ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
changeset 542 a7344c9afbaf
parent 538 8016a2480704
child 543 b2bea5968b89
equal deleted inserted replaced
541:5bf9f94c02e1 542:a7344c9afbaf
   246 We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
   246 We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
   247 $bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
   247 $bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
   248 expressions and $as$ for a list of annotated regular expressions.
   248 expressions and $as$ for a list of annotated regular expressions.
   249 The alternative constructor ($\sum$) has been generalised to 
   249 The alternative constructor ($\sum$) has been generalised to 
   250 accept a list of annotated regular expressions rather than just 2.
   250 accept a list of annotated regular expressions rather than just 2.
   251 %We will show that these bitcodes encode information about
   251 
   252 %the ($\POSIX$) value that should be generated by the Sulzmann and Lu
   252 
   253 %algorithm.
   253 The first thing we define related to bit-coded regular expressions
   254 The most central question is how these partial lexing information
   254 is how we move bits, for instance pasting it at the front of an annotated regex.
   255 represented as bit-codes is augmented and carried around 
       
   256 during a derivative is taken.
       
   257 
       
   258 This is done by adding bitcodes to the 
       
   259 derivatives, for example when one more star iteratoin is taken (we
       
   260 call the operation of derivatives on annotated regular expressions $\bder$
       
   261 because it is derivatives on regexes with bitcodes):
       
   262 \begin{center}
       
   263   \begin{tabular}{@{}lcl@{}}
       
   264   $\bder \; c\; (_{bs}a^*) $ & $\dn$ &
       
   265       $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
       
   266        (_{[]}a^*))$
       
   267 \end{tabular}    
       
   268 \end{center}    
       
   269 
       
   270 \noindent
       
   271 For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
       
   272 there is no danger of confusion with derivatives on plain regular expressions, 
       
   273 for example, the above can be expressed as
       
   274 \begin{center}
       
   275   \begin{tabular}{@{}lcl@{}}
       
   276   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
   277       $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
       
   278        (_{[]}a^*))$
       
   279 \end{tabular}    
       
   280 \end{center}   
       
   281 
       
   282 
       
   283 Using the picture we used earlier to depict this, the transformation when 
       
   284 taking a derivative w.r.t a star is like below:
       
   285 \centering
       
   286 \begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
       
   287 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   288     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   289         {$bs$
       
   290          \nodepart{two} $a^*$ };
       
   291 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   292 \end{tikzpicture} 
       
   293 &
       
   294 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   295     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   296         {$v_{\text{previous iterations}}$
       
   297          \nodepart{two} $a^*$};
       
   298 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   299 \end{tikzpicture}
       
   300 \\
       
   301 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   302     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   303         { $bs$ + [Z]
       
   304          \nodepart{two}  $(a\backslash c )\cdot a^*$ };
       
   305 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   306 \end{tikzpicture}
       
   307 &
       
   308 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   309     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   310         {$v_{\text{previous iterations}}$ + 1 more iteration
       
   311          \nodepart{two} $(a\backslash c )\cdot a^*$ };
       
   312 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   313 \end{tikzpicture}
       
   314 \end{tabular}    
       
   315 \noindent
       
   316 The operation $\fuse$ is just to attach bit-codes 
   255 The operation $\fuse$ is just to attach bit-codes 
   317 to the front of an annotated regular expression:
   256 to the front of an annotated regular expression:
   318 \begin{center}
   257 \begin{center}
   319 \begin{tabular}{lcl}
   258 \begin{tabular}{lcl}
   320   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
   259   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
   330      $_{bs @ bs'}a^*$
   269      $_{bs @ bs'}a^*$
   331 \end{tabular}    
   270 \end{tabular}    
   332 \end{center} 
   271 \end{center} 
   333 
   272 
   334 \noindent
   273 \noindent
   335 Another place in the $\bder$ function where it differs
   274 With that we are able to define $\internalise$.
   336 from normal derivatives on un-annotated regular expressions
       
   337 is the sequence case:
       
   338 \begin{center}
       
   339   \begin{tabular}{@{}lcl@{}}
       
   340 
       
   341   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
   342      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   343 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
   344 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
   345   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
       
   346 \end{tabular}    
       
   347 \end{center}    
       
   348 Here 
       
   349 
       
   350 
       
   351 \begin{center}
       
   352   \begin{tabular}{@{}lcl@{}}
       
   353   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   354   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   355   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
       
   356         $\textit{if}\;c=d\; \;\textit{then}\;
       
   357          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
       
   358   $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
       
   359   $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\
       
   360   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
   361      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   362 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
   363 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
   364   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
       
   365   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
   366       $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
       
   367        (_{[]}r^*))$
       
   368 \end{tabular}    
       
   369 \end{center}    
       
   370 
       
   371 
   275 
   372 To do lexing using annotated regular expressions, we shall first
   276 To do lexing using annotated regular expressions, we shall first
   373 transform the usual (un-annotated) regular expressions into annotated
   277 transform the usual (un-annotated) regular expressions into annotated
   374 regular expressions. This operation is called \emph{internalisation} and
   278 regular expressions. This operation is called \emph{internalisation} and
   375 defined as follows:
   279 defined as follows:
   390 \end{tabular}    
   294 \end{tabular}    
   391 \end{center}    
   295 \end{center}    
   392 %\end{definition}
   296 %\end{definition}
   393 
   297 
   394 \noindent
   298 \noindent
   395 We use up arrows here to indicate that the basic un-annotated regular
   299 We use an up arrow with postfix notation
   396 expressions are ``lifted up'' into something slightly more complex. In the
   300 to denote the operation, 
   397 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
   301 for convenience. The $\textit{internalise} \; r$
   398 attach bits to the front of an annotated regular expression. Its
   302 notation is more cumbersome.
   399 definition is as follows:
   303 The opposite of $\textit{internalise}$ is
   400 
   304 $\erase$, where all the bit-codes are removed,
   401  
   305 and the alternative operator $\sum$ for annotated
   402 
   306 regular expressions is transformed to the binary alternatives
   403 \noindent
   307 for plain regular expressions.
   404 After internalising the regular expression, we perform successive
   308 \begin{center}
   405 derivative operations on the annotated regular expressions. This
   309 	\begin{tabular}{lcr}
   406 derivative operation is the same as what we had previously for the
   310 		$\erase \; \ZERO$ & $\dn$ & $\ZERO$\\
   407 basic regular expressions, except that we beed to take care of
   311 		$\erase \; _{bs}\ONE$ & $\dn$ & $\ONE$\\
   408 the bitcodes:
   312 		$\erase \; _{bs}\mathbf{c}$ & $\dn$ & $\mathbf{c}$\\
   409 
   313 		$\erase \; _{bs} a_1 \cdot a_2$ & $\dn$ & $\erase \; r_1\cdot\erase\; r_2$\\
   410 
   314 		$\erase \; _{bs} [] $ & $\dn$ & $\ZERO$\\
   411 
   315 		$\erase \; _{bs} [a] $ & $\dn$ & $\erase \; a$\\
   412 
   316 		$\erase \; _{bs} \sum [a_1, \; a_2]$ & $\dn$ & $\erase \; a_1 +\erase \; a_2$\\
   413 
   317 		$\erase \; _{bs} \sum (a :: as)$ & $\dn$ & $\erase \; a + \erase \; _{[]} \sum as$\\
   414 %\end{definition}
   318 		$\erase \; _{bs} a^*$ & $\dn$ & $(\erase \; a)^*$
   415 \noindent
   319 	\end{tabular}
   416 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
   320 \end{center}
   417 we need to unfold it into a sequence,
   321 \noindent
   418 and attach an additional bit $Z$ to the front of $r \backslash c$
   322 We also abbreviate the $\erase\; a$ operation
   419 to indicate one more star iteration. Also the sequence clause
   323 as $a_\downarrow$, for conciseness.
   420 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
   324 For bit-coded regular expressions, as a different datatype, 
   421 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
   325 testing whether they contain empty string in their lauguage requires
   422 that it is for annotated regular expressions, therefore we omit the
   326 a dedicated function $\bnullable$
   423 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
   327 which simply calls $\erase$ first before testing whether it is $\nullable$.
   424 $a_1$ matches the string prior to character $c$ (more on this later),
   328 \begin{center}
   425 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
   329 	\begin{tabular}{lcr}
   426 \backslash c)$ will collapse the regular expression $a_1$(as it has
   330 		$\bnullable \; a $ & $\dn$ & $\nullable \; (a_\downarrow)$
   427 already been fully matched) and store the parsing information at the
   331 	\end{tabular}
   428 head of the regular expression $a_2 \backslash c$ by fusing to it. The
   332 \end{center}
   429 bitsequence $\textit{bs}$, which was initially attached to the
   333 The function for collecting the
   430 first element of the sequence $a_1 \cdot a_2$, has
   334 bitcodes of a $\bnullable$ annotated regular expression
   431 now been elevated to the top-level of $\sum$, as this information will be
   335 is a generalised version of the $\textit{mkeps}$ function
   432 needed whichever way the sequence is matched---no matter whether $c$ belongs
       
   433 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
   434 the lexing information, we complete the lexing by collecting the
       
   435 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
   436 for annotated regular expressions, called $\textit{bmkeps}$:
   336 for annotated regular expressions, called $\textit{bmkeps}$:
   437 
   337 
   438 
   338 
   439 %\begin{definition}[\textit{bmkeps}]\mbox{}
   339 %\begin{definition}[\textit{bmkeps}]\mbox{}
   440 \begin{center}
   340 \begin{center}
   441 \begin{tabular}{lcl}
   341 \begin{tabular}{lcl}
   442   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
   342   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
   443   $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
   343   $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
   444      $\textit{if}\;\textit{bnullable}\,a$\\
   344      $\textit{if}\;\textit{bnullable}\,a$\\
   445   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
   345   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
   446   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
   346   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{[]}\sum \textit{as})$\\
   447   $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
   347   $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
   448      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
   348      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
   449   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
   349   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
   450      $bs \,@\, [Z]$
   350      $bs \,@\, [Z]$
   451 \end{tabular}    
   351 \end{tabular}    
   454 
   354 
   455 \noindent
   355 \noindent
   456 This function completes the value information by travelling along the
   356 This function completes the value information by travelling along the
   457 path of the regular expression that corresponds to a POSIX value and
   357 path of the regular expression that corresponds to a POSIX value and
   458 collecting all the bitcodes, and using $S$ to indicate the end of star
   358 collecting all the bitcodes, and using $S$ to indicate the end of star
   459 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
   359 iterations. 
   460 decode them, we get the value we expect. The corresponding lexing
   360 
   461 algorithm looks as follows:
   361 The most central question is how these partial lexing information
       
   362 represented as bit-codes is augmented and carried around 
       
   363 during a derivative is taken.
       
   364 This is done by adding bitcodes to the 
       
   365 derivatives, for example when one more star iteratoin is taken (we
       
   366 call the operation of derivatives on annotated regular expressions $\bder$
       
   367 because it is derivatives on regexes with \emph{b}itcodes),
       
   368 we need to unfold it into a sequence,
       
   369 and attach an additional bit $Z$ to the front of $r \backslash c$
       
   370 to indicate one more star iteration. 
       
   371 \begin{center}
       
   372   \begin{tabular}{@{}lcl@{}}
       
   373   $\bder \; c\; (_{bs}a^*) $ & $\dn$ &
       
   374       $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
       
   375        (_{[]}a^*))$
       
   376 \end{tabular}    
       
   377 \end{center}    
       
   378 
       
   379 \noindent
       
   380 For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
       
   381 there is no danger of confusion with derivatives on plain regular expressions, 
       
   382 for example, the above can be expressed as
       
   383 \begin{center}
       
   384   \begin{tabular}{@{}lcl@{}}
       
   385   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
   386       $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
       
   387        (_{[]}a^*))$
       
   388 \end{tabular}    
       
   389 \end{center}   
       
   390 
       
   391 \noindent
       
   392 Using the picture we used earlier to depict this, the transformation when 
       
   393 taking a derivative w.r.t a star is like below:
       
   394 
       
   395 \begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
       
   396 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   397     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   398         {$bs$
       
   399          \nodepart{two} $a^*$ };
       
   400 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   401 \end{tikzpicture} 
       
   402 &
       
   403 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   404     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   405         {$v_{\text{previous iterations}}$
       
   406          \nodepart{two} $a^*$};
       
   407 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   408 \end{tikzpicture}
       
   409 \\
       
   410 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   411     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   412         { $bs$ + [Z]
       
   413          \nodepart{two}  $(a\backslash c )\cdot a^*$ };
       
   414 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   415 \end{tikzpicture}
       
   416 &
       
   417 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   418     \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
       
   419         {$v_{\text{previous iterations}}$ + 1 more iteration
       
   420          \nodepart{two} $(a\backslash c )\cdot a^*$ };
       
   421 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   422 \end{tikzpicture}
       
   423 \end{tabular}    
       
   424 \noindent
       
   425 Another place in the $\bder$ function where it differs
       
   426 from normal derivatives (on un-annotated regular expressions)
       
   427 is the sequence case:
       
   428 \begin{center}
       
   429   \begin{tabular}{@{}lcl@{}}
       
   430 
       
   431   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
   432      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   433 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
   434 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
   435   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
       
   436 \end{tabular}    
       
   437 \end{center}    
       
   438 
       
   439 The difference is that (when $a_1$ is $\bnullable$)
       
   440 we use $\bmkeps$ to store the lexing information
       
   441 in $a_1$ before collapsing it (as it has been fully matched by string prior to $c$, 
       
   442 and attach the collected bit-codes to the front of $a_2$
       
   443 before throwing away $a_1$. We assume that $\bmkeps$ correctly extracts the bitcode for how $a_1$
       
   444 matches the string prior to $c$ (more on this later).
       
   445 The bitsequence $\textit{bs}$ which was initially attached to the first element of the sequence
       
   446 $a_1 \cdot a_2$, has now been elevated to the top level of teh $\sum$. 
       
   447 This is because this piece of information will be needed whichever way the sequence is matched,
       
   448 regardless of whether $c$ belongs to $a_1$ or $a_2$.
       
   449 
       
   450 In the injection-based lexing, $r_1$ is immediately thrown away in 
       
   451 subsequent derivatives on the right branch (when $r_1$ is $\nullable$),
       
   452 \begin{center}
       
   453 	$(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$
       
   454 \end{center}
       
   455 \noindent
       
   456 as it knows $r_1$ is stored on stack and available once the recursive 
       
   457 call to later derivatives finish.
       
   458 Therefore, if the $\Right$ branch is taken in a $\POSIX$ match,
       
   459 we construct back the sequence value once step back by
       
   460 calling a on $\mkeps(r_1)$.
       
   461 \begin{center}
       
   462 	\begin{tabular}{lcr}
       
   463 		$\ldots r_1 \cdot r_2$ & $\rightarrow$ & $r_1\cdot r_2 + r_2 \backslash c \ldots $\\
       
   464 		$\ldots \Seq(v_1, v_2) (\Seq(\mkeps(r1), (\inj \; r_2 \; c\; v_{2c})))$ & $\leftarrow$ & $\Right(v_{2c})\ldots$
       
   465 	\end{tabular}
       
   466 \end{center}
       
   467 \noindent
       
   468 The rest of the clauses of $\bder$ is rather similar to
       
   469 $\der$, and is put together here as a wholesome definition
       
   470 for $\bder$:
       
   471 \begin{center}
       
   472   \begin{tabular}{@{}lcl@{}}
       
   473   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   474   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
   475   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
       
   476         $\textit{if}\;c=d\; \;\textit{then}\;
       
   477          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
       
   478   $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
       
   479   $_{bs}\sum\;(\textit{map} \; (\_\backslash c) \; as )$\\
       
   480   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
   481      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   482 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
   483 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
   484   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
       
   485   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
   486       $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
       
   487        (_{[]}r^*))$
       
   488 \end{tabular}    
       
   489 \end{center}    
       
   490 \noindent
       
   491 Generalising the derivative operation with bitcodes to strings, we have 
       
   492 \begin{center}
       
   493 	\begin{tabular}{@{}lcl@{}}
       
   494 		$a\backslash_s [] $ & $\dn$ & $a$\\
       
   495 		$a\backslash (c :: s) $ & $\dn$ & $(a \backslash c) \backslash_s s$
       
   496 	\end{tabular}
       
   497 \end{center}
       
   498 As we did earlier, we omit the $s$ subscript at $\backslash_s$ when there is no danger
       
   499 of confusion.
       
   500 Putting this all together, we obtain a lexer with bit-coded regular expressions
       
   501 as its internal data structures, which we call $\blexer$:
   462 
   502 
   463 \begin{center}
   503 \begin{center}
   464 \begin{tabular}{lcl}
   504 \begin{tabular}{lcl}
   465   $\textit{blexer}\;r\,s$ & $\dn$ &
   505   $\textit{blexer}\;r\,s$ & $\dn$ &
   466       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
   506       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
   469   & & $\;\;\textit{else}\;\textit{None}$
   509   & & $\;\;\textit{else}\;\textit{None}$
   470 \end{tabular}
   510 \end{tabular}
   471 \end{center}
   511 \end{center}
   472 
   512 
   473 \noindent
   513 \noindent
   474 In this definition $\_\backslash s$ is the  generalisation  of the derivative
   514 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
   475 operation from characters to strings (just like the derivatives for un-annotated
   515 \begin{conjecture}
   476 regular expressions).
   516 $\blexer \;r \; s = \lexer \; r \; s$.
   477 
   517 \end{conjecture}
       
   518 This was claimed but not formalised in Sulzmann and Lu's work.
       
   519 We introduce the proof later, after we give out all
       
   520 the needed auxiliary functions and definitions
   478 
   521 
   479 
   522 
   480 %-----------------------------------
   523 %-----------------------------------
   481 %	SUBSECTION 1
   524 %	SUBSECTION 1
   482 %-----------------------------------
   525 %-----------------------------------
   483 \section{Specifications of Some Helper Functions}
   526 \section{Specifications of Some Helper Functions}
   484 Here we give some functions' definitions, 
   527 The functions we introduce will give a more detailed glimpse into 
   485 which we will use later.
   528 the lexing process, which might not be possible
   486 \begin{center}
   529 using $\lexer$ or $\blexer$ themselves.
   487 \begin{tabular}{ccc}
   530 The first function we shall look at is $\retrieve$.
   488 $\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
   531 \subsection{$\retrieve$}
       
   532 Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$
       
   533 after we finished doing all the derivatives:
       
   534 \begin{center}
       
   535 \begin{tabular}{lcl}
       
   536 	& & $\ldots$\\
       
   537   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   538   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   539   & & $\ldots$
   489 \end{tabular}
   540 \end{tabular}
   490 \end{center}
   541 \end{center}
   491 
   542 \noindent
   492 
   543 Recall that $\bmkeps$ looks for the leftmost branch of an alternative
   493 %----------------------------------------------------------------------------------------
   544 and completes a star's iterations by attaching a $Z$ at the end of the bitcodes
   494 %	SECTION  correctness proof
   545 extracted. It "retrieves" a sequence by visiting both children and then stitch together 
   495 %----------------------------------------------------------------------------------------
   546 two bitcodes using concatenation. After the entire tree structure of the regular 
   496 \section{Correctness of Bit-coded Algorithm (Without Simplification)}
   547 expression has been traversed using the above manner, we get a bitcode encoding the 
   497 We now give the proof the correctness of the algorithm with bit-codes.
   548 lexing result.
   498 
   549 We know that this "retrieved" bitcode leads to the correct value after decoding,
   499 Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
   550 which is $v_0$ in the bird's eye view of the injection-based lexing diagram.
       
   551 Now assume we keep every other data structure in the diagram \ref{InjFigure},
       
   552 and only replace all the plain regular expression by their annotated counterparts,
       
   553 computed during a $\blexer$ run.
       
   554 Then we obtain a diagram for the annotated regular expression derivatives and
       
   555 their corresponding values, though the values are never calculated in $\blexer$.
       
   556 We have that $a_n$ contains all the lexing result information.
       
   557 \vspace{20mm}
       
   558 \begin{center}%\label{graph:injLexer}
       
   559 \begin{tikzcd}[
       
   560 	every matrix/.append style = {name=p},
       
   561 	remember picture, overlay,
       
   562 	]
       
   563 	a_0 \arrow[r, "\backslash c_0"]  \arrow[d] & a_1 \arrow[r, "\backslash c_1"] \arrow[d] & a_2 \arrow[r, dashed] \arrow[d] & a_n \arrow[d] \\
       
   564 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   565 \end{tikzcd}
       
   566 \begin{tikzpicture}[
       
   567 	remember picture, overlay,
       
   568 E/.style = {ellipse, draw=blue, dashed,
       
   569             inner xsep=4mm,inner ysep=-4mm, rotate=90, fit=#1}
       
   570                         ]
       
   571 \node[E = (p-1-1) (p-2-1)] {};
       
   572 \node[E = (p-1-4) (p-2-4)] {};
       
   573 \end{tikzpicture}
       
   574 
       
   575 \end{center}
       
   576 \vspace{20mm}
       
   577 \noindent
       
   578 On the other hand, $v_0$ also encodes the correct lexing result, as we have proven for $\lexer$.
       
   579 Encircled in the diagram  are the two pairs $v_0, a_0$ and $v_n, a_n$, which both 
       
   580 encode the correct lexical result. Though for the leftmost pair, we have
       
   581 the information condensed in $v_0$ the value part, whereas for the rightmost pair,
       
   582 the information is concentrated on $a_n$.
       
   583 We know that in the intermediate steps the pairs $v_i, a_i$, must in some way encode the complete
       
   584 lexing information as well. Therefore, we need a unified approach to extract such lexing result
       
   585 from a value $v_i$ and its annotated regular expression $a_i$. 
       
   586 And the function $f$ must satisfy these requirements:
       
   587 \begin{itemize}
       
   588 	\item
       
   589 		$f \; a_i\;v_i = f \; a_n \; v_n = \decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
       
   590 	\item
       
   591 		$f \; a_i\;v_i = f \; a_0 \; v_0 = v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
       
   592 \end{itemize}
       
   593 \noindent
       
   594 If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$,
       
   595 The core of the function $f$ is something that produces the bitcodes
       
   596 $\code \; v_0$.
       
   597 It is unclear how, but Sulzmann and Lu came up with a function satisfying all the above
       
   598 requirements, named \emph{retrieve}:
       
   599 
       
   600 
       
   601 
       
   602 \begin{center}
       
   603 \begin{tabular}{lcr}
       
   604 	$\retrieve \; \, (_{bs} \ONE) \; \, (\Empty)$ & $\dn$ & $\textit{bs}$\\
       
   605 	$\retrieve \; \, (_{bs} \mathbf{c} ) \; \Char(c)$ & $\dn$ & $ \textit{bs}$\\
       
   606 	$\retrieve \; \, (_{bs} a_1 \cdot a_2) \; \Seq(v_1, v_2)$ & $\dn$ &  $\textit{bs} @ (\retrieve \; a_1\; v_1) @ (\retrieve \; a_2 \; v_2)$\\
       
   607 	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as}) \; \,\Left(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v)$\\
       
   608 	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as} \; \, \Right(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\
       
   609 	$\retrieve \; \, (_{bs} a^*) \; \, (\Stars(v :: vs)) $ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v) @ (\retrieve \; (_{[]} a^*) \; (\Stars(vs)))$\\
       
   610 	$\retrieve \; \, (_{bs} a^*) \; \, (\Stars([]) $ & $\dn$ & $\textit{bs} @ [Z]$
       
   611 \end{tabular}
       
   612 \end{center}
       
   613 \noindent
       
   614 As promised, $\retrieve$ collects the right bit-codes from the 
       
   615 final derivative $a_n$:
       
   616 \begin{lemma}
       
   617 	$\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$
       
   618 \end{lemma}
       
   619 \begin{proof}
       
   620 	By a routine induction on $a$.
       
   621 \end{proof}
       
   622 The design of $\retrieve$ enables extraction of bit-codes
       
   623 from not only $\bnullable$ (annotated) regular expressions,
       
   624 but also those that are not $\bnullable$.
       
   625 For example, if we have the regular expression just internalised
       
   626 and the lexing result value, we could $\retrieve$ the bitcdoes
       
   627 that look as if we have en$\code$-ed the value:
       
   628 \begin{lemma}
       
   629 	$\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$
       
   630 \end{lemma}
       
   631 \begin{proof}
       
   632 	By induction on $r$.
       
   633 \end{proof}
       
   634 \noindent
       
   635 The following property is more interesting, as
       
   636 it provides a "bridge" between $a_0, v_0$ and $a_n, v_n$ in the
       
   637 lexing diagram.
       
   638 If you take derivative of an annotated regular expression, 
       
   639 you can $\retrieve$ the same bit-codes as before the derivative took place,
       
   640 provided that you use the corresponding value:
       
   641 
       
   642 \begin{lemma}\label{retrieveStepwise}
       
   643 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
       
   644 \end{lemma}
       
   645 \begin{proof}
       
   646 	By induction on $r$, where $v$ is allowed to be arbitrary.
       
   647 	The induction principle is function $\erase$'s cases.
       
   648 \end{proof}
       
   649 \noindent
       
   650 $\retrieve$ is connected to the $\blexer$ in the following way:
       
   651 \begin{lemma}\label{blexer_retrieve}
       
   652 $\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
       
   653 \end{lemma}
       
   654 \noindent
       
   655 $\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regexes of $\blexer$.
       
   656 For plain regular expressions something similar is required as well.
       
   657 
       
   658 \subsection{$\flex$}
       
   659 Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$,
   500 defined as
   660 defined as
   501 \begin{center}
   661 \begin{center}
   502 \begin{tabular}{lcr}
   662 \begin{tabular}{lcr}
   503 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
   663 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
   504 $\flex \; r \; f \; c :: s \; v$ &  $=$ &   $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$
   664 $\flex \; r \; f \; c :: s \; v$ &  $=$ &   $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$
   508 and does the injection in a stack-like manner (last taken derivative first injected).
   668 and does the injection in a stack-like manner (last taken derivative first injected).
   509 $\flex$ is connected to the $\lexer$:
   669 $\flex$ is connected to the $\lexer$:
   510 \begin{lemma}
   670 \begin{lemma}
   511 $\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
   671 $\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
   512 \end{lemma}
   672 \end{lemma}
   513 $\flex$ provides us a bridge between $\lexer$ and $\blexer$.
   673 \begin{proof}
       
   674 	By reverse induction on $s$.
       
   675 \end{proof}
       
   676 $\flex$ provides us a bridge between $\lexer$'s intermediate steps.
   514 What is even better about $\flex$ is that it allows us to 
   677 What is even better about $\flex$ is that it allows us to 
   515 directly operate on the value $\mkeps (r\backslash v)$,
   678 directly operate on the value $\mkeps (r\backslash v)$,
   516 which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
   679 which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
   517 When the value created by $\mkeps$ becomes available, one can 
   680 When the value created by $\mkeps$ becomes available, one can 
   518 prove some stepwise properties of lexing nicely:
   681 prove some stepwise properties of lexing nicely:
   519 \begin{lemma}\label{flexStepwise}
   682 \begin{lemma}\label{flexStepwise}
   520 $\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
   683 $\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
   521 \end{lemma}
   684 \end{lemma}
   522 
   685 \begin{proof}
   523 And for $\blexer$ we have a function with stepwise properties like $\flex$ as well,
   686 	By induction on the shape of $r\backslash s$
   524 called $\retrieve$\ref{retrieveDef}.
   687 \end{proof}
   525 $\retrieve$ takes bit-codes from annotated regular expressions
   688 \noindent
   526 guided by a value.
   689 With $\flex$ and $\retrieve$ ready, we are ready to connect $\lexer$ and $\blexer$ .
   527 $\retrieve$ is connected to the $\blexer$ in the following way:
   690 
   528 \begin{lemma}\label{blexer_retrieve}
   691 \subsection{Correctness Proof of Bit-coded Algorithm}
   529 $\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
       
   530 \end{lemma}
       
   531 If you take derivative of an annotated regular expression, 
       
   532 you can $\retrieve$ the same bit-codes as before the derivative took place,
       
   533 provided that you use the corresponding value:
       
   534 
       
   535 \begin{lemma}\label{retrieveStepwise}
       
   536 $\retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
       
   537 \end{lemma}
       
   538 The other good thing about $\retrieve$ is that it can be connected to $\flex$:
       
   539 %centralLemma1
       
   540 \begin{lemma}\label{flex_retrieve}
   692 \begin{lemma}\label{flex_retrieve}
   541 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
   693 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
   542 \end{lemma}
   694 \end{lemma}
   543 \begin{proof}
   695 \begin{proof}
   544 By induction on $s$. The induction tactic is reverse induction on strings.
   696 By induction on $s$. The induction tactic is reverse induction on strings.
   565 \].
   717 \].
   566 \end{proof}
   718 \end{proof}
   567 
   719 
   568 With the above lemma we can now link $\flex$ and $\blexer$.
   720 With the above lemma we can now link $\flex$ and $\blexer$.
   569 
   721 
       
   722 %----------------------------------------------------------------------------------------
       
   723 %	SECTION  correctness proof
       
   724 %----------------------------------------------------------------------------------------
       
   725 \section{Correctness of Bit-coded Algorithm (Without Simplification)}
       
   726 We now give the proof the correctness of the algorithm with bit-codes.
       
   727 
   570 \begin{lemma}\label{flex_blexer}
   728 \begin{lemma}\label{flex_blexer}
   571 $\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
   729 $\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
   572 \end{lemma}
   730 \end{lemma}
   573 \begin{proof}
   731 \begin{proof}
   574 Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
   732 Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
   575 \end{proof}
   733 \end{proof}
   576 Finally 
   734 Finally the correctness of $\blexer$ is given as it outputs the same result as $\lexer$:
   577 
   735 \begin{theorem}
   578 
   736 	$\blexer\; r \; s = \lexer \; r \; s$
   579 
   737 \end{theorem}
       
   738 \begin{proof}
       
   739 	Straightforward corollary of \ref{flex_blexer}.
       
   740 \end{proof}
       
   741 \noindent
       
   742 Our main reason for wanting a bit-coded algorithm over 
       
   743 the injection-based one is for its capabilities of allowing
       
   744 more aggressive simplifications.
       
   745 We will elaborate on this in the next chapter.
       
   746 
       
   747