ChengsongTanPhdThesis/Chapters/Bitcoded1.tex
changeset 624 8ffa28fce271
parent 622 4b1149fb5aec
child 638 dd9dde2d902b
equal deleted inserted replaced
623:c0c1ebe09c7d 624:8ffa28fce271
     8 %simplifications and therefore introduce our version of the bitcoded algorithm and 
     8 %simplifications and therefore introduce our version of the bitcoded algorithm and 
     9 %its correctness proof in 
     9 %its correctness proof in 
    10 %Chapter 3\ref{Chapter3}. 
    10 %Chapter 3\ref{Chapter3}. 
    11 In this chapter, we are going to describe the bit-coded algorithm
    11 In this chapter, we are going to describe the bit-coded algorithm
    12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
    12 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
       
    13 derivatives of
    13 regular expressions. 
    14 regular expressions. 
    14 We have implemented their algorithm in Scala, and found out inefficiencies
    15 We have implemented their algorithm in Scala and Isabelle, 
       
    16 and found problems 
    15 in their algorithm such as de-duplication not working properly and redundant
    17 in their algorithm such as de-duplication not working properly and redundant
    16 fixpoint construction. Their algorithm is improved and verified with the help of
    18 fixpoint construction. 
    17 formal proofs.
       
    18 \section{The Motivation Behind Using Bitcodes}
    19 \section{The Motivation Behind Using Bitcodes}
    19 We first do a recap of what was going on 
    20 Let us give again the definition of $\lexer$ from Chapter \ref{Inj}:
    20 in the lexer algorithm in Chapter \ref{Inj},
       
    21 \begin{center}
    21 \begin{center}
    22 \begin{tabular}{lcl}
    22 \begin{tabular}{lcl}
    23 	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
    23 	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
    24 	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
    24 	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
    25 	& & $\quad \phantom{\mid}\; \None \implies \None$\\
    25 	& & $\quad \phantom{\mid}\; \None \implies \None$\\
    26 	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
    26 	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
    27 \end{tabular}
    27 \end{tabular}
    28 \end{center}
    28 \end{center}
    29 \noindent
    29 \noindent
    30 The algorithm recursively calls $\lexer$ on 
    30 The first problem with this algorithm is that
       
    31 for the function $\inj$ to work properly
       
    32 we cannot destroy the structure of a regular expression,
       
    33 and therefore canno simplify aggressively.
       
    34 For example,
       
    35 \[
       
    36 	r + (r + a) \rightarrow r + a
       
    37 \]
       
    38 cannot be done because that would require
       
    39 breaking up the inner alternative.
       
    40 The $\lexer$ function therefore only enables
       
    41 same-level de-duplications like
       
    42 \[
       
    43 	r + r \rightarrow r.
       
    44 \]
       
    45 Secondly, the algorithm recursively calls $\lexer$ on 
    31 each new character input,
    46 each new character input,
    32 and before starting a child call
    47 and before starting a child call
    33 it stores information of previous lexing steps
    48 it stores information of previous lexing steps
    34 on a stack, in the form of regular expressions
    49 on a stack, in the form of regular expressions
    35 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
    50 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
   294 \caption{Stored $r_i, c_{i+1}$ Used by $\inj$}
   309 \caption{Stored $r_i, c_{i+1}$ Used by $\inj$}
   295 \end{figure}
   310 \end{figure}
   296 \noindent
   311 \noindent
   297 Storing all $r_i, c_{i+1}$ pairs recursively
   312 Storing all $r_i, c_{i+1}$ pairs recursively
   298 allows the algorithm to work in an elegant way, at the expense of 
   313 allows the algorithm to work in an elegant way, at the expense of 
   299 storing quite a bit of verbose information.
   314 storing quite a bit of verbose information on the stack.
   300 The stack seems to grow at least quadratically fast with respect
   315 The stack seems to grow at least quadratically with respect
   301 to the input (not taking into account the size bloat of $r_i$),
   316 to the input (not taking into account the size bloat of $r_i$),
   302 which can be inefficient and prone to stack overflow.
   317 which can be inefficient and prone to stack overflows.
   303 \section{Bitcoded Algorithm}
   318 \section{Bitcoded Algorithm}
   304 To address this,
   319 To address this,
   305 Sulzmann and Lu chose to  define a new datatype 
   320 Sulzmann and Lu defined a new datatype 
   306 called \emph{annotated regular expression},
   321 called \emph{annotated regular expression},
   307 which condenses all the partial lexing information
   322 which condenses all the partial lexing information
   308 (that was originally stored in $r_i, c_{i+1}$ pairs)
   323 (that was originally stored in $r_i, c_{i+1}$ pairs)
   309 into bitcodes.
   324 into bitcodes.
   310 The bitcodes are then carried with the regular
   325 The bitcodes are then carried with the regular
   331 \end{center}  
   346 \end{center}  
   332 \noindent
   347 \noindent
   333 where $bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
   348 where $bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
   334 expressions and $as$ for lists of annotated regular expressions.
   349 expressions and $as$ for lists of annotated regular expressions.
   335 The alternative constructor, written, $\sum$, has been generalised to 
   350 The alternative constructor, written, $\sum$, has been generalised to 
   336 accept a list of annotated regular expressions rather than just two.
   351 take a list of annotated regular expressions rather than just two.
   337 Why is it generalised? This is because when we open up nested 
   352 Why is it generalised? This is because when we analyse nested 
   338 alternatives, there could be more than two elements at the same level
   353 alternatives, there can be more than two elements at the same level
   339 after de-duplication, which can no longer be stored in a binary
   354 after de-duplication, which can no longer be stored in a binary
   340 constructor.
   355 constructor.
   341 Bits and bitcodes (lists of bits) are defined as:
   356 Bits and bitcodes (lists of bits) are defined as:
   342 \begin{center}
   357 \begin{center}
   343 		$b ::=   S \mid  Z \qquad
   358 		$b ::=   S \mid  Z \qquad
   345 $
   360 $
   346 \end{center}
   361 \end{center}
   347 \noindent
   362 \noindent
   348 We use $S$ and $Z$ rather than $1$ and $0$ is to avoid
   363 We use $S$ and $Z$ rather than $1$ and $0$ is to avoid
   349 confusion with the regular expressions $\ZERO$ and $\ONE$.
   364 confusion with the regular expressions $\ZERO$ and $\ONE$.
   350 The idea is to use the attached bitcodes
   365 The idea is to use the bitcodes
   351 to indicate which choice was made at a certain point
   366 to indicate which choice was made at a certain point
   352 during lexing.
   367 during lexing.
   353 For example,
   368 For example,
   354 $(_{Z}a+_{S}b) \backslash a$ gives us
   369 $(_{Z}a+_{S}b) \backslash a$ gives us
   355 $_{Z}\ONE + \ZERO$, this $Z$ bitcode will
   370 $_{Z}\ONE + \ZERO$, where the $Z$ bitcode will
   356 later be used to decode that a left branch was
   371 later be used to decode that a left branch was
   357 selected at the time when the part $a+b$ is being taken
   372 selected at the time when the part $a+b$ was anallysed by
   358 derivative of.
   373 derivative.
   359 \subsection{A Bird's Eye View of the Bit-coded Lexer}
   374 \subsection{A Bird's Eye View of the Bit-coded Lexer}
   360 Before we give out the rest of the functions and definitions 
   375 Before we give the details of the functions and definitions 
   361 related to our
   376 related to our
   362 $\blexer$ (\emph{b}-itcoded lexer), we first provide a high-level
   377 $\blexer$ (\emph{b}-itcoded lexer), we first provide a high-level
   363 view of the algorithm, so the reader does not get lost in
   378 view of the algorithm.
   364 the details.
       
   365 \begin{figure}[H]
   379 \begin{figure}[H]
   366 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] 
   380 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick] 
   367 %\draw (-6,-6) grid (6,6);
   381 %\draw (-6,-6) grid (6,6);
   368 
   382 
   369 	\node [circle, draw] (r0) at (-6, 2) {$r$};
   383 	\node [circle, draw] (r0) at (-6, 2) {$r$};
   434 	%edge [] node {} (ldots2);
   448 	%edge [] node {} (ldots2);
   435 
   449 
   436 
   450 
   437 
   451 
   438 \end{tikzpicture}
   452 \end{tikzpicture}
   439 \caption{Bird's Eye View of $\blexer$}
   453 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand
       
   454 	for the annotated regular expressions in each derivative step.
       
   455 	The characters used in each derivative is written as $c_i$.
       
   456 The relative size increase reflect the size increase in each derivative step.}
   440 \end{figure}
   457 \end{figure}
   441 \noindent
   458 \noindent
   442 The plain regular expressions
   459 The plain regular expressions
   443 are first ``lifted'' to an annotated regular expression,
   460 are first ``lifted'' to an annotated regular expression,
   444 with the function called \emph{internalise}.
   461 with the function called \emph{internalise} ($r \rightarrow _{bs}a$).
   445 Then the annotated regular expression $_{bs}a$ will
   462 Then the annotated regular expression $_{bs}a$ will
   446 go through successive derivatives with respect 
   463 be transformed by successive derivatives with respect 
   447 to the input stream of characters $c_1, c_2$ etc.
   464 to the input stream of characters $c_1, c_2$ etc.
   448 Each time a derivative is taken, the bitcodes
   465 Each time a derivative is taken, the bitcodes
   449 are moved around, discarded or augmented together 
   466 are moved around, discarded or augmented together 
   450 with the regular expression they are attached to.
   467 with the regular expression they are attached to.
   451 After all input has been consumed, the 
   468 After all input has been consumed, the 
   452 bitcodes are collected by $\bmkeps$,
   469 bitcodes are collected by $\bmkeps$,
   453 which traverses the nullable part of the regular expression
   470 which traverses the nullable part of the regular expression
   454 and collect the bitcodes along the way.
   471 and collects the bitcodes along the way.
   455 The collected bitcodes are then $\decode$d with the guidance
   472 The collected bitcodes are then $\decode$d with the guidance
   456 of the input regular expression $r$.
   473 of the input regular expression $r$ ( $_{bs}a \rightarrow r$).
   457 The most notable improvements of $\blexer$
   474 The most notable improvements of $\blexer$
   458 over $\lexer$ are 
   475 over $\lexer$ are 
   459 \begin{itemize}
   476 \begin{itemize}
   460 	\item
   477 	\item
   461 
   478 
   462 		An absence of the second injection phase.
   479 		An absence of the second injection phase.
   463 	\item
   480 	\item
   464 		One need not store each pair of the 
   481 		One does not need to store each pair of the 
   465 		intermediate regular expressions $_{bs_i}a_i$ and  $c_{i+1}$. 
   482 		intermediate regular expressions $_{bs_i}a_i$ and  $c_{i+1}$. 
   466 		The previous annotated regular expression $_{bs_i}a_i$'s information is passed
   483 		The previous annotated regular expression $_{bs_i}a_i$'s information is passed
   467 		on without loss to its successor $_{bs_{i+1}}a_{i+1}$,
   484 		on without loss to its successor $_{bs_{i+1}}a_{i+1}$,
   468 		and $c_{i+1}$'s already contained in the strings in $L\;r$ \footnote{
   485 		and $c_{i+1}$'s information is stored in $L\;r$.\footnote{
   469 		which can be easily recovered if we decode in the correct order}.
   486 		which will be used during the decode phase, where we use $r$ as
       
   487 	a source of information.}
   470 	\item
   488 	\item
   471 		The simplification works much better--one can maintain correctness
   489 		Finally, simplification works much better--one can maintain correctness
   472 		while applying quite strong simplifications, as we shall wee.
   490 		while applying quite aggressive simplifications.
   473 \end{itemize}
   491 \end{itemize}
   474 \noindent
   492 %In the next section we will 
   475 In the next section we will 
   493 %give the operations needed in $\blexer$,
   476 give the operations needed in $\blexer$,
   494 %with some remarks on the idea behind their definitions.
   477 with some remarks on the idea behind their definitions.
       
   478 \subsection{Operations in $\textit{Blexer}$}
   495 \subsection{Operations in $\textit{Blexer}$}
   479 The first operation we define related to bit-coded regular expressions
   496 The first operation we define related to bit-coded regular expressions
   480 is how we move bits to the inside of regular expressions.
   497 is how we move bits to the inside of regular expressions.
   481 Called $\fuse$, this operation attaches bit-codes 
   498 This operation is called $\fuse$:
   482 to the front of an annotated regular expression:
       
   483 \begin{center}
   499 \begin{center}
   484 \begin{tabular}{lcl}
   500 \begin{tabular}{lcl}
   485   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
   501 	$\textit{fuse}\;bs \; (\ZERO)$ & $\dn$ & $\ZERO$\\
   486   $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
   502 	$\textit{fuse}\;bs\; (_{bs'})\ONE$ & $\dn$ &
   487      $_{bs @ bs'}\ONE$\\
   503      $_{bs @ bs'}\ONE$\\
   488   $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
   504 	$\textit{fuse}\;bs\;(_{bs'}{\bf c})$ & $\dn$ &
   489      $_{bs@bs'}{\bf c}$\\
   505      $_{bs@bs'}{\bf c}$\\
   490   $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
   506 	$\textit{fuse}\;bs\;(_{bs'}\sum\textit{as})$ & $\dn$ &
   491      $_{bs@bs'}\sum\textit{as}$\\
   507      $_{bs@bs'}\sum\textit{as}$\\
   492   $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
   508 	$\textit{fuse}\;bs\; (_{bs'}a_1\cdot a_2)$ & $\dn$ &
   493      $_{bs@bs'}a_1 \cdot a_2$\\
   509      $_{bs@bs'}a_1 \cdot a_2$\\
   494   $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
   510   $\textit{fuse}\;bs\;(_{bs'}a^*)$ & $\dn$ &
   495      $_{bs @ bs'}a^*$
   511      $_{bs @ bs'}a^*$
   496 \end{tabular}    
   512 \end{tabular}    
   497 \end{center} 
   513 \end{center} 
   498 
   514 
   499 \noindent
   515 \noindent
   500 With \emph{fuse} we are able to define the $\internalise$ function
   516 With \emph{fuse} we are able to define the $\internalise$ function, 
       
   517 written $(\_)^\uparrow$,
   501 that translates a ``standard'' regular expression into an
   518 that translates a ``standard'' regular expression into an
   502 annotated regular expression.
   519 annotated regular expression.
   503 This function will be applied before we start
   520 This function will be applied before we start
   504 with the derivative phase of the algorithm.
   521 with the derivative phase of the algorithm.
   505 
   522 
   516   $(r^*)^\uparrow$ & $\dn$ &
   533   $(r^*)^\uparrow$ & $\dn$ &
   517          $_{[]}(r^\uparrow)^*$\\
   534          $_{[]}(r^\uparrow)^*$\\
   518 \end{tabular}    
   535 \end{tabular}    
   519 \end{center}    
   536 \end{center}    
   520 \noindent
   537 \noindent
   521 We use an up arrow with postfix notation
       
   522 to denote this operation
       
   523 for convenience. 
       
   524 The opposite of $\textit{internalise}$ is
   538 The opposite of $\textit{internalise}$ is
   525 $\erase$, where all the bit-codes are removed,
   539 $\erase$, where all bit-codes are removed,
   526 and the alternative operator $\sum$ for annotated
   540 and the alternative operator $\sum$ for annotated
   527 regular expressions is transformed to the binary version 
   541 regular expressions is transformed to the binary version 
   528 in plain regular expressions.
   542 in (plain) regular expressions. This can be defined as follows:
   529 \begin{center}
   543 \begin{center}\label{eraseDef}
   530 	\begin{tabular}{lcl}
   544 	\begin{tabular}{lcl}
   531 		$\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\
   545 		$\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\
   532 		$( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\
   546 		$( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\
   533 		$( _{bs}\mathbf{c} )_\downarrow$ & $\dn$ & $\mathbf{c}$\\
   547 		$( _{bs}\mathbf{c} )_\downarrow$ & $\dn$ & $\mathbf{c}$\\
   534 		$( _{bs} a_1 \cdot a_2 )_\downarrow$ & $\dn$ & 
   548 		$( _{bs} a_1 \cdot a_2 )_\downarrow$ & $\dn$ & 
   535 		$ (a_1) _\downarrow \cdot  (a_2) _\downarrow$\\
   549 		$ (a_1) _\downarrow \cdot  (a_2) _\downarrow$\\
   536 		$( _{bs} [])_\downarrow $ & $\dn$ & $\ZERO $\\
   550 		$( _{bs} [])_\downarrow $ & $\dn$ & $\ZERO $\\
   537 		$( _{bs} [a]  )_\downarrow$ & $\dn$ & $a_\downarrow$\\
   551 		$( _{bs} [a]  )_\downarrow$ & $\dn$ & $a_\downarrow$\\
   538 		$_{bs} \sum [a_1, \; a_2]$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\
   552 		$(_{bs} \sum [a_1, \; a_2])_\downarrow$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\
   539 		$(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\
   553 		$(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\
   540 		$( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$
   554 		$( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$
   541 	\end{tabular}
   555 	\end{tabular}
   542 \end{center}
   556 \end{center}
   543 \noindent
   557 \noindent
   544 We also abbreviate the $\erase\; a$ operation
   558 We also abbreviate the $\erase\; a$ operation
   545 as $a_\downarrow$, for conciseness.
   559 as $a_\downarrow$, for conciseness.
   546 
   560 
   547 Testing whether an annotated regular expression
   561 Determining whether an annotated regular expression
   548 contains the empty string in its lauguage requires
   562 contains the empty string in its lauguage requires
   549 a dedicated function $\bnullable$.
   563 a dedicated function $\bnullable$.
       
   564 In our formalisation this function is defined by simply calling $\erase$
       
   565 before $\nullable$.
   550 $\bnullable$ simply calls $\erase$ before $\nullable$.
   566 $\bnullable$ simply calls $\erase$ before $\nullable$.
   551 \begin{definition}
   567 \begin{definition}
   552 		$\bnullable \; a \dn  \nullable \; (a_\downarrow)$
   568 		$\bnullable \; a \dn  \nullable \; (a_\downarrow)$
   553 \end{definition}
   569 \end{definition}
   554 The function for collecting 
   570 The function for collecting 
   555 bitcodes from a 
   571 bitcodes from a 
   556 (b)nullable regular expression is called $\bmkeps$.
   572 (b)nullable regular expression is called $\bmkeps$.
   557 $\bmkeps$ is a variation of the $\textit{mkeps}$ function,
   573 This function  is a lifted version of the $\textit{mkeps}$ function,
   558 which follows the path $\mkeps$ takes to traverse the
   574 which follows the path $\mkeps$ takes to traverse the
   559 $\nullable$ branches when visiting a regular expression,
   575 $\nullable$ branches when visiting a regular expression,
   560 but gives back bitcodes instead of a value.
   576 but collects bitcodes instead of generating a value.
   561 \begin{center}
   577 \begin{center}
   562 \begin{tabular}{lcl}
   578 \begin{tabular}{lcl}
   563   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
   579   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
   564   $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
   580   $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
   565      $\textit{if}\;\textit{bnullable}\,a$\\
   581      $\textit{if}\;\textit{bnullable}\,a$\\
   570   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
   586   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
   571      $bs \,@\, [S]$
   587      $bs \,@\, [S]$
   572 \end{tabular}    
   588 \end{tabular}    
   573 \end{center}    
   589 \end{center}    
   574 \noindent
   590 \noindent
   575 $\bmkeps$, just like $\mkeps$, 
   591 This function, just like $\mkeps$, 
   576 visits a regular expression tree respecting
   592 visits a regular expression tree respecting
   577 the POSIX rules. The difference, however, is that
   593 the POSIX rules. 
   578 it does not create values, but only bitcodes.
       
   579 It traverses each child of the sequence regular expression
   594 It traverses each child of the sequence regular expression
   580 from left to right and creates a bitcode by stitching
   595 from left to right and creates a bitcode by stitching
   581 together bitcodes obtained from the children expressions.
   596 together bitcodes obtained from the children expressions.
   582 In the case of alternative regular expressions, 
   597 In the case of alternative regular expressions, 
   583 it looks for the leftmost
   598 it looks for the leftmost
   586 %Whenever there is some bitcodes attached to a
   601 %Whenever there is some bitcodes attached to a
   587 %node, it returns the bitcodes concatenated with whatever
   602 %node, it returns the bitcodes concatenated with whatever
   588 %child recursive calls return.
   603 %child recursive calls return.
   589 The only time when $\bmkeps$ creates new bitcodes
   604 The only time when $\bmkeps$ creates new bitcodes
   590 is when it completes a star's iterations by attaching a $S$ to the end of the bitcode
   605 is when it completes a star's iterations by attaching a $S$ to the end of the bitcode
   591 list it returns.\\
   606 list it returns.
       
   607 
   592 The bitcodes extracted by $\bmkeps$ need to be 
   608 The bitcodes extracted by $\bmkeps$ need to be 
   593 $\decode$d (with the guidance of a plain regular expression):
   609 $\decode$d (with the guidance of a plain regular expression):
   594 %\begin{definition}[Bitdecoding of Values]\mbox{}
   610 %\begin{definition}[Bitdecoding of Values]\mbox{}
   595 \begin{center}
   611 \begin{center}
   596 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   612 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   618        \textit{else}\;\textit{None}$                       
   634        \textit{else}\;\textit{None}$                       
   619 \end{tabular}    
   635 \end{tabular}    
   620 \end{center} 
   636 \end{center} 
   621 \noindent
   637 \noindent
   622 The function $\decode'$ returns a pair consisting of 
   638 The function $\decode'$ returns a pair consisting of 
   623 a partially decoded value and some leftover bit list that cannot
   639 a partially decoded value and some leftover bit-list.
   624 be decide yet.
       
   625 The function $\decode'$ succeeds if the left-over 
   640 The function $\decode'$ succeeds if the left-over 
   626 bit-sequence is empty.
   641 bit-sequence is empty.
   627 $\decode$ is terminating as $\decode'$ is terminating.
   642 
   628 $\decode'$ is terminating 
   643 %$\decode$ is terminating as $\decode'$ is terminating.
   629 because at least one of $\decode'$'s parameters will go down in terms
   644 %$\decode'$ is terminating 
   630 of size.\\
   645 %because at least one of $\decode'$'s parameters will go down in terms
       
   646 %of size.\\
   631 The reverse operation of $\decode$ is $\code$.
   647 The reverse operation of $\decode$ is $\code$.
   632 $\textit{code}$ encodes a value into a bitcode by converting
   648 This function encodes a value into a bitcode by converting
   633 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
   649 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
   634 star iteration by $S$. The border where a star iteration
   650 star iteration by $S$. The border where a star iteration
   635 terminates is marked by $Z$. 
   651 terminates is marked by $Z$. 
   636 This coding is lossy, as it throws away the information about
   652 This coding is lossy, as it throws away the information about
   637 characters, and does not encode the ``boundary'' between two
   653 characters, and does not encode the ``boundary'' between two
   638 sequence values. Moreover, with only the bitcode we cannot even tell
   654 sequence values. Moreover, with only the bitcode we cannot even tell
   639 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$.  
   655 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$,
       
   656 but this swill not be necessary in our proof.
   640 \begin{center}
   657 \begin{center}
   641 \begin{tabular}{lcl}
   658 \begin{tabular}{lcl}
   642   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   659   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   643   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
   660   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
   644   $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
   661   $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
   648   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
   665   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
   649                                                  code(\Stars\,vs)$
   666                                                  code(\Stars\,vs)$
   650 \end{tabular}    
   667 \end{tabular}    
   651 \end{center} 
   668 \end{center} 
   652 \noindent
   669 \noindent
   653 Assume we have a value $v$ and regular expression $r$
   670 We can prove that given a value $v$ and regular expression $r$
   654 with $\vdash v:r$,
   671 with $\vdash v:r$,
   655 then we have the property that $\decode$ and $\code$ are
   672 then we have the property that $\decode$ and $\code$ are
   656 reverse operations of one another:
   673 reverse operations of one another:
   657 \begin{lemma}
   674 \begin{lemma}
   658 \[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \]
   675 \[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \]
   662 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
   679 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
   663 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition,
   680 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition,
   664 we obtain the property.
   681 we obtain the property.
   665 \end{proof}
   682 \end{proof}
   666 \noindent
   683 \noindent
   667 Now we give out the central part of this lexing algorithm,
   684 Now we define the central part of Sulzmann and Lu's lexing algorithm,
   668 the $\bder$ function (stands for \emph{b}itcoded-derivative):
   685 the $\bder$ function (which stands for \emph{b}itcoded-derivative):
   669 \begin{center}
   686 \begin{center}
   670   \begin{tabular}{@{}lcl@{}}
   687   \begin{tabular}{@{}lcl@{}}
   671   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   688   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   672   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   689   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
   673   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
   690   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
   684       $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
   701       $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
   685        (_{[]}r^*))$
   702        (_{[]}r^*))$
   686 \end{tabular}    
   703 \end{tabular}    
   687 \end{center}    
   704 \end{center}    
   688 \noindent
   705 \noindent
   689 For most time we use the infix notation $(\_\backslash\_)$ 
   706 The $\bder$ function tells us how regular expressions can be recursively traversed,
   690 to mean $\bder$ for brevity when
       
   691 there is no danger of confusion with derivatives on plain regular expressions.
       
   692 For example, we write $( _{[]}r^* ) \backslash c$ instead of $\bder \;c \; _{[]}r^*$,
       
   693 as the bitcodes at the front of $r^*$ indicates that it is 
       
   694 a bit-coded regular expression, not a plain one.\\
       
   695 $\bder$ tells us how regular expressions can be recursively traversed,
       
   696 where the bitcodes are augmented and carried around 
   707 where the bitcodes are augmented and carried around 
   697 when a derivative is taken.
   708 when a derivative is taken.
   698 We give the intuition behind some of the more involved cases in 
   709 We give the intuition behind some of the more involved cases in 
   699 $\bder$. \\
   710 $\bder$. \\
   700 For example,
   711 For example,
   701 in the \emph{star} case,
   712 in the \emph{star} case,
   702 a derivative on $_{bs}a^*$ means 
   713 a derivative of $_{bs}a^*$ means 
   703 that one more star iteratoin needs to be taken.
   714 that one more star iteration needs to be taken.
   704 we need to unfold it into a sequence,
   715 We therefore need to unfold it into a sequence,
   705 and attach an additional bit $Z$ to the front of $r \backslash c$
   716 and attach an additional bit $Z$ to the front of $a \backslash c$
   706 as a record to indicate one new star iteration is unfolded.
   717 as a record to indicate one new star iteration is unfolded.
   707 
   718 
   708 \noindent
   719 \noindent
   709 \begin{center}
   720 \begin{center}
   710   \begin{tabular}{@{}lcl@{}}
   721   \begin{tabular}{@{}lcl@{}}
   714 \end{tabular}    
   725 \end{tabular}    
   715 \end{center}   
   726 \end{center}   
   716 
   727 
   717 \noindent
   728 \noindent
   718 This information will be recovered later by the $\decode$ function.
   729 This information will be recovered later by the $\decode$ function.
   719 The intuition is that the bit $Z$ will be decoded at the right location,
   730 %The intuition is that the bit $Z$ will be decoded at the right location,
   720 because we accumulate bits from left to right (a rigorous proof will be given
   731 %because we accumulate bits from left to right (a rigorous proof will be given
   721 later).
   732 %later).
   722 
   733 
   723 %\begin{tikzpicture}[ > = stealth, % arrow head style
   734 %\begin{tikzpicture}[ > = stealth, % arrow head style
   724 %        shorten > = 1pt, % don't touch arrow head to node
   735 %        shorten > = 1pt, % don't touch arrow head to node
   725 %        semithick % line style
   736 %        semithick % line style
   726 %    ]
   737 %    ]
   759 % 	\end{scope}
   770 % 	\end{scope}
   760 %\end{tikzpicture}
   771 %\end{tikzpicture}
   761 
   772 
   762 \noindent
   773 \noindent
   763 Another place the $\bder$ function differs
   774 Another place the $\bder$ function differs
   764 from derivatives on plain regular expressions
   775 from derivatives on regular expressions
   765 is the sequence case:
   776 is the sequence case:
   766 \begin{center}
   777 \begin{center}
   767   \begin{tabular}{@{}lcl@{}}
   778   \begin{tabular}{@{}lcl@{}}
   768 
   779 
   769   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
   780   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
   772 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
   783 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
   773   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
   784   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
   774 \end{tabular}    
   785 \end{tabular}    
   775 \end{center}    
   786 \end{center}    
   776 \noindent
   787 \noindent
   777 The difference is that (when $a_1$ is $\bnullable$)
   788 %We assume that $\bmkeps$ 
       
   789 %correctly extracts the bitcode for how $a_1$
       
   790 %matches the string prior to $c$ (more on this later).
       
   791 %The bitsequence $\textit{bs}$ which was initially 
       
   792 %attached to the first element of the sequence
       
   793 %$a_1 \cdot a_2$, has now been 
       
   794 %elevated to the top level of the $\sum$ constructor. 
       
   795 %This is because this piece of information will be needed 
       
   796 %whichever way the sequence is matched,
       
   797 %regardless of whether $c$ belongs to $a_1$ or $a_2$.
       
   798 The difference mainly lies in the $\textit{if}$ branch (when $a_1$ is $\bnullable$):
   778 we use $\bmkeps$ to store the lexing information
   799 we use $\bmkeps$ to store the lexing information
   779 in $a_1$ before collapsing 
   800 in $a_1$ before collapsing 
   780 it (as it has been fully matched by string prior to $c$),
   801 it (as it has been fully matched by string prior to $c$),
   781 and attach the collected bit-codes to the front of $a_2$
   802 and attach the collected bit-codes to the front of $a_2$
   782 before throwing away $a_1$. We assume that $\bmkeps$ 
   803 before throwing away $a_1$. 
   783 correctly extracts the bitcode for how $a_1$
       
   784 matches the string prior to $c$ (more on this later).
       
   785 The bitsequence $\textit{bs}$ which was initially 
       
   786 attached to the first element of the sequence
       
   787 $a_1 \cdot a_2$, has now been 
       
   788 elevated to the top level of the $\sum$ constructor. 
       
   789 This is because this piece of information will be needed 
       
   790 whichever way the sequence is matched,
       
   791 regardless of whether $c$ belongs to $a_1$ or $a_2$.
       
   792 
       
   793 In the injection-based lexer, $r_1$ is immediately thrown away in 
   804 In the injection-based lexer, $r_1$ is immediately thrown away in 
   794 subsequent derivatives on the right branch (when $r_1$ is $\nullable$),
   805 in the $\textit{if}$ branch,
   795 depite $r_1$ potentially storing information of previous matches:
   806 the information $r_1$ stores is therefore lost:
   796 \begin{center}
   807 \begin{center}
   797 	$(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$
   808 	\begin{tabular}{lcl}
       
   809 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
       
   810 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   811 	&   & $\mathit{else}\; \ldots$\\
       
   812 	\end{tabular}
   798 \end{center}
   813 \end{center}
   799 \noindent
   814 \noindent
   800 this loss of information makes it necessary
   815 
   801 to store on stack all the regular expressions'
   816 %\noindent
   802 ``snapshots'' before they were
   817 %this loss of information makes it necessary
   803 taken derivative of.
   818 %to store on stack all the regular expressions'
   804 So that the related information will be available 
   819 %``snapshots'' before they were
   805 once the child recursive 
   820 %taken derivative of.
   806 call finishes.\\
   821 %So that the related information will be available 
       
   822 %once the child recursive 
       
   823 %call finishes.\\
   807 The rest of the clauses of $\bder$ is rather similar to
   824 The rest of the clauses of $\bder$ is rather similar to
   808 $\der$. \\
   825 $\der$. \\
   809 Generalising the derivative operation with bitcodes to strings, we have 
   826 Generalising the derivative operation with bitcodes to strings, we have 
   810 \begin{center}
   827 \begin{center}
   811 	\begin{tabular}{@{}lcl@{}}
   828 	\begin{tabular}{@{}lcl@{}}
   827   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
   844   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
   828   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   845   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
   829   & & $\;\;\textit{else}\;\textit{None}$
   846   & & $\;\;\textit{else}\;\textit{None}$
   830 \end{tabular}
   847 \end{tabular}
   831 \end{center}
   848 \end{center}
   832 
   849 \noindent
   833 \noindent
   850 This function first attaches bitcodes to a
   834 $\blexer$ first attaches bitcodes to a
   851 plain regular expression $r$, and then builds successive derivatives
   835 plain regular expression, and then do successive derivatives
       
   836 with respect to the input string $s$, and
   852 with respect to the input string $s$, and
   837 then test whether the result is nullable.
   853 then test whether the result is (b)nullable.
   838 If yes, then extract the bitcodes out of the nullable expression,
   854 If yes, then extract the bitcodes from the nullable expression,
   839 and decodes the bitcodes into a lexical value.
   855 and decodes the bitcodes into a lexical value.
   840 If there does not exists a match between $r$ and $s$ the lexer
   856 If there does not exists a match between $r$ and $s$ the lexer
   841 outputs $\None$ indicating a failed lex.\\
   857 outputs $\None$ indicating a mismatch.\\
   842 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
   858 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
   843 \begin{property}
   859 \begin{property}
   844 $\blexer \;r \; s = \lexer \; r \; s$.
   860 $\blexer \;r \; s = \lexer \; r \; s$.
   845 \end{property}
   861 \end{property}
   846 \noindent
   862 \noindent
   847 This was claimed but not formalised in Sulzmann and Lu's work.
   863 This was claimed but not formalised in Sulzmann and Lu's work.
   848 We introduce the proof later, after we give all
   864 We introduce the proof later, after we give all
   849 the needed auxiliary functions and definitions.
   865 the needed auxiliary functions and definitions.
   850 \subsection{An Example $\blexer$ Run}
   866 \subsection{An Example $\blexer$ Run}
   851 Before introducing the proof we shall first walk the reader 
   867 Before introducing the proof we shall first walk 
   852 through a concrete example of our $\blexer$ calculating the right 
   868 through a concrete example of our $\blexer$ calculating the right 
   853 lexical information through bit-coded regular expressions.\\
   869 lexical information through bit-coded regular expressions.\\
   854 Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$.
   870 Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$.
   855 We give again the bird's eye view of this particular example 
   871 We give again the bird's eye view of this particular example 
   856 in each stage of the algorithm:
   872 in each stage of the algorithm:
   917 The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$
   933 The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$
   918 turned into $ZS$ after derivative w.r.t $b$
   934 turned into $ZS$ after derivative w.r.t $b$
   919 is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$
   935 is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$
   920 before processing $_Zb+_Sc$.\\
   936 before processing $_Zb+_Sc$.\\
   921 The annotated regular expressions
   937 The annotated regular expressions
   922 would look too cumbersome if we explicitly indicate all the
   938 would look overwhelming if we explicitly indicate all the
   923 locations where bitcodes are attached.
   939 locations where bitcodes are attached.
   924 For example,
   940 For example,
   925 $(aa)^*\cdot (b+c)$ would 
   941 $(aa)^*\cdot (b+c)$ would 
   926 look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$ 
   942 look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$ 
   927 after 
   943 after 
   928 internalise.
   944 internalise.
   929 Therefore for readability we omit bitcodes if they are empty. 
   945 Therefore for readability we omit bitcodes if they are empty. 
   930 This applies to all example annotated 
   946 This applies to all annotated 
   931 regular expressions in this thesis.\\
   947 regular expressions in this thesis.\\
   932 %and assume we have just read the first character $a$:
   948 %and assume we have just read the first character $a$:
   933 %\begin{center}
   949 %\begin{center}
   934 %\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
   950 %\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
   935 %    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
   951 %    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
  1036 	\item
  1052 	\item
  1037 		The bit-encodings work properly,
  1053 		The bit-encodings work properly,
  1038 		allowing the possibility of 
  1054 		allowing the possibility of 
  1039 		pulling out the right lexical value from an
  1055 		pulling out the right lexical value from an
  1040 		annotated regular expression at 
  1056 		annotated regular expression at 
  1041 		any stage of the algorithm.
  1057 		any stage, using $\bmkeps$ or $\retrieve$ (details in
       
  1058 		lemmas \ref{bmkepsRetrieve} and \ref{blexer_retrieve}).
  1042 \end{itemize}
  1059 \end{itemize}
  1043 We will elaborate on this, with the help of
  1060 We will elaborate on this, with the help of
  1044 some helper functions such as $\retrieve$ and
  1061 some helper functions such as $\retrieve$ and
  1045 $\flex$.
  1062 $\flex$.
  1046 \subsection{Specifications of Some Helper Functions}
  1063 \subsection{Specifications of Some Helper Functions}
  1049 using $\lexer$ or $\blexer$ alone.
  1066 using $\lexer$ or $\blexer$ alone.
  1050 \subsubsection{$\textit{Retrieve}$}
  1067 \subsubsection{$\textit{Retrieve}$}
  1051 The first function we shall introduce is $\retrieve$.
  1068 The first function we shall introduce is $\retrieve$.
  1052 Sulzmann and Lu gave its definition, and
  1069 Sulzmann and Lu gave its definition, and
  1053 Ausaf and Urban found
  1070 Ausaf and Urban found
  1054 its usage in mechanised proofs.
  1071 useful in their mechanised proofs.
  1055 Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$,
  1072 Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$,
  1056 after all the derivatives has been taken:
  1073 after all the derivatives have been taken:
  1057 \begin{center}
  1074 \begin{center}
  1058 \begin{tabular}{lcl}
  1075 \begin{tabular}{lcl}
  1059 	& & $\ldots$\\
  1076 	& & $\ldots$\\
  1060   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  1077   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  1061   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  1078   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  1062   & & $\ldots$
  1079   & & $\ldots$
  1063 \end{tabular}
  1080 \end{tabular}
  1064 \end{center}
  1081 \end{center}
  1065 \noindent
  1082 \noindent
  1066 $\bmkeps$ retrieves the value $v$'s
  1083 The function $\bmkeps$ retrieves the value $v$'s
  1067 information in the format
  1084 information in the format
  1068 of bitcodes, by travelling along the
  1085 of bitcodes, by travelling along the
  1069 path of the regular expression that corresponds to a POSIX match,
  1086 path of the regular expression that corresponds to a POSIX match,
  1070 collecting all the bitcodes.
  1087 collecting all the bitcodes.
  1071 We know that this "retrieved" bitcode leads to the correct value after decoding,
  1088 We know that this "retrieved" bitcode leads to the correct value after decoding,
  1072 which is $v_0$ in the injection-based lexing diagram.
  1089 which is $v_0$ in the injection-based lexing diagram:
  1073 As an observation we pointed at the beginning of this section,
       
  1074 the annotated regular expressions generated in successive derivative steps
       
  1075 in $\blexer$ after $\erase$ has the same structure 
       
  1076 as those appeared in $\lexer$.
       
  1077 We redraw the diagram below to visualise this fact.
       
  1078 We pretend that all the values are 
       
  1079 ready despite they are only calculated in $\lexer$.
       
  1080 In general we have $\vdash v_i:(a_i)_\downarrow$.
       
  1081 \vspace{20mm}
  1090 \vspace{20mm}
  1082 \begin{figure}[H]%\label{graph:injLexer}
  1091 \begin{figure}[H]%\label{graph:injLexer}
  1083 \begin{center}
  1092 \begin{center}
  1084 \begin{tikzcd}[
  1093 \begin{tikzcd}[
  1085 	every matrix/.append style = {name=p},
  1094 	every matrix/.append style = {name=p},
  1097 \node[E = (p-1-1) (p-2-1)] {};
  1106 \node[E = (p-1-1) (p-2-1)] {};
  1098 \node[E = (p-1-4) (p-2-4)] {};
  1107 \node[E = (p-1-4) (p-2-4)] {};
  1099 \node[E = (p-1-2) (p-2-2)] {};
  1108 \node[E = (p-1-2) (p-2-2)] {};
  1100 \node[E = (p-1-3) (p-2-3)] {};
  1109 \node[E = (p-1-3) (p-2-3)] {};
  1101 \end{tikzpicture}
  1110 \end{tikzpicture}
  1102 
  1111 \caption{Injection-based lexing diagram, 
       
  1112 with matching value and regular expression pairs
       
  1113 encircled}\label{Inj2}
  1103 \end{figure}
  1114 \end{figure}
  1104 \vspace{20mm}
  1115 \vspace{20mm}
  1105 \noindent
  1116 \noindent
  1106 We encircle in the diagram  all the pairs $v_i, a_i$ to show that these values
  1117 We have that $\vdash v_i:(a_i)_\downarrow$,
       
  1118 where $v_i$ are the intermediate values generated step by step in $\lexer$.
       
  1119 These values are not explicitly created in $\blexer$.
       
  1120 
       
  1121 We encircled in the diagram  all the pairs $v_i, a_i$ to show that these values
  1107 and regular expressions share the same structure.
  1122 and regular expressions share the same structure.
  1108 These pairs all contain adequate information to 
  1123 These pairs all contain adequate information to 
  1109 obtain the final lexing result.
  1124 obtain the final lexing result.
  1110 For example, in the leftmost pair the 
  1125 For example, in the leftmost pair the 
  1111 lexical information is condensed in 
  1126 lexical information is condensed in 
  1112 $v_0$, whereas for the rightmost pair,
  1127 $v_0$, whereas for the rightmost pair,
  1113 the lexing result hides is in the bitcodes of $a_n$.\\
  1128 the lexing result hides is in the bitcodes of $a_n$.\\
  1114 $\bmkeps$ is able to extract from $a_n$ the result
  1129 The function $\bmkeps$ is able to extract from $a_n$ the result
  1115 by looking for nullable parts of the regular expression.
  1130 by looking for nullable parts of the regular expression.
  1116 However for regular expressions $a_i$ in general,
  1131 However for regular expressions $a_i$ in general,
  1117 they might not necessarily be nullable.
  1132 they might not necessarily be nullable.
  1118 For those regular expressions that are not nullable,
  1133 For those regular expressions that are not nullable,
  1119 $\bmkeps$ will not work. 
  1134 $\bmkeps$ will not work. 
  1127 		$f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
  1142 		$f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
  1128 	\item
  1143 	\item
  1129 		$f \; a_i\;v_i = f \; a_0 \; v_0 = \code \; v_0$ %\decode \;(\code \; v_0) \; (\erase \; a_0)$
  1144 		$f \; a_i\;v_i = f \; a_0 \; v_0 = \code \; v_0$ %\decode \;(\code \; v_0) \; (\erase \; a_0)$
  1130 \end{itemize}
  1145 \end{itemize}
  1131 \noindent
  1146 \noindent
  1132 Sulzmann and Lu came up with a function satisfying 
  1147 Sulzmann and Lu came up with the following definition for $f$ satisfying 
  1133 the above
  1148 the above
  1134 requirements, named \emph{retrieve}:
  1149 requirements, and named it \emph{retrieve}:
       
  1150 \vspace{-7mm}
  1135 \begin{center}
  1151 \begin{center}
  1136 \begin{tabular}{llcl}
  1152 \begin{tabular}{llcl}
  1137 	$\retrieve \; \, _{bs} \ONE$ & $   \Empty$ & $\dn$ & $\textit{bs}$\\
  1153 	$\retrieve \; \, _{bs} \ONE$ & $   \Empty$ & $\dn$ & $\textit{bs}$\\
  1138 
  1154 
  1139 	$\retrieve \; \, _{bs} \mathbf{c}$ & $ (\Char \; c) $ & $\dn$ & $ \textit{bs}$\\
  1155 	$\retrieve \; \, _{bs} \mathbf{c}$ & $ (\Char \; c) $ & $\dn$ & $ \textit{bs}$\\
  1153 
  1169 
  1154 	$\retrieve \; \, _{bs} a^* $ & $  (\Stars \; []) $ & $\dn$ & $\textit{bs} \; @ \; [S]$
  1170 	$\retrieve \; \, _{bs} a^* $ & $  (\Stars \; []) $ & $\dn$ & $\textit{bs} \; @ \; [S]$
  1155 \end{tabular}
  1171 \end{tabular}
  1156 \end{center}
  1172 \end{center}
  1157 \noindent
  1173 \noindent
  1158 As promised, $\retrieve$ collects the right bit-codes from the 
  1174 As stated, $\retrieve$ collects the right bit-codes from the 
  1159 final derivative $a_n$, guided by $v_n$:
  1175 final derivative $a_n$, guided by $v_n$. This can be proved
       
  1176 as follows:
  1160 \begin{lemma}\label{bmkepsRetrieve}
  1177 \begin{lemma}\label{bmkepsRetrieve}
  1161 	$\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$
  1178 	$\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$
  1162 \end{lemma}
  1179 \end{lemma}
  1163 \begin{proof}
  1180 \begin{proof}
  1164 	By a routine induction on $a$.
  1181 	By a routine induction on $a$.
  1165 \end{proof}
  1182 \end{proof}
  1166 \noindent
  1183 \noindent
  1167 The design of $\retrieve$ enables us to extract bitcodes
  1184 %The design of $\retrieve$ enables us to extract bitcodes
  1168 from both annotated regular expressions and values.
  1185 %from both annotated regular expressions and values.
  1169 $\retrieve$ always ``sucks up'' all the information.
  1186 %$\retrieve$ always ``sucks up'' all the information.
  1170 When more information is stored in the value, we would be able to
  1187 When more information is stored in the value, we would be able to
  1171 extract more from the value, and vice versa.
  1188 extract more from the value, and vice versa.
  1172 For example in star iterations, $\retrieve$ will be able to extract from $\Stars \; vs$
  1189 For example in star iterations, $\retrieve$ will be able to extract from $\Stars \; vs$
  1173 exactly the same bitcodes as $\code \; (\Stars \; vs)$:
  1190 exactly the same bitcodes as $\code \; (\Stars \; vs)$:
  1174 \begin{lemma}
  1191 \begin{lemma}\label{retrieveEncodeSTARS}
  1175   If $\forall v \in vs. \vdash v : r$, and  $\code \; v = \retrieve \; (\internalise\; r) \; v$\\
  1192   If $\forall v \in vs. \vdash v : r$, and  $\code \; v = \retrieve \; (\internalise\; r) \; v$\\
  1176   then $\code \; (\Stars \; vs) = \retrieve \; _{[]}((\internalise \;  r)^*) \; (\Stars \; vs)$
  1193   then $\code \; (\Stars \; vs) = \retrieve \; _{[]}((\internalise \;  r)^*) \; (\Stars \; vs)$
  1177 \end{lemma}\label{retrieveEncodeSTARS}
  1194 \end{lemma}
  1178 \begin{proof}
  1195 \begin{proof}
  1179 	By induction on the value list $vs$.
  1196 	By induction on the value list $vs$.
  1180 \end{proof}
  1197 \end{proof}
  1181 \noindent
  1198 \noindent
  1182 Similarly, when the value list does not hold information,
  1199 Similarly, when the value list does not hold information,
  1183 only the bitcodes plus some delimiter will be recorded:
  1200 only the bitcodes plus some delimiter will be recorded:
  1184 \begin{center}
  1201 \begin{center}
  1185   $\retrieve \; _{bs}((\internalise \;  r)^*) \; (\Stars \; [] )  = bs @ [S]$.
  1202   $\retrieve \; _{bs}((\internalise \;  r)^*) \; (\Stars \; [] )  = bs @ [S]$.
  1186 \end{center}
  1203 \end{center}
  1187 In general, if we have a regular expression just internalised
  1204 In general, if we have a regular expression that is just internalised
  1188 and the lexing result value, we could $\retrieve$ the bitcdoes
  1205 and the final lexing result value, we can $\retrieve$ the bitcodes
  1189 that look as if we have en$\code$-ed the value as bitcodes:
  1206 that look as if we have en$\code$-ed the value as bitcodes:
  1190 \begin{lemma}
  1207 \begin{lemma}
  1191 	$\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$
  1208 	$\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$
  1192 \end{lemma}
  1209 \end{lemma}
  1193 \begin{proof}
  1210 \begin{proof}
  1194 	By induction on $r$.
  1211 	By induction on $r$.
  1195 	The star case uses lemma \ref{retrieveEncodeSTARS}.
  1212 	The star case uses lemma \ref{retrieveEncodeSTARS}.
  1196 \end{proof}
  1213 \end{proof}
  1197 \noindent
  1214 \noindent
  1198 The following property is more interesting, as
  1215 The following property of $\retrieve$ is crucial, as
  1199 it provides a "bridge" between $a_0$ and $a_n $ in the
  1216 it provides a "bridge" between $a_0$ and $a_n $ in the
  1200 lexing diagram by linking adjacent regular expressions $a_i$ and
  1217 lexing diagram by linking adjacent regular expressions $a_i$ and
  1201 $a_{i+1}$.
  1218 $a_{i+1}$.
  1202 The property says that one 
  1219 The property says that one 
  1203 can retrieve the same bits from a derivative 
  1220 can retrieve the same bits from a derivative 
  1204 regular expression as those from 
  1221 regular expression as those from 
  1205 before the derivative took place,
  1222 before the derivative took place,
  1206 provided that the right values are used respectively:
  1223 provided that the corresponding values are used respectively:
  1207 \begin{lemma}\label{retrieveStepwise}
  1224 \begin{lemma}\label{retrieveStepwise}
  1208 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
  1225 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
  1209 \end{lemma}
  1226 \end{lemma}
  1210 \begin{proof}
  1227 \begin{proof}
  1211 	By induction on $r$, where $v$ is allowed to be arbitrary.
  1228 	By induction on $r$, generalising over $v$.
  1212 	The induction principle is function $\erase$'s cases.
  1229 	The induction principle in our formalisation
       
  1230 	is function $\erase$'s cases.
  1213 \end{proof}
  1231 \end{proof}
  1214 \noindent
  1232 \noindent
  1215 Before we move on to the next
  1233 Before we move on to the next
  1216 helper function, we rewrite $\blexer$ in
  1234 helper function, we rewrite $\blexer$ in
  1217 the following way which makes later proofs more convenient:
  1235 the following way which makes later proofs more convenient:
  1220 \end{lemma}
  1238 \end{lemma}
  1221 \begin{proof}
  1239 \begin{proof}
  1222 	Using lemma \ref{bmkepsRetrieve}.
  1240 	Using lemma \ref{bmkepsRetrieve}.
  1223 \end{proof}
  1241 \end{proof}
  1224 \noindent
  1242 \noindent
  1225 $\retrieve$ allows connecting
  1243 The function $\retrieve$ allows connecting
  1226 between the intermediate 
  1244 between the intermediate 
  1227 results $a_i$ and $a_{i+1}$ in $\blexer$.
  1245 results $a_i$ and $a_{i+1}$ in $\blexer$.
  1228 For plain regular expressions something similar is required.
  1246 For plain regular expressions something similar is required.
  1229 
  1247 
  1230 \subsection{$\flex$}
  1248 \subsection{$\flex$}
  1231 Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$,
  1249 Ausaf and Urban defined an auxiliary function called $\flex$ for $\lexer$,
  1232 defined as
  1250 defined as
  1233 \begin{center}
  1251 \begin{center}
  1234 \begin{tabular}{lcl}
  1252 \begin{tabular}{lcl}
  1235 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
  1253 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
  1236 $\flex \; r \; f \; (c :: s) \; v$ &  $=$ &   $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$
  1254 $\flex \; r \; f \; (c :: s) \; v$ &  $=$ &   $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$
  1237 \end{tabular}
  1255 \end{tabular}
  1238 \end{center}
  1256 \end{center}
  1239 which accumulates the characters that need to be injected back, 
  1257 which accumulates the characters that need to be injected back, 
  1240 and does the injection in a stack-like manner (last taken derivative first injected).
  1258 and does the injection in a stack-like manner (last taken derivative first injected).
       
  1259 The function
  1241 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
  1260 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
  1242 $r$, the identity function $id$, the input string $s$ and the value
  1261 $r$, the identity function $id$, the input string $s$ and the value
  1243 $v_n= \mkeps \; (r\backslash s)$:
  1262 $v_n= \mkeps \; (r\backslash s)$:
  1244 \begin{lemma}
  1263 \begin{lemma}
  1245 	$\flex \;\; r \;\; \textit{id}\;\; s \;\; (\mkeps \;(r\backslash s)) = \lexer \; r \; s$
  1264 	$\flex \;\; r \;\; \textit{id}\;\; s \;\; (\mkeps \;(r\backslash s)) = \lexer \; r \; s$
  1247 \begin{proof}
  1266 \begin{proof}
  1248 	By reverse induction on $s$.
  1267 	By reverse induction on $s$.
  1249 \end{proof}
  1268 \end{proof}
  1250 \noindent
  1269 \noindent
  1251 The main advantage of using $\flex$
  1270 The main advantage of using $\flex$
  1252 in addition to $\lexer$ is that
  1271 in $\lexer$ is that
  1253 $\flex$ makes the value $v$ and function $f$
  1272 $\flex$ makes the value $v$ and function $f$
  1254 that manipulates the value  explicit parameters,
  1273 that manipulates the value  explicit parameters,
  1255 which ``exposes'' $\lexer$'s recursive call
  1274 which ``exposes'' $\lexer$'s recursive call
  1256 arguments and allows us to ``set breakpoints'' and ``resume''
  1275 arguments and allows us to ``set breakpoints'' and ``resume''
  1257 at any point during $\lexer$'s recursive calls.\\
  1276 at any point during $\lexer$'s recursive calls. Therefore
  1258 The following stepwise property holds. 
  1277 the following stepwise property holds. 
  1259 \begin{lemma}\label{flexStepwise}
  1278 \begin{lemma}\label{flexStepwise}
  1260 	$\textit{flex} \;\; r \;\; f \;\; (s@[c]) \;\; v = \flex \;\; r \;\; f  \;\; s \;\; (\inj \; (r\backslash s) \; c \; v) $
  1279 	$\textit{flex} \;\; r \;\; f \;\; (s@[c]) \;\; v = \flex \;\; r \;\; f  \;\; s \;\; (\inj \; (r\backslash s) \; c \; v) $
  1261 \end{lemma}
  1280 \end{lemma}
  1262 \begin{proof}
  1281 \begin{proof}
  1263 	By induction on the shape of $r\backslash s$.
  1282 	By induction on the shape of $r\backslash s$.
  1264 \end{proof}
  1283 \end{proof}
  1265 \noindent
  1284 \noindent
  1266 With $\flex$ and $\retrieve$, 
  1285 With $\flex$ and $\retrieve$, 
  1267 we are ready to connect $\lexer$ and $\blexer$,
  1286 we are ready to connect $\lexer$ and $\blexer$,
  1268 giving the correctness proof.
  1287 giving the correctness proof for $\blexer$.
  1269 
  1288 
  1270 %----------------------------------------------------------------------------------------
  1289 %----------------------------------------------------------------------------------------
  1271 %	SECTION  correctness proof
  1290 %	SECTION  correctness proof
  1272 %----------------------------------------------------------------------------------------
  1291 %----------------------------------------------------------------------------------------
  1273 \section{Correctness of Bit-coded Algorithm (Without Simplification)}
  1292 \section{Correctness of the  Bit-coded Algorithm (Without Simplification)}
  1274 $\flex$ and $\blexer$ essentially calculates the same thing.
  1293 We can first show that 
       
  1294 $\flex$ and $\blexer$ calculates the same thing.
  1275 \begin{lemma}\label{flex_retrieve}
  1295 \begin{lemma}\label{flex_retrieve}
  1276 	If $\vdash v: (r\backslash s)$, then\\
  1296 	If $\vdash v: (r\backslash s)$, then\\
  1277 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
  1297 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
  1278 \end{lemma}
  1298 \end{lemma}
  1279 \begin{proof}
  1299 \begin{proof}
  1326 \noindent
  1346 \noindent
  1327 To piece things together and spell out the correctness
  1347 To piece things together and spell out the correctness
  1328 result of the bitcoded lexer more explicitly,
  1348 result of the bitcoded lexer more explicitly,
  1329 we use the fact from the previous chapter that
  1349 we use the fact from the previous chapter that
  1330 \[
  1350 \[
  1331 	 (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s = v
  1351 	 (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s =\Some \;  v
       
  1352 	 \quad \quad
       
  1353 	 \nexists v. (r, s) \rightarrow v\;\; \textit{iff} \;\; \lexer \;r\;s = None
  1332 \]
  1354 \]
  1333 to obtain this corollary:
  1355 to obtain this corollary:
  1334 \begin{corollary}\label{blexerPOSIX}
  1356 \begin{corollary}\label{blexerPOSIX}
  1335 	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = v$
  1357 	The $\blexer$ function correctly implements a POSIX lexer, namely\\
       
  1358 	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = \Some \; v$\\
       
  1359 	$\nexists v. (r, s) \rightarrow v\;\; \textit{iff} \;\; \blexer \;r\;s = None$
  1336 \end{corollary}
  1360 \end{corollary}
  1337 Our main reason for wanting a bit-coded algorithm over 
  1361 Our main reason for analysing the bit-coded algorithm over 
  1338 the injection-based one is for its capabilities of allowing
  1362 the injection-based one is that it allows us to define
  1339 more aggressive simplifications.
  1363 more aggressive simplifications.
  1340 We will elaborate on this in the next chapter.
  1364 We will elaborate on this in the next chapter.
  1341 
  1365 
  1342 
  1366