lexing: ChengsongTanPhdThesis/Chapters/Bitcoded1.tex@0f00d440f484


% Chapter Template

% Main chapter title
\chapter{Bit-coded Algorithm of Sulzmann and Lu}

\label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
%simplifications and therefore introduce our version of the bitcoded algorithm and 
%its correctness proof in 
%Chapter 3\ref{Chapter3}. 
In this chapter, we are going to introduce the bit-coded algorithm
introduced by Sulzmann and Lu to address the problem of 
under-simplified regular expressions. 
\section{Bit-coded Algorithm}
The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
stores information of previous lexing steps
on a stack, in the form of regular expressions
and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
\begin{envForCaption}
\begin{ceqn}
\begin{equation}%\label{graph:injLexer}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
\end{tikzcd}
\end{equation}
\end{ceqn}
\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
\end{envForCaption}
\noindent
This is both inefficient and prone to stack overflow.
A natural question arises as to whether we can store lexing
information on the fly, while still using regular expression 
derivatives?

In a lexing algorithm's run, split by the current input position,
we have a sub-string that has been consumed,
and the sub-string that has yet to come.
We already know what was before, and this should be reflected in the value 
and the regular expression at that step as well. But this is not the 
case for injection-based regular expression derivatives.
Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
as an example, if we have just read the two former characters $aa$:

\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {Consumed: $aa$
         \nodepart{two} Not Yet Reached: $bc$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\caption{Partially matched String} 
\end{envForCaption}
\end{center}
%\caption{Input String}\label{StringPartial}
%\end{figure}

\noindent
We have the value that has already been partially calculated,
and the part that has yet to come:
\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$\Seq(\Stars[\Char(a), \Char(a)], ???)$
         \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\caption{Partially constructed Value} 
\end{envForCaption}
\end{center}

In the regex derivative part , (after simplification)
all we have is just what is about to come:
\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
        {$???$
         \nodepart{two} To Come: $b c$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\caption{Derivative} 
\end{envForCaption}
\end{center}
\noindent
The previous part is missing.
How about keeping the partially constructed value 
attached to the front of the regular expression?
\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
         \nodepart{two} To Come: $b c$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\caption{Derivative} 
\end{envForCaption}
\end{center}
\noindent
If we do this kind of "attachment"
and each time augment the attached partially
constructed value when taking off a 
character:
\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
         \nodepart{two} To Come: $c$};
\end{tikzpicture}\\
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
         \nodepart{two} EOF};
\end{tikzpicture}
\caption{After $\backslash b$ and $\backslash c$} 
\end{envForCaption}
\end{center}
\noindent
In the end we could recover the value without a backward phase.
But (partial) values are a bit clumsy to stick together with a regular expression, so 
we instead use bit-codes to encode them.

Bits and bitcodes (lists of bits) are defined as:
\begin{envForCaption}
\begin{center}
		$b ::=   S \mid  Z \qquad
bs ::= [] \mid b::bs    
$
\end{center}
\caption{Bit-codes datatype}
\end{envForCaption}

\noindent
Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
confusion with the regular expressions $\ZERO$ and $\ONE$.
Bitcodes (or
bit-lists) can be used to encode values (or potentially incomplete values) in a
compact form. This can be straightforwardly seen in the following
coding function from values to bitcodes: 
\begin{envForCaption}
\begin{center}
\begin{tabular}{lcl}
  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
  $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
  $\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\
  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
  $\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\
  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
                                                 code(\Stars\,vs)$
\end{tabular}    
\end{center} 
\caption{Coding Function for Values}
\end{envForCaption}

\noindent
Here $\textit{code}$ encodes a value into a bit-code by converting
$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
star iteration by $S$. The border where a local star terminates
is marked by $Z$. 
This coding is lossy, as it throws away the information about
characters, and also does not encode the ``boundary'' between two
sequence values. Moreover, with only the bitcode we cannot even tell
whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
reason for choosing this compact way of storing information is that the
relatively small size of bits can be easily manipulated and ``moved
around'' in a regular expression. 


We define the reverse operation of $\code$, which is $\decode$.
As expected, $\decode$ not only requires the bit-codes,
but also a regular expression to guide the decoding and 
fill the gaps of characters:


%\begin{definition}[Bitdecoding of Values]\mbox{}
\begin{envForCaption}
\begin{center}
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
  $\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       (\Left\,v, bs_1)$\\
  $\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       (\Right\,v, bs_1)$\\                           
  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
  $\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
  $\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ & 
         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
  
  $\textit{decode}\,bs\,r$ & $\dn$ &
     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       \textit{else}\;\textit{None}$                       
\end{tabular}    
\end{center} 
\end{envForCaption}
%\end{definition}

\noindent
$\decode'$ does most of the job while $\decode$ throws
away leftover bit-codes and returns the value only.
$\decode$ is terminating as $\decode'$ is terminating.
We have the property that $\decode$ and $\code$ are
reverse operations of one another:
\begin{lemma}
\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
\end{lemma}
\begin{proof}
By proving a more general version of the lemma, on $\decode'$:
\[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
we get the lemma.
\end{proof}
With the $\code$ and $\decode$ functions in hand, we know how to 
switch between bit-codes and value--the two different representations of 
lexing information. 
The next step is to integrate this information into the working regular expression.
Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
gave for storing partial values on the fly:

\begin{center}
\begin{tabular}{lcl}
  $\textit{a}$ & $::=$  & $\ZERO$\\
                  & $\mid$ & $_{bs}\ONE$\\
                  & $\mid$ & $_{bs}{\bf c}$\\
                  & $\mid$ & $_{bs}\sum\,as$\\
                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
                  & $\mid$ & $_{bs}a^*$
\end{tabular}    
\end{center}  
%(in \textit{ALTS})

\noindent
We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
$bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
expressions and $as$ for a list of annotated regular expressions.
The alternative constructor ($\sum$) has been generalised to 
accept a list of annotated regular expressions rather than just 2.


The first thing we define related to bit-coded regular expressions
is how we move bits, for instance pasting it at the front of an annotated regex.
The operation $\fuse$ is just to attach bit-codes 
to the front of an annotated regular expression:
\begin{center}
\begin{tabular}{lcl}
  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
     $_{bs @ bs'}\ONE$\\
  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
     $_{bs@bs'}{\bf c}$\\
  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
     $_{bs@bs'}\sum\textit{as}$\\
  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
     $_{bs@bs'}a_1 \cdot a_2$\\
  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
     $_{bs @ bs'}a^*$
\end{tabular}    
\end{center} 

\noindent
With that we are able to define $\internalise$.

To do lexing using annotated regular expressions, we shall first
transform the usual (un-annotated) regular expressions into annotated
regular expressions. This operation is called \emph{internalisation} and
defined as follows:

%\begin{definition}
\begin{center}
\begin{tabular}{lcl}
  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
  $(r_1 + r_2)^\uparrow$ & $\dn$ &
  $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
  \textit{fuse}\,[S]\,r_2^\uparrow]$\\
  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
  $(r^*)^\uparrow$ & $\dn$ &
         $_{[]}(r^\uparrow)^*$\\
\end{tabular}    
\end{center}    
%\end{definition}

\noindent
We use an up arrow with postfix notation
to denote the operation, 
for convenience. The $\textit{internalise} \; r$
notation is more cumbersome.
The opposite of $\textit{internalise}$ is
$\erase$, where all the bit-codes are removed,
and the alternative operator $\sum$ for annotated
regular expressions is transformed to the binary alternatives
for plain regular expressions.
\begin{center}
	\begin{tabular}{lcr}
		$\erase \; \ZERO$ & $\dn$ & $\ZERO$\\
		$\erase \; _{bs}\ONE$ & $\dn$ & $\ONE$\\
		$\erase \; _{bs}\mathbf{c}$ & $\dn$ & $\mathbf{c}$\\
		$\erase \; _{bs} a_1 \cdot a_2$ & $\dn$ & $\erase \; r_1\cdot\erase\; r_2$\\
		$\erase \; _{bs} [] $ & $\dn$ & $\ZERO$\\
		$\erase \; _{bs} [a] $ & $\dn$ & $\erase \; a$\\
		$\erase \; _{bs} \sum [a_1, \; a_2]$ & $\dn$ & $\erase \; a_1 +\erase \; a_2$\\
		$\erase \; _{bs} \sum (a :: as)$ & $\dn$ & $\erase \; a + \erase \; _{[]} \sum as$\\
		$\erase \; _{bs} a^*$ & $\dn$ & $(\erase \; a)^*$
	\end{tabular}
\end{center}
\noindent
We also abbreviate the $\erase\; a$ operation
as $a_\downarrow$, for conciseness.
For bit-coded regular expressions, as a different datatype, 
testing whether they contain empty string in their lauguage requires
a dedicated function $\bnullable$
which simply calls $\erase$ first before testing whether it is $\nullable$.
\begin{center}
	\begin{tabular}{lcr}
		$\bnullable \; a $ & $\dn$ & $\nullable \; (a_\downarrow)$
	\end{tabular}
\end{center}
The function for collecting the
bitcodes of a $\bnullable$ annotated regular expression
is a generalised version of the $\textit{mkeps}$ function
for annotated regular expressions, called $\textit{bmkeps}$:


%\begin{definition}[\textit{bmkeps}]\mbox{}
\begin{center}
\begin{tabular}{lcl}
  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a$\\
  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{[]}\sum \textit{as})$\\
  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
     $bs \,@\, [Z]$
\end{tabular}    
\end{center}    
%\end{definition}

\noindent
This function completes the value information by travelling along the
path of the regular expression that corresponds to a POSIX value and
collecting all the bitcodes, and using $S$ to indicate the end of star
iterations. 

The most central question is how these partial lexing information
represented as bit-codes is augmented and carried around 
during a derivative is taken.
This is done by adding bitcodes to the 
derivatives, for example when one more star iteratoin is taken (we
call the operation of derivatives on annotated regular expressions $\bder$
because it is derivatives on regexes with \emph{b}itcodes),
we need to unfold it into a sequence,
and attach an additional bit $Z$ to the front of $r \backslash c$
to indicate one more star iteration. 
\begin{center}
  \begin{tabular}{@{}lcl@{}}
  $\bder \; c\; (_{bs}a^*) $ & $\dn$ &
      $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
       (_{[]}a^*))$
\end{tabular}    
\end{center}    

\noindent
For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
there is no danger of confusion with derivatives on plain regular expressions, 
for example, the above can be expressed as
\begin{center}
  \begin{tabular}{@{}lcl@{}}
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
      $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
       (_{[]}a^*))$
\end{tabular}    
\end{center}   

\noindent
Using the picture we used earlier to depict this, the transformation when 
taking a derivative w.r.t a star is like below:

\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$bs$
         \nodepart{two} $a^*$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture} 
&
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$v_{\text{previous iterations}}$
         \nodepart{two} $a^*$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\\
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        { $bs$ + [Z]
         \nodepart{two}  $(a\backslash c )\cdot a^*$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
&
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$v_{\text{previous iterations}}$ + 1 more iteration
         \nodepart{two} $(a\backslash c )\cdot a^*$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\end{tabular}    
\noindent
Another place in the $\bder$ function where it differs
from normal derivatives (on un-annotated regular expressions)
is the sequence case:
\begin{center}
  \begin{tabular}{@{}lcl@{}}

  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a_1$\\
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
\end{tabular}    
\end{center}    

The difference is that (when $a_1$ is $\bnullable$)
we use $\bmkeps$ to store the lexing information
in $a_1$ before collapsing it (as it has been fully matched by string prior to $c$, 
and attach the collected bit-codes to the front of $a_2$
before throwing away $a_1$. We assume that $\bmkeps$ correctly extracts the bitcode for how $a_1$
matches the string prior to $c$ (more on this later).
The bitsequence $\textit{bs}$ which was initially attached to the first element of the sequence
$a_1 \cdot a_2$, has now been elevated to the top level of teh $\sum$. 
This is because this piece of information will be needed whichever way the sequence is matched,
regardless of whether $c$ belongs to $a_1$ or $a_2$.

In the injection-based lexing, $r_1$ is immediately thrown away in 
subsequent derivatives on the right branch (when $r_1$ is $\nullable$),
\begin{center}
	$(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$
\end{center}
\noindent
as it knows $r_1$ is stored on stack and available once the recursive 
call to later derivatives finish.
Therefore, if the $\Right$ branch is taken in a $\POSIX$ match,
we construct back the sequence value once step back by
calling a on $\mkeps(r_1)$.
\begin{center}
	\begin{tabular}{lcr}
		$\ldots r_1 \cdot r_2$ & $\rightarrow$ & $r_1\cdot r_2 + r_2 \backslash c \ldots $\\
		$\ldots \Seq(v_1, v_2) (\Seq(\mkeps(r1), (\inj \; r_2 \; c\; v_{2c})))$ & $\leftarrow$ & $\Right(v_{2c})\ldots$
	\end{tabular}
\end{center}
\noindent
The rest of the clauses of $\bder$ is rather similar to
$\der$, and is put together here as a wholesome definition
for $\bder$:
\begin{center}
  \begin{tabular}{@{}lcl@{}}
  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
        $\textit{if}\;c=d\; \;\textit{then}\;
         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
  $_{bs}\sum\;(\textit{map} \; (\_\backslash c) \; as )$\\
  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a_1$\\
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
      $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
       (_{[]}r^*))$
\end{tabular}    
\end{center}    
\noindent
Generalising the derivative operation with bitcodes to strings, we have 
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$a\backslash_s [] $ & $\dn$ & $a$\\
		$a\backslash (c :: s) $ & $\dn$ & $(a \backslash c) \backslash_s s$
	\end{tabular}
\end{center}
As we did earlier, we omit the $s$ subscript at $\backslash_s$ when there is no danger
of confusion.
Putting this all together, we obtain a lexer with bit-coded regular expressions
as its internal data structures, which we call $\blexer$:

\begin{center}
\begin{tabular}{lcl}
  $\textit{blexer}\;r\,s$ & $\dn$ &
      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}

\noindent
Ausaf and Urban formally proved the correctness of the $\blexer$, namely
\begin{conjecture}
$\blexer \;r \; s = \lexer \; r \; s$.
\end{conjecture}
This was claimed but not formalised in Sulzmann and Lu's work.
We introduce the proof later, after we give out all
the needed auxiliary functions and definitions


%-----------------------------------
%	SUBSECTION 1
%-----------------------------------
\section{Specifications of Some Helper Functions}
The functions we introduce will give a more detailed glimpse into 
the lexing process, which might not be possible
using $\lexer$ or $\blexer$ themselves.
The first function we shall look at is $\retrieve$.
\subsection{$\textit{Retrieve}$}
Our bit-coded lexer "retrieve"s the bitcodes using $\bmkeps$
after we finished doing all the derivatives:
\begin{center}
\begin{tabular}{lcl}
	& & $\ldots$\\
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\ldots$
\end{tabular}
\end{center}
\noindent
Recall that $\bmkeps$ looks for the leftmost branch of an alternative
and completes a star's iterations by attaching a $Z$ at the end of the bitcodes
extracted. It "retrieves" a sequence by visiting both children and then stitch together 
two bitcodes using concatenation. After the entire tree structure of the regular 
expression has been traversed using the above manner, we get a bitcode encoding the 
lexing result.
We know that this "retrieved" bitcode leads to the correct value after decoding,
which is $v_0$ in the bird's eye view of the injection-based lexing diagram.
Now assume we keep every other data structure in the diagram \ref{InjFigure},
and only replace all the plain regular expression by their annotated counterparts,
computed during a $\blexer$ run.
Then we obtain a diagram for the annotated regular expression derivatives and
their corresponding values, though the values are never calculated in $\blexer$.
We have that $a_n$ contains all the lexing result information.
\vspace{20mm}
\begin{center}%\label{graph:injLexer}
\begin{tikzcd}[
	every matrix/.append style = {name=p},
	remember picture, overlay,
	]
	a_0 \arrow[r, "\backslash c_0"]  \arrow[d] & a_1 \arrow[r, "\backslash c_1"] \arrow[d] & a_2 \arrow[r, dashed] \arrow[d] & a_n \arrow[d] \\
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
\end{tikzcd}
\begin{tikzpicture}[
	remember picture, overlay,
E/.style = {ellipse, draw=blue, dashed,
            inner xsep=4mm,inner ysep=-4mm, rotate=90, fit=#1}
                        ]
\node[E = (p-1-1) (p-2-1)] {};
\node[E = (p-1-4) (p-2-4)] {};
\end{tikzpicture}

\end{center}
\vspace{20mm}
\noindent
On the other hand, $v_0$ also encodes the correct lexing result, as we have proven for $\lexer$.
Encircled in the diagram  are the two pairs $v_0, a_0$ and $v_n, a_n$, which both 
encode the correct lexical result. Though for the leftmost pair, we have
the information condensed in $v_0$ the value part, whereas for the rightmost pair,
the information is concentrated on $a_n$.
We know that in the intermediate steps the pairs $v_i, a_i$, must in some way encode the complete
lexing information as well. Therefore, we need a unified approach to extract such lexing result
from a value $v_i$ and its annotated regular expression $a_i$. 
And the function $f$ must satisfy these requirements:
\begin{itemize}
	\item
		$f \; a_i\;v_i = f \; a_n \; v_n = \decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
	\item
		$f \; a_i\;v_i = f \; a_0 \; v_0 = v_0 = \decode \;(\code \; v_0) \; (\erase \; a_0)$
\end{itemize}
\noindent
If we factor out the common part $\decode \; \_ \; (\erase \; a_0)$,
The core of the function $f$ is something that produces the bitcodes
$\code \; v_0$.
It is unclear how, but Sulzmann and Lu came up with a function satisfying all the above
requirements, named \emph{retrieve}:



\begin{center}
\begin{tabular}{lcr}
	$\retrieve \; \, (_{bs} \ONE) \; \, (\Empty)$ & $\dn$ & $\textit{bs}$\\
	$\retrieve \; \, (_{bs} \mathbf{c} ) \; \Char(c)$ & $\dn$ & $ \textit{bs}$\\
	$\retrieve \; \, (_{bs} a_1 \cdot a_2) \; \Seq(v_1, v_2)$ & $\dn$ &  $\textit{bs} @ (\retrieve \; a_1\; v_1) @ (\retrieve \; a_2 \; v_2)$\\
	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as}) \; \,\Left(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v)$\\
	$\retrieve \; \, (_{bs} \Sigma (a :: \textit{as} \; \, \Right(v)$ & $\dn$ & $\textit{bs} @ (\retrieve \; (_{[]}\Sigma \textit{as}) \; v)$\\
	$\retrieve \; \, (_{bs} a^*) \; \, (\Stars(v :: vs)) $ & $\dn$ & $\textit{bs} @ (\retrieve \; a \; v) @ (\retrieve \; (_{[]} a^*) \; (\Stars(vs)))$\\
	$\retrieve \; \, (_{bs} a^*) \; \, (\Stars([]) $ & $\dn$ & $\textit{bs} @ [Z]$
\end{tabular}
\end{center}
\noindent
As promised, $\retrieve$ collects the right bit-codes from the 
final derivative $a_n$:
\begin{lemma}
	$\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$
\end{lemma}
\begin{proof}
	By a routine induction on $a$.
\end{proof}
The design of $\retrieve$ enables extraction of bit-codes
from not only $\bnullable$ (annotated) regular expressions,
but also those that are not $\bnullable$.
For example, if we have the regular expression just internalised
and the lexing result value, we could $\retrieve$ the bitcdoes
that look as if we have en$\code$-ed the value:
\begin{lemma}
	$\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$
\end{lemma}
\begin{proof}
	By induction on $r$.
\end{proof}
\noindent
The following property is more interesting, as
it provides a "bridge" between $a_0, v_0$ and $a_n, v_n$ in the
lexing diagram.
If you take derivative of an annotated regular expression, 
you can $\retrieve$ the same bit-codes as before the derivative took place,
provided that you use the corresponding value:

\begin{lemma}\label{retrieveStepwise}
	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
\end{lemma}
\begin{proof}
	By induction on $r$, where $v$ is allowed to be arbitrary.
	The induction principle is function $\erase$'s cases.
\end{proof}
\noindent
$\retrieve$ is connected to the $\blexer$ in the following way:
\begin{lemma}\label{blexer_retrieve}
$\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
\end{lemma}
\noindent
$\retrieve$ allows free navigation on the diagram \ref{InjFigure} for annotated regexes of $\blexer$.
For plain regular expressions something similar is required as well.

\subsection{$\flex$}
Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$,
defined as
\begin{center}
\begin{tabular}{lcr}
$\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
$\flex \; r \; f \; c :: s \; v$ &  $=$ &   $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$
\end{tabular}
\end{center}
which accumulates the characters that needs to be injected back, 
and does the injection in a stack-like manner (last taken derivative first injected).
$\flex$ is connected to the $\lexer$:
\begin{lemma}
$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
\end{lemma}
\begin{proof}
	By reverse induction on $s$.
\end{proof}
$\flex$ provides us a bridge between $\lexer$'s intermediate steps.
What is even better about $\flex$ is that it allows us to 
directly operate on the value $\mkeps (r\backslash v)$,
which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
When the value created by $\mkeps$ becomes available, one can 
prove some stepwise properties of lexing nicely:
\begin{lemma}\label{flexStepwise}
$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
\end{lemma}
\begin{proof}
	By induction on the shape of $r\backslash s$
\end{proof}
\noindent
With $\flex$ and $\retrieve$ ready, we are ready to connect $\lexer$ and $\blexer$ .

\subsection{Correctness Proof of Bit-coded Algorithm}
\begin{lemma}\label{flex_retrieve}
$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
\end{lemma}
\begin{proof}
By induction on $s$. The induction tactic is reverse induction on strings.
$v$ is allowed to be arbitrary.
The crucial point is to rewrite 
\[
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
\]
as
\[
\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
\].
This enables us to equate 
\[
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
\] 
with 
\[
\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
\],
which in turn can be rewritten as
\[
\flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
\].
\end{proof}

With the above lemma we can now link $\flex$ and $\blexer$.

%----------------------------------------------------------------------------------------
%	SECTION  correctness proof
%----------------------------------------------------------------------------------------
\section{Correctness of Bit-coded Algorithm (Without Simplification)}
We now give the proof the correctness of the algorithm with bit-codes.

\begin{lemma}\label{flex_blexer}
$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
\end{lemma}
\begin{proof}
Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
\end{proof}
Finally the correctness of $\blexer$ is given as it outputs the same result as $\lexer$:
\begin{theorem}
	$\blexer\; r \; s = \lexer \; r \; s$
\end{theorem}
\begin{proof}
	Straightforward corollary of \ref{flex_blexer}.
\end{proof}
\noindent
Our main reason for wanting a bit-coded algorithm over 
the injection-based one is for its capabilities of allowing
more aggressive simplifications.
We will elaborate on this in the next chapter.
author	Chengsong
	Fri, 24 Jun 2022 21:49:23 +0100
changeset 553	0f00d440f484
parent 543	b2bea5968b89
child 564	3cbcd7cda0a9
permissions	-rwxr-xr-x