lexing: ChengsongTanPhdThesis/Chapters/Bitcoded1.tex@8016a2480704


% Chapter Template

% Main chapter title
\chapter{Bit-coded Algorithm of Sulzmann and Lu}

\label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
%simplifications and therefore introduce our version of the bitcoded algorithm and 
%its correctness proof in 
%Chapter 3\ref{Chapter3}. 
In this chapter, we are going to introduce the bit-coded algorithm
introduced by Sulzmann and Lu to address the problem of 
under-simplified regular expressions. 
\section{Bit-coded Algorithm}
The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},
stores information of previous lexing steps
on a stack, in the form of regular expressions
and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
\begin{envForCaption}
\begin{ceqn}
\begin{equation}%\label{graph:injLexer}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
\end{tikzcd}
\end{equation}
\end{ceqn}
\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}
\end{envForCaption}
\noindent
This is both inefficient and prone to stack overflow.
A natural question arises as to whether we can store lexing
information on the fly, while still using regular expression 
derivatives?

In a lexing algorithm's run, split by the current input position,
we have a sub-string that has been consumed,
and the sub-string that has yet to come.
We already know what was before, and this should be reflected in the value 
and the regular expression at that step as well. But this is not the 
case for injection-based regular expression derivatives.
Take the regex $(aa)^* \cdot bc$ matching the string $aabc$
as an example, if we have just read the two former characters $aa$:

\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {Consumed: $aa$
         \nodepart{two} Not Yet Reached: $bc$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\caption{Partially matched String} 
\end{envForCaption}
\end{center}
%\caption{Input String}\label{StringPartial}
%\end{figure}

\noindent
We have the value that has already been partially calculated,
and the part that has yet to come:
\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$\Seq(\Stars[\Char(a), \Char(a)], ???)$
         \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\caption{Partially constructed Value} 
\end{envForCaption}
\end{center}

In the regex derivative part , (after simplification)
all we have is just what is about to come:
\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},]
        {$???$
         \nodepart{two} To Come: $b c$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\caption{Derivative} 
\end{envForCaption}
\end{center}
\noindent
The previous part is missing.
How about keeping the partially constructed value 
attached to the front of the regular expression?
\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$
         \nodepart{two} To Come: $b c$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\caption{Derivative} 
\end{envForCaption}
\end{center}
\noindent
If we do this kind of "attachment"
and each time augment the attached partially
constructed value when taking off a 
character:
\begin{center}
\begin{envForCaption}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$
         \nodepart{two} To Come: $c$};
\end{tikzpicture}\\
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$
         \nodepart{two} EOF};
\end{tikzpicture}
\caption{After $\backslash b$ and $\backslash c$} 
\end{envForCaption}
\end{center}
\noindent
In the end we could recover the value without a backward phase.
But (partial) values are a bit clumsy to stick together with a regular expression, so 
we instead use bit-codes to encode them.

Bits and bitcodes (lists of bits) are defined as:
\begin{envForCaption}
\begin{center}
		$b ::=   S \mid  Z \qquad
bs ::= [] \mid b::bs    
$
\end{center}
\caption{Bit-codes datatype}
\end{envForCaption}

\noindent
Using $S$ and $Z$ rather than $1$ and $0$ is to avoid
confusion with the regular expressions $\ZERO$ and $\ONE$.
Bitcodes (or
bit-lists) can be used to encode values (or potentially incomplete values) in a
compact form. This can be straightforwardly seen in the following
coding function from values to bitcodes: 
\begin{envForCaption}
\begin{center}
\begin{tabular}{lcl}
  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
  $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
  $\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\
  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
  $\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\
  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
                                                 code(\Stars\,vs)$
\end{tabular}    
\end{center} 
\caption{Coding Function for Values}
\end{envForCaption}

\noindent
Here $\textit{code}$ encodes a value into a bit-code by converting
$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
star iteration by $S$. The border where a local star terminates
is marked by $Z$. 
This coding is lossy, as it throws away the information about
characters, and also does not encode the ``boundary'' between two
sequence values. Moreover, with only the bitcode we cannot even tell
whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. The
reason for choosing this compact way of storing information is that the
relatively small size of bits can be easily manipulated and ``moved
around'' in a regular expression. 


We define the reverse operation of $\code$, which is $\decode$.
As expected, $\decode$ not only requires the bit-codes,
but also a regular expression to guide the decoding and 
fill the gaps of characters:


%\begin{definition}[Bitdecoding of Values]\mbox{}
\begin{envForCaption}
\begin{center}
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
  $\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       (\Left\,v, bs_1)$\\
  $\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       (\Right\,v, bs_1)$\\                           
  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
  $\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
  $\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ & 
         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
  
  $\textit{decode}\,bs\,r$ & $\dn$ &
     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       \textit{else}\;\textit{None}$                       
\end{tabular}    
\end{center} 
\end{envForCaption}
%\end{definition}

\noindent
$\decode'$ does most of the job while $\decode$ throws
away leftover bit-codes and returns the value only.
$\decode$ is terminating as $\decode'$ is terminating.
We have the property that $\decode$ and $\code$ are
reverse operations of one another:
\begin{lemma}
\[\vdash v : r \implies \decode \; (\code \; v) \; r = \textit{Some}(v) \]
\end{lemma}
\begin{proof}
By proving a more general version of the lemma, on $\decode'$:
\[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
Then setting $ds$ to be $[]$ and unfolding $\decode$ definition
we get the lemma.
\end{proof}
With the $\code$ and $\decode$ functions in hand, we know how to 
switch between bit-codes and value--the two different representations of 
lexing information. 
The next step is to integrate this information into the working regular expression.
Attaching bits to the front of regular expressions is the solution Sulzamann and Lu
gave for storing partial values on the fly:

\begin{center}
\begin{tabular}{lcl}
  $\textit{a}$ & $::=$  & $\ZERO$\\
                  & $\mid$ & $_{bs}\ONE$\\
                  & $\mid$ & $_{bs}{\bf c}$\\
                  & $\mid$ & $_{bs}\sum\,as$\\
                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
                  & $\mid$ & $_{bs}a^*$
\end{tabular}    
\end{center}  
%(in \textit{ALTS})

\noindent
We call these regular expressions carrying bit-codes \emph{Annotated regular expressions}.
$bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
expressions and $as$ for a list of annotated regular expressions.
The alternative constructor ($\sum$) has been generalised to 
accept a list of annotated regular expressions rather than just 2.
%We will show that these bitcodes encode information about
%the ($\POSIX$) value that should be generated by the Sulzmann and Lu
%algorithm.
The most central question is how these partial lexing information
represented as bit-codes is augmented and carried around 
during a derivative is taken.

This is done by adding bitcodes to the 
derivatives, for example when one more star iteratoin is taken (we
call the operation of derivatives on annotated regular expressions $\bder$
because it is derivatives on regexes with bitcodes):
\begin{center}
  \begin{tabular}{@{}lcl@{}}
  $\bder \; c\; (_{bs}a^*) $ & $\dn$ &
      $_{bs}(\textit{fuse}\, [Z] \; \bder \; c \; a)\cdot
       (_{[]}a^*))$
\end{tabular}    
\end{center}    

\noindent
For most time we use the infix notation $\backslash$ to mean $\bder$ for brevity when
there is no danger of confusion with derivatives on plain regular expressions, 
for example, the above can be expressed as
\begin{center}
  \begin{tabular}{@{}lcl@{}}
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
      $_{bs}(\textit{fuse}\, [Z] \; a\,\backslash c)\cdot
       (_{[]}a^*))$
\end{tabular}    
\end{center}   


Using the picture we used earlier to depict this, the transformation when 
taking a derivative w.r.t a star is like below:
\centering
\begin{tabular}{@{}l@{\hspace{1mm}}l@{\hspace{0mm}}c@{}}
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$bs$
         \nodepart{two} $a^*$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture} 
&
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$v_{\text{previous iterations}}$
         \nodepart{two} $a^*$};
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\\
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        { $bs$ + [Z]
         \nodepart{two}  $(a\backslash c )\cdot a^*$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
&
\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
        {$v_{\text{previous iterations}}$ + 1 more iteration
         \nodepart{two} $(a\backslash c )\cdot a^*$ };
%\caption{term 1 \ref{term:1}'s matching configuration}
\end{tikzpicture}
\end{tabular}    
\noindent
The operation $\fuse$ is just to attach bit-codes 
to the front of an annotated regular expression:
\begin{center}
\begin{tabular}{lcl}
  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
     $_{bs @ bs'}\ONE$\\
  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
     $_{bs@bs'}{\bf c}$\\
  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
     $_{bs@bs'}\sum\textit{as}$\\
  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
     $_{bs@bs'}a_1 \cdot a_2$\\
  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
     $_{bs @ bs'}a^*$
\end{tabular}    
\end{center} 

\noindent
Another place in the $\bder$ function where it differs
from normal derivatives on un-annotated regular expressions
is the sequence case:
\begin{center}
  \begin{tabular}{@{}lcl@{}}

  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a_1$\\
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
\end{tabular}    
\end{center}    
Here 


\begin{center}
  \begin{tabular}{@{}lcl@{}}
  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
        $\textit{if}\;c=d\; \;\textit{then}\;
         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
  $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\
  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a_1$\\
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
      $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
       (_{[]}r^*))$
\end{tabular}    
\end{center}    


To do lexing using annotated regular expressions, we shall first
transform the usual (un-annotated) regular expressions into annotated
regular expressions. This operation is called \emph{internalisation} and
defined as follows:

%\begin{definition}
\begin{center}
\begin{tabular}{lcl}
  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
  $(r_1 + r_2)^\uparrow$ & $\dn$ &
  $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\,
  \textit{fuse}\,[S]\,r_2^\uparrow]$\\
  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
  $(r^*)^\uparrow$ & $\dn$ &
         $_{[]}(r^\uparrow)^*$\\
\end{tabular}    
\end{center}    
%\end{definition}

\noindent
We use up arrows here to indicate that the basic un-annotated regular
expressions are ``lifted up'' into something slightly more complex. In the
fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
attach bits to the front of an annotated regular expression. Its
definition is as follows:

 

\noindent
After internalising the regular expression, we perform successive
derivative operations on the annotated regular expressions. This
derivative operation is the same as what we had previously for the
basic regular expressions, except that we beed to take care of
the bitcodes:





%\end{definition}
\noindent
For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
we need to unfold it into a sequence,
and attach an additional bit $Z$ to the front of $r \backslash c$
to indicate one more star iteration. Also the sequence clause
is more subtle---when $a_1$ is $\textit{bnullable}$ (here
\textit{bnullable} is exactly the same as $\textit{nullable}$, except
that it is for annotated regular expressions, therefore we omit the
definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
$a_1$ matches the string prior to character $c$ (more on this later),
then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
\backslash c)$ will collapse the regular expression $a_1$(as it has
already been fully matched) and store the parsing information at the
head of the regular expression $a_2 \backslash c$ by fusing to it. The
bitsequence $\textit{bs}$, which was initially attached to the
first element of the sequence $a_1 \cdot a_2$, has
now been elevated to the top-level of $\sum$, as this information will be
needed whichever way the sequence is matched---no matter whether $c$ belongs
to $a_1$ or $ a_2$. After building these derivatives and maintaining all
the lexing information, we complete the lexing by collecting the
bitcodes using a generalised version of the $\textit{mkeps}$ function
for annotated regular expressions, called $\textit{bmkeps}$:


%\begin{definition}[\textit{bmkeps}]\mbox{}
\begin{center}
\begin{tabular}{lcl}
  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a$\\
  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
     $bs \,@\, [Z]$
\end{tabular}    
\end{center}    
%\end{definition}

\noindent
This function completes the value information by travelling along the
path of the regular expression that corresponds to a POSIX value and
collecting all the bitcodes, and using $S$ to indicate the end of star
iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
decode them, we get the value we expect. The corresponding lexing
algorithm looks as follows:

\begin{center}
\begin{tabular}{lcl}
  $\textit{blexer}\;r\,s$ & $\dn$ &
      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}

\noindent
In this definition $\_\backslash s$ is the  generalisation  of the derivative
operation from characters to strings (just like the derivatives for un-annotated
regular expressions).



%-----------------------------------
%	SUBSECTION 1
%-----------------------------------
\section{Specifications of Some Helper Functions}
Here we give some functions' definitions, 
which we will use later.
\begin{center}
\begin{tabular}{ccc}
$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$
\end{tabular}
\end{center}


%----------------------------------------------------------------------------------------
%	SECTION  correctness proof
%----------------------------------------------------------------------------------------
\section{Correctness of Bit-coded Algorithm (Without Simplification)}
We now give the proof the correctness of the algorithm with bit-codes.

Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
defined as
\begin{center}
\begin{tabular}{lcr}
$\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
$\flex \; r \; f \; c :: s \; v$ &  $=$ &   $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$
\end{tabular}
\end{center}
which accumulates the characters that needs to be injected back, 
and does the injection in a stack-like manner (last taken derivative first injected).
$\flex$ is connected to the $\lexer$:
\begin{lemma}
$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
\end{lemma}
$\flex$ provides us a bridge between $\lexer$ and $\blexer$.
What is even better about $\flex$ is that it allows us to 
directly operate on the value $\mkeps (r\backslash v)$,
which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
When the value created by $\mkeps$ becomes available, one can 
prove some stepwise properties of lexing nicely:
\begin{lemma}\label{flexStepwise}
$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
\end{lemma}

And for $\blexer$ we have a function with stepwise properties like $\flex$ as well,
called $\retrieve$\ref{retrieveDef}.
$\retrieve$ takes bit-codes from annotated regular expressions
guided by a value.
$\retrieve$ is connected to the $\blexer$ in the following way:
\begin{lemma}\label{blexer_retrieve}
$\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
\end{lemma}
If you take derivative of an annotated regular expression, 
you can $\retrieve$ the same bit-codes as before the derivative took place,
provided that you use the corresponding value:

\begin{lemma}\label{retrieveStepwise}
$\retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
\end{lemma}
The other good thing about $\retrieve$ is that it can be connected to $\flex$:
%centralLemma1
\begin{lemma}\label{flex_retrieve}
$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
\end{lemma}
\begin{proof}
By induction on $s$. The induction tactic is reverse induction on strings.
$v$ is allowed to be arbitrary.
The crucial point is to rewrite 
\[
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
\]
as
\[
\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
\].
This enables us to equate 
\[
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
\] 
with 
\[
\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
\],
which in turn can be rewritten as
\[
\flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
\].
\end{proof}

With the above lemma we can now link $\flex$ and $\blexer$.

\begin{lemma}\label{flex_blexer}
$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
\end{lemma}
\begin{proof}
Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
\end{proof}
Finally
author	Chengsong
	Thu, 09 Jun 2022 12:57:53 +0100
changeset 538	8016a2480704
parent 537	50e590823220
child 542	a7344c9afbaf
permissions	-rwxr-xr-x