lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded1.tex

equal deleted inserted replaced

-:c0c1ebe09c7d
+:8ffa28fce271
 %simplifications and therefore introduce our version of the bitcoded algorithm and
 %its correctness proof in
 %Chapter 3\ref{Chapter3}.
 In this chapter, we are going to describe the bit-coded algorithm
 introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of
+derivatives of
 regular expressions.
-We have implemented their algorithm in Scala, and found out inefficiencies
+We have implemented their algorithm in Scala and Isabelle,
+and found problems
 in their algorithm such as de-duplication not working properly and redundant
-fixpoint construction. Their algorithm is improved and verified with the help of
+fixpoint construction.
-formal proofs.
 \section{The Motivation Behind Using Bitcodes}
-We first do a recap of what was going on
+Let us give again the definition of $\lexer$ from Chapter \ref{Inj}:
-in the lexer algorithm in Chapter \ref{Inj},
 \begin{center}
 \begin{tabular}{lcl}
 	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
 	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
 	& & $\quad \phantom{\mid}\; \None \implies \None$\\
 	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
 \end{tabular}
 \end{center}
 \noindent
-The algorithm recursively calls $\lexer$ on
+The first problem with this algorithm is that
+for the function $\inj$ to work properly
+we cannot destroy the structure of a regular expression,
+and therefore canno simplify aggressively.
+For example,
+\[
+	r + (r + a) \rightarrow r + a
+\]
+cannot be done because that would require
+breaking up the inner alternative.
+The $\lexer$ function therefore only enables
+same-level de-duplications like
+\[
+	r + r \rightarrow r.
+\]
+Secondly, the algorithm recursively calls $\lexer$ on
 each new character input,
 and before starting a child call
 it stores information of previous lexing steps
 on a stack, in the form of regular expressions
 and characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.
 \caption{Stored $r_i, c_{i+1}$ Used by $\inj$}
 \end{figure}
 \noindent
 Storing all $r_i, c_{i+1}$ pairs recursively
 allows the algorithm to work in an elegant way, at the expense of
-storing quite a bit of verbose information.
+storing quite a bit of verbose information on the stack.
-The stack seems to grow at least quadratically fast with respect
+The stack seems to grow at least quadratically with respect
 to the input (not taking into account the size bloat of $r_i$),
-which can be inefficient and prone to stack overflow.
+which can be inefficient and prone to stack overflows.
 \section{Bitcoded Algorithm}
 To address this,
-Sulzmann and Lu chose to  define a new datatype
+Sulzmann and Lu defined a new datatype
 called \emph{annotated regular expression},
 which condenses all the partial lexing information
 (that was originally stored in $r_i, c_{i+1}$ pairs)
 into bitcodes.
 The bitcodes are then carried with the regular
 \end{center}
 \noindent
 where $bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
 expressions and $as$ for lists of annotated regular expressions.
 The alternative constructor, written, $\sum$, has been generalised to
-accept a list of annotated regular expressions rather than just two.
+take a list of annotated regular expressions rather than just two.
-Why is it generalised? This is because when we open up nested
+Why is it generalised? This is because when we analyse nested
-alternatives, there could be more than two elements at the same level
+alternatives, there can be more than two elements at the same level
 after de-duplication, which can no longer be stored in a binary
 constructor.
 Bits and bitcodes (lists of bits) are defined as:
 \begin{center}
 		$b ::=   S \mid  Z \qquad
 $
 \end{center}
 \noindent
 We use $S$ and $Z$ rather than $1$ and $0$ is to avoid
 confusion with the regular expressions $\ZERO$ and $\ONE$.
-The idea is to use the attached bitcodes
+The idea is to use the bitcodes
 to indicate which choice was made at a certain point
 during lexing.
 For example,
 $(_{Z}a+_{S}b) \backslash a$ gives us
-$_{Z}\ONE + \ZERO$, this $Z$ bitcode will
+$_{Z}\ONE + \ZERO$, where the $Z$ bitcode will
 later be used to decode that a left branch was
-selected at the time when the part $a+b$ is being taken
+selected at the time when the part $a+b$ was anallysed by
-derivative of.
+derivative.
 \subsection{A Bird's Eye View of the Bit-coded Lexer}
-Before we give out the rest of the functions and definitions
+Before we give the details of the functions and definitions
 related to our
 $\blexer$ (\emph{b}-itcoded lexer), we first provide a high-level
-view of the algorithm, so the reader does not get lost in
+view of the algorithm.
-the details.
 \begin{figure}[H]
 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,thick]
 %\draw (-6,-6) grid (6,6);
 	\node [circle, draw] (r0) at (-6, 2) {$r$};
 	%edge [] node {} (ldots2);
 \end{tikzpicture}
-\caption{Bird's Eye View of $\blexer$}
+\caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand
+	for the annotated regular expressions in each derivative step.
+	The characters used in each derivative is written as $c_i$.
+The relative size increase reflect the size increase in each derivative step.}
 \end{figure}
 \noindent
 The plain regular expressions
 are first ``lifted'' to an annotated regular expression,
-with the function called \emph{internalise}.
+with the function called \emph{internalise} ($r \rightarrow _{bs}a$).
 Then the annotated regular expression $_{bs}a$ will
-go through successive derivatives with respect
+be transformed by successive derivatives with respect
 to the input stream of characters $c_1, c_2$ etc.
 Each time a derivative is taken, the bitcodes
 are moved around, discarded or augmented together
 with the regular expression they are attached to.
 After all input has been consumed, the
 bitcodes are collected by $\bmkeps$,
 which traverses the nullable part of the regular expression
-and collect the bitcodes along the way.
+and collects the bitcodes along the way.
 The collected bitcodes are then $\decode$d with the guidance
-of the input regular expression $r$.
+of the input regular expression $r$ ( $_{bs}a \rightarrow r$).
 The most notable improvements of $\blexer$
 over $\lexer$ are
 \begin{itemize}
 	\item
 		An absence of the second injection phase.
 	\item
-		One need not store each pair of the
+		One does not need to store each pair of the
 		intermediate regular expressions $_{bs_i}a_i$ and  $c_{i+1}$.
 		The previous annotated regular expression $_{bs_i}a_i$'s information is passed
 		on without loss to its successor $_{bs_{i+1}}a_{i+1}$,
-		and $c_{i+1}$'s already contained in the strings in $L\;r$ \footnote{
+		and $c_{i+1}$'s information is stored in $L\;r$.\footnote{
-		which can be easily recovered if we decode in the correct order}.
+		which will be used during the decode phase, where we use $r$ as
+	a source of information.}
 	\item
-		The simplification works much better--one can maintain correctness
+		Finally, simplification works much better--one can maintain correctness
-		while applying quite strong simplifications, as we shall wee.
+		while applying quite aggressive simplifications.
 \end{itemize}
-\noindent
+%In the next section we will
-In the next section we will
+%give the operations needed in $\blexer$,
-give the operations needed in $\blexer$,
+%with some remarks on the idea behind their definitions.
-with some remarks on the idea behind their definitions.
 \subsection{Operations in $\textit{Blexer}$}
 The first operation we define related to bit-coded regular expressions
 is how we move bits to the inside of regular expressions.
-Called $\fuse$, this operation attaches bit-codes
+This operation is called $\fuse$:
-to the front of an annotated regular expression:
 \begin{center}
 \begin{tabular}{lcl}
-$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
+	$\textit{fuse}\;bs \; (\ZERO)$ & $\dn$ & $\ZERO$\\
-$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
+	$\textit{fuse}\;bs\; (_{bs'})\ONE$ & $\dn$ &
 $_{bs @ bs'}\ONE$\\
-$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
+	$\textit{fuse}\;bs\;(_{bs'}{\bf c})$ & $\dn$ &
 $_{bs@bs'}{\bf c}$\\
-$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
+	$\textit{fuse}\;bs\;(_{bs'}\sum\textit{as})$ & $\dn$ &
 $_{bs@bs'}\sum\textit{as}$\\
-$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
+	$\textit{fuse}\;bs\; (_{bs'}a_1\cdot a_2)$ & $\dn$ &
 $_{bs@bs'}a_1 \cdot a_2$\\
-$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
+$\textit{fuse}\;bs\;(_{bs'}a^*)$ & $\dn$ &
 $_{bs @ bs'}a^*$
 \end{tabular}
 \end{center}
 \noindent
-With \emph{fuse} we are able to define the $\internalise$ function
+With \emph{fuse} we are able to define the $\internalise$ function,
+written $(\_)^\uparrow$,
 that translates a ``standard'' regular expression into an
 annotated regular expression.
 This function will be applied before we start
 with the derivative phase of the algorithm.
 $(r^*)^\uparrow$ & $\dn$ &
 $_{[]}(r^\uparrow)^*$\\
 \end{tabular}
 \end{center}
 \noindent
-We use an up arrow with postfix notation
-to denote this operation
-for convenience.
 The opposite of $\textit{internalise}$ is
-$\erase$, where all the bit-codes are removed,
+$\erase$, where all bit-codes are removed,
 and the alternative operator $\sum$ for annotated
 regular expressions is transformed to the binary version
-in plain regular expressions.
+in (plain) regular expressions. This can be defined as follows:
-\begin{center}
+\begin{center}\label{eraseDef}
 	\begin{tabular}{lcl}
 		$\ZERO_\downarrow$ & $\dn$ & $\ZERO$\\
 		$( _{bs}\ONE )_\downarrow$ & $\dn$ & $\ONE$\\
 		$( _{bs}\mathbf{c} )_\downarrow$ & $\dn$ & $\mathbf{c}$\\
 		$( _{bs} a_1 \cdot a_2 )_\downarrow$ & $\dn$ &
 		$ (a_1) _\downarrow \cdot  (a_2) _\downarrow$\\
 		$( _{bs} [])_\downarrow $ & $\dn$ & $\ZERO $\\
 		$( _{bs} [a]  )_\downarrow$ & $\dn$ & $a_\downarrow$\\
-		$_{bs} \sum [a_1, \; a_2]$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\
+		$(_{bs} \sum [a_1, \; a_2])_\downarrow$ & $\dn$ & $ (a_1) _\downarrow + ( a_2 ) _\downarrow $\\
 		$(_{bs} \sum (a :: as))_\downarrow$ & $\dn$ & $ a_\downarrow + \; (_{[]} \sum as)_\downarrow$\\
 		$( _{bs} a^* )_\downarrow$ & $\dn$ & $(a_\downarrow)^*$
 	\end{tabular}
 \end{center}
 \noindent
 We also abbreviate the $\erase\; a$ operation
 as $a_\downarrow$, for conciseness.
-Testing whether an annotated regular expression
+Determining whether an annotated regular expression
 contains the empty string in its lauguage requires
 a dedicated function $\bnullable$.
+In our formalisation this function is defined by simply calling $\erase$
+before $\nullable$.
 $\bnullable$ simply calls $\erase$ before $\nullable$.
 \begin{definition}
 		$\bnullable \; a \dn  \nullable \; (a_\downarrow)$
 \end{definition}
 The function for collecting
 bitcodes from a
 (b)nullable regular expression is called $\bmkeps$.
-$\bmkeps$ is a variation of the $\textit{mkeps}$ function,
+This function  is a lifted version of the $\textit{mkeps}$ function,
 which follows the path $\mkeps$ takes to traverse the
 $\nullable$ branches when visiting a regular expression,
-but gives back bitcodes instead of a value.
+but collects bitcodes instead of generating a value.
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
 $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
 $\textit{if}\;\textit{bnullable}\,a$\\
 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
 $bs \,@\, [S]$
 \end{tabular}
 \end{center}
 \noindent
-$\bmkeps$, just like $\mkeps$,
+This function, just like $\mkeps$,
 visits a regular expression tree respecting
-the POSIX rules. The difference, however, is that
+the POSIX rules.
-it does not create values, but only bitcodes.
 It traverses each child of the sequence regular expression
 from left to right and creates a bitcode by stitching
 together bitcodes obtained from the children expressions.
 In the case of alternative regular expressions,
 it looks for the leftmost
 %Whenever there is some bitcodes attached to a
 %node, it returns the bitcodes concatenated with whatever
 %child recursive calls return.
 The only time when $\bmkeps$ creates new bitcodes
 is when it completes a star's iterations by attaching a $S$ to the end of the bitcode
-list it returns.\\
+list it returns.
 The bitcodes extracted by $\bmkeps$ need to be
 $\decode$d (with the guidance of a plain regular expression):
 %\begin{definition}[Bitdecoding of Values]\mbox{}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 \textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
 The function $\decode'$ returns a pair consisting of
-a partially decoded value and some leftover bit list that cannot
+a partially decoded value and some leftover bit-list.
-be decide yet.
 The function $\decode'$ succeeds if the left-over
 bit-sequence is empty.
-$\decode$ is terminating as $\decode'$ is terminating.
-$\decode'$ is terminating
+%$\decode$ is terminating as $\decode'$ is terminating.
-because at least one of $\decode'$'s parameters will go down in terms
+%$\decode'$ is terminating
-of size.\\
+%because at least one of $\decode'$'s parameters will go down in terms
+%of size.\\
 The reverse operation of $\decode$ is $\code$.
-$\textit{code}$ encodes a value into a bitcode by converting
+This function encodes a value into a bitcode by converting
 $\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-empty
 star iteration by $S$. The border where a star iteration
 terminates is marked by $Z$.
 This coding is lossy, as it throws away the information about
 characters, and does not encode the ``boundary'' between two
 sequence values. Moreover, with only the bitcode we cannot even tell
-whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$.
+whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$,
+but this swill not be necessary in our proof.
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
 $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\
 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\;
 code(\Stars\,vs)$
 \end{tabular}
 \end{center}
 \noindent
-Assume we have a value $v$ and regular expression $r$
+We can prove that given a value $v$ and regular expression $r$
 with $\vdash v:r$,
 then we have the property that $\decode$ and $\code$ are
 reverse operations of one another:
 \begin{lemma}
 \[If \vdash v : r \; then \;\decode \; (\code \; v) \; r = \textit{Some}(v) \]
 \[\vdash v : r \implies \decode' \; ((\code \; v) @ ds) \; r = (v, ds) \]
 Then setting $ds$ to be $[]$ and unfolding $\decode$ definition,
 we obtain the property.
 \end{proof}
 \noindent
-Now we give out the central part of this lexing algorithm,
+Now we define the central part of Sulzmann and Lu's lexing algorithm,
-the $\bder$ function (stands for \emph{b}itcoded-derivative):
+the $\bder$ function (which stands for \emph{b}itcoded-derivative):
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
 $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot
 (_{[]}r^*))$
 \end{tabular}
 \end{center}
 \noindent
-For most time we use the infix notation $(\_\backslash\_)$
+The $\bder$ function tells us how regular expressions can be recursively traversed,
-to mean $\bder$ for brevity when
-there is no danger of confusion with derivatives on plain regular expressions.
-For example, we write $( _{[]}r^* ) \backslash c$ instead of $\bder \;c \; _{[]}r^*$,
-as the bitcodes at the front of $r^*$ indicates that it is
-a bit-coded regular expression, not a plain one.\\
-$\bder$ tells us how regular expressions can be recursively traversed,
 where the bitcodes are augmented and carried around
 when a derivative is taken.
 We give the intuition behind some of the more involved cases in
 $\bder$. \\
 For example,
 in the \emph{star} case,
-a derivative on $_{bs}a^*$ means
+a derivative of $_{bs}a^*$ means
-that one more star iteratoin needs to be taken.
+that one more star iteration needs to be taken.
-we need to unfold it into a sequence,
+We therefore need to unfold it into a sequence,
-and attach an additional bit $Z$ to the front of $r \backslash c$
+and attach an additional bit $Z$ to the front of $a \backslash c$
 as a record to indicate one new star iteration is unfolded.
 \noindent
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 \end{tabular}
 \end{center}
 \noindent
 This information will be recovered later by the $\decode$ function.
-The intuition is that the bit $Z$ will be decoded at the right location,
+%The intuition is that the bit $Z$ will be decoded at the right location,
-because we accumulate bits from left to right (a rigorous proof will be given
+%because we accumulate bits from left to right (a rigorous proof will be given
-later).
+%later).
 %\begin{tikzpicture}[ > = stealth, % arrow head style
 %        shorten > = 1pt, % don't touch arrow head to node
 %        semithick % line style
 %    ]
 % 	\end{scope}
 %\end{tikzpicture}
 \noindent
 Another place the $\bder$ function differs
-from derivatives on plain regular expressions
+from derivatives on regular expressions
 is the sequence case:
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$
 \end{tabular}
 \end{center}
 \noindent
-The difference is that (when $a_1$ is $\bnullable$)
+%We assume that $\bmkeps$
+%correctly extracts the bitcode for how $a_1$
+%matches the string prior to $c$ (more on this later).
+%The bitsequence $\textit{bs}$ which was initially
+%attached to the first element of the sequence
+%$a_1 \cdot a_2$, has now been
+%elevated to the top level of the $\sum$ constructor.
+%This is because this piece of information will be needed
+%whichever way the sequence is matched,
+%regardless of whether $c$ belongs to $a_1$ or $a_2$.
+The difference mainly lies in the $\textit{if}$ branch (when $a_1$ is $\bnullable$):
 we use $\bmkeps$ to store the lexing information
 in $a_1$ before collapsing
 it (as it has been fully matched by string prior to $c$),
 and attach the collected bit-codes to the front of $a_2$
-before throwing away $a_1$. We assume that $\bmkeps$
+before throwing away $a_1$.
-correctly extracts the bitcode for how $a_1$
-matches the string prior to $c$ (more on this later).
-The bitsequence $\textit{bs}$ which was initially
-attached to the first element of the sequence
-$a_1 \cdot a_2$, has now been
-elevated to the top level of the $\sum$ constructor.
-This is because this piece of information will be needed
-whichever way the sequence is matched,
-regardless of whether $c$ belongs to $a_1$ or $a_2$.
 In the injection-based lexer, $r_1$ is immediately thrown away in
-subsequent derivatives on the right branch (when $r_1$ is $\nullable$),
+in the $\textit{if}$ branch,
-depite $r_1$ potentially storing information of previous matches:
+the information $r_1$ stores is therefore lost:
 \begin{center}
-	$(r_1 \cdot r_2 )\backslash c = (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$
+	\begin{tabular}{lcl}
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
+	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+	&   & $\mathit{else}\; \ldots$\\
+	\end{tabular}
 \end{center}
 \noindent
-this loss of information makes it necessary
-to store on stack all the regular expressions'
+%\noindent
-``snapshots'' before they were
+%this loss of information makes it necessary
-taken derivative of.
+%to store on stack all the regular expressions'
-So that the related information will be available
+%``snapshots'' before they were
-once the child recursive
+%taken derivative of.
-call finishes.\\
+%So that the related information will be available
+%once the child recursive
+%call finishes.\\
 The rest of the clauses of $\bder$ is rather similar to
 $\der$. \\
 Generalising the derivative operation with bitcodes to strings, we have
 \begin{center}
 	\begin{tabular}{@{}lcl@{}}
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
+\noindent
-\noindent
+This function first attaches bitcodes to a
-$\blexer$ first attaches bitcodes to a
+plain regular expression $r$, and then builds successive derivatives
-plain regular expression, and then do successive derivatives
 with respect to the input string $s$, and
-then test whether the result is nullable.
+then test whether the result is (b)nullable.
-If yes, then extract the bitcodes out of the nullable expression,
+If yes, then extract the bitcodes from the nullable expression,
 and decodes the bitcodes into a lexical value.
 If there does not exists a match between $r$ and $s$ the lexer
-outputs $\None$ indicating a failed lex.\\
+outputs $\None$ indicating a mismatch.\\
 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
 \begin{property}
 $\blexer \;r \; s = \lexer \; r \; s$.
 \end{property}
 \noindent
 This was claimed but not formalised in Sulzmann and Lu's work.
 We introduce the proof later, after we give all
 the needed auxiliary functions and definitions.
 \subsection{An Example $\blexer$ Run}
-Before introducing the proof we shall first walk the reader
+Before introducing the proof we shall first walk
 through a concrete example of our $\blexer$ calculating the right
 lexical information through bit-coded regular expressions.\\
 Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$.
 We give again the bird's eye view of this particular example
 in each stage of the algorithm:
 The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$
 turned into $ZS$ after derivative w.r.t $b$
 is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$
 before processing $_Zb+_Sc$.\\
 The annotated regular expressions
-would look too cumbersome if we explicitly indicate all the
+would look overwhelming if we explicitly indicate all the
 locations where bitcodes are attached.
 For example,
 $(aa)^*\cdot (b+c)$ would
 look like $_{[]}(_{[]}(_{[]}a \cdot _{[]}a)^*\cdot _{[]}(_{[]}b+_{[]}c))$
 after
 internalise.
 Therefore for readability we omit bitcodes if they are empty.
-This applies to all example annotated
+This applies to all annotated
 regular expressions in this thesis.\\
 %and assume we have just read the first character $a$:
 %\begin{center}
 %\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
 %    \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},]
 	\item
 		The bit-encodings work properly,
 		allowing the possibility of
 		pulling out the right lexical value from an
 		annotated regular expression at
-		any stage of the algorithm.
+		any stage, using $\bmkeps$ or $\retrieve$ (details in
+		lemmas \ref{bmkepsRetrieve} and \ref{blexer_retrieve}).
 \end{itemize}
 We will elaborate on this, with the help of
 some helper functions such as $\retrieve$ and
 $\flex$.
 \subsection{Specifications of Some Helper Functions}
 using $\lexer$ or $\blexer$ alone.
 \subsubsection{$\textit{Retrieve}$}
 The first function we shall introduce is $\retrieve$.
 Sulzmann and Lu gave its definition, and
 Ausaf and Urban found
-its usage in mechanised proofs.
+useful in their mechanised proofs.
 Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$,
-after all the derivatives has been taken:
+after all the derivatives have been taken:
 \begin{center}
 \begin{tabular}{lcl}
 	& & $\ldots$\\
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 & & $\ldots$
 \end{tabular}
 \end{center}
 \noindent
-$\bmkeps$ retrieves the value $v$'s
+The function $\bmkeps$ retrieves the value $v$'s
 information in the format
 of bitcodes, by travelling along the
 path of the regular expression that corresponds to a POSIX match,
 collecting all the bitcodes.
 We know that this "retrieved" bitcode leads to the correct value after decoding,
-which is $v_0$ in the injection-based lexing diagram.
+which is $v_0$ in the injection-based lexing diagram:
-As an observation we pointed at the beginning of this section,
-the annotated regular expressions generated in successive derivative steps
-in $\blexer$ after $\erase$ has the same structure
-as those appeared in $\lexer$.
-We redraw the diagram below to visualise this fact.
-We pretend that all the values are
-ready despite they are only calculated in $\lexer$.
-In general we have $\vdash v_i:(a_i)_\downarrow$.
 \vspace{20mm}
 \begin{figure}[H]%\label{graph:injLexer}
 \begin{center}
 \begin{tikzcd}[
 	every matrix/.append style = {name=p},
 \node[E = (p-1-1) (p-2-1)] {};
 \node[E = (p-1-4) (p-2-4)] {};
 \node[E = (p-1-2) (p-2-2)] {};
 \node[E = (p-1-3) (p-2-3)] {};
 \end{tikzpicture}
+\caption{Injection-based lexing diagram,
+with matching value and regular expression pairs
+encircled}\label{Inj2}
 \end{figure}
 \vspace{20mm}
 \noindent
-We encircle in the diagram  all the pairs $v_i, a_i$ to show that these values
+We have that $\vdash v_i:(a_i)_\downarrow$,
+where $v_i$ are the intermediate values generated step by step in $\lexer$.
+These values are not explicitly created in $\blexer$.
+We encircled in the diagram  all the pairs $v_i, a_i$ to show that these values
 and regular expressions share the same structure.
 These pairs all contain adequate information to
 obtain the final lexing result.
 For example, in the leftmost pair the
 lexical information is condensed in
 $v_0$, whereas for the rightmost pair,
 the lexing result hides is in the bitcodes of $a_n$.\\
-$\bmkeps$ is able to extract from $a_n$ the result
+The function $\bmkeps$ is able to extract from $a_n$ the result
 by looking for nullable parts of the regular expression.
 However for regular expressions $a_i$ in general,
 they might not necessarily be nullable.
 For those regular expressions that are not nullable,
 $\bmkeps$ will not work.
 		$f \; a_i\;v_i = f \; a_n \; v_n = \bmkeps \; a_n$%\decode \; (\bmkeps \; a_n) \; (\erase \; a_0)$
 	\item
 		$f \; a_i\;v_i = f \; a_0 \; v_0 = \code \; v_0$ %\decode \;(\code \; v_0) \; (\erase \; a_0)$
 \end{itemize}
 \noindent
-Sulzmann and Lu came up with a function satisfying
+Sulzmann and Lu came up with the following definition for $f$ satisfying
 the above
-requirements, named \emph{retrieve}:
+requirements, and named it \emph{retrieve}:
+\vspace{-7mm}
 \begin{center}
 \begin{tabular}{llcl}
 	$\retrieve \; \, _{bs} \ONE$ & $   \Empty$ & $\dn$ & $\textit{bs}$\\
 	$\retrieve \; \, _{bs} \mathbf{c}$ & $ (\Char \; c) $ & $\dn$ & $ \textit{bs}$\\
 	$\retrieve \; \, _{bs} a^* $ & $  (\Stars \; []) $ & $\dn$ & $\textit{bs} \; @ \; [S]$
 \end{tabular}
 \end{center}
 \noindent
-As promised, $\retrieve$ collects the right bit-codes from the
+As stated, $\retrieve$ collects the right bit-codes from the
-final derivative $a_n$, guided by $v_n$:
+final derivative $a_n$, guided by $v_n$. This can be proved
+as follows:
 \begin{lemma}\label{bmkepsRetrieve}
 	$\bnullable \; a \implies \bmkeps \; a = \retrieve \; a \; (\mkeps \; (\erase \; a))$
 \end{lemma}
 \begin{proof}
 	By a routine induction on $a$.
 \end{proof}
 \noindent
-The design of $\retrieve$ enables us to extract bitcodes
+%The design of $\retrieve$ enables us to extract bitcodes
-from both annotated regular expressions and values.
+%from both annotated regular expressions and values.
-$\retrieve$ always ``sucks up'' all the information.
+%$\retrieve$ always ``sucks up'' all the information.
 When more information is stored in the value, we would be able to
 extract more from the value, and vice versa.
 For example in star iterations, $\retrieve$ will be able to extract from $\Stars \; vs$
 exactly the same bitcodes as $\code \; (\Stars \; vs)$:
-\begin{lemma}
+\begin{lemma}\label{retrieveEncodeSTARS}
 If $\forall v \in vs. \vdash v : r$, and  $\code \; v = \retrieve \; (\internalise\; r) \; v$\\
 then $\code \; (\Stars \; vs) = \retrieve \; _{[]}((\internalise \;  r)^*) \; (\Stars \; vs)$
-\end{lemma}\label{retrieveEncodeSTARS}
+\end{lemma}
 \begin{proof}
 	By induction on the value list $vs$.
 \end{proof}
 \noindent
 Similarly, when the value list does not hold information,
 only the bitcodes plus some delimiter will be recorded:
 \begin{center}
 $\retrieve \; _{bs}((\internalise \;  r)^*) \; (\Stars \; [] )  = bs @ [S]$.
 \end{center}
-In general, if we have a regular expression just internalised
+In general, if we have a regular expression that is just internalised
-and the lexing result value, we could $\retrieve$ the bitcdoes
+and the final lexing result value, we can $\retrieve$ the bitcodes
 that look as if we have en$\code$-ed the value as bitcodes:
 \begin{lemma}
 	$\vdash v : r \implies \retrieve \; (r)^\uparrow \; v = \code \; v$
 \end{lemma}
 \begin{proof}
 	By induction on $r$.
 	The star case uses lemma \ref{retrieveEncodeSTARS}.
 \end{proof}
 \noindent
-The following property is more interesting, as
+The following property of $\retrieve$ is crucial, as
 it provides a "bridge" between $a_0$ and $a_n $ in the
 lexing diagram by linking adjacent regular expressions $a_i$ and
 $a_{i+1}$.
 The property says that one
 can retrieve the same bits from a derivative
 regular expression as those from
 before the derivative took place,
-provided that the right values are used respectively:
+provided that the corresponding values are used respectively:
 \begin{lemma}\label{retrieveStepwise}
 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
 \end{lemma}
 \begin{proof}
-	By induction on $r$, where $v$ is allowed to be arbitrary.
+	By induction on $r$, generalising over $v$.
-	The induction principle is function $\erase$'s cases.
+	The induction principle in our formalisation
+	is function $\erase$'s cases.
 \end{proof}
 \noindent
 Before we move on to the next
 helper function, we rewrite $\blexer$ in
 the following way which makes later proofs more convenient:
 \end{lemma}
 \begin{proof}
 	Using lemma \ref{bmkepsRetrieve}.
 \end{proof}
 \noindent
-$\retrieve$ allows connecting
+The function $\retrieve$ allows connecting
 between the intermediate
 results $a_i$ and $a_{i+1}$ in $\blexer$.
 For plain regular expressions something similar is required.
 \subsection{$\flex$}
-Ausaf and Urban cleverly defined an auxiliary function called $\flex$ for $\lexer$,
+Ausaf and Urban defined an auxiliary function called $\flex$ for $\lexer$,
 defined as
 \begin{center}
 \begin{tabular}{lcl}
 $\flex \; r \; f \; [] \; v$       &  $=$ &   $f\; v$\\
 $\flex \; r \; f \; (c :: s) \; v$ &  $=$ &   $\flex \; r \; (\lambda v. \, f (\inj \; r\; c\; v)) \;\; s \; v$
 \end{tabular}
 \end{center}
 which accumulates the characters that need to be injected back,
 and does the injection in a stack-like manner (last taken derivative first injected).
+The function
 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
 $r$, the identity function $id$, the input string $s$ and the value
 $v_n= \mkeps \; (r\backslash s)$:
 \begin{lemma}
 	$\flex \;\; r \;\; \textit{id}\;\; s \;\; (\mkeps \;(r\backslash s)) = \lexer \; r \; s$
 \begin{proof}
 	By reverse induction on $s$.
 \end{proof}
 \noindent
 The main advantage of using $\flex$
-in addition to $\lexer$ is that
+in $\lexer$ is that
 $\flex$ makes the value $v$ and function $f$
 that manipulates the value  explicit parameters,
 which ``exposes'' $\lexer$'s recursive call
 arguments and allows us to ``set breakpoints'' and ``resume''
-at any point during $\lexer$'s recursive calls.\\
+at any point during $\lexer$'s recursive calls. Therefore
-The following stepwise property holds.
+the following stepwise property holds.
 \begin{lemma}\label{flexStepwise}
 	$\textit{flex} \;\; r \;\; f \;\; (s@[c]) \;\; v = \flex \;\; r \;\; f  \;\; s \;\; (\inj \; (r\backslash s) \; c \; v) $
 \end{lemma}
 \begin{proof}
 	By induction on the shape of $r\backslash s$.
 \end{proof}
 \noindent
 With $\flex$ and $\retrieve$,
 we are ready to connect $\lexer$ and $\blexer$,
-giving the correctness proof.
+giving the correctness proof for $\blexer$.
 %----------------------------------------------------------------------------------------
 %	SECTION  correctness proof
 %----------------------------------------------------------------------------------------
-\section{Correctness of Bit-coded Algorithm (Without Simplification)}
+\section{Correctness of the  Bit-coded Algorithm (Without Simplification)}
-$\flex$ and $\blexer$ essentially calculates the same thing.
+We can first show that
+$\flex$ and $\blexer$ calculates the same thing.
 \begin{lemma}\label{flex_retrieve}
 	If $\vdash v: (r\backslash s)$, then\\
 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
 \end{lemma}
 \begin{proof}
 \noindent
 To piece things together and spell out the correctness
 result of the bitcoded lexer more explicitly,
 we use the fact from the previous chapter that
 \[
-	 (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s = v
+	 (r, s) \rightarrow v \;\; \textit{iff} \;\; \lexer \; r \; s =\Some \;  v
+	 \quad \quad
+	 \nexists v. (r, s) \rightarrow v\;\; \textit{iff} \;\; \lexer \;r\;s = None
 \]
 to obtain this corollary:
 \begin{corollary}\label{blexerPOSIX}
-	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = v$
+	The $\blexer$ function correctly implements a POSIX lexer, namely\\
+	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \; s = \Some \; v$\\
+	$\nexists v. (r, s) \rightarrow v\;\; \textit{iff} \;\; \blexer \;r\;s = None$
 \end{corollary}
-Our main reason for wanting a bit-coded algorithm over
+Our main reason for analysing the bit-coded algorithm over
-the injection-based one is for its capabilities of allowing
+the injection-based one is that it allows us to define
 more aggressive simplifications.
 We will elaborate on this in the next chapter.

changeset 624	8ffa28fce271
parent 622	4b1149fb5aec
child 638	dd9dde2d902b