% Chapter Template% Main chapter title\chapter{Correctness of Bit-coded Algorithm without Simplification}\label{Bitcoded1} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}%Then we illustrate how the algorithm without bitcodes falls short for such aggressive %simplifications and therefore introduce our version of the bitcoded algorithm and %its correctness proof in %Chapter 3\ref{Chapter3}. \section{Bit-coded Algorithm}The lexer algorithm in Chapter \ref{Inj}, as shown in \ref{InjFigure},stores information of previous lexing stepson a stack, in the form of regular expressionsand characters: $r_0$, $c_0$, $r_1$, $c_1$, etc.\begin{envForCaption}\begin{ceqn}\begin{equation}%\label{graph:injLexer}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] \end{tikzcd}\end{equation}\end{ceqn}\caption{Injection-based Lexing from Chapter\ref{Inj}}\label{InjFigure}\end{envForCaption}\noindentThis is both inefficient and prone to stack overflow.A natural question arises as to whether we can store lexinginformation on the fly, while still using regular expression derivatives?In a lexing algorithm's run, split by the current input position,we have a sub-string that has been consumed,and the sub-string that has yet to come.We already know what was before, and this should be reflected in the value and the regular expression at that step as well. But this is not the case for injection-based regular expression derivatives.Take the regex $(aa)^* \cdot bc$ matching the string $aabc$as an example, if we have just read the two former characters $aa$:\begin{center}\begin{envForCaption}\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] {Consumed: $aa$ \nodepart{two} Not Yet Reached: $bc$ };%\caption{term 1 \ref{term:1}'s matching configuration}\end{tikzpicture}\caption{Partially matched String} \end{envForCaption}\end{center}%\caption{Input String}\label{StringPartial}%\end{figure}\noindentWe have the value that has already been partially calculated,and the part that has yet to come:\begin{center}\begin{envForCaption}\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] {$\Seq(\Stars[\Char(a), \Char(a)], ???)$ \nodepart{two} $\Seq(\ldots, \Seq(\Char(b), \Char(c)))$};%\caption{term 1 \ref{term:1}'s matching configuration}\end{tikzpicture}\caption{Partially constructed Value} \end{envForCaption}\end{center}In the regex derivative part , (after simplification)all we have is just what is about to come:\begin{center}\begin{envForCaption}\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={white!30,blue!20},] {$???$ \nodepart{two} To Come: $b c$};%\caption{term 1 \ref{term:1}'s matching configuration}\end{tikzpicture}\caption{Derivative} \end{envForCaption}\end{center}\noindentThe previous part is missing.How about keeping the partially constructed value attached to the front of the regular expression?\begin{center}\begin{envForCaption}\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] {$\Seq(\Stars[\Char(a), \Char(a)], \ldots)$ \nodepart{two} To Come: $b c$};%\caption{term 1 \ref{term:1}'s matching configuration}\end{tikzpicture}\caption{Derivative} \end{envForCaption}\end{center}\noindentIf we do this kind of "attachment"and each time augment the attached partiallyconstructed value when taking off a character:\begin{center}\begin{envForCaption}\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \ldots))$ \nodepart{two} To Come: $c$};\end{tikzpicture}\\\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}] \node [rectangle split, rectangle split horizontal, rectangle split parts=2, rectangle split part fill={red!30,blue!20},] {$\Seq(\Stars[\Char(a), \Char(a)], \Seq(\Char(b), \Char(c)))$ \nodepart{two} EOF};\end{tikzpicture}\caption{After $\backslash b$ and $\backslash c$} \end{envForCaption}\end{center}\noindentIn the end we could recover the value without a backward phase.But (partial) values are a bit clumsy to stick together with a regular expression, so we instead use bit-codes to encode them.Bits and bitcodes (lists of bits) are defined as:\begin{envForCaption}\begin{center} $b ::= S \mid Z \qquadbs ::= [] \mid b::bs $\end{center}\caption{Bit-codes datatype}\end{envForCaption}\noindentThe $S$ and $Z$ are not in bold in order to avoid confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (orbit-lists) can be used to encode values (or potentially incomplete values) in acompact form. This can be straightforwardly seen in the followingcoding function from values to bitcodes: \begin{envForCaption}\begin{center}\begin{tabular}{lcl} $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ $\textit{code}(\Left\,v)$ & $\dn$ & $Z :: code(v)$\\ $\textit{code}(\Right\,v)$ & $\dn$ & $S :: code(v)$\\ $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ $\textit{code}(\Stars\,[])$ & $\dn$ & $[Z]$\\ $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $S :: code(v) \;@\; code(\Stars\,vs)$\end{tabular} \end{center} \caption{Coding Function for Values}\end{envForCaption}\noindentHere $\textit{code}$ encodes a value into a bit-code by converting$\Left$ into $Z$, $\Right$ into $S$, and marks the start of any non-emptystar iteration by $S$. The border where a local star terminatesis marked by $Z$. This coding is lossy, as it throws away the information aboutcharacters, and also does not encode the ``boundary'' between twosequence values. Moreover, with only the bitcode we cannot even tellwhether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$. Thereason for choosing this compact way of storing information is that therelatively small size of bits can be easily manipulated and ``movedaround'' in a regular expression. We define the reverse operation of $\code$, which is $\decode$.As expected, $\decode$ not only requires the bit-codes,but also a regular expression to guide the decoding and fill the gaps of characters:%\begin{definition}[Bitdecoding of Values]\mbox{}\begin{envForCaption}\begin{center}\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ $\textit{decode}'\,(Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; (\Left\,v, bs_1)$\\ $\textit{decode}'\,(S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; (\Right\,v, bs_1)$\\ $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ $\textit{decode}'\,(Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ $\textit{decode}'\,(S\!::\!bs)\,(r^*)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ $\textit{decode}\,bs\,r$ & $\dn$ & $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; \textit{else}\;\textit{None}$ \end{tabular} \end{center}\caption{Bit-decoding of values} \end{envForCaption}%\end{definition}Sulzmann and Lu's integrated the bitcodes into regular expressions tocreate annotated regular expressions \cite{Sulzmann2014}.\emph{Annotated regular expressions} are defined by the followinggrammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}\begin{center}\begin{tabular}{lcl} $\textit{a}$ & $::=$ & $\ZERO$\\ & $\mid$ & $_{bs}\ONE$\\ & $\mid$ & $_{bs}{\bf c}$\\ & $\mid$ & $_{bs}\sum\,as$\\ & $\mid$ & $_{bs}a_1\cdot a_2$\\ & $\mid$ & $_{bs}a^*$\end{tabular} \end{center} %(in \textit{ALTS})\noindentwhere $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regularexpressions and $as$ for a list of annotated regular expressions.The alternative constructor($\sum$) has been generalized to accept a list of annotated regular expressions rather than just 2.We will show that these bitcodes encode information aboutthe (POSIX) value that should be generated by the Sulzmann and Lualgorithm.To do lexing using annotated regular expressions, we shall firsttransform the usual (un-annotated) regular expressions into annotatedregular expressions. This operation is called \emph{internalisation} anddefined as follows:%\begin{definition}\begin{center}\begin{tabular}{lcl} $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ $(r_1 + r_2)^\uparrow$ & $\dn$ & $_{[]}\sum[\textit{fuse}\,[Z]\,r_1^\uparrow,\, \textit{fuse}\,[S]\,r_2^\uparrow]$\\ $(r_1\cdot r_2)^\uparrow$ & $\dn$ & $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ $(r^*)^\uparrow$ & $\dn$ & $_{[]}(r^\uparrow)^*$\\\end{tabular} \end{center} %\end{definition}\noindentWe use up arrows here to indicate that the basic un-annotated regularexpressions are ``lifted up'' into something slightly more complex. In thefourth clause, $\textit{fuse}$ is an auxiliary function that helps toattach bits to the front of an annotated regular expression. Itsdefinition is as follows:\begin{center}\begin{tabular}{lcl} $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & $_{bs @ bs'}\ONE$\\ $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & $_{bs@bs'}{\bf c}$\\ $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & $_{bs@bs'}\sum\textit{as}$\\ $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & $_{bs@bs'}a_1 \cdot a_2$\\ $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & $_{bs @ bs'}a^*$\end{tabular} \end{center} \noindentAfter internalising the regular expression, we perform successivederivative operations on the annotated regular expressions. Thisderivative operation is the same as what we had previously for thebasic regular expressions, except that we beed to take care ofthe bitcodes:\iffalse %\begin{definition}{bder}\begin{center} \begin{tabular}{@{}lcl@{}} $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, (\textit{STAR}\,[]\,r)$\end{tabular} \end{center} %\end{definition}\begin{center} \begin{tabular}{@{}lcl@{}} $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & $_{bs}\textit{SEQ}\;(\textit{fuse}\, [Z] \; r\,\backslash c )\, (_{bs}\textit{STAR}\,[]\,r)$\end{tabular} \end{center} %\end{definition}\fi\begin{center} \begin{tabular}{@{}lcl@{}} $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; _{bs}\ONE\;\textit{else}\;\ZERO$\\ $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & $_{bs}\sum\;(\textit{map} (\_\backslash c) as )$\\ $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ $(_{bs}a^*)\,\backslash c$ & $\dn$ & $_{bs}(\textit{fuse}\, [Z] \; r\,\backslash c)\cdot (_{[]}r^*))$\end{tabular} \end{center} %\end{definition}\noindentFor instance, when we do derivative of $_{bs}a^*$ with respect to c,we need to unfold it into a sequence,and attach an additional bit $Z$ to the front of $r \backslash c$to indicate one more star iteration. Also the sequence clauseis more subtle---when $a_1$ is $\textit{bnullable}$ (here\textit{bnullable} is exactly the same as $\textit{nullable}$, exceptthat it is for annotated regular expressions, therefore we omit thedefinition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how$a_1$ matches the string prior to character $c$ (more on this later),then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2\backslash c)$ will collapse the regular expression $a_1$(as it hasalready been fully matched) and store the parsing information at thehead of the regular expression $a_2 \backslash c$ by fusing to it. Thebitsequence $\textit{bs}$, which was initially attached to thefirst element of the sequence $a_1 \cdot a_2$, hasnow been elevated to the top-level of $\sum$, as this information will beneeded whichever way the sequence is matched---no matter whether $c$ belongsto $a_1$ or $ a_2$. After building these derivatives and maintaining allthe lexing information, we complete the lexing by collecting thebitcodes using a generalised version of the $\textit{mkeps}$ functionfor annotated regular expressions, called $\textit{bmkeps}$:%\begin{definition}[\textit{bmkeps}]\mbox{}\begin{center}\begin{tabular}{lcl} $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a$\\ & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & $bs \,@\, [Z]$\end{tabular} \end{center} %\end{definition}\noindentThis function completes the value information by travelling along thepath of the regular expression that corresponds to a POSIX value andcollecting all the bitcodes, and using $S$ to indicate the end of stariterations. If we take the bitcodes produced by $\textit{bmkeps}$ anddecode them, we get the value we expect. The corresponding lexingalgorithm looks as follows:\begin{center}\begin{tabular}{lcl} $\textit{blexer}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentIn this definition $\_\backslash s$ is the generalisation of the derivativeoperation from characters to strings (just like the derivatives for un-annotatedregular expressions).Now we introduce the simplifications, which is why we introduce the bitcodes in the first place.\subsection*{Simplification Rules}This section introduces aggressive (in terms of size) simplification ruleson annotated regular expressionsto keep derivatives small. Such simplifications are promisingas we havegenerated test data that showthat a good tight bound can be achieved. We could onlypartially cover the search space as there are infinitely many regularexpressions and strings. One modification we introduced is to allow a list of annotated regularexpressions in the $\sum$ constructor. This allows us to not justdelete unnecessary $\ZERO$s and $\ONE$s from regular expressions, butalso unnecessary ``copies'' of regular expressions (very similar tosimplifying $r + r$ to just $r$, but in a more general setting). Anothermodification is that we use simplification rules inspired by Antimirov'swork on partial derivatives. They maintain the idea that only the first``copy'' of a regular expression in an alternative contributes to thecalculation of a POSIX value. All subsequent copies can be pruned away fromthe regular expression. A recursive definition of our simplification function that looks somewhat similar to our Scala code is given below:%\comment{Use $\ZERO$, $\ONE$ and so on. %Is it $ALTS$ or $ALTS$?}\\\begin{center} \begin{tabular}{@{}lcl@{}} $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ \end{tabular} \end{center} \noindentThe simplification does a pattern matching on the regular expression.When it detected that the regular expression is an alternative orsequence, it will try to simplify its child regular expressionsrecursively and then see if one of the children turns into $\ZERO$ or$\ONE$, which might trigger further simplification at the current level.The most involved part is the $\sum$ clause, where we use twoauxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nestedalternatives and reduce as many duplicates as possible. Function$\textit{distinct}$ keeps the first occurring copy only and removes all later oneswhen detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.Its recursive definition is given below: \begin{center} \begin{tabular}{@{}lcl@{}} $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) \end{tabular} \end{center} \noindentHere $\textit{flatten}$ behaves like the traditional functional programming flattenfunction, except that it also removes $\ZERO$s. Or in terms of regular expressions, itremoves parentheses, for example changing $a+(b+c)$ into $a+b+c$.Having defined the $\simp$ function,we can use the previous notation of naturalextension from derivative w.r.t.~character to derivativew.r.t.~string:%\comment{simp in the [] case?}\begin{center}\begin{tabular}{lcl}$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\$r \backslash_{simp} [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentto obtain an optimised version of the algorithm: \begin{center}\begin{tabular}{lcl} $\textit{blexer\_simp}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentThis algorithm keeps the regular expression size small, for example,with this simplification our previous $(a + aa)^*$ example's 8000 nodeswill be reduced to just 17 and stays constant, no matter how long theinput string is.%-----------------------------------% SUBSECTION 1%-----------------------------------\section{Specifications of Some Helper Functions}Here we give some functions' definitions, which we will use later.\begin{center}\begin{tabular}{ccc}$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}$\end{tabular}\end{center}%----------------------------------------------------------------------------------------% SECTION correctness proof%----------------------------------------------------------------------------------------\section{Correctness of Bit-coded Algorithm (Without Simplification)}We now give the proof the correctness of the algorithm with bit-codes.Ausaf and Urban cleverly defined an auxiliary function called $\flex$,defined as\begin{center}\begin{tabular}{lcr}$\flex \; r \; f \; [] \; v$ & $=$ & $f\; v$\\$\flex \; r \; f \; c :: s \; v$ & $=$ & $\flex \; r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v$\end{tabular}\end{center}which accumulates the characters that needs to be injected back, and does the injection in a stack-like manner (last taken derivative first injected).$\flex$ is connected to the $\lexer$:\begin{lemma}$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$\end{lemma}$\flex$ provides us a bridge between $\lexer$ and $\blexer$.What is even better about $\flex$ is that it allows us to directly operate on the value $\mkeps (r\backslash v)$,which is pivotal in the definition of $\lexer $ and $\blexer$, but not visible as an argument.When the value created by $\mkeps$ becomes available, one can prove some stepwise properties of lexing nicely:\begin{lemma}\label{flexStepwise}$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $\end{lemma}And for $\blexer$ we have a function with stepwise properties like $\flex$ as well,called $\retrieve$\ref{retrieveDef}.$\retrieve$ takes bit-codes from annotated regular expressionsguided by a value.$\retrieve$ is connected to the $\blexer$ in the following way:\begin{lemma}\label{blexer_retrieve}$\blexer \; r \; s = \decode \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$\end{lemma}If you take derivative of an annotated regular expression, you can $\retrieve$ the same bit-codes as before the derivative took place,provided that you use the corresponding value:\begin{lemma}\label{retrieveStepwise}$\retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$\end{lemma}The other good thing about $\retrieve$ is that it can be connected to $\flex$:%centralLemma1\begin{lemma}\label{flex_retrieve}$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$\end{lemma}\begin{proof}By induction on $s$. The induction tactic is reverse induction on strings.$v$ is allowed to be arbitrary.The crucial point is to rewrite \[\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) \]as\[\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\; \mkeps (r \backslash s@[c]))\].This enables us to equate \[\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) \] with \[\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))\],which in turn can be rewritten as\[\flex \; r \; \textit{id} \; s@[c] \; (\mkeps (r\backslash s@[c]))\].\end{proof}With the above lemma we can now link $\flex$ and $\blexer$.\begin{lemma}\label{flex_blexer}$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s) = \blexer \; r \; s$\end{lemma}\begin{proof}Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.\end{proof}Finally