% Chapter Template% Main chapter title\chapter{Correctness of Bit-coded Algorithm with Simplification}\label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}%Then we illustrate how the algorithm without bitcodes falls short for such aggressive %simplifications and therefore introduce our version of the bitcoded algorithm and %its correctness proof in %Chapter 3\ref{Chapter3}. In this chapter we introduce simplificationsfor annotated regular expressions that can be applied to each intermediate derivative result. This allowsus to make $\blexer$ much more efficient.Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,but their simplification functions were inefficient and in some cases needs fixing.%We contrast our simplification function %with Sulzmann and Lu's, indicating the simplicity of our algorithm.%This is another case for the usefulness %and reliability of formal proofs on algorithms.%These ``aggressive'' simplifications would not be possible in the injection-based %lexing we introduced in chapter \ref{Inj}.%We then prove the correctness with the improved version of %$\blexer$, called $\blexersimp$, by establishing %$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.%\section{Simplifications by Sulzmann and Lu}Consider the derivatives of the following example $(a^*a^*)^*$:%and $(a^* + (aa)^*)^*$:\begin{center} \begin{tabular}{lcl} $(a^*a^*)^*$ & $ \stackrel{\backslash a}{\longrightarrow}$ & $ (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ & $ \stackrel{\backslash a}{\longrightarrow} $ & $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ & $\stackrel{\backslash a}{ \longrightarrow} $ & $\ldots$\\ \end{tabular}\end{center}\noindentAs can be seen, there are serveral duplications.A simple-minded simplification function cannot simplifythe third regular expression in the above chain of derivativeregular expressions, namely\begin{center}$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$\end{center}because the duplicates arenot next to each other and therefore the rule$r+ r \rightarrow r$ does not fire.One would expect a better simplification function to work in the following way:\begin{gather*} ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\ \bigg\downarrow (1) \\ (a^*a^* + a^* \color{gray} + a^* \color{black})\cdot(a^*a^*)^* + \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\ \bigg\downarrow (2) \\ (a^*a^* + a^* )\cdot(a^*a^*)^* \color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\ \bigg\downarrow (3) \\ (a^*a^* + a^* )\cdot(a^*a^*)^* \end{gather*}\noindentIn the first step, the nested alternative regular expression$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.Now the third term $a^*$ can clearly be identified as a duplicateand therefore removed in the second step. This causes the twotop-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ removed in the final step.Sulzmann and Lu's simplification function (using our notations) can achieve thissimplification:\begin{center} \begin{tabular}{lcl} $\textit{simp}\_{SL} \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & $\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\ & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\ $\textit{simp}\_{SL} \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if} \; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\; \textit{then} \;\; \ZERO$\\ & & $\textit{else}\;\;_{bs}((\textit{simp}\_{SL} \;r_1)\cdot (\textit{simp}\_{SL} \; r_2))$\\ $\textit{simp}\_{SL} \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\ $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ $\textit{simp}\_{SL} \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL} \; r)$\\ $\textit{simp}\_{SL} \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum (\nub \; (\filter \; (\not \circ \zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\ \end{tabular}\end{center}\noindentThe $\textit{zeroable}$ predicate tests whether the regular expressionis equivalent to $\ZERO$, andcan be defined as:\begin{center} \begin{tabular}{lcl} $\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\; \zeroable \;_{[]}\sum\;rs $\\ $\zeroable\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\zeroable\; r_1 \;\; \lor \;\; \zeroable \; r_2$\\ $\zeroable\;_{bs}r^*$ & $\dn$ & $\textit{false}$ \\ $\zeroable\;_{bs}c$ & $\dn$ & $\textit{false}$\\ $\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\ $\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$ \end{tabular}\end{center}\noindentThe \begin{center} \begin{tabular}{lcl} $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ \end{tabular}\end{center}\noindentclause does flatten the alternative as required in step (1),but $\textit{simp}\_{SL}$ is insufficient if we want to do steps (2) and (3),as these ``identical'' terms have different bit-annotations.They also suggested that the $\textit{simp}\_{SL} $ function should beapplied repeatedly until a fixpoint is reached.We call this construction $\textit{SLSimp}$:\begin{center} \begin{tabular}{lcl} $\textit{SLSimp} \; r$ & $\dn$ & $\textit{while}((\textit{simp}\_{SL} \; r)\; \cancel{=} \; r)$ \\ & & $\quad r := \textit{simp}\_{SL} \; r$\\ & & $\textit{return} \; r$ \end{tabular}\end{center}We call the operation of alternatingly applying derivatives and simplifications(until the string is exhausted) Sulz-simp-derivative,written $\backslash_{SLSimp}$:\begin{center}\begin{tabular}{lcl} $r \backslash_{SLSimp} (c\!::\!s) $ & $\dn$ & $(\textit{SLSimp} \; (r \backslash c)) \backslash_{SLSimp}\, s$ \\$r \backslash_{SLSimp} [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentAfter the derivatives have been taken, the bitcodesare extracted and decoded in the same manneras $\blexer$:\begin{center}\begin{tabular}{lcl} $\textit{blexer\_SLSimp}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash_{SLSimp}\, s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentWe implemented this lexing algorithm in Scala, and found that the final derivative regular expressionsize grows exponentially fast (note the logarithmic scale):\begin{figure}[H] \centering\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={size}, ymode = log, legend entries={Final Derivative Size}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexer.data};\end{axis}\end{tikzpicture} \caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}\end{figure}\noindentAt $n= 20$ we already get an out of memory error with Scala's normal JVM heap size settings.In fact their simplification does not improve much overthe simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.The time required also grows exponentially:\begin{figure}[H] \centering\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={time}, %ymode = log, legend entries={time in secs}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexerTime.data};\end{axis}\end{tikzpicture} \caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime}\end{figure}\noindentwhich seems like a counterexample for their linear complexity claim:\begin{quote}\itLinear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input. \end{quote}\noindentThe assumption that the size of the regular expressionsin the algorithmwould stay below a finite constant is not true, not in theexamples we considered.The main reason behind this is that (i) the $\textit{nub}$function requires identical annotations between two annotated regular expressions to qualify as duplicates,and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$even if both $a^*$ denote the same language, and(ii) the ``flattening'' only applies to the head of the listin the \begin{center} \begin{tabular}{lcl} $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ \end{tabular}\end{center}\noindentclause, and therefore is not strong enough to simplify allneeded parts of the regular expression. Moreover, even if the regular expressions sizedo stay finite, one has to take into account thatthe $\textit{simp}\_{SL}$ function is applied many timesin each derivative step, and that number is not necessarilya constant with respect to the size of the regular expression.%To not get ``caught off guard'' by%these counterexamples,%one needs to be more careful when designing the%simplification function and making claims about them.\section{Our $\textit{Simp}$ Function}We will now introduce our simplification function.%by making a contrast with $\textit{simp}\_{SL}$.We also describethe ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$algorithm and why it fails to achieve the desired effect. Our simplification function comes with a formalcorrectness proof.\subsection{Flattening Nested Alternatives}The idea behind the clause\begin{center} $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\end{center}is that it allowsduplicate removal of regular expressions at different``levels'' of alternatives.For example, this would help with thefollowing simplification:\begin{center}$(a+r)+r \longrightarrow a+r$\end{center}The problem is that only the head elementis ``spilled out''.It is more desirableto flattenan entire list to open up possibilities for further simplificationswith later regular expressions.Not flattening the rest of the elements also means thatthe later de-duplication processs does not fully remove further duplicates.For example,using $\textit{simp}\_{SL}$ we cannotsimplify\begin{center} $((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+((a^*a^*)+a^*)\cdot (a^*a^*)^*$\end{center}due to the underlined part not being in the head of the alternative.We define our flatten operation so that it flattens the entire list: \begin{center} \begin{tabular}{@{}lcl@{}} $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\ $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\ $\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise) \end{tabular} \end{center} \noindentOur $\flts$ operation also throws away $\ZERO$sas they do not contribute to a lexing result.\subsection{Duplicate Removal}After flattening is done, we can deduplicate.The de-duplicate function is called $\distinctBy$,and that is where we make our second improvement overSulzmann and Lu's simplification method.The process goes as follows:\begin{center}$rs \stackrel{\textit{flts}}{\longrightarrow} rs_{flat} \xrightarrow{\distinctBy \; rs_{flat} \; \rerases\; \varnothing} rs_{distinct}$%\stackrel{\distinctBy \; %rs_{flat} \; \erase\; \varnothing}{\longrightarrow} \; rs_{distinct}$\end{center}where the $\distinctBy$ function is defined as:\begin{center} \begin{tabular}{@{}lcl@{}} $\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\ $\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & $\quad \textit{if} (f \; x \in acc)\;\; \textit{then} \;\; \distinctBy \; xs \; f \; acc$\\ & & $\quad \textit{else}\;\; x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$ \end{tabular}\end{center}\noindentThe reason we define a distinct function under a mapping $f$ is becausewe want to eliminate regular expressions that are syntactically the same,but have different bit-codes.For example, we can remove the second $a^*a^*$ from$_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because itrepresents a match with shorter initial sub-match (and therefore is definitely not POSIX),and will be discarded by $\bmkeps$ later.\begin{center} $_{ZSZ}\underbrace{a^*}_{ZS:\; match \; 1\; times\quad}\underbrace{a^*}_{Z: \;match\; 1 \;times} + _{SZZ}\underbrace{a^*}_{S: \; match \; 0 \; times\quad}\underbrace{a^*}_{ZZ: \; match \; 2 \; times} $\end{center}%$_{bs1} r_1 + _{bs2} r_2 \text{where} (r_1)_{\downarrow} = (r_2)_{\downarrow}$Due to the way our algorithm works,the matches that conform to the POSIX standard will always be placed further to the left. When we traverse the list from left to right,regular expressions we have already seenwill definitely not contribute to a POSIX value,even if they are attached with different bitcodes.These duplicates therefore need to be removed.To achieve this, we call $\rerases$ as the function $f$ during the distinctionoperation. The function$\rerases$ is very similar to $\erase$, except that it preserves the structurewhen erasing an alternative regular expression.The reason why we use $\rerases$ instead of $\erase$ is thatit keeps the structures of alternative annotated regular expressionswhereas $\erase$ would turn it back into a binary tree structure.Not having to mess with the structure greatly simplifies the finiteness proof in chapter \ref{Finite}.We give the definitions of $\rerases$ here together withthe new datatype used by $\rerases$ (as our plainregular expression datatype does not allow non-binary alternatives).For now we can think of $\rerases$ as the function $(\_)_\downarrow$ defined in chapter \ref{Bitcoded1}and $\rrexp$ as plain regular expressions, but having a general list constructorfor alternatives:\begin{figure}[H]\begin{center} $\rrexp ::= \RZERO \mid \RONE \mid \RCHAR{c} \mid \RSEQ{r_1}{r_2} \mid \RALTS{rs} \mid \RSTAR{r} $\end{center}\caption{$\rrexp$: plain regular expressions, but with $\sum$ alternative constructor}\label{rrexpDef}\end{figure}The function $\rerases$ we define as follows:\begin{center}\begin{tabular}{lcl}$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ $\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$\end{tabular}\end{center}\subsection{Putting Things Together}We can now give the definition of our simplification function:%that looks somewhat similar to our Scala code is \begin{center} \begin{tabular}{@{}lcl@{}} $\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\ $\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \rerases \; \varnothing) $ \\ $\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ \end{tabular} \end{center} \noindentThe simplification (named $\textit{bsimp}$ for \emph{b}it-coded) does a pattern matching on the regular expression.When it detected that the regular expression is an alternative orsequence, it will try to simplify its children regular expressionsrecursively and then see if one of the children turns into $\ZERO$ or$\ONE$, which might trigger further simplification at the current level.Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$.\begin{center} \begin{tabular}{@{}lcl@{}} $\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \end{tabular}\end{center}\noindentThe most involved part is the $\sum$ clause, where we first call $\flts$ onthe simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.and then call $\distinctBy$ on that list, the predicate determining whether two elements are the same is $\rerases \; r_1 = \rerases\; r_2$.Finally, depending on whether the regular expression list $as'$ has turned into asingleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$decides whether to keep the current level constructor $\sum$ as it is, and removes it when there are less than two elements:\begin{center} \begin{tabular}{lcl} $\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ \end{tabular}\end{center}Having defined the $\textit{bsimp}$ function,we add it as a phase after a derivative is taken.\begin{center} \begin{tabular}{lcl} $r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$ \end{tabular}\end{center}%Following previous notations%when extending from derivatives w.r.t.~character to derivative%w.r.t.~string, we define the derivative that nests simplifications %with derivatives:%\comment{simp in the [] case?}We extend this from characters to strings:\begin{center}\begin{tabular}{lcl}$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentThe lexer that extracts bitcodes from the derivatives with simplifications from our $\simp$ functionis called $\blexersimp$:\begin{center}\begin{tabular}{lcl} $\textit{blexer\_simp}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentThis algorithm keeps the regular expression size small.\subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$After Simplification}Recall theprevious $(a^*a^*)^*$ examplewhere $\textit{simp}\_{SL}$ could notprevent the fast growth (over3 million nodes just below $20$ input length)will be reduced to just 15 and stays constant no matter how long theinput string is.This is shown in the graphs below.\begin{figure}[H]\begin{center}\begin{tabular}{ll}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={derivative size}, width=7cm, height=4cm, legend entries={Lexer with $\textit{bsimp}$}, legend pos= south east, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data};\end{axis}\end{tikzpicture} %\label{fig:BitcodedLexer}&\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={derivative size}, width = 7cm, height = 4cm, legend entries={Lexer with $\textit{simp}\_{SL}$}, legend pos= north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};\end{axis}\end{tikzpicture} \end{tabular}\end{center}\caption{Our Improvement over Sulzmann and Lu's in terms of size}\end{figure}\noindentGiven the size difference, it is notsurprising that our $\blexersimp$ significantly outperforms$\textit{blexer\_SLSimp}$.In the next section we are going to establish that oursimplification preserves the correctness of the algorithm.%----------------------------------------------------------------------------------------% SECTION rewrite relation%----------------------------------------------------------------------------------------\section{Correctness of $\blexersimp$}We first introduce the rewriting relation \emph{rrewrite}($\rrewrite$) between two regular expressions,which stands for an atomicsimplification.We then prove properties aboutthis rewriting relation and its reflexive transitive closure.Finally we leverage these properties to showan equivalence between the results generated by$\blexer$ and $\blexersimp$.\subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}In the $\blexer$'s correctness proof, wedid not directly derive the fact that $\blexer$ generates the POSIX value,but first proved that $\blexer$ generates the same result as $\lexer$.Then we re-usethe correctness of $\lexer$to obtain \begin{center} $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$\\ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; r\;s = \None$.\end{center}%\begin{center}% $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.%\end{center}Here we apply thismodularised technique againby first proving that$\blexersimp \; r \; s $ produces the same output as $\blexer \; r\; s$,and then piecing it together with $\blexer$'s correctness to achieve our maintheorem:\begin{center} $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = \Some \;v$ \\ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; r\;s = \None$\end{center}\noindentThe overall idea for the proofof $\blexer \;r \;s = \blexersimp \; r \;s$ is that the transition from $r$ to $\textit{bsimp}\; r$ can bebroken down into smaller rewrite steps of the form:\begin{center} $r \rightsquigarrow^* \textit{bsimp} \; r$\end{center}where each rewrite step, written $\rightsquigarrow$,is an ``atomic'' simplification thatis similar to a small-step reduction in operational semantics (see figure \ref{rrewriteRules} for the rules):\begin{figure}[H]\begin{mathpar} \inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\} \inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\} \inferrule * [Right = $S_1$]{\vspace{0em}}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\ \inferrule * [Right = $SL$] {\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\} \inferrule * [Right = $SR$] {\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\ \inferrule * [Right = $A0$] {\vspace{0em}}{ _{bs}\sum [] \rightsquigarrow \ZERO} \inferrule * [Right = $A1$] {\vspace{0em}}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a} \inferrule * [Right = $AL$] {\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2} \inferrule * [Right = $LE$] {\vspace{0em}}{ [] \stackrel{s}{\rightsquigarrow} []} \inferrule * [Right = $LT$] {rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{ r :: rs_1 \stackrel{s}{\rightsquigarrow} r :: rs_2 } \inferrule * [Right = $LH$] {r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs} \inferrule * [Right = $L\ZERO$] {\vspace{0em}}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs} \inferrule * [Right = $LS$] {\vspace{0em}}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) } \inferrule * [Right = $LD$] {\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rs_c \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}\end{mathpar}\caption{The rewrite rules that generate simplified regular expressions in small steps: $r_1 \rightsquigarrow r_2$ is for bitcoded regular expressions and $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$ for lists of bitcoded regular expressions. Interesting is the LD rule that allows copies of regular expressions to be removed provided a regular expression earlier in the list can match the same strings.}\label{rrewriteRules}\end{figure}\noindentThe rules $LT$ and $LH$ are for rewriting two regular expression listssuch that one regular expressionin the left-hand-side list is rewritable in one stepto the right-hand-side's regular expression at the same position.This helps with defining the ``context rule'' $AL$.\\The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$are defined in the usual way:\begin{figure}[H] \centering\begin{mathpar} \inferrule{\vspace{0em}}{ r \rightsquigarrow^* r \\} \inferrule{\vspace{0em}}{rs \stackrel{s*}{\rightsquigarrow} rs \\} \inferrule{r_1 \rightsquigarrow^* r_2 \land \; r_2 \rightsquigarrow^* r_3}{r_1 \rightsquigarrow^* r_3\\} \inferrule{rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3}\end{mathpar}\caption{The Reflexive Transitive Closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}\end{figure}%Two rewritable terms will remain rewritable to each other%even after a derivative is taken:The main point of our rewriting relationis that it is preserved under derivatives,namely\begin{center} $r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$\end{center}And also, if two terms are rewritable to each other,then they produce the same bitcodes:\begin{center} $r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$\end{center}The decoding phase of both $\blexer$ and $\blexersimp$are the same, which means that if they receive the samebitcodes before the decoding phase,they generate the same value after decoding is done.We will prove the three properties we mentioned above in the next sub-section.\subsection{Important Properties of $\rightsquigarrow$}First we prove some basic facts about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$, $\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$,which will be needed later.\\The inference rules (\ref{rrewriteRules}) we gave in the previous section have their ``many-steps version'':\begin{lemma}\label{squig1} \hspace{0em} \begin{itemize} \item $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$ \item $r \rightsquigarrow^* r' \implies _{bs} \sum (r :: rs)\; \rightsquigarrow^*\; _{bs} \sum (r' :: rs)$ \item The rewriting in many steps property is composible in terms of the sequence constructor:\\ $r_1 \rightsquigarrow^* r_2 \implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* \; _{bs} r_2 \cdot r_3 \quad $ and $\quad r_3 \rightsquigarrow^* r_4 \implies _{bs} r_1 \cdot r_3 \rightsquigarrow^* _{bs} \; r_1 \cdot r_4$ \item The rewriting in many steps properties $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ is preserved under the function $\fuse$:\\ $r_1 \rightsquigarrow^* r_2 \implies \fuse \; bs \; r_1 \rightsquigarrow^* \; \fuse \; bs \; r_2 \quad $ and $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\fuse \; bs) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$ \end{itemize}\end{lemma}\begin{proof} By an induction on the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively. The third and fourth points are by the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and $rs_2 \stackrel{s}{\rightsquigarrow} rs_3 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$, which can be indutively proven by the inductive cases of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$.\end{proof}\noindentThe inference rules of $\stackrel{s}{\rightsquigarrow}$are defined in terms of the list cons operation, wherewe establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is also preserved w.r.t appending and prepending of a list.In addition, wealso prove some relations between $\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$.\begin{lemma}\label{ssgqTossgs} \hspace{0em} \begin{itemize} \item $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$ \item $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2 \; \; \textit{and} \; \; rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$ \item The $\stackrel{s}{\rightsquigarrow} $ relation after appending a list becomes $\stackrel{s*}{\rightsquigarrow}$:\\ $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$ \item $r_1 \rightsquigarrow^* r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$ \item $rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \rightsquigarrow^* r_2 \implies r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$ \item If we can rewrite a regular expression in many steps to $\ZERO$, then we can also rewrite any sequence containing it to $\ZERO$:\\ $r_1 \rightsquigarrow^* \ZERO \implies _{bs}r_1\cdot r_2 \rightsquigarrow^* \ZERO$ \end{itemize}\end{lemma}\begin{proof} The first part is by induction on the list $rs$. The second part is by induction on the inductive cases of $\stackrel{s*}{\rightsquigarrow}$. The third part is by rule induction of $\stackrel{s}{\rightsquigarrow}$. The fourth sub-lemma is by rule induction of $\stackrel{s*}{\rightsquigarrow}$ and using part one to three. The fifth part is a corollary of part four. The last part is proven by rule induction again on $\rightsquigarrow^*$.\end{proof}\noindentNow we are ready to give the proofs of the following properties:\begin{itemize} \item $r \rightsquigarrow^* r'\land \bnullable \; r_1 \implies \bmkeps \; r = \bmkeps \; r'$. \\ \item $r \rightsquigarrow^* \textit{bsimp} \;r$.\\ \item $r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\\end{itemize}\subsubsection{Property 1: $r \rightsquigarrow^* r'\land \bnullable \; r_1 \implies \bmkeps \; r = \bmkeps \; r'$}Intuitively, this property says we can extract the same bitcodes using $\bmkeps$ from the nullablecomponents of two regular expressions $r$ and $r'$,if we can rewrite from one to the other in finitelymany steps.\\For convenience, we define a predicate for a list of regular expressionshaving at least one nullable regular expressions:\begin{center} $\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$\end{center}\noindentThe rewriting relation $\rightsquigarrow$ preserves (b)nullability:\begin{lemma}\label{rewritesBnullable} \hspace{0em} \begin{itemize} \item $\text{If} \; r_1 \rightsquigarrow r_2, \; \text{then} \; \bnullable \; r_1 = \bnullable \; r_2$ \item $\text{If} \; rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \; \text{then} \; \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$ \item $r_1 \rightsquigarrow^* r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$ \end{itemize}\end{lemma}\begin{proof} By rule induction of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$. The third point is a corollary of the second.\end{proof}\noindentFor convenience again,we define $\bmkepss$ on a list $rs$,which extracts the bit-codes on the first $\bnullable$ element in $rs$:\begin{center} \begin{tabular}{lcl} $\bmkepss \; [] $ & $\dn$ & $[]$\\ $\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \;(\bnullable \; r) \;\; \textit{then} \;\; \bmkeps \; r \; \textit{else} \;\; \bmkepss \; rs$ \end{tabular}\end{center}\noindentIf both regular expressions in a rewriting relation are nullable, then they produce the same bitcodes:\begin{lemma}\label{rewriteBmkepsAux} \hspace{0em} \begin{itemize} \item $r_1 \rightsquigarrow r_2 \implies (\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 = \bmkeps \; r_2)$ \item and $rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 \implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies \bmkepss \; rs_1 = \bmkepss \; rs2)$ \end{itemize}\end{lemma}\begin{proof} By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$.\end{proof}\noindentWith lemma \ref{rewriteBmkepsAux} in place we are ready to prove itsmany-step version: \begin{lemma} $\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$\end{lemma}\begin{proof} By rule induction of $\stackrel{*}{\rightsquigarrow} $. Lemma $\ref{rewritesBnullable}$ gives us both $r$ and $r'$ are nullable. The lemma \ref{rewriteBmkepsAux} solves the inductive case.\end{proof}\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$}Now we get to the key part of the proof, which says that our simplification's helper functions such as $\distinctBy$ and $\flts$ describereducts of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^* $.\\The first lemma to prove is a more general version of $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$:\begin{lemma} $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} (rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$\end{lemma}\noindentIt says that that for a list made of two parts $rs_1 @ rs_2$, one can throw away the duplicateelements in $rs_2$, as well as those that have appeared in $rs_1$.\begin{proof} By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary.\end{proof}\noindentSetting $rs_2$ to be empty,we get the corollary\begin{corollary}\label{dBPreserves} $rs_1 \stackrel{s*}{\rightsquigarrow} \distinctBy \; rs_1 \; \phi$.\end{corollary}\noindentSimilarly the flatten function $\flts$ describes a reduct of$\stackrel{s*}{\rightsquigarrow}$ as well:\begin{lemma}\label{fltsPreserves} $rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$\end{lemma}\begin{proof} By an induction on $rs$.\end{proof}\noindentThe function $\bsimpalts$ preserves rewritability:\begin{lemma}\label{bsimpaltsPreserves} $_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$\end{lemma}\noindentThe simplification function$\textit{bsimp}$ only transforms the regular expression using steps specified by $\rightsquigarrow^*$ and nothing else:\begin{lemma} $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$\end{lemma}\begin{proof} By an induction on $r$. The most involved case is the alternative, where we use lemmas \ref{bsimpaltsPreserves}, \ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\ \begin{center} \begin{tabular}{lcl} $rs$ & $\stackrel{s*}{\rightsquigarrow}$ & $ \map \; \textit{bsimp} \; rs$\\ & $\stackrel{s*}{\rightsquigarrow}$ & $ \flts \; (\map \; \textit{bsimp} \; rs)$\\ & $\stackrel{s*}{\rightsquigarrow}$ & $ \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\ \end{tabular} \end{center} Using this we can derive the following rewrite sequence:\\ \begin{center} \begin{tabular}{lcl} $r$ & $=$ & $_{bs}\sum rs$\\[1.5ex] & $\rightsquigarrow^*$ & $\bsimpalts \; bs \; rs$ \\[1.5ex] & $\rightsquigarrow^*$ & $\ldots$ \\ [1.5ex] & $\rightsquigarrow^*$ & $\bsimpalts \; bs \; (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi)$\\[1.5ex] %& $\rightsquigarrow^*$ & $ _{bs} \sum (\distinctBy \; %(\flts \; (\map \; \textit{bsimp}\; rs)) \; \; %\rerases \; \;\phi) $\\[1.5ex] & $\rightsquigarrow^*$ & $\textit{bsimp} \; r$\\[1.5ex] \end{tabular} \end{center} \end{proof}\subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}The rewrite relation $\rightsquigarrow$ changes into $\stackrel{*}{\rightsquigarrow}$after derivatives are taken on both sides:\begin{lemma}\label{rewriteBder} \hspace{0em} \begin{itemize} \item If $r_1 \rightsquigarrow r_2$, then $r_1 \backslash c \rightsquigarrow^* r_2 \backslash c$ \item If $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$, then $ \map \; (\_\backslash c) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$ \end{itemize}\end{lemma}\begin{proof} By induction on $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas.\end{proof}\noindentNow we can prove property 3, as an immediate corollary:\begin{corollary}\label{rewritesBder} $r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^* r_2 \backslash c$\end{corollary}\begin{proof} By rule induction of $\stackrel{*}{\rightsquigarrow} $ and lemma \ref{rewriteBder}.\end{proof}\noindentThis can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$to obtain the correspondence between$\blexer$ and $\blexersimp$'s intermediatederivative regular expressions \begin{lemma}\label{bderBderssimp} $a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $\end{lemma}\begin{proof} By an induction on $s$.\end{proof}\subsection{Main Theorem}Now with \ref{bderBderssimp} in place we are ready for the main theorem.\begin{theorem} $\blexer \; r \; s = \blexersimp{r}{s}$\end{theorem}\noindent\begin{proof} We can rewrite in many steps from the original lexer's derivative regular expressions to the lexer with simplification applied (by lemma \ref{bderBderssimp}): \begin{center} $a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $. \end{center} We know that they generate the same bits, if the lexing result is a match: \begin{center} $\bnullable \; (a \backslash s) \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ \end{center} Now that they generate the same bits, we know that they give the same value after decoding. \begin{center} $\bnullable \; (a \backslash s) \implies \decode \; r \; (\bmkeps \; (a \backslash s)) = \decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$ \end{center} Which is required by our proof goal: \begin{center} $\blexer \; r \; s = \blexersimp \; r \; s$. \end{center} \end{proof}\noindentAs a corollary,we can link this result with the lemma we proved earlier that \begin{center} $(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; r\;s = \None$.\end{center}and obtain the corollary that the bit-coded lexer with simplification isindeed correctly outputting POSIX lexing result, if such a result exists.\begin{corollary} $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; r\;s = \None$.\end{corollary}\subsection{Comments on the Proof Techniques Used}Straightforward and simple as the proof may seem,the efforts we spent obtaining it were far from trivial.We initially attempted to re-use the argument in \cref{flex_retrieve}. The problem is that both functions $\inj$ and $\retrieve$ require that the annotated regular expressions stay unsimplified, so that one can correctly compare $v_{i+1}$ and $r_i$ and $v_i$ in diagram \ref{graph:inj} and ``fit the key into the lock hole''.\noindentWe also tried to prove \begin{center}$\textit{bsimp} \;\; (\bderssimp{a}{s}) = \textit{bsimp} \;\; (a\backslash s)$,\end{center}but this turns out to be not true.A counterexample would be\[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\; \text{and} \;\; s = bb.\]\noindentThen we would have \begin{center} $\textit{bsimp}\;\; ( a \backslash s )$ = $_{[]}(_{ZZ}\ONE + _{ZS}c ) $\end{center}\noindentwhereas \begin{center} $\textit{bsimp} \;\;( \bderssimp{a}{s} )$ = $_{Z}(_{Z} \ONE + _{S} c)$.\end{center}Unfortunately, if we apply $\textit{bsimp}$ differentlywe will always have this discrepancy. This is due to the $\map \; (\fuse\; bs) \; as$ operation happening at different locations in the regular expression.\\The rewriting relation $\rightsquigarrow^*$ allows us to ignore this discrepancyand view the expressions \begin{center} $_{[]}(_{ZZ}\ONE + _{ZS}c ) $\\ and\\ $_{Z}(_{Z} \ONE + _{S} c)$\end{center}as equal, because they were both re-writtenfrom the same expression.\\The simplification rewriting rulesgiven in \ref{rrewriteRules} are by no meansfinal,one could come up new rulessuch as $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow\SEQs [r_1, r_2, r_3]$.However this does not fit with the proof techniqueof our main theorem, but seem to not violate the POSIXproperty.Having established the correctness of our$\blexersimp$, in the next chapter we shall prove that with our $\simp$ function,for a given $r$, the derivative size is alwaysfinitely bounded by a constant.