diff -r a7344c9afbaf -r b2bea5968b89 ChengsongTanPhdThesis/Chapters/Bitcoded2.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Tue Jun 14 18:06:33 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Thu Jun 23 16:09:40 2022 +0100 @@ -14,92 +14,124 @@ Now we introduce the simplifications, which is why we introduce the bitcodes in the first place. -\subsection*{Simplification Rules} +\section{Simplification for Annotated Regular Expressions} +The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s +and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns +are scattered around different levels: -This section introduces aggressive (in terms of size) simplification rules -on annotated regular expressions -to keep derivatives small. Such simplifications are promising -as we have -generated test data that show -that a good tight bound can be achieved. We could only -partially cover the search space as there are infinitely many regular -expressions and strings. +\begin{center} + $(a^*a^*)^* \rightarrow (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ + $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ +\end{center} +\noindent +Despite that we have already implemented the simple-minded simplifications +such as throwing away useless $\ONE$s and $\ZERO$s. +The simplification rule $r + r \rightarrow $ cannot make a difference either +since it only removes duplicates on the same level, not something like +$(r+a)+r \rightarrow r+a$. +This requires us to break up nested alternatives into lists, for example +using the flatten operation similar to the one defined for any function language's +list datatype: -One modification we introduced is to allow a list of annotated regular -expressions in the $\sum$ constructor. This allows us to not just -delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but -also unnecessary ``copies'' of regular expressions (very similar to -simplifying $r + r$ to just $r$, but in a more general setting). Another -modification is that we use simplification rules inspired by Antimirov's -work on partial derivatives. They maintain the idea that only the first -``copy'' of a regular expression in an alternative contributes to the -calculation of a POSIX value. All subsequent copies can be pruned away from -the regular expression. A recursive definition of our simplification function + \begin{center} + \begin{tabular}{@{}lcl@{}} + $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; + (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\ + $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\ + $\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise) +\end{tabular} +\end{center} + +\noindent +There is a minor difference though, in that our $\flts$ operation defined by us +also throws away $\ZERO$s. +For a flattened list of regular expressions, a de-duplication can be done efficiently: + + +\begin{center} + \begin{tabular}{@{}lcl@{}} + $\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\ + $\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & \\ + & & $\quad \textit{if} (f \; x) \in acc \textit{then} \distinctBy \; xs \; f \; acc$\\ + & & $\quad \textit{else} x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$ + \end{tabular} +\end{center} +\noindent +The reason we define a distinct function under a mapping $f$ is because +we want to eliminate regular expressions that are the same syntactically, +but having different bit-codes (more on the reason why we can do this later). +To achieve this, we call $\erase$ as the function $f$ during the distinction +operation. +A recursive definition of our simplification function that looks somewhat similar to our Scala code is given below: -%\comment{Use $\ZERO$, $\ONE$ and so on. -%Is it $ALTS$ or $ALTS$?}\\ \begin{center} \begin{tabular}{@{}lcl@{}} - $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ - &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ - &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ - &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ - - $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\ - &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ - &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ - &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ - - $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ + $\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp} \; a_2) $ \\ + $\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \erase \; \phi) $ \\ + $\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ \end{tabular} \end{center} \noindent -The simplification does a pattern matching on the regular expression. +The simplification (named $\bsimp$ for \emph{b}it-coded) does a pattern matching on the regular expression. When it detected that the regular expression is an alternative or -sequence, it will try to simplify its child regular expressions +sequence, it will try to simplify its children regular expressions recursively and then see if one of the children turns into $\ZERO$ or $\ONE$, which might trigger further simplification at the current level. -The most involved part is the $\sum$ clause, where we use two -auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested -alternatives and reduce as many duplicates as possible. Function -$\textit{distinct}$ keeps the first occurring copy only and removes all later ones -when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. -Its recursive definition is given below: - - \begin{center} - \begin{tabular}{@{}lcl@{}} - $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; - (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ - $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ - $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) -\end{tabular} -\end{center} - +Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$, +using rules such as $\ZERO \cdot r \rightarrow \ZERO$ and $\ONE \cdot r \rightarrow r$. +\begin{center} + \begin{tabular}{@{}lcl@{}} + $\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\ + &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; (_{bs1}\ONE, a_2') \Rightarrow \textit{fuse} \; (bs@bs_1) \; a_2'$ \\ + &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ + \end{tabular} +\end{center} \noindent -Here $\textit{flatten}$ behaves like the traditional functional programming flatten -function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it -removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. - -Having defined the $\simp$ function, -we can use the previous notation of natural +The most involved part is the $\sum$ clause, where we first call $\flts$ on +the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$. +and then call $\distinctBy$ on that list, the predicate determining whether two +elements are the same is $\erase \; r_1 = \erase\; r_2$. +Finally, depending on whether the regular expression list $as'$ has turned into a +singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$ +decides whether to keep the current level constructor $\sum$ as it is, and +removes it when there are less than two elements: +\begin{center} + \begin{tabular}{lcl} + $\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ + &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ + &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ + &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ + \end{tabular} + +\end{center} +Having defined the $\bsimp$ function, +we add it as a phase after a derivative is taken, +so it stays small: +\begin{center} + \begin{tabular}{lcl} + $r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$ + \end{tabular} +\end{center} +Following previous notation of natural extension from derivative w.r.t.~character to derivative -w.r.t.~string:%\comment{simp in the [] case?} +w.r.t.~string, we define the derivative that nests simplifications with derivatives:%\comment{simp in the [] case?} \begin{center} \begin{tabular}{lcl} -$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ -$r \backslash_{simp} [\,] $ & $\dn$ & $r$ +$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\ +$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$ \end{tabular} \end{center} \noindent -to obtain an optimised version of the algorithm: - +Extracting bit-codes from the derivatives that had been simplified repeatedly after +each derivative run, the simplified $\blexer$, called $\blexersimp$, +is defined as \begin{center} \begin{tabular}{lcl} $\textit{blexer\_simp}\;r\,s$ & $\dn$ & @@ -112,16 +144,17 @@ \noindent This algorithm keeps the regular expression size small, for example, -with this simplification our previous $(a + aa)^*$ example's 8000 nodes +with this simplification our previous $(a + aa)^*$ example's fast growth (over +$10^5$ nodes at around $20$ input length) will be reduced to just 17 and stays constant, no matter how long the input string is. +We show some graphs to better demonstrate this imrpovement. - -\section{ $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification} -We -Unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current -simplification rules (we put the graphs together to show the contrast) +\section{$(a+aa)^*$ and $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification} +For $(a+aa)^*$, it used to grow to over $9000$ nodes with simple-minded simplification +at only around $15$ input characters: +\begin{center} \begin{tabular}{ll} \begin{tikzpicture} \begin{axis}[ @@ -129,7 +162,38 @@ ylabel={derivative size}, width=7cm, height=4cm, - legend entries={Lexer with $\bsimp$}, + legend entries={Lexer with $\textit{bsimp}$}, + legend pos= south east, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_bsimp.data}; +\end{axis} +\end{tikzpicture} %\label{fig:BitcodedLexer} +& +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + ylabel={derivative size}, + width = 7cm, + height = 4cm, + legend entries={Lexer without $\textit{bsimp}$}, + legend pos= north west, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_easysimp.data}; +\end{axis} +\end{tikzpicture} +\end{tabular} +\end{center} +And for $(a^*\cdot a^*)^*$, unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current +simplification rules (we put the graphs together to show the contrast) +\begin{center} +\begin{tabular}{ll} +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + ylabel={derivative size}, + width=7cm, + height=4cm, + legend entries={Lexer with $\textit{bsimp}$}, legend pos= south east, legend cell align=left] \addplot[red,mark=*, mark options={fill=white}] table {BitcodedLexer.data}; @@ -142,65 +206,349 @@ ylabel={derivative size}, width = 7cm, height = 4cm, - legend entries={Lexer without $\bsimp$}, + legend entries={Lexer without $\textit{bsimp}$}, legend pos= north west, legend cell align=left] \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; \end{axis} \end{tikzpicture} \end{tabular} +\end{center} +%---------------------------------------------------------------------------------------- +% SECTION rewrite relation +%---------------------------------------------------------------------------------------- +\section{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)} +The overall idea for the correctness +\begin{conjecture} + $\blexersimp \; r \; s = \blexer \; r\; s$ +\end{conjecture} +is that the $\textit{bsimp}$ will not change the regular expressions so much that +it becomes impossible to extract the $\POSIX$ values. +To capture this "similarity" between unsimplified regular expressions and simplified ones, +we devise the rewriting relation $\rrewrite$, written infix as $\rightsquigarrow$: +\begin{center} +\begin{mathpar} + \inferrule{}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\} + + \inferrule{}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\} + + \inferrule{}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\ + + + + \inferrule{\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\} + + \inferrule{\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\ + + \inferrule{}{ _{bs}\sum [] \rightsquigarrow \ZERO\\} + + \inferrule{}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a\\} + + \inferrule{\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2\\} + + \inferrule{}{\\ [] \stackrel{s}{\rightsquigarrow} []} + + \inferrule{rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{\\ r :: rs_1 \rightsquigarrow r :: rs_2 \rightsquigarrow} + + \inferrule{r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs\\} + + \inferrule{}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs} + + \inferrule{}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) } + + \inferrule{\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c} + +\end{mathpar} +\end{center} +These "rewrite" steps define the atomic simplifications we could impose on regular expressions +under our simplification algorithm. +For convenience, we define a relation $\stackrel{s}{\rightsquigarrow}$ for rewriting +a list of regular exression to another list. +The $\rerase{}$ function is used instead of $_\downarrow$ for the finiteness bound proof of next chapter, +which we will discuss later. For the moment the reader can assume they basically do the same thing as $\erase$. + +Usually more than one steps are taking place during the simplification of a regular expression, +therefore we define the reflexive transitive closure of the $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ +relations: + +\begin{center} +\begin{mathpar} + \inferrule{}{ r \stackrel{*}{\rightsquigarrow} r \\} + \inferrule{}{\\ rs \stackrel{s*}{\rightsquigarrow} rs \\} + \inferrule{\\r_1 \stackrel{*}{\rightsquigarrow} r_2 \land \; r_2 \stackrel{*}{\rightsquigarrow} r_3}{r_1 \stackrel{*}{\rightsquigarrow} r_3\\} + \inferrule{\\rs_1 \stackrel{*}{\rightsquigarrow} rs_2 \land \; rs_2 \stackrel{*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{*}{\rightsquigarrow} rs_3} +\end{mathpar} +\end{center} +Now that we have modelled the rewriting behaviour of our simplifier $\bsimp$, we prove mainly +three properties about how these relations connect to $\blexersimp$: +\begin{itemize} + \item + $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$ + The algorithm $\bsimp$ only transforms the regex $r$ using steps specified by + $\stackrel{*}{\rightsquigarrow}$ and nothing else. + \item + $r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow r'\backslash c$. + The relation $\stackrel{*}{rightsquigarrow}$ is preserved under derivative. + \item + $(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$. If we reach another + expression in finitely many atomic simplification steps, then these two regular expressions + will produce the same bit-codes under the bit collection function $\bmkeps$ used by our $\blexer$. + +\end{itemize} +\section{Three Important properties} +These properties would work together towards the correctness theorem. +We start proving each of these lemmas below. +\subsection{$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ and $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$} +The first few properties we establish is that the inference rules we gave for $\rightsquigarrow$ +and $\stackrel{s}{\rightsquigarrow}$ also hold as implications for $\stackrel{*}{\rightsquigarrow}$ and +$\stackrel{s*}{\rightsquigarrow}$. +\begin{lemma} + $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$ +\end{lemma} +\begin{proof} + By rule induction of $\stackrel{s*}{\rightsquigarrow}$. +\end{proof} +\begin{lemma} + $r \stackrel{*}{\rightsquigarrow} r' \implies _{bs} \sum r :: rs \stackrel{*}{\rightsquigarrow} _{bs} \sum r' :: rs$ +\end{lemma} +\begin{proof} + By rule induction of $\stackrel{*}{\rightsquigarrow} $. +\end{proof} +\noindent +Then we establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is preserved w.r.t appending and prepending of a list: +\begin{lemma} + $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$ +\end{lemma} +\begin{proof} + By induction on the list $rs$. +\end{proof} + +\begin{lemma} + $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2$ +\end{lemma} +\begin{proof} + By rule induction of $\stackrel{s*}{\rightsquigarrow}$. +\end{proof} + +\noindent +The $\stackrel{s}{\rightsquigarrow} $ relation after appending a list becomes $\stackrel{s*}{\rightsquigarrow}$: +\begin{lemma}\label{ssgqTossgs} + $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$ +\end{lemma} +\begin{proof} + By rule induction of $\stackrel{s}{\rightsquigarrow}$. +\end{proof} +\begin{lemma} + $rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$ +\end{lemma} +\begin{proof} + By rule induction of $\stackrel{s*}{\rightsquigarrow}$ and using \ref{ssgqTossgs}. +\end{proof} +Here are two lemmas relating $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$: +\begin{lemma}\label{singleton} + $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies [r_1] \stackrel{s*}{\rightsquigarrow} [r_2]$ +\end{lemma} +\begin{proof} + By rule induction of $ \stackrel{*}{\rightsquigarrow} $. +\end{proof} +\begin{lemma} + $rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies + r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$ +\end{lemma} +\begin{proof} + By using \ref{singleton}. +\end{proof} +Now we get to the "meaty" part of the proof, which relates the relations $\stackrel{s*}{\rightsquigarrow}$ and +$\stackrel{*}{\rightsquigarrow} $ with our simplification components such $\distinctBy$ and $\flts$. +The first lemma below says that for a list made of two parts $rs_1 @ rs_2$, one can throw away the duplicate +elements in $rs_2$, as well as those that have appeared in $rs_1$: +\begin{lemma} + $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} (rs_1 @ (\distinctBy \; rs_2 \; \; \rerase{\_}\; \; (\map\;\; \rerase{\_}\; \; rs_1)))$ +\end{lemma} +\begin{proof} + By induction on $rs_2$, where $rs_1$ is allowed to be arbitrary. +\end{proof} +The above h as the corollary that is suitable for the actual way $\distinctBy$ is called in $\bsimp$: +\begin{lemma}\label{dBPreserves} + $rs_ 1 \rightarrow \distinctBy \; rs_1 \; \phi$ +\end{lemma} -%---------------------------------------------------------------------------------------- -% SECTION common identities -%---------------------------------------------------------------------------------------- -\section{Common Identities In Simplification-Related Functions} -Need to prove these before starting on the big correctness proof. +The flatten function $\flts$ works within the $\rightsquigarrow$ relation: +\begin{lemma}\label{fltsPreserves} + $rs \stackrel{s*}{\rightsquigarrow} \flts \; rs$ +\end{lemma} + +The rewriting in many steps property is composible in terms of regular expression constructors: +\begin{lemma} + $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} \; _{bs} r_2 \cdot r_3 \quad $ and +$r_3 \stackrel{*}{\rightsquigarrow} r_4 \implies _{bs} r_1 \cdot r_3 \stackrel{*}{\rightsquigarrow} _{bs} \; r_1 \cdot r_4$ +\end{lemma} -%----------------------------------- -% SUBSECTION -%----------------------------------- -\subsection{Idempotency of $\simp$} - -\begin{equation} - \simp \;r = \simp\; \simp \; r -\end{equation} -This property means we do not have to repeatedly -apply simplification in each step, which justifies -our definition of $\blexersimp$. -It will also be useful in future proofs where properties such as -closed forms are needed. -The proof is by structural induction on $r$. +The rewriting in many steps properties $\stackrel{*}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ is preserved under the function $\fuse$: +\begin{lemma} + $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \fuse \; bs \; r_1 \stackrel{*}{\rightsquigarrow} \; \fuse \; bs \; r_2 \quad $ and + $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\fuse \; bs) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs) \; rs_2$ +\end{lemma} +\begin{proof} + By the properties $r_1 \rightsquigarrow r_2 \implies \fuse \; bs \; r_1 \implies \fuse \; bs \; r_2$ and + $rs_2 \stackrel{s}{\rightsquigarrow} rs_3 \implies \map \; (\fuse \; bs) rs_2 \stackrel{s*}{\rightsquigarrow} \map \; (\fuse \; bs)\; rs_3$. +\end{proof} +\noindent +If we could rewrite a regular expression in many steps to $\ZERO$, then +we could also rewrite any sequence containing it to $\ZERO$: +\begin{lemma} + $r_1 \stackrel{*}{\rightsquigarrow} \ZERO \implies _{bs}r_1\cdot r_2 \stackrel{*}{\rightsquigarrow} \ZERO$ +\end{lemma} +\begin{lemma} + $\bmkeps \; (r \backslash s) = \bmkeps \; \bderssimp{r}{s}$ +\end{lemma} +\noindent +The function $\bsimpalts$ preserves rewritability: +\begin{lemma}\label{bsimpaltsPreserves} + $_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$ +\end{lemma} +\noindent +Before we give out the next lemmas, we define a predicate for a list of regular expressions +having at least one nullable regular expressions: +\begin{definition} + $\textit{bnullables} \; rs \dn \exists r \in rs. \bnullable r$ +\end{definition} +The rewriting relation $\rightsquigarrow$ preserves nullability: +\begin{lemma} + $r_1 \rightsquigarrow r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$ and + $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \textit{bnullables} \; rs_1 = \textit{bnullables} \; rs_2$ +\end{lemma} +\begin{proof} + By rule induction of $\rightarrow$ and $\stackrel{s}{\rightsquigarrow}$. +\end{proof} +So does the many steps rewriting: +\begin{lemma}\label{rewritesBnullable} + $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies \bnullable \; r_1 = \bnullable \; r_2$ +\end{lemma} +\begin{proof} + By rule induction of $\stackrel{*}{\rightsquigarrow} $. +\end{proof} +\noindent +And if both regular expressions in a rewriting relation are nullable, then they +produce the same bit-codes: -%----------------------------------- -% SUBSECTION -%----------------------------------- -\subsection{Syntactic Equivalence Under $\simp$} -We prove that minor differences can be annhilated -by $\simp$. -For example, +\begin{lemma}\label{rewriteBmkepsAux} + $r_1 \rightsquigarrow r_2 \implies (\bnullable \; r_1 \land \bnullable \; r_2 \implies \bmkeps \; r_1 = + \bmkeps \; r_2)$ and + $rs_ 1 \stackrel{s}{\rightsquigarrow} rs_2 \implies (\bnullables \; rs_1 \land \bnullables \; rs_2 \implies + \bmkepss \; rs_1 = \bmkepss \; rs2)$ +\end{lemma} +\noindent +The definition of $\bmkepss$ on list $rs$ is just to extract the bit-codes on the first element in $rs$ that +is $bnullable$: \begin{center} -$\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = - \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ + \begin{tabular}{lcl} + $\bmkepss \; [] $ & $\dn$ & $[]$\\ + $\bmkepss \; r :: rs$ & $\dn$ & $\textit{if} \; \bnullable \; r then \; (\bmkeps \; r) \; \textit{else} \; \bmkepss \; rs$ + \end{tabular} \end{center} - - - - - +\noindent +And now we are ready to prove the key property that if you +have two regular expressions, one rewritable in many steps to the other, +and one of them is $\bnullable$, then they will both yield the same bits under $\bmkeps$: +\begin{lemma} + $(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ +\end{lemma} +\begin{proof} + By rule induction of $\stackrel{*}{\rightsquigarrow} $, using \ref{rewriteBmkepsAux} and $\ref{rewritesBnullable}$. +\end{proof} +\noindent +the other property is also ready: +\begin{lemma} + $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$ +\end{lemma} +\begin{proof} + By an induction on $r$. +The most difficult case would be the alternative case, where we using properties such as \ref{bsimpaltsPreserves} and \ref{fltsPreserves} and \ref{dBPreserves}, we could continuously rewrite a list like:\\ + $rs \stackrel{s*}{\rightsquigarrow} \map \; \textit{bsimp} \; rs$\\ + $\ldots \stackrel{s*}{\rightsquigarrow} \flts \; (\map \; \textit{bsimp} \; rs)$\\ + $\ldots \;\stackrel{s*}{\rightsquigarrow} \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi$\\ + Then we could do the following regular expresssion many steps rewrite:\\ + $ _{bs} \sum \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi \stackrel{*}{\rightsquigarrow} \bsimpalts \; bs \; (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi)$ + \\ + +\end{proof} +\section{Proof for the Property: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$} +The first thing we prove is that if we could rewrite in one step, then after derivative +we could rewrite in many steps: +\begin{lemma}\label{rewriteBder} + $r_1 \rightsquigarrow r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$ and + $rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\_\backslash c) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$ +\end{lemma} +\begin{proof} + By induction on $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$, using a number of the previous lemmas. +\end{proof} +Now we can prove that once we could rewrite from one expression to another in many steps, +then after a derivative on both sides we could still rewrite one to another in many steps: +\begin{lemma} + $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$ +\end{lemma} +\begin{proof} + By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma :\ref{rewriteBder}. +\end{proof} +\noindent +This can be extended and combined with the previous two important properties +so that a regular expression's successivve derivatives can be rewritten in many steps +to its simplified counterpart: +\begin{lemma}\label{bderBderssimp} + $a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $ +\end{lemma} +\subsection{Main Theorem} +Now with \ref{bdersBderssimp} we are ready for the main theorem. +To link $\blexersimp$ and $\blexer$, +we first say that they give out the same bits, if the lexing result is a match: +\begin{lemma} + $\bnullable \; (a \backslash s) \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ +\end{lemma} +\noindent +Now that they give out the same bits, we know that they give the same value after decoding, +which we know is correct value as $\blexer$ is correct: +\begin{theorem} + $\blexer \; r \; s = \blexersimp{r}{s}$ +\end{theorem} +\noindent - -%---------------------------------------------------------------------------------------- -% SECTION corretness proof -%---------------------------------------------------------------------------------------- -\section{Proof Technique of Correctness of Bit-coded Algorithm with Simplification} +\subsection{Comments on the Proof Techniques Used} The non-trivial part of proving the correctness of the algorithm with simplification compared with not having simplification is that we can no longer use the argument in \cref{flex_retrieve}. -The function \retrieve needs the structure of the annotated regular expression to +The function \retrieve needs the cumbersome structure of the (umsimplified) +annotated regular expression to agree with the structure of the value, but simplification will always mess with the -structure: -%TODO: after simp does not agree with each other: (a + 0) --> a v.s. Left(Char(a)) \ No newline at end of file +structure. + +We also tried to prove $\bsimp{\bderssimp{a}{s}} = \bsimp{a\backslash s}$, +but this turns out to be not true, A counterexample of this being +\[ r = [(1+c)\cdot [aa \cdot (1+c)]] \land s = aa +\] + +Then we would have $\bsimp{a \backslash s}$ being +$_{[]}(_{ZZ}\ONE + _{ZS}c ) $ +whereas $\bsimp{\bderssimp{a}{s}}$ would be +$_{Z}(_{Z} \ONE + _{S} c)$. +Unfortunately if we apply $\textit{bsimp}$ at different +stages we will always have this discrepancy, due to +whether the $\map \; (\fuse\; bs) \; as$ operation in $\textit{bsimp}$ +is taken at some points will be entirely dependant on when the simplification +take place whether there is a larger alternative structure surrounding the +alternative being simplified. +The good thing about $\stackrel{*}{\rightsquigarrow} $ is that it allows +us not specify how exactly the "atomic" simplification steps $\rightsquigarrow$ +are taken, but simply say that they can be taken to make two similar +regular expressions equal, and can be done after interleaving derivatives +and simplifications. + + +Having correctness property is good. But we would also like the lexer to be efficient in +some sense, for exampe, not grinding to a halt at certain cases. +In the next chapter we shall prove that for a given $r$, the internal derivative size is always +finitely bounded by a constant.