diff -r a5f666410101 -r fd068f39ac23 ChengsongTanPhdThesis/Chapters/Bitcoded2.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Sat Sep 10 22:30:22 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Mon Sep 12 23:32:18 2022 +0200 @@ -11,39 +11,43 @@ -In this chapter we introduce the simplifications +In this chapter we introduce simplifications on annotated regular expressions that can be applied to each intermediate derivative result. This allows us to make $\blexer$ much more efficient. -We contrast this simplification function -with Sulzmann and Lu's original -simplifications, indicating the simplicity of our algorithm and -improvements we made, demostrating -the usefulness and reliability of formal proofs on algorithms. +Sulzmann and Lu already had some bit-coded simplifications, +but their simplification functions were inefficient. +We contrast our simplification function +with Sulzmann and Lu's, indicating the simplicity of our algorithm. +This is another case for the usefulness +and reliability of formal proofs on algorithms. These ``aggressive'' simplifications would not be possible in the injection-based lexing we introduced in chapter \ref{Inj}. -We then go on to prove the correctness with the improved version of +We then prove the correctness with the improved version of $\blexer$, called $\blexersimp$, by establishing $\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system. \section{Simplifications by Sulzmann and Lu} -The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s -and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns -are scattered around different levels, and therefore requires -de-duplication at different levels: +Consider the derivatives of examples such as $(a^*a^*)^*$ +and $(a^* + (aa)^*)^*$: \begin{center} $(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\ $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$ \end{center} \noindent -As we have already mentioned in \ref{eqn:growth2}, -a simple-minded simplification function cannot simplify +As can be seen, there is a lot of duplication +in the example we have already mentioned in +\ref{eqn:growth2}. +A simple-minded simplification function cannot simplify the third regular expression in the above chain of derivative -regular expressions: +regular expressions, namely \begin{center} $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ \end{center} -one would expect a better simplification function to work in the +because the duplicates are +not next to each other and therefore the rule +$r+ r \rightarrow r$ does not fire. +One would expect a better simplification function to work in the following way: \begin{gather*} ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + @@ -61,10 +65,16 @@ )\cdot(a^*a^*)^* \end{gather*} \noindent -This motivating example came from testing Sulzmann and Lu's +In the first step, the nested alternative regular expression +$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$. +Now the third term $a^*$ is clearly identified as a duplicate +and therefore removed in the second step. This causes the two +top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ +removed in the final step.\\ +This motivating example is from testing Sulzmann and Lu's algorithm: their simplification does not work! -We quote their $\textit{simp}$ function verbatim here: +Consider their simplification (using our notations): \begin{center} \begin{tabular}{lcl} $\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & @@ -85,10 +95,10 @@ \end{tabular} \end{center} \noindent -the $\textit{zeroable}$ predicate -which tests whether the regular expression +where the $\textit{zeroable}$ predicate +tests whether the regular expression is equivalent to $\ZERO$, -is defined as: +can be defined as: \begin{center} \begin{tabular}{lcl} $\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\; @@ -190,6 +200,23 @@ The assumption that the size of the regular expressions in the algorithm would stay below a finite constant is not ture. +The main reason behind this is that (i) The $\textit{nub}$ +function requires identical annotations between two +annotated regular expressions to qualify as duplicates, +and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$ +even if both $a^*$ denote the same language. +(ii) The ``flattening'' only applies to the head of the list +in the +\begin{center} + \begin{tabular}{lcl} + + $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & + $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ + \end{tabular} +\end{center} +\noindent +clause, and therefore is not thorough enough to simplify all +needed parts of the regular expression.\\ In addition to that, even if the regular expressions size do stay finite, one has to take into account that the $\simpsulz$ function is applied many times @@ -209,14 +236,14 @@ by our solution. These solutions come with correctness statements that are backed up by formal proofs. \subsection{Flattening Nested Alternatives} -The idea behind the +The idea behind the clause \begin{center} $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$ \end{center} -clause is that it allows +is that it allows duplicate removal of regular expressions at different -levels. +``levels'' of alternatives. For example, this would help with the following simplification: @@ -226,15 +253,15 @@ The problem here is that only the head element is ``spilled out'', whereas we would want to flatten -an entire list to open up possibilities for further simplifications.\\ +an entire list to open up possibilities for further simplifications. Not flattening the rest of the elements also means that the later de-duplication processs -does not fully remove apparent duplicates. +does not fully remove further duplicates. For example, using $\simpsulz$ we could not simplify \begin{center} -$((a^* a^*)+ (a^* + a^*))\cdot (a^*a^*)^*+ + $((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+ ((a^*a^*)+a^*)\cdot (a^*a^*)^*$ \end{center} due to the underlined part not in the first element @@ -427,17 +454,16 @@ This algorithm keeps the regular expression size small. -\subsection{$(a+aa)^*$ and $(a^*\cdot a^*)^*$ against -$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification} -For example, -with our simplification the +\subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$ +After Simplification} +Recall the previous $(a^*a^*)^*$ example where $\simpsulz$ could not -stop the fast growth (over +prevent the fast growth (over 3 million nodes just below $20$ input length) -will be reduced to just 15 and stays constant, no matter how long the +will be reduced to just 15 and stays constant no matter how long the input string is. -This is demonstrated in the graphs below. +This is shown in the graphs below. \begin{figure}[H] \begin{center} \begin{tabular}{ll} @@ -482,12 +508,11 @@ \section{Correctness of $\blexersimp$} In this section we give details of the correctness proof of $\blexersimp$, -an important contribution of this thesis.\\ +one of the contributions of this thesis.\\ We first introduce the rewriting relation \emph{rrewrite} ($\rrewrite$) between two regular expressions, which expresses an atomic -simplification step from the left-hand-side -to the right-hand-side. +simplification. We then prove properties about this rewriting relation and its reflexive transitive closure. Finally we leverage these properties to show @@ -496,7 +521,7 @@ \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)} In the $\blexer$'s correctness proof, we -did not directly derive the fact that $\blexer$ gives out the POSIX value, +did not directly derive the fact that $\blexer$ generates the POSIX value, but first proved that $\blexer$ is linked with $\lexer$. Then we re-use the correctness of $\lexer$ @@ -511,8 +536,8 @@ produces the same output as $\blexer \; r\; s$, and then piecing it together with $\blexer$'s correctness to achieve our main -theorem:\footnote{ the case when -$s$ is not in $L \; r$, is routine to establish } +theorem:\footnote{ The case when +$s$ is not in $L \; r$, is routine to establish.} \begin{center} $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = v$ \end{center} @@ -526,7 +551,7 @@ \end{center} where each rewrite step, written $\rightsquigarrow$, is an ``atomic'' simplification that -cannot be broken down any further: +is similar to a small-step reduction in operational semantics: \begin{figure}[H] \begin{mathpar} \inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\} @@ -592,8 +617,10 @@ \caption{The Reflexive Transitive Closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure} \end{figure} -Two rewritable terms will remain rewritable to each other -even after a derivative is taken: +%Two rewritable terms will remain rewritable to each other +%even after a derivative is taken: +Rewriting is preserved under derivatives, +namely \begin{center} $r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$ \end{center} @@ -956,7 +983,7 @@ \subsection{Comments on the Proof Techniques Used} Straightforward and simple as the proof may seem, -the efforts we spent obtaining it was far from trivial.\\ +the efforts we spent obtaining it were far from trivial.\\ We initially attempted to re-use the argument in \cref{flex_retrieve}. The problem was that both functions $\inj$ and $\retrieve$ require @@ -1007,6 +1034,16 @@ \end{center} as equal, because they were both re-written from the same expression.\\ +The simplification rewriting rules +given in \ref{rrewriteRules} are by no means +final, +one could come up new rules +such as +$\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow +\SEQs [r_1, r_2, r_3]$. +This does not fit with the proof technique +of our main theorem, but seem to not violate the POSIX +property.\\ Having correctness property is good. But we would also a guarantee that the lexer is not slow in some sense, for exampe, not grinding to a halt regardless of the input.