diff -r dd9dde2d902b -r 80cc6dc4c98b ChengsongTanPhdThesis/Chapters/Bitcoded2.tex --- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Fri Dec 30 01:52:32 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Fri Dec 30 17:37:51 2022 +0000 @@ -16,7 +16,7 @@ each intermediate derivative result. This allows us to make $\blexer$ much more efficient. Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions, -but their simplification functions were inefficient and in some cases needs fixing. +but their simplification functions were inefficient and in some cases needed fixing. %We contrast our simplification function %with Sulzmann and Lu's, indicating the simplicity of our algorithm. %This is another case for the usefulness @@ -51,7 +51,7 @@ \end{center} because the duplicates are not next to each other and therefore the rule -$r+ r \rightarrow r$ does not fire. +$r+ r \rightarrow r$ from $\textit{simp}$ does not fire. One would expect a better simplification function to work in the following way: \begin{gather*} @@ -94,7 +94,7 @@ $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ $\textit{simp}\_{SL} \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL} \; r)$\\ $\textit{simp}\_{SL} \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum - (\nub \; (\filter \; (\not \circ \zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\ + (\nub \; (\filter \; (\neg\zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\ \end{tabular} \end{center} @@ -163,7 +163,7 @@ \noindent We implemented this lexing algorithm in Scala, and found that the final derivative regular expression -size grows exponentially fast (note the logarithmic scale): +size still grows exponentially (note the logarithmic scale): \begin{figure}[H] \centering \begin{tikzpicture} @@ -206,17 +206,18 @@ \end{figure} \noindent which seems like a counterexample for -their linear complexity claim: +Sulzmann and Lu's linear complexity claim +in their paper \cite{Sulzmann2014}: \begin{quote}\it -Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations: -simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input. +``Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations: +simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.'' \end{quote} \noindent The assumption that the size of the regular expressions in the algorithm -would stay below a finite constant is not true, not in the +would stay below a finite constant is not true, at least not in the examples we considered. -The main reason behind this is that (i) the $\textit{nub}$ +The main reason behind this is that (i) Haskell's $\textit{nub}$ function requires identical annotations between two annotated regular expressions to qualify as duplicates, and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$ @@ -232,24 +233,24 @@ \end{center} \noindent clause, and therefore is not strong enough to simplify all -needed parts of the regular expression. Moreover, even if the regular expressions size -do stay finite, one has to take into account that -the $\textit{simp}\_{SL}$ function is applied many times -in each derivative step, and that number is not necessarily -a constant with respect to the size of the regular expression. +needed parts of the regular expression. Moreover, +the $\textit{simp}\_{SL}$ function is applied repeatedly +in each derivative step until a fixed point is reached, +which makes the algorithm even more +unpredictable and inefficient. %To not get ``caught off guard'' by %these counterexamples, %one needs to be more careful when designing the %simplification function and making claims about them. \section{Our $\textit{Simp}$ Function} -We will now introduce our simplification function. +We will now introduce our own simplification function. %by making a contrast with $\textit{simp}\_{SL}$. We also describe the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$ algorithm -and why it fails to achieve the desired effect. -Our simplification function comes with a formal +and why it fails to achieve the desired effect of keeping the sizes of derivatives finitely bounded. +In addition, our simplification function will come with a formal correctness proof. \subsection{Flattening Nested Alternatives} The idea behind the clause @@ -282,7 +283,7 @@ $((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+ ((a^*a^*)+a^*)\cdot (a^*a^*)^*$ \end{center} -due to the underlined part not being in the head +due to the underlined part not being the head of the alternative. We define our flatten operation so that it flattens @@ -400,7 +401,7 @@ \noindent The simplification (named $\textit{bsimp}$ for \emph{b}it-coded) does a pattern matching on the regular expression. -When it detected that the regular expression is an alternative or +When it detects that the regular expression is an alternative or sequence, it will try to simplify its children regular expressions recursively and then see if one of the children turns into $\ZERO$ or $\ONE$, which might trigger further simplification at the current level. @@ -459,15 +460,15 @@ \begin{center} \begin{tabular}{lcl} $\textit{blexer\_simp}\;r\,s$ & $\dn$ & - $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ + $\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$ \end{tabular} \end{center} - \noindent -This algorithm keeps the regular expression size small. +This algorithm keeps the regular expression size small, +as we shall demonstrate with some examples in the next section. \subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$ @@ -510,12 +511,12 @@ \end{tikzpicture} \end{tabular} \end{center} -\caption{Our Improvement over Sulzmann and Lu's in terms of size} +\caption{Our Improvement over Sulzmann and Lu's in terms of size of the derivatives.} \end{figure} \noindent Given the size difference, it is not surprising that our $\blexersimp$ significantly outperforms -$\textit{blexer\_SLSimp}$. +$\textit{blexer\_SLSimp}$ by Sulzmann and Lu. In the next section we are going to establish that our simplification preserves the correctness of the algorithm. %---------------------------------------------------------------------------------------- @@ -621,7 +622,8 @@ such that one regular expression in the left-hand-side list is rewritable in one step to the right-hand-side's regular expression at the same position. -This helps with defining the ``context rule'' $AL$.\\ +This helps with defining the ``context rule'' $AL$. + The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ are defined in the usual way: \begin{figure}[H] @@ -775,7 +777,8 @@ extract the same bitcodes using $\bmkeps$ from the nullable components of two regular expressions $r$ and $r'$, if we can rewrite from one to the other in finitely -many steps.\\ +many steps. + For convenience, we define a predicate for a list of regular expressions having at least one nullable regular expressions: @@ -850,7 +853,8 @@ which says that our simplification's helper functions such as $\distinctBy$ and $\flts$ describe reducts of $\stackrel{s*}{\rightsquigarrow}$ and -$\rightsquigarrow^* $.\\ +$\rightsquigarrow^* $. + The first lemma to prove is a more general version of $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$: \begin{lemma} @@ -972,7 +976,7 @@ derivative regular expressions to the lexer with simplification applied (by lemma \ref{bderBderssimp}): \begin{center} - $a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $. + $a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $. \end{center} We know that they generate the same bits, if the lexing result is a match: \begin{center} @@ -998,8 +1002,8 @@ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; r\;s = \None$. \end{center} -and obtain the corollary that the bit-coded lexer with simplification is -indeed correctly outputting POSIX lexing result, if such a result exists. +and obtain the property that the bit-coded lexer with simplification is +indeed correctly generating a POSIX lexing result, if such a result exists. \begin{corollary} $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; @@ -1015,17 +1019,15 @@ that the annotated regular expressions stay unsimplified, so that one can correctly compare $v_{i+1}$ and $r_i$ and $v_i$ -in diagram \ref{graph:inj} and -``fit the key into the lock hole''. +in diagram \ref{graph:inj}. -\noindent We also tried to prove \begin{center} $\textit{bsimp} \;\; (\bderssimp{a}{s}) = \textit{bsimp} \;\; (a\backslash s)$, \end{center} but this turns out to be not true. -A counterexample would be +A counterexample is \[ a = [(_{Z}1+_{S}c)\cdot [bb \cdot (_{Z}1+_{S}c)]] \;\; \text{and} \;\; s = bb. \] @@ -1046,7 +1048,8 @@ we will always have this discrepancy. This is due to the $\map \; (\fuse\; bs) \; as$ operation -happening at different locations in the regular expression.\\ +happening at different locations in the regular expression. + The rewriting relation $\rightsquigarrow^*$ allows us to ignore this discrepancy @@ -1058,7 +1061,8 @@ \end{center} as equal, because they were both re-written -from the same expression.\\ +from the same expression. + The simplification rewriting rules given in \ref{rrewriteRules} are by no means final,