lexing: comparison ChengsongTanPhdThesis/Chapters/Bitcoded2.tex

equal deleted inserted replaced

-:1734bd5975a3
+:4969ef817d92
 lexing we introduced in chapter \ref{Inj}.
 We then go on to prove the correctness with the improved version of
 $\blexer$, called $\blexersimp$, by establishing
 $\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
-\section{Simplification for Annotated Regular Expressions}
+\section{Simplifications by Sulzmann and Lu}
 The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
 and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
 are scattered around different levels, and therefore requires
 de-duplication at different levels:
 \begin{center}
-	$(a^*a^*)^* \rightarrow (a^*a^* + a^*)\cdot(a^*a^*)^* \rightarrow $\\
+	$(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
-	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \rightarrow \ldots$
+	$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
 \end{center}
 \noindent
 As we have already mentioned in \ref{eqn:growth2},
 a simple-minded simplification function cannot simplify
+the third regular expression in the above chain of derivative
+regular expressions:
 \begin{center}
 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
 \end{center}
-any further, one would expect a better simplification function to work in the
+one would expect a better simplification function to work in the
 following way:
 \begin{gather*}
 	((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
 	\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
 	\bigg\downarrow \\
 algorithm: their simplification does
 not work!
 We quote their $\textit{simp}$ function verbatim here:
 \begin{center}
 	\begin{tabular}{lcl}
-		$\simp\; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
+		$\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
 		$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
 						   & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
-		$\simp\;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if}
+		$\simpsulz \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if}
 		\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\;
 		\textit{then} \;\; \ZERO$\\
-					     & & $\textit{else}\;\;_{bs}((\simp\;r_1)\cdot
+					     & & $\textit{else}\;\;_{bs}((\simpsulz \;r_1)\cdot
-					     (\simp\; r_2))$\\
+					     (\simpsulz \; r_2))$\\
-		$\simp \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
+		$\simpsulz  \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
-		$\simp \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+		$\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
 		$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
-		$\simp \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simp \; r)$\\
+		$\simpsulz  \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simpsulz  \; r)$\\
-		$\simp \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
+		$\simpsulz  \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
-		(\nub \; (\filter \; (\not \circ \zeroable)\;((\simp \; r) :: \map \; \simp \; rs)))$\\
+		(\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz  \; r) :: \map \; \simpsulz  \; rs)))$\\
 	\end{tabular}
 \end{center}
 \noindent
 the $\textit{zeroable}$ predicate
 		$\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\
 		$\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$
 	\end{tabular}
 \end{center}
 \noindent
-They suggested that the $\simp$ function should be
+They suggested that the $\simpsulz $ function should be
 applied repeatedly until a fixpoint is reached.
-We implemented the algorithm as required in Scala,
+We call this construction $\textit{sulzSimp}$:
-and found that final derivative regular expression
+\begin{center}
+	\begin{tabular}{lcl}
+		$\textit{sulzSimp} \; r$ & $\dn$ &
+		$\textit{while}((\simpsulz  \; r)\; \cancel{=} \; r)$ \\
+		& & $\quad r := \simpsulz  \; r$\\
+		& & $\textit{return} \; r$
+	\end{tabular}
+\end{center}
+We call the operation of alternatingly
+applying derivatives and simplifications
+(until the string is exhausted) Sulz-simp-derivative,
+written $\backslash_{sulzSimp}$:
+\begin{center}
+\begin{tabular}{lcl}
+	$r \backslash_{sulzSimp} (c\!::\!s) $ & $\dn$ & $(\textit{sulzSimp} \; (r \backslash c)) \backslash_{sulzSimp}\, s$ \\
+$r \backslash_{sulzSimp} [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+\noindent
+After the derivatives have been taken, the bitcodes
+are extracted and decoded in the same manner
+as $\blexer$:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{blexer\_sulzSimp}\;r\,s$ & $\dn$ &
+$\textit{let}\;a = (r^\uparrow)\backslash_{sulzSimp}\, s\;\textit{in}$\\
+& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+& & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+\noindent
+We implemented this lexing algorithm in Scala,
+and found that the final derivative regular expression
 size grows exponentially fast:
-\begin{center}
 \begin{figure}[H]
+	\centering
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
-ylabel={time in secs},
+ylabel={size},
 ymode = log,
-legend entries={Sulzmann and Lu's $\simp$},
+legend entries={Final Derivative Size},
 legend pos=north west,
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexer.data};
 \end{axis}
 \end{tikzpicture}
-\caption{Matching the regular expression $(a^*a^*)^*$ against strings of the form
+\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
-$ using Sulzmann and Lu's lexer algorithm}\label{SulzmannLuLexer}
+$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexer}
 \end{figure}
-\end{center}
 \noindent
 At $n= 20$ we already get an out of memory error with Scala's normal
-JVM heap size settings,
+JVM heap size settings.
-which seems a counterexample for their linear complexity claim:
+In fact their simplification does not improve over
+the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.
+The time required also grows exponentially:
+\begin{figure}[H]
+	\centering
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+ylabel={time},
+ymode = log,
+legend entries={time in secs},
+legend pos=north west,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {SulzmannLuLexerTime.data};
+\end{axis}
+\end{tikzpicture}
+\caption{Lexing the regular expression $(a^*a^*)^*$ against strings of the form
+$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
+$ using Sulzmann and Lu's lexer}\label{SulzmannLuLexerTime}
+\end{figure}
+\noindent
+which seems like a counterexample for
+their linear complexity claim:
 \begin{quote}\it
-Linear-Time Complexity Claim It is easy to see that each call of one of the functions/operations\b,simp,fuse,mkEpsBCandisPhi leadstosubcallswhose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input. We yet need to work out detailed estimates regarding the space complexity of our algorithm.
+Linear-Time Complexity Claim \\It is easy to see that each call of one of the functions/operations:
-\end{quote}
+simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input.
+\end{quote}
+\noindent
+The assumption that the size of the regular expressions
-Despite that we have already implemented the simple-minded simplifications
+in the algorithm
-such as throwing away useless $\ONE$s and $\ZERO$s.
+would stay below a finite constant is not ture.
-The simplification rule $r + r \rightarrow $ cannot make a difference either
+In addition to that, even if the regular expressions size
-since it only removes duplicates on the same level, not something like
+do stay finite, one has to take into account that
-$(r+a)+r \rightarrow r+a$.
+the $\simpsulz$ function is applied many times
-This requires us to break up nested alternatives into lists, for example
+in each derivative step, and that number is not necessarily
-using the flatten operation similar to the one defined for any function language's
+a constant with respect to the size of the regular expression.
-list datatype:
+To not get ``caught off guard'' by
+these counterexamples,
+one needs to be more careful when designing the
+simplification function and making claims about them.
+\section{Our $\textit{Simp}$ Function}
+We will now introduce our simplification function,
+by making a contrast with $\simpsulz$.
+We describe
+the ideas behind components in their algorithm
+and why they fail to achieve the desired effect, followed
+by our solution. These solutions come with correctness
+statements that are backed up by formal proofs.
+\subsection{Flattening Nested Alternatives}
+The idea behind the
+\begin{center}
+$\simpsulz  \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
+	       _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
+\end{center}
+clause is that it allows
+duplicate removal of regular expressions at different
+levels.
+For example, this would help with the
+following simplification:
+\begin{center}
+$(a+r)+r \longrightarrow a+r$
+\end{center}
+The problem here is that only the head element
+is ``spilled out'',
+whereas we would want to flatten
+an entire list to open up possibilities for further simplifications.\\
+Not flattening the rest of the elements also means that
+the later de-duplication processs
+does not fully remove apparent duplicates.
+For example,
+using $\simpsulz$ we could not
+simplify
+\begin{center}
+$((a^* a^*)+ (a^* + a^*))\cdot (a^*a^*)^*+
+((a^*a^*)+a^*)\cdot (a^*a^*)^*$
+\end{center}
+due to the underlined part not in the first element
+of the alternative.\\
+We define a flatten operation that flattens not only
+the first regular expression of an alternative,
+but the entire list:
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\
 $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \;  \textit{as'} $ \\
 $\textit{flts} \; a :: as'$ & $\dn$ & $a :: \textit{flts} \; \textit{as'}$ \quad(otherwise)
 \end{tabular}
 \end{center}
+\noindent
-\noindent
+Our $\flts$ operation
-There is a minor difference though, in that our $\flts$ operation defined by us
+also throws away $\ZERO$s
-also throws away $\ZERO$s.
+as they do not contribute to a lexing result.
-For a flattened list of regular expressions, a de-duplication can be done efficiently:
+\subsection{Duplicate Removal}
+After flattening is done, we are ready to deduplicate.
+The de-duplicate function is called $\distinctBy$,
+and that is where we make our second improvement over
+Sulzmann and Lu's.
+The process goes as follows:
+\begin{center}
+$rs \stackrel{\textit{flts}}{\longrightarrow}
+rs_{flat}
+\xrightarrow{\distinctBy \;
+rs_{flat} \; \rerases\; \varnothing} rs_{distinct}$
+%\stackrel{\distinctBy \;
+%rs_{flat} \; \erase\; \varnothing}{\longrightarrow} \; rs_{distinct}$
+\end{center}
+where the $\distinctBy$ function is defined as:
 \begin{center}
 	\begin{tabular}{@{}lcl@{}}
 		$\distinctBy \; [] \; f\; acc $ & $ =$ & $ []$\\
-		$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & \\
+		$\distinctBy \; (x :: xs) \; f \; acc$ & $=$ & $\quad \textit{if} (f \; x \in acc)\;\; \textit{then} \;\; \distinctBy \; xs \; f \; acc$\\
-						       & & $\quad \textit{if} (f \; x) \in acc \textit{then} \distinctBy \; xs \; f \; acc$\\
+						       & & $\quad \textit{else}\;\; x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$
-						       & & $\quad \textit{else} x :: (\distinctBy \; xs \; f \; (\{f \; x\} \cup acc))$
 	\end{tabular}
 \end{center}
 \noindent
 The reason we define a distinct function under a mapping $f$ is because
-we want to eliminate regular expressions that are the same syntactically,
+we want to eliminate regular expressions that are syntactically the same,
-but having different bit-codes (more on the reason why we can do this later).
+but with different bit-codes.
-To achieve this, we call $\erase$ as the function $f$ during the distinction
+For example, we can remove the second $a^*a^*$ from
-operation.
+$_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because it
+represents a match with shorter initial sub-match
+(and therefore is definitely not POSIX),
+and will be discarded by $\bmkeps$ later.
+\begin{center}
+	$_{ZSZ}\underbrace{a^*}_{ZS:\; match \; 1\; times\quad}\underbrace{a^*}_{Z: \;match\; 1 \;times} +
+	_{SZZ}\underbrace{a^*}_{S: \; match \; 0 \; times\quad}\underbrace{a^*}_{ZZ: \; match \; 2 \; times}
+	$
+\end{center}
+%$_{bs1} r_1 + _{bs2} r_2 \text{where} (r_1)_{\downarrow} = (r_2)_{\downarrow}$
+Due to the way our algorithm works,
+the matches that conform to the POSIX standard
+will always be placed further to the left. When we
+traverse the list from left to right,
+regular expressions we have already seen
+will definitely not contribute to a POSIX value,
+even if they are attached with different bitcodes.
+These duplicates therefore need to be removed.
+To achieve this, we call $\rerases$ as the function $f$ during the distinction
+operation.\\
+$\rerases$ is very similar to $\erase$, except that it preserves the structure
+when erasing an alternative regular expression.
+The reason why we use $\rerases$ instead of $\erase$ is that
+it keeps the structures of alternative
+annotated regular expressions
+whereas $\erase$ would turn it back into a binary structure.
+Not having to mess with the structure
+greatly simplifies the finiteness proof in chapter \ref{Finite}.
+We give the definitions of $\rerases$ here together with
+the new datatype used by $\rerases$ (as our plain
+regular expression datatype does not allow non-binary alternatives),
+and explain in detail
+why we want it in the next chapter.
+For the moment the reader can just think of
+$\rerases$ as $\erase$ and $\rrexp$ as plain regular expressions.
+\[			\rrexp ::=   \RZERO \mid  \RONE
+			 \mid  \RCHAR{c}
+			 \mid  \RSEQ{r_1}{r_2}
+			 \mid  \RALTS{rs}
+			 \mid \RSTAR{r}
+\]
+The notation of $\rerases$ also follows that of $\erase$,
+which is a postfix operator written as a subscript,
+except that it has an \emph{r} attached to it to distinguish against $\erase$:
+\begin{center}
+\begin{tabular}{lcl}
+$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
+$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
+	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
+$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
+$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
+\end{tabular}
+\end{center}
+\subsection{Putting Things Together}
 A recursive definition of our  simplification function
-that looks somewhat similar to our Scala code is given below:
+is given below:
+%that looks somewhat similar to our Scala code is
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 	  $\textit{bsimp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ \textit{bsimp}_{ASEQ} \; bs \;(\textit{bsimp} \; a_1) \; (\textit{bsimp}  \; a_2)  $ \\
-	  $\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \erase \; \phi) $ \\
+	  $\textit{bsimp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{bsimp}_{ALTS} \; \textit{bs} \; (\textit{distinctBy} \; ( \textit{flatten} ( \textit{map} \; bsimp \; as)) \; \rerases \; \varnothing) $ \\
 $\textit{bsimp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
 \end{tabular}
 \end{center}
 \noindent
-The simplification (named $\bsimp$ for \emph{b}it-coded) does a pattern matching on the regular expression.
+The simplification (named $\textit{bsimp}$ for \emph{b}it-coded)
+does a pattern matching on the regular expression.
 When it detected that the regular expression is an alternative or
 sequence, it will try to simplify its children regular expressions
 recursively and then see if one of the children turns into $\ZERO$ or
 $\ONE$, which might trigger further simplification at the current level.
 Current level simplifications are handled by the function $\textit{bsimp}_{ASEQ}$,
 \end{center}
 \noindent
 The most involved part is the $\sum$ clause, where we first call $\flts$ on
 the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$.
 and then call $\distinctBy$ on that list, the predicate determining whether two
-elements are the same is $\erase \; r_1 = \erase\; r_2$.
+elements are the same is $\rerases \; r_1 = \rerases\; r_2$.
 Finally, depending on whether the regular expression list $as'$ has turned into a
 singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
 decides whether to keep the current level constructor $\sum$ as it is, and
 removes it when there are less than two elements:
 \begin{center}
 \begin{center}
 	\begin{tabular}{lcl}
 		$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
 	\end{tabular}
 \end{center}
-Following previous notation of  natural
+%Following previous notations
-extension from derivative w.r.t.~character to derivative
+%when extending from derivatives w.r.t.~character to derivative
-w.r.t.~string, we define the derivative that nests simplifications with derivatives:%\comment{simp in  the [] case?}
+%w.r.t.~string, we define the derivative that nests simplifications
+%with derivatives:%\comment{simp in  the [] case?}
+We extend this from character to string:
 \begin{center}
 \begin{tabular}{lcl}
 $r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
 $r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
 \noindent
-Extracting bit-codes from the derivatives that had been simplified repeatedly after
+The lexer that extracts bitcodes from the
-each derivative run, the simplified $\blexer$, called $\blexersimp$,
+derivatives with simplifications from our $\simp$ function
-is defined as
+is called $\blexersimp$:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
 & & $\;\;\textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
 \noindent
-This algorithm keeps the regular expression size small, for example,
+This algorithm keeps the regular expression size small.
-with this simplification our previous $(a + aa)^*$ example's fast growth (over
-$10^5$ nodes at around $20$ input length)
-will be reduced to just 17 and stays constant, no matter how long the
+\subsection{$(a+aa)^*$ and $(a^*\cdot a^*)^*$  against
+$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
+For example,
+with our simplification the
+previous $(a^*a^*)^*$ example
+where $\simpsulz$ could not
+stop the fast growth (over
+3 million nodes just below $20$ input length)
+will be reduced to just 15 and stays constant, no matter how long the
 input string is.
-We show some graphs to better demonstrate this imrpovement.
+This is demonstrated in the graphs below.
+\begin{figure}[H]
-\section{$(a+aa)^*$ and $(a^*\cdot a^*)^*$  against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
-For $(a+aa)^*$, it used to grow to over $9000$ nodes with simple-minded simplification
-at only around $15$ input characters:
-\begin{center}
-\begin{tabular}{ll}
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-ylabel={derivative size},
-width=7cm,
-height=4cm,
-legend entries={Lexer with $\textit{bsimp}$},
-legend pos=  south east,
-legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_bsimp.data};
-\end{axis}
-\end{tikzpicture} %\label{fig:BitcodedLexer}
-&
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-ylabel={derivative size},
-width = 7cm,
-height = 4cm,
-legend entries={Lexer without $\textit{bsimp}$},
-legend pos=  north west,
-legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star_easysimp.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-And for $(a^*\cdot a^*)^*$, unlike in \ref{fig:BetterWaterloo}, the size with simplification now stay nicely constant with current
-simplification rules (we put the graphs together to show the contrast)
 \begin{center}
 \begin{tabular}{ll}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 \begin{axis}[
 xlabel={$n$},
 ylabel={derivative size},
 width = 7cm,
 height = 4cm,
-legend entries={Lexer without $\textit{bsimp}$},
+legend entries={Lexer with $\simpsulz$},
 legend pos=  north west,
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
 \end{center}
+\caption{Our Improvement over Sulzmann and Lu's in terms of size}
+\end{figure}
+\noindent
+Given the size difference, it is not
+surprising that our $\blexersimp$ significantly outperforms
+$\textit{blexer\_sulzSimp}$.
+In the next section we are going to establish the
+first important property of our lexer--the correctness.
 %----------------------------------------------------------------------------------------
 %	SECTION rewrite relation
 %----------------------------------------------------------------------------------------
-\section{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
+\section{Correctness of $\blexersimp$}
+In this section we give details
+of the correctness proof of $\blexersimp$,
+an important contribution of this thesis.\\
+We first introduce the rewriting relation \emph{rrewrite}
+($\rrewrite$) between two regular expressions,
+which expresses an atomic
+simplification step from the left-hand-side
+to the right-hand-side.
+We then prove properties about
+this rewriting relation and its reflexive transitive closure.
+Finally we leverage these properties to show
+an equivalence between the internal data structures of
+$\blexer$ and $\blexersimp$.
+\subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
 In the $\blexer$'s correctness proof, we
 did not directly derive the fact that $\blexer$ gives out the POSIX value,
-but first prove that $\blexer$ is linked with $\lexer$.
+but first proved that $\blexer$ is linked with $\lexer$.
 Then we re-use
-the correctness of $\lexer$.
+the correctness of $\lexer$
-Here we follow suit by first proving that
+to obtain
+\begin{center}
+	$(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.
+\end{center}
+Here we apply this
+modularised technique again
+by first proving that
 $\blexersimp \; r \; s $
 produces the same output as $\blexer \; r\; s$,
-and then putting it together with
+and then piecing it together with
-$\blexer$'s correctness result
+$\blexer$'s correctness to achieve our main
-($(r, s) \rightarrow v \implies \blexer \; r \;s = v$)
+theorem:\footnote{ the case when
-to obtain half of the correctness property (the other half
+$s$ is not in $L \; r$, is routine to establish }
-being when $s$ is not $L \; r$ which is routine to establish)
+\begin{center}
-\begin{center}
+	$(r, s) \rightarrow v \; \;   \textit{iff} \;\;  \blexersimp \; r \; s = v$
-	$(r, s) \rightarrow v \implies \blexersimp \; r \; s = v$
+\end{center}
-\end{center}
+\noindent
-\noindent
+The overall idea for the proof
-because it makes the proof simpler
+of $\blexer \;r \;s = \blexersimp \; r \;s$
-The overall idea for the correctness of our simplified bitcoded lexer
+is that the transition from $r$ to $\textit{bsimp}\; r$ can be
-\noindent
+broken down into finitely many rewrite steps:
-is that the $\textit{bsimp}$ will not change the regular expressions so much that
+\begin{center}
-it becomes impossible to extract the $\POSIX$ values.
+	$r \rightsquigarrow^* \textit{bsimp} \; r$
-To capture this "similarity" between unsimplified regular expressions and simplified ones,
+\end{center}
-we devise the rewriting relation $\rrewrite$, written infix as $\rightsquigarrow$:
+where each rewrite step, written $\rightsquigarrow$,
+is an ``atomic'' simplification that
-\begin{center}
+cannot be broken down any further:
+\begin{figure}[H]
 \begin{mathpar}
-	\inferrule{}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
+	\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
-	\inferrule{}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
+	\inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\}
-	\inferrule{}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\
+	\inferrule * [Right = $S_1$]{\vspace{0em}}{_{bs1} ((_{bs2} \ONE) \cdot r) \rightsquigarrow \fuse \; (bs_1 @ bs_2) \; r\\}\\
-	\inferrule{\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}
+	\inferrule * [Right = $SL$] {\\ r_1 \rightsquigarrow r_2}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_2 \cdot r_3\\}
-	\inferrule{\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\
+	\inferrule * [Right = $SR$] {\\ r_3 \rightsquigarrow r_4}{_{bs} r_1 \cdot r_3 \rightsquigarrow _{bs} r_1 \cdot r_4\\}\\
-	\inferrule{}{ _{bs}\sum [] \rightsquigarrow \ZERO\\}
+	\inferrule * [Right = $A0$] {\vspace{0em}}{ _{bs}\sum [] \rightsquigarrow \ZERO}
-	\inferrule{}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a\\}
+	\inferrule * [Right = $A1$] {\vspace{0em}}{ _{bs}\sum [a] \rightsquigarrow \fuse \; bs \; a}
-	\inferrule{\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2\\}
+	\inferrule * [Right = $AL$] {\\ rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{_{bs}\sum rs_1 \rightsquigarrow rs_2}
-	\inferrule{}{\\ [] \stackrel{s}{\rightsquigarrow} []}
+	\inferrule * [Right = $LE$] {\vspace{0em}}{ [] \stackrel{s}{\rightsquigarrow} []}
-	\inferrule{rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{\\ r :: rs_1 \rightsquigarrow r :: rs_2 \rightsquigarrow}
+	\inferrule * [Right = $LT$] {rs_1 \stackrel{s}{\rightsquigarrow} rs_2}{ r :: rs_1 \stackrel{s}{\rightsquigarrow} r :: rs_2 }
-	\inferrule{r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs\\}
+	\inferrule * [Right = $LH$] {r_1 \rightsquigarrow r_2}{ r_1 :: rs \stackrel{s}{\rightsquigarrow} r_2 :: rs}
-	\inferrule{}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}
+	\inferrule * [Right = $L\ZERO$] {\vspace{0em}}{\ZERO :: rs \stackrel{s}{\rightsquigarrow} rs}
-	\inferrule{}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }
+	\inferrule * [Right = $LS$] {\vspace{0em}}{_{bs} \sum (rs_1 :: rs_b) \stackrel{s}{\rightsquigarrow} ((\map \; (\fuse \; bs_1) \; rs_1) @ rsb) }
-	\inferrule{\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}
+	\inferrule * [Right = $LD$] {\\ \rerase{a_1} = \rerase{a_2}}{rs_a @ [a_1] @ rs_b @ [a_2] @ rsc \stackrel{s}{\rightsquigarrow} rs_a @ [a_1] @ rs_b @ rs_c}
 \end{mathpar}
-\end{center}
+\caption{
-These "rewrite" steps define the atomic simplifications we could impose on regular expressions
+The rewrite rules that generate simplified regular expressions
-under our simplification algorithm.
+in small steps: $r_1 \rightsquigarrow r_2$ is for bitcoded regular expressions
-For convenience, we define a relation $\stackrel{s}{\rightsquigarrow}$ for rewriting
+and $rs_1 \stackrel{s}{\rightsquigarrow} rs_2$ for
-a list of regular exression to another list.
+lists of bitcoded regular expressions.
-The $\rerase{}$ function is used instead of $_\downarrow$ for the finiteness bound proof of next chapter,
+Interesting is the LD rule that allows copies of regular expressions
-which we will discuss later. For the moment the reader can assume they basically do the same thing as $\erase$.
+to be removed provided a regular expression
+earlier in the list can match the same strings.
-Usually more than one steps are taking place during the simplification of a regular expression,
+}\label{rrewriteRules}
-therefore we define the reflexive transitive closure of the $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
+\end{figure}
-relations:
+\noindent
+The rules such as $LT$ and $LH$ are for rewriting between two regular expression lists
-\begin{center}
+such that one regular expression
+in the left-hand-side list is rewritable in one step
+to the right-hand-side's regular expression at the same position.
+This helps with defining the ``context rules'' such as $AL$.\\
+The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
+are defined in the usual way:
+\begin{figure}[H]
+	\centering
 \begin{mathpar}
-	\inferrule{}{ r \stackrel{*}{\rightsquigarrow} r \\}
+	\inferrule{\vspace{0em}}{ r \rightsquigarrow^* r \\}
-	\inferrule{}{\\ rs \stackrel{s*}{\rightsquigarrow} rs \\}
-	\inferrule{\\r_1 \stackrel{*}{\rightsquigarrow}  r_2 \land \; r_2 \stackrel{*}{\rightsquigarrow} r_3}{r_1 \stackrel{*}{\rightsquigarrow} r_3\\}
+	\inferrule{\vspace{0em}}{rs \stackrel{s*}{\rightsquigarrow} rs \\}
-	\inferrule{\\rs_1 \stackrel{*}{\rightsquigarrow}  rs_2 \land \; rs_2 \stackrel{*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{*}{\rightsquigarrow} rs_3}
+	\inferrule{r_1 \rightsquigarrow^*  r_2 \land \; r_2 \rightsquigarrow^* r_3}{r_1 \rightsquigarrow^* r_3\\}
+	\inferrule{rs_1 \stackrel{s*}{\rightsquigarrow}  rs_2 \land \; rs_2 \stackrel{s*}{\rightsquigarrow} rs_3}{rs_1 \stackrel{s*}{\rightsquigarrow} rs_3}
 \end{mathpar}
-\end{center}
+\caption{The Reflexive Transitive Closure of
-Now that we have modelled the rewriting behaviour of our simplifier $\bsimp$, we prove mainly
+$\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
-three properties about how these relations connect to $\blexersimp$:
+\end{figure}
-\begin{itemize}
+Two rewritable terms will remain rewritable to each other
-	\item
+even after a derivative is taken:
-		$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+\begin{center}
-		The algorithm $\bsimp$ only transforms the regex $r$ using steps specified by
+	$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
-		$\stackrel{*}{\rightsquigarrow}$ and nothing else.
+\end{center}
-	\item
+And finally, if two terms are rewritable to each other,
-		$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow r'\backslash c$.
+then they produce the same bitcodes:
-		The relation $\stackrel{*}{rightsquigarrow}$ is preserved under derivative.
+\begin{center}
-	\item
+	$r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$
-		$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$. If we reach another
+\end{center}
-		expression in finitely many atomic simplification steps, then these two regular expressions
+The decoding phase of both $\blexer$ and $\blexersimp$
-		will produce the same bit-codes under the bit collection function $\bmkeps$ used by our $\blexer$.
+are the same, which means that if they get the same
+bitcodes before the decoding phase,
-\end{itemize}
+they get the same value after decoding is done.
-\section{Three Important properties}
+We will prove the three properties
-These properties would work together towards the correctness theorem.
+we mentioned above in the next sub-section.
-We start proving each of these lemmas below.
+\subsection{Important Properties of $\rightsquigarrow$}
-\subsection{$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$ and $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
+First we prove some basic facts
-The first few properties we establish is that the inference rules we gave for $\rightsquigarrow$
+about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$,
-and $\stackrel{s}{\rightsquigarrow}$ also hold as implications for $\stackrel{*}{\rightsquigarrow}$ and
+$\rightsquigarrow^*$ and $\stackrel{s*}{\rightsquigarrow}$,
-$\stackrel{s*}{\rightsquigarrow}$.
+which will be needed later.\\
-\begin{lemma}
+The inference rules (\ref{rrewriteRules}) we
-	$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
+gave in the previous section
-\end{lemma}
+have their ``many-steps version'':
-\begin{proof}
-	By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
+\begin{lemma}
-\end{proof}
+	\hspace{0em}
-\begin{lemma}
+	\begin{itemize}
-	$r \stackrel{*}{\rightsquigarrow} r' \implies _{bs} \sum r :: rs \stackrel{*}{\rightsquigarrow}  _{bs} \sum r' :: rs$
+		\item
-\end{lemma}
+			$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies _{bs} \sum rs_1 \stackrel{*}{\rightsquigarrow} _{bs} \sum rs_2$
-\begin{proof}
+		\item
-	By rule induction of $\stackrel{*}{\rightsquigarrow} $.
+			$r \stackrel{*}{\rightsquigarrow} r' \implies _{bs} \sum r :: rs \stackrel{*}{\rightsquigarrow}  _{bs} \sum r' :: rs$
-\end{proof}
+	\end{itemize}
-\noindent
+\end{lemma}
-Then we establish that the $\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$ relation is preserved w.r.t appending and prepending of a  list:
+\begin{proof}
-\begin{lemma}
+	By an induction on
-	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$
+	the inductive cases of $\stackrel{s*}{\rightsquigarrow}$ and $\rightsquigarrow^*$ respectively.
-\end{lemma}
+\end{proof}
-\begin{proof}
+\noindent
-	By induction on the list $rs$.
+The inference rules of $\stackrel{s}{\rightsquigarrow}$
-\end{proof}
+are defined in terms of list cons operation, here
+we establish that the
-\begin{lemma}
+$\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$
-	$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2$
+relation is also preserved w.r.t appending and prepending of a list:
-\end{lemma}
-\begin{proof}
-	By rule induction of $\stackrel{s*}{\rightsquigarrow}$.
-\end{proof}
-\noindent
-The $\stackrel{s}{\rightsquigarrow} $ relation after appending a list becomes $\stackrel{s*}{\rightsquigarrow}$:
 \begin{lemma}\label{ssgqTossgs}
-	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+	\hspace{0em}
-\end{lemma}
+	\begin{itemize}
-\begin{proof}
+		\item
-	By rule induction of $\stackrel{s}{\rightsquigarrow}$.
+			$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs @ rs_1 \stackrel{s}{\rightsquigarrow} rs @ rs_2$
+		\item
+			$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies
+			rs @ rs_1 \stackrel{s*}{\rightsquigarrow} rs @ rs_2$
+		\item
+			The $\stackrel{s}{\rightsquigarrow} $ relation after appending
+			a list becomes $\stackrel{s*}{\rightsquigarrow}$:\\
+			$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
+	\end{itemize}
+\end{lemma}
+\begin{proof}
+	The first part is by induction on the list $rs$.
+	The second part is by induction on the inductive cases of $\stackrel{s*}{\rightsquigarrow}$.
+	The third part is
+	by rule induction of $\stackrel{s}{\rightsquigarrow}$.
 \end{proof}
 \begin{lemma}
 	$rs_1 \stackrel{s*}{\rightsquigarrow} rs_2 \implies rs_1 @ rs \stackrel{s*}{\rightsquigarrow} rs_2 @ rs$
 \end{lemma}
 \begin{proof}
 	r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
 \end{lemma}
 \begin{proof}
 	By using \ref{singleton}.
 \end{proof}
+In this section we give details for the proofs of the below properties:
+\begin{itemize}
+	\item
+		$(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1)
+		\implies \bmkeps \; r = \bmkeps \; r'$. \\
+		If we can rewrite
+		in many steps from $r$ to $r'$, then they
+		will produce the same bitcodes
+		under $\bmkeps$.
+	\item
+		$r \rightsquigarrow^* \textit{bsimp} \;r$.\\
+		The simplification function
+		$\textit{bsimp}$ only transforms the regex $r$ using steps specified by
+		$\rightsquigarrow^*$ and nothing else.
+	\item
+		$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\
+		The rewritability relation $\rightsquigarrow$ is preserved under derivatives--
+		it is just that we might need more steps.
+\end{itemize}
+These properties would work together towards the correctness theorem.
+We start proving each of these lemmas below.
+\subsubsection{Property 1: $(r \stackrel{*}{\rightsquigarrow} r'\land \bnullable \; r_1) \implies \bmkeps \; r = \bmkeps \; r'$}
+Intuitively this property says we can
+extract the same bitcodes from the nullable
+components of two regular expressions
+if we can rewrite from one to the other.
 Now we get to the "meaty" part of the proof, which relates the relations $\stackrel{s*}{\rightsquigarrow}$ and
 $\stackrel{*}{\rightsquigarrow} $ with our simplification components such $\distinctBy$ and $\flts$.
 The first lemma below says that for a list made of two parts $rs_1 @ rs_2$, one can throw away the duplicate
 elements in $rs_2$, as well as those that have appeared in $rs_1$:
 \begin{lemma}
 \end{lemma}
 \begin{proof}
 	By rule induction of $\stackrel{*}{\rightsquigarrow} $, using \ref{rewriteBmkepsAux} and $\ref{rewritesBnullable}$.
 \end{proof}
 \noindent
+\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
 the other property is also ready:
 \begin{lemma}
 	$r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
 \end{lemma}
 \begin{proof}
 	Then we could do the following regular expresssion many steps rewrite:\\
 	$ _{bs} \sum \distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi \stackrel{*}{\rightsquigarrow} \bsimpalts \; bs \; (\distinctBy \; (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerase \; \phi)$
 	\\
 \end{proof}
-\section{Proof for the Property: $r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
+\subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow}  r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
 The first thing we prove is that if we could rewrite in one step, then after derivative
 we could rewrite in many steps:
 \begin{lemma}\label{rewriteBder}
 	$r_1 \rightsquigarrow r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow}  r_2 \backslash c$ and
 	$rs_1 \stackrel{s}{\rightsquigarrow} rs_2 \implies \map \; (\_\backslash c) \; rs_1 \stackrel{s*}{\rightsquigarrow} \map \; (\_ \backslash c) \; rs_2$

changeset 585	4969ef817d92
parent 584	1734bd5975a3
child 586	826af400b068