--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Sat Nov 12 00:37:23 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Sat Nov 12 21:34:40 2022 +0000
@@ -12,32 +12,37 @@
In this chapter we introduce simplifications
-on annotated regular expressions that can be applied to
+for annotated regular expressions that can be applied to
each intermediate derivative result. This allows
us to make $\blexer$ much more efficient.
-Sulzmann and Lu already had some bit-coded simplifications,
-but their simplification functions were inefficient.
-We contrast our simplification function
-with Sulzmann and Lu's, indicating the simplicity of our algorithm.
-This is another case for the usefulness
-and reliability of formal proofs on algorithms.
-These ``aggressive'' simplifications would not be possible in the injection-based
-lexing we introduced in chapter \ref{Inj}.
-We then prove the correctness with the improved version of
-$\blexer$, called $\blexersimp$, by establishing
-$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
-
+Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,
+but their simplification functions were inefficient and in some cases needs fixing.
+%We contrast our simplification function
+%with Sulzmann and Lu's, indicating the simplicity of our algorithm.
+%This is another case for the usefulness
+%and reliability of formal proofs on algorithms.
+%These ``aggressive'' simplifications would not be possible in the injection-based
+%lexing we introduced in chapter \ref{Inj}.
+%We then prove the correctness with the improved version of
+%$\blexer$, called $\blexersimp$, by establishing
+%$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
+%
\section{Simplifications by Sulzmann and Lu}
-Consider the derivatives of examples such as $(a^*a^*)^*$
-and $(a^* + (aa)^*)^*$:
+Consider the derivatives of the following example $(a^*a^*)^*$:
+%and $(a^* + (aa)^*)^*$:
\begin{center}
- $(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
- $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
+ \begin{tabular}{lcl}
+ $(a^*a^*)^*$ & $ \stackrel{\backslash a}{\longrightarrow}$ &
+ $ (a^*a^* + a^*)\cdot(a^*a^*)^*$\\
+ &
+ $ \stackrel{\backslash a}{\longrightarrow} $ &
+ $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$\\
+ & $\stackrel{\backslash a}{
+ \longrightarrow} $ & $\ldots$\\
+ \end{tabular}
\end{center}
\noindent
-As can be seen, there is a lot of duplication
-in the example we have already mentioned in
-\ref{eqn:growth2}.
+As can be seen, there are serveral duplications.
A simple-minded simplification function cannot simplify
the third regular expression in the above chain of derivative
regular expressions, namely
@@ -52,52 +57,51 @@
\begin{gather*}
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\
- \bigg\downarrow \\
+ \bigg\downarrow (1) \\
(a^*a^* + a^*
\color{gray} + a^* \color{black})\cdot(a^*a^*)^* +
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\
- \bigg\downarrow \\
+ \bigg\downarrow (2) \\
(a^*a^* + a^*
)\cdot(a^*a^*)^*
\color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\
- \bigg\downarrow \\
+ \bigg\downarrow (3) \\
(a^*a^* + a^*
)\cdot(a^*a^*)^*
\end{gather*}
\noindent
In the first step, the nested alternative regular expression
$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.
-Now the third term $a^*$ is clearly identified as a duplicate
-and therefore removed in the second step. This causes the two
+Now the third term $a^*$ can clearly be identified as a duplicate
+and therefore removed in the second step.
+This causes the two
top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$
-removed in the final step.\\
-This motivating example is from testing Sulzmann and Lu's
-algorithm: their simplification does
-not work!
-Consider their simplification (using our notations):
+removed in the final step.
+Sulzmann and Lu's simplification function (using our notations) can achieve this
+simplification:
\begin{center}
\begin{tabular}{lcl}
- $\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
+ $\textit{simp}\_{SL} \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
$\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\
& &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\
- $\simpsulz \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if}
+ $\textit{simp}\_{SL} \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if}
\; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\;
\textit{then} \;\; \ZERO$\\
- & & $\textit{else}\;\;_{bs}((\simpsulz \;r_1)\cdot
- (\simpsulz \; r_2))$\\
- $\simpsulz \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
- $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+ & & $\textit{else}\;\;_{bs}((\textit{simp}\_{SL} \;r_1)\cdot
+ (\textit{simp}\_{SL} \; r_2))$\\
+ $\textit{simp}\_{SL} \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\
+ $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
- $\simpsulz \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simpsulz \; r)$\\
- $\simpsulz \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
- (\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz \; r) :: \map \; \simpsulz \; rs)))$\\
+ $\textit{simp}\_{SL} \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL} \; r)$\\
+ $\textit{simp}\_{SL} \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum
+ (\nub \; (\filter \; (\not \circ \zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\
\end{tabular}
\end{center}
\noindent
-where the $\textit{zeroable}$ predicate
+The $\textit{zeroable}$ predicate
tests whether the regular expression
-is equivalent to $\ZERO$,
+is equivalent to $\ZERO$, and
can be defined as:
\begin{center}
\begin{tabular}{lcl}
@@ -111,25 +115,36 @@
\end{tabular}
\end{center}
\noindent
-They suggested that the $\simpsulz $ function should be
-applied repeatedly until a fixpoint is reached.
-We call this construction $\textit{sulzSimp}$:
+The
\begin{center}
\begin{tabular}{lcl}
- $\textit{sulzSimp} \; r$ & $\dn$ &
- $\textit{while}((\simpsulz \; r)\; \cancel{=} \; r)$ \\
- & & $\quad r := \simpsulz \; r$\\
+ $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+ $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
+ \end{tabular}
+\end{center}
+\noindent
+clause does flatten the alternative as required in step (1),
+but $\textit{simp}\_{SL}$ is insufficient if we want to do steps (2) and (3),
+as these ``identical'' terms have different bit-annotations.
+They also suggested that the $\textit{simp}\_{SL} $ function should be
+applied repeatedly until a fixpoint is reached.
+We call this construction $\textit{SLSimp}$:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\textit{SLSimp} \; r$ & $\dn$ &
+ $\textit{while}((\textit{simp}\_{SL} \; r)\; \cancel{=} \; r)$ \\
+ & & $\quad r := \textit{simp}\_{SL} \; r$\\
& & $\textit{return} \; r$
\end{tabular}
\end{center}
We call the operation of alternatingly
applying derivatives and simplifications
(until the string is exhausted) Sulz-simp-derivative,
-written $\backslash_{sulzSimp}$:
+written $\backslash_{SLSimp}$:
\begin{center}
\begin{tabular}{lcl}
- $r \backslash_{sulzSimp} (c\!::\!s) $ & $\dn$ & $(\textit{sulzSimp} \; (r \backslash c)) \backslash_{sulzSimp}\, s$ \\
-$r \backslash_{sulzSimp} [\,] $ & $\dn$ & $r$
+ $r \backslash_{SLSimp} (c\!::\!s) $ & $\dn$ & $(\textit{SLSimp} \; (r \backslash c)) \backslash_{SLSimp}\, s$ \\
+$r \backslash_{SLSimp} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}
\noindent
@@ -138,8 +153,8 @@
as $\blexer$:
\begin{center}
\begin{tabular}{lcl}
- $\textit{blexer\_sulzSimp}\;r\,s$ & $\dn$ &
- $\textit{let}\;a = (r^\uparrow)\backslash_{sulzSimp}\, s\;\textit{in}$\\
+ $\textit{blexer\_SLSimp}\;r\,s$ & $\dn$ &
+ $\textit{let}\;a = (r^\uparrow)\backslash_{SLSimp}\, s\;\textit{in}$\\
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
& & $\;\;\textit{else}\;\textit{None}$
@@ -148,7 +163,7 @@
\noindent
We implemented this lexing algorithm in Scala,
and found that the final derivative regular expression
-size grows exponentially fast:
+size grows exponentially fast (note the logarithmic scale):
\begin{figure}[H]
\centering
\begin{tikzpicture}
@@ -169,7 +184,7 @@
\noindent
At $n= 20$ we already get an out of memory error with Scala's normal
JVM heap size settings.
-In fact their simplification does not improve over
+In fact their simplification does not improve much over
the simple-minded simplifications we have shown in \ref{fig:BetterWaterloo}.
The time required also grows exponentially:
\begin{figure}[H]
@@ -199,46 +214,47 @@
\noindent
The assumption that the size of the regular expressions
in the algorithm
-would stay below a finite constant is not ture.
-The main reason behind this is that (i) The $\textit{nub}$
+would stay below a finite constant is not true, not in the
+examples we considered.
+The main reason behind this is that (i) the $\textit{nub}$
function requires identical annotations between two
annotated regular expressions to qualify as duplicates,
-and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$
-even if both $a^*$ denote the same language.
-(ii) The ``flattening'' only applies to the head of the list
+and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$
+even if both $a^*$ denote the same language, and
+(ii) the ``flattening'' only applies to the head of the list
in the
\begin{center}
\begin{tabular}{lcl}
- $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+ $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
$_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
\end{tabular}
\end{center}
\noindent
-clause, and therefore is not thorough enough to simplify all
-needed parts of the regular expression.\\
-In addition to that, even if the regular expressions size
+clause, and therefore is not strong enough to simplify all
+needed parts of the regular expression. Moreover, even if the regular expressions size
do stay finite, one has to take into account that
-the $\simpsulz$ function is applied many times
+the $\textit{simp}\_{SL}$ function is applied many times
in each derivative step, and that number is not necessarily
a constant with respect to the size of the regular expression.
-To not get ``caught off guard'' by
-these counterexamples,
-one needs to be more careful when designing the
-simplification function and making claims about them.
+%To not get ``caught off guard'' by
+%these counterexamples,
+%one needs to be more careful when designing the
+%simplification function and making claims about them.
\section{Our $\textit{Simp}$ Function}
-We will now introduce our simplification function,
-by making a contrast with $\simpsulz$.
-We describe
-the ideas behind components in their algorithm
-and why they fail to achieve the desired effect, followed
-by our solution. These solutions come with correctness
-statements that are backed up by formal proofs.
+We will now introduce our simplification function.
+%by making a contrast with $\textit{simp}\_{SL}$.
+We also describe
+the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$
+algorithm
+and why it fails to achieve the desired effect.
+Our simplification function comes with a formal
+correctness proof.
\subsection{Flattening Nested Alternatives}
The idea behind the clause
\begin{center}
-$\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
+ $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
\end{center}
is that it allows
@@ -250,25 +266,27 @@
\begin{center}
$(a+r)+r \longrightarrow a+r$
\end{center}
-The problem here is that only the head element
-is ``spilled out'',
-whereas we would want to flatten
-an entire list to open up possibilities for further simplifications.
+The problem is that only the head element
+is ``spilled out''.
+It is more desirable
+to flatten
+an entire list to open up possibilities for further simplifications
+with later regular expressions.
Not flattening the rest of the elements also means that
the later de-duplication processs
does not fully remove further duplicates.
For example,
-using $\simpsulz$ we could not
+using $\textit{simp}\_{SL}$ we cannot
simplify
\begin{center}
$((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
((a^*a^*)+a^*)\cdot (a^*a^*)^*$
\end{center}
-due to the underlined part not in the first element
-of the alternative.\\
-We define a flatten operation that flattens not only
-the first regular expression of an alternative,
-but the entire list:
+due to the underlined part not being in the head
+of the alternative.
+
+We define our flatten operation so that it flattens
+the entire list:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
@@ -282,10 +300,10 @@
also throws away $\ZERO$s
as they do not contribute to a lexing result.
\subsection{Duplicate Removal}
-After flattening is done, we are ready to deduplicate.
+After flattening is done, we can deduplicate.
The de-duplicate function is called $\distinctBy$,
and that is where we make our second improvement over
-Sulzmann and Lu's.
+Sulzmann and Lu's simplification method.
The process goes as follows:
\begin{center}
$rs \stackrel{\textit{flts}}{\longrightarrow}
@@ -306,7 +324,7 @@
\noindent
The reason we define a distinct function under a mapping $f$ is because
we want to eliminate regular expressions that are syntactically the same,
-but with different bit-codes.
+but have different bit-codes.
For example, we can remove the second $a^*a^*$ from
$_{ZSZ}a^*a^* + _{SZZ}a^*a^*$, because it
represents a match with shorter initial sub-match
@@ -327,21 +345,23 @@
even if they are attached with different bitcodes.
These duplicates therefore need to be removed.
To achieve this, we call $\rerases$ as the function $f$ during the distinction
-operation.\\
+operation. The function
$\rerases$ is very similar to $\erase$, except that it preserves the structure
when erasing an alternative regular expression.
The reason why we use $\rerases$ instead of $\erase$ is that
it keeps the structures of alternative
annotated regular expressions
-whereas $\erase$ would turn it back into a binary structure.
+whereas $\erase$ would turn it back into a binary tree structure.
Not having to mess with the structure
greatly simplifies the finiteness proof in chapter
-\ref{Finite} (we will follow up with more details there).
+\ref{Finite}.
We give the definitions of $\rerases$ here together with
the new datatype used by $\rerases$ (as our plain
regular expression datatype does not allow non-binary alternatives).
-For the moment the reader can just think of
-$\rerases$ as $\erase$ and $\rrexp$ as plain regular expressions.
+For now we can think of
+$\rerases$ as the function $(\_)_\downarrow$ defined in chapter \ref{Bitcoded1}
+and $\rrexp$ as plain regular expressions, but having a general list constructor
+for alternatives:
\begin{figure}[H]
\begin{center}
$\rrexp ::= \RZERO \mid \RONE
@@ -353,9 +373,7 @@
\caption{$\rrexp$: plain regular expressions, but with $\sum$ alternative
constructor}\label{rrexpDef}
\end{figure}
-The notation of $\rerases$ also follows that of $\erase$,
-which is a postfix operator written as a subscript,
-except that it has an \emph{r} attached to it to distinguish against $\erase$:
+The function $\rerases$ we define as follows:
\begin{center}
\begin{tabular}{lcl}
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
@@ -368,8 +386,7 @@
\end{center}
\subsection{Putting Things Together}
-A recursive definition of our simplification function
-is given below:
+We can now give the definition of our simplification function:
%that looks somewhat similar to our Scala code is
\begin{center}
\begin{tabular}{@{}lcl@{}}
@@ -404,21 +421,20 @@
and then call $\distinctBy$ on that list, the predicate determining whether two
elements are the same is $\rerases \; r_1 = \rerases\; r_2$.
Finally, depending on whether the regular expression list $as'$ has turned into a
-singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$
+singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$
decides whether to keep the current level constructor $\sum$ as it is, and
removes it when there are less than two elements:
\begin{center}
\begin{tabular}{lcl}
- $\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
+ $\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
\end{tabular}
\end{center}
-Having defined the $\bsimp$ function,
-we add it as a phase after a derivative is taken,
-so it stays small:
+Having defined the $\textit{bsimp}$ function,
+we add it as a phase after a derivative is taken.
\begin{center}
\begin{tabular}{lcl}
$r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
@@ -428,7 +444,7 @@
%when extending from derivatives w.r.t.~character to derivative
%w.r.t.~string, we define the derivative that nests simplifications
%with derivatives:%\comment{simp in the [] case?}
-We extend this from character to string:
+We extend this from characters to strings:
\begin{center}
\begin{tabular}{lcl}
$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
@@ -458,7 +474,7 @@
After Simplification}
Recall the
previous $(a^*a^*)^*$ example
-where $\simpsulz$ could not
+where $\textit{simp}\_{SL}$ could not
prevent the fast growth (over
3 million nodes just below $20$ input length)
will be reduced to just 15 and stays constant no matter how long the
@@ -486,7 +502,7 @@
ylabel={derivative size},
width = 7cm,
height = 4cm,
- legend entries={Lexer with $\simpsulz$},
+ legend entries={Lexer with $\textit{simp}\_{SL}$},
legend pos= north west,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
@@ -499,36 +515,38 @@
\noindent
Given the size difference, it is not
surprising that our $\blexersimp$ significantly outperforms
-$\textit{blexer\_sulzSimp}$.
-In the next section we are going to establish the
-first important property of our lexer--the correctness.
+$\textit{blexer\_SLSimp}$.
+In the next section we are going to establish that our
+simplification preserves the correctness of the algorithm.
%----------------------------------------------------------------------------------------
% SECTION rewrite relation
%----------------------------------------------------------------------------------------
\section{Correctness of $\blexersimp$}
-In this section we give details
-of the correctness proof of $\blexersimp$,
-one of the contributions of this thesis.\\
We first introduce the rewriting relation \emph{rrewrite}
($\rrewrite$) between two regular expressions,
-which expresses an atomic
+which stands for an atomic
simplification.
We then prove properties about
this rewriting relation and its reflexive transitive closure.
Finally we leverage these properties to show
-an equivalence between the internal data structures of
+an equivalence between the results generated by
$\blexer$ and $\blexersimp$.
\subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
In the $\blexer$'s correctness proof, we
did not directly derive the fact that $\blexer$ generates the POSIX value,
-but first proved that $\blexer$ is linked with $\lexer$.
+but first proved that $\blexer$ generates the same result as $\lexer$.
Then we re-use
the correctness of $\lexer$
-to obtain
+to obtain
\begin{center}
- $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.
+ $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$\\
+ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;
+ r\;s = \None$.
\end{center}
+%\begin{center}
+% $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$.
+%\end{center}
Here we apply this
modularised technique again
by first proving that
@@ -536,22 +554,25 @@
produces the same output as $\blexer \; r\; s$,
and then piecing it together with
$\blexer$'s correctness to achieve our main
-theorem:\footnote{ The case when
-$s$ is not in $L \; r$, is routine to establish.}
+theorem:
\begin{center}
- $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = v$
+ $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = \Some \;v$
+ \\
+ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
+ r\;s = \None$
\end{center}
\noindent
The overall idea for the proof
of $\blexer \;r \;s = \blexersimp \; r \;s$
is that the transition from $r$ to $\textit{bsimp}\; r$ can be
-broken down into finitely many rewrite steps:
+broken down into smaller rewrite steps of the form:
\begin{center}
$r \rightsquigarrow^* \textit{bsimp} \; r$
\end{center}
where each rewrite step, written $\rightsquigarrow$,
is an ``atomic'' simplification that
-is similar to a small-step reduction in operational semantics:
+is similar to a small-step reduction in operational semantics (
+see figure \ref{rrewriteRules} for the rules):
\begin{figure}[H]
\begin{mathpar}
\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
@@ -596,11 +617,11 @@
}\label{rrewriteRules}
\end{figure}
\noindent
-The rules such as $LT$ and $LH$ are for rewriting between two regular expression lists
+The rules $LT$ and $LH$ are for rewriting two regular expression lists
such that one regular expression
in the left-hand-side list is rewritable in one step
to the right-hand-side's regular expression at the same position.
-This helps with defining the ``context rules'' such as $AL$.\\
+This helps with defining the ``context rule'' $AL$.\\
The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$
are defined in the usual way:
\begin{figure}[H]
@@ -619,20 +640,21 @@
\end{figure}
%Two rewritable terms will remain rewritable to each other
%even after a derivative is taken:
-Rewriting is preserved under derivatives,
+The main point of our rewriting relation
+is that it is preserved under derivatives,
namely
\begin{center}
$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
\end{center}
-And finally, if two terms are rewritable to each other,
+And also, if two terms are rewritable to each other,
then they produce the same bitcodes:
\begin{center}
$r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$
\end{center}
The decoding phase of both $\blexer$ and $\blexersimp$
-are the same, which means that if they get the same
+are the same, which means that if they receive the same
bitcodes before the decoding phase,
-they get the same value after decoding is done.
+they generate the same value after decoding is done.
We will prove the three properties
we mentioned above in the next sub-section.
\subsection{Important Properties of $\rightsquigarrow$}
@@ -685,7 +707,7 @@
\end{proof}
\noindent
The inference rules of $\stackrel{s}{\rightsquigarrow}$
-are defined in terms of list cons operation, here
+are defined in terms of the list cons operation, where
we establish that the
$\stackrel{s}{\rightsquigarrow}$ and $\stackrel{s*}{\rightsquigarrow}$
relation is also preserved w.r.t appending and prepending of a list.
@@ -717,9 +739,9 @@
$rs_3 \stackrel{s*}{\rightsquigarrow} rs_4 \land r_1 \rightsquigarrow^* r_2 \implies
r_2 :: rs_3 \stackrel{s*}{\rightsquigarrow} r_2 :: rs_4$
\item
- If we could rewrite a regular expression
+ If we can rewrite a regular expression
in many steps to $\ZERO$, then
- we could also rewrite any sequence containing it to $\ZERO$:\\
+ we can also rewrite any sequence containing it to $\ZERO$:\\
$r_1 \rightsquigarrow^* \ZERO
\implies _{bs}r_1\cdot r_2 \rightsquigarrow^* \ZERO$
\end{itemize}
@@ -736,18 +758,18 @@
The last part is proven by rule induction again on $\rightsquigarrow^*$.
\end{proof}
\noindent
-Now we are ready to give the proofs of the below properties:
+Now we are ready to give the proofs of the following properties:
\begin{itemize}
\item
- $(r \rightsquigarrow^* r'\land \bnullable \; r_1)
+ $r \rightsquigarrow^* r'\land \bnullable \; r_1
\implies \bmkeps \; r = \bmkeps \; r'$. \\
\item
$r \rightsquigarrow^* \textit{bsimp} \;r$.\\
\item
$r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\
\end{itemize}
-These properties would work together towards the correctness theorem.
-\subsubsection{Property 1: $(r \rightsquigarrow^* r'\land \bnullable \; r_1)
+
+\subsubsection{Property 1: $r \rightsquigarrow^* r'\land \bnullable \; r_1
\implies \bmkeps \; r = \bmkeps \; r'$}
Intuitively, this property says we can
extract the same bitcodes using $\bmkeps$ from the nullable
@@ -761,7 +783,7 @@
$\textit{bnullables} \; rs \quad \dn \quad \exists r \in rs. \;\; \bnullable \; r$
\end{center}
\noindent
-The rewriting relation $\rightsquigarrow$ preserves nullability:
+The rewriting relation $\rightsquigarrow$ preserves (b)nullability:
\begin{lemma}\label{rewritesBnullable}
\hspace{0em}
\begin{itemize}
@@ -812,23 +834,23 @@
By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$.
\end{proof}
\noindent
-With lemma \ref{rewriteBmkepsAux} we are ready to prove its
+With lemma \ref{rewriteBmkepsAux} in place we are ready to prove its
many-step version:
\begin{lemma}
$\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$
\end{lemma}
\begin{proof}
- By rule induction of $\stackrel{*}{\rightsquigarrow} $.
- $\ref{rewritesBnullable}$ tells us both $r$ and $r'$ are nullable.
- \ref{rewriteBmkepsAux} solves the inductive case.
+ By rule induction of $\stackrel{*}{\rightsquigarrow} $. Lemma
+ $\ref{rewritesBnullable}$ gives us both $r$ and $r'$ are nullable.
+ The lemma \ref{rewriteBmkepsAux} solves the inductive case.
\end{proof}
-\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$}
-Now we get to the ``meaty'' part of the proof,
+\subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$}
+Now we get to the key part of the proof,
which says that our simplification's helper functions
-such as $\distinctBy$ and $\flts$ conform to
-the $\stackrel{s*}{\rightsquigarrow}$ and
-$\rightsquigarrow^* $ rewriting relations.\\
+such as $\distinctBy$ and $\flts$ describe
+reducts of $\stackrel{s*}{\rightsquigarrow}$ and
+$\rightsquigarrow^* $.\\
The first lemma to prove is a more general version of
$rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$:
\begin{lemma}
@@ -849,7 +871,7 @@
$rs_1 \stackrel{s*}{\rightsquigarrow} \distinctBy \; rs_1 \; \phi$.
\end{corollary}
\noindent
-The flatten function $\flts$ conforms to
+Similarly the flatten function $\flts$ describes a reduct of
$\stackrel{s*}{\rightsquigarrow}$ as well:
\begin{lemma}\label{fltsPreserves}
@@ -865,14 +887,14 @@
\end{lemma}
\noindent
The simplification function
-$\textit{bsimp}$ only transforms the regex $r$ using steps specified by
-$\rightsquigarrow^*$ and nothing else.
+$\textit{bsimp}$ only transforms the regular expression using steps specified by
+$\rightsquigarrow^*$ and nothing else:
\begin{lemma}
- $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$
+ $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$
\end{lemma}
\begin{proof}
By an induction on $r$.
- The most involved case would be the alternative,
+ The most involved case is the alternative,
where we use lemmas \ref{bsimpaltsPreserves},
\ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\
\begin{center}
@@ -883,7 +905,7 @@
(\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\
\end{tabular}
\end{center}
- Using this we derive the following rewrite relation:\\
+ Using this we can derive the following rewrite sequence:\\
\begin{center}
\begin{tabular}{lcl}
$r$ & $=$ & $_{bs}\sum rs$\\[1.5ex]
@@ -900,10 +922,9 @@
\end{center}
\end{proof}
\subsubsection{Property 3: $r_1 \stackrel{*}{\rightsquigarrow} r_2 \implies r_1 \backslash c \stackrel{*}{\rightsquigarrow} r_2 \backslash c$}
-The rewritability relation
-$\rightsquigarrow$ is preserved under derivatives--
-it is just that we might need multiple steps
-where originally only one step was needed:
+The rewrite relation
+$\rightsquigarrow$ changes into $\stackrel{*}{\rightsquigarrow}$
+after derivatives are taken on both sides:
\begin{lemma}\label{rewriteBder}
\hspace{0em}
\begin{itemize}
@@ -927,11 +948,11 @@
r_2 \backslash c$
\end{corollary}
\begin{proof}
- By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma \ref{rewriteBder}.
+ By rule induction of $\stackrel{*}{\rightsquigarrow} $ and lemma \ref{rewriteBder}.
\end{proof}
\noindent
This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$
-to obtain the rewritability between
+to obtain the correspondence between
$\blexer$ and $\blexersimp$'s intermediate
derivative regular expressions
\begin{lemma}\label{bderBderssimp}
@@ -941,52 +962,56 @@
By an induction on $s$.
\end{proof}
\subsection{Main Theorem}
-Now with \ref{bderBderssimp} we are ready for the main theorem.
+Now with \ref{bderBderssimp} in place we are ready for the main theorem.
\begin{theorem}
$\blexer \; r \; s = \blexersimp{r}{s}$
\end{theorem}
\noindent
\begin{proof}
- One can rewrite in many steps from the original lexer's
+ We can rewrite in many steps from the original lexer's
derivative regular expressions to the
lexer with simplification applied (by lemma \ref{bderBderssimp}):
\begin{center}
$a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $.
\end{center}
- we know that they give out the same bits, if the lexing result is a match:
+ We know that they generate the same bits, if the lexing result is a match:
\begin{center}
$\bnullable \; (a \backslash s)
\implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$
\end{center}
- Now that they give out the same bits, we know that they give the same value after decoding.
+ Now that they generate the same bits, we know that they give the same value after decoding.
\begin{center}
$\bnullable \; (a \backslash s)
\implies \decode \; r \; (\bmkeps \; (a \backslash s)) =
\decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$
\end{center}
- Which is equivalent to our proof goal:
+ Which is required by our proof goal:
\begin{center}
$\blexer \; r \; s = \blexersimp \; r \; s$.
\end{center}
\end{proof}
\noindent
As a corollary,
-we link this result with the lemma we proved earlier that
+we can link this result with the lemma we proved earlier that
\begin{center}
- $(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = v$
+ $(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\
+ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\;
+ r\;s = \None$.
\end{center}
and obtain the corollary that the bit-coded lexer with simplification is
indeed correctly outputting POSIX lexing result, if such a result exists.
\begin{corollary}
- $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $
+ $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\
+ $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\;
+ r\;s = \None$.
\end{corollary}
\subsection{Comments on the Proof Techniques Used}
Straightforward and simple as the proof may seem,
-the efforts we spent obtaining it were far from trivial.\\
+the efforts we spent obtaining it were far from trivial.
We initially attempted to re-use the argument
in \cref{flex_retrieve}.
-The problem was that both functions $\inj$ and $\retrieve$ require
+The problem is that both functions $\inj$ and $\retrieve$ require
that the annotated regular expressions stay unsimplified,
so that one can
correctly compare $v_{i+1}$ and $r_i$ and $v_i$
@@ -1041,16 +1066,11 @@
such as
$\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
\SEQs [r_1, r_2, r_3]$.
-This does not fit with the proof technique
+However this does not fit with the proof technique
of our main theorem, but seem to not violate the POSIX
-property.\\
-Having correctness property is good.
-But we would also a guarantee that the lexer is not slow in
-some sense, for exampe, not grinding to a halt regardless of the input.
-As we have already seen, Sulzmann and Lu's simplification function
-$\simpsulz$ cannot achieve this, because their claim that
-the regular expression size does not grow arbitrary large
-was not true.
-In the next chapter we shall prove that with our $\simp$,
-for a given $r$, the internal derivative size is always
+property.
+
+Having established the correctness of our
+$\blexersimp$, in the next chapter we shall prove that with our $\simp$ function,
+for a given $r$, the derivative size is always
finitely bounded by a constant.