--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Sat Sep 10 22:30:22 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Mon Sep 12 23:32:18 2022 +0200
@@ -11,39 +11,43 @@
-In this chapter we introduce the simplifications
+In this chapter we introduce simplifications
on annotated regular expressions that can be applied to
each intermediate derivative result. This allows
us to make $\blexer$ much more efficient.
-We contrast this simplification function
-with Sulzmann and Lu's original
-simplifications, indicating the simplicity of our algorithm and
-improvements we made, demostrating
-the usefulness and reliability of formal proofs on algorithms.
+Sulzmann and Lu already had some bit-coded simplifications,
+but their simplification functions were inefficient.
+We contrast our simplification function
+with Sulzmann and Lu's, indicating the simplicity of our algorithm.
+This is another case for the usefulness
+and reliability of formal proofs on algorithms.
These ``aggressive'' simplifications would not be possible in the injection-based
lexing we introduced in chapter \ref{Inj}.
-We then go on to prove the correctness with the improved version of
+We then prove the correctness with the improved version of
$\blexer$, called $\blexersimp$, by establishing
$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
\section{Simplifications by Sulzmann and Lu}
-The first thing we notice in the fast growth of examples such as $(a^*a^*)^*$'s
-and $(a^* + (aa)^*)^*$'s derivatives is that a lot of duplicated sub-patterns
-are scattered around different levels, and therefore requires
-de-duplication at different levels:
+Consider the derivatives of examples such as $(a^*a^*)^*$
+and $(a^* + (aa)^*)^*$:
\begin{center}
$(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$
\end{center}
\noindent
-As we have already mentioned in \ref{eqn:growth2},
-a simple-minded simplification function cannot simplify
+As can be seen, there is a lot of duplication
+in the example we have already mentioned in
+\ref{eqn:growth2}.
+A simple-minded simplification function cannot simplify
the third regular expression in the above chain of derivative
-regular expressions:
+regular expressions, namely
\begin{center}
$((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$
\end{center}
-one would expect a better simplification function to work in the
+because the duplicates are
+not next to each other and therefore the rule
+$r+ r \rightarrow r$ does not fire.
+One would expect a better simplification function to work in the
following way:
\begin{gather*}
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
@@ -61,10 +65,16 @@
)\cdot(a^*a^*)^*
\end{gather*}
\noindent
-This motivating example came from testing Sulzmann and Lu's
+In the first step, the nested alternative regular expression
+$(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$.
+Now the third term $a^*$ is clearly identified as a duplicate
+and therefore removed in the second step. This causes the two
+top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$
+removed in the final step.\\
+This motivating example is from testing Sulzmann and Lu's
algorithm: their simplification does
not work!
-We quote their $\textit{simp}$ function verbatim here:
+Consider their simplification (using our notations):
\begin{center}
\begin{tabular}{lcl}
$\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ &
@@ -85,10 +95,10 @@
\end{tabular}
\end{center}
\noindent
-the $\textit{zeroable}$ predicate
-which tests whether the regular expression
+where the $\textit{zeroable}$ predicate
+tests whether the regular expression
is equivalent to $\ZERO$,
-is defined as:
+can be defined as:
\begin{center}
\begin{tabular}{lcl}
$\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\;
@@ -190,6 +200,23 @@
The assumption that the size of the regular expressions
in the algorithm
would stay below a finite constant is not ture.
+The main reason behind this is that (i) The $\textit{nub}$
+function requires identical annotations between two
+annotated regular expressions to qualify as duplicates,
+and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$
+even if both $a^*$ denote the same language.
+(ii) The ``flattening'' only applies to the head of the list
+in the
+\begin{center}
+ \begin{tabular}{lcl}
+
+ $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ &
+ $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\
+ \end{tabular}
+\end{center}
+\noindent
+clause, and therefore is not thorough enough to simplify all
+needed parts of the regular expression.\\
In addition to that, even if the regular expressions size
do stay finite, one has to take into account that
the $\simpsulz$ function is applied many times
@@ -209,14 +236,14 @@
by our solution. These solutions come with correctness
statements that are backed up by formal proofs.
\subsection{Flattening Nested Alternatives}
-The idea behind the
+The idea behind the clause
\begin{center}
$\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad
_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$
\end{center}
-clause is that it allows
+is that it allows
duplicate removal of regular expressions at different
-levels.
+``levels'' of alternatives.
For example, this would help with the
following simplification:
@@ -226,15 +253,15 @@
The problem here is that only the head element
is ``spilled out'',
whereas we would want to flatten
-an entire list to open up possibilities for further simplifications.\\
+an entire list to open up possibilities for further simplifications.
Not flattening the rest of the elements also means that
the later de-duplication processs
-does not fully remove apparent duplicates.
+does not fully remove further duplicates.
For example,
using $\simpsulz$ we could not
simplify
\begin{center}
-$((a^* a^*)+ (a^* + a^*))\cdot (a^*a^*)^*+
+ $((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+
((a^*a^*)+a^*)\cdot (a^*a^*)^*$
\end{center}
due to the underlined part not in the first element
@@ -427,17 +454,16 @@
This algorithm keeps the regular expression size small.
-\subsection{$(a+aa)^*$ and $(a^*\cdot a^*)^*$ against
-$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ After Simplification}
-For example,
-with our simplification the
+\subsection{Examples $(a+aa)^*$ and $(a^*\cdot a^*)^*$
+After Simplification}
+Recall the
previous $(a^*a^*)^*$ example
where $\simpsulz$ could not
-stop the fast growth (over
+prevent the fast growth (over
3 million nodes just below $20$ input length)
-will be reduced to just 15 and stays constant, no matter how long the
+will be reduced to just 15 and stays constant no matter how long the
input string is.
-This is demonstrated in the graphs below.
+This is shown in the graphs below.
\begin{figure}[H]
\begin{center}
\begin{tabular}{ll}
@@ -482,12 +508,11 @@
\section{Correctness of $\blexersimp$}
In this section we give details
of the correctness proof of $\blexersimp$,
-an important contribution of this thesis.\\
+one of the contributions of this thesis.\\
We first introduce the rewriting relation \emph{rrewrite}
($\rrewrite$) between two regular expressions,
which expresses an atomic
-simplification step from the left-hand-side
-to the right-hand-side.
+simplification.
We then prove properties about
this rewriting relation and its reflexive transitive closure.
Finally we leverage these properties to show
@@ -496,7 +521,7 @@
\subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)}
In the $\blexer$'s correctness proof, we
-did not directly derive the fact that $\blexer$ gives out the POSIX value,
+did not directly derive the fact that $\blexer$ generates the POSIX value,
but first proved that $\blexer$ is linked with $\lexer$.
Then we re-use
the correctness of $\lexer$
@@ -511,8 +536,8 @@
produces the same output as $\blexer \; r\; s$,
and then piecing it together with
$\blexer$'s correctness to achieve our main
-theorem:\footnote{ the case when
-$s$ is not in $L \; r$, is routine to establish }
+theorem:\footnote{ The case when
+$s$ is not in $L \; r$, is routine to establish.}
\begin{center}
$(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = v$
\end{center}
@@ -526,7 +551,7 @@
\end{center}
where each rewrite step, written $\rightsquigarrow$,
is an ``atomic'' simplification that
-cannot be broken down any further:
+is similar to a small-step reduction in operational semantics:
\begin{figure}[H]
\begin{mathpar}
\inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\}
@@ -592,8 +617,10 @@
\caption{The Reflexive Transitive Closure of
$\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure}
\end{figure}
-Two rewritable terms will remain rewritable to each other
-even after a derivative is taken:
+%Two rewritable terms will remain rewritable to each other
+%even after a derivative is taken:
+Rewriting is preserved under derivatives,
+namely
\begin{center}
$r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$
\end{center}
@@ -956,7 +983,7 @@
\subsection{Comments on the Proof Techniques Used}
Straightforward and simple as the proof may seem,
-the efforts we spent obtaining it was far from trivial.\\
+the efforts we spent obtaining it were far from trivial.\\
We initially attempted to re-use the argument
in \cref{flex_retrieve}.
The problem was that both functions $\inj$ and $\retrieve$ require
@@ -1007,6 +1034,16 @@
\end{center}
as equal, because they were both re-written
from the same expression.\\
+The simplification rewriting rules
+given in \ref{rrewriteRules} are by no means
+final,
+one could come up new rules
+such as
+$\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow
+\SEQs [r_1, r_2, r_3]$.
+This does not fit with the proof technique
+of our main theorem, but seem to not violate the POSIX
+property.\\
Having correctness property is good.
But we would also a guarantee that the lexer is not slow in
some sense, for exampe, not grinding to a halt regardless of the input.