--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Aug 12 00:39:23 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Sun Aug 14 09:44:27 2022 +0100
@@ -9,20 +9,32 @@
In this chapter we give a guarantee in terms of size:
given an annotated regular expression $a$, for any string $s$
-our algorithm's internal data structure
-$\bderssimp{a}{s}$'s size
+our algorithm's internal annotated regular expression
+size
\begin{center}
$\llbracket \bderssimp{a}{s} \rrbracket$
\end{center}
\noindent
is finitely bounded
-by a constant $N_a$ that only depends on $a$.
+by a constant $N_a$ that only depends on $a$,
+where the size of an annotated regular expression is defined
+in terms of the number of nodes in its tree structure:
+\begin{center}
+\begin{tabular}{ccc}
+ $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+ $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+\end{tabular}
+\end{center}
\section{Formalising About Size}
+\noindent
In our lexer $\blexersimp$,
The regular expression is repeatedly being taken derivative of
and then simplified.
-
\begin{center}
\begin{tikzpicture}[scale=2,node distance=1.4cm,
every node/.style={minimum size=9mm}]
@@ -37,36 +49,48 @@
\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
\draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
-\node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$};
+\node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$};
\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
\end{tikzpicture}
\end{center}
\noindent
As illustrated in the picture,
-each time the derivative is taken derivative of, it grows,
-and when it is being simplified, it shrinks.
-The blue ones are the regular expressions after simplification,
-which would be smaller than before.
+each time the regular expression
+is taken derivative of, it grows (the black nodes),
+and after simplification, it shrinks (the blue nodes).
+This intuition is depicted by the relative size
+difference between the black and blue nodes.
We give a mechanised proof that after simplification
the regular expression's size (the blue ones)
$\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
a fixed $a$.
-While it is not yet a direct formalisation of our lexer's complexity,
-it is a stepping stone towards a complexity proof because
-if the data structures out lexer has to traverse is large, the program
-will certainly be slow.
-The bound is not yet tight, and we seek to improve $N_a$ so that
+There are two problems with this finiteness proof, though.
+\begin{itemize}
+ \item
+First, It is not yet a direct formalisation of our lexer's complexity,
+as a complexity proof would require looking into
+the time it takes to execute {\bf all} the operations
+involved in the lexer (simp, collect, decode), not just the derivative.
+\item
+Second, the bound is not yet tight, and we seek to improve $N_a$ so that
it is polynomial on $\llbracket a \rrbracket$.
-We believe a formalisation of the complexity-related properties
-of the algorithm is important in our context, because we want to address
-catastrophic backtracking, which is not a correctness issue but
-in essence a performance issue, a formal proof can give
-the strongest assurance that such issues cannot arise
-regardless of what the input is.
-This level of certainty cannot come from a pencil and paper proof or
-eimpirical evidence.
-
+\end{itemize}
+Still, we believe this size formalisation
+of the algorithm is important in our context, because
+\begin{itemize}
+ \item
+ Derivatives are the most important phases of our lexer algorithm.
+ Size properties about derivatives covers the majority of the algorithm
+ and is therefore a good indication of complexity of the entire program.
+ \item
+ What the size bound proof does ensure is an absence of
+ catastrophic backtracking, which is prevalent in regular expression engines
+ in popular programming languages like Java.
+ We prove catastrophic backtracking cannot happen for {\bf all} inputs,
+ which is an advantage over work which
+ only gives empirical evidence on some test cases.
+\end{itemize}
For example, Sulzmann and Lu's made claimed that their algorithm
with bitcodes and simplifications can lex in linear time with respect
to the input string. This assumes that each
@@ -76,8 +100,9 @@
of each derivative step to grow arbitrarily large.
Here in our proof we state that such explosions cannot happen
with our simplification function.
-
-Here is a bird's eye view of how the proof of finiteness works:
+\subsection{Overview of the Proof}
+Here is a bird's eye view of how the proof of finiteness works,
+which involves three steps:
\begin{center}
\begin{tikzpicture}[scale=1,font=\bf,
node/.style={
@@ -98,45 +123,45 @@
\end{tikzpicture}
\end{center}
\noindent
-We explain the above steps one by one:
+We explain the steps one by one:
\begin{itemize}
\item
-We first use a new datatype, called $\textit{rrexp}$s, whose
-inductive type definition and derivative and simplification operations are
+We first define a new datatype
+that is more straightforward to tweak
+into the shape we want
+compared with an annotated regular expression,
+called $\textit{rrexp}$s.
+Its inductive type definition and
+derivative and simplification operations are
almost identical to those of the annotated regular expressions,
except that no bitcodes are attached.
-This new datatype is more straightforward to tweak
-compared with an annotated regular expression.
\item
We have a set of equalities for this new datatype that enables one to
-rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc.
+rewrite $\rderssimp{r_1 \cdot r_2}{s}$
+and $\rderssimp{r^*}{s}$ (the most involved
+inductive cases)
by a combinatioin of derivatives of their
-children regular expressions ($r_1$, $r_2$, and $r$, respectively).
-These equalities are chained together to get into a shape
-that is very easy to estimate, which we call the \emph{Closed Forms}
+children regular expressions ($r_1$, $r_2$, and $r$, respectively),
+which we call the \emph{Closed Forms}
of the derivatives.
\item
-This closed form is controlled by terms that
-are easier to deal with, wich are in turn bounded loosely
+The Closed Forms of the regular expressions
+are controlled by terms that
+are easier to deal with.
+Using inductive hypothesis, these terms
+are in turn bounded loosely
by a large yet constant number.
\end{itemize}
We give details of these steps in the next sections.
-The first step is to use something simpler than annotated regular expressions.
+The first step is to use
+$\textit{rrexp}$s,
+something simpler than
+annotated regular expressions.
-\section{the $\mathbf{r}$-rexp datatype and the size functions}
+\section{the $\textit{rrexp}$ Datatype and Its Size Functions}
We want to prove the size property about annotated regular expressions.
The size is
written $\llbracket r\rrbracket$, whose intuitive definition is as below
-\begin{center}
-\begin{tabular}{ccc}
- $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
- $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
- $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
-$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
-$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
-$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
-\end{tabular}
-\end{center}
\noindent
We first note that $\llbracket \_ \rrbracket$
is unaware of bitcodes, since