--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Fri Jul 14 00:32:41 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Mon Jul 24 11:09:48 2023 +0100
@@ -1,43 +1,32 @@
% Chapter Template
-\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
+\chapter{Regular Expressions and POSIX Lexing-Preliminaries} % Main chapter title
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
%and notations we
% used for describing the lexing algorithm by Sulzmann and Lu,
%and then give the algorithm and its variant and discuss
%why more aggressive simplifications are needed.
-
-In this chapter, we define the basic notions
-for regular languages and regular expressions.
-This is essentially a description in ``English''
-the functions and datatypes of our formalisation in Isabelle/HOL.
-We also define what $\POSIX$ lexing means,
-followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
-that produces the output conforming
-to the $\POSIX$ standard\footnote{In what follows
-we choose to use the Isabelle-style notation
-for function applications, where
-the parameters of a function are not enclosed
-inside a pair of parentheses (e.g. $f \;x \;y$
-instead of $f(x,\;y)$). This is mainly
-to make the text visually more concise.}.
-
+We start with a technical overview of the catastrophic backtracking problem,
+motivating rigorous approaches to regular expression matching and lexing.
+Then we introduce basic notations and definitions of our problem.
\section{Technical Overview}
-
Consider for example the regular expression $(a^*)^*\,b$ and
strings of the form $aa..a$. These strings cannot be matched by this regular
-expression: Obviously the expected $b$ in the last
+expression: obviously the expected $b$ in the last
position is missing. One would assume that modern regular expression
matching engines can find this out very quickly. Surprisingly, if one tries
this example in JavaScript, Python or Java 8, even with small strings,
say of lenght of around 30 $a$'s,
-the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
+the decision takes large amounts of time to finish.
+This is inproportional
+to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
The algorithms clearly show exponential behaviour, and as can be seen
is triggered by some relatively simple regular expressions.
Java 9 and newer
-versions improve this behaviour somewhat, but are still slow compared
+versions mitigates this behaviour by several magnitudes,
+but are still slow compared
with the approach we are going to use in this thesis.
@@ -188,15 +177,73 @@
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
\end{axis}
\end{tikzpicture}\\
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=true,
+ %xtick={0,5,...,30},
+ %xmax=33,
+ %ymax=35,
+ restrict x to domain*=0:60000,
+ restrict y to domain*=0:35,
+ %ytick={0,5,...,30},
+ %scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Scala},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=true,
+ %xtick={0,5000,...,40000},
+ %xmax=40000,
+ %ymax=35,
+ restrict x to domain*=0:60000,
+ restrict y to domain*=0:45,
+ %ytick={0,5,...,30},
+ %scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend style={cells={align=left}},
+ legend entries={Isabelle \\ Extracted},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
+\end{axis}
+\end{tikzpicture}\\
\multicolumn{2}{c}{Graphs}
\end{tabular}
\end{center}
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
- The reason for their superlinear behaviour is that they do a depth-first-search
+ The reason for their fast growth %superlinear behaviour
+ is that they do a depth-first-search
using NFAs.
If the string does not match, the regular expression matching
engine starts to explore all possibilities.
+ The last two graphs are for our implementation in Scala, one manual
+ and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
+ Our lexer
+ performs better in this case, and
+ is formally verified.
+ Despite being almost identical, the codegen-generated lexer
+ % is generated directly from Isabelle using $\textit{codegen}$,
+ is slower than the manually written version since the code synthesised by
+ $\textit{codegen}$ does not use native integer or string
+ types of the target language.
+ %Note that we are comparing our \emph{lexer} against other languages' matchers.
}\label{fig:aStarStarb}
\end{figure}\afterpage{\clearpage}
@@ -236,6 +283,34 @@
approach based on Brzozowski derivatives and formal proofs.
+
+
+In this chapter, we define the basic notions
+for regular languages and regular expressions.
+This is essentially a description in ``English''
+the functions and datatypes of our formalisation in Isabelle/HOL.
+We also define what $\POSIX$ lexing means,
+followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
+that produces the output conforming
+to the $\POSIX$ standard.
+This is a preliminary chapter which describes the results of
+Sulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016},
+but the proof details and function definitions are needed to motivate our work
+as we show in chapter \ref{Bitcoded2} how the proofs break down when
+simplifications are applied.
+%TODO: Actually show this in chapter 4.
+
+In what follows
+we choose to use the Isabelle-style notation
+for function and datatype constructor applications, where
+the parameters of a function are not enclosed
+inside a pair of parentheses (e.g. $f \;x \;y$
+instead of $f(x,\;y)$). This is mainly
+to make the text visually more concise.
+
+
+
+
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
Regular expressions and regular expression matchers
have clearly been studied for many, many years.
@@ -339,7 +414,7 @@
algorithms or algorithms that consume large amounts of memory.
Implementations using $\DFA$s will
in such situations
-either become excruciatingly slow
+either become very slow
(for example Verbatim++ \cite{Verbatimpp}) or run
out of memory (for example $\mathit{LEX}$ and
$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
@@ -942,7 +1017,7 @@
of the sizes of derivatives.
Elegant and beautiful
as many implementations are,
-they can be excruciatingly slow.
+they can be still quite slow.
For example, Sulzmann and Lu
claim a linear running time of their proposed algorithm,
but that was falsified by our experiments. The running time
@@ -1530,7 +1605,7 @@
\end{figure}
\noindent
If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow, as shown in
+the algorithm can be as slow as backtracking lexers, as shown in
\ref{NaiveMatcher}.
Note that both axes are in logarithmic scale.
Around two dozen characters
@@ -1984,7 +2059,6 @@
$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
\end{tabular}
\end{center}
-
\noindent
The function recurses on
the shape of regular
@@ -1998,7 +2072,15 @@
injected back to that sub-value;
$\inj$ assembles all parts
to form a new value.
+\subsection{An Example of how Injection Works}
+We will provide a few examples on why
+\begin{center}
+ \begin{tabular}{lcl}
+ $\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\
+ $A$ & $B$ & $C$
+ \end{tabular}
+\end{center}
For instance, the last clause is an
injection into a sequence value $v_{i+1}$
whose second child