--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sat Dec 24 11:55:04 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Dec 30 01:52:32 2022 +0000
@@ -217,29 +217,28 @@
Given their usefulness and ubiquity, one would assume that
modern regular expression matching implementations
are mature and fully studied.
-Indeed, in a popular programming language's regular expression engine,
-supplying it with regular expressions and strings,
-in most cases one can
+Indeed, in many popular programming languages' regular expression engines,
+one can supply a regular expression and a string, and in most cases get
get the matching information in a very short time.
-Those engines can be blindingly fast--some
+Those engines can sometimes be blindingly fast--some
network intrusion detection systems
use regular expression engines that are able to process
hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those engines can sometimes exhibit a surprising security vulnerability
+However, those engines can sometimes also exhibit a surprising security vulnerability
under a certain class of inputs.
%However, , this is not the case for $\mathbf{all}$ inputs.
%TODO: get source for SNORT/BRO's regex matching engine/speed
-Consider for example $(a^*)^*\,b$ and
-strings of the form $aa..a$, these strings cannot be matched by this regular
+Consider for example the regular expression $(a^*)^*\,b$ and
+strings of the form $aa..a$. These strings cannot be matched by this regular
expression: Obviously the expected $b$ in the last
position is missing. One would assume that modern regular expression
matching engines can find this out very quickly. Surprisingly, if one tries
this example in JavaScript, Python or Java 8, even with small strings,
say of lenght of around 30 $a$'s,
the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
-This is clearly exponential behaviour, and as can be seen
+The algorithms clearly show exponential behaviour, and as can be seen
is triggered by some relatively simple regular expressions.
Java 9 and newer
versions improve this behaviour somewhat, but are still slow compared
@@ -248,7 +247,7 @@
This superlinear blowup in regular expression engines
-has caused grief in ``real life'' in the past where it is
+has caused grief in ``real life'' where it is
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
For example, on 20 July 2016 one evil
regular expression brought the webpage
@@ -423,9 +422,9 @@
than 1000 evil regular expressions
in Node.js, Python core libraries, npm and pypi.
They therefore concluded that evil regular expressions
-are real problems rather than just "a parlour trick".
+are a real problem rather than just "a parlour trick".
-This work aims to address this issue
+The work in this thesis aims to address this issue
with the help of formal proofs.
We describe a lexing algorithm based
on Brzozowski derivatives with verified correctness
@@ -627,12 +626,18 @@
close to ten million.
A smaller sample XSD they gave
is:
-\begin{verbatim}
+\lstset{
+ basicstyle=\fontsize{8.5}{9}\ttfamily,
+ language=XML,
+ morekeywords={encoding,
+ xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
+}
+\begin{lstlisting}
<sequence minOccurs="0" maxOccurs="65535">
- <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
- <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+ <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
</sequence>
-\end{verbatim}
+\end{lstlisting}
This can be seen as the regex
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
regular expressions
@@ -787,7 +792,7 @@
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
by 1 to 4, and can be ``referred back'' by their respective numbers.
%These sub-expressions are called "capturing groups".
-To do so, we use the syntax $\backslash i$
+To do so, one uses the syntax $\backslash i$
to denote that we want the sub-string
of the input matched by the i-th
sub-expression to appear again,
@@ -836,7 +841,7 @@
which matches four-character palindromes
like $abba$, $x??x$ and so on.
-Back-references is a regex construct
+Back-references are a regex construct
that programmers find quite useful.
According to Becchi and Crawley \cite{Becchi08},
6\% of Snort rules (up until 2008) use them.
@@ -928,7 +933,8 @@
disambiguate.
The more widely used strategy is called POSIX,
which roughly speaking always chooses the longest initial match.
-The POSIX strategy is widely adopted in many regular expression matchers.
+The POSIX strategy is widely adopted in many regular expression matchers
+because it is a natural strategy for lexers.
However, many implementations (including the C libraries
used by Linux and OS X distributions) contain bugs
or do not meet the specification they claim to adhere to.
@@ -947,11 +953,11 @@
$[0, 2), [2, 5)$
respectively.
But feeding this example to the different engines
-in regex101 \parencite{regex101} \footnote{
+listed at regex101 \footnote{
regex101 is an online regular expression matcher which
provides API for trying out regular expression
engines of multiple popular programming languages like
-Java, Python, Go, etc.}
+Java, Python, Go, etc.} \parencite{regex101}.
yields
only two incomplete matches: $[aba]$ at $[0, 3)$
and $a$ at $[4, 5)$.
@@ -959,8 +965,8 @@
commented that most regex libraries are not
correctly implementing the central POSIX
rule, called the maximum munch rule.
-Grathwohl \parencite{grathwohl2014crash} wrote,
-\begin{quote}
+Grathwohl \parencite{grathwohl2014crash} wrote:
+\begin{quote}\it
``The POSIX strategy is more complicated than the
greedy because of the dependence on information about
the length of matched strings in the various subexpressions.''
@@ -968,7 +974,7 @@
%\noindent
We think the implementation complexity of POSIX rules also come from
the specification being not very precise.
-The developers of the regexp package of GO
+The developers of the regexp package of Go
\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
commented that
\begin{quote}\it
@@ -1008,13 +1014,16 @@
are the first to
give a quite simple formalised POSIX
specification in Isabelle/HOL, and also prove
-that their specification coincides with the
+that their specification coincides with an earlier (unformalised)
POSIX specification given by Okui and Suzuki \cite{Okui10}.
They then formally proved the correctness of
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
with regards to that specification.
They also found that the informal POSIX
-specification by Sulzmann and Lu needs to be revised for the correctness proof.
+specification by Sulzmann and Lu needs to be substantially
+revised in order for the correctness proof to go through.
+Their original specification and proof were unfixable
+according to Ausaf et al.
In the next section, we will briefly
introduce Brzozowski derivatives and Sulzmann
@@ -1030,7 +1039,7 @@
The derivative of a regular expression $r$
with respect to a character $c$, is written as $r \backslash c$.
This operation tells us what $r$ transforms into
-if we ``chop'' off the first character $c$
+if we ``chop'' off a particular character $c$
from all strings in the language of $r$ (defined
later as $L \; r$).
%To give a flavour of Brzozowski derivatives, we present
@@ -1060,14 +1069,14 @@
%one simply takes derivatives of $r$ successively with
%respect to the characters (in the correct order) in $s$,
%and then test whether the empty string is in the last regular expression.
-With this property, derivatives give a simple solution
+With this property, derivatives can give a simple solution
to the problem of matching a string $s$ with a regular
expression $r$: if the derivative of $r$ w.r.t.\ (in
succession) all the characters of the string matches the empty string,
then $r$ matches $s$ (and {\em vice versa}).
%This makes formally reasoning about these properties such
%as correctness and complexity smooth and intuitive.
-There had been several mechanised proofs about this property in various theorem
+There are several mechanised proofs of this property in various theorem
provers,
for example one by Owens and Slind \cite{Owens2008} in HOL4,
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
@@ -1079,7 +1088,7 @@
\begin{center}
\begin{tabular}{lcl}
$r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
- r^{\{n-1\}}$\\
+ r^{\{n-1\}} (\text{when} n > 0)$\\
\end{tabular}
\end{center}
\noindent
@@ -1089,8 +1098,8 @@
large counters
quite well.
-There has also been
-extensions to other constructs.
+There have also been
+extensions of derivatives to other regex constructs.
For example, Owens et al include the derivatives
for the \emph{NOT} regular expression, which is
able to concisely express C-style comments of the form
@@ -1106,11 +1115,11 @@
Given the above definitions and properties of
Brzozowski derivatives, one quickly realises their potential
-in generating a formally verified algorithm for lexing--the clauses and property
+in generating a formally verified algorithm for lexing: the clauses and property
can be easily expressed in a functional programming language
or converted to theorem prover
-code, with great extensibility.
-Perhaps this is the reason why it has sparked quite a bit of interest
+code, with great ease.
+Perhaps this is the reason why derivatives have sparked quite a bit of interest
in the functional programming and theorem prover communities in the last
fifteen or so years (
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
@@ -1125,7 +1134,8 @@
little information in the context of lexing where separate tokens must
be identified and also classified (for example as keywords
or identifiers).
-Second, derivative-based matchers need to be more efficient.
+Second, derivative-based matchers need to be more efficient in terms
+of the sizes of derivatives.
Elegant and beautiful
as many implementations are,
they can be excruciatingly slow.
@@ -1146,7 +1156,7 @@
\noindent
They concluded that the algorithm is quadratic in terms of
the length of the input string.
-When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+When we tried out their extracted OCaml code with the example $(a+aa)^*$,
the time it took to match a string of 40 $a$'s was approximately 5 minutes.
@@ -1158,16 +1168,17 @@
\emph{how} a regular expression matches a string following the POSIX
rules for regular expression matching. They achieve this by adding a
second ``phase'' to Brzozowski's algorithm involving an injection
-function.
+function. This injection function in a sense undoes the ``damage''
+of the derivatives chopping off characters.
In earlier work, Ausaf et al provided the formal
specification of what POSIX matching means and proved in Isabelle/HOL
the correctness
of this extended algorithm accordingly
\cite{AusafDyckhoffUrban2016}.
-The version of the algorithm proven correct
-suffers heavily from a
-second difficulty, where the internal derivatives can
+The version of the algorithm proven correct
+suffers however heavily from a
+second difficulty, where derivatives can
grow to arbitrarily big sizes.
For example if we start with the
regular expression $(a+aa)^*$ and take
@@ -1185,7 +1196,7 @@
\end{tabular}
\end{center}
-\noindent where after around 35 steps we run out of memory on a
+\noindent where after around 35 steps we usually run out of memory on a
typical computer. Clearly, the
notation involving $\ZERO$s and $\ONE$s already suggests
simplification rules that can be applied to regular regular
@@ -1242,6 +1253,7 @@
There is also some newer work called
Verbatim++~\cite{Verbatimpp}, which does not use derivatives,
but deterministic finite automaton instead.
+We will also study this work in a later section.
%An example that gives problem to automaton approaches would be
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
%It requires at least $2^{n+1}$ states to represent
@@ -1256,8 +1268,8 @@
backtracking and error-prone matchers: a formally verified
regular expression lexing algorithm
that is both fast
-and correct by extending Ausaf et al.'s work.
-The end result is %a regular expression lexing algorithm that comes with
+and correct.
+The result is %a regular expression lexing algorithm that comes with
\begin{itemize}
\item
an improved version of Sulzmann and Lu's bit-coded algorithm using
@@ -1268,10 +1280,10 @@
\item
a complexity-related property for that algorithm saying that the
internal data structure will
-remain finite,
+remain below a finite bound,
\item
-and extension to
-the bounded repetitions construct with the correctness and finiteness property
+and an extension to
+the bounded repetition constructs with the correctness and finiteness property
maintained.
\end{itemize}
\noindent
@@ -1285,34 +1297,33 @@
Further improvements to the algorithm with an even stronger version of
simplification can be made. We conjecture that the resulting size of derivatives
can be bounded by a cubic bound w.r.t. the size of the regular expression.
-
-
-
-
+We will give relevant code in Scala,
+but do not give a formal proof for that in Isabelle/HOL.
+This is still future work.
\section{Structure of the thesis}
In chapter \ref{Inj} we will introduce the concepts
and notations we
use for describing regular expressions and derivatives,
-and the first version of their lexing algorithm without bitcodes (including
+and the first version of Sulzmann and Lu's lexing algorithm without bitcodes (including
its correctness proof).
We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
together with the correctness proof by Ausaf and Urban.
Then we illustrate in chapter \ref{Bitcoded2}
how Sulzmann and Lu's
-simplifications fail to simplify. We therefore introduce our version of the
-algorithm with simplification and
-its correctness proof .
+simplifications fail to simplify correctly. We therefore introduce our own version of the
+algorithm with correct simplifications and
+their correctness proof.
In chapter \ref{Finite} we give the second guarantee
of our bitcoded algorithm, that is a finite bound on the size of any
regular expression's derivatives.
We also show how one can extend the
algorithm to include bounded repetitions.
In chapter \ref{Cubic} we discuss stronger simplification rules which
-improve the finite bound to a cubic bound.%and the NOT regular expression.
+improve the finite bound to a cubic bound. %and the NOT regular expression.
Chapter \ref{RelatedWork} introduces relevant work for this thesis.
-Chapter \ref{Future} concludes with avenues of future research.
+Chapter \ref{Future} concludes and mentions avenues of future research.