ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 638 dd9dde2d902b
parent 637 e3752aac8ec2
child 646 56057198e4f5
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sat Dec 24 11:55:04 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Dec 30 01:52:32 2022 +0000
@@ -217,29 +217,28 @@
 Given their usefulness and ubiquity, one would assume that
 modern regular expression matching implementations
 are mature and fully studied.
-Indeed, in a popular programming language's regular expression engine, 
-supplying it with regular expressions and strings,
-in most cases one can
+Indeed, in many popular programming languages' regular expression engines, 
+one can supply a regular expression and a string, and in most cases get
 get the matching information in a very short time.
-Those engines can be blindingly fast--some 
+Those engines can sometimes be blindingly fast--some 
 network intrusion detection systems
 use regular expression engines that are able to process 
 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those engines can sometimes exhibit a surprising security vulnerability
+However, those engines can sometimes also exhibit a surprising security vulnerability
 under a certain class of inputs.
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
 
-Consider for example $(a^*)^*\,b$ and 
-strings of the form $aa..a$, these strings cannot be matched by this regular
+Consider for example the regular expression $(a^*)^*\,b$ and 
+strings of the form $aa..a$. These strings cannot be matched by this regular
 expression: Obviously the expected $b$ in the last
 position is missing. One would assume that modern regular expression
 matching engines can find this out very quickly. Surprisingly, if one tries
 this example in JavaScript, Python or Java 8, even with small strings, 
 say of lenght of around 30 $a$'s,
 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
-This is clearly exponential behaviour, and as can be seen
+The algorithms clearly show exponential behaviour, and as can be seen
 is triggered by some relatively simple regular expressions.
 Java 9 and newer
 versions improve this behaviour somewhat, but are still slow compared 
@@ -248,7 +247,7 @@
 
 
 This superlinear blowup in regular expression engines
-has caused grief in ``real life'' in the past where it is 
+has caused grief in ``real life'' where it is 
 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
 For example, on 20 July 2016 one evil
 regular expression brought the webpage
@@ -423,9 +422,9 @@
 than 1000 evil regular expressions
 in Node.js, Python core libraries, npm and pypi. 
 They therefore concluded that evil regular expressions
-are real problems rather than just "a parlour trick".
+are a real problem rather than just "a parlour trick".
 
-This work aims to address this issue
+The work in this thesis aims to address this issue
 with the help of formal proofs.
 We describe a lexing algorithm based
 on Brzozowski derivatives with verified correctness 
@@ -627,12 +626,18 @@
 close to ten million. 
 A smaller sample XSD they gave
 is:
-\begin{verbatim}
+\lstset{
+	basicstyle=\fontsize{8.5}{9}\ttfamily,
+  language=XML,
+  morekeywords={encoding,
+    xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
+}
+\begin{lstlisting}
 <sequence minOccurs="0" maxOccurs="65535">
- <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
- <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ 	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
 </sequence>
-\end{verbatim}
+\end{lstlisting}
 This can be seen as the regex
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions 
@@ -787,7 +792,7 @@
 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
 by 1 to 4, and can be ``referred back'' by their respective numbers. 
 %These sub-expressions are called "capturing groups".
-To do so, we use the syntax $\backslash i$ 
+To do so, one uses the syntax $\backslash i$ 
 to denote that we want the sub-string 
 of the input matched by the i-th
 sub-expression to appear again, 
@@ -836,7 +841,7 @@
 which matches four-character palindromes
 like $abba$, $x??x$ and so on.
 
-Back-references is a regex construct 
+Back-references are a regex construct 
 that programmers find quite useful.
 According to Becchi and Crawley \cite{Becchi08},
 6\% of Snort rules (up until 2008) use them.
@@ -928,7 +933,8 @@
 disambiguate.
 The more widely used strategy is called POSIX,
 which roughly speaking always chooses the longest initial match.
-The POSIX strategy is widely adopted in many regular expression matchers.
+The POSIX strategy is widely adopted in many regular expression matchers
+because it is a natural strategy for lexers.
 However, many implementations (including the C libraries
 used by Linux and OS X distributions) contain bugs
 or do not meet the specification they claim to adhere to.
@@ -947,11 +953,11 @@
 $[0, 2), [2, 5)$
 respectively.
 But feeding this example to the different engines
-in regex101 \parencite{regex101} \footnote{
+listed at regex101 \footnote{
 	regex101 is an online regular expression matcher which
 	provides API for trying out regular expression
 	engines of multiple popular programming languages like
-Java, Python, Go, etc.}
+Java, Python, Go, etc.} \parencite{regex101}. 
 yields
 only two incomplete matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
@@ -959,8 +965,8 @@
 commented that most regex libraries are not
 correctly implementing the central POSIX
 rule, called the maximum munch rule.
-Grathwohl \parencite{grathwohl2014crash} wrote,
-\begin{quote}
+Grathwohl \parencite{grathwohl2014crash} wrote:
+\begin{quote}\it
 	``The POSIX strategy is more complicated than the 
 	greedy because of the dependence on information about 
 	the length of matched strings in the various subexpressions.''
@@ -968,7 +974,7 @@
 %\noindent
 We think the implementation complexity of POSIX rules also come from
 the specification being not very precise.
-The developers of the regexp package of GO 
+The developers of the regexp package of Go 
 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
 commented that
 \begin{quote}\it
@@ -1008,13 +1014,16 @@
 are the first to
 give a quite simple formalised POSIX
 specification in Isabelle/HOL, and also prove 
-that their specification coincides with the 
+that their specification coincides with an earlier (unformalised) 
 POSIX specification given by Okui and Suzuki \cite{Okui10}.
 They then formally proved the correctness of
 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
 with regards to that specification.
 They also found that the informal POSIX
-specification by Sulzmann and Lu needs to be revised for the correctness proof.
+specification by Sulzmann and Lu needs to be substantially 
+revised in order for the correctness proof to go through.
+Their original specification and proof were unfixable
+according to Ausaf et al.
 
 In the next section, we will briefly
 introduce Brzozowski derivatives and Sulzmann
@@ -1030,7 +1039,7 @@
 The derivative of a regular expression $r$
 with respect to a character $c$, is written as $r \backslash c$.
 This operation tells us what $r$ transforms into
-if we ``chop'' off the first character $c$ 
+if we ``chop'' off a particular character $c$ 
 from all strings in the language of $r$ (defined
 later as $L \; r$).
 %To give a flavour of Brzozowski derivatives, we present
@@ -1060,14 +1069,14 @@
 %one simply takes derivatives of $r$ successively with
 %respect to the characters (in the correct order) in $s$,
 %and then test whether the empty string is in the last regular expression.
-With this property, derivatives give a simple solution
+With this property, derivatives can give a simple solution
 to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then $r$ matches $s$ (and {\em vice versa}).  
 %This makes formally reasoning about these properties such
 %as correctness and complexity smooth and intuitive.
-There had been several mechanised proofs about this property in various theorem
+There are several mechanised proofs of this property in various theorem
 provers,
 for example one by Owens and Slind \cite{Owens2008} in HOL4,
 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
@@ -1079,7 +1088,7 @@
 \begin{center}
 	\begin{tabular}{lcl}
 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
-		r^{\{n-1\}}$\\
+		r^{\{n-1\}} (\text{when} n > 0)$\\
 	\end{tabular}
 \end{center}
 \noindent
@@ -1089,8 +1098,8 @@
 large counters
 quite well.
 
-There has also been 
-extensions to other constructs.
+There have also been 
+extensions of derivatives to other regex constructs.
 For example, Owens et al include the derivatives
 for the \emph{NOT} regular expression, which is
 able to concisely express C-style comments of the form
@@ -1106,11 +1115,11 @@
 
 Given the above definitions and properties of
 Brzozowski derivatives, one quickly realises their potential
-in generating a formally verified algorithm for lexing--the clauses and property
+in generating a formally verified algorithm for lexing: the clauses and property
 can be easily expressed in a functional programming language 
 or converted to theorem prover
-code, with great extensibility.
-Perhaps this is the reason why it has sparked quite a bit of interest
+code, with great ease.
+Perhaps this is the reason why derivatives have sparked quite a bit of interest
 in the functional programming and theorem prover communities in the last
 fifteen or so years (
 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
@@ -1125,7 +1134,8 @@
 little information in the context of lexing where separate tokens must
 be identified and also classified (for example as keywords
 or identifiers). 
-Second, derivative-based matchers need to be more efficient.
+Second, derivative-based matchers need to be more efficient in terms
+of the sizes of derivatives.
 Elegant and beautiful
 as many implementations are,
 they can be excruciatingly slow. 
@@ -1146,7 +1156,7 @@
 \noindent
 They concluded that the algorithm is quadratic in terms of 
 the length of the input string.
-When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+When we tried out their extracted OCaml code with the example $(a+aa)^*$,
 the time it took to match a string of 40 $a$'s was approximately 5 minutes.
 
 
@@ -1158,16 +1168,17 @@
 \emph{how} a regular expression matches a string following the POSIX
 rules for regular expression matching. They achieve this by adding a
 second ``phase'' to Brzozowski's algorithm involving an injection
-function.
+function. This injection function in a sense undoes the ``damage''
+of the derivatives chopping off characters.
 In earlier work, Ausaf et al provided the formal
 specification of what POSIX matching means and proved in Isabelle/HOL
 the correctness
 of this extended algorithm accordingly
 \cite{AusafDyckhoffUrban2016}.
 
-The version of the algorithm proven correct 
-suffers heavily from a 
-second difficulty, where the internal derivatives can
+The version of the algorithm proven correct
+suffers however heavily from a 
+second difficulty, where derivatives can
 grow to arbitrarily big sizes. 
 For example if we start with the
 regular expression $(a+aa)^*$ and take
@@ -1185,7 +1196,7 @@
 \end{tabular}
 \end{center}
  
-\noindent where after around 35 steps we run out of memory on a
+\noindent where after around 35 steps we usually run out of memory on a
 typical computer.  Clearly, the
 notation involving $\ZERO$s and $\ONE$s already suggests
 simplification rules that can be applied to regular regular
@@ -1242,6 +1253,7 @@
 There is also some newer work called
 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
 but deterministic finite automaton instead.
+We will also study this work in a later section.
 %An example that gives problem to automaton approaches would be
 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
 %It requires at least $2^{n+1}$ states to represent
@@ -1256,8 +1268,8 @@
 backtracking and error-prone matchers: a formally verified
 regular expression lexing algorithm
 that is both fast
-and correct by extending Ausaf et al.'s work.
-The end result is %a regular expression lexing algorithm that comes with 
+and correct. 
+The result is %a regular expression lexing algorithm that comes with 
 \begin{itemize}
 \item
 an improved version of  Sulzmann and Lu's bit-coded algorithm using 
@@ -1268,10 +1280,10 @@
 \item 
 a complexity-related property for that algorithm saying that the 
 internal data structure will
-remain finite,
+remain below a finite bound,
 \item
-and extension to
-the bounded repetitions construct with the correctness and finiteness property
+and an extension to
+the bounded repetition constructs with the correctness and finiteness property
 maintained.
 \end{itemize}
 \noindent
@@ -1285,34 +1297,33 @@
 Further improvements to the algorithm with an even stronger version of 
 simplification can be made. We conjecture that the resulting size of derivatives
 can be bounded by a cubic bound w.r.t. the size of the regular expression.
-
-
-
-
+We will give relevant code in Scala, 
+but do not give a formal proof for that in Isabelle/HOL.
+This is still future work.
 
 
 \section{Structure of the thesis}
 In chapter \ref{Inj} we will introduce the concepts
 and notations we 
 use for describing regular expressions and derivatives,
-and the first version of their lexing algorithm without bitcodes (including 
+and the first version of Sulzmann and Lu's lexing algorithm without bitcodes (including 
 its correctness proof).
 We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
 together with the correctness proof by Ausaf and Urban.
 Then we illustrate in chapter \ref{Bitcoded2}
 how Sulzmann and Lu's
-simplifications fail to simplify. We therefore introduce our version of the
-algorithm with simplification and 
-its correctness proof .  
+simplifications fail to simplify correctly. We therefore introduce our own version of the
+algorithm with correct simplifications and 
+their correctness proof.  
 In chapter \ref{Finite} we give the second guarantee
 of our bitcoded algorithm, that is a finite bound on the size of any 
 regular expression's derivatives. 
 We also show how one can extend the
 algorithm to include bounded repetitions. 
 In chapter \ref{Cubic} we discuss stronger simplification rules which 
-improve the finite bound to a cubic bound.%and the NOT regular expression.
+improve the finite bound to a cubic bound. %and the NOT regular expression.
 Chapter \ref{RelatedWork} introduces relevant work for this thesis.
-Chapter \ref{Future} concludes with avenues of future research.
+Chapter \ref{Future} concludes and mentions avenues of future research.