lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:e3752aac8ec2
+:dd9dde2d902b
 detection systems that inspect suspicious traffic; or compiler
 front ends.
 Given their usefulness and ubiquity, one would assume that
 modern regular expression matching implementations
 are mature and fully studied.
-Indeed, in a popular programming language's regular expression engine,
+Indeed, in many popular programming languages' regular expression engines,
-supplying it with regular expressions and strings,
+one can supply a regular expression and a string, and in most cases get
-in most cases one can
 get the matching information in a very short time.
-Those engines can be blindingly fast--some
+Those engines can sometimes be blindingly fast--some
 network intrusion detection systems
 use regular expression engines that are able to process
 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those engines can sometimes exhibit a surprising security vulnerability
+However, those engines can sometimes also exhibit a surprising security vulnerability
 under a certain class of inputs.
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
-Consider for example $(a^*)^*\,b$ and
+Consider for example the regular expression $(a^*)^*\,b$ and
-strings of the form $aa..a$, these strings cannot be matched by this regular
+strings of the form $aa..a$. These strings cannot be matched by this regular
 expression: Obviously the expected $b$ in the last
 position is missing. One would assume that modern regular expression
 matching engines can find this out very quickly. Surprisingly, if one tries
 this example in JavaScript, Python or Java 8, even with small strings,
 say of lenght of around 30 $a$'s,
 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
-This is clearly exponential behaviour, and as can be seen
+The algorithms clearly show exponential behaviour, and as can be seen
 is triggered by some relatively simple regular expressions.
 Java 9 and newer
 versions improve this behaviour somewhat, but are still slow compared
 with the approach we are going to use in this thesis.
 This superlinear blowup in regular expression engines
-has caused grief in ``real life'' in the past where it is
+has caused grief in ``real life'' where it is
 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
 For example, on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
 attacks.
 Davis et al. \cite{Davis18} detected more
 than 1000 evil regular expressions
 in Node.js, Python core libraries, npm and pypi.
 They therefore concluded that evil regular expressions
-are real problems rather than just "a parlour trick".
+are a real problem rather than just "a parlour trick".
-This work aims to address this issue
+The work in this thesis aims to address this issue
 with the help of formal proofs.
 We describe a lexing algorithm based
 on Brzozowski derivatives with verified correctness
 and a finiteness property for the size of derivatives
 (which are all done in Isabelle/HOL).
 have bounded regular expressions in them.
 Often the counters are quite large, with the largest being
 close to ten million.
 A smaller sample XSD they gave
 is:
-\begin{verbatim}
+\lstset{
+	basicstyle=\fontsize{8.5}{9}\ttfamily,
+language=XML,
+morekeywords={encoding,
+xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
+}
+\begin{lstlisting}
 <sequence minOccurs="0" maxOccurs="65535">
-<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
 </sequence>
-\end{verbatim}
+\end{lstlisting}
 This can be seen as the regex
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions
 satisfying certain constraints (such as
 satisfying the floating point number format).
 \end{center}
 The sub-expressions
 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
 by 1 to 4, and can be ``referred back'' by their respective numbers.
 %These sub-expressions are called "capturing groups".
-To do so, we use the syntax $\backslash i$
+To do so, one uses the syntax $\backslash i$
 to denote that we want the sub-string
 of the input matched by the i-th
 sub-expression to appear again,
 exactly the same as it first appeared:
 \begin{center}
 $(.)(.)\backslash 2\backslash 1$
 \end{center}
 which matches four-character palindromes
 like $abba$, $x??x$ and so on.
-Back-references is a regex construct
+Back-references are a regex construct
 that programmers find quite useful.
 According to Becchi and Crawley \cite{Becchi08},
 6\% of Snort rules (up until 2008) use them.
 The most common use of back-references
 is to express well-formed html files,
 with a regular expression.
 In such cases the regular expressions matcher needs to
 disambiguate.
 The more widely used strategy is called POSIX,
 which roughly speaking always chooses the longest initial match.
-The POSIX strategy is widely adopted in many regular expression matchers.
+The POSIX strategy is widely adopted in many regular expression matchers
+because it is a natural strategy for lexers.
 However, many implementations (including the C libraries
 used by Linux and OS X distributions) contain bugs
 or do not meet the specification they claim to adhere to.
 Kuklewicz maintains a unit test repository which lists some
 problems with existing regular expression engines \cite{KuklewiczHaskell}.
 involves the entire string $ababa$,
 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
 $[0, 2), [2, 5)$
 respectively.
 But feeding this example to the different engines
-in regex101 \parencite{regex101} \footnote{
+listed at regex101 \footnote{
 	regex101 is an online regular expression matcher which
 	provides API for trying out regular expression
 	engines of multiple popular programming languages like
-Java, Python, Go, etc.}
+Java, Python, Go, etc.} \parencite{regex101}.
 yields
 only two incomplete matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
 commented that most regex libraries are not
 correctly implementing the central POSIX
 rule, called the maximum munch rule.
-Grathwohl \parencite{grathwohl2014crash} wrote,
+Grathwohl \parencite{grathwohl2014crash} wrote:
-\begin{quote}
+\begin{quote}\it
 	``The POSIX strategy is more complicated than the
 	greedy because of the dependence on information about
 	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
 We think the implementation complexity of POSIX rules also come from
 the specification being not very precise.
-The developers of the regexp package of GO
+The developers of the regexp package of Go
 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
 commented that
 \begin{quote}\it
 ``
 The POSIX rule is computationally prohibitive
 %and is crying out for formalising.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
 are the first to
 give a quite simple formalised POSIX
 specification in Isabelle/HOL, and also prove
-that their specification coincides with the
+that their specification coincides with an earlier (unformalised)
 POSIX specification given by Okui and Suzuki \cite{Okui10}.
 They then formally proved the correctness of
 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
 with regards to that specification.
 They also found that the informal POSIX
-specification by Sulzmann and Lu needs to be revised for the correctness proof.
+specification by Sulzmann and Lu needs to be substantially
+revised in order for the correctness proof to go through.
+Their original specification and proof were unfixable
+according to Ausaf et al.
 In the next section, we will briefly
 introduce Brzozowski derivatives and Sulzmann
 and Lu's algorithm, which the main point of this thesis builds on.
 %We give a taste of what they
 Brzozowski \cite{Brzozowski1964} first introduced the
 concept of a \emph{derivative} of regular expression in 1964.
 The derivative of a regular expression $r$
 with respect to a character $c$, is written as $r \backslash c$.
 This operation tells us what $r$ transforms into
-if we ``chop'' off the first character $c$
+if we ``chop'' off a particular character $c$
 from all strings in the language of $r$ (defined
 later as $L \; r$).
 %To give a flavour of Brzozowski derivatives, we present
 %two straightforward clauses from it:
 %\begin{center}
 %This property can be used on regular expressions
 %matching and lexing--to test whether a string $s$ is in $L \; r$,
 %one simply takes derivatives of $r$ successively with
 %respect to the characters (in the correct order) in $s$,
 %and then test whether the empty string is in the last regular expression.
-With this property, derivatives give a simple solution
+With this property, derivatives can give a simple solution
 to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then $r$ matches $s$ (and {\em vice versa}).
 %This makes formally reasoning about these properties such
 %as correctness and complexity smooth and intuitive.
-There had been several mechanised proofs about this property in various theorem
+There are several mechanised proofs of this property in various theorem
 provers,
 for example one by Owens and Slind \cite{Owens2008} in HOL4,
 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
 relatively straightforwardly. For example, the derivative for
 this can be defined as:
 \begin{center}
 	\begin{tabular}{lcl}
 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
-		r^{\{n-1\}}$\\
+		r^{\{n-1\}} (\text{when} n > 0)$\\
 	\end{tabular}
 \end{center}
 \noindent
 Experimental results suggest that  unlike DFA-based solutions
 for bounded regular expressions,
 derivatives can cope
 large counters
 quite well.
-There has also been
+There have also been
-extensions to other constructs.
+extensions of derivatives to other regex constructs.
 For example, Owens et al include the derivatives
 for the \emph{NOT} regular expression, which is
 able to concisely express C-style comments of the form
 $/* \ldots */$ (see \cite{Owens2008}).
 Another extension for derivatives is
 Given the above definitions and properties of
 Brzozowski derivatives, one quickly realises their potential
-in generating a formally verified algorithm for lexing--the clauses and property
+in generating a formally verified algorithm for lexing: the clauses and property
 can be easily expressed in a functional programming language
 or converted to theorem prover
-code, with great extensibility.
+code, with great ease.
-Perhaps this is the reason why it has sparked quite a bit of interest
+Perhaps this is the reason why derivatives have sparked quite a bit of interest
 in the functional programming and theorem prover communities in the last
 fifteen or so years (
 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
 \cite{Chen12} and \cite{Coquand2012}
 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
 First, Brzozowski's original matcher only generates a yes/no answer
 for whether a regular expression matches a string or not.  This is too
 little information in the context of lexing where separate tokens must
 be identified and also classified (for example as keywords
 or identifiers).
-Second, derivative-based matchers need to be more efficient.
+Second, derivative-based matchers need to be more efficient in terms
+of the sizes of derivatives.
 Elegant and beautiful
 as many implementations are,
 they can be excruciatingly slow.
 For example, Sulzmann and Lu
 claim a linear running time of their proposed algorithm,
 	$\underbrace{a \ldots a}_{\text{n a's}}$.
 \end{center}
 \noindent
 They concluded that the algorithm is quadratic in terms of
 the length of the input string.
-When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+When we tried out their extracted OCaml code with the example $(a+aa)^*$,
 the time it took to match a string of 40 $a$'s was approximately 5 minutes.
 \subsection{Sulzmann and Lu's Algorithm}
 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
 by cleverly extending Brzozowski's matching
 algorithm. Their extended version generates additional information on
 \emph{how} a regular expression matches a string following the POSIX
 rules for regular expression matching. They achieve this by adding a
 second ``phase'' to Brzozowski's algorithm involving an injection
-function.
+function. This injection function in a sense undoes the ``damage''
+of the derivatives chopping off characters.
 In earlier work, Ausaf et al provided the formal
 specification of what POSIX matching means and proved in Isabelle/HOL
 the correctness
 of this extended algorithm accordingly
 \cite{AusafDyckhoffUrban2016}.
 The version of the algorithm proven correct
-suffers heavily from a
+suffers however heavily from a
-second difficulty, where the internal derivatives can
+second difficulty, where derivatives can
 grow to arbitrarily big sizes.
 For example if we start with the
 regular expression $(a+aa)^*$ and take
 successive derivatives according to the character $a$, we end up with
 a sequence of ever-growing derivatives like
 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
 \end{tabular}
 \end{center}
-\noindent where after around 35 steps we run out of memory on a
+\noindent where after around 35 steps we usually run out of memory on a
 typical computer.  Clearly, the
 notation involving $\ZERO$s and $\ONE$s already suggests
 simplification rules that can be applied to regular regular
 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
 This is relevant work for us and we will compare it later with
 our derivative-based matcher we are going to present.
 There is also some newer work called
 Verbatim++~\cite{Verbatimpp}, which does not use derivatives,
 but deterministic finite automaton instead.
+We will also study this work in a later section.
 %An example that gives problem to automaton approaches would be
 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
 %It requires at least $2^{n+1}$ states to represent
 %as a DFA.
 In this thesis,
 we propose a solution to catastrophic
 backtracking and error-prone matchers: a formally verified
 regular expression lexing algorithm
 that is both fast
-and correct by extending Ausaf et al.'s work.
+and correct.
-The end result is %a regular expression lexing algorithm that comes with
+The result is %a regular expression lexing algorithm that comes with
 \begin{itemize}
 \item
 an improved version of  Sulzmann and Lu's bit-coded algorithm using
 derivatives with simplifications,
 accompanied by
 a proven correctness theorem according to POSIX specification
 given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
 \item
 a complexity-related property for that algorithm saying that the
 internal data structure will
-remain finite,
+remain below a finite bound,
 \item
-and extension to
+and an extension to
-the bounded repetitions construct with the correctness and finiteness property
+the bounded repetition constructs with the correctness and finiteness property
 maintained.
 \end{itemize}
 \noindent
 With a formal finiteness bound in place,
 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
 \url{https://github.com/hellotommmy/posix}
 \end{center}
 Further improvements to the algorithm with an even stronger version of
 simplification can be made. We conjecture that the resulting size of derivatives
 can be bounded by a cubic bound w.r.t. the size of the regular expression.
+We will give relevant code in Scala,
+but do not give a formal proof for that in Isabelle/HOL.
+This is still future work.
 \section{Structure of the thesis}
 In chapter \ref{Inj} we will introduce the concepts
 and notations we
 use for describing regular expressions and derivatives,
-and the first version of their lexing algorithm without bitcodes (including
+and the first version of Sulzmann and Lu's lexing algorithm without bitcodes (including
 its correctness proof).
 We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
 together with the correctness proof by Ausaf and Urban.
 Then we illustrate in chapter \ref{Bitcoded2}
 how Sulzmann and Lu's
-simplifications fail to simplify. We therefore introduce our version of the
+simplifications fail to simplify correctly. We therefore introduce our own version of the
-algorithm with simplification and
+algorithm with correct simplifications and
-its correctness proof .
+their correctness proof.
 In chapter \ref{Finite} we give the second guarantee
 of our bitcoded algorithm, that is a finite bound on the size of any
 regular expression's derivatives.
 We also show how one can extend the
 algorithm to include bounded repetitions.
 In chapter \ref{Cubic} we discuss stronger simplification rules which
-improve the finite bound to a cubic bound.%and the NOT regular expression.
+improve the finite bound to a cubic bound. %and the NOT regular expression.
 Chapter \ref{RelatedWork} introduces relevant work for this thesis.
-Chapter \ref{Future} concludes with avenues of future research.
+Chapter \ref{Future} concludes and mentions avenues of future research.

changeset 638	dd9dde2d902b
parent 637	e3752aac8ec2
child 646	56057198e4f5