lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:17c7611fb0a9
+:4b1149fb5aec
 What is unexpected is that even in the cases
 not involving back-references, there is still
 a (non-negligible) chance they might backtrack super-linearly,
 as shown in the graphs in figure\ref{fig:aStarStarb}.
-\subsection{Summary of the Catastrophic Backtracking Problem}
 Summing these up, we can categorise existing
 practical regex libraries into two kinds:
-(i)The ones  with  linear
+(i) The ones  with  linear
-time guarantees like Go and Rust. The cost with them is that
+time guarantees like Go and Rust. The downside with them is that
 they impose restrictions
-on the user input (not allowing back-references,
+on the regular expressions (not allowing back-references,
-bounded repetitions cannot exceed a counter limit etc.).
+bounded repetitions cannot exceed an ad hoc limit etc.).
 (ii) Those
 that allow large bounded regular expressions and back-references
-at the expense of using a backtracking algorithm.
+at the expense of using backtracking algorithms.
-They could grind to a halt
+They can potentially ``grind to a halt''
-on some very simple cases, posing a vulnerability of
+on some very simple cases, resulting
-a ReDoS attack.
+ReDoS attacks.
+The proble with both approaches is the motivation for us
-We would like to have regex engines that can
+to look again at the regular expression matching problem.
-deal with the regular part (e.g.
+Another motivation is that regular expression matching algorithms
-bounded repetitions) of regexes more
+that follow the POSIX standard often contain errors and bugs
-efficiently.
+as we shall explain next.
-Also we want to make sure that they do it correctly.
-It turns out that such aim is not so easy to achieve.
+%We would like to have regex engines that can
+%deal with the regular part (e.g.
+%bounded repetitions) of regexes more
+%efficiently.
+%Also we want to make sure that they do it correctly.
+%It turns out that such aim is not so easy to achieve.
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
 % For example, the Rust regex engine claims to be linear,
 % but does not support lookarounds and back-references.
 % The GoLang regex library does not support over 1000 repetitions.
 % Java and Python both support back-references, but shows
 %, whereas $\NFA$s usually run into trouble
 %on the second phase.
 \section{Error-prone POSIX Implementations}
-When there are multiple ways of matching a string
+Very often there are multiple ways of matching a string
-with a regular expression, a matcher needs to
+with a regular expression.
+In such cases the regular expressions matcher needs to
 disambiguate.
-The standard for which particular match to pick
+The more widely used strategy is called POSIX,
-is called the disambiguation strategy.
+which roughly speaking always chooses the longest initial match.
-The more intuitive strategy is called POSIX,
+The POSIX strategy is widely adopted in many regular expression matchers.
-which always chooses the longest initial match.
-An alternative strategy would be greedy matches,
-which always ends a sub-match as early as possible.
-The POSIX standard is widely adopted in many operating systems.
 However, many implementations (including the C libraries
 used by Linux and OS X distributions) contain bugs
 or do not meet the specification they claim to adhere to.
-In some cases, they either fail to generate a lexing
+Kuklewicz maintains a unit test repository which lists some
+problems with existing regular expression engines \ref{KuklewiczHaskell}.
+In some cases, they either fail to generate a
 result when there exists a match,
-or give results that are inconsistent with the $\POSIX$ standard.
+or give results that are inconsistent with the POSIX standard.
-A concrete example would be the regex given by \cite{fowler2003}
+A concrete example is the regex:
 \begin{center}
 	$(aba + ab + a)^* \text{and the string} ababa$
 \end{center}
-The correct $\POSIX$ match for the above would be
+The correct POSIX match for the above
-with the entire string $ababa$,
+is the entire string $ababa$,
-split into two Kleene star iterations, $[ab] [aba]$ at positions
+split into two Kleene star iterations, namely $[ab], [aba]$ at positions
 $[0, 2), [2, 5)$
 respectively.
-But trying this out in regex101\parencite{regex101}
+But trying this out in regex101 \parencite{regex101} \footnote{
-with different language engines would yield
+	regex101 is an online regular expression matcher which
-the same two fragmented matches: $[aba]$ at $[0, 3)$
+	provides API for trying out regular expression
+	engines of multiple popular programming languages like
+Java, Python, Go, etc.}
+with different engines yields
+always matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
 commented that most regex libraries are not
-correctly implementing the POSIX (maximum-munch)
+correctly implementing the central POSIX
-rule of regular expression matching.
+rule, called the maximum munch rule.
-As Grathwohl\parencite{grathwohl2014crash} wrote,
+Grathwohl \parencite{grathwohl2014crash} wrote,
 \begin{quote}
 	``The POSIX strategy is more complicated than the
 	greedy because of the dependence on information about
 	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
-The implementation complexity of POSIX rules also come from
+We think the implementation complexity of POSIX rules also come from
-the specification being not very clear.
+the specification being not very precise.
 There are many informal summaries of this disambiguation
 strategy, which are often quite long and delicate.
 For example Kuklewicz \cite{KuklewiczHaskell}
-described the POSIX rule as
+described the POSIX rule as (section 1, last paragraph):
 \begin{quote}
-	``
 	\begin{itemize}
 		\item
 regular expressions (REs) take the leftmost starting match, and the longest match starting there
 earlier subpatterns have leftmost-longest priority over later subpatterns\\
 \item
 text of component subexpressions must be contained in the text of the
 higher-level subexpressions\\
 \item
 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
 \item
-if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
 \end{itemize}
 \end{quote}
-The text above
+%The text above
-is trying to capture something very precise,
+%is trying to capture something very precise,
-and is crying out for formalising.
+%and is crying out for formalising.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
-are the first to fill the gap
+are the first to
-by not just describing such a formalised POSIX
+give a quite simple formalised POSIX
-specification in Isabelle/HOL, but also proving
+specification in Isabelle/HOL, and also prove
 that their specification coincides with the
-POSIX specification given by Okui and Suzuki \cite{Okui10}
+POSIX specification given by Okui and Suzuki \cite{Okui10}.
-which is a completely
-different characterisation.
 They then formally proved the correctness of
 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
-based on that specification.
+with regards to that specification.
+They also found that the informal POSIX
-In the next section we will very briefly
+specification by Sulzmann and Lu does not work for the correctness proof.
+In the next section we will briefly
 introduce Brzozowski derivatives and Sulzmann
-and Lu's algorithm, which this thesis builds on.
+and Lu's algorithm, which the main point of this thesis builds on.
-We give a taste of what they
+%We give a taste of what they
-are like and why they are suitable for regular expression
+%are like and why they are suitable for regular expression
-matching and lexing.
+%matching and lexing.
+\section{Formal Specification of POSIX Matching
-\section{Our Solution--Formal Specification of POSIX Matching
 and Brzozowski Derivatives}
-Now we start with the central topic of the thesis: Brzozowski derivatives.
+%Now we start with the central topic of the thesis: Brzozowski derivatives.
 Brzozowski \cite{Brzozowski1964} first introduced the
-concept of the \emph{derivative} in the 1960s.
+concept of a \emph{derivative} of regular expression in 1964.
 The derivative of a regular expression $r$
-with respect to a character $c$, is written as $r \backslash c$.\footnote{
+with respect to a character $c$, is written as $r \backslash c$.
-	Despite having the same name, regular expression
+This operation tells us what $r$ transforms into
-	derivatives bear little similarity with the mathematical definition
+if we ``chop'' off the first character $c$
-	of derivatives on functions.
+from all strings in the language of $r$ (defined
-}
+later as $L \; r$).
-It tells us what $r$ would transform into
+%To give a flavour of Brzozowski derivatives, we present
-if we chop off the first character $c$
+%two straightforward clauses from it:
-from all strings in the language of $r$ ($L \; r$).
+%\begin{center}
-To give a flavour of Brzozowski derivatives, we present
+%	\begin{tabular}{lcl}
-two straightforward clauses from it:
+%		$d \backslash c$     & $\dn$ &
-\begin{center}
+%		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-	\begin{tabular}{lcl}
+%$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-		$d \backslash c$     & $\dn$ &
+%	\end{tabular}
-		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+%\end{center}
-$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+%\noindent
-	\end{tabular}
+%The first clause says that for the regular expression
-\end{center}
+%denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
-\noindent
+%we check the derivative character $c$ against $d$,
-The first clause says that for the regular expression
+%returning a set containing only the empty string $\{ [] \}$
-denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
-we check the derivative character $c$ against $d$,
+%The second clause states that to obtain the regular expression
-returning a set containing only the empty string $\{ [] \}$
+%representing all strings' head character $c$ being chopped off
-if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+%from $r_1 + r_2$, one simply needs to recursively take derivative
-The second clause states that to obtain the regular expression
+%of $r_1$ and $r_2$ and then put them together.
-representing all strings' head character $c$ being chopped off
+Derivatives have the property
-from $r_1 + r_2$, one simply needs to recursively take derivative
-of $r_1$ and $r_2$ and then put them together.
-Thanks to the definition, derivatives have the nice property
 that $s \in L \; (r\backslash c)$ if and only if
-$c::s \in L \; r$.
+$c::s \in L \; r$ where $::$ stands for list prepending.
 %This property can be used on regular expressions
 %matching and lexing--to test whether a string $s$ is in $L \; r$,
 %one simply takes derivatives of $r$ successively with
 %respect to the characters (in the correct order) in $s$,
 %and then test whether the empty string is in the last regular expression.
-Derivatives give a simple solution
+With this derivatives give a simple solution
-to the problem of matching and lexing a string $s$ with a regular
+to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then $r$ matches $s$ (and {\em vice versa}).
+%This makes formally reasoning about these properties such
-This makes formally reasoning about these properties such
+%as correctness and complexity smooth and intuitive.
-as correctness and complexity smooth and intuitive.
+There had been several mechanised proofs about this property in various theorem
-In fact, there has already been several mechanised proofs about them,
+provers,
-for example the one by Owens and Slind \cite{Owens2008} in HOL4,
+for example one by Owens and Slind \cite{Owens2008} in HOL4,
 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
-In addition, one can extend the clauses to bounded repetitions
+In addition, one can extend derivatives to bounded repetitions
-``for free'':
+relatively straightforwardly. For example, the derivative for
+this can be defined as:
 \begin{center}
 	\begin{tabular}{lcl}
 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
 		r^{\{n-1\}}$\\
 	\end{tabular}
 \end{center}
 \noindent
-And experimental results suggest that  unlike DFA-based solutions,
+Experimental results suggest that  unlike DFA-based solutions
-this derivatives can support
+for bounded regular expressions,
-bounded regular expressions with large counters
+derivatives can cope
+large counters
 quite well.
 There has also been
 extensions to other constructs.
 For example, Owens et al include the derivatives
-for \emph{NOT} regular expressions, which is
+for the \emph{NOT} regular expression, which is
 able to concisely express C-style comments of the form
-$/* \ldots */$.
+$/* \ldots */$ (see \cite{Owens2008}).
-Another extension for derivatives would be
+Another extension for derivatives is
-regular expressions with look-aheads, done by
+regular expressions with look-aheads, done
 by Miyazaki and Minamide
 \cite{Takayuki2019}.
 %We therefore use Brzozowski derivatives on regular expressions
 %lexing
 in the functional programming and theorem prover communities in the last
 fifteen or so years (
 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
 \cite{Chen12} and \cite{Coquand2012}
 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
-after they were first published.
+after they were first published by Brzozowski.
 However, there are two difficulties with derivative-based matchers:
 First, Brzozowski's original matcher only generates a yes/no answer
 for whether a regular expression matches a string or not.  This is too
 is actually $\Omega(2^n)$ in the worst case.
 A similar claim about a theoretical runtime of $O(n^2)$
 is made for the Verbatim \cite{Verbatim}
 %TODO: give references
 lexer, which calculates POSIX matches and is based on derivatives.
-They formalized the correctness of the lexer, but not the complexity.
+They formalized the correctness of the lexer, but not their complexity result.
-In the performance evaluation section, they simply analyzed the run time
+In the performance evaluation section, they analyzed the run time
 of matching $a$ with the string
 \begin{center}
-	$\underbrace{a \ldots a}_{\text{n a's}}$
+	$\underbrace{a \ldots a}_{\text{n a's}}$.
 \end{center}
-and concluded that the algorithm is quadratic in terms of input length.
+\noindent
+They concluded that the algorithm is quadratic in terms of
+the length of the input string.
 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
-the time it took to lex only 40 $a$'s was 5 minutes.
+the time it took to match a string of 40 $a$'s was approximately 5 minutes.
 \subsection{Sulzmann and Lu's Algorithm}
 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
-difficulty by cleverly extending Brzozowski's matching
+problem with the yes/no answer
+by cleverly extending Brzozowski's matching
 algorithm. Their extended version generates additional information on
 \emph{how} a regular expression matches a string following the POSIX
 rules for regular expression matching. They achieve this by adding a
 second ``phase'' to Brzozowski's algorithm involving an injection
-function simplification of internal data structures
+function.
-eliminating the exponential behaviours.
+In earlier work, Ausaf et al provided the formal
-In an earlier work, Ausaf et al provided the formal
 specification of what POSIX matching means and proved in Isabelle/HOL
 the correctness
-of Sulzmann and Lu's extended algorithm accordingly
+of this extended algorithm accordingly
 \cite{AusafDyckhoffUrban2016}.
 The version of the algorithm proven correct
-suffers from the
+suffers heavily from a
-second difficulty though, where the internal derivatives can
+second difficulty, where the internal derivatives can
 grow to arbitrarily big sizes.
 For example if we start with the
 regular expression $(a+aa)^*$ and take
 successive derivatives according to the character $a$, we end up with
 a sequence of ever-growing derivatives like
 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
 \end{tabular}
 \end{center}
 \noindent where after around 35 steps we run out of memory on a
-typical computer (we shall define in the next chapter
+typical computer.  Clearly, the
-the precise details of our
-regular expressions and the derivative operation).  Clearly, the
 notation involving $\ZERO$s and $\ONE$s already suggests
 simplification rules that can be applied to regular regular
 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
-r$. While such simple-minded simplifications have been proved in our
+r$. While such simple-minded simplifications have been proved in
-earlier work to preserve the correctness of Sulzmann and Lu's
+the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
 \emph{not} help with limiting the growth of the derivatives shown
 above: the growth is slowed, but the derivatives can still grow rather
 quickly beyond any finite bound.
-Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
+Therefore we want to look in this thesis at a second
-\cite{Sulzmann2014} where they introduce bit-coded
+algorithm by Sulzmann and Lu where they
-regular expressions. In this version, POSIX values are
+overcame this ``growth problem'' \cite{Sulzmann2014}.
+In this version, POSIX values are
 represented as bit sequences and such sequences are incrementally generated
 when derivatives are calculated. The compact representation
 of bit sequences and regular expressions allows them to define a more
 ``aggressive'' simplification method that keeps the size of the
 derivatives finite no matter what the length of the string is.
 method [..] in combination with the simplification steps [..]
 yields POSIX parse trees. We have tested this claim
 extensively [..] but yet
 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
 \end{quote}
-Ausaf and Urban were able to back this correctness claim with
+Ausaf and Urban made some initial progress towards the
-a formal proof.
+full correctness proof but still had to leave out the optimisation
+Sulzmann and Lu proposed.
-However a faster formally verified
+Ausaf  wrote \cite{Ausaf},
-lexing program with the optimisations
-mentioned by Sulzmann and Lu's second algorithm
-is still missing.
-As they stated,
 \begin{quote}\it
 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
 \end{quote}
 This thesis implements the aggressive simplifications envisioned
 by Ausaf and Urban,
-together with a formal proof of the correctness with those simplifications.
+together with a formal proof of the correctness of those simplifications.
 One of the most recent work in the context of lexing
 %with this issue
 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
 remain finite,
 \item
 and extension to
 the bounded repetitions construct with the correctness and finiteness property
 maintained.
 \end{itemize}
+\noindent
 With a formal finiteness bound in place,
 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
 Further improvements to the algorithm with an even stronger version of
-simplification is made.
+simplification can be made.
-Thanks to our theorem-prover-friendly approach,
-we believe that
-this finiteness bound can be improved to a bound
-linear to input and
-cubic to the regular expression size using a technique by
-Antimirov\cite{Antimirov95}.
+\section{Structure of the thesis}
-Once formalised, this would be a guarantee for the absence of all super-linear behavious.
+In chapter \ref{Inj} we will introduce the concepts
-We are working out the
+and notations we
-details.
+use for describing regular expressions and derivatives,
+and the first version of their lexing algorithm without bitcodes (including
+its correctness proof).
+We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
+together with the correctness proof by Ausaf and Urban.
+Then we illustrate in chapter \ref{Bitcoded2}
+how Sulzmann and Lu's
+simplifications fail to simplify. We therefore introduce our version of the
+algorithm with simplification and
+its correctness proof .
+In chapter \ref{Finite} we give the second guarantee
+of our bitcoded algorithm, that is a finite bound on the size of any
+regular expression's derivatives.
+In chapter \ref{Cubic} we discuss stronger simplification rules which
+improve the finite bound to a polynomial bound, and also show how one can extend the
+algorithm to include bounded repetitions. %and the NOT regular expression.
-To our best knowledge, no lexing libraries using Brzozowski derivatives
-have similar complexity-related bounds,
-and claims about running time are usually speculative and backed by empirical
-evidence on a few test cases.
-If a matching or lexing algorithm
-does not come with certain basic complexity related
-guarantees (for examaple the internal data structure size
-does not grow indefinitely),
-then they cannot claim with confidence having solved the problem
-of catastrophic backtracking.
-\section{Structure of the thesis}
-In chapter 2 \ref{Inj} we will introduce the concepts
-and notations we
-use for describing the lexing algorithm by Sulzmann and Lu,
-and then give the lexing algorithm.
-We will give its variant in \ref{Bitcoded1}.
-Then we illustrate in \ref{Bitcoded2}
-how the algorithm without bitcodes falls short for such aggressive
-simplifications and therefore introduce our version of the
-bit-coded algorithm and
-its correctness proof .
-In \ref{Finite} we give the second guarantee
-of our bitcoded algorithm, that is a finite bound on the size of any
-regex's derivatives.
-In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
-in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
-algorithm to include constructs such as bounded repetitions and negations.
 %----------------------------------------------------------------------------------------

changeset 622	4b1149fb5aec
parent 620	ae6010c14e49
child 628	7af4e2420a8c