lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:37b6fd310a16
+:61139fdddae0
 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
 Such a language is contained in the context-sensitive hierarchy
 of formal languages.
 Solving the back-reference expressions matching problem
-is NP-complete\parencite{alfred2014algorithms}.
+is known to be NP-complete \parencite{alfred2014algorithms}.
 A non-bactracking,
 efficient solution is not known to exist.
 Regex libraries supporting back-references such as
 PCRE \cite{pcre} therefore have to
-revert to a depth-first search algorithm that backtracks.
+revert to a depth-first search algorithm which backtracks.
-The unreasonable part with them, is that even in the case of
+What is unexpected is that even in the cases
-regexes not involving back-references, there is still
+not involving back-references, there is still
-a (non-negligible) chance they might backtrack super-linearly.
+a (non-negligible) chance they might backtrack super-linearly,
+as shown in the graphs in \ref{fig:aStarStarb}.
 \subsection{Summary of the Catastrophic Backtracking Problem}
 Summing these up, we can categorise existing
 practical regex libraries into two kinds:
 (i)The ones  with  linear
 \end{quote}
 The text above
 is trying to capture something very precise,
 and is crying out for formalising.
 Ausaf et al. \cite{AusafDyckhoffUrban2016}
-are the first to describe such a formalised POSIX
+are the first to fill the gap
-specification in Isabelle/HOL.
+by not just describing such a formalised POSIX
+specification in Isabelle/HOL, but also proving
+that their specification coincides with the
+POSIX specification given by Okui and Suzuki \cite{Okui10}
+which is a completely
+different characterisation.
 They then formally proved the correctness of
 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
 based on that specification.
-In this thesis,
-we propose a solution to catastrophic
-backtracking and error-prone matchers--a formally verified
-regular expression lexing algorithm
-that is both fast
-and correct by extending Ausaf et al.'s work.
-The end result is a regular expression lexing algorithm that comes with
-\begin{itemize}
-\item
-a proven correctness theorem according to POSIX specification
-given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
-\item
-a proven complexity-related property saying that the
-algorithm's internal data structure will
-remain finite,
-\item
-and extension to
-the bounded repetitions construct with the correctness and finiteness property
-maintained.
-\end{itemize}
 In the next section we will very briefly
-introduce Brzozowski derivatives.
+introduce Brzozowski derivatives and Sulzmann
-We will give a taste to what they
+and Lu's algorithm, which this thesis builds on.
+We give a taste of what they
 are like and why they are suitable for regular expression
 matching and lexing.
-\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
+\section{Our Solution--Formal Specification of POSIX Matching
-We propose Brzozowski derivatives on regular expressions as
+and Brzozowski Derivatives}
-a solution to this.
+Now we start with the central topic of the thesis: Brzozowski derivatives.
-In the last fifteen or so years, Brzozowski's derivatives of regular
+Brzozowski \cite{Brzozowski1964} first introduced the
-expressions have sparked quite a bit of interest in the functional
+concept of the \emph{derivative} in the 1960s.
-programming and theorem prover communities.
+The derivative of a regular expression $r$
+with respect to a character $c$, is written as $r \backslash c$.\footnote{
-\subsection{Motivation}
+	Despite having the same name, regular expression
+	derivatives bear little similarity with the mathematical definition
+	of derivatives on functions.
+}
+It tells us what $r$ would transform into
+if we chop off the first character $c$
+from all strings in the language of $r$ ($L \; r$).
+To give a flavour of Brzozowski derivatives, we present
+two straightforward clauses from it:
+\begin{center}
+	\begin{tabular}{lcl}
+		$d \backslash c$     & $\dn$ &
+		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+	\end{tabular}
+\end{center}
+\noindent
+The first clause says that for the regular expression
+denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+we check the derivative character $c$ against $d$,
+returning a set containing only the empty string $\{ [] \}$
+if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+The second clause states that to obtain the regular expression
+representing all strings' head character $c$ being chopped off
+from $r_1 + r_2$, one simply needs to recursively take derivative
+of $r_1$ and $r_2$ and then put them together.
+Thanks to the definition, derivatives have the nice property
+that $s \in L \; (r\backslash c)$ if and only if
+$c::s \in L \; r$.
+%This property can be used on regular expressions
+%matching and lexing--to test whether a string $s$ is in $L \; r$,
+%one simply takes derivatives of $r$ successively with
+%respect to the characters (in the correct order) in $s$,
+%and then test whether the empty string is in the last regular expression.
 Derivatives give a simple solution
-to the problem of matching a string $s$ with a regular
+to the problem of matching and lexing a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then $r$ matches $s$ (and {\em vice versa}).
-The beauty of
+This makes formally reasoning about these properties such
-Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
+as correctness and complexity smooth and intuitive.
-expressible in any functional language, and easily definable and
+In fact, there has already been several mechanised proofs about them,
-reasoned about in theorem provers---the definitions just consist of
+for example the one by Owens and Slind \cite{Owens2008} in HOL4,
-inductive datatypes and simple recursive functions.
+another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
-And an algorithms based on it by
+yet another in Coq by Coquand and Siles \cite{Coquand2012}.
-Suzmann and Lu  \parencite{Sulzmann2014} allows easy extension
-to include  extended regular expressions and
+In addition, one can extend the clauses to bounded repetitions
-simplification of internal data structures
+``for free'':
-eliminating the exponential behaviours.
+\begin{center}
+	\begin{tabular}{lcl}
-However, two difficulties with derivative-based matchers exist:
+		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
-\subsubsection{Problems with Current Brzozowski Matchers}
+		r^{\{n-1\}}$\\
+	\end{tabular}
+\end{center}
+\noindent
+And experimental results suggest that  unlike DFA-based solutions,
+this derivatives can support
+bounded regular expressions with large counters
+quite well.
+There has also been
+extensions to other constructs.
+For example, Owens et al include the derivatives
+for \emph{NOT} regular expressions, which is
+able to concisely express C-style comments of the form
+$/* \ldots */$.
+Another extension for derivatives would be
+regular expressions with look-aheads, done by
+by Miyazaki and Minamide
+\cite{Takayuki2019}.
+%We therefore use Brzozowski derivatives on regular expressions
+%lexing
+Given the above definitions and properties of
+Brzozowski derivatives, one quickly realises their potential
+in generating a formally verified algorithm for lexing--the clauses and property
+can be easily expressed in a functional programming language
+or converted to theorem prover
+code, with great extensibility.
+Perhaps this is the reason why it has sparked quite a bit of interest
+in the functional programming and theorem prover communities in the last
+fifteen or so years (
+\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
+\cite{Chen12} and \cite{Coquand2012}
+to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
+after they were first published.
+However, there are two difficulties with derivative-based matchers:
 First, Brzozowski's original matcher only generates a yes/no answer
 for whether a regular expression matches a string or not.  This is too
 little information in the context of lexing where separate tokens must
 be identified and also classified (for example as keywords
-or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
+or identifiers).
+Second, derivative-based matchers need to be more efficient.
+Elegant and beautiful
+as many implementations are,
+they can be excruciatingly slow.
+For example, Sulzmann and Lu
+claim a linear running time of their proposed algorithm,
+but that was falsified by our experiments. The running time
+is actually $\Omega(2^n)$ in the worst case.
+A similar claim about a theoretical runtime of $O(n^2)$
+is made for the Verbatim \cite{Verbatim}
+%TODO: give references
+lexer, which calculates POSIX matches and is based on derivatives.
+They formalized the correctness of the lexer, but not the complexity.
+In the performance evaluation section, they simply analyzed the run time
+of matching $a$ with the string
+\begin{center}
+	$\underbrace{a \ldots a}_{\text{n a's}}$
+\end{center}
+and concluded that the algorithm is quadratic in terms of input length.
+When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+the time it took to lex only 40 $a$'s was 5 minutes.
+\subsection{Sulzmann and Lu's Algorithm}
+Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
 difficulty by cleverly extending Brzozowski's matching
 algorithm. Their extended version generates additional information on
 \emph{how} a regular expression matches a string following the POSIX
 rules for regular expression matching. They achieve this by adding a
 second ``phase'' to Brzozowski's algorithm involving an injection
-function.  In our own earlier work, we provided the formal
+function simplification of internal data structures
+eliminating the exponential behaviours.
+In an earlier work, Ausaf et al provided the formal
 specification of what POSIX matching means and proved in Isabelle/HOL
 the correctness
 of Sulzmann and Lu's extended algorithm accordingly
 \cite{AusafDyckhoffUrban2016}.
-The second difficulty is that Brzozowski's derivatives can
+The version of the algorithm proven correct
-grow to arbitrarily big sizes. For example if we start with the
+suffers from the
+second difficulty though, where the internal derivatives can
+grow to arbitrarily big sizes.
+For example if we start with the
 regular expression $(a+aa)^*$ and take
 successive derivatives according to the character $a$, we end up with
 a sequence of ever-growing derivatives like
 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
 \end{tabular}
 \end{center}
 \noindent where after around 35 steps we run out of memory on a
-typical computer (we shall define shortly the precise details of our
+typical computer (we shall define in the next chapter
+the precise details of our
 regular expressions and the derivative operation).  Clearly, the
 notation involving $\ZERO$s and $\ONE$s already suggests
 simplification rules that can be applied to regular regular
 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
 earlier work to preserve the correctness of Sulzmann and Lu's
 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
 \emph{not} help with limiting the growth of the derivatives shown
 above: the growth is slowed, but the derivatives can still grow rather
 quickly beyond any finite bound.
 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
 \cite{Sulzmann2014} where they introduce bit-coded
 regular expressions. In this version, POSIX values are
 represented as bit sequences and such sequences are incrementally generated
 method [..] in combination with the simplification steps [..]
 yields POSIX parse trees. We have tested this claim
 extensively [..] but yet
 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
 \end{quote}
 Ausaf and Urban were able to back this correctness claim with
 a formal proof.
-But as they stated,
+However a faster formally verified
+lexing program with the optimisations
+mentioned by Sulzmann and Lu's second algorithm
+is still missing.
+As they stated,
 \begin{quote}\it
-The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.
+``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
 \end{quote}
 This thesis implements the aggressive simplifications envisioned
 by Ausaf and Urban,
-and gives a formal proof of the correctness with those simplifications.
+together with a formal proof of the correctness with those simplifications.
 %----------------------------------------------------------------------------------------
 \section{Contribution}
+In this thesis,
+we propose a solution to catastrophic
-This work addresses the vulnerability of super-linear and
+backtracking and error-prone matchers: a formally verified
-buggy regex implementations by the combination
+regular expression lexing algorithm
-of Brzozowski's derivatives and interactive theorem proving.
+that is both fast
-We give an
+and correct by extending Ausaf et al.'s work.
-improved version of  Sulzmann and Lu's bit-coded algorithm using
+The end result is %a regular expression lexing algorithm that comes with
-derivatives, which come with a formal guarantee in terms of correctness and
+\begin{itemize}
-running time as an Isabelle/HOL proof.
+\item
+an improved version of  Sulzmann and Lu's bit-coded algorithm using
+derivatives with simplifications,
+accompanied by
+a proven correctness theorem according to POSIX specification
+given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
+\item
+a complexity-related property for that algorithm saying that the
+internal data structure will
+remain finite,
+\item
+and extension to
+the bounded repetitions construct with the correctness and finiteness property
+maintained.
+\end{itemize}
+With a formal finiteness bound in place,
+we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
 Further improvements to the algorithm with an even stronger version of
 simplification is made.
-We have not yet come up with one, but believe that it leads to a
+Thanks to our theorem-prover-friendly approach,
-formalised proof with a time bound linear to input and
+we believe that
-cubic to regular expression size using a technique by
+this finiteness bound can be improved to a bound
-Antimirov\cite{Antimirov}.
+linear to input and
+cubic to the regular expression size using a technique by
+Antimirov\cite{Antimirov95}.
+Once formalised, this would be a guarantee for the absence of all super-linear behavious.
+We are working out the
+details.
-The main contribution of this thesis is
-\begin{itemize}
-\item
-a proven correct lexing algorithm
-\item
-with formalized finite bounds on internal data structures' sizes.
-\end{itemize}
 To our best knowledge, no lexing libraries using Brzozowski derivatives
-have a provable time guarantee,
+have similar complexity-related bounds,
-and claims about running time are usually speculative and backed by thin empirical
+and claims about running time are usually speculative and backed by empirical
-evidence.
+evidence on a few test cases.
-%TODO: give references
+If a matching or lexing algorithm
-For example, Sulzmann and Lu had proposed an algorithm  in which they
+does not come with certain basic complexity related
-claim a linear running time.
+guarantees (for examaple the internal data structure size
-But that was falsified by our experiments and the running time
+does not grow indefinitely),
-is actually $\Omega(2^n)$ in the worst case.
+then they cannot claim with confidence having solved the problem
-A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
+of catastrophic backtracking.
-%TODO: give references
-lexer, which calculates POSIX matches and is based on derivatives.
-They formalized the correctness of the lexer, but not the complexity.
-In the performance evaluation section, they simply analyzed the run time
-of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
-and concluded that the algorithm is quadratic in terms of input length.
-When we tried out their extracted OCaml code with our example $(a+aa)^*$,
-the time it took to lex only 40 $a$'s was 5 minutes.
-\subsection{Related Work}
-We are aware
-of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
-Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
-of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
-in Coq is given by Coquand and Siles \parencite{Coquand2012}.
-Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
-When a regular expression does not behave as intended,
-people usually try to rewrite the regex to some equivalent form
-or they try to avoid the possibly problematic patterns completely,
-for which many false positives exist\parencite{Davis18}.
-Animated tools to "debug" regular expressions such as
-\parencite{regexploit2021} \parencite{regex101} are also popular.
-We are also aware of static analysis work on regular expressions that
-aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
-\parencite{Rathnayake2014StaticAF} proposed an algorithm
-that detects regular expressions triggering exponential
-behavious on backtracking matchers.
-Weideman \parencite{Weideman2017Static} came up with
-non-linear polynomial worst-time estimates
-for regexes, attack string that exploit the worst-time
-scenario, and "attack automata" that generates
-attack strings.
 \section{Structure of the thesis}

changeset 609	61139fdddae0
parent 608	37b6fd310a16
child 612	8c234a1bc7e0