--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sat Oct 01 12:06:46 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Mon Oct 03 02:08:49 2022 +0100
@@ -819,15 +819,16 @@
Such a language is contained in the context-sensitive hierarchy
of formal languages.
Solving the back-reference expressions matching problem
-is NP-complete\parencite{alfred2014algorithms}.
+is known to be NP-complete \parencite{alfred2014algorithms}.
A non-bactracking,
efficient solution is not known to exist.
Regex libraries supporting back-references such as
PCRE \cite{pcre} therefore have to
-revert to a depth-first search algorithm that backtracks.
-The unreasonable part with them, is that even in the case of
-regexes not involving back-references, there is still
-a (non-negligible) chance they might backtrack super-linearly.
+revert to a depth-first search algorithm which backtracks.
+What is unexpected is that even in the cases
+not involving back-references, there is still
+a (non-negligible) chance they might backtrack super-linearly,
+as shown in the graphs in \ref{fig:aStarStarb}.
\subsection{Summary of the Catastrophic Backtracking Problem}
Summing these up, we can categorise existing
@@ -944,84 +945,172 @@
is trying to capture something very precise,
and is crying out for formalising.
Ausaf et al. \cite{AusafDyckhoffUrban2016}
-are the first to describe such a formalised POSIX
-specification in Isabelle/HOL.
+are the first to fill the gap
+by not just describing such a formalised POSIX
+specification in Isabelle/HOL, but also proving
+that their specification coincides with the
+POSIX specification given by Okui and Suzuki \cite{Okui10}
+which is a completely
+different characterisation.
They then formally proved the correctness of
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
based on that specification.
-In this thesis,
-we propose a solution to catastrophic
-backtracking and error-prone matchers--a formally verified
-regular expression lexing algorithm
-that is both fast
-and correct by extending Ausaf et al.'s work.
-The end result is a regular expression lexing algorithm that comes with
-\begin{itemize}
-\item
-a proven correctness theorem according to POSIX specification
-given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
-\item
-a proven complexity-related property saying that the
-algorithm's internal data structure will
-remain finite,
-\item
-and extension to
-the bounded repetitions construct with the correctness and finiteness property
-maintained.
- \end{itemize}
In the next section we will very briefly
-introduce Brzozowski derivatives.
-We will give a taste to what they
+introduce Brzozowski derivatives and Sulzmann
+and Lu's algorithm, which this thesis builds on.
+We give a taste of what they
are like and why they are suitable for regular expression
matching and lexing.
-\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
-We propose Brzozowski derivatives on regular expressions as
- a solution to this.
-In the last fifteen or so years, Brzozowski's derivatives of regular
-expressions have sparked quite a bit of interest in the functional
-programming and theorem prover communities.
+\section{Our Solution--Formal Specification of POSIX Matching
+and Brzozowski Derivatives}
+Now we start with the central topic of the thesis: Brzozowski derivatives.
+Brzozowski \cite{Brzozowski1964} first introduced the
+concept of the \emph{derivative} in the 1960s.
+The derivative of a regular expression $r$
+with respect to a character $c$, is written as $r \backslash c$.\footnote{
+ Despite having the same name, regular expression
+ derivatives bear little similarity with the mathematical definition
+ of derivatives on functions.
+}
+It tells us what $r$ would transform into
+if we chop off the first character $c$
+from all strings in the language of $r$ ($L \; r$).
+To give a flavour of Brzozowski derivatives, we present
+two straightforward clauses from it:
+\begin{center}
+ \begin{tabular}{lcl}
+ $d \backslash c$ & $\dn$ &
+ $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+ \end{tabular}
+\end{center}
+\noindent
+The first clause says that for the regular expression
+denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+we check the derivative character $c$ against $d$,
+returning a set containing only the empty string $\{ [] \}$
+if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+The second clause states that to obtain the regular expression
+representing all strings' head character $c$ being chopped off
+from $r_1 + r_2$, one simply needs to recursively take derivative
+of $r_1$ and $r_2$ and then put them together.
-\subsection{Motivation}
-
+Thanks to the definition, derivatives have the nice property
+that $s \in L \; (r\backslash c)$ if and only if
+$c::s \in L \; r$.
+%This property can be used on regular expressions
+%matching and lexing--to test whether a string $s$ is in $L \; r$,
+%one simply takes derivatives of $r$ successively with
+%respect to the characters (in the correct order) in $s$,
+%and then test whether the empty string is in the last regular expression.
Derivatives give a simple solution
-to the problem of matching a string $s$ with a regular
+to the problem of matching and lexing a string $s$ with a regular
expression $r$: if the derivative of $r$ w.r.t.\ (in
succession) all the characters of the string matches the empty string,
then $r$ matches $s$ (and {\em vice versa}).
-The beauty of
-Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
-expressible in any functional language, and easily definable and
-reasoned about in theorem provers---the definitions just consist of
-inductive datatypes and simple recursive functions.
-And an algorithms based on it by
-Suzmann and Lu \parencite{Sulzmann2014} allows easy extension
-to include extended regular expressions and
- simplification of internal data structures
- eliminating the exponential behaviours.
+This makes formally reasoning about these properties such
+as correctness and complexity smooth and intuitive.
+In fact, there has already been several mechanised proofs about them,
+for example the one by Owens and Slind \cite{Owens2008} in HOL4,
+another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
+yet another in Coq by Coquand and Siles \cite{Coquand2012}.
+
+In addition, one can extend the clauses to bounded repetitions
+``for free'':
+\begin{center}
+ \begin{tabular}{lcl}
+ $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
+ r^{\{n-1\}}$\\
+ \end{tabular}
+\end{center}
+\noindent
+And experimental results suggest that unlike DFA-based solutions,
+this derivatives can support
+bounded regular expressions with large counters
+quite well.
-However, two difficulties with derivative-based matchers exist:
-\subsubsection{Problems with Current Brzozowski Matchers}
+There has also been
+extensions to other constructs.
+For example, Owens et al include the derivatives
+for \emph{NOT} regular expressions, which is
+able to concisely express C-style comments of the form
+$/* \ldots */$.
+Another extension for derivatives would be
+regular expressions with look-aheads, done by
+by Miyazaki and Minamide
+\cite{Takayuki2019}.
+%We therefore use Brzozowski derivatives on regular expressions
+%lexing
+
+
+
+Given the above definitions and properties of
+Brzozowski derivatives, one quickly realises their potential
+in generating a formally verified algorithm for lexing--the clauses and property
+can be easily expressed in a functional programming language
+or converted to theorem prover
+code, with great extensibility.
+Perhaps this is the reason why it has sparked quite a bit of interest
+in the functional programming and theorem prover communities in the last
+fifteen or so years (
+\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
+\cite{Chen12} and \cite{Coquand2012}
+to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
+after they were first published.
+
+
+However, there are two difficulties with derivative-based matchers:
First, Brzozowski's original matcher only generates a yes/no answer
for whether a regular expression matches a string or not. This is too
little information in the context of lexing where separate tokens must
be identified and also classified (for example as keywords
-or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this
+or identifiers).
+Second, derivative-based matchers need to be more efficient.
+Elegant and beautiful
+as many implementations are,
+they can be excruciatingly slow.
+For example, Sulzmann and Lu
+claim a linear running time of their proposed algorithm,
+but that was falsified by our experiments. The running time
+is actually $\Omega(2^n)$ in the worst case.
+A similar claim about a theoretical runtime of $O(n^2)$
+is made for the Verbatim \cite{Verbatim}
+%TODO: give references
+lexer, which calculates POSIX matches and is based on derivatives.
+They formalized the correctness of the lexer, but not the complexity.
+In the performance evaluation section, they simply analyzed the run time
+of matching $a$ with the string
+\begin{center}
+ $\underbrace{a \ldots a}_{\text{n a's}}$
+\end{center}
+and concluded that the algorithm is quadratic in terms of input length.
+When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+the time it took to lex only 40 $a$'s was 5 minutes.
+
+
+\subsection{Sulzmann and Lu's Algorithm}
+Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
difficulty by cleverly extending Brzozowski's matching
algorithm. Their extended version generates additional information on
\emph{how} a regular expression matches a string following the POSIX
rules for regular expression matching. They achieve this by adding a
second ``phase'' to Brzozowski's algorithm involving an injection
-function. In our own earlier work, we provided the formal
+function simplification of internal data structures
+eliminating the exponential behaviours.
+In an earlier work, Ausaf et al provided the formal
specification of what POSIX matching means and proved in Isabelle/HOL
the correctness
of Sulzmann and Lu's extended algorithm accordingly
\cite{AusafDyckhoffUrban2016}.
-The second difficulty is that Brzozowski's derivatives can
-grow to arbitrarily big sizes. For example if we start with the
+The version of the algorithm proven correct
+suffers from the
+second difficulty though, where the internal derivatives can
+grow to arbitrarily big sizes.
+For example if we start with the
regular expression $(a+aa)^*$ and take
successive derivatives according to the character $a$, we end up with
a sequence of ever-growing derivatives like
@@ -1038,7 +1127,8 @@
\end{center}
\noindent where after around 35 steps we run out of memory on a
-typical computer (we shall define shortly the precise details of our
+typical computer (we shall define in the next chapter
+the precise details of our
regular expressions and the derivative operation). Clearly, the
notation involving $\ZERO$s and $\ONE$s already suggests
simplification rules that can be applied to regular regular
@@ -1051,7 +1141,6 @@
above: the growth is slowed, but the derivatives can still grow rather
quickly beyond any finite bound.
-
Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
\cite{Sulzmann2014} where they introduce bit-coded
regular expressions. In this version, POSIX values are
@@ -1075,94 +1164,76 @@
extensively [..] but yet
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
\end{quote}
-
Ausaf and Urban were able to back this correctness claim with
a formal proof.
-But as they stated,
+However a faster formally verified
+lexing program with the optimisations
+mentioned by Sulzmann and Lu's second algorithm
+is still missing.
+As they stated,
\begin{quote}\it
-The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.
+``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
\end{quote}
-
This thesis implements the aggressive simplifications envisioned
by Ausaf and Urban,
-and gives a formal proof of the correctness with those simplifications.
+together with a formal proof of the correctness with those simplifications.
%----------------------------------------------------------------------------------------
\section{Contribution}
-
-
-This work addresses the vulnerability of super-linear and
-buggy regex implementations by the combination
-of Brzozowski's derivatives and interactive theorem proving.
-We give an
-improved version of Sulzmann and Lu's bit-coded algorithm using
-derivatives, which come with a formal guarantee in terms of correctness and
-running time as an Isabelle/HOL proof.
-Further improvements to the algorithm with an even stronger version of
-simplification is made.
-We have not yet come up with one, but believe that it leads to a
-formalised proof with a time bound linear to input and
-cubic to regular expression size using a technique by
-Antimirov\cite{Antimirov}.
-
-
-The main contribution of this thesis is
+In this thesis,
+we propose a solution to catastrophic
+backtracking and error-prone matchers: a formally verified
+regular expression lexing algorithm
+that is both fast
+and correct by extending Ausaf et al.'s work.
+The end result is %a regular expression lexing algorithm that comes with
\begin{itemize}
\item
-a proven correct lexing algorithm
+an improved version of Sulzmann and Lu's bit-coded algorithm using
+derivatives with simplifications,
+accompanied by
+a proven correctness theorem according to POSIX specification
+given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
+\item
+a complexity-related property for that algorithm saying that the
+internal data structure will
+remain finite,
\item
-with formalized finite bounds on internal data structures' sizes.
-\end{itemize}
-
-To our best knowledge, no lexing libraries using Brzozowski derivatives
-have a provable time guarantee,
-and claims about running time are usually speculative and backed by thin empirical
-evidence.
-%TODO: give references
-For example, Sulzmann and Lu had proposed an algorithm in which they
-claim a linear running time.
-But that was falsified by our experiments and the running time
-is actually $\Omega(2^n)$ in the worst case.
-A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
-%TODO: give references
-lexer, which calculates POSIX matches and is based on derivatives.
-They formalized the correctness of the lexer, but not the complexity.
-In the performance evaluation section, they simply analyzed the run time
-of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
-and concluded that the algorithm is quadratic in terms of input length.
-When we tried out their extracted OCaml code with our example $(a+aa)^*$,
-the time it took to lex only 40 $a$'s was 5 minutes.
-
+and extension to
+the bounded repetitions construct with the correctness and finiteness property
+maintained.
+ \end{itemize}
-\subsection{Related Work}
-We are aware
-of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
-Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
-of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
-in Coq is given by Coquand and Siles \parencite{Coquand2012}.
-Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
-
+With a formal finiteness bound in place,
+we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
+Further improvements to the algorithm with an even stronger version of
+simplification is made.
+Thanks to our theorem-prover-friendly approach,
+we believe that
+this finiteness bound can be improved to a bound
+linear to input and
+cubic to the regular expression size using a technique by
+Antimirov\cite{Antimirov95}.
+Once formalised, this would be a guarantee for the absence of all super-linear behavious.
+We are working out the
+details.
+
- When a regular expression does not behave as intended,
-people usually try to rewrite the regex to some equivalent form
-or they try to avoid the possibly problematic patterns completely,
-for which many false positives exist\parencite{Davis18}.
-Animated tools to "debug" regular expressions such as
- \parencite{regexploit2021} \parencite{regex101} are also popular.
-We are also aware of static analysis work on regular expressions that
-aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
-\parencite{Rathnayake2014StaticAF} proposed an algorithm
-that detects regular expressions triggering exponential
-behavious on backtracking matchers.
-Weideman \parencite{Weideman2017Static} came up with
-non-linear polynomial worst-time estimates
-for regexes, attack string that exploit the worst-time
-scenario, and "attack automata" that generates
-attack strings.
+To our best knowledge, no lexing libraries using Brzozowski derivatives
+have similar complexity-related bounds,
+and claims about running time are usually speculative and backed by empirical
+evidence on a few test cases.
+If a matching or lexing algorithm
+does not come with certain basic complexity related
+guarantees (for examaple the internal data structure size
+does not grow indefinitely),
+then they cannot claim with confidence having solved the problem
+of catastrophic backtracking.
+