--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Overview.tex Wed Aug 23 03:02:31 2023 +0100
@@ -0,0 +1,872 @@
+\chapter{Technical Overview}
+\label{Overview}
+We start with a technical overview of the
+catastrophic backtracking problem,
+motivating rigorous approaches to regular expression matching and lexing.
+In doing so we also briefly
+introduce common terminology such as
+bounded repetitions and back-references.
+
+\section{Motivating Examples}
+Consider for example the regular expression $(a^*)^*\,b$ and
+strings of the form $aa..a$. These strings cannot be matched by this regular
+expression: obviously the expected $b$ in the last
+position is missing. One would assume that modern regular expression
+matching engines can find this out very quickly. Surprisingly, if one tries
+this example in JavaScript, Python or Java 8, even with small strings,
+say of lenght of around 30 $a$'s,
+the decision takes large amounts of time to finish.
+This is inproportional
+to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
+The algorithms clearly show exponential behaviour, and as can be seen
+is triggered by some relatively simple regular expressions.
+Java 9 and newer
+versions mitigates this behaviour by several magnitudes,
+but are still slow compared
+with the approach we are going to use in this thesis.
+
+
+
+This superlinear blowup in regular expression engines
+has caused grief in ``real life'' where it is
+given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
+A less serious example is a bug in the Atom editor:
+a user found out when writing the following code
+\begin{verbatim}
+ vVar.Type().Name() == "" && vVar.Kind() == reflect.Ptr
+\end{verbatim}
+\begin{verbatim}
+ && vVar.Type().Elem().Name() == "" && vVar.Type().Elem().Kind() ==
+\end{verbatim}
+\begin{verbatim}
+ reflect.Slice
+\end{verbatim}
+it took the editor half a hour to finish computing.
+This was subsequently fixed by Galbraith \cite{fixedAtom},
+and the issue was due to the regex engine of Atom.
+But evil regular expressions can be more than a nuisance in a text editor:
+on 20 July 2016 one evil
+regular expression brought the webpage
+\href{http://stackexchange.com}{Stack Exchange} to its
+knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+In this instance, a regular expression intended to just trim white
+spaces from the beginning and the end of a line actually consumed
+massive amounts of CPU resources---causing the web servers to grind to a
+halt. In this example, the time needed to process
+the string was
+$O(n^2)$ with respect to the string length $n$. This
+quadratic overhead was enough for the homepage of Stack Exchange to
+respond so slowly that the load balancer assumed a $\mathit{DoS}$
+attack and therefore stopped the servers from responding to any
+requests. This made the whole site become unavailable.
+
+\begin{figure}[p]
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={JavaScript},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Python},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\end{axis}
+\end{tikzpicture}\\
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Java 8},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Dart},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\end{axis}
+\end{tikzpicture}\\
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Swift},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=true,
+ %xtick={0,5000,...,40000},
+ %xmax=40000,
+ %ymax=35,
+ restrict x to domain*=0:40000,
+ restrict y to domain*=0:35,
+ %ytick={0,5,...,30},
+ %scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Java9+},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+\end{axis}
+\end{tikzpicture}\\
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=true,
+ %xtick={0,5,...,30},
+ %xmax=33,
+ %ymax=35,
+ restrict x to domain*=0:60000,
+ restrict y to domain*=0:35,
+ %ytick={0,5,...,30},
+ %scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Scala},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=true,
+ %xtick={0,5000,...,40000},
+ %xmax=40000,
+ %ymax=35,
+ restrict x to domain*=0:60000,
+ restrict y to domain*=0:45,
+ %ytick={0,5,...,30},
+ %scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend style={cells={align=left}},
+ legend entries={Isabelle \\ Extracted},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{2}{c}{Graphs}
+\end{tabular}
+\end{center}
+\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
+ of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
+ The reason for their fast growth %superlinear behaviour
+ is that they do a depth-first-search
+ using NFAs.
+ If the string does not match, the regular expression matching
+ engine starts to explore all possibilities.
+ The last two graphs are for our implementation in Scala, one manual
+ and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
+ Our lexer
+ performs better in this case, and
+ is formally verified.
+ Despite being almost identical, the codegen-generated lexer
+ % is generated directly from Isabelle using $\textit{codegen}$,
+ is slower than the manually written version since the code synthesised by
+ $\textit{codegen}$ does not use native integer or string
+ types of the target language.
+ %Note that we are comparing our \emph{lexer} against other languages' matchers.
+}\label{fig:aStarStarb}
+\end{figure}\afterpage{\clearpage}
+
+A more recent example is a global outage of all Cloudflare servers on
+2 July 2019.
+The traffic Cloudflare services each day is more than
+Twitter, Amazon, Instagram, Apple, Bing and Wikipedia combined, yet
+it became completely unavailable for half an hour because of
+a poorly-written regular expression roughly of the form $^*.^*=.^*$
+exhausted CPU resources that serve HTTP traffic.
+Although the outage
+had several causes, at the heart was a regular expression that
+was used to monitor network
+traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
+These problems with regular expressions
+are not isolated events that happen
+very rarely,
+%but actually widespread.
+%They occur so often that they have a
+but they occur actually often enough that they have a
+name: Regular-Expression-Denial-Of-Service (ReDoS)
+attacks.
+Davis et al. \cite{Davis18} detected more
+than 1000 evil regular expressions
+in Node.js, Python core libraries, npm and pypi.
+They therefore concluded that evil regular expressions
+are a real problem rather than just "a parlour trick".
+
+The work in this thesis aims to address this issue
+with the help of formal proofs.
+We describe a lexing algorithm based
+on Brzozowski derivatives with verified correctness
+and a finiteness property for the size of derivatives
+(which are all done in Isabelle/HOL).
+Such properties %guarantee the absence of
+are an important step in preventing
+catastrophic backtracking once and for all.
+We will give more details in the next sections
+on (i) why the slow cases in graph \ref{fig:aStarStarb}
+can occur in traditional regular expression engines
+and (ii) why we choose our
+approach based on Brzozowski derivatives and formal proofs.
+
+
+
+\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
+Regular expressions and regular expression matchers
+have been studied for many years.
+Theoretical results in automata theory state
+that basic regular expression matching should be linear
+w.r.t the input.
+This assumes that the regular expression
+$r$ was pre-processed and turned into a
+deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
+By basic we mean textbook definitions such as the one
+below, involving only regular expressions for characters, alternatives,
+sequences, and Kleene stars:
+\[
+ r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
+\]
+Modern regular expression matchers used by programmers,
+however,
+support much richer constructs, such as bounded repetitions,
+negations,
+and back-references.
+To differentiate, we use the word \emph{regex} to refer
+to those expressions with richer constructs while reserving the
+term \emph{regular expression}
+for the more traditional meaning in formal languages theory.
+We follow this convention
+in this thesis.
+In the future, we aim to support all the popular features of regexes,
+but for this work we mainly look at basic regular expressions
+and bounded repetitions.
+
+
+%Most modern regex libraries
+%the so-called PCRE standard (Peral Compatible Regular Expressions)
+%has the back-references
+Regexes come with a number of constructs
+that make it more convenient for
+programmers to write regular expressions.
+Depending on the types of constructs
+the task of matching and lexing with them
+will have different levels of complexity.
+Some of those constructs are syntactic sugars that are
+simply short hand notations
+that save the programmers a few keystrokes.
+These will not cause problems for regex libraries.
+For example the
+non-binary alternative involving three or more choices just means:
+\[
+ (a | b | c) \stackrel{means}{=} ((a + b)+ c)
+\]
+Similarly, the range operator
+%used to express the alternative
+%of all characters between its operands,
+is just a concise way
+of expressing an alternative of consecutive characters:
+\[
+ [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )
+\]
+for an alternative. The
+wildcard character '$.$' is used to refer to any single character,
+\[
+ . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
+\]
+except the newline.
+
+\subsection{Bounded Repetitions}
+More interesting are bounded repetitions, which can
+make the regular expressions much
+more compact.
+Normally there are four kinds of bounded repetitions:
+$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
+(where $n$ and $m$ are constant natural numbers).
+Like the star regular expressions, the set of strings or language
+a bounded regular expression can match
+is defined using the power operation on sets:
+\begin{center}
+ \begin{tabular}{lcl}
+ $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
+ $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
+ $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
+ $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
+ \end{tabular}
+\end{center}
+The attraction of bounded repetitions is that they can be
+used to avoid a size blow up: for example $r^{\{n\}}$
+is a shorthand for
+the much longer regular expression:
+\[
+ \underbrace{r\ldots r}_\text{n copies of r}.
+\]
+%Therefore, a naive algorithm that simply unfolds
+%them into their desugared forms
+%will suffer from at least an exponential runtime increase.
+
+
+The problem with matching
+such bounded repetitions
+is that tools based on the classic notion of
+automata need to expand $r^{\{n\}}$ into $n$ connected
+copies of the automaton for $r$. This leads to very inefficient matching
+algorithms or algorithms that consume large amounts of memory.
+Implementations using $\DFA$s will
+in such situations
+either become very slow
+(for example Verbatim++ \cite{Verbatimpp}) or run
+out of memory (for example $\mathit{LEX}$ and
+$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
+in C and JAVA that generate $\mathit{DFA}$-based
+lexers. The user provides a set of regular expressions
+and configurations, and then
+gets an output program encoding a minimized $\mathit{DFA}$
+that can be compiled and run.
+When given the above countdown regular expression,
+a small $n$ (say 20) would result in a program representing a
+DFA
+with millions of states.}) for large counters.
+A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
+where the minimal DFA requires at least $2^{n+1}$ states.
+For example, when $n$ is equal to 2,
+the corresponding $\mathit{NFA}$ looks like:
+\vspace{6mm}
+\begin{center}
+\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ \node[state,initial] (q_0) {$q_0$};
+ \node[state, red] (q_1) [right=of q_0] {$q_1$};
+ \node[state, red] (q_2) [right=of q_1] {$q_2$};
+ \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+ \path[->]
+ (q_0) edge node {a} (q_1)
+ edge [loop below] node {a,b} ()
+ (q_1) edge node {a,b} (q_2)
+ (q_2) edge node {a,b} (q_3);
+\end{tikzpicture}
+\end{center}
+and when turned into a DFA by the subset construction
+requires at least $2^3$ states.\footnote{The
+red states are "countdown states" which count down
+the number of characters needed in addition to the current
+string to make a successful match.
+For example, state $q_1$ indicates a match that has
+gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+and just consumed the "delimiter" $a$ in the middle, and
+needs to match 2 more iterations of $(a|b)$ to complete.
+State $q_2$ on the other hand, can be viewed as a state
+after $q_1$ has consumed 1 character, and just waits
+for 1 more character to complete.
+The state $q_3$ is the last (accepting) state, requiring 0
+more characters.
+Depending on the suffix of the
+input string up to the current read location,
+the states $q_1$ and $q_2$, $q_3$
+may or may
+not be active.
+A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+contain at least $2^3$ non-equivalent states that cannot be merged,
+because the subset construction during determinisation will generate
+all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+Generalizing this to regular expressions with larger
+bounded repetitions number, we have that
+regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
+would require at least $2^{n+1}$ states, if $r$ itself contains
+more than 1 string.
+This is to represent all different
+scenarios in which "countdown" states are active.}
+
+
+Bounded repetitions are important because they
+tend to occur frequently in practical use,
+for example in the regex library RegExLib, in
+the rules library of Snort \cite{Snort1999}\footnote{
+Snort is a network intrusion detection (NID) tool
+for monitoring network traffic.
+The network security community curates a list
+of malicious patterns written as regexes,
+which is used by Snort's detection engine
+to match against network traffic for any hostile
+activities such as buffer overflow attacks.},
+as well as in XML Schema definitions (XSDs).
+According to Bj\"{o}rklund et al \cite{xml2015},
+more than half of the
+XSDs they found on the Maven.org central repository
+have bounded regular expressions in them.
+Often the counters are quite large, with the largest being
+close to ten million.
+A smaller sample XSD they gave
+is:
+\lstset{
+ basicstyle=\fontsize{8.5}{9}\ttfamily,
+ language=XML,
+ morekeywords={encoding,
+ xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
+}
+\begin{lstlisting}
+<sequence minOccurs="0" maxOccurs="65535">
+ <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{lstlisting}
+This can be seen as the regex
+$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+regular expressions
+satisfying certain constraints (such as
+satisfying the floating point number format).
+It is therefore quite unsatisfying that
+some regular expressions matching libraries
+impose adhoc limits
+for bounded regular expressions:
+For example, in the regular expression matching library in the Go
+language the regular expression $a^{1001}$ is not permitted, because no counter
+can be above 1000, and in the built-in Rust regular expression library
+expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
+for being too big~\footnote{Try it out here: https://rustexp.lpil.uk}.
+As Becchi and Crawley \cite{Becchi08} have pointed out,
+the reason for these restrictions
+is that they simulate a non-deterministic finite
+automata (NFA) with a breadth-first search.
+This way the number of active states could
+be equal to the counter number.
+When the counters are large,
+the memory requirement could become
+infeasible, and a regex engine
+like in Go will reject this pattern straight away.
+\begin{figure}[H]
+\begin{center}
+\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+
+ \node (q0) [state, initial] {$0$};
+ \node (q1) [state, right = of q0] {$1$};
+ %\node (q2) [state, right = of q1] {$2$};
+ \node (qdots) [right = of q1] {$\ldots$};
+ \node (qn) [state, right = of qdots] {$n$};
+ \node (qn1) [state, right = of qn] {$n+1$};
+ \node (qn2) [state, right = of qn1] {$n+2$};
+ \node (qn3) [state, accepting, right = of qn2] {$n+3$};
+
+\path [-stealth, thick]
+ (q0) edge [loop above] node {a} ()
+ (q0) edge node {a} (q1)
+ %(q1) edge node {.} (q2)
+ (q1) edge node {.} (qdots)
+ (qdots) edge node {.} (qn)
+ (qn) edge node {.} (qn1)
+ (qn1) edge node {b} (qn2)
+ (qn2) edge node {$c$} (qn3);
+\end{tikzpicture}
+%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+% \node[state,initial] (q_0) {$0$};
+% \node[state, ] (q_1) [right=of q_0] {$1$};
+% \node[state, ] (q_2) [right=of q_1] {$2$};
+% \node[state,
+% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+% \path[->]
+% (q_0) edge node {a} (q_1)
+% edge [loop below] node {a,b} ()
+% (q_1) edge node {a,b} (q_2)
+% (q_2) edge node {a,b} (q_3);
+%\end{tikzpicture}
+\end{center}
+\caption{The example given by Becchi and Crawley
+ that NFA simulation can consume large
+ amounts of memory: $.^*a.^{\{n\}}bc$ matching
+ strings of the form $aaa\ldots aaaabc$.
+ When traversing in a breadth-first manner,
+all states from 0 till $n+1$ will become active.}\label{fig:inj}
+
+\end{figure}
+%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+%in terms of input string length.
+%TODO:try out these lexers
+These problems can of course be solved in matching algorithms where
+automata go beyond the classic notion and for instance include explicit
+counters \cite{Turo_ov__2020}.
+These solutions can be quite efficient,
+with the ability to process
+gigabits of strings input per second
+even with large counters \cite{Becchi08}.
+These practical solutions do not come with
+formal guarantees, and as pointed out by
+Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
+%But formal reasoning about these automata especially in Isabelle
+%can be challenging
+%and un-intuitive.
+%Therefore, we take correctness and runtime claims made about these solutions
+%with a grain of salt.
+
+In the work reported in \cite{ITP2023} and here,
+we add better support using derivatives
+for bounded regular expression $r^{\{n\}}$.
+Our results
+extend straightforwardly to
+repetitions with intervals such as
+$r^{\{n\ldots m\}}$.
+The merit of Brzozowski derivatives (more on this later)
+on this problem is that
+it can be naturally extended to support bounded repetitions.
+Moreover these extensions are still made up of only small
+inductive datatypes and recursive functions,
+making it handy to deal with them in theorem provers.
+%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+%straightforwardly extended to deal with bounded regular expressions
+%and moreover the resulting code still consists of only simple
+%recursive functions and inductive datatypes.
+Finally, bounded regular expressions do not destroy our finite
+boundedness property, which we shall prove later on.
+
+
+
+
+
+\subsection{Back-References}
+The other way to simulate an $\mathit{NFA}$ for matching is choosing
+a single transition each time, keeping all the other options in
+a queue or stack, and backtracking if that choice eventually
+fails.
+This method, often called a "depth-first-search",
+is efficient in many cases, but could end up
+with exponential run time.
+The backtracking method is employed in regex libraries
+that support \emph{back-references}, for example
+in Java and Python.
+%\section{Back-references and The Terminology Regex}
+
+%When one constructs an $\NFA$ out of a regular expression
+%there is often very little to be done in the first phase, one simply
+%construct the $\NFA$ states based on the structure of the input regular expression.
+
+%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
+%one by keeping track of all active states after consuming
+%a character, and update that set of states iteratively.
+%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+%for a path terminating
+%at an accepting state.
+
+
+
+Consider the following regular expression where the sequence
+operator is omitted for brevity:
+\begin{center}
+ $r_1r_2r_3r_4$
+\end{center}
+In this example,
+one could label sub-expressions of interest
+by parenthesizing them and giving
+them a number in the order in which their opening parentheses appear.
+One possible way of parenthesizing and labelling is:
+\begin{center}
+ $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
+\end{center}
+The sub-expressions
+$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
+by 1 to 4, and can be ``referred back'' by their respective numbers.
+%These sub-expressions are called "capturing groups".
+To do so, one uses the syntax $\backslash i$
+to denote that we want the sub-string
+of the input matched by the i-th
+sub-expression to appear again,
+exactly the same as it first appeared:
+\begin{center}
+$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
+\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
+\end{center}
+Once the sub-string $s_i$ for the sub-expression $r_i$
+has been fixed, there is no variability on what the back-reference
+label $\backslash i$ can be---it is tied to $s_i$.
+The overall string will look like $\ldots s_i \ldots s_i \ldots $
+%The backslash and number $i$ are the
+%so-called "back-references".
+%Let $e$ be an expression made of regular expressions
+%and back-references. $e$ contains the expression $e_i$
+%as its $i$-th capturing group.
+%The semantics of back-reference can be recursively
+%written as:
+%\begin{center}
+% \begin{tabular}{c}
+% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+% \end{tabular}
+%\end{center}
+A concrete example
+for back-references is
+\begin{center}
+$(.^*)\backslash 1$,
+\end{center}
+which matches
+strings that can be split into two identical halves,
+for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
+Note that this is different from
+repeating the sub-expression verbatim like
+\begin{center}
+ $(.^*)(.^*)$,
+\end{center}
+which does not impose any restrictions on what strings the second
+sub-expression $.^*$
+might match.
+Another example for back-references is
+\begin{center}
+$(.)(.)\backslash 2\backslash 1$
+\end{center}
+which matches four-character palindromes
+like $abba$, $x??x$ and so on.
+
+Back-references are a regex construct
+that programmers find quite useful.
+According to Becchi and Crawley \cite{Becchi08},
+6\% of Snort rules (up until 2008) use them.
+The most common use of back-references
+is to express well-formed html files,
+where back-references are convenient for matching
+opening and closing tags like
+\begin{center}
+ $\langle html \rangle \ldots \langle / html \rangle$
+\end{center}
+A regex describing such a format
+is
+\begin{center}
+ $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+\end{center}
+Despite being useful, the expressive power of regexes
+go beyond regular languages
+once back-references are included.
+In fact, they allow the regex construct to express
+languages that cannot be contained in context-free
+languages either.
+For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
+expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
+which cannot be expressed by
+context-free grammars \cite{campeanu2003formal}.
+Such a language is contained in the context-sensitive hierarchy
+of formal languages.
+Also solving the matching problem involving back-references
+is known to be NP-complete \parencite{alfred2014algorithms}.
+Regex libraries supporting back-references such as
+PCRE \cite{pcre} therefore have to
+revert to a depth-first search algorithm including backtracking.
+What is unexpected is that even in the cases
+not involving back-references, there is still
+a (non-negligible) chance they might backtrack super-linearly,
+as shown in the graphs in figure \ref{fig:aStarStarb}.
+
+Summing up, we can categorise existing
+practical regex libraries into two kinds:
+(i) The ones with linear
+time guarantees like Go and Rust. The downside with them is that
+they impose restrictions
+on the regular expressions (not allowing back-references,
+bounded repetitions cannot exceed an ad hoc limit etc.).
+And (ii) those
+that allow large bounded regular expressions and back-references
+at the expense of using backtracking algorithms.
+They can potentially ``grind to a halt''
+on some very simple cases, resulting
+ReDoS attacks if exposed to the internet.
+
+The problems with both approaches are the motivation for us
+to look again at the regular expression matching problem.
+Another motivation is that regular expression matching algorithms
+that follow the POSIX standard often contain errors and bugs
+as we shall explain next.
+
+%We would like to have regex engines that can
+%deal with the regular part (e.g.
+%bounded repetitions) of regexes more
+%efficiently.
+%Also we want to make sure that they do it correctly.
+%It turns out that such aim is not so easy to achieve.
+ %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
+% For example, the Rust regex engine claims to be linear,
+% but does not support lookarounds and back-references.
+% The GoLang regex library does not support over 1000 repetitions.
+% Java and Python both support back-references, but shows
+%catastrophic backtracking behaviours on inputs without back-references(
+%when the language is still regular).
+ %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
+ %TODO: verify the fact Rust does not allow 1000+ reps
+
+
+
+
+%The time cost of regex matching algorithms in general
+%involve two different phases, and different things can go differently wrong on
+%these phases.
+%$\DFA$s usually have problems in the first (construction) phase
+%, whereas $\NFA$s usually run into trouble
+%on the second phase.
+
+
+\section{Error-prone POSIX Implementations}
+Very often there are multiple ways of matching a string
+with a regular expression.
+In such cases the regular expressions matcher needs to
+disambiguate.
+The more widely used strategy is called POSIX,
+which roughly speaking always chooses the longest initial match.
+The POSIX strategy is widely adopted in many regular expression matchers
+because it is a natural strategy for lexers.
+However, many implementations (including the C libraries
+used by Linux and OS X distributions) contain bugs
+or do not meet the specification they claim to adhere to.
+Kuklewicz maintains a unit test repository which lists some
+problems with existing regular expression engines \cite{KuklewiczHaskell}.
+In some cases, they either fail to generate a
+result when there exists a match,
+or give results that are inconsistent with the POSIX standard.
+A concrete example is the regex:
+\begin{center}
+ $(aba + ab + a)^* \text{and the string} \; ababa$
+\end{center}
+The correct POSIX match for the above
+involves the entire string $ababa$,
+split into two Kleene star iterations, namely $[ab], [aba]$ at positions
+$[0, 2), [2, 5)$
+respectively.
+But feeding this example to the different engines
+listed at regex101 \footnote{
+ regex101 is an online regular expression matcher which
+ provides API for trying out regular expression
+ engines of multiple popular programming languages like
+Java, Python, Go, etc.} \parencite{regex101}.
+yields
+only two incomplete matches: $[aba]$ at $[0, 3)$
+and $a$ at $[4, 5)$.
+Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
+commented that most regex libraries are not
+correctly implementing the central POSIX
+rule, called the maximum munch rule.
+Grathwohl \parencite{grathwohl2014crash} wrote:
+\begin{quote}\it
+ ``The POSIX strategy is more complicated than the
+ greedy because of the dependence on information about
+ the length of matched strings in the various subexpressions.''
+\end{quote}
+%\noindent
+People have recognised that the implementation complexity of POSIX rules also come from
+the specification being not very precise.
+The developers of the regexp package of Go
+\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
+commented that
+\begin{quote}\it
+``
+The POSIX rule is computationally prohibitive
+and not even well-defined.
+``
+\end{quote}
+There are many informal summaries of this disambiguation
+strategy, which are often quite long and delicate.
+For example Kuklewicz \cite{KuklewiczHaskell}
+described the POSIX rule as (section 1, last paragraph):
+\begin{quote}
+ \begin{itemize}
+ \item
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
+parenthesized subexpressions return the match from their last usage\\
+\item
+text of component subexpressions must be contained in the text of the
+higher-level subexpressions\\
+\item
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+\item
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+\end{itemize}
+\end{quote}
+%The text above
+%is trying to capture something very precise,
+%and is crying out for formalising.