diff -r 660cf698eb26 -r 3831621d7b14 ChengsongTanPhdThesis/Chapters/Overview.tex --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Overview.tex Wed Aug 23 03:02:31 2023 +0100 @@ -0,0 +1,872 @@ +\chapter{Technical Overview} +\label{Overview} +We start with a technical overview of the +catastrophic backtracking problem, +motivating rigorous approaches to regular expression matching and lexing. +In doing so we also briefly +introduce common terminology such as +bounded repetitions and back-references. + +\section{Motivating Examples} +Consider for example the regular expression $(a^*)^*\,b$ and +strings of the form $aa..a$. These strings cannot be matched by this regular +expression: obviously the expected $b$ in the last +position is missing. One would assume that modern regular expression +matching engines can find this out very quickly. Surprisingly, if one tries +this example in JavaScript, Python or Java 8, even with small strings, +say of lenght of around 30 $a$'s, +the decision takes large amounts of time to finish. +This is inproportional +to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}). +The algorithms clearly show exponential behaviour, and as can be seen +is triggered by some relatively simple regular expressions. +Java 9 and newer +versions mitigates this behaviour by several magnitudes, +but are still slow compared +with the approach we are going to use in this thesis. + + + +This superlinear blowup in regular expression engines +has caused grief in ``real life'' where it is +given the name ``catastrophic backtracking'' or ``evil'' regular expressions. +A less serious example is a bug in the Atom editor: +a user found out when writing the following code +\begin{verbatim} + vVar.Type().Name() == "" && vVar.Kind() == reflect.Ptr +\end{verbatim} +\begin{verbatim} + && vVar.Type().Elem().Name() == "" && vVar.Type().Elem().Kind() == +\end{verbatim} +\begin{verbatim} + reflect.Slice +\end{verbatim} +it took the editor half a hour to finish computing. +This was subsequently fixed by Galbraith \cite{fixedAtom}, +and the issue was due to the regex engine of Atom. +But evil regular expressions can be more than a nuisance in a text editor: +on 20 July 2016 one evil +regular expression brought the webpage +\href{http://stackexchange.com}{Stack Exchange} to its +knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} +In this instance, a regular expression intended to just trim white +spaces from the beginning and the end of a line actually consumed +massive amounts of CPU resources---causing the web servers to grind to a +halt. In this example, the time needed to process +the string was +$O(n^2)$ with respect to the string length $n$. This +quadratic overhead was enough for the homepage of Stack Exchange to +respond so slowly that the load balancer assumed a $\mathit{DoS}$ +attack and therefore stopped the servers from responding to any +requests. This made the whole site become unavailable. + +\begin{figure}[p] +\begin{center} +\begin{tabular}{@{}c@{\hspace{0mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={JavaScript}, + legend pos=north west, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Python}, + legend pos=north west, + legend cell align=left] +\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; +\end{axis} +\end{tikzpicture}\\ +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Java 8}, + legend pos=north west, + legend cell align=left] +\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Dart}, + legend pos=north west, + legend cell align=left] +\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; +\end{axis} +\end{tikzpicture}\\ +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Swift}, + legend pos=north west, + legend cell align=left] +\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=true, + %xtick={0,5000,...,40000}, + %xmax=40000, + %ymax=35, + restrict x to domain*=0:40000, + restrict y to domain*=0:35, + %ytick={0,5,...,30}, + %scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Java9+}, + legend pos=north west, + legend cell align=left] +\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; +\end{axis} +\end{tikzpicture}\\ +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + ylabel={time in secs}, + enlargelimits=true, + %xtick={0,5,...,30}, + %xmax=33, + %ymax=35, + restrict x to domain*=0:60000, + restrict y to domain*=0:35, + %ytick={0,5,...,30}, + %scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Scala}, + legend pos=north west, + legend cell align=left] +\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=true, + %xtick={0,5000,...,40000}, + %xmax=40000, + %ymax=35, + restrict x to domain*=0:60000, + restrict y to domain*=0:45, + %ytick={0,5,...,30}, + %scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend style={cells={align=left}}, + legend entries={Isabelle \\ Extracted}, + legend pos=north west, + legend cell align=left] +\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data}; +\end{axis} +\end{tikzpicture}\\ +\multicolumn{2}{c}{Graphs} +\end{tabular} +\end{center} +\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings + of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. + The reason for their fast growth %superlinear behaviour + is that they do a depth-first-search + using NFAs. + If the string does not match, the regular expression matching + engine starts to explore all possibilities. + The last two graphs are for our implementation in Scala, one manual + and one extracted from the verified lexer in Isabelle by $\textit{codegen}$. + Our lexer + performs better in this case, and + is formally verified. + Despite being almost identical, the codegen-generated lexer + % is generated directly from Isabelle using $\textit{codegen}$, + is slower than the manually written version since the code synthesised by + $\textit{codegen}$ does not use native integer or string + types of the target language. + %Note that we are comparing our \emph{lexer} against other languages' matchers. +}\label{fig:aStarStarb} +\end{figure}\afterpage{\clearpage} + +A more recent example is a global outage of all Cloudflare servers on +2 July 2019. +The traffic Cloudflare services each day is more than +Twitter, Amazon, Instagram, Apple, Bing and Wikipedia combined, yet +it became completely unavailable for half an hour because of +a poorly-written regular expression roughly of the form $^*.^*=.^*$ +exhausted CPU resources that serve HTTP traffic. +Although the outage +had several causes, at the heart was a regular expression that +was used to monitor network +traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} +These problems with regular expressions +are not isolated events that happen +very rarely, +%but actually widespread. +%They occur so often that they have a +but they occur actually often enough that they have a +name: Regular-Expression-Denial-Of-Service (ReDoS) +attacks. +Davis et al. \cite{Davis18} detected more +than 1000 evil regular expressions +in Node.js, Python core libraries, npm and pypi. +They therefore concluded that evil regular expressions +are a real problem rather than just "a parlour trick". + +The work in this thesis aims to address this issue +with the help of formal proofs. +We describe a lexing algorithm based +on Brzozowski derivatives with verified correctness +and a finiteness property for the size of derivatives +(which are all done in Isabelle/HOL). +Such properties %guarantee the absence of +are an important step in preventing +catastrophic backtracking once and for all. +We will give more details in the next sections +on (i) why the slow cases in graph \ref{fig:aStarStarb} +can occur in traditional regular expression engines +and (ii) why we choose our +approach based on Brzozowski derivatives and formal proofs. + + + +\section{Preliminaries}%Regex, and the Problems with Regex Matchers} +Regular expressions and regular expression matchers +have been studied for many years. +Theoretical results in automata theory state +that basic regular expression matching should be linear +w.r.t the input. +This assumes that the regular expression +$r$ was pre-processed and turned into a +deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}. +By basic we mean textbook definitions such as the one +below, involving only regular expressions for characters, alternatives, +sequences, and Kleene stars: +\[ + r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* +\] +Modern regular expression matchers used by programmers, +however, +support much richer constructs, such as bounded repetitions, +negations, +and back-references. +To differentiate, we use the word \emph{regex} to refer +to those expressions with richer constructs while reserving the +term \emph{regular expression} +for the more traditional meaning in formal languages theory. +We follow this convention +in this thesis. +In the future, we aim to support all the popular features of regexes, +but for this work we mainly look at basic regular expressions +and bounded repetitions. + + +%Most modern regex libraries +%the so-called PCRE standard (Peral Compatible Regular Expressions) +%has the back-references +Regexes come with a number of constructs +that make it more convenient for +programmers to write regular expressions. +Depending on the types of constructs +the task of matching and lexing with them +will have different levels of complexity. +Some of those constructs are syntactic sugars that are +simply short hand notations +that save the programmers a few keystrokes. +These will not cause problems for regex libraries. +For example the +non-binary alternative involving three or more choices just means: +\[ + (a | b | c) \stackrel{means}{=} ((a + b)+ c) +\] +Similarly, the range operator +%used to express the alternative +%of all characters between its operands, +is just a concise way +of expressing an alternative of consecutive characters: +\[ + [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) +\] +for an alternative. The +wildcard character '$.$' is used to refer to any single character, +\[ + . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] +\] +except the newline. + +\subsection{Bounded Repetitions} +More interesting are bounded repetitions, which can +make the regular expressions much +more compact. +Normally there are four kinds of bounded repetitions: +$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ +(where $n$ and $m$ are constant natural numbers). +Like the star regular expressions, the set of strings or language +a bounded regular expression can match +is defined using the power operation on sets: +\begin{center} + \begin{tabular}{lcl} + $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ + $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ + $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ + $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ + \end{tabular} +\end{center} +The attraction of bounded repetitions is that they can be +used to avoid a size blow up: for example $r^{\{n\}}$ +is a shorthand for +the much longer regular expression: +\[ + \underbrace{r\ldots r}_\text{n copies of r}. +\] +%Therefore, a naive algorithm that simply unfolds +%them into their desugared forms +%will suffer from at least an exponential runtime increase. + + +The problem with matching +such bounded repetitions +is that tools based on the classic notion of +automata need to expand $r^{\{n\}}$ into $n$ connected +copies of the automaton for $r$. This leads to very inefficient matching +algorithms or algorithms that consume large amounts of memory. +Implementations using $\DFA$s will +in such situations +either become very slow +(for example Verbatim++ \cite{Verbatimpp}) or run +out of memory (for example $\mathit{LEX}$ and +$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators +in C and JAVA that generate $\mathit{DFA}$-based +lexers. The user provides a set of regular expressions +and configurations, and then +gets an output program encoding a minimized $\mathit{DFA}$ +that can be compiled and run. +When given the above countdown regular expression, +a small $n$ (say 20) would result in a program representing a +DFA +with millions of states.}) for large counters. +A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ +where the minimal DFA requires at least $2^{n+1}$ states. +For example, when $n$ is equal to 2, +the corresponding $\mathit{NFA}$ looks like: +\vspace{6mm} +\begin{center} +\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] + \node[state,initial] (q_0) {$q_0$}; + \node[state, red] (q_1) [right=of q_0] {$q_1$}; + \node[state, red] (q_2) [right=of q_1] {$q_2$}; + \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; + \path[->] + (q_0) edge node {a} (q_1) + edge [loop below] node {a,b} () + (q_1) edge node {a,b} (q_2) + (q_2) edge node {a,b} (q_3); +\end{tikzpicture} +\end{center} +and when turned into a DFA by the subset construction +requires at least $2^3$ states.\footnote{The +red states are "countdown states" which count down +the number of characters needed in addition to the current +string to make a successful match. +For example, state $q_1$ indicates a match that has +gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, +and just consumed the "delimiter" $a$ in the middle, and +needs to match 2 more iterations of $(a|b)$ to complete. +State $q_2$ on the other hand, can be viewed as a state +after $q_1$ has consumed 1 character, and just waits +for 1 more character to complete. +The state $q_3$ is the last (accepting) state, requiring 0 +more characters. +Depending on the suffix of the +input string up to the current read location, +the states $q_1$ and $q_2$, $q_3$ +may or may +not be active. +A $\mathit{DFA}$ for such an $\mathit{NFA}$ would +contain at least $2^3$ non-equivalent states that cannot be merged, +because the subset construction during determinisation will generate +all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. +Generalizing this to regular expressions with larger +bounded repetitions number, we have that +regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s +would require at least $2^{n+1}$ states, if $r$ itself contains +more than 1 string. +This is to represent all different +scenarios in which "countdown" states are active.} + + +Bounded repetitions are important because they +tend to occur frequently in practical use, +for example in the regex library RegExLib, in +the rules library of Snort \cite{Snort1999}\footnote{ +Snort is a network intrusion detection (NID) tool +for monitoring network traffic. +The network security community curates a list +of malicious patterns written as regexes, +which is used by Snort's detection engine +to match against network traffic for any hostile +activities such as buffer overflow attacks.}, +as well as in XML Schema definitions (XSDs). +According to Bj\"{o}rklund et al \cite{xml2015}, +more than half of the +XSDs they found on the Maven.org central repository +have bounded regular expressions in them. +Often the counters are quite large, with the largest being +close to ten million. +A smaller sample XSD they gave +is: +\lstset{ + basicstyle=\fontsize{8.5}{9}\ttfamily, + language=XML, + morekeywords={encoding, + xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute} +} +\begin{lstlisting} + + + + +\end{lstlisting} +This can be seen as the regex +$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves +regular expressions +satisfying certain constraints (such as +satisfying the floating point number format). +It is therefore quite unsatisfying that +some regular expressions matching libraries +impose adhoc limits +for bounded regular expressions: +For example, in the regular expression matching library in the Go +language the regular expression $a^{1001}$ is not permitted, because no counter +can be above 1000, and in the built-in Rust regular expression library +expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message +for being too big~\footnote{Try it out here: https://rustexp.lpil.uk}. +As Becchi and Crawley \cite{Becchi08} have pointed out, +the reason for these restrictions +is that they simulate a non-deterministic finite +automata (NFA) with a breadth-first search. +This way the number of active states could +be equal to the counter number. +When the counters are large, +the memory requirement could become +infeasible, and a regex engine +like in Go will reject this pattern straight away. +\begin{figure}[H] +\begin{center} +\begin{tikzpicture} [node distance = 2cm, on grid, auto] + + \node (q0) [state, initial] {$0$}; + \node (q1) [state, right = of q0] {$1$}; + %\node (q2) [state, right = of q1] {$2$}; + \node (qdots) [right = of q1] {$\ldots$}; + \node (qn) [state, right = of qdots] {$n$}; + \node (qn1) [state, right = of qn] {$n+1$}; + \node (qn2) [state, right = of qn1] {$n+2$}; + \node (qn3) [state, accepting, right = of qn2] {$n+3$}; + +\path [-stealth, thick] + (q0) edge [loop above] node {a} () + (q0) edge node {a} (q1) + %(q1) edge node {.} (q2) + (q1) edge node {.} (qdots) + (qdots) edge node {.} (qn) + (qn) edge node {.} (qn1) + (qn1) edge node {b} (qn2) + (qn2) edge node {$c$} (qn3); +\end{tikzpicture} +%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] +% \node[state,initial] (q_0) {$0$}; +% \node[state, ] (q_1) [right=of q_0] {$1$}; +% \node[state, ] (q_2) [right=of q_1] {$2$}; +% \node[state, +% \node[state, accepting, ](q_3) [right=of q_2] {$3$}; +% \path[->] +% (q_0) edge node {a} (q_1) +% edge [loop below] node {a,b} () +% (q_1) edge node {a,b} (q_2) +% (q_2) edge node {a,b} (q_3); +%\end{tikzpicture} +\end{center} +\caption{The example given by Becchi and Crawley + that NFA simulation can consume large + amounts of memory: $.^*a.^{\{n\}}bc$ matching + strings of the form $aaa\ldots aaaabc$. + When traversing in a breadth-first manner, +all states from 0 till $n+1$ will become active.}\label{fig:inj} + +\end{figure} +%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this +%type of $\mathit{NFA}$ simulation and guarantees a linear runtime +%in terms of input string length. +%TODO:try out these lexers +These problems can of course be solved in matching algorithms where +automata go beyond the classic notion and for instance include explicit +counters \cite{Turo_ov__2020}. +These solutions can be quite efficient, +with the ability to process +gigabits of strings input per second +even with large counters \cite{Becchi08}. +These practical solutions do not come with +formal guarantees, and as pointed out by +Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. +%But formal reasoning about these automata especially in Isabelle +%can be challenging +%and un-intuitive. +%Therefore, we take correctness and runtime claims made about these solutions +%with a grain of salt. + +In the work reported in \cite{ITP2023} and here, +we add better support using derivatives +for bounded regular expression $r^{\{n\}}$. +Our results +extend straightforwardly to +repetitions with intervals such as +$r^{\{n\ldots m\}}$. +The merit of Brzozowski derivatives (more on this later) +on this problem is that +it can be naturally extended to support bounded repetitions. +Moreover these extensions are still made up of only small +inductive datatypes and recursive functions, +making it handy to deal with them in theorem provers. +%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be +%straightforwardly extended to deal with bounded regular expressions +%and moreover the resulting code still consists of only simple +%recursive functions and inductive datatypes. +Finally, bounded regular expressions do not destroy our finite +boundedness property, which we shall prove later on. + + + + + +\subsection{Back-References} +The other way to simulate an $\mathit{NFA}$ for matching is choosing +a single transition each time, keeping all the other options in +a queue or stack, and backtracking if that choice eventually +fails. +This method, often called a "depth-first-search", +is efficient in many cases, but could end up +with exponential run time. +The backtracking method is employed in regex libraries +that support \emph{back-references}, for example +in Java and Python. +%\section{Back-references and The Terminology Regex} + +%When one constructs an $\NFA$ out of a regular expression +%there is often very little to be done in the first phase, one simply +%construct the $\NFA$ states based on the structure of the input regular expression. + +%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: +%one by keeping track of all active states after consuming +%a character, and update that set of states iteratively. +%This can be viewed as a breadth-first-search of the $\mathit{NFA}$ +%for a path terminating +%at an accepting state. + + + +Consider the following regular expression where the sequence +operator is omitted for brevity: +\begin{center} + $r_1r_2r_3r_4$ +\end{center} +In this example, +one could label sub-expressions of interest +by parenthesizing them and giving +them a number in the order in which their opening parentheses appear. +One possible way of parenthesizing and labelling is: +\begin{center} + $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ +\end{center} +The sub-expressions +$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled +by 1 to 4, and can be ``referred back'' by their respective numbers. +%These sub-expressions are called "capturing groups". +To do so, one uses the syntax $\backslash i$ +to denote that we want the sub-string +of the input matched by the i-th +sub-expression to appear again, +exactly the same as it first appeared: +\begin{center} +$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots +\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$ +\end{center} +Once the sub-string $s_i$ for the sub-expression $r_i$ +has been fixed, there is no variability on what the back-reference +label $\backslash i$ can be---it is tied to $s_i$. +The overall string will look like $\ldots s_i \ldots s_i \ldots $ +%The backslash and number $i$ are the +%so-called "back-references". +%Let $e$ be an expression made of regular expressions +%and back-references. $e$ contains the expression $e_i$ +%as its $i$-th capturing group. +%The semantics of back-reference can be recursively +%written as: +%\begin{center} +% \begin{tabular}{c} +% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ +% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ +% \end{tabular} +%\end{center} +A concrete example +for back-references is +\begin{center} +$(.^*)\backslash 1$, +\end{center} +which matches +strings that can be split into two identical halves, +for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. +Note that this is different from +repeating the sub-expression verbatim like +\begin{center} + $(.^*)(.^*)$, +\end{center} +which does not impose any restrictions on what strings the second +sub-expression $.^*$ +might match. +Another example for back-references is +\begin{center} +$(.)(.)\backslash 2\backslash 1$ +\end{center} +which matches four-character palindromes +like $abba$, $x??x$ and so on. + +Back-references are a regex construct +that programmers find quite useful. +According to Becchi and Crawley \cite{Becchi08}, +6\% of Snort rules (up until 2008) use them. +The most common use of back-references +is to express well-formed html files, +where back-references are convenient for matching +opening and closing tags like +\begin{center} + $\langle html \rangle \ldots \langle / html \rangle$ +\end{center} +A regex describing such a format +is +\begin{center} + $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ +\end{center} +Despite being useful, the expressive power of regexes +go beyond regular languages +once back-references are included. +In fact, they allow the regex construct to express +languages that cannot be contained in context-free +languages either. +For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ +expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, +which cannot be expressed by +context-free grammars \cite{campeanu2003formal}. +Such a language is contained in the context-sensitive hierarchy +of formal languages. +Also solving the matching problem involving back-references +is known to be NP-complete \parencite{alfred2014algorithms}. +Regex libraries supporting back-references such as +PCRE \cite{pcre} therefore have to +revert to a depth-first search algorithm including backtracking. +What is unexpected is that even in the cases +not involving back-references, there is still +a (non-negligible) chance they might backtrack super-linearly, +as shown in the graphs in figure \ref{fig:aStarStarb}. + +Summing up, we can categorise existing +practical regex libraries into two kinds: +(i) The ones with linear +time guarantees like Go and Rust. The downside with them is that +they impose restrictions +on the regular expressions (not allowing back-references, +bounded repetitions cannot exceed an ad hoc limit etc.). +And (ii) those +that allow large bounded regular expressions and back-references +at the expense of using backtracking algorithms. +They can potentially ``grind to a halt'' +on some very simple cases, resulting +ReDoS attacks if exposed to the internet. + +The problems with both approaches are the motivation for us +to look again at the regular expression matching problem. +Another motivation is that regular expression matching algorithms +that follow the POSIX standard often contain errors and bugs +as we shall explain next. + +%We would like to have regex engines that can +%deal with the regular part (e.g. +%bounded repetitions) of regexes more +%efficiently. +%Also we want to make sure that they do it correctly. +%It turns out that such aim is not so easy to achieve. + %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions +% For example, the Rust regex engine claims to be linear, +% but does not support lookarounds and back-references. +% The GoLang regex library does not support over 1000 repetitions. +% Java and Python both support back-references, but shows +%catastrophic backtracking behaviours on inputs without back-references( +%when the language is still regular). + %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac + %TODO: verify the fact Rust does not allow 1000+ reps + + + + +%The time cost of regex matching algorithms in general +%involve two different phases, and different things can go differently wrong on +%these phases. +%$\DFA$s usually have problems in the first (construction) phase +%, whereas $\NFA$s usually run into trouble +%on the second phase. + + +\section{Error-prone POSIX Implementations} +Very often there are multiple ways of matching a string +with a regular expression. +In such cases the regular expressions matcher needs to +disambiguate. +The more widely used strategy is called POSIX, +which roughly speaking always chooses the longest initial match. +The POSIX strategy is widely adopted in many regular expression matchers +because it is a natural strategy for lexers. +However, many implementations (including the C libraries +used by Linux and OS X distributions) contain bugs +or do not meet the specification they claim to adhere to. +Kuklewicz maintains a unit test repository which lists some +problems with existing regular expression engines \cite{KuklewiczHaskell}. +In some cases, they either fail to generate a +result when there exists a match, +or give results that are inconsistent with the POSIX standard. +A concrete example is the regex: +\begin{center} + $(aba + ab + a)^* \text{and the string} \; ababa$ +\end{center} +The correct POSIX match for the above +involves the entire string $ababa$, +split into two Kleene star iterations, namely $[ab], [aba]$ at positions +$[0, 2), [2, 5)$ +respectively. +But feeding this example to the different engines +listed at regex101 \footnote{ + regex101 is an online regular expression matcher which + provides API for trying out regular expression + engines of multiple popular programming languages like +Java, Python, Go, etc.} \parencite{regex101}. +yields +only two incomplete matches: $[aba]$ at $[0, 3)$ +and $a$ at $[4, 5)$. +Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} +commented that most regex libraries are not +correctly implementing the central POSIX +rule, called the maximum munch rule. +Grathwohl \parencite{grathwohl2014crash} wrote: +\begin{quote}\it + ``The POSIX strategy is more complicated than the + greedy because of the dependence on information about + the length of matched strings in the various subexpressions.'' +\end{quote} +%\noindent +People have recognised that the implementation complexity of POSIX rules also come from +the specification being not very precise. +The developers of the regexp package of Go +\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} +commented that +\begin{quote}\it +`` +The POSIX rule is computationally prohibitive +and not even well-defined. +`` +\end{quote} +There are many informal summaries of this disambiguation +strategy, which are often quite long and delicate. +For example Kuklewicz \cite{KuklewiczHaskell} +described the POSIX rule as (section 1, last paragraph): +\begin{quote} + \begin{itemize} + \item +regular expressions (REs) take the leftmost starting match, and the longest match starting there +earlier subpatterns have leftmost-longest priority over later subpatterns\\ +\item +higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ +\item +REs have right associative concatenation which can be changed with parenthesis\\ +\item +parenthesized subexpressions return the match from their last usage\\ +\item +text of component subexpressions must be contained in the text of the +higher-level subexpressions\\ +\item +if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ +\item +if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ +\end{itemize} +\end{quote} +%The text above +%is trying to capture something very precise, +%and is crying out for formalising.