diff -r 660cf698eb26 -r 3831621d7b14 ChengsongTanPhdThesis/Chapters/Inj.tex --- a/ChengsongTanPhdThesis/Chapters/Inj.tex Tue Jul 25 17:28:29 2023 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Wed Aug 23 03:02:31 2023 +0100 @@ -7,906 +7,33 @@ % used for describing the lexing algorithm by Sulzmann and Lu, %and then give the algorithm and its variant and discuss %why more aggressive simplifications are needed. -We start with a technical overview of the catastrophic backtracking problem, -motivating rigorous approaches to regular expression matching and lexing. -Then we introduce basic notations and definitions of our problem. - -\section{Technical Overview} -Consider for example the regular expression $(a^*)^*\,b$ and -strings of the form $aa..a$. These strings cannot be matched by this regular -expression: obviously the expected $b$ in the last -position is missing. One would assume that modern regular expression -matching engines can find this out very quickly. Surprisingly, if one tries -this example in JavaScript, Python or Java 8, even with small strings, -say of lenght of around 30 $a$'s, -the decision takes large amounts of time to finish. -This is inproportional -to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}). -The algorithms clearly show exponential behaviour, and as can be seen -is triggered by some relatively simple regular expressions. -Java 9 and newer -versions mitigates this behaviour by several magnitudes, -but are still slow compared -with the approach we are going to use in this thesis. - - - -This superlinear blowup in regular expression engines -has caused grief in ``real life'' where it is -given the name ``catastrophic backtracking'' or ``evil'' regular expressions. -For example, on 20 July 2016 one evil -regular expression brought the webpage -\href{http://stackexchange.com}{Stack Exchange} to its -knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} -In this instance, a regular expression intended to just trim white -spaces from the beginning and the end of a line actually consumed -massive amounts of CPU resources---causing the web servers to grind to a -halt. In this example, the time needed to process -the string was -$O(n^2)$ with respect to the string length $n$. This -quadratic overhead was enough for the homepage of Stack Exchange to -respond so slowly that the load balancer assumed a $\mathit{DoS}$ -attack and therefore stopped the servers from responding to any -requests. This made the whole site become unavailable. - -\begin{figure}[p] -\begin{center} -\begin{tabular}{@{}c@{\hspace{0mm}}c@{}} -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={JavaScript}, - legend pos=north west, - legend cell align=left] -\addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={Python}, - legend pos=north west, - legend cell align=left] -\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; -\end{axis} -\end{tikzpicture}\\ -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={Java 8}, - legend pos=north west, - legend cell align=left] -\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={Dart}, - legend pos=north west, - legend cell align=left] -\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; -\end{axis} -\end{tikzpicture}\\ -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={Swift}, - legend pos=north west, - legend cell align=left] -\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=true, - %xtick={0,5000,...,40000}, - %xmax=40000, - %ymax=35, - restrict x to domain*=0:40000, - restrict y to domain*=0:35, - %ytick={0,5,...,30}, - %scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={Java9+}, - legend pos=north west, - legend cell align=left] -\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; -\end{axis} -\end{tikzpicture}\\ -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - ylabel={time in secs}, - enlargelimits=true, - %xtick={0,5,...,30}, - %xmax=33, - %ymax=35, - restrict x to domain*=0:60000, - restrict y to domain*=0:35, - %ytick={0,5,...,30}, - %scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={Scala}, - legend pos=north west, - legend cell align=left] -\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=true, - %xtick={0,5000,...,40000}, - %xmax=40000, - %ymax=35, - restrict x to domain*=0:60000, - restrict y to domain*=0:45, - %ytick={0,5,...,30}, - %scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend style={cells={align=left}}, - legend entries={Isabelle \\ Extracted}, - legend pos=north west, - legend cell align=left] -\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data}; -\end{axis} -\end{tikzpicture}\\ -\multicolumn{2}{c}{Graphs} -\end{tabular} -\end{center} -\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings - of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. - The reason for their fast growth %superlinear behaviour - is that they do a depth-first-search - using NFAs. - If the string does not match, the regular expression matching - engine starts to explore all possibilities. - The last two graphs are for our implementation in Scala, one manual - and one extracted from the verified lexer in Isabelle by $\textit{codegen}$. - Our lexer - performs better in this case, and - is formally verified. - Despite being almost identical, the codegen-generated lexer - % is generated directly from Isabelle using $\textit{codegen}$, - is slower than the manually written version since the code synthesised by - $\textit{codegen}$ does not use native integer or string - types of the target language. - %Note that we are comparing our \emph{lexer} against other languages' matchers. -}\label{fig:aStarStarb} -\end{figure}\afterpage{\clearpage} - -A more recent example is a global outage of all Cloudflare servers on 2 July -2019. A poorly written regular expression exhibited catastrophic backtracking -and exhausted CPUs that serve HTTP traffic. Although the outage -had several causes, at the heart was a regular expression that -was used to monitor network -traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} -These problems with regular expressions -are not isolated events that happen -very rarely, -%but actually widespread. -%They occur so often that they have a -but they occur actually often enough that they have a -name: Regular-Expression-Denial-Of-Service (ReDoS) -attacks. -Davis et al. \cite{Davis18} detected more -than 1000 evil regular expressions -in Node.js, Python core libraries, npm and pypi. -They therefore concluded that evil regular expressions -are a real problem rather than just "a parlour trick". - -The work in this thesis aims to address this issue -with the help of formal proofs. -We describe a lexing algorithm based -on Brzozowski derivatives with verified correctness -and a finiteness property for the size of derivatives -(which are all done in Isabelle/HOL). -Such properties %guarantee the absence of -are an important step in preventing -catastrophic backtracking once and for all. -We will give more details in the next sections -on (i) why the slow cases in graph \ref{fig:aStarStarb} -can occur in traditional regular expression engines -and (ii) why we choose our -approach based on Brzozowski derivatives and formal proofs. - - - - -In this chapter, we define the basic notions -for regular languages and regular expressions. -This is essentially a description in ``English'' -the functions and datatypes of our formalisation in Isabelle/HOL. -We also define what $\POSIX$ lexing means, -followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} -that produces the output conforming -to the $\POSIX$ standard. -This is a preliminary chapter which describes the results of -Sulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016}, -but the proof details and function definitions are needed to motivate our work -as we show in chapter \ref{Bitcoded2} how the proofs break down when -simplifications are applied. -%TODO: Actually show this in chapter 4. - -In what follows -we choose to use the Isabelle-style notation -for function and datatype constructor applications, where -the parameters of a function are not enclosed -inside a pair of parentheses (e.g. $f \;x \;y$ -instead of $f(x,\;y)$). This is mainly -to make the text visually more concise. - - - - -\section{Preliminaries}%Regex, and the Problems with Regex Matchers} -Regular expressions and regular expression matchers -have clearly been studied for many, many years. -Theoretical results in automata theory state -that basic regular expression matching should be linear -w.r.t the input. -This assumes that the regular expression -$r$ was pre-processed and turned into a -deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}. -By basic we mean textbook definitions such as the one -below, involving only regular expressions for characters, alternatives, -sequences, and Kleene stars: -\[ - r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* -\] -Modern regular expression matchers used by programmers, -however, -support much richer constructs, such as bounded repetitions, -negations, -and back-references. -To differentiate, we use the word \emph{regex} to refer -to those expressions with richer constructs while reserving the -term \emph{regular expression} -for the more traditional meaning in formal languages theory. -We follow this convention -in this thesis. -In the future, we aim to support all the popular features of regexes, -but for this work we mainly look at basic regular expressions -and bounded repetitions. - - - -%Most modern regex libraries -%the so-called PCRE standard (Peral Compatible Regular Expressions) -%has the back-references -Regexes come with a number of constructs -that make it more convenient for -programmers to write regular expressions. -Depending on the types of constructs -the task of matching and lexing with them -will have different levels of complexity. -Some of those constructs are syntactic sugars that are -simply short hand notations -that save the programmers a few keystrokes. -These will not cause problems for regex libraries. -For example the -non-binary alternative involving three or more choices just means: -\[ - (a | b | c) \stackrel{means}{=} ((a + b)+ c) -\] -Similarly, the range operator -%used to express the alternative -%of all characters between its operands, -is just a concise way -of expressing an alternative of consecutive characters: -\[ - [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) -\] -for an alternative. The -wildcard character '$.$' is used to refer to any single character, -\[ - . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] -\] -except the newline. - -\subsection{Bounded Repetitions} -More interesting are bounded repetitions, which can -make the regular expressions much -more compact. -Normally there are four kinds of bounded repetitions: -$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ -(where $n$ and $m$ are constant natural numbers). -Like the star regular expressions, the set of strings or language -a bounded regular expression can match -is defined using the power operation on sets: -\begin{center} - \begin{tabular}{lcl} - $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ - $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ - $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ - $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ - \end{tabular} -\end{center} -The attraction of bounded repetitions is that they can be -used to avoid a size blow up: for example $r^{\{n\}}$ -is a shorthand for -the much longer regular expression: -\[ - \underbrace{r\ldots r}_\text{n copies of r}. -\] -%Therefore, a naive algorithm that simply unfolds -%them into their desugared forms -%will suffer from at least an exponential runtime increase. -The problem with matching -such bounded repetitions -is that tools based on the classic notion of -automata need to expand $r^{\{n\}}$ into $n$ connected -copies of the automaton for $r$. This leads to very inefficient matching -algorithms or algorithms that consume large amounts of memory. -Implementations using $\DFA$s will -in such situations -either become very slow -(for example Verbatim++ \cite{Verbatimpp}) or run -out of memory (for example $\mathit{LEX}$ and -$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators -in C and JAVA that generate $\mathit{DFA}$-based -lexers. The user provides a set of regular expressions -and configurations, and then -gets an output program encoding a minimized $\mathit{DFA}$ -that can be compiled and run. -When given the above countdown regular expression, -a small $n$ (say 20) would result in a program representing a -DFA -with millions of states.}) for large counters. -A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ -where the minimal DFA requires at least $2^{n+1}$ states. -For example, when $n$ is equal to 2, -the corresponding $\mathit{NFA}$ looks like: -\vspace{6mm} -\begin{center} -\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] - \node[state,initial] (q_0) {$q_0$}; - \node[state, red] (q_1) [right=of q_0] {$q_1$}; - \node[state, red] (q_2) [right=of q_1] {$q_2$}; - \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; - \path[->] - (q_0) edge node {a} (q_1) - edge [loop below] node {a,b} () - (q_1) edge node {a,b} (q_2) - (q_2) edge node {a,b} (q_3); -\end{tikzpicture} -\end{center} -and when turned into a DFA by the subset construction -requires at least $2^3$ states.\footnote{The -red states are "countdown states" which count down -the number of characters needed in addition to the current -string to make a successful match. -For example, state $q_1$ indicates a match that has -gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, -and just consumed the "delimiter" $a$ in the middle, and -needs to match 2 more iterations of $(a|b)$ to complete. -State $q_2$ on the other hand, can be viewed as a state -after $q_1$ has consumed 1 character, and just waits -for 1 more character to complete. -The state $q_3$ is the last (accepting) state, requiring 0 -more characters. -Depending on the suffix of the -input string up to the current read location, -the states $q_1$ and $q_2$, $q_3$ -may or may -not be active. -A $\mathit{DFA}$ for such an $\mathit{NFA}$ would -contain at least $2^3$ non-equivalent states that cannot be merged, -because the subset construction during determinisation will generate -all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. -Generalizing this to regular expressions with larger -bounded repetitions number, we have that -regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s -would require at least $2^{n+1}$ states, if $r$ itself contains -more than 1 string. -This is to represent all different -scenarios in which "countdown" states are active.} - +\marginpar{Gerog comment: addressed attributing work correctly problem, +showed clearly these definitions come from Ausaf et al.} -Bounded repetitions are important because they -tend to occur frequently in practical use, -for example in the regex library RegExLib, in -the rules library of Snort \cite{Snort1999}\footnote{ -Snort is a network intrusion detection (NID) tool -for monitoring network traffic. -The network security community curates a list -of malicious patterns written as regexes, -which is used by Snort's detection engine -to match against network traffic for any hostile -activities such as buffer overflow attacks.}, -as well as in XML Schema definitions (XSDs). -According to Bj\"{o}rklund et al \cite{xml2015}, -more than half of the -XSDs they found on the Maven.org central repository -have bounded regular expressions in them. -Often the counters are quite large, with the largest being -close to ten million. -A smaller sample XSD they gave -is: -\lstset{ - basicstyle=\fontsize{8.5}{9}\ttfamily, - language=XML, - morekeywords={encoding, - xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute} -} -\begin{lstlisting} - - - - -\end{lstlisting} -This can be seen as the regex -$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves -regular expressions -satisfying certain constraints (such as -satisfying the floating point number format). -It is therefore quite unsatisfying that -some regular expressions matching libraries -impose adhoc limits -for bounded regular expressions: -For example, in the regular expression matching library in the Go -language the regular expression $a^{1001}$ is not permitted, because no counter -can be above 1000, and in the built-in Rust regular expression library -expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message -for being too big. -As Becchi and Crawley \cite{Becchi08} have pointed out, -the reason for these restrictions -is that they simulate a non-deterministic finite -automata (NFA) with a breadth-first search. -This way the number of active states could -be equal to the counter number. -When the counters are large, -the memory requirement could become -infeasible, and a regex engine -like in Go will reject this pattern straight away. -\begin{figure}[H] -\begin{center} -\begin{tikzpicture} [node distance = 2cm, on grid, auto] - - \node (q0) [state, initial] {$0$}; - \node (q1) [state, right = of q0] {$1$}; - %\node (q2) [state, right = of q1] {$2$}; - \node (qdots) [right = of q1] {$\ldots$}; - \node (qn) [state, right = of qdots] {$n$}; - \node (qn1) [state, right = of qn] {$n+1$}; - \node (qn2) [state, right = of qn1] {$n+2$}; - \node (qn3) [state, accepting, right = of qn2] {$n+3$}; - -\path [-stealth, thick] - (q0) edge [loop above] node {a} () - (q0) edge node {a} (q1) - %(q1) edge node {.} (q2) - (q1) edge node {.} (qdots) - (qdots) edge node {.} (qn) - (qn) edge node {.} (qn1) - (qn1) edge node {b} (qn2) - (qn2) edge node {$c$} (qn3); -\end{tikzpicture} -%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] -% \node[state,initial] (q_0) {$0$}; -% \node[state, ] (q_1) [right=of q_0] {$1$}; -% \node[state, ] (q_2) [right=of q_1] {$2$}; -% \node[state, -% \node[state, accepting, ](q_3) [right=of q_2] {$3$}; -% \path[->] -% (q_0) edge node {a} (q_1) -% edge [loop below] node {a,b} () -% (q_1) edge node {a,b} (q_2) -% (q_2) edge node {a,b} (q_3); -%\end{tikzpicture} -\end{center} -\caption{The example given by Becchi and Crawley - that NFA simulation can consume large - amounts of memory: $.^*a.^{\{n\}}bc$ matching - strings of the form $aaa\ldots aaaabc$. - When traversing in a breadth-first manner, -all states from 0 till $n+1$ will become active.}\label{fig:inj} +This is a preliminary chapter which describes the results of +Sulzmann and Lu \cite{Sulzmann2014} and part of the formalisations by +Ausaf et al \ref{AusafDyckhoffUrban2016} (mainly the definitions). +These results are not part of this PhD work, +but the details are included to provide necessary context +for our work on the correctness proof of $\blexersimp$, +as we show in chapter \ref{Bitcoded2} how the proofs break down when +simplifications are applied. -\end{figure} -%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this -%type of $\mathit{NFA}$ simulation and guarantees a linear runtime -%in terms of input string length. -%TODO:try out these lexers -These problems can of course be solved in matching algorithms where -automata go beyond the classic notion and for instance include explicit -counters \cite{Turo_ov__2020}. -These solutions can be quite efficient, -with the ability to process -gigabits of strings input per second -even with large counters \cite{Becchi08}. -These practical solutions do not come with -formal guarantees, and as pointed out by -Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. -%But formal reasoning about these automata especially in Isabelle -%can be challenging -%and un-intuitive. -%Therefore, we take correctness and runtime claims made about these solutions -%with a grain of salt. - -In the work reported in \cite{FoSSaCS2023} and here, -we add better support using derivatives -for bounded regular expression $r^{\{n\}}$. -Our results -extend straightforwardly to -repetitions with intervals such as -$r^{\{n\ldots m\}}$. -The merit of Brzozowski derivatives (more on this later) -on this problem is that -it can be naturally extended to support bounded repetitions. -Moreover these extensions are still made up of only small -inductive datatypes and recursive functions, -making it handy to deal with them in theorem provers. -%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be -%straightforwardly extended to deal with bounded regular expressions -%and moreover the resulting code still consists of only simple -%recursive functions and inductive datatypes. -Finally, bounded regular expressions do not destroy our finite -boundedness property, which we shall prove later on. - - - - - -\subsection{Back-References} -The other way to simulate an $\mathit{NFA}$ for matching is choosing -a single transition each time, keeping all the other options in -a queue or stack, and backtracking if that choice eventually -fails. -This method, often called a "depth-first-search", -is efficient in many cases, but could end up -with exponential run time. -The backtracking method is employed in regex libraries -that support \emph{back-references}, for example -in Java and Python. -%\section{Back-references and The Terminology Regex} - -%When one constructs an $\NFA$ out of a regular expression -%there is often very little to be done in the first phase, one simply -%construct the $\NFA$ states based on the structure of the input regular expression. - -%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: -%one by keeping track of all active states after consuming -%a character, and update that set of states iteratively. -%This can be viewed as a breadth-first-search of the $\mathit{NFA}$ -%for a path terminating -%at an accepting state. - +In the coming section, the definitions of basic notions +for regular languages and regular expressions are given. +This is essentially a description in ``English'' +the functions and datatypes used by Ausaf et al. +\cite{AusafDyckhoffUrban2016} \cite{Ausaf} +in their formalisation in Isabelle/HOL. +We include them as we build on their formalisation, +and therefore inherently use these definitions. -Consider the following regular expression where the sequence -operator is omitted for brevity: -\begin{center} - $r_1r_2r_3r_4$ -\end{center} -In this example, -one could label sub-expressions of interest -by parenthesizing them and giving -them a number in the order in which their opening parentheses appear. -One possible way of parenthesizing and labelling is: -\begin{center} - $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ -\end{center} -The sub-expressions -$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled -by 1 to 4, and can be ``referred back'' by their respective numbers. -%These sub-expressions are called "capturing groups". -To do so, one uses the syntax $\backslash i$ -to denote that we want the sub-string -of the input matched by the i-th -sub-expression to appear again, -exactly the same as it first appeared: -\begin{center} -$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots -\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$ -\end{center} -Once the sub-string $s_i$ for the sub-expression $r_i$ -has been fixed, there is no variability on what the back-reference -label $\backslash i$ can be---it is tied to $s_i$. -The overall string will look like $\ldots s_i \ldots s_i \ldots $ -%The backslash and number $i$ are the -%so-called "back-references". -%Let $e$ be an expression made of regular expressions -%and back-references. $e$ contains the expression $e_i$ -%as its $i$-th capturing group. -%The semantics of back-reference can be recursively -%written as: -%\begin{center} -% \begin{tabular}{c} -% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ -% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ -% \end{tabular} -%\end{center} -A concrete example -for back-references is -\begin{center} -$(.^*)\backslash 1$, -\end{center} -which matches -strings that can be split into two identical halves, -for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. -Note that this is different from -repeating the sub-expression verbatim like -\begin{center} - $(.^*)(.^*)$, -\end{center} -which does not impose any restrictions on what strings the second -sub-expression $.^*$ -might match. -Another example for back-references is -\begin{center} -$(.)(.)\backslash 2\backslash 1$ -\end{center} -which matches four-character palindromes -like $abba$, $x??x$ and so on. - -Back-references are a regex construct -that programmers find quite useful. -According to Becchi and Crawley \cite{Becchi08}, -6\% of Snort rules (up until 2008) use them. -The most common use of back-references -is to express well-formed html files, -where back-references are convenient for matching -opening and closing tags like -\begin{center} - $\langle html \rangle \ldots \langle / html \rangle$ -\end{center} -A regex describing such a format -is -\begin{center} - $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ -\end{center} -Despite being useful, the expressive power of regexes -go beyond regular languages -once back-references are included. -In fact, they allow the regex construct to express -languages that cannot be contained in context-free -languages either. -For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ -expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, -which cannot be expressed by -context-free grammars \cite{campeanu2003formal}. -Such a language is contained in the context-sensitive hierarchy -of formal languages. -Also solving the matching problem involving back-references -is known to be NP-complete \parencite{alfred2014algorithms}. -Regex libraries supporting back-references such as -PCRE \cite{pcre} therefore have to -revert to a depth-first search algorithm including backtracking. -What is unexpected is that even in the cases -not involving back-references, there is still -a (non-negligible) chance they might backtrack super-linearly, -as shown in the graphs in figure \ref{fig:aStarStarb}. - -Summing up, we can categorise existing -practical regex libraries into two kinds: -(i) The ones with linear -time guarantees like Go and Rust. The downside with them is that -they impose restrictions -on the regular expressions (not allowing back-references, -bounded repetitions cannot exceed an ad hoc limit etc.). -And (ii) those -that allow large bounded regular expressions and back-references -at the expense of using backtracking algorithms. -They can potentially ``grind to a halt'' -on some very simple cases, resulting -ReDoS attacks if exposed to the internet. - -The problems with both approaches are the motivation for us -to look again at the regular expression matching problem. -Another motivation is that regular expression matching algorithms -that follow the POSIX standard often contain errors and bugs -as we shall explain next. - -%We would like to have regex engines that can -%deal with the regular part (e.g. -%bounded repetitions) of regexes more -%efficiently. -%Also we want to make sure that they do it correctly. -%It turns out that such aim is not so easy to achieve. - %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions -% For example, the Rust regex engine claims to be linear, -% but does not support lookarounds and back-references. -% The GoLang regex library does not support over 1000 repetitions. -% Java and Python both support back-references, but shows -%catastrophic backtracking behaviours on inputs without back-references( -%when the language is still regular). - %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac - %TODO: verify the fact Rust does not allow 1000+ reps - - - - -%The time cost of regex matching algorithms in general -%involve two different phases, and different things can go differently wrong on -%these phases. -%$\DFA$s usually have problems in the first (construction) phase -%, whereas $\NFA$s usually run into trouble -%on the second phase. - - -\section{Error-prone POSIX Implementations} -Very often there are multiple ways of matching a string -with a regular expression. -In such cases the regular expressions matcher needs to -disambiguate. -The more widely used strategy is called POSIX, -which roughly speaking always chooses the longest initial match. -The POSIX strategy is widely adopted in many regular expression matchers -because it is a natural strategy for lexers. -However, many implementations (including the C libraries -used by Linux and OS X distributions) contain bugs -or do not meet the specification they claim to adhere to. -Kuklewicz maintains a unit test repository which lists some -problems with existing regular expression engines \cite{KuklewiczHaskell}. -In some cases, they either fail to generate a -result when there exists a match, -or give results that are inconsistent with the POSIX standard. -A concrete example is the regex: -\begin{center} - $(aba + ab + a)^* \text{and the string} \; ababa$ -\end{center} -The correct POSIX match for the above -involves the entire string $ababa$, -split into two Kleene star iterations, namely $[ab], [aba]$ at positions -$[0, 2), [2, 5)$ -respectively. -But feeding this example to the different engines -listed at regex101 \footnote{ - regex101 is an online regular expression matcher which - provides API for trying out regular expression - engines of multiple popular programming languages like -Java, Python, Go, etc.} \parencite{regex101}. -yields -only two incomplete matches: $[aba]$ at $[0, 3)$ -and $a$ at $[4, 5)$. -Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} -commented that most regex libraries are not -correctly implementing the central POSIX -rule, called the maximum munch rule. -Grathwohl \parencite{grathwohl2014crash} wrote: -\begin{quote}\it - ``The POSIX strategy is more complicated than the - greedy because of the dependence on information about - the length of matched strings in the various subexpressions.'' -\end{quote} -%\noindent -People have recognised that the implementation complexity of POSIX rules also come from -the specification being not very precise. -The developers of the regexp package of Go -\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} -commented that -\begin{quote}\it -`` -The POSIX rule is computationally prohibitive -and not even well-defined. -`` -\end{quote} -There are many informal summaries of this disambiguation -strategy, which are often quite long and delicate. -For example Kuklewicz \cite{KuklewiczHaskell} -described the POSIX rule as (section 1, last paragraph): -\begin{quote} - \begin{itemize} - \item -regular expressions (REs) take the leftmost starting match, and the longest match starting there -earlier subpatterns have leftmost-longest priority over later subpatterns\\ -\item -higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ -\item -REs have right associative concatenation which can be changed with parenthesis\\ -\item -parenthesized subexpressions return the match from their last usage\\ -\item -text of component subexpressions must be contained in the text of the -higher-level subexpressions\\ -\item -if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ -\item -if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ -\end{itemize} -\end{quote} -%The text above -%is trying to capture something very precise, -%and is crying out for formalising. -Ribeiro and Du Bois \cite{RibeiroAgda2017} have -formalised the notion of bit-coded regular expressions -and proved their relations with simple regular expressions in -the dependently-typed proof assistant Agda. -They also proved the soundness and completeness of a matching algorithm -based on the bit-coded regular expressions. -Ausaf et al. \cite{AusafDyckhoffUrban2016} -are the first to -give a quite simple formalised POSIX -specification in Isabelle/HOL, and also prove -that their specification coincides with an earlier (unformalised) -POSIX specification given by Okui and Suzuki \cite{Okui10}. -They then formally proved the correctness of -a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} -with regards to that specification. -They also found that the informal POSIX -specification by Sulzmann and Lu needs to be substantially -revised in order for the correctness proof to go through. -Their original specification and proof were unfixable -according to Ausaf et al. - - -In the next section, we will briefly -introduce Brzozowski derivatives and Sulzmann -and Lu's algorithm, which the main point of this thesis builds on. +% In the next section, we will briefly +% introduce Brzozowski derivatives and Sulzmann +% and Lu's algorithm, which the main point of this thesis builds on. %We give a taste of what they %are like and why they are suitable for regular expression %matching and lexing. @@ -1679,6 +806,24 @@ Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and elegant (arguably as beautiful as the definition of the Brzozowski derivative) solution for this. +\marginpar{explicitly attribute the work} +For the same reason described +at the beginning of this chapter, +we introduce the formal +semantics of $\POSIX$ lexing by +Ausaf et al.\cite{AusafDyckhoffUrban2016}, +followed by the first lexing algorithm by +Sulzmanna and Lu \cite{Sulzmann2014} +that produces the output conforming +to the $\POSIX$ standard. +%TODO: Actually show this in chapter 4. +In what follows +we choose to use the Isabelle-style notation +for function and datatype constructor applications, where +the parameters of a function are not enclosed +inside a pair of parentheses (e.g. $f \;x \;y$ +instead of $f(x,\;y)$). This is mainly +to make the text visually more concise. \section{Values and the Lexing Algorithm by Sulzmann and Lu} In this section, we present a two-phase regular expression lexing