lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:660cf698eb26
+:3831621d7b14
 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
 %and notations we
 % used for describing the lexing algorithm by Sulzmann and Lu,
 %and then give the algorithm and its variant and discuss
 %why more aggressive simplifications are needed.
-We start with a technical overview of the catastrophic backtracking problem,
-motivating rigorous approaches to regular expression matching and lexing.
-Then we introduce basic notations and definitions of our problem.
+\marginpar{Gerog comment: addressed attributing work correctly problem,
+showed clearly these definitions come from Ausaf et al.}
-\section{Technical Overview}
-Consider for example the regular expression $(a^*)^*\,b$ and
-strings of the form $aa..a$. These strings cannot be matched by this regular
-expression: obviously the expected $b$ in the last
-position is missing. One would assume that modern regular expression
-matching engines can find this out very quickly. Surprisingly, if one tries
-this example in JavaScript, Python or Java 8, even with small strings,
-say of lenght of around 30 $a$'s,
-the decision takes large amounts of time to finish.
-This is inproportional
-to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
-The algorithms clearly show exponential behaviour, and as can be seen
-is triggered by some relatively simple regular expressions.
-Java 9 and newer
-versions mitigates this behaviour by several magnitudes,
-but are still slow compared
-with the approach we are going to use in this thesis.
-This superlinear blowup in regular expression engines
-has caused grief in ``real life'' where it is
-given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
-For example, on 20 July 2016 one evil
-regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
-In this instance, a regular expression intended to just trim white
-spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU resources---causing the web servers to grind to a
-halt. In this example, the time needed to process
-the string was
-$O(n^2)$ with respect to the string length $n$. This
-quadratic overhead was enough for the homepage of Stack Exchange to
-respond so slowly that the load balancer assumed a $\mathit{DoS}$
-attack and therefore stopped the servers from responding to any
-requests. This made the whole site become unavailable.
-\begin{figure}[p]
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=35,
-ytick={0,5,...,30},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={JavaScript},
-legend pos=north west,
-legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-%ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=35,
-ytick={0,5,...,30},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={Python},
-legend pos=north west,
-legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\end{axis}
-\end{tikzpicture}\\
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=35,
-ytick={0,5,...,30},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={Java 8},
-legend pos=north west,
-legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-%ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=35,
-ytick={0,5,...,30},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={Dart},
-legend pos=north west,
-legend cell align=left]
-\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
-\end{axis}
-\end{tikzpicture}\\
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=35,
-ytick={0,5,...,30},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={Swift},
-legend pos=north west,
-legend cell align=left]
-\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-%ylabel={time in secs},
-enlargelimits=true,
-%xtick={0,5000,...,40000},
-%xmax=40000,
-%ymax=35,
-restrict x to domain*=0:40000,
-restrict y to domain*=0:35,
-%ytick={0,5,...,30},
-%scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={Java9+},
-legend pos=north west,
-legend cell align=left]
-\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
-\end{axis}
-\end{tikzpicture}\\
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-ylabel={time in secs},
-enlargelimits=true,
-%xtick={0,5,...,30},
-%xmax=33,
-%ymax=35,
-restrict x to domain*=0:60000,
-restrict y to domain*=0:35,
-%ytick={0,5,...,30},
-%scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={Scala},
-legend pos=north west,
-legend cell align=left]
-\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-%ylabel={time in secs},
-enlargelimits=true,
-%xtick={0,5000,...,40000},
-%xmax=40000,
-%ymax=35,
-restrict x to domain*=0:60000,
-restrict y to domain*=0:45,
-%ytick={0,5,...,30},
-%scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend style={cells={align=left}},
-legend entries={Isabelle \\ Extracted},
-legend pos=north west,
-legend cell align=left]
-\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{2}{c}{Graphs}
-\end{tabular}
-\end{center}
-\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
-of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
-The reason for their fast growth %superlinear behaviour
-is that they do a depth-first-search
-using NFAs.
-If the string does not match, the regular expression matching
-engine starts to explore all possibilities.
-The last two graphs are for our implementation in Scala, one manual
-and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
-Our lexer
-performs better in this case, and
-is formally verified.
-Despite being almost identical, the codegen-generated lexer
-% is generated directly from Isabelle using $\textit{codegen}$,
-is slower than the manually written version since the code synthesised by
-$\textit{codegen}$ does not use native integer or string
-types of the target language.
-%Note that we are comparing our \emph{lexer} against other languages' matchers.
-}\label{fig:aStarStarb}
-\end{figure}\afterpage{\clearpage}
-A more recent example is a global outage of all Cloudflare servers on 2 July
-2019. A poorly written regular expression exhibited catastrophic backtracking
-and exhausted CPUs that serve HTTP traffic. Although the outage
-had several causes, at the heart was a regular expression that
-was used to monitor network
-traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
-These problems with regular expressions
-are not isolated events that happen
-very rarely,
-%but actually widespread.
-%They occur so often that they have a
-but they occur actually often enough that they have a
-name: Regular-Expression-Denial-Of-Service (ReDoS)
-attacks.
-Davis et al. \cite{Davis18} detected more
-than 1000 evil regular expressions
-in Node.js, Python core libraries, npm and pypi.
-They therefore concluded that evil regular expressions
-are a real problem rather than just "a parlour trick".
-The work in this thesis aims to address this issue
-with the help of formal proofs.
-We describe a lexing algorithm based
-on Brzozowski derivatives with verified correctness
-and a finiteness property for the size of derivatives
-(which are all done in Isabelle/HOL).
-Such properties %guarantee the absence of
-are an important step in preventing
-catastrophic backtracking once and for all.
-We will give more details in the next sections
-on (i) why the slow cases in graph \ref{fig:aStarStarb}
-can occur in traditional regular expression engines
-and (ii) why we choose our
-approach based on Brzozowski derivatives and formal proofs.
-In this chapter, we define the basic notions
-for regular languages and regular expressions.
-This is essentially a description in ``English''
-the functions and datatypes of our formalisation in Isabelle/HOL.
-We also define what $\POSIX$ lexing means,
-followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
-that produces the output conforming
-to the $\POSIX$ standard.
 This is a preliminary chapter which describes the results of
-Sulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016},
+Sulzmann and Lu \cite{Sulzmann2014} and part of the formalisations by
-but the proof details and function definitions are needed to motivate our work
+Ausaf et al \ref{AusafDyckhoffUrban2016} (mainly the definitions).
+These results are not part of this PhD work,
+but the details are included to provide necessary context
+for our work on the correctness proof of $\blexersimp$,
 as we show in chapter \ref{Bitcoded2} how the proofs break down when
 simplifications are applied.
-%TODO: Actually show this in chapter 4.
+In the coming section, the definitions of basic notions
-In what follows
+for regular languages and regular expressions are given.
-we choose to use the Isabelle-style notation
+This is essentially a description in ``English''
-for function and datatype constructor applications, where
+the functions and datatypes used by Ausaf et al.
-the parameters of a function are not enclosed
+\cite{AusafDyckhoffUrban2016} \cite{Ausaf}
-inside a pair of parentheses (e.g. $f \;x \;y$
+in their formalisation in Isabelle/HOL.
-instead of $f(x,\;y)$). This is mainly
+We include them as we build on their formalisation,
-to make the text visually more concise.
+and therefore inherently use these definitions.
+% In the next section, we will briefly
+% introduce Brzozowski derivatives and Sulzmann
-\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
+% and Lu's algorithm, which the main point of this thesis builds on.
-Regular expressions and regular expression matchers
-have clearly been studied for many, many years.
-Theoretical results in automata theory state
-that basic regular expression matching should be linear
-w.r.t the input.
-This assumes that the regular expression
-$r$ was pre-processed and turned into a
-deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
-By basic we mean textbook definitions such as the one
-below, involving only regular expressions for characters, alternatives,
-sequences, and Kleene stars:
-\[
-	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
-\]
-Modern regular expression matchers used by programmers,
-however,
-support much richer constructs, such as bounded repetitions,
-negations,
-and back-references.
-To differentiate, we use the word \emph{regex} to refer
-to those expressions with richer constructs while reserving the
-term \emph{regular expression}
-for the more traditional meaning in formal languages theory.
-We follow this convention
-in this thesis.
-In the future, we aim to support all the popular features of regexes,
-but for this work we mainly look at basic regular expressions
-and bounded repetitions.
-%Most modern regex libraries
-%the so-called PCRE standard (Peral Compatible Regular Expressions)
-%has the back-references
-Regexes come with a number of constructs
-that make it more convenient for
-programmers to write regular expressions.
-Depending on the types of constructs
-the task of matching and lexing with them
-will have different levels of complexity.
-Some of those constructs are syntactic sugars that are
-simply short hand notations
-that save the programmers a few keystrokes.
-These will not cause problems for regex libraries.
-For example the
-non-binary alternative involving three or more choices just means:
-\[
-	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
-\]
-Similarly, the range operator
-%used to express the alternative
-%of all characters between its operands,
-is just a concise way
-of expressing an alternative of consecutive characters:
-\[
-	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )
-\]
-for an alternative. The
-wildcard character '$.$' is used to refer to any single character,
-\[
-	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
-\]
-except the newline.
-\subsection{Bounded Repetitions}
-More interesting are bounded repetitions, which can
-make the regular expressions much
-more compact.
-Normally there are four kinds of bounded repetitions:
-$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
-(where $n$ and $m$ are constant natural numbers).
-Like the star regular expressions, the set of strings or language
-a bounded regular expression can match
-is defined using the power operation on sets:
-\begin{center}
-	\begin{tabular}{lcl}
-		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
-		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
-		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
-		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
-	\end{tabular}
-\end{center}
-The attraction of bounded repetitions is that they can be
-used to avoid a size blow up: for example $r^{\{n\}}$
-is a shorthand for
-the much longer regular expression:
-\[
-	\underbrace{r\ldots r}_\text{n copies of r}.
-\]
-%Therefore, a naive algorithm that simply unfolds
-%them into their desugared forms
-%will suffer from at least an exponential runtime increase.
-The problem with matching
-such bounded repetitions
-is that tools based on the classic notion of
-automata need to expand $r^{\{n\}}$ into $n$ connected
-copies of the automaton for $r$. This leads to very inefficient matching
-algorithms  or algorithms that consume large amounts of memory.
-Implementations using $\DFA$s will
-in such situations
-either become very slow
-(for example Verbatim++ \cite{Verbatimpp}) or run
-out of memory (for example $\mathit{LEX}$ and
-$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
-in C and JAVA that generate $\mathit{DFA}$-based
-lexers. The user provides a set of regular expressions
-and configurations, and then
-gets an output program encoding a minimized $\mathit{DFA}$
-that can be compiled and run.
-When given the above countdown regular expression,
-a small $n$ (say 20) would result in a program representing a
-DFA
-with millions of states.}) for large counters.
-A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
-where the minimal DFA requires at least $2^{n+1}$ states.
-For example, when $n$ is equal to 2,
-the corresponding $\mathit{NFA}$ looks like:
-\vspace{6mm}
-\begin{center}
-\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
-\node[state,initial] (q_0)   {$q_0$};
-\node[state, red] (q_1) [right=of q_0] {$q_1$};
-\node[state, red] (q_2) [right=of q_1] {$q_2$};
-\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
-\path[->]
-(q_0) edge  node {a} (q_1)
-	  edge [loop below] node {a,b} ()
-(q_1) edge  node  {a,b} (q_2)
-(q_2) edge  node  {a,b} (q_3);
-\end{tikzpicture}
-\end{center}
-and when turned into a DFA by the subset construction
-requires at least $2^3$ states.\footnote{The
-red states are "countdown states" which count down
-the number of characters needed in addition to the current
-string to make a successful match.
-For example, state $q_1$ indicates a match that has
-gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
-and just consumed the "delimiter" $a$ in the middle, and
-needs to match 2 more iterations of $(a|b)$ to complete.
-State $q_2$ on the other hand, can be viewed as a state
-after $q_1$ has consumed 1 character, and just waits
-for 1 more character to complete.
-The state $q_3$ is the last (accepting) state, requiring 0
-more characters.
-Depending on the suffix of the
-input string up to the current read location,
-the states $q_1$ and $q_2$, $q_3$
-may or may
-not be active.
-A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
-contain at least $2^3$ non-equivalent states that cannot be merged,
-because the subset construction during determinisation will generate
-all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
-Generalizing this to regular expressions with larger
-bounded repetitions number, we have that
-regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
-would require at least $2^{n+1}$ states, if $r$ itself contains
-more than 1 string.
-This is to represent all different
-scenarios in which "countdown" states are active.}
-Bounded repetitions are important because they
-tend to occur frequently in practical use,
-for example in the regex library RegExLib, in
-the rules library of Snort \cite{Snort1999}\footnote{
-Snort is a network intrusion detection (NID) tool
-for monitoring network traffic.
-The network security community curates a list
-of malicious patterns written as regexes,
-which is used by Snort's detection engine
-to match against network traffic for any hostile
-activities such as buffer overflow attacks.},
-as well as in XML Schema definitions (XSDs).
-According to Bj\"{o}rklund et al \cite{xml2015},
-more than half of the
-XSDs they found on the Maven.org central repository
-have bounded regular expressions in them.
-Often the counters are quite large, with the largest being
-close to ten million.
-A smaller sample XSD they gave
-is:
-\lstset{
-	basicstyle=\fontsize{8.5}{9}\ttfamily,
-language=XML,
-morekeywords={encoding,
-xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
-}
-\begin{lstlisting}
-<sequence minOccurs="0" maxOccurs="65535">
-	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-</sequence>
-\end{lstlisting}
-This can be seen as the regex
-$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
-regular expressions
-satisfying certain constraints (such as
-satisfying the floating point number format).
-It is therefore quite unsatisfying that
-some regular expressions matching libraries
-impose adhoc limits
-for bounded regular expressions:
-For example, in the regular expression matching library in the Go
-language the regular expression $a^{1001}$ is not permitted, because no counter
-can be above 1000, and in the built-in Rust regular expression library
-expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
-for being too big.
-As Becchi and Crawley \cite{Becchi08}  have pointed out,
-the reason for these restrictions
-is that they simulate a non-deterministic finite
-automata (NFA) with a breadth-first search.
-This way the number of active states could
-be equal to the counter number.
-When the counters are large,
-the memory requirement could become
-infeasible, and a regex engine
-like in Go will reject this pattern straight away.
-\begin{figure}[H]
-\begin{center}
-\begin{tikzpicture} [node distance = 2cm, on grid, auto]
-	\node (q0) [state, initial] {$0$};
-	\node (q1) [state, right = of q0] {$1$};
-	%\node (q2) [state, right = of q1] {$2$};
-	\node (qdots) [right = of q1] {$\ldots$};
-	\node (qn) [state, right = of qdots] {$n$};
-	\node (qn1) [state, right = of qn] {$n+1$};
-	\node (qn2) [state, right = of qn1] {$n+2$};
-	\node (qn3) [state, accepting, right = of qn2] {$n+3$};
-\path [-stealth, thick]
-	(q0) edge [loop above] node {a} ()
-(q0) edge node {a}   (q1)
-%(q1) edge node {.}   (q2)
-(q1) edge node {.}   (qdots)
-(qdots) edge node {.} (qn)
-(qn) edge node {.} (qn1)
-(qn1) edge node {b} (qn2)
-(qn2) edge node {$c$} (qn3);
-\end{tikzpicture}
-%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
-%   \node[state,initial] (q_0)   {$0$};
-%   \node[state, ] (q_1) [right=of q_0] {$1$};
-%   \node[state, ] (q_2) [right=of q_1] {$2$};
-%   \node[state,
-%   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
-%    \path[->]
-%    (q_0) edge  node {a} (q_1)
-%    	  edge [loop below] node {a,b} ()
-%    (q_1) edge  node  {a,b} (q_2)
-%    (q_2) edge  node  {a,b} (q_3);
-%\end{tikzpicture}
-\end{center}
-\caption{The example given by Becchi and Crawley
-	that NFA simulation can consume large
-	amounts of memory: $.^*a.^{\{n\}}bc$ matching
-	strings of the form $aaa\ldots aaaabc$.
-	When traversing in a breadth-first manner,
-all states from 0 till $n+1$ will become active.}\label{fig:inj}
-\end{figure}
-%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
-%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
-%in terms of input string length.
-%TODO:try out these lexers
-These problems can of course be solved in matching algorithms where
-automata go beyond the classic notion and for instance include explicit
-counters \cite{Turo_ov__2020}.
-These solutions can be quite efficient,
-with the ability to process
-gigabits of strings input per second
-even with large counters \cite{Becchi08}.
-These practical solutions do not come with
-formal guarantees, and as pointed out by
-Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
-%But formal reasoning about these automata especially in Isabelle
-%can be challenging
-%and un-intuitive.
-%Therefore, we take correctness and runtime claims made about these solutions
-%with a grain of salt.
-In the work reported in \cite{FoSSaCS2023} and here,
-we add better support using derivatives
-for bounded regular expression $r^{\{n\}}$.
-Our results
-extend straightforwardly to
-repetitions with intervals such as
-$r^{\{n\ldots m\}}$.
-The merit of Brzozowski derivatives (more on this later)
-on this problem is that
-it can be naturally extended to support bounded repetitions.
-Moreover these extensions are still made up of only small
-inductive datatypes and recursive functions,
-making it handy to deal with them in theorem provers.
-%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
-%straightforwardly extended to deal with bounded regular expressions
-%and moreover the resulting code still consists of only simple
-%recursive functions and inductive datatypes.
-Finally, bounded regular expressions do not destroy our finite
-boundedness property, which we shall prove later on.
-\subsection{Back-References}
-The other way to simulate an $\mathit{NFA}$ for matching is choosing
-a single transition each time, keeping all the other options in
-a queue or stack, and backtracking if that choice eventually
-fails.
-This method, often called a  "depth-first-search",
-is efficient in many cases, but could end up
-with exponential run time.
-The backtracking method is employed in regex libraries
-that support \emph{back-references}, for example
-in Java and Python.
-%\section{Back-references and The Terminology Regex}
-%When one constructs an $\NFA$ out of a regular expression
-%there is often very little to be done in the first phase, one simply
-%construct the $\NFA$ states based on the structure of the input regular expression.
-%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
-%one by keeping track of all active states after consuming
-%a character, and update that set of states iteratively.
-%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
-%for a path terminating
-%at an accepting state.
-Consider the following regular expression where the sequence
-operator is omitted for brevity:
-\begin{center}
-	$r_1r_2r_3r_4$
-\end{center}
-In this example,
-one could label sub-expressions of interest
-by parenthesizing them and giving
-them a number in the order in which their opening parentheses appear.
-One possible way of parenthesizing and labelling is:
-\begin{center}
-	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
-\end{center}
-The sub-expressions
-$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
-by 1 to 4, and can be ``referred back'' by their respective numbers.
-%These sub-expressions are called "capturing groups".
-To do so, one uses the syntax $\backslash i$
-to denote that we want the sub-string
-of the input matched by the i-th
-sub-expression to appear again,
-exactly the same as it first appeared:
-\begin{center}
-$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
-\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
-\end{center}
-Once the sub-string $s_i$ for the sub-expression $r_i$
-has been fixed, there is no variability on what the back-reference
-label $\backslash i$ can be---it is tied to $s_i$.
-The overall string will look like $\ldots s_i \ldots s_i \ldots $
-%The backslash and number $i$ are the
-%so-called "back-references".
-%Let $e$ be an expression made of regular expressions
-%and back-references. $e$ contains the expression $e_i$
-%as its $i$-th capturing group.
-%The semantics of back-reference can be recursively
-%written as:
-%\begin{center}
-%	\begin{tabular}{c}
-%		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
-%		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
-%	\end{tabular}
-%\end{center}
-A concrete example
-for back-references is
-\begin{center}
-$(.^*)\backslash 1$,
-\end{center}
-which matches
-strings that can be split into two identical halves,
-for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
-Note that this is different from
-repeating the  sub-expression verbatim like
-\begin{center}
-	$(.^*)(.^*)$,
-\end{center}
-which does not impose any restrictions on what strings the second
-sub-expression $.^*$
-might match.
-Another example for back-references is
-\begin{center}
-$(.)(.)\backslash 2\backslash 1$
-\end{center}
-which matches four-character palindromes
-like $abba$, $x??x$ and so on.
-Back-references are a regex construct
-that programmers find quite useful.
-According to Becchi and Crawley \cite{Becchi08},
-6\% of Snort rules (up until 2008) use them.
-The most common use of back-references
-is to express well-formed html files,
-where back-references are convenient for matching
-opening and closing tags like
-\begin{center}
-	$\langle html \rangle \ldots \langle / html \rangle$
-\end{center}
-A regex describing such a format
-is
-\begin{center}
-	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
-\end{center}
-Despite being useful, the expressive power of regexes
-go beyond regular languages
-once back-references are included.
-In fact, they allow the regex construct to express
-languages that cannot be contained in context-free
-languages either.
-For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
-expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
-which cannot be expressed by
-context-free grammars \cite{campeanu2003formal}.
-Such a language is contained in the context-sensitive hierarchy
-of formal languages.
-Also solving the matching problem involving back-references
-is known to be NP-complete \parencite{alfred2014algorithms}.
-Regex libraries supporting back-references such as
-PCRE \cite{pcre} therefore have to
-revert to a depth-first search algorithm including backtracking.
-What is unexpected is that even in the cases
-not involving back-references, there is still
-a (non-negligible) chance they might backtrack super-linearly,
-as shown in the graphs in figure \ref{fig:aStarStarb}.
-Summing up, we can categorise existing
-practical regex libraries into two kinds:
-(i) The ones  with  linear
-time guarantees like Go and Rust. The downside with them is that
-they impose restrictions
-on the regular expressions (not allowing back-references,
-bounded repetitions cannot exceed an ad hoc limit etc.).
-And (ii) those
-that allow large bounded regular expressions and back-references
-at the expense of using backtracking algorithms.
-They can potentially ``grind to a halt''
-on some very simple cases, resulting
-ReDoS attacks if exposed to the internet.
-The problems with both approaches are the motivation for us
-to look again at the regular expression matching problem.
-Another motivation is that regular expression matching algorithms
-that follow the POSIX standard often contain errors and bugs
-as we shall explain next.
-%We would like to have regex engines that can
-%deal with the regular part (e.g.
-%bounded repetitions) of regexes more
-%efficiently.
-%Also we want to make sure that they do it correctly.
-%It turns out that such aim is not so easy to achieve.
-%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
-% For example, the Rust regex engine claims to be linear,
-% but does not support lookarounds and back-references.
-% The GoLang regex library does not support over 1000 repetitions.
-% Java and Python both support back-references, but shows
-%catastrophic backtracking behaviours on inputs without back-references(
-%when the language is still regular).
-%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
-%TODO: verify the fact Rust does not allow 1000+ reps
-%The time cost of regex matching algorithms in general
-%involve two different phases, and different things can go differently wrong on
-%these phases.
-%$\DFA$s usually have problems in the first (construction) phase
-%, whereas $\NFA$s usually run into trouble
-%on the second phase.
-\section{Error-prone POSIX Implementations}
-Very often there are multiple ways of matching a string
-with a regular expression.
-In such cases the regular expressions matcher needs to
-disambiguate.
-The more widely used strategy is called POSIX,
-which roughly speaking always chooses the longest initial match.
-The POSIX strategy is widely adopted in many regular expression matchers
-because it is a natural strategy for lexers.
-However, many implementations (including the C libraries
-used by Linux and OS X distributions) contain bugs
-or do not meet the specification they claim to adhere to.
-Kuklewicz maintains a unit test repository which lists some
-problems with existing regular expression engines \cite{KuklewiczHaskell}.
-In some cases, they either fail to generate a
-result when there exists a match,
-or give results that are inconsistent with the POSIX standard.
-A concrete example is the regex:
-\begin{center}
-	$(aba + ab + a)^* \text{and the string} \; ababa$
-\end{center}
-The correct POSIX match for the above
-involves the entire string $ababa$,
-split into two Kleene star iterations, namely $[ab], [aba]$ at positions
-$[0, 2), [2, 5)$
-respectively.
-But feeding this example to the different engines
-listed at regex101 \footnote{
-	regex101 is an online regular expression matcher which
-	provides API for trying out regular expression
-	engines of multiple popular programming languages like
-Java, Python, Go, etc.} \parencite{regex101}.
-yields
-only two incomplete matches: $[aba]$ at $[0, 3)$
-and $a$ at $[4, 5)$.
-Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
-commented that most regex libraries are not
-correctly implementing the central POSIX
-rule, called the maximum munch rule.
-Grathwohl \parencite{grathwohl2014crash} wrote:
-\begin{quote}\it
-	``The POSIX strategy is more complicated than the
-	greedy because of the dependence on information about
-	the length of matched strings in the various subexpressions.''
-\end{quote}
-%\noindent
-People have recognised that the implementation complexity of POSIX rules also come from
-the specification being not very precise.
-The developers of the regexp package of Go
-\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
-commented that
-\begin{quote}\it
-``
-The POSIX rule is computationally prohibitive
-and not even well-defined.
-``
-\end{quote}
-There are many informal summaries of this disambiguation
-strategy, which are often quite long and delicate.
-For example Kuklewicz \cite{KuklewiczHaskell}
-described the POSIX rule as (section 1, last paragraph):
-\begin{quote}
-	\begin{itemize}
-		\item
-regular expressions (REs) take the leftmost starting match, and the longest match starting there
-earlier subpatterns have leftmost-longest priority over later subpatterns\\
-\item
-higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
-\item
-REs have right associative concatenation which can be changed with parenthesis\\
-\item
-parenthesized subexpressions return the match from their last usage\\
-\item
-text of component subexpressions must be contained in the text of the
-higher-level subexpressions\\
-\item
-if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
-\item
-if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
-\end{itemize}
-\end{quote}
-%The text above
-%is trying to capture something very precise,
-%and is crying out for formalising.
-Ribeiro and Du Bois \cite{RibeiroAgda2017} have
-formalised the notion of bit-coded regular expressions
-and proved their relations with simple regular expressions in
-the dependently-typed proof assistant Agda.
-They also proved the soundness and completeness of a matching algorithm
-based on the bit-coded regular expressions.
-Ausaf et al. \cite{AusafDyckhoffUrban2016}
-are the first to
-give a quite simple formalised POSIX
-specification in Isabelle/HOL, and also prove
-that their specification coincides with an earlier (unformalised)
-POSIX specification given by Okui and Suzuki \cite{Okui10}.
-They then formally proved the correctness of
-a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
-with regards to that specification.
-They also found that the informal POSIX
-specification by Sulzmann and Lu needs to be substantially
-revised in order for the correctness proof to go through.
-Their original specification and proof were unfixable
-according to Ausaf et al.
-In the next section, we will briefly
-introduce Brzozowski derivatives and Sulzmann
-and Lu's algorithm, which the main point of this thesis builds on.
 %We give a taste of what they
 %are like and why they are suitable for regular expression
 %matching and lexing.
 \section{Formal Specification of POSIX Matching
 and Brzozowski Derivatives}
 But what if we want to
 do lexing instead of just getting a true/false answer?
 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and
 elegant (arguably as beautiful as the definition of the
 Brzozowski derivative) solution for this.
+\marginpar{explicitly attribute the work}
+For the same reason described
+at the beginning of this chapter,
+we introduce the formal
+semantics of $\POSIX$ lexing by
+Ausaf et al.\cite{AusafDyckhoffUrban2016},
+followed by the first lexing algorithm by
+Sulzmanna and Lu \cite{Sulzmann2014}
+that produces the output conforming
+to the $\POSIX$ standard.
+%TODO: Actually show this in chapter 4.
+In what follows
+we choose to use the Isabelle-style notation
+for function and datatype constructor applications, where
+the parameters of a function are not enclosed
+inside a pair of parentheses (e.g. $f \;x \;y$
+instead of $f(x,\;y)$). This is mainly
+to make the text visually more concise.
 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
 In this section, we present a two-phase regular expression lexing
 algorithm.
 The first phase takes successive derivatives with

changeset 668	3831621d7b14
parent 667	660cf698eb26