lexing: ChengsongTanPhdThesis/Chapters/Overview.tex@3831621d7b14


\chapter{Technical Overview}
\label{Overview}
We start with a technical overview of the 
catastrophic backtracking problem,
motivating rigorous approaches to regular expression matching and lexing.
In doing so we also briefly
introduce common terminology such as
bounded repetitions and back-references.

\section{Motivating Examples}
Consider for example the regular expression $(a^*)^*\,b$ and 
strings of the form $aa..a$. These strings cannot be matched by this regular
expression: obviously the expected $b$ in the last
position is missing. One would assume that modern regular expression
matching engines can find this out very quickly. Surprisingly, if one tries
this example in JavaScript, Python or Java 8, even with small strings, 
say of lenght of around 30 $a$'s,
the decision takes large amounts of time to finish.
This is inproportional
to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).
The algorithms clearly show exponential behaviour, and as can be seen
is triggered by some relatively simple regular expressions.
Java 9 and newer
versions mitigates this behaviour by several magnitudes, 
but are still slow compared 
with the approach we are going to use in this thesis.
 


This superlinear blowup in regular expression engines
has caused grief in ``real life'' where it is 
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
A less serious example is a bug in the Atom editor:
a user found out when writing the following code
\begin{verbatim}
   vVar.Type().Name() == "" && vVar.Kind() == reflect.Ptr 
\end{verbatim}
\begin{verbatim}
    && vVar.Type().Elem().Name() == "" && vVar.Type().Elem().Kind() == 
\end{verbatim}
\begin{verbatim}
    reflect.Slice
\end{verbatim}
it took the editor half a hour to finish computing.
This was subsequently fixed by Galbraith \cite{fixedAtom},
and the issue was due to the regex engine of Atom.
But evil regular expressions can be more than a nuisance in a text editor:
on 20 July 2016 one evil
regular expression brought the webpage
\href{http://stackexchange.com}{Stack Exchange} to its
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
In this instance, a regular expression intended to just trim white
spaces from the beginning and the end of a line actually consumed
massive amounts of CPU resources---causing the web servers to grind to a
halt. In this example, the time needed to process
the string was 
$O(n^2)$ with respect to the string length $n$. This
quadratic overhead was enough for the homepage of Stack Exchange to
respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
attack and therefore stopped the servers from responding to any
requests. This made the whole site become unavailable. 

\begin{figure}[p]
\begin{center}
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={JavaScript},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
\end{axis}
\end{tikzpicture}
  &
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    %ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={Python},  
    legend pos=north west,
    legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\end{axis}
\end{tikzpicture}\\ 
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={Java 8},  
    legend pos=north west,
    legend cell align=left]
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
\end{axis}
\end{tikzpicture}
  &
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    %ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={Dart},  
    legend pos=north west,
    legend cell align=left]
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
\end{axis}
\end{tikzpicture}\\ 
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={Swift},  
    legend pos=north west,
    legend cell align=left]
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
\end{axis}
\end{tikzpicture}
  & 
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    %ylabel={time in secs},
    enlargelimits=true,
    %xtick={0,5000,...,40000},
    %xmax=40000,
    %ymax=35,
    restrict x to domain*=0:40000,
    restrict y to domain*=0:35,
    %ytick={0,5,...,30},
    %scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={Java9+},  
    legend pos=north west,
    legend cell align=left]
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
\end{axis}
\end{tikzpicture}\\ 
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    ylabel={time in secs},
    enlargelimits=true,
    %xtick={0,5,...,30},
    %xmax=33,
    %ymax=35,
    restrict x to domain*=0:60000,
    restrict y to domain*=0:35,
    %ytick={0,5,...,30},
    %scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={Scala},  
    legend pos=north west,
    legend cell align=left]
\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};
\end{axis}
\end{tikzpicture}
  & 
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    %ylabel={time in secs},
    enlargelimits=true,
    %xtick={0,5000,...,40000},
    %xmax=40000,
    %ymax=35,
    restrict x to domain*=0:60000,
    restrict y to domain*=0:45,
    %ytick={0,5,...,30},
    %scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend style={cells={align=left}},
    legend entries={Isabelle \\ Extracted},
    legend pos=north west,
    legend cell align=left]
\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};
\end{axis}
\end{tikzpicture}\\ 
\multicolumn{2}{c}{Graphs}
\end{tabular}    
\end{center}
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
           of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
   The reason for their fast growth %superlinear behaviour 
   is that they do a depth-first-search
   using NFAs.
   If the string does not match, the regular expression matching
   engine starts to explore all possibilities. 
   The last two graphs are for our implementation in Scala, one manual
   and one extracted from the verified lexer in Isabelle by $\textit{codegen}$.
   Our lexer  
   performs better in this case, and
   is formally verified.
   Despite being almost identical, the codegen-generated lexer
   % is generated directly from Isabelle using $\textit{codegen}$,
   is slower than the manually written version since the code synthesised by
   $\textit{codegen}$ does not use native integer or string
   types of the target language.
   %Note that we are comparing our \emph{lexer} against other languages' matchers.
}\label{fig:aStarStarb}
\end{figure}\afterpage{\clearpage}

A more recent example is a global outage of all Cloudflare servers on 
2 July 2019. 
The traffic Cloudflare services each day is more than
Twitter, Amazon, Instagram, Apple, Bing and Wikipedia combined, yet
it became completely unavailable for half an hour because of 
a poorly-written regular expression roughly of the form $^*.^*=.^*$
exhausted CPU resources that serve HTTP traffic. 
Although the outage
had several causes, at the heart was a regular expression that
was used to monitor network
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
These problems with regular expressions 
are not isolated events that happen
very rarely, 
%but actually widespread.
%They occur so often that they have a 
but they occur actually often enough that they have a
name: Regular-Expression-Denial-Of-Service (ReDoS)
attacks.
Davis et al. \cite{Davis18} detected more
than 1000 evil regular expressions
in Node.js, Python core libraries, npm and pypi. 
They therefore concluded that evil regular expressions
are a real problem rather than just "a parlour trick".

The work in this thesis aims to address this issue
with the help of formal proofs.
We describe a lexing algorithm based
on Brzozowski derivatives with verified correctness 
and a finiteness property for the size of derivatives
(which are all done in Isabelle/HOL).
Such properties %guarantee the absence of 
are an important step in preventing
catastrophic backtracking once and for all.
We will give more details in the next sections
on (i) why the slow cases in graph \ref{fig:aStarStarb}
can occur in traditional regular expression engines
and (ii) why we choose our 
approach based on Brzozowski derivatives and formal proofs.



\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
Regular expressions and regular expression matchers 
have been studied for many years.
Theoretical results in automata theory state 
that basic regular expression matching should be linear
w.r.t the input.
This assumes that the regular expression
$r$ was pre-processed and turned into a
deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
By basic we mean textbook definitions such as the one
below, involving only regular expressions for characters, alternatives,
sequences, and Kleene stars:
\[
	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
\]
Modern regular expression matchers used by programmers,
however,
support much richer constructs, such as bounded repetitions,
negations,
and back-references.
To differentiate, we use the word \emph{regex} to refer
to those expressions with richer constructs while reserving the
term \emph{regular expression}
for the more traditional meaning in formal languages theory.
We follow this convention 
in this thesis.
In the future, we aim to support all the popular features of regexes, 
but for this work we mainly look at basic regular expressions
and bounded repetitions.


%Most modern regex libraries
%the so-called PCRE standard (Peral Compatible Regular Expressions)
%has the back-references
Regexes come with a number of constructs
that make it more convenient for 
programmers to write regular expressions.
Depending on the types of constructs
the task of matching and lexing with them
will have different levels of complexity.
Some of those constructs are syntactic sugars that are
simply short hand notations
that save the programmers a few keystrokes.
These will not cause problems for regex libraries.
For example the
non-binary alternative involving three or more choices just means:
\[
	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
\]
Similarly, the range operator
%used to express the alternative
%of all characters between its operands, 
is just a concise way
of expressing an alternative of consecutive characters:
\[
	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
\]
for an alternative. The
wildcard character '$.$' is used to refer to any single character,
\[
	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
\]
except the newline.

\subsection{Bounded Repetitions}
More interesting are bounded repetitions, which can 
make the regular expressions much
more compact.
Normally there are four kinds of bounded repetitions:
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
(where $n$ and $m$ are constant natural numbers).
Like the star regular expressions, the set of strings or language
a bounded regular expression can match
is defined using the power operation on sets:
\begin{center}
	\begin{tabular}{lcl}
		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
	\end{tabular}
\end{center}
The attraction of bounded repetitions is that they can be
used to avoid a size blow up: for example $r^{\{n\}}$
is a shorthand for
the much longer regular expression:
\[
	\underbrace{r\ldots r}_\text{n copies of r}.
\]
%Therefore, a naive algorithm that simply unfolds
%them into their desugared forms
%will suffer from at least an exponential runtime increase.


The problem with matching 
such bounded repetitions
is that tools based on the classic notion of
automata need to expand $r^{\{n\}}$ into $n$ connected 
copies of the automaton for $r$. This leads to very inefficient matching
algorithms  or algorithms that consume large amounts of memory.
Implementations using $\DFA$s will
in such situations
either become very slow 
(for example Verbatim++ \cite{Verbatimpp}) or run
out of memory (for example $\mathit{LEX}$ and 
$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
in C and JAVA that generate $\mathit{DFA}$-based
lexers. The user provides a set of regular expressions
and configurations, and then 
gets an output program encoding a minimized $\mathit{DFA}$
that can be compiled and run. 
When given the above countdown regular expression,
a small $n$ (say 20) would result in a program representing a
DFA
with millions of states.}) for large counters.
A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
where the minimal DFA requires at least $2^{n+1}$ states.
For example, when $n$ is equal to 2,
the corresponding $\mathit{NFA}$ looks like:
\vspace{6mm}
\begin{center}
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
   \node[state,initial] (q_0)   {$q_0$}; 
   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
   \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
    \path[->] 
    (q_0) edge  node {a} (q_1)
    	  edge [loop below] node {a,b} ()
    (q_1) edge  node  {a,b} (q_2)
    (q_2) edge  node  {a,b} (q_3);
\end{tikzpicture}
\end{center}
and when turned into a DFA by the subset construction
requires at least $2^3$ states.\footnote{The 
red states are "countdown states" which count down 
the number of characters needed in addition to the current
string to make a successful match.
For example, state $q_1$ indicates a match that has
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
and just consumed the "delimiter" $a$ in the middle, and 
needs to match 2 more iterations of $(a|b)$ to complete.
State $q_2$ on the other hand, can be viewed as a state
after $q_1$ has consumed 1 character, and just waits
for 1 more character to complete.
The state $q_3$ is the last (accepting) state, requiring 0 
more characters.
Depending on the suffix of the
input string up to the current read location,
the states $q_1$ and $q_2$, $q_3$
may or may
not be active.
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
contain at least $2^3$ non-equivalent states that cannot be merged, 
because the subset construction during determinisation will generate
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
Generalizing this to regular expressions with larger
bounded repetitions number, we have that
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
would require at least $2^{n+1}$ states, if $r$ itself contains
more than 1 string.
This is to represent all different 
scenarios in which "countdown" states are active.}


Bounded repetitions are important because they
tend to occur frequently in practical use,
for example in the regex library RegExLib, in
the rules library of Snort \cite{Snort1999}\footnote{
Snort is a network intrusion detection (NID) tool
for monitoring network traffic.
The network security community curates a list
of malicious patterns written as regexes,
which is used by Snort's detection engine
to match against network traffic for any hostile
activities such as buffer overflow attacks.}, 
as well as in XML Schema definitions (XSDs).
According to Bj\"{o}rklund et al \cite{xml2015},
more than half of the 
XSDs they found on the Maven.org central repository
have bounded regular expressions in them.
Often the counters are quite large, with the largest being
close to ten million. 
A smaller sample XSD they gave
is:
\lstset{
	basicstyle=\fontsize{8.5}{9}\ttfamily,
  language=XML,
  morekeywords={encoding,
    xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
}
\begin{lstlisting}
<sequence minOccurs="0" maxOccurs="65535">
	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
 	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
</sequence>
\end{lstlisting}
This can be seen as the regex
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
regular expressions 
satisfying certain constraints (such as 
satisfying the floating point number format).
It is therefore quite unsatisfying that 
some regular expressions matching libraries
impose adhoc limits
for bounded regular expressions:
For example, in the regular expression matching library in the Go
language the regular expression $a^{1001}$ is not permitted, because no counter
can be above 1000, and in the built-in Rust regular expression library
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
for being too big~\footnote{Try it out here: https://rustexp.lpil.uk}. 
As Becchi and Crawley \cite{Becchi08}  have pointed out,
the reason for these restrictions
is that they simulate a non-deterministic finite
automata (NFA) with a breadth-first search.
This way the number of active states could
be equal to the counter number.
When the counters are large, 
the memory requirement could become
infeasible, and a regex engine
like in Go will reject this pattern straight away.
\begin{figure}[H]
\begin{center}
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
 
    	\node (q0) [state, initial] {$0$};
	\node (q1) [state, right = of q0] {$1$};
	%\node (q2) [state, right = of q1] {$2$};
	\node (qdots) [right = of q1] {$\ldots$};
	\node (qn) [state, right = of qdots] {$n$};
	\node (qn1) [state, right = of qn] {$n+1$};
	\node (qn2) [state, right = of qn1] {$n+2$};
	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
 
\path [-stealth, thick]
	(q0) edge [loop above] node {a} ()
    (q0) edge node {a}   (q1) 
    %(q1) edge node {.}   (q2)
    (q1) edge node {.}   (qdots)
    (qdots) edge node {.} (qn)
    (qn) edge node {.} (qn1)
    (qn1) edge node {b} (qn2)
    (qn2) edge node {$c$} (qn3);
\end{tikzpicture}
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
%   \node[state,initial] (q_0)   {$0$}; 
%   \node[state, ] (q_1) [right=of q_0] {$1$}; 
%   \node[state, ] (q_2) [right=of q_1] {$2$}; 
%   \node[state,
%   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
%    \path[->] 
%    (q_0) edge  node {a} (q_1)
%    	  edge [loop below] node {a,b} ()
%    (q_1) edge  node  {a,b} (q_2)
%    (q_2) edge  node  {a,b} (q_3);
%\end{tikzpicture}
\end{center}
\caption{The example given by Becchi and Crawley
	that NFA simulation can consume large
	amounts of memory: $.^*a.^{\{n\}}bc$ matching
	strings of the form $aaa\ldots aaaabc$.
	When traversing in a breadth-first manner,
all states from 0 till $n+1$ will become active.}\label{fig:inj}

\end{figure}
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
%in terms of input string length.
%TODO:try out these lexers
These problems can of course be solved in matching algorithms where 
automata go beyond the classic notion and for instance include explicit
counters \cite{Turo_ov__2020}.
These solutions can be quite efficient,
with the ability to process
gigabits of strings input per second
even with large counters \cite{Becchi08}.
These practical solutions do not come with
formal guarantees, and as pointed out by
Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
%But formal reasoning about these automata especially in Isabelle 
%can be challenging
%and un-intuitive. 
%Therefore, we take correctness and runtime claims made about these solutions
%with a grain of salt.

In the work reported in \cite{ITP2023} and here, 
we add better support using derivatives
for bounded regular expression $r^{\{n\}}$.
Our results
extend straightforwardly to
repetitions with intervals such as 
$r^{\{n\ldots m\}}$.
The merit of Brzozowski derivatives (more on this later)
on this problem is that
it can be naturally extended to support bounded repetitions.
Moreover these extensions are still made up of only small
inductive datatypes and recursive functions,
making it handy to deal with them in theorem provers.
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
%straightforwardly extended to deal with bounded regular expressions
%and moreover the resulting code still consists of only simple
%recursive functions and inductive datatypes.
Finally, bounded regular expressions do not destroy our finite
boundedness property, which we shall prove later on.





\subsection{Back-References}
The other way to simulate an $\mathit{NFA}$ for matching is choosing  
a single transition each time, keeping all the other options in 
a queue or stack, and backtracking if that choice eventually 
fails. 
This method, often called a  "depth-first-search", 
is efficient in many cases, but could end up
with exponential run time.
The backtracking method is employed in regex libraries
that support \emph{back-references}, for example
in Java and Python.
%\section{Back-references and The Terminology Regex}

%When one constructs an $\NFA$ out of a regular expression
%there is often very little to be done in the first phase, one simply 
%construct the $\NFA$ states based on the structure of the input regular expression.

%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
%one by keeping track of all active states after consuming 
%a character, and update that set of states iteratively.
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
%for a path terminating
%at an accepting state.



Consider the following regular expression where the sequence
operator is omitted for brevity:
\begin{center}
	$r_1r_2r_3r_4$
\end{center}
In this example,
one could label sub-expressions of interest 
by parenthesizing them and giving 
them a number in the order in which their opening parentheses appear.
One possible way of parenthesizing and labelling is: 
\begin{center}
	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
\end{center}
The sub-expressions
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
by 1 to 4, and can be ``referred back'' by their respective numbers. 
%These sub-expressions are called "capturing groups".
To do so, one uses the syntax $\backslash i$ 
to denote that we want the sub-string 
of the input matched by the i-th
sub-expression to appear again, 
exactly the same as it first appeared: 
\begin{center}
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
\end{center}
Once the sub-string $s_i$ for the sub-expression $r_i$
has been fixed, there is no variability on what the back-reference
label $\backslash i$ can be---it is tied to $s_i$.
The overall string will look like $\ldots s_i \ldots s_i \ldots $
%The backslash and number $i$ are the
%so-called "back-references".
%Let $e$ be an expression made of regular expressions 
%and back-references. $e$ contains the expression $e_i$
%as its $i$-th capturing group.
%The semantics of back-reference can be recursively
%written as:
%\begin{center}
%	\begin{tabular}{c}
%		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
%		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
%	\end{tabular}
%\end{center}
A concrete example
for back-references is
\begin{center}
$(.^*)\backslash 1$,
\end{center}
which matches
strings that can be split into two identical halves,
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
Note that this is different from 
repeating the  sub-expression verbatim like
\begin{center}
	$(.^*)(.^*)$,
\end{center}
which does not impose any restrictions on what strings the second 
sub-expression $.^*$
might match.
Another example for back-references is
\begin{center}
$(.)(.)\backslash 2\backslash 1$
\end{center}
which matches four-character palindromes
like $abba$, $x??x$ and so on.

Back-references are a regex construct 
that programmers find quite useful.
According to Becchi and Crawley \cite{Becchi08},
6\% of Snort rules (up until 2008) use them.
The most common use of back-references
is to express well-formed html files,
where back-references are convenient for matching
opening and closing tags like 
\begin{center}
	$\langle html \rangle \ldots \langle / html \rangle$
\end{center}
A regex describing such a format
is
\begin{center}
	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
\end{center}
Despite being useful, the expressive power of regexes 
go beyond regular languages 
once back-references are included.
In fact, they allow the regex construct to express 
languages that cannot be contained in context-free
languages either.
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
which cannot be expressed by 
context-free grammars \cite{campeanu2003formal}.
Such a language is contained in the context-sensitive hierarchy
of formal languages. 
Also solving the matching problem involving back-references
is known to be NP-complete \parencite{alfred2014algorithms}.
Regex libraries supporting back-references such as 
PCRE \cite{pcre} therefore have to
revert to a depth-first search algorithm including backtracking.
What is unexpected is that even in the cases 
not involving back-references, there is still
a (non-negligible) chance they might backtrack super-linearly,
as shown in the graphs in figure \ref{fig:aStarStarb}.

Summing up, we can categorise existing 
practical regex libraries into two kinds:
(i) The ones  with  linear
time guarantees like Go and Rust. The downside with them is that
they impose restrictions
on the regular expressions (not allowing back-references, 
bounded repetitions cannot exceed an ad hoc limit etc.).
And (ii) those 
that allow large bounded regular expressions and back-references
at the expense of using backtracking algorithms.
They can potentially ``grind to a halt''
on some very simple cases, resulting 
ReDoS attacks if exposed to the internet.

The problems with both approaches are the motivation for us 
to look again at the regular expression matching problem. 
Another motivation is that regular expression matching algorithms
that follow the POSIX standard often contain errors and bugs 
as we shall explain next.

%We would like to have regex engines that can 
%deal with the regular part (e.g.
%bounded repetitions) of regexes more
%efficiently.
%Also we want to make sure that they do it correctly.
%It turns out that such aim is not so easy to achieve.
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
% For example, the Rust regex engine claims to be linear, 
% but does not support lookarounds and back-references.
% The GoLang regex library does not support over 1000 repetitions.  
% Java and Python both support back-references, but shows
%catastrophic backtracking behaviours on inputs without back-references(
%when the language is still regular).
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
 %TODO: verify the fact Rust does not allow 1000+ reps




%The time cost of regex matching algorithms in general
%involve two different phases, and different things can go differently wrong on 
%these phases.
%$\DFA$s usually have problems in the first (construction) phase
%, whereas $\NFA$s usually run into trouble
%on the second phase.


\section{Error-prone POSIX Implementations}
Very often there are multiple ways of matching a string
with a regular expression.
In such cases the regular expressions matcher needs to
disambiguate.
The more widely used strategy is called POSIX,
which roughly speaking always chooses the longest initial match.
The POSIX strategy is widely adopted in many regular expression matchers
because it is a natural strategy for lexers.
However, many implementations (including the C libraries
used by Linux and OS X distributions) contain bugs
or do not meet the specification they claim to adhere to.
Kuklewicz maintains a unit test repository which lists some
problems with existing regular expression engines \cite{KuklewiczHaskell}.
In some cases, they either fail to generate a
result when there exists a match,
or give results that are inconsistent with the POSIX standard.
A concrete example is the regex:
\begin{center}
	$(aba + ab + a)^* \text{and the string} \; ababa$
\end{center}
The correct POSIX match for the above
involves the entire string $ababa$, 
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
$[0, 2), [2, 5)$
respectively.
But feeding this example to the different engines
listed at regex101 \footnote{
	regex101 is an online regular expression matcher which
	provides API for trying out regular expression
	engines of multiple popular programming languages like
Java, Python, Go, etc.} \parencite{regex101}. 
yields
only two incomplete matches: $[aba]$ at $[0, 3)$
and $a$ at $[4, 5)$.
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
commented that most regex libraries are not
correctly implementing the central POSIX
rule, called the maximum munch rule.
Grathwohl \parencite{grathwohl2014crash} wrote:
\begin{quote}\it
	``The POSIX strategy is more complicated than the 
	greedy because of the dependence on information about 
	the length of matched strings in the various subexpressions.''
\end{quote}
%\noindent
People have recognised that the implementation complexity of POSIX rules also come from
the specification being not very precise.
The developers of the regexp package of Go 
\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
commented that
\begin{quote}\it
``
The POSIX rule is computationally prohibitive
and not even well-defined.
``
\end{quote}
There are many informal summaries of this disambiguation
strategy, which are often quite long and delicate.
For example Kuklewicz \cite{KuklewiczHaskell} 
described the POSIX rule as (section 1, last paragraph):
\begin{quote}
	\begin{itemize}
		\item
regular expressions (REs) take the leftmost starting match, and the longest match starting there
earlier subpatterns have leftmost-longest priority over later subpatterns\\
\item
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
\item
REs have right associative concatenation which can be changed with parenthesis\\
\item
parenthesized subexpressions return the match from their last usage\\
\item
text of component subexpressions must be contained in the text of the 
higher-level subexpressions\\
\item
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
\item
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
\end{itemize}
\end{quote}
%The text above 
%is trying to capture something very precise,
%and is crying out for formalising.
author	Chengsong
	Wed, 23 Aug 2023 03:02:31 +0100
changeset 668	3831621d7b14
permissions	-rw-r--r--