--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Oct 04 00:25:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Oct 11 13:09:47 2022 +0100
@@ -201,11 +201,9 @@
be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
command-line tools like $\mathit{grep}$ that facilitate easy
text-processing; network intrusion
-detection systems that reject suspicious traffic; or compiler
-front ends--the majority of the solutions to these tasks
-involve lexing with regular
-expressions.
-Given its usefulness and ubiquity, one would imagine that
+detection systems that inspect suspicious traffic; or compiler
+front ends.
+Given their usefulness and ubiquity, one would assume that
modern regular expression matching implementations
are mature and fully studied.
Indeed, in a popular programming language's regex engine,
@@ -222,26 +220,25 @@
%TODO: get source for SNORT/BRO's regex matching engine/speed
-Take $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
+Consider $(a^*)^*\,b$ and ask whether
+strings of the form $aa..a$ can be matched by this regular
expression. Obviously this is not the case---the expected $b$ in the last
position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
-length, say around 30 $a$'s,
-the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
+matching engines can find this out very quickly. Surprisingly, if one tries
+this example in JavaScript, Python or Java 8, even with small strings,
+say of lenght of around 30 $a$'s,
+the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
This is clearly exponential behaviour, and
-is triggered by some relatively simple regex patterns.
+is triggered by some relatively simple regular expressions.
Java 9 and newer
-versions improves this behaviour, but is still slow compared
-with the approach we are going to use.
-
+versions improve this behaviour somewhat, but is still slow compared
+with the approach we are going to use in this thesis.
This superlinear blowup in regular expression engines
-had repeatedly caused grief in real life that they
-get a name for them--``catastrophic backtracking''.
+had repeatedly caused grief in ``real life'' where it is
+given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
For example, on 20 July 2016 one evil
regular expression brought the webpage
\href{http://stackexchange.com}{Stack Exchange} to its
@@ -257,7 +254,7 @@
requests. This made the whole site become unavailable.
\begin{figure}[p]
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
@@ -298,28 +295,7 @@
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\end{axis}
-\end{tikzpicture}
- &
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- %ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={Java 8},
- legend pos=north west,
- legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}\\
+\end{tikzpicture}\\
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
@@ -334,10 +310,10 @@
axis lines=left,
width=5cm,
height=4cm,
- legend entries={Dart},
+ legend entries={Java 8},
legend pos=north west,
legend cell align=left]
-\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
\end{axis}
\end{tikzpicture}
&
@@ -355,128 +331,171 @@
axis lines=left,
width=5cm,
height=4cm,
+ legend entries={Dart},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\end{axis}
+\end{tikzpicture}\\
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
legend entries={Swift},
legend pos=north west,
legend cell align=left]
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
\end{axis}
\end{tikzpicture}
- & \\
-\multicolumn{3}{c}{Graphs}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=true,
+ %xtick={0,5000,...,40000},
+ %xmax=40000,
+ %ymax=35,
+ restrict x to domain*=0:40000,
+ restrict y to domain*=0:35,
+ %ytick={0,5,...,30},
+ %scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Java9+},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{2}{c}{Graphs}
\end{tabular}
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
- The reason for their superlinear behaviour is that they do a depth-first-search.
- If the string does not match, the engine starts to explore all possibilities.
+ The reason for their superlinear behaviour is that they do a depth-first-search
+ using NFAs.
+ If the string does not match, the regular expression matching
+ engine starts to explore all possibilities.
}\label{fig:aStarStarb}
\end{figure}\afterpage{\clearpage}
A more recent example is a global outage of all Cloudflare servers on 2 July
-2019. A poorly written regular expression exhibited exponential
-behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
+2019. A poorly written regular expression exhibited catastrophic backtracking
+and exhausted CPUs that serve HTTP traffic. Although the outage
had several causes, at the heart was a regular expression that
was used to monitor network
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
These problems with regular expressions
are not isolated events that happen
very occasionally, but actually widespread.
-They occur so often that they get a
-name--Regular-Expression-Denial-Of-Service (ReDoS)
+They occur so often that they have a
+name: Regular-Expression-Denial-Of-Service (ReDoS)
attack.
\citeauthor{Davis18} detected more
-than 1000 super-linear (SL) regular expressions
-in Node.js, Python core libraries, and npm and pypi.
+than 1000 evil regular expressions
+in Node.js, Python core libraries, npm and in pypi.
They therefore concluded that evil regular expressions
-are problems "more than a parlour trick", but one that
-requires
-more research attention.
+are real problems rather than "a parlour trick".
This work aims to address this issue
with the help of formal proofs.
-We offer a lexing algorithm based
-on Brzozowski derivatives with certified correctness (in
+We describe a lexing algorithm based
+on Brzozowski derivatives with verified correctness (in
Isabelle/HOL)
-and finiteness property.
-Such properties guarantee the absence of
-catastrophic backtracking in most cases.
+and a finiteness property.
+Such properties %guarantee the absence of
+are an important step in preventing
+catastrophic backtracking once and for all.
We will give more details in the next sections
on (i) why the slow cases in graph \ref{fig:aStarStarb}
-can occur
+can occur in traditional regular expression engines
and (ii) why we choose our
-approach (Brzozowski derivatives and formal proofs).
+approach based on Brzozowski derivatives and formal proofs.
-\section{Regex, and the Problems with Regex Matchers}
+\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
Regular expressions and regular expression matchers
have of course been studied for many, many years.
-Theoretical results in automata theory say
+Theoretical results in automata theory state
that basic regular expression matching should be linear
w.r.t the input.
This assumes that the regular expression
$r$ was pre-processed and turned into a
-deterministic finite automata (DFA) before matching,
-which could be exponential\cite{Sakarovitch2009}.
+deterministic finite automaton (DFA) before matching\cite{Sakarovitch2009}.
By basic we mean textbook definitions such as the one
-below, involving only characters, alternatives,
+below, involving only regular expressions for characters, alternatives,
sequences, and Kleene stars:
\[
- r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
+ r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
\]
Modern regular expression matchers used by programmers,
however,
-support richer constructs such as bounded repetitions
+support much richer constructs, such as bounded repetitions
and back-references.
-To differentiate, people use the word \emph{regex} to refer
+To differentiate, we use the word \emph{regex} to refer
to those expressions with richer constructs while reserving the
term \emph{regular expression}
for the more traditional meaning in formal languages theory.
We follow this convention
in this thesis.
In the future, we aim to support all the popular features of regexes,
-but for this work we mainly look at regular expressions.
+but for this work we mainly look at basic regular expressions
+and bounded repetitions.
%Most modern regex libraries
%the so-called PCRE standard (Peral Compatible Regular Expressions)
%has the back-references
-Regexes come with a lot of constructs
-beyond the basic ones
+Regexes come with a number of constructs
that make it more convenient for
programmers to write regular expressions.
-Depending on the types of these constructs
+Depending on the types of constructs
the task of matching and lexing with them
-will have different levels of complexity increase.
-Some of those constructs are syntactic sugars that are
+will have different levels of complexity.
+Some of those constructs are just syntactic sugars that are
simply short hand notations
that save the programmers a few keystrokes.
-These will not cause trouble for regex libraries.
-
-\noindent
+These will not cause problems for regex libraries.
For example the
-non-binary alternative involving three or more choices:
+non-binary alternative involving three or more choices just means:
\[
(a | b | c) \stackrel{means}{=} ((a + b)+ c)
\]
-the range operator $-$ used to express the alternative
-of all characters between its operands in a concise way:
+Similarly, the range operator used to express the alternative
+of all characters between its operands is just a concise way:
\[
[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
\]
-and the
-wildcard character $.$ used to refer to any single character:
+for an alternative. The
+wildcard character $.$ is used to refer to any single character,
\[
. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
\]
+except the newline.
-\noindent
\subsection{Bounded Repetitions}
-Some of those constructs do make the expressions much
+More interesting are bounded repetitions, which can
+make the regular expressions much
more compact.
-For example, the bounded regular expressions
-(where $n$ and $m$ are constant natural numbers)
-$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$,
-defined as
+There are
+$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
+(where $n$ and $m$ are constant natural numbers).
+Like the star regular expressions, the set of strings or language
+a bounded regular expression can match
+is defined using the power operation on sets:
\begin{center}
\begin{tabular}{lcl}
$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
@@ -485,9 +504,9 @@
$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
\end{tabular}
\end{center}
-are exponentially smaller compared with
-their unfolded form: for example $r^{\{n\}}$
-as opposed to
+The attraction of bounded repetitions is that they can be
+used to avoid a blow up: for example $r^{\{n\}}$
+is a shorthand for
\[
\underbrace{r\ldots r}_\text{n copies of r}.
\]
@@ -495,8 +514,10 @@
%them into their desugared forms
%will suffer from at least an exponential runtime increase.
-The problem here is that tools based on the classic notion of
-automata need to expand $r^{n}$ into $n$ connected
+
+The problem with matching
+is that tools based on the classic notion of
+automata need to expand $r^{\{n\}}$ into $n$ connected
copies of the automaton for $r$. This leads to very inefficient matching
algorithms or algorithms that consume large amounts of memory.
Implementations using $\DFA$s will
@@ -512,11 +533,11 @@
When given the above countdown regular expression,
a small $n$ (a few dozen) would result in a
determinised automata
-with millions of states.}) under large counters.
-A classic example is the regular expression $(a+b)^* a (a+b)^{n}$
+with millions of states.}) for large counters.
+A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
where the minimal DFA requires at least $2^{n+1}$ states.
For example, when $n$ is equal to 2,
-an $\mathit{NFA}$ describing it would look like:
+The corresponding $\mathit{NFA}$ looks like:
\begin{center}
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
\node[state,initial] (q_0) {$q_0$};
@@ -530,8 +551,8 @@
(q_2) edge node {a,b} (q_3);
\end{tikzpicture}
\end{center}
-which requires at least $2^3$ states
-for its subset construction.\footnote{The
+when turned into a DFA by the subset construction
+requires at least $2^3$ states.\footnote{The
red states are "countdown states" which counts down
the number of characters needed in addition to the current
string to make a successful match.
@@ -560,23 +581,10 @@
This is to represent all different
scenarios which "countdown" states are active.}
-One of the most recent work in the context of lexing
-%with this issue
-is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
-This is relevant work and we will compare later on
-our derivative-based matcher we are going to present.
-There is also some newer work called
-Verbatim++\cite{Verbatimpp}, which does not use derivatives,
-but deterministic finite automaton instead.
-%An example that gives problem to automaton approaches would be
-%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
-%It requires at least $2^{n+1}$ states to represent
-%as a DFA.
-
Bounded repetitions are very important because they
-tend to occur a lot in practical use.
-For example in the regex library RegExLib,
+tend to occur a lot in practical use,
+for example in the regex library RegExLib,
the rules library of Snort \cite{Snort1999}\footnote{
Snort is a network intrusion detection (NID) tool
for monitoring network traffic.
@@ -588,22 +596,23 @@
as well as in XML Schema definitions (XSDs).
According to Bj\"{o}rklund et al \cite{xml2015},
more than half of the
-XSDs they found have bounded regular expressions in them.
-Often the counters are quite large, the largest up to ten million.
+XSDs they found on the Maven.org central repository
+have bounded regular expressions in them.
+Often the counters are quite large, with the largest being
+approximately up to ten million.
An example XSD they gave
-was:
-%\begin{verbatim}
-%<sequence minOccurs="0" maxOccurs="65535">
-% <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-% <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-%</sequence>
-%\end{verbatim}
+is:
+\begin{verbatim}
+<sequence minOccurs="0" maxOccurs="65535">
+ <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{verbatim}
This can be seen as the expression
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
regular expressions
satisfying certain constraints (such as
satisfying the floating point number format).
-
It is therefore quite unsatisfying that
some regular expressions matching libraries
impose adhoc limits
@@ -615,21 +624,22 @@
for being too big.
As Becchi and Crawley\cite{Becchi08} have pointed out,
the reason for these restrictions
-are that they simulate a non-deterministic finite
+is that they simulate a non-deterministic finite
automata (NFA) with a breadth-first search.
This way the number of active states could
be equal to the counter number.
When the counters are large,
the memory requirement could become
-infeasible, and the pattern needs to be rejected straight away.
+infeasible, and a regex engine
+like Go will reject this pattern straight away.
\begin{figure}[H]
\begin{center}
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
\node (q0) [state, initial] {$0$};
\node (q1) [state, right = of q0] {$1$};
- \node (q2) [state, right = of q1] {$2$};
- \node (qdots) [right = of q2] {$\ldots$};
+ %\node (q2) [state, right = of q1] {$2$};
+ \node (qdots) [right = of q1] {$\ldots$};
\node (qn) [state, right = of qdots] {$n$};
\node (qn1) [state, right = of qn] {$n+1$};
\node (qn2) [state, right = of qn1] {$n+2$};
@@ -638,8 +648,8 @@
\path [-stealth, thick]
(q0) edge [loop above] node {a} ()
(q0) edge node {a} (q1)
- (q1) edge node {.} (q2)
- (q2) edge node {.} (qdots)
+ %(q1) edge node {.} (q2)
+ (q1) edge node {.} (qdots)
(qdots) edge node {.} (qn)
(qn) edge node {.} (qn1)
(qn1) edge node {b} (qn2)
@@ -672,14 +682,15 @@
These problems can of course be solved in matching algorithms where
automata go beyond the classic notion and for instance include explicit
counters \cite{Turo_ov__2020}.
-These solutions can be quite effective,
+These solutions can be quite efficient,
with the ability to process
-gigabytes of string input per second
+gigabytes of strings input per second
even with large counters \cite{Becchi08}.
-But formally reasoning about these automata can be challenging
-and un-intuitive.
-Therefore, correctness and runtime claims made about these solutions need to be
-taken with a grain of salt.
+But formal reasoning about these automata especially in Isabelle
+can be challenging
+and un-intuitive.
+Therefore, we take correctness and runtime claims made about these solutions
+with a grain of salt.
In the work reported in \cite{CSL2022} and here,
we add better support using derivatives
@@ -1180,6 +1191,20 @@
together with a formal proof of the correctness with those simplifications.
+One of the most recent work in the context of lexing
+%with this issue
+is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+This is relevant work for us and we will compare it later with
+our derivative-based matcher we are going to present.
+There is also some newer work called
+Verbatim++\cite{Verbatimpp}, which does not use derivatives,
+but deterministic finite automaton instead.
+%An example that gives problem to automaton approaches would be
+%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+%It requires at least $2^{n+1}$ states to represent
+%as a DFA.
+
+
%----------------------------------------------------------------------------------------
\section{Contribution}