ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 612 8c234a1bc7e0
parent 609 61139fdddae0
child 618 233cf2b97d1a
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Oct 04 00:25:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Oct 11 13:09:47 2022 +0100
@@ -201,11 +201,9 @@
 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
 command-line tools like $\mathit{grep}$ that facilitate easy 
 text-processing; network intrusion
-detection systems that reject suspicious traffic; or compiler
-front ends--the majority of the solutions to these tasks 
-involve lexing with regular 
-expressions.
-Given its usefulness and ubiquity, one would imagine that
+detection systems that inspect suspicious traffic; or compiler
+front ends.
+Given their usefulness and ubiquity, one would assume that
 modern regular expression matching implementations
 are mature and fully studied.
 Indeed, in a popular programming language's regex engine, 
@@ -222,26 +220,25 @@
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
 
-Take $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
+Consider $(a^*)^*\,b$ and ask whether
+strings of the form $aa..a$ can be matched by this regular
 expression. Obviously this is not the case---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
-length, say around 30 $a$'s,
-the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
+matching engines can find this out very quickly. Surprisingly, if one tries
+this example in JavaScript, Python or Java 8, even with small strings, 
+say of lenght of around 30 $a$'s,
+the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
 This is clearly exponential behaviour, and 
-is triggered by some relatively simple regex patterns.
+is triggered by some relatively simple regular expressions.
 Java 9 and newer
-versions improves this behaviour, but is still slow compared 
-with the approach we are going to use.
-
+versions improve this behaviour somewhat, but is still slow compared 
+with the approach we are going to use in this thesis.
 
 
 
 This superlinear blowup in regular expression engines
-had repeatedly caused grief in real life that they
-get a name for them--``catastrophic backtracking''.
+had repeatedly caused grief in ``real life'' where it is 
+given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
 For example, on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
@@ -257,7 +254,7 @@
 requests. This made the whole site become unavailable. 
 
 \begin{figure}[p]
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
     xlabel={$n$},
@@ -298,28 +295,7 @@
     legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
 \end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=35,
-    ytick={0,5,...,30},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={Java 8},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}\\
+\end{tikzpicture}\\ 
 \begin{tikzpicture}
 \begin{axis}[
     xlabel={$n$},
@@ -334,10 +310,10 @@
     axis lines=left,
     width=5cm,
     height=4cm, 
-    legend entries={Dart},  
+    legend entries={Java 8},  
     legend pos=north west,
     legend cell align=left]
-\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
 \end{axis}
 \end{tikzpicture}
   &
@@ -355,128 +331,171 @@
     axis lines=left,
     width=5cm,
     height=4cm, 
+    legend entries={Dart},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\end{axis}
+\end{tikzpicture}\\ 
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=35,
+    ytick={0,5,...,30},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
     legend entries={Swift},  
     legend pos=north west,
     legend cell align=left]
 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
 \end{axis}
 \end{tikzpicture}
-  & \\
-\multicolumn{3}{c}{Graphs}
+  & 
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    %ylabel={time in secs},
+    enlargelimits=true,
+    %xtick={0,5000,...,40000},
+    %xmax=40000,
+    %ymax=35,
+    restrict x to domain*=0:40000,
+    restrict y to domain*=0:35,
+    %ytick={0,5,...,30},
+    %scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={Java9+},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+\end{axis}
+\end{tikzpicture}\\ 
+\multicolumn{2}{c}{Graphs}
 \end{tabular}    
 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
-   The reason for their superlinear behaviour is that they do a depth-first-search.
-   If the string does not match, the engine starts to explore all possibilities. 
+   The reason for their superlinear behaviour is that they do a depth-first-search
+   using NFAs.
+   If the string does not match, the regular expression matching
+   engine starts to explore all possibilities. 
 }\label{fig:aStarStarb}
 \end{figure}\afterpage{\clearpage}
 
 A more recent example is a global outage of all Cloudflare servers on 2 July
-2019. A poorly written regular expression exhibited exponential
-behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
+2019. A poorly written regular expression exhibited catastrophic backtracking
+and exhausted CPUs that serve HTTP traffic. Although the outage
 had several causes, at the heart was a regular expression that
 was used to monitor network
 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
 These problems with regular expressions 
 are not isolated events that happen
 very occasionally, but actually widespread.
-They occur so often that they get a 
-name--Regular-Expression-Denial-Of-Service (ReDoS)
+They occur so often that they have a 
+name: Regular-Expression-Denial-Of-Service (ReDoS)
 attack.
 \citeauthor{Davis18} detected more
-than 1000 super-linear (SL) regular expressions
-in Node.js, Python core libraries, and npm and pypi. 
+than 1000 evil regular expressions
+in Node.js, Python core libraries, npm and in pypi. 
 They therefore concluded that evil regular expressions
-are problems "more than a parlour trick", but one that
-requires
-more research attention.
+are real problems rather than "a parlour trick".
 
 This work aims to address this issue
 with the help of formal proofs.
-We offer a lexing algorithm based
-on Brzozowski derivatives with certified correctness (in 
+We describe a lexing algorithm based
+on Brzozowski derivatives with verified correctness (in 
 Isabelle/HOL)
-and finiteness property.
-Such properties guarantee the absence of 
-catastrophic backtracking in most cases.
+and a finiteness property.
+Such properties %guarantee the absence of 
+are an important step in preventing
+catastrophic backtracking once and for all.
 We will give more details in the next sections
 on (i) why the slow cases in graph \ref{fig:aStarStarb}
-can occur
+can occur in traditional regular expression engines
 and (ii) why we choose our 
-approach (Brzozowski derivatives and formal proofs).
+approach based on Brzozowski derivatives and formal proofs.
 
 
-\section{Regex, and the Problems with Regex Matchers}
+\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
 Regular expressions and regular expression matchers 
 have of course been studied for many, many years.
-Theoretical results in automata theory say
+Theoretical results in automata theory state 
 that basic regular expression matching should be linear
 w.r.t the input.
 This assumes that the regular expression
 $r$ was pre-processed and turned into a
-deterministic finite automata (DFA) before matching,
-which could be exponential\cite{Sakarovitch2009}.
+deterministic finite automaton (DFA) before matching\cite{Sakarovitch2009}.
 By basic we mean textbook definitions such as the one
-below, involving only characters, alternatives,
+below, involving only regular expressions for characters, alternatives,
 sequences, and Kleene stars:
 \[
-	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
+	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
 \]
 Modern regular expression matchers used by programmers,
 however,
-support richer constructs such as bounded repetitions
+support much richer constructs, such as bounded repetitions
 and back-references.
-To differentiate, people use the word \emph{regex} to refer
+To differentiate, we use the word \emph{regex} to refer
 to those expressions with richer constructs while reserving the
 term \emph{regular expression}
 for the more traditional meaning in formal languages theory.
 We follow this convention 
 in this thesis.
 In the future, we aim to support all the popular features of regexes, 
-but for this work we mainly look at regular expressions.
+but for this work we mainly look at basic regular expressions
+and bounded repetitions.
 
 
 
 %Most modern regex libraries
 %the so-called PCRE standard (Peral Compatible Regular Expressions)
 %has the back-references
-Regexes come with a lot of constructs
-beyond the basic ones
+Regexes come with a number of constructs
 that make it more convenient for 
 programmers to write regular expressions.
-Depending on the types of these constructs
+Depending on the types of constructs
 the task of matching and lexing with them
-will have different levels of complexity increase.
-Some of those constructs are syntactic sugars that are
+will have different levels of complexity.
+Some of those constructs are just syntactic sugars that are
 simply short hand notations
 that save the programmers a few keystrokes.
-These will not cause trouble for regex libraries.
-
-\noindent
+These will not cause problems for regex libraries.
 For example the
-non-binary alternative involving three or more choices:
+non-binary alternative involving three or more choices just means:
 \[
 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
 \]
-the range operator $-$ used to express the alternative
-of all characters between its operands in a concise way:
+Similarly, the range operator used to express the alternative
+of all characters between its operands is just a concise way:
 \[
 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
 \]
-and the
-wildcard character $.$ used to refer to any single character:
+for an alternative. The
+wildcard character $.$ is used to refer to any single character,
 \[
 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
 \]
+except the newline.
 
-\noindent
 \subsection{Bounded Repetitions}
-Some of those constructs do make the expressions much
+More interesting are bounded repetitions, which can 
+make the regular expressions much
 more compact.
-For example, the bounded regular expressions
-(where $n$ and $m$ are constant natural numbers)
-$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$,
-defined as 
+There are 
+$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
+(where $n$ and $m$ are constant natural numbers).
+Like the star regular expressions, the set of strings or language
+a bounded regular expression can match
+is defined using the power operation on sets:
 \begin{center}
 	\begin{tabular}{lcl}
 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
@@ -485,9 +504,9 @@
 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
 	\end{tabular}
 \end{center}
-are exponentially smaller compared with
-their unfolded form: for example $r^{\{n\}}$
-as opposed to
+The attraction of bounded repetitions is that they can be
+used to avoid a blow up: for example $r^{\{n\}}$
+is a shorthand for
 \[
 	\underbrace{r\ldots r}_\text{n copies of r}.
 \]
@@ -495,8 +514,10 @@
 %them into their desugared forms
 %will suffer from at least an exponential runtime increase.
 
-The problem here is that tools based on the classic notion of
-automata need to expand $r^{n}$ into $n$ connected 
+
+The problem with matching 
+is that tools based on the classic notion of
+automata need to expand $r^{\{n\}}$ into $n$ connected 
 copies of the automaton for $r$. This leads to very inefficient matching
 algorithms  or algorithms that consume large amounts of memory.
 Implementations using $\DFA$s will
@@ -512,11 +533,11 @@
 When given the above countdown regular expression,
 a small $n$ (a few dozen) would result in a 
 determinised automata
-with millions of states.}) under large counters.
-A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
+with millions of states.}) for large counters.
+A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
 where the minimal DFA requires at least $2^{n+1}$ states.
 For example, when $n$ is equal to 2,
-an $\mathit{NFA}$ describing it would look like:
+The corresponding $\mathit{NFA}$ looks like:
 \begin{center}
 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
    \node[state,initial] (q_0)   {$q_0$}; 
@@ -530,8 +551,8 @@
     (q_2) edge  node  {a,b} (q_3);
 \end{tikzpicture}
 \end{center}
-which requires at least $2^3$ states
-for its subset construction.\footnote{The 
+when turned into a DFA by the subset construction
+requires at least $2^3$ states.\footnote{The 
 red states are "countdown states" which counts down 
 the number of characters needed in addition to the current
 string to make a successful match.
@@ -560,23 +581,10 @@
 This is to represent all different 
 scenarios which "countdown" states are active.}
 
-One of the most recent work in the context of lexing
-%with this issue
-is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
-This is relevant work and we will compare later on
-our derivative-based matcher we are going to present.
-There is also some newer work called
-Verbatim++\cite{Verbatimpp}, which does not use derivatives, 
-but deterministic finite automaton instead.
-%An example that gives problem to automaton approaches would be
-%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
-%It requires at least $2^{n+1}$ states to represent
-%as a DFA.
-
 
 Bounded repetitions are very important because they
-tend to occur a lot in practical use.
-For example in the regex library RegExLib,
+tend to occur a lot in practical use,
+for example in the regex library RegExLib,
 the rules library of Snort \cite{Snort1999}\footnote{
 Snort is a network intrusion detection (NID) tool
 for monitoring network traffic.
@@ -588,22 +596,23 @@
 as well as in XML Schema definitions (XSDs).
 According to Bj\"{o}rklund et al \cite{xml2015},
 more than half of the 
-XSDs they found have bounded regular expressions in them.
-Often the counters are quite large, the largest up to ten million. 
+XSDs they found on the Maven.org central repository
+have bounded regular expressions in them.
+Often the counters are quite large, with the largest being
+approximately up to ten million. 
 An example XSD they gave
-was:
-%\begin{verbatim}
-%<sequence minOccurs="0" maxOccurs="65535">
-%  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-%  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-%</sequence>
-%\end{verbatim}
+is:
+\begin{verbatim}
+<sequence minOccurs="0" maxOccurs="65535">
+ <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{verbatim}
 This can be seen as the expression 
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions 
 satisfying certain constraints (such as 
 satisfying the floating point number format).
-
 It is therefore quite unsatisfying that 
 some regular expressions matching libraries
 impose adhoc limits
@@ -615,21 +624,22 @@
 for being too big. 
 As Becchi and Crawley\cite{Becchi08}  have pointed out,
 the reason for these restrictions
-are that they simulate a non-deterministic finite
+is that they simulate a non-deterministic finite
 automata (NFA) with a breadth-first search.
 This way the number of active states could
 be equal to the counter number.
 When the counters are large, 
 the memory requirement could become
-infeasible, and the pattern needs to be rejected straight away.
+infeasible, and a regex engine
+like Go will reject this pattern straight away.
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
  
     	\node (q0) [state, initial] {$0$};
 	\node (q1) [state, right = of q0] {$1$};
-	\node (q2) [state, right = of q1] {$2$};
-	\node (qdots) [right = of q2] {$\ldots$};
+	%\node (q2) [state, right = of q1] {$2$};
+	\node (qdots) [right = of q1] {$\ldots$};
 	\node (qn) [state, right = of qdots] {$n$};
 	\node (qn1) [state, right = of qn] {$n+1$};
 	\node (qn2) [state, right = of qn1] {$n+2$};
@@ -638,8 +648,8 @@
 \path [-stealth, thick]
 	(q0) edge [loop above] node {a} ()
     (q0) edge node {a}   (q1) 
-    (q1) edge node {.}   (q2)
-    (q2) edge node {.}   (qdots)
+    %(q1) edge node {.}   (q2)
+    (q1) edge node {.}   (qdots)
     (qdots) edge node {.} (qn)
     (qn) edge node {.} (qn1)
     (qn1) edge node {b} (qn2)
@@ -672,14 +682,15 @@
 These problems can of course be solved in matching algorithms where 
 automata go beyond the classic notion and for instance include explicit
 counters \cite{Turo_ov__2020}.
-These solutions can be quite effective,
+These solutions can be quite efficient,
 with the ability to process
-gigabytes of string input per second
+gigabytes of strings input per second
 even with large counters \cite{Becchi08}.
-But formally reasoning about these automata can be challenging
-and un-intuitive.
-Therefore, correctness and runtime claims made about these solutions need to be
-taken with a grain of salt.
+But formal reasoning about these automata especially in Isabelle 
+can be challenging
+and un-intuitive. 
+Therefore, we take correctness and runtime claims made about these solutions
+with a grain of salt.
 
 In the work reported in \cite{CSL2022} and here, 
 we add better support using derivatives
@@ -1180,6 +1191,20 @@
 together with a formal proof of the correctness with those simplifications.
 
 
+One of the most recent work in the context of lexing
+%with this issue
+is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+This is relevant work for us and we will compare it later with 
+our derivative-based matcher we are going to present.
+There is also some newer work called
+Verbatim++\cite{Verbatimpp}, which does not use derivatives, 
+but deterministic finite automaton instead.
+%An example that gives problem to automaton approaches would be
+%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+%It requires at least $2^{n+1}$ states to represent
+%as a DFA.
+
+
 %----------------------------------------------------------------------------------------
 \section{Contribution}