lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:bc1df466150a
+:8c234a1bc7e0
 Regular expressions are widely used in computer science:
 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
 command-line tools like $\mathit{grep}$ that facilitate easy
 text-processing; network intrusion
-detection systems that reject suspicious traffic; or compiler
+detection systems that inspect suspicious traffic; or compiler
-front ends--the majority of the solutions to these tasks
+front ends.
-involve lexing with regular
+Given their usefulness and ubiquity, one would assume that
-expressions.
-Given its usefulness and ubiquity, one would imagine that
 modern regular expression matching implementations
 are mature and fully studied.
 Indeed, in a popular programming language's regex engine,
 supplying it with regular expressions and strings,
 in most cases one can
 under a certain class of inputs.
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
-Take $(a^*)^*\,b$ and ask whether
+Consider $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
+strings of the form $aa..a$ can be matched by this regular
 expression. Obviously this is not the case---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
+matching engines can find this out very quickly. Surprisingly, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
+this example in JavaScript, Python or Java 8, even with small strings,
-length, say around 30 $a$'s,
+say of lenght of around 30 $a$'s,
-the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
+the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
 This is clearly exponential behaviour, and
-is triggered by some relatively simple regex patterns.
+is triggered by some relatively simple regular expressions.
 Java 9 and newer
-versions improves this behaviour, but is still slow compared
+versions improve this behaviour somewhat, but is still slow compared
-with the approach we are going to use.
+with the approach we are going to use in this thesis.
 This superlinear blowup in regular expression engines
-had repeatedly caused grief in real life that they
+had repeatedly caused grief in ``real life'' where it is
-get a name for them--``catastrophic backtracking''.
+given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
 For example, on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
 In this instance, a regular expression intended to just trim white
 respond so slowly that the load balancer assumed a $\mathit{DoS}$
 attack and therefore stopped the servers from responding to any
 requests. This made the whole site become unavailable.
 \begin{figure}[p]
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 x label style={at={(1.05,-0.05)}},
 ylabel={time in secs},
 legend entries={Python},
 legend pos=north west,
 legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
 \end{axis}
-\end{tikzpicture}
+\end{tikzpicture}\\
-&
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-%ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=35,
-ytick={0,5,...,30},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={Java 8},
-legend pos=north west,
-legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}\\
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 x label style={at={(1.05,-0.05)}},
 ylabel={time in secs},
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=5cm,
 height=4cm,
-legend entries={Dart},
+legend entries={Java 8},
 legend pos=north west,
 legend cell align=left]
-\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
 \end{axis}
 \end{tikzpicture}
 &
 \begin{tikzpicture}
 \begin{axis}[
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=5cm,
 height=4cm,
+legend entries={Dart},
+legend pos=north west,
+legend cell align=left]
+\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\end{axis}
+\end{tikzpicture}\\
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
 legend entries={Swift},
 legend pos=north west,
 legend cell align=left]
 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
 \end{axis}
 \end{tikzpicture}
-& \\
+&
-\multicolumn{3}{c}{Graphs}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+%ylabel={time in secs},
+enlargelimits=true,
+%xtick={0,5000,...,40000},
+%xmax=40000,
+%ymax=35,
+restrict x to domain*=0:40000,
+restrict y to domain*=0:35,
+%ytick={0,5,...,30},
+%scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={Java9+},
+legend pos=north west,
+legend cell align=left]
+\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{2}{c}{Graphs}
 \end{tabular}
 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
-The reason for their superlinear behaviour is that they do a depth-first-search.
+The reason for their superlinear behaviour is that they do a depth-first-search
-If the string does not match, the engine starts to explore all possibilities.
+using NFAs.
+If the string does not match, the regular expression matching
+engine starts to explore all possibilities.
 }\label{fig:aStarStarb}
 \end{figure}\afterpage{\clearpage}
 A more recent example is a global outage of all Cloudflare servers on 2 July
-2019. A poorly written regular expression exhibited exponential
+2019. A poorly written regular expression exhibited catastrophic backtracking
-behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
+and exhausted CPUs that serve HTTP traffic. Although the outage
 had several causes, at the heart was a regular expression that
 was used to monitor network
 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
 These problems with regular expressions
 are not isolated events that happen
 very occasionally, but actually widespread.
-They occur so often that they get a
+They occur so often that they have a
-name--Regular-Expression-Denial-Of-Service (ReDoS)
+name: Regular-Expression-Denial-Of-Service (ReDoS)
 attack.
 \citeauthor{Davis18} detected more
-than 1000 super-linear (SL) regular expressions
+than 1000 evil regular expressions
-in Node.js, Python core libraries, and npm and pypi.
+in Node.js, Python core libraries, npm and in pypi.
 They therefore concluded that evil regular expressions
-are problems "more than a parlour trick", but one that
+are real problems rather than "a parlour trick".
-requires
-more research attention.
 This work aims to address this issue
 with the help of formal proofs.
-We offer a lexing algorithm based
+We describe a lexing algorithm based
-on Brzozowski derivatives with certified correctness (in
+on Brzozowski derivatives with verified correctness (in
 Isabelle/HOL)
-and finiteness property.
+and a finiteness property.
-Such properties guarantee the absence of
+Such properties %guarantee the absence of
-catastrophic backtracking in most cases.
+are an important step in preventing
+catastrophic backtracking once and for all.
 We will give more details in the next sections
 on (i) why the slow cases in graph \ref{fig:aStarStarb}
-can occur
+can occur in traditional regular expression engines
 and (ii) why we choose our
-approach (Brzozowski derivatives and formal proofs).
+approach based on Brzozowski derivatives and formal proofs.
-\section{Regex, and the Problems with Regex Matchers}
+\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
 Regular expressions and regular expression matchers
 have of course been studied for many, many years.
-Theoretical results in automata theory say
+Theoretical results in automata theory state
 that basic regular expression matching should be linear
 w.r.t the input.
 This assumes that the regular expression
 $r$ was pre-processed and turned into a
-deterministic finite automata (DFA) before matching,
+deterministic finite automaton (DFA) before matching\cite{Sakarovitch2009}.
-which could be exponential\cite{Sakarovitch2009}.
 By basic we mean textbook definitions such as the one
-below, involving only characters, alternatives,
+below, involving only regular expressions for characters, alternatives,
 sequences, and Kleene stars:
 \[
-	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
+	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
 \]
 Modern regular expression matchers used by programmers,
 however,
-support richer constructs such as bounded repetitions
+support much richer constructs, such as bounded repetitions
 and back-references.
-To differentiate, people use the word \emph{regex} to refer
+To differentiate, we use the word \emph{regex} to refer
 to those expressions with richer constructs while reserving the
 term \emph{regular expression}
 for the more traditional meaning in formal languages theory.
 We follow this convention
 in this thesis.
 In the future, we aim to support all the popular features of regexes,
-but for this work we mainly look at regular expressions.
+but for this work we mainly look at basic regular expressions
+and bounded repetitions.
 %Most modern regex libraries
 %the so-called PCRE standard (Peral Compatible Regular Expressions)
 %has the back-references
-Regexes come with a lot of constructs
+Regexes come with a number of constructs
-beyond the basic ones
 that make it more convenient for
 programmers to write regular expressions.
-Depending on the types of these constructs
+Depending on the types of constructs
 the task of matching and lexing with them
-will have different levels of complexity increase.
+will have different levels of complexity.
-Some of those constructs are syntactic sugars that are
+Some of those constructs are just syntactic sugars that are
 simply short hand notations
 that save the programmers a few keystrokes.
-These will not cause trouble for regex libraries.
+These will not cause problems for regex libraries.
-\noindent
 For example the
-non-binary alternative involving three or more choices:
+non-binary alternative involving three or more choices just means:
 \[
 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
 \]
-the range operator $-$ used to express the alternative
+Similarly, the range operator used to express the alternative
-of all characters between its operands in a concise way:
+of all characters between its operands is just a concise way:
 \[
 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
 \]
-and the
+for an alternative. The
-wildcard character $.$ used to refer to any single character:
+wildcard character $.$ is used to refer to any single character,
 \[
 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
 \]
+except the newline.
-\noindent
 \subsection{Bounded Repetitions}
-Some of those constructs do make the expressions much
+More interesting are bounded repetitions, which can
+make the regular expressions much
 more compact.
-For example, the bounded regular expressions
+There are
-(where $n$ and $m$ are constant natural numbers)
+$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
-$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$,
+(where $n$ and $m$ are constant natural numbers).
-defined as
+Like the star regular expressions, the set of strings or language
+a bounded regular expression can match
+is defined using the power operation on sets:
 \begin{center}
 	\begin{tabular}{lcl}
 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
 		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
 	\end{tabular}
 \end{center}
-are exponentially smaller compared with
+The attraction of bounded repetitions is that they can be
-their unfolded form: for example $r^{\{n\}}$
+used to avoid a blow up: for example $r^{\{n\}}$
-as opposed to
+is a shorthand for
 \[
 	\underbrace{r\ldots r}_\text{n copies of r}.
 \]
 %Therefore, a naive algorithm that simply unfolds
 %them into their desugared forms
 %will suffer from at least an exponential runtime increase.
-The problem here is that tools based on the classic notion of
-automata need to expand $r^{n}$ into $n$ connected
+The problem with matching
+is that tools based on the classic notion of
+automata need to expand $r^{\{n\}}$ into $n$ connected
 copies of the automaton for $r$. This leads to very inefficient matching
 algorithms  or algorithms that consume large amounts of memory.
 Implementations using $\DFA$s will
 either become excruciatingly slow
 (for example Verbatim++\cite{Verbatimpp}) or get
 gets an output program encoding a minimized $\mathit{DFA}$
 that can be compiled and run.
 When given the above countdown regular expression,
 a small $n$ (a few dozen) would result in a
 determinised automata
-with millions of states.}) under large counters.
+with millions of states.}) for large counters.
-A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
+A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
 where the minimal DFA requires at least $2^{n+1}$ states.
 For example, when $n$ is equal to 2,
-an $\mathit{NFA}$ describing it would look like:
+The corresponding $\mathit{NFA}$ looks like:
 \begin{center}
 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
 \node[state,initial] (q_0)   {$q_0$};
 \node[state, red] (q_1) [right=of q_0] {$q_1$};
 \node[state, red] (q_2) [right=of q_1] {$q_2$};
 	  edge [loop below] node {a,b} ()
 (q_1) edge  node  {a,b} (q_2)
 (q_2) edge  node  {a,b} (q_3);
 \end{tikzpicture}
 \end{center}
-which requires at least $2^3$ states
+when turned into a DFA by the subset construction
-for its subset construction.\footnote{The
+requires at least $2^3$ states.\footnote{The
 red states are "countdown states" which counts down
 the number of characters needed in addition to the current
 string to make a successful match.
 For example, state $q_1$ indicates a match that has
 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
 would require at least $2^{n+1}$ states, if $r$ itself contains
 more than 1 string.
 This is to represent all different
 scenarios which "countdown" states are active.}
-One of the most recent work in the context of lexing
-%with this issue
-is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
-This is relevant work and we will compare later on
-our derivative-based matcher we are going to present.
-There is also some newer work called
-Verbatim++\cite{Verbatimpp}, which does not use derivatives,
-but deterministic finite automaton instead.
-%An example that gives problem to automaton approaches would be
-%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
-%It requires at least $2^{n+1}$ states to represent
-%as a DFA.
 Bounded repetitions are very important because they
-tend to occur a lot in practical use.
+tend to occur a lot in practical use,
-For example in the regex library RegExLib,
+for example in the regex library RegExLib,
 the rules library of Snort \cite{Snort1999}\footnote{
 Snort is a network intrusion detection (NID) tool
 for monitoring network traffic.
 The network security community curates a list
 of malicious patterns written as regexes,
 to match against network traffic for any hostile
 activities such as buffer overflow attacks.},
 as well as in XML Schema definitions (XSDs).
 According to Bj\"{o}rklund et al \cite{xml2015},
 more than half of the
-XSDs they found have bounded regular expressions in them.
+XSDs they found on the Maven.org central repository
-Often the counters are quite large, the largest up to ten million.
+have bounded regular expressions in them.
+Often the counters are quite large, with the largest being
+approximately up to ten million.
 An example XSD they gave
-was:
+is:
-%\begin{verbatim}
+\begin{verbatim}
-%<sequence minOccurs="0" maxOccurs="65535">
+<sequence minOccurs="0" maxOccurs="65535">
-%  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-%  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-%</sequence>
+</sequence>
-%\end{verbatim}
+\end{verbatim}
 This can be seen as the expression
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions
 satisfying certain constraints (such as
 satisfying the floating point number format).
 It is therefore quite unsatisfying that
 some regular expressions matching libraries
 impose adhoc limits
 for bounded regular expressions:
 For example, in the regular expression matching library in the Go
 can be above 1000, and in the built-in Rust regular expression library
 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
 for being too big.
 As Becchi and Crawley\cite{Becchi08}  have pointed out,
 the reason for these restrictions
-are that they simulate a non-deterministic finite
+is that they simulate a non-deterministic finite
 automata (NFA) with a breadth-first search.
 This way the number of active states could
 be equal to the counter number.
 When the counters are large,
 the memory requirement could become
-infeasible, and the pattern needs to be rejected straight away.
+infeasible, and a regex engine
+like Go will reject this pattern straight away.
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
 	\node (q0) [state, initial] {$0$};
 	\node (q1) [state, right = of q0] {$1$};
-	\node (q2) [state, right = of q1] {$2$};
+	%\node (q2) [state, right = of q1] {$2$};
-	\node (qdots) [right = of q2] {$\ldots$};
+	\node (qdots) [right = of q1] {$\ldots$};
 	\node (qn) [state, right = of qdots] {$n$};
 	\node (qn1) [state, right = of qn] {$n+1$};
 	\node (qn2) [state, right = of qn1] {$n+2$};
 	\node (qn3) [state, accepting, right = of qn2] {$n+3$};
 \path [-stealth, thick]
 	(q0) edge [loop above] node {a} ()
 (q0) edge node {a}   (q1)
-(q1) edge node {.}   (q2)
+%(q1) edge node {.}   (q2)
-(q2) edge node {.}   (qdots)
+(q1) edge node {.}   (qdots)
 (qdots) edge node {.} (qn)
 (qn) edge node {.} (qn1)
 (qn1) edge node {b} (qn2)
 (qn2) edge node {$c$} (qn3);
 \end{tikzpicture}
 %in terms of input string length.
 %TODO:try out these lexers
 These problems can of course be solved in matching algorithms where
 automata go beyond the classic notion and for instance include explicit
 counters \cite{Turo_ov__2020}.
-These solutions can be quite effective,
+These solutions can be quite efficient,
 with the ability to process
-gigabytes of string input per second
+gigabytes of strings input per second
 even with large counters \cite{Becchi08}.
-But formally reasoning about these automata can be challenging
+But formal reasoning about these automata especially in Isabelle
-and un-intuitive.
+can be challenging
-Therefore, correctness and runtime claims made about these solutions need to be
+and un-intuitive.
-taken with a grain of salt.
+Therefore, we take correctness and runtime claims made about these solutions
+with a grain of salt.
 In the work reported in \cite{CSL2022} and here,
 we add better support using derivatives
 for bounded regular expressions $r^{\{n\}}$.
 The results
 This thesis implements the aggressive simplifications envisioned
 by Ausaf and Urban,
 together with a formal proof of the correctness with those simplifications.
+One of the most recent work in the context of lexing
+%with this issue
+is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+This is relevant work for us and we will compare it later with
+our derivative-based matcher we are going to present.
+There is also some newer work called
+Verbatim++\cite{Verbatimpp}, which does not use derivatives,
+but deterministic finite automaton instead.
+%An example that gives problem to automaton approaches would be
+%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+%It requires at least $2^{n+1}$ states to represent
+%as a DFA.
 %----------------------------------------------------------------------------------------
 \section{Contribution}
 In this thesis,
 we propose a solution to catastrophic

changeset 612	8c234a1bc7e0
parent 609	61139fdddae0
child 618	233cf2b97d1a