lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:46db6ae66448
+:370fe1dde7c7
 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
 However, those matchers can exhibit a surprising security vulnerability
 under a certain class of inputs.
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
+Take $(a^*)^*\,b$ and ask whether
+strings of the form $aa..a$ match this regular
+expression. Obviously this is not the case---the expected $b$ in the last
+position is missing. One would expect that modern regular expression
+matching engines can find this out very quickly. Alas, if one tries
+this example in JavaScript, Python or Java 8, even with strings of a small
+length, say around 30 $a$'s,
+the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
+This is clearly exponential behaviour, and
+is triggered by some relatively simple regex patterns.
+Java 9 and newer
+versions improves this behaviour, but is still slow compared
+with the approach we are going to use.
+This superlinear blowup in regular expression engines
+had repeatedly caused grief in real life that they
+get a name for them--``catastrophic backtracking''.
+For example, on 20 July 2016 one evil
+regular expression brought the webpage
+\href{http://stackexchange.com}{Stack Exchange} to its
+knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+In this instance, a regular expression intended to just trim white
+spaces from the beginning and the end of a line actually consumed
+massive amounts of CPU resources---causing web servers to grind to a
+halt. In this example, the time needed to process
+the string was $O(n^2)$ with respect to the string length. This
+quadratic overhead was enough for the homepage of Stack Exchange to
+respond so slowly that the load balancer assumed a $\mathit{DoS}$
+attack and therefore stopped the servers from responding to any
+requests. This made the whole site become unavailable.
 \begin{figure}[p]
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
 The reason for their superlinear behaviour is that they do a depth-first-search.
 If the string does not match, the engine starts to explore all possibilities.
 }\label{fig:aStarStarb}
 \end{figure}\afterpage{\clearpage}
-Take $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
-expression. Obviously this is not the case---the expected $b$ in the last
-position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
-length, say around 30 $a$'s, one discovers that
-this decision takes crazy time to finish given the simplicity of the problem.
-This is clearly exponential behaviour, and
-is triggered by some relatively simple regex patterns, as the graphs
-in \ref{fig:aStarStarb} show.
-Java 9 and newer
-versions improves this behaviour, but is still slow compared
-with the approach we are going to use.
-This superlinear blowup in regular expression engines
-had repeatedly caused grief in real life.
-For example, on 20 July 2016 one evil
-regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
-In this instance, a regular expression intended to just trim white
-spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU resources---causing web servers to grind to a
-halt. In this example, the time needed to process
-the string was $O(n^2)$ with respect to the string length. This
-quadratic overhead was enough for the homepage of Stack Exchange to
-respond so slowly that the load balancer assumed a $\mathit{DoS}$
-attack and therefore stopped the servers from responding to any
-requests. This made the whole site become unavailable.
 A more recent example is a global outage of all Cloudflare servers on 2 July
 2019. A poorly written regular expression exhibited exponential
 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
 had several causes, at the heart was a regular expression that
 was used to monitor network
 They therefore concluded that evil regular expressions
 are problems "more than a parlour trick", but one that
 requires
 more research attention.
+This work aims to address this issue
-\ChristianComment{I am not totally sure where this sentence should be
+with the help of formal proofs.
-put, seems a little out-standing here.}
+We offer a lexing algorithm based
+on Brzozowski derivatives with certified correctness (in
+Isabelle/HOL)
+and finiteness property.
+Such properties guarantees the absence of
+catastrophic backtracking in most cases.
+We will give more details why we choose our
+approach (Brzozowski derivatives and formal proofs)
+in the next sections.
+\section{The Problem with Bounded Repetitions}
 Regular expressions and regular expression matchers
 have of course been studied for many, many years.
 One of the most recent work in the context of lexing
 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
 This is relevant work and we will compare later on
 our derivative-based matcher we are going to present.
 There is also some newer work called
 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
 For that the problem is dealing with the bounded regular expressions of the form
 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat
-exactly $n$ times.
+exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow
+on the bounded repetitions construct.
+In the work reported in \cite{CSL2022} and here, we add better support
+for them.
 The other repetition constructs include
-$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify
-at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times.
+intervals for how many times $r$ should match.
+$r^{\ldots m}$ means repeating
+at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and
+$r^{n\ldots m}$ means repeating between $n$ and $m$ times.
+The results presented in this thesis extend straightforwardly to them
+too.
 Their formal definitions will be given later.
 Bounded repetitions are important because they
-tend to occur often in practical use\cite{xml2015}, for example in RegExLib,
+tend to occur often in practical use, for example in RegExLib,
 Snort, as well as in XML Schema definitions (XSDs).
-One XSD that seems to be related to the MPEG-7 standard involves
+According to Bj\"{o}rklund et al \cite{xml2015},
-the below regular expression:
+bounded regular expressions occur frequently in the latter and can have
+counters up to ten million. An example XSD with a large counter they gave
+was:
 \begin{verbatim}
 <sequence minOccurs="0" maxOccurs="65535">
 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
 </sequence>
 \end{verbatim}
-This is just a fancy way of writing the regular expression
+This can be seen as the expression
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions
-satisfy certain constraints such as floating point number format.
+satisfying certain constraints (such as
+satisfying the floating point number format).
-The problems are not limited to slowness on certain
+The problem here is that tools based on the classic notion of
+automata need to expand $r^{n}$ into $n$ connected
+copies of the automaton for $r$. This leads to very inefficient matching
+algorithms  or algorithms that consume large amounts of memory.
+A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
+where the minimal DFA requires at least $2^{n+1}$ states (more on this
+later).
+Therefore regular expressions matching libraries that rely on the classic
+notion of DFAs  often impose adhoc limits
+for bounded regular expressions:
+For example, in the regular expression matching library in the Go
+language the regular expression $a^{1001}$ is not permitted, because no counter
+can be above 1000, and in the built-in Rust regular expression library
+expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
+for being too big. These problems can of course be solved in matching algorithms where
+automata go beyond the classic notion and for instance include explicit
+counters \cite{Turo_ov__2020}.
+The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+straightforwardly extended to deal with bounded regular expressions
+and moreover the resulting code still consists of only simple
+recursive functions and inductive datatypes.
+Finally, bounded regular expressions do not destroy our finite
+boundedness property, which we shall prove later on.
+\section{The Terminology Regex, and why Backtracking?}
+Shouldn't regular expression matching be linear?
+How can one explain the super-linear behaviour of the
+regex matching engines we have?
+The time cost of regex matching algorithms in general
+involve two different phases, and different things can go differently wrong on
+these phases.
+$\DFA$s usually have problems in the first (construction) phase
+, whereas $\NFA$s usually run into trouble
+on the second phase.
+\section{Error-prone POSIX Implementations}
+The problems with practical implementations
+of reare not limited to slowness on certain
 cases.
 Another thing about these libraries is that there
 is no correctness guarantee.
 In some cases, they either fail to generate a lexing result when there exists a match,
 or give results that are inconsistent with the $\POSIX$ standard.
 certain regex engines are running horribly slow on the "catastrophic"
 cases and propose a solution that addresses both of these problems
 based on Brzozowski and Sulzmann and Lu's work.
-\section{Why are current regex engines slow?}
-%find literature/find out for yourself that REGEX->DFA on basic regexes
-%does not blow up the size
-Shouldn't regular expression matching be linear?
-How can one explain the super-linear behaviour of the
-regex matching engines we have?
-The time cost of regex matching algorithms in general
-involve two different phases, and different things can go differently wrong on
-these phases.
-$\DFA$s usually have problems in the first (construction) phase
-, whereas $\NFA$s usually run into trouble
-on the second phase.
 \subsection{Different Phases of a Matching/Lexing Algorithm}
 Most lexing algorithms can be roughly divided into

changeset 603	370fe1dde7c7
parent 602	46db6ae66448
child 604	16d67f9c07d4