--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Sep 23 00:44:22 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sat Sep 24 00:49:10 2022 +0100
@@ -221,6 +221,41 @@
%However, , this is not the case for $\mathbf{all}$ inputs.
%TODO: get source for SNORT/BRO's regex matching engine/speed
+
+Take $(a^*)^*\,b$ and ask whether
+strings of the form $aa..a$ match this regular
+expression. Obviously this is not the case---the expected $b$ in the last
+position is missing. One would expect that modern regular expression
+matching engines can find this out very quickly. Alas, if one tries
+this example in JavaScript, Python or Java 8, even with strings of a small
+length, say around 30 $a$'s,
+the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
+This is clearly exponential behaviour, and
+is triggered by some relatively simple regex patterns.
+Java 9 and newer
+versions improves this behaviour, but is still slow compared
+with the approach we are going to use.
+
+
+
+
+This superlinear blowup in regular expression engines
+had repeatedly caused grief in real life that they
+get a name for them--``catastrophic backtracking''.
+For example, on 20 July 2016 one evil
+regular expression brought the webpage
+\href{http://stackexchange.com}{Stack Exchange} to its
+knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+In this instance, a regular expression intended to just trim white
+spaces from the beginning and the end of a line actually consumed
+massive amounts of CPU resources---causing web servers to grind to a
+halt. In this example, the time needed to process
+the string was $O(n^2)$ with respect to the string length. This
+quadratic overhead was enough for the homepage of Stack Exchange to
+respond so slowly that the load balancer assumed a $\mathit{DoS}$
+attack and therefore stopped the servers from responding to any
+requests. This made the whole site become unavailable.
+
\begin{figure}[p]
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
@@ -336,40 +371,6 @@
}\label{fig:aStarStarb}
\end{figure}\afterpage{\clearpage}
-Take $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
-expression. Obviously this is not the case---the expected $b$ in the last
-position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
-length, say around 30 $a$'s, one discovers that
-this decision takes crazy time to finish given the simplicity of the problem.
-This is clearly exponential behaviour, and
-is triggered by some relatively simple regex patterns, as the graphs
- in \ref{fig:aStarStarb} show.
-Java 9 and newer
-versions improves this behaviour, but is still slow compared
-with the approach we are going to use.
-
-
-
-
-This superlinear blowup in regular expression engines
-had repeatedly caused grief in real life.
-For example, on 20 July 2016 one evil
-regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
-In this instance, a regular expression intended to just trim white
-spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU resources---causing web servers to grind to a
-halt. In this example, the time needed to process
-the string was $O(n^2)$ with respect to the string length. This
-quadratic overhead was enough for the homepage of Stack Exchange to
-respond so slowly that the load balancer assumed a $\mathit{DoS}$
-attack and therefore stopped the servers from responding to any
-requests. This made the whole site become unavailable.
-
A more recent example is a global outage of all Cloudflare servers on 2 July
2019. A poorly written regular expression exhibited exponential
behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
@@ -390,9 +391,20 @@
requires
more research attention.
+This work aims to address this issue
+with the help of formal proofs.
+We offer a lexing algorithm based
+on Brzozowski derivatives with certified correctness (in
+Isabelle/HOL)
+and finiteness property.
+Such properties guarantees the absence of
+catastrophic backtracking in most cases.
+We will give more details why we choose our
+approach (Brzozowski derivatives and formal proofs)
+in the next sections.
-\ChristianComment{I am not totally sure where this sentence should be
-put, seems a little out-standing here.}
+
+\section{The Problem with Bounded Repetitions}
Regular expressions and regular expression matchers
have of course been studied for many, many years.
One of the most recent work in the context of lexing
@@ -403,28 +415,78 @@
Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
For that the problem is dealing with the bounded regular expressions of the form
$r^{n}$ where $n$ is a constant specifying that $r$ must repeat
-exactly $n$ times.
+exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow
+on the bounded repetitions construct.
+
+In the work reported in \cite{CSL2022} and here, we add better support
+for them.
The other repetition constructs include
-$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating
-at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times.
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify
+intervals for how many times $r$ should match.
+$r^{\ldots m}$ means repeating
+at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and
+$r^{n\ldots m}$ means repeating between $n$ and $m$ times.
+The results presented in this thesis extend straightforwardly to them
+too.
Their formal definitions will be given later.
+
Bounded repetitions are important because they
-tend to occur often in practical use\cite{xml2015}, for example in RegExLib,
+tend to occur often in practical use, for example in RegExLib,
Snort, as well as in XML Schema definitions (XSDs).
-One XSD that seems to be related to the MPEG-7 standard involves
-the below regular expression:
+According to Bj\"{o}rklund et al \cite{xml2015},
+bounded regular expressions occur frequently in the latter and can have
+counters up to ten million. An example XSD with a large counter they gave
+was:
\begin{verbatim}
<sequence minOccurs="0" maxOccurs="65535">
<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
</sequence>
\end{verbatim}
-This is just a fancy way of writing the regular expression
+This can be seen as the expression
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
regular expressions
-satisfy certain constraints such as floating point number format.
+satisfying certain constraints (such as
+satisfying the floating point number format).
+The problem here is that tools based on the classic notion of
+automata need to expand $r^{n}$ into $n$ connected
+copies of the automaton for $r$. This leads to very inefficient matching
+algorithms or algorithms that consume large amounts of memory.
+A classic example is the regular expression $(a+b)^* a (a+b)^{n}$
+where the minimal DFA requires at least $2^{n+1}$ states (more on this
+later).
+Therefore regular expressions matching libraries that rely on the classic
+notion of DFAs often impose adhoc limits
+for bounded regular expressions:
+For example, in the regular expression matching library in the Go
+language the regular expression $a^{1001}$ is not permitted, because no counter
+can be above 1000, and in the built-in Rust regular expression library
+expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
+for being too big. These problems can of course be solved in matching algorithms where
+automata go beyond the classic notion and for instance include explicit
+counters \cite{Turo_ov__2020}.
+The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+straightforwardly extended to deal with bounded regular expressions
+and moreover the resulting code still consists of only simple
+recursive functions and inductive datatypes.
+Finally, bounded regular expressions do not destroy our finite
+boundedness property, which we shall prove later on.
-The problems are not limited to slowness on certain
+\section{The Terminology Regex, and why Backtracking?}
+Shouldn't regular expression matching be linear?
+How can one explain the super-linear behaviour of the
+regex matching engines we have?
+The time cost of regex matching algorithms in general
+involve two different phases, and different things can go differently wrong on
+these phases.
+$\DFA$s usually have problems in the first (construction) phase
+, whereas $\NFA$s usually run into trouble
+on the second phase.
+
+
+\section{Error-prone POSIX Implementations}
+The problems with practical implementations
+of reare not limited to slowness on certain
cases.
Another thing about these libraries is that there
is no correctness guarantee.
@@ -468,19 +530,6 @@
based on Brzozowski and Sulzmann and Lu's work.
- \section{Why are current regex engines slow?}
-
-%find literature/find out for yourself that REGEX->DFA on basic regexes
-%does not blow up the size
-Shouldn't regular expression matching be linear?
-How can one explain the super-linear behaviour of the
-regex matching engines we have?
-The time cost of regex matching algorithms in general
-involve two different phases, and different things can go differently wrong on
-these phases.
-$\DFA$s usually have problems in the first (construction) phase
-, whereas $\NFA$s usually run into trouble
-on the second phase.
\subsection{Different Phases of a Matching/Lexing Algorithm}
--- a/ChengsongTanPhdThesis/example.bib Fri Sep 23 00:44:22 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib Sat Sep 24 00:49:10 2022 +0100
@@ -5,6 +5,13 @@
%% Saved with string encoding Unicode (UTF-8)
+
+@unpublished{CSL22
+author = "Chengsong Tan and Christian Urban",
+title = "POSIX Lexing with Bitcoded Derivatives",
+note = "submitted",
+}
+
@INPROCEEDINGS{Verbatim, author={Egolf, Derek and Lasser, Sam and Fisher, Kathleen}, booktitle={2021 IEEE Security and Privacy Workshops (SPW)}, title={Verbatim: A Verified Lexer Generator}, year={2021}, volume={}, number={}, pages={92-100}, doi={10.1109/SPW53761.2021.00022}}