--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Mon Sep 12 23:32:18 2022 +0200
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Sep 22 00:31:09 2022 +0100
@@ -151,7 +151,7 @@
\def\zeroable{\textit{zeroable}}
\def\nub{\textit{nub}}
\def\filter{\textit{filter}}
-\def\not{\textit{not}}
+%\def\not{\textit{not}}
@@ -192,8 +192,32 @@
%TODO: look up snort rules to use here--give readers idea of what regexes look like
-\begin{figure}
-\centering
+
+
+
+
+
+Regular expressions are widely used in computer science:
+be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+command-line tools like $\mathit{grep}$ that facilitate easy
+text-processing; network intrusion
+detection systems that reject suspicious traffic; or compiler
+front ends--the majority of the solutions to these tasks
+involve lexing with regular
+expressions.
+Given its usefulness and ubiquity, one would imagine that
+modern regular expression matching implementations
+are mature and fully studied.
+Indeed, in a popular programming language' regex engine,
+supplying it with regular expressions and strings, one can
+get rich matching information in a very short time.
+Some network intrusion detection systems
+use regex engines that are able to process
+megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+%TODO: get source for SNORT/BRO's regex matching engine/speed
+
+\begin{figure}[p]
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
@@ -257,36 +281,56 @@
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
\end{axis}
\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
- of the form $\underbrace{aa..a}_{n}$.}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Dart},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Swift},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+\end{axis}
+\end{tikzpicture}
+ & \\
+\multicolumn{3}{c}{Graphs}
\end{tabular}
-\caption{aStarStarb} \label{fig:aStarStarb}
-\end{figure}
-
-
-
-
-
-Regular expressions are widely used in computer science:
-be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
-command-line tools like $\mathit{grep}$ that facilitate easy
-text-processing; network intrusion
-detection systems that reject suspicious traffic; or compiler
-front ends--the majority of the solutions to these tasks
-involve lexing with regular
-expressions.
-Given its usefulness and ubiquity, one would imagine that
-modern regular expression matching implementations
-are mature and fully studied.
-Indeed, in a popular programming language' regex engine,
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
-use regex engines that are able to process
-megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
-%TODO: get source for SNORT/BRO's regex matching engine/speed
-
+\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
+ of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
+ The reason for their superlinear behaviour is that they do a depth-first-search.
+ If the string does not match, the engine starts to explore all possibilities.
+}\label{fig:aStarStarb}
+\end{figure}\afterpage{\clearpage}
Take $(a^*)^*\,b$ and ask whether
strings of the form $aa..a$ match this regular
@@ -299,14 +343,12 @@
This is clearly exponential behaviour, and
is triggered by some relatively simple regex patterns, as the graphs
in \ref{fig:aStarStarb} show.
-
+Java 9 and newer
+versions improves this behaviour, but is still slow compared
+with the approach we are going to use.
-\ChristianComment{Superlinear I just leave out the explanation
-which I find once used would distract the flow. Plus if i just say exponential
-here the 2016 event in StackExchange was not exponential, but just quardratic so would be
-in accurate}
This superlinear blowup in regular expression engines
had repeatedly caused grief in real life.
@@ -330,7 +372,6 @@
had several causes, at the heart was a regular expression that
was used to monitor network
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
-%TODO: data points for some new versions of languages
These problems with regular expressions
are not isolated events that happen
very occasionally, but actually widespread.
@@ -345,22 +386,29 @@
requires
more research attention.
+Regular expressions and regular expression matchers
+have of course been studied for many, many years.
+One of the most recent work in the context of lexing
+is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+This is relevant work and we will compare later on
+our derivative-based matcher we are going to present.
+There is also some newer work called
+Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
+For that the problem is the bounded repetitions ($r^{n}$,
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$).
+They often occur in practical use, in for example the Snort XML definitions:
-But the problems are not limited to slowness on certain
+
+The problems are not limited to slowness on certain
cases.
Another thing about these libraries is that there
is no correctness guarantee.
In some cases, they either fail to generate a lexing result when there exists a match,
or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be
-the regex
-\begin{verbatim}
-(aba|ab|a)*
-\end{verbatim}
-and the string
-\begin{verbatim}
-ababa
-\end{verbatim}
+A concrete example would be the regex
+\begin{center}
+ $(aba + ab + a)* \text{and the string} ababa$
+\end{center}
The correct $\POSIX$ match for the above would be
with the entire string $ababa$,
split into two Kleene star iterations, $[ab] [aba]$ at positions
@@ -375,12 +423,13 @@
correctly implementing the POSIX (maximum-munch)
rule of regular expression matching.
-As Grathwohl\parencite{grathwohl2014crash} commented,
-\begin{center}
- ``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
-\end{center}
-
-
+As Grathwohl\parencite{grathwohl2014crash} wrote,
+\begin{quote}
+ The POSIX strategy is more complicated than the
+ greedy because of the dependence on information about
+ the length of matched strings in the various subexpressions.
+\end{quote}
+%\noindent
To summarise the above, regular expressions are important.
They are popular and programming languages' library functions
for them are very fast on non-catastrophic cases.
@@ -424,7 +473,6 @@
representing that regex $r$ is being operated on.
We represent the time
it takes by $P_2(r, s)$.\\
-
For $\mathit{DFA}$,
we have $P_2(r, s) = O( |s| )$,
because we take at most $|s|$ steps,
@@ -435,13 +483,14 @@
transition upon receiving an input symbol.
But unfortunately in the worst case
$P_1(r) = O(exp^{|r|})$. An example will be given later.
-
-
For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
-expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
+expressions like $r^n$ into
+\[
+ \underbrace{r \cdots r}_{\text{n copies of r}}.
+\]
The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^|s|$.
+to $|r| * 2^{|s|}$.
%on the input
%And when calculating the time complexity of the matching algorithm,
%we are assuming that each input reading step requires constant time.
@@ -638,7 +687,6 @@
%when the language is still regular).
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
%TODO: verify the fact Rust does not allow 1000+ reps
-\ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?}
So we have practical implementations