ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 601 ce4e5151a836
parent 600 fd068f39ac23
child 602 46db6ae66448
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Mon Sep 12 23:32:18 2022 +0200
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Sep 22 00:31:09 2022 +0100
@@ -151,7 +151,7 @@
 \def\zeroable{\textit{zeroable}}
 \def\nub{\textit{nub}}
 \def\filter{\textit{filter}}
-\def\not{\textit{not}}
+%\def\not{\textit{not}}
 
 
 
@@ -192,8 +192,32 @@
 
 %TODO: look up snort rules to use here--give readers idea of what regexes look like
 
-\begin{figure}
-\centering
+
+
+
+
+
+Regular expressions are widely used in computer science: 
+be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+command-line tools like $\mathit{grep}$ that facilitate easy 
+text-processing; network intrusion
+detection systems that reject suspicious traffic; or compiler
+front ends--the majority of the solutions to these tasks 
+involve lexing with regular 
+expressions.
+Given its usefulness and ubiquity, one would imagine that
+modern regular expression matching implementations
+are mature and fully studied.
+Indeed, in a popular programming language' regex engine, 
+supplying it with regular expressions and strings, one can
+get rich matching information in a very short time.
+Some network intrusion detection systems
+use regex engines that are able to process 
+megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+%TODO: get source for SNORT/BRO's regex matching engine/speed
+
+\begin{figure}[p]
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
@@ -257,36 +281,56 @@
 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
 \end{axis}
 \end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
-           of the form $\underbrace{aa..a}_{n}$.}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=35,
+    ytick={0,5,...,30},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={Dart},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\end{axis}
+\end{tikzpicture}
+  &
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    %ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=35,
+    ytick={0,5,...,30},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={Swift},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+\end{axis}
+\end{tikzpicture}
+  & \\
+\multicolumn{3}{c}{Graphs}
 \end{tabular}    
-\caption{aStarStarb} \label{fig:aStarStarb}
-\end{figure}
-
-
-
-
-
-Regular expressions are widely used in computer science: 
-be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
-command-line tools like $\mathit{grep}$ that facilitate easy 
-text-processing; network intrusion
-detection systems that reject suspicious traffic; or compiler
-front ends--the majority of the solutions to these tasks 
-involve lexing with regular 
-expressions.
-Given its usefulness and ubiquity, one would imagine that
-modern regular expression matching implementations
-are mature and fully studied.
-Indeed, in a popular programming language' regex engine, 
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
-use regex engines that are able to process 
-megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
-%TODO: get source for SNORT/BRO's regex matching engine/speed
-
+\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
+           of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
+   The reason for their superlinear behaviour is that they do a depth-first-search.
+   If the string does not match, the engine starts to explore all possibilities. 
+}\label{fig:aStarStarb}
+\end{figure}\afterpage{\clearpage}
 
 Take $(a^*)^*\,b$ and ask whether
 strings of the form $aa..a$ match this regular
@@ -299,14 +343,12 @@
 This is clearly exponential behaviour, and 
 is triggered by some relatively simple regex patterns, as the graphs
  in \ref{fig:aStarStarb} show.
-
+Java 9 and newer
+versions improves this behaviour, but is still slow compared 
+with the approach we are going to use.
 
 
 
-\ChristianComment{Superlinear I just leave out the explanation 
-which I find once used would distract the flow. Plus if i just say exponential
-here the 2016 event in StackExchange was not exponential, but just quardratic so would be 
-in accurate}
 
 This superlinear blowup in regular expression engines
 had repeatedly caused grief in real life.
@@ -330,7 +372,6 @@
 had several causes, at the heart was a regular expression that
 was used to monitor network
 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
-%TODO: data points for some new versions of languages
 These problems with regular expressions 
 are not isolated events that happen
 very occasionally, but actually widespread.
@@ -345,22 +386,29 @@
 requires
 more research attention.
 
+Regular expressions and regular expression matchers 
+have of course been studied for many, many years.
+One of the most recent work in the context of lexing
+is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+This is relevant work and we will compare later on
+our derivative-based matcher we are going to present.
+There is also some newer work called
+Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
+For that the problem is the bounded repetitions ($r^{n}$,
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$).
+They often occur in practical use, in for example the Snort XML definitions:
 
-But the problems are not limited to slowness on certain 
+
+The problems are not limited to slowness on certain 
 cases. 
 Another thing about these libraries is that there
 is no correctness guarantee.
 In some cases, they either fail to generate a lexing result when there exists a match,
 or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be 
-the regex
-\begin{verbatim}
-(aba|ab|a)*
-\end{verbatim}
-and the string
-\begin{verbatim}
-ababa
-\end{verbatim}
+A concrete example would be the regex
+\begin{center}
+	$(aba + ab + a)* \text{and the string} ababa$
+\end{center}
 The correct $\POSIX$ match for the above would be 
 with the entire string $ababa$, 
 split into two Kleene star iterations, $[ab] [aba]$ at positions
@@ -375,12 +423,13 @@
 correctly implementing the POSIX (maximum-munch)
 rule of regular expression matching.
 
-As Grathwohl\parencite{grathwohl2014crash} commented,
-\begin{center}
-	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
-\end{center}
-
-
+As Grathwohl\parencite{grathwohl2014crash} wrote,
+\begin{quote}
+	The POSIX strategy is more complicated than the 
+	greedy because of the dependence on information about 
+	the length of matched strings in the various subexpressions.
+\end{quote}
+%\noindent
 To summarise the above, regular expressions are important.
 They are popular and programming languages' library functions
 for them are very fast on non-catastrophic cases.
@@ -424,7 +473,6 @@
 representing that regex $r$ is being operated on. 
 We represent the time
 it takes by $P_2(r, s)$.\\
-
 For $\mathit{DFA}$,
 we have $P_2(r, s) = O( |s| )$,
 because we take at most $|s|$ steps, 
@@ -435,13 +483,14 @@
 transition upon receiving an input symbol.
 But unfortunately in the  worst case
 $P_1(r) = O(exp^{|r|})$. An example will be given later. 
-
-
 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
-expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
+expressions like $r^n$ into 
+\[
+	\underbrace{r \cdots r}_{\text{n copies of r}}.
+\]
 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
 On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^|s|$.
+to $|r| * 2^{|s|}$.
 %on the input
 %And when calculating the time complexity of the matching algorithm,
 %we are assuming that each input reading step requires constant time.
@@ -638,7 +687,6 @@
 %when the language is still regular).
  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
  %TODO: verify the fact Rust does not allow 1000+ reps
-\ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?}
 
 
 So we have practical implementations