diff -r fd068f39ac23 -r ce4e5151a836 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Mon Sep 12 23:32:18 2022 +0200 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Sep 22 00:31:09 2022 +0100 @@ -151,7 +151,7 @@ \def\zeroable{\textit{zeroable}} \def\nub{\textit{nub}} \def\filter{\textit{filter}} -\def\not{\textit{not}} +%\def\not{\textit{not}} @@ -192,8 +192,32 @@ %TODO: look up snort rules to use here--give readers idea of what regexes look like -\begin{figure} -\centering + + + + + +Regular expressions are widely used in computer science: +be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; +command-line tools like $\mathit{grep}$ that facilitate easy +text-processing; network intrusion +detection systems that reject suspicious traffic; or compiler +front ends--the majority of the solutions to these tasks +involve lexing with regular +expressions. +Given its usefulness and ubiquity, one would imagine that +modern regular expression matching implementations +are mature and fully studied. +Indeed, in a popular programming language' regex engine, +supplying it with regular expressions and strings, one can +get rich matching information in a very short time. +Some network intrusion detection systems +use regex engines that are able to process +megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. +Unfortunately, this is not the case for $\mathbf{all}$ inputs. +%TODO: get source for SNORT/BRO's regex matching engine/speed + +\begin{figure}[p] \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} \begin{tikzpicture} \begin{axis}[ @@ -257,36 +281,56 @@ \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; \end{axis} \end{tikzpicture}\\ -\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings - of the form $\underbrace{aa..a}_{n}$.} +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Dart}, + legend pos=north west, + legend cell align=left] +\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Swift}, + legend pos=north west, + legend cell align=left] +\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; +\end{axis} +\end{tikzpicture} + & \\ +\multicolumn{3}{c}{Graphs} \end{tabular} -\caption{aStarStarb} \label{fig:aStarStarb} -\end{figure} - - - - - -Regular expressions are widely used in computer science: -be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; -command-line tools like $\mathit{grep}$ that facilitate easy -text-processing; network intrusion -detection systems that reject suspicious traffic; or compiler -front ends--the majority of the solutions to these tasks -involve lexing with regular -expressions. -Given its usefulness and ubiquity, one would imagine that -modern regular expression matching implementations -are mature and fully studied. -Indeed, in a popular programming language' regex engine, -supplying it with regular expressions and strings, one can -get rich matching information in a very short time. -Some network intrusion detection systems -use regex engines that are able to process -megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. -Unfortunately, this is not the case for $\mathbf{all}$ inputs. -%TODO: get source for SNORT/BRO's regex matching engine/speed - +\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings + of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. + The reason for their superlinear behaviour is that they do a depth-first-search. + If the string does not match, the engine starts to explore all possibilities. +}\label{fig:aStarStarb} +\end{figure}\afterpage{\clearpage} Take $(a^*)^*\,b$ and ask whether strings of the form $aa..a$ match this regular @@ -299,14 +343,12 @@ This is clearly exponential behaviour, and is triggered by some relatively simple regex patterns, as the graphs in \ref{fig:aStarStarb} show. - +Java 9 and newer +versions improves this behaviour, but is still slow compared +with the approach we are going to use. -\ChristianComment{Superlinear I just leave out the explanation -which I find once used would distract the flow. Plus if i just say exponential -here the 2016 event in StackExchange was not exponential, but just quardratic so would be -in accurate} This superlinear blowup in regular expression engines had repeatedly caused grief in real life. @@ -330,7 +372,6 @@ had several causes, at the heart was a regular expression that was used to monitor network traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} -%TODO: data points for some new versions of languages These problems with regular expressions are not isolated events that happen very occasionally, but actually widespread. @@ -345,22 +386,29 @@ requires more research attention. +Regular expressions and regular expression matchers +have of course been studied for many, many years. +One of the most recent work in the context of lexing +is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. +This is relevant work and we will compare later on +our derivative-based matcher we are going to present. +There is also some newer work called +Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. +For that the problem is the bounded repetitions ($r^{n}$, +$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$). +They often occur in practical use, in for example the Snort XML definitions: -But the problems are not limited to slowness on certain + +The problems are not limited to slowness on certain cases. Another thing about these libraries is that there is no correctness guarantee. In some cases, they either fail to generate a lexing result when there exists a match, or give results that are inconsistent with the $\POSIX$ standard. -A concrete example would be -the regex -\begin{verbatim} -(aba|ab|a)* -\end{verbatim} -and the string -\begin{verbatim} -ababa -\end{verbatim} +A concrete example would be the regex +\begin{center} + $(aba + ab + a)* \text{and the string} ababa$ +\end{center} The correct $\POSIX$ match for the above would be with the entire string $ababa$, split into two Kleene star iterations, $[ab] [aba]$ at positions @@ -375,12 +423,13 @@ correctly implementing the POSIX (maximum-munch) rule of regular expression matching. -As Grathwohl\parencite{grathwohl2014crash} commented, -\begin{center} - ``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.'' -\end{center} - - +As Grathwohl\parencite{grathwohl2014crash} wrote, +\begin{quote} + The POSIX strategy is more complicated than the + greedy because of the dependence on information about + the length of matched strings in the various subexpressions. +\end{quote} +%\noindent To summarise the above, regular expressions are important. They are popular and programming languages' library functions for them are very fast on non-catastrophic cases. @@ -424,7 +473,6 @@ representing that regex $r$ is being operated on. We represent the time it takes by $P_2(r, s)$.\\ - For $\mathit{DFA}$, we have $P_2(r, s) = O( |s| )$, because we take at most $|s|$ steps, @@ -435,13 +483,14 @@ transition upon receiving an input symbol. But unfortunately in the worst case $P_1(r) = O(exp^{|r|})$. An example will be given later. - - For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold -expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$. +expressions like $r^n$ into +\[ + \underbrace{r \cdots r}_{\text{n copies of r}}. +\] The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. On the other hand, if backtracking is used, the worst-case time bound bloats -to $|r| * 2^|s|$. +to $|r| * 2^{|s|}$. %on the input %And when calculating the time complexity of the matching algorithm, %we are assuming that each input reading step requires constant time. @@ -638,7 +687,6 @@ %when the language is still regular). %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac %TODO: verify the fact Rust does not allow 1000+ reps -\ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?} So we have practical implementations