lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:fd068f39ac23
+:ce4e5151a836
 \def\zeroable{\textit{zeroable}}
 \def\nub{\textit{nub}}
 \def\filter{\textit{filter}}
-\def\not{\textit{not}}
+%\def\not{\textit{not}}
 \def\RZERO{\mathbf{0}_r }
 \def\RONE{\mathbf{1}_r}
 %This part is about regular expressions, Brzozowski derivatives,
 %and a bit-coded lexing algorithm with proven correctness and time bounds.
 %TODO: look up snort rules to use here--give readers idea of what regexes look like
-\begin{figure}
-\centering
+Regular expressions are widely used in computer science:
+be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+command-line tools like $\mathit{grep}$ that facilitate easy
+text-processing; network intrusion
+detection systems that reject suspicious traffic; or compiler
+front ends--the majority of the solutions to these tasks
+involve lexing with regular
+expressions.
+Given its usefulness and ubiquity, one would imagine that
+modern regular expression matching implementations
+are mature and fully studied.
+Indeed, in a popular programming language' regex engine,
+supplying it with regular expressions and strings, one can
+get rich matching information in a very short time.
+Some network intrusion detection systems
+use regex engines that are able to process
+megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+%TODO: get source for SNORT/BRO's regex matching engine/speed
+\begin{figure}[p]
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 x label style={at={(1.05,-0.05)}},
 legend pos=north west,
 legend cell align=left]
 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
 \end{axis}
 \end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
+\begin{tikzpicture}
-of the form $\underbrace{aa..a}_{n}$.}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={Dart},
+legend pos=north west,
+legend cell align=left]
+\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+%ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={Swift},
+legend pos=north west,
+legend cell align=left]
+\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+\end{axis}
+\end{tikzpicture}
+& \\
+\multicolumn{3}{c}{Graphs}
 \end{tabular}
-\caption{aStarStarb} \label{fig:aStarStarb}
+\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
-\end{figure}
+of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
+The reason for their superlinear behaviour is that they do a depth-first-search.
+If the string does not match, the engine starts to explore all possibilities.
+}\label{fig:aStarStarb}
+\end{figure}\afterpage{\clearpage}
-Regular expressions are widely used in computer science:
-be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
-command-line tools like $\mathit{grep}$ that facilitate easy
-text-processing; network intrusion
-detection systems that reject suspicious traffic; or compiler
-front ends--the majority of the solutions to these tasks
-involve lexing with regular
-expressions.
-Given its usefulness and ubiquity, one would imagine that
-modern regular expression matching implementations
-are mature and fully studied.
-Indeed, in a popular programming language' regex engine,
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
-use regex engines that are able to process
-megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
-%TODO: get source for SNORT/BRO's regex matching engine/speed
 Take $(a^*)^*\,b$ and ask whether
 strings of the form $aa..a$ match this regular
 expression. Obviously this is not the case---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
 length, say around 30 $a$'s, one discovers that
 this decision takes crazy time to finish given the simplicity of the problem.
 This is clearly exponential behaviour, and
 is triggered by some relatively simple regex patterns, as the graphs
 in \ref{fig:aStarStarb} show.
+Java 9 and newer
+versions improves this behaviour, but is still slow compared
+with the approach we are going to use.
-\ChristianComment{Superlinear I just leave out the explanation
-which I find once used would distract the flow. Plus if i just say exponential
-here the 2016 event in StackExchange was not exponential, but just quardratic so would be
-in accurate}
 This superlinear blowup in regular expression engines
 had repeatedly caused grief in real life.
 For example, on 20 July 2016 one evil
 regular expression brought the webpage
 2019. A poorly written regular expression exhibited exponential
 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
 had several causes, at the heart was a regular expression that
 was used to monitor network
 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
-%TODO: data points for some new versions of languages
 These problems with regular expressions
 are not isolated events that happen
 very occasionally, but actually widespread.
 They occur so often that they get a
 name--Regular-Expression-Denial-Of-Service (ReDoS)
 They therefore concluded that evil regular expressions
 are problems "more than a parlour trick", but one that
 requires
 more research attention.
+Regular expressions and regular expression matchers
-But the problems are not limited to slowness on certain
+have of course been studied for many, many years.
+One of the most recent work in the context of lexing
+is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+This is relevant work and we will compare later on
+our derivative-based matcher we are going to present.
+There is also some newer work called
+Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
+For that the problem is the bounded repetitions ($r^{n}$,
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$).
+They often occur in practical use, in for example the Snort XML definitions:
+The problems are not limited to slowness on certain
 cases.
 Another thing about these libraries is that there
 is no correctness guarantee.
 In some cases, they either fail to generate a lexing result when there exists a match,
 or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be
+A concrete example would be the regex
-the regex
+\begin{center}
-\begin{verbatim}
+	$(aba + ab + a)* \text{and the string} ababa$
-(aba|ab|a)*
+\end{center}
-\end{verbatim}
-and the string
-\begin{verbatim}
-ababa
-\end{verbatim}
 The correct $\POSIX$ match for the above would be
 with the entire string $ababa$,
 split into two Kleene star iterations, $[ab] [aba]$ at positions
 $[0, 2), [2, 5)$
 respectively.
 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
 correctly implementing the POSIX (maximum-munch)
 rule of regular expression matching.
-As Grathwohl\parencite{grathwohl2014crash} commented,
+As Grathwohl\parencite{grathwohl2014crash} wrote,
-\begin{center}
+\begin{quote}
-	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
+	The POSIX strategy is more complicated than the
-\end{center}
+	greedy because of the dependence on information about
+	the length of matched strings in the various subexpressions.
+\end{quote}
+%\noindent
 To summarise the above, regular expressions are important.
 They are popular and programming languages' library functions
 for them are very fast on non-catastrophic cases.
 But there are problems with current practical implementations.
 First thing is that the running time might blow up.
 The second phase is the lexing phase, when the input string
 $s$ is read and the data structure
 representing that regex $r$ is being operated on.
 We represent the time
 it takes by $P_2(r, s)$.\\
 For $\mathit{DFA}$,
 we have $P_2(r, s) = O( |s| )$,
 because we take at most $|s|$ steps,
 and each step takes
 at most one transition--
 a deterministic-finite-automata
 by definition has at most one state active and at most one
 transition upon receiving an input symbol.
 But unfortunately in the  worst case
 $P_1(r) = O(exp^{|r|})$. An example will be given later.
 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
-expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
+expressions like $r^n$ into
+\[
+	\underbrace{r \cdots r}_{\text{n copies of r}}.
+\]
 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
 On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^|s|$.
+to $|r| * 2^{|s|}$.
 %on the input
 %And when calculating the time complexity of the matching algorithm,
 %we are assuming that each input reading step requires constant time.
 %which translates to that the number of
 %states active and transitions taken each time is bounded by a
 % Java and Python both support back-references, but shows
 %catastrophic backtracking behaviours on inputs without back-references(
 %when the language is still regular).
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
 %TODO: verify the fact Rust does not allow 1000+ reps
-\ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?}
 So we have practical implementations
 on regular expression matching/lexing which are fast
 but do not come with any guarantees that it will not grind to a halt

changeset 601	ce4e5151a836
parent 600	fd068f39ac23
child 602	46db6ae66448