lexing: comparison ChengsongTanPhdThesis/Chapters/Chapter1.tex

equal deleted inserted replaced

-:3c5e58d08939
+:ff7945a988a3
 \def\AZERO{\textit{AZERO}}
 \def\AONE{\textit{AONE}}
 \def\ACHAR{\textit{ACHAR}}
+\def\POSIX{\textit{POSIX}}
 \def\ALTS{\textit{ALTS}}
 \def\ASTAR{\textit{ASTAR}}
 \def\DFA{\textit{DFA}}
 \def\bmkeps{\textit{bmkeps}}
 \def\retrieve{\textit{retrieve}}
 \def\blexer{\textit{blexer}}
 \def\flex{\textit{flex}}
 \def\inj{\mathit{inj}}
 \def\Empty{\mathit{Empty}}
-\def\Left{\mathit{Left}}xc
+\def\Left{\mathit{Left}}
 \def\Right{\mathit{Right}}
 \def\Stars{\mathit{Stars}}
 \def\Char{\mathit{Char}}
 \def\Seq{\mathit{Seq}}
 \def\Der{\mathit{Der}}
 \def\S{\mathit{S}}
 \def\rup{r^\uparrow}
 %\def\bderssimp{\mathit{bders}\_\mathit{simp}}
 \def\distinctWith{\textit{distinctWith}}
+\def\size{\mathit{size}}
 \def\rexp{\mathbf{rexp}}
 \def\simp{\mathit{simp}}
 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
 \def\map{\mathit{map}}
 \def\distinct{\mathit{distinct}}
 \newcommand\rnullable[1]{\textit{rnullable}(#1)}
 \newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
 \newcommand\asize[1]{\llbracket #1 \rrbracket}
 \newcommand\rerase[1]{ (#1)\downarrow_r}
 \def\erase{\textit{erase}}
 \def\STAR{\textit{STAR}}
 \def\flts{\textit{flts}}
 %TODO: look up snort rules to use here--give readers idea of what regexes look like
 Regular expressions are widely used in computer science:
-be it in text-editors\parencite{atomEditor} with syntax hightlighting and auto completion,
+be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion,
-command line tools like $\mathit{grep}$ that facilitates easy
+command-line tools like $\mathit{grep}$ that facilitate easy
-text processing , network intrusion
+text processing, network intrusion
-detection systems that rejects suspicious traffic, or compiler
+detection systems that reject suspicious traffic, or compiler
 front ends--the majority of the solutions to these tasks
 involve lexing with regular
 expressions.
 Given its usefulness and ubiquity, one would imagine that
 modern regular expression matching implementations
-are mature and fully-studied.
+are mature and fully studied.
+Indeed, in a popular programming language' regex engine,
-If you go to a popular programming language's
+supplying it with regular expressions and strings, one can
-regex engine,
+get rich matching information in a very short time.
-you can supply it with regex and strings of your own,
+Some network intrusion detection systems
-and get matching/lexing  information such as how a
+use regex engines that are able to process
-sub-part of the regex matches a sub-part of the string.
+megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}.
-These lexing libraries are on average quite fast.
+Unfortunately, this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
-For example, the regex engines some network intrusion detection
-systems use are able to process
-megabytes or even gigabytes of network traffic per second.
-Why do we need to have our version, if the algorithms are
-blindingly fast already? Well it turns out it is not always the case.
 Take $(a^*)^*\,b$ and ask whether
 strings of the form $aa..a$ match this regular
 expression. Obviously this is not the case---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
 matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8 with strings like 28
+this example in JavaScript, Python or Java 8, even with strings of a small
-$a$'s, one discovers that this decision takes around 30 seconds and
+length, say around 30 $a$'s, one discovers that
-takes considerably longer when adding a few more $a$'s, as the graphs
+this decision takes crazy time to finish given the simplicity of the problem.
+This is clearly exponential behaviour, and
+is triggered by some relatively simple regex patterns, as the graphs
 below show:
-\begin{center}
+\begin{figure}
+\centering
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 x label style={at={(1.05,-0.05)}},
 \end{axis}
 \end{tikzpicture}\\
 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
 of the form $\underbrace{aa..a}_{n}$.}
 \end{tabular}
-\end{center}
+\caption{aStarStarb} \label{fig:aStarStarb}
+\end{figure}
-This is clearly exponential behaviour, and
-is triggered by some relatively simple regex patterns.
 This superlinear blowup in matching algorithms sometimes cause
 considerable grief in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
 the string was $O(n^2)$ with respect to the string length. This
 quadratic overhead was enough for the homepage of Stack Exchange to
 respond so slowly that the load balancer assumed a $\mathit{DoS}$
 attack and therefore stopped the servers from responding to any
 requests. This made the whole site become unavailable.
-A more
+A more recent example is a global outage of all Cloudflare servers on 2 July
-recent example is a global outage of all Cloudflare servers on 2 July
 2019. A poorly written regular expression exhibited exponential
-behaviour and exhausted CPUs that serve HTTP traffic. Although the
+behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
-outage had several causes, at the heart was a regular expression that
+had several causes, at the heart was a regular expression that
 was used to monitor network
 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
 %TODO: data points for some new versions of languages
 These problems with regular expressions
 are not isolated events that happen
-very occasionally, but actually quite widespread.
+very occasionally, but actually widespread.
 They occur so often that they get a
 name--Regular-Expression-Denial-Of-Service (ReDoS)
 attack.
 Davis et al. \parencite{Davis18} detected more
 than 1000 super-linear (SL) regular expressions
 They therefore concluded that evil regular expressions
 are problems more than "a parlour trick", but one that
 requires
 more research attention.
-\section{Why are current algorithm for regexes slow?}
+\section{The Problem Behind Slow Cases}
 %find literature/find out for yourself that REGEX->DFA on basic regexes
 %does not blow up the size
 Shouldn't regular expression matching be linear?
 How can one explain the super-linear behaviour of the
 that can be compiled and run.
 The good things about $\mathit{DFA}$s is that once
 generated, they are fast and stable, unlike
 backtracking algorithms.
 However, they do not scale well with bounded repetitions.\\
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-ymode=log,
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-ylabel={time in secs},
-enlargelimits=false,
-xtick={1,2,...,8},
-xmax=9,
-ymax=16000000,
-ytick={0,500,...,3500},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={JavaScript},
-legend pos=north west,
-legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {re-chengsong.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-Another graph:
-\begin{center}
-\begin{tikzpicture}
-\begin{axis}[
-height=0.5\textwidth,
-width=\textwidth,
-xlabel=number of a's,
-xtick={0,...,9},
-ylabel=maximum size,
-ymode=log,
-log basis y={2}
-]
-\addplot[mark=*,blue] table {re-chengsong.data};
-\end{axis}
-\end{tikzpicture}
-\end{center}
 Bounded repetitions, usually written in the form
 \end{center}
 The concrete example
 $((a|b|c|\ldots|z)^*)\backslash 1$
 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
 Back-reference is a construct in the "regex" standard
-that programmers found quite useful, but not exactly
+that programmers found useful, but not exactly
 regular any more.
 In fact, that allows the regex construct to express
 languages that cannot be contained in context-free
 languages either.
 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
 %TODO: verify the fact Rust does not allow 1000+ reps
 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
 \section{Buggy Regex Engines}
-Another thing about the these libraries is that there
+Another thing about these libraries is that there
 is no correctness guarantee.
-In some cases they either fails to generate a lexing result when there is a match,
+In some cases, they either fail to generate a lexing result when there exists a match,
-or gives the wrong way of matching.
+or give the wrong way of matching.
 It turns out that regex libraries not only suffer from
 exponential backtracking problems,
 but also undesired (or even buggy) outputs.
 %\section{How people solve problems with regexes}
 When a regular expression does not behave as intended,
 people usually try to rewrite the regex to some equivalent form
-or they try to avoid the possibly problematic patterns completely\parencite{Davis18},
+or they try to avoid the possibly problematic patterns completely,
-of which there are many false positives.
+for which many false positives exist\parencite{Davis18}.
 Animated tools to "debug" regular expressions
-are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101}
+are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101}
 to name a few.
-There is also static analysis work on regular expressions that
+We are also aware of static analysis work on regular expressions that
 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
 \parencite{Rathnayake2014StaticAF} proposed an algorithm
 that detects regular expressions triggering exponential
 behavious on backtracking matchers.
 Weideman \parencite{Weideman2017Static} came up with
 %----------------------------------------------------------------------------------------
-\section{Our Contribution}
+\section{Contribution}
 This work addresses the vulnerability of super-linear and
 buggy regex implementations by the combination
 Antimirov.
 The main contribution of this thesis is a proven correct lexing algorithm
 with formalized time bounds.
-To our best knowledge, there is no lexing libraries using Brzozowski derivatives
+To our best knowledge, no lexing libraries using Brzozowski derivatives
-that have a provable time guarantee,
+have a provable time guarantee,
 and claims about running time are usually speculative and backed by thin empirical
 evidence.
 %TODO: give references
 For example, Sulzmann and Lu had proposed an algorithm  in which they
 claim a linear running time.
 The language or set of strings defined by regular expressions are defined as
 %TODO: FILL in the other defs
 \begin{center}
 \begin{tabular}{lcl}
-$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
+$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
-$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
+$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
 \end{tabular}
 \end{center}
 Which are also called the "language interpretation".
 strings without the head character $c$.
 Formally, we define first such a transformation on any string set, which
 we call semantic derivative:
 \begin{center}
-$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
+$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
 \end{center}
 Mathematically, it can be expressed as the
 If the $\textit{StringSet}$ happen to have some structure, for example,
 if it is regular, then we have that it
 a new regular expression after all the string it contains
 is chopped off a certain head character $c$.
 The most involved cases are the sequence
 and star case.
 The sequence case says that if the first regular expression
-contains an empty string then second component of the sequence
+contains an empty string then the second component of the sequence
 might be chosen as the target regular expression to be chopped
 off its head character.
-The star regular expression unwraps the iteration of
+The star regular expression's derivative unwraps the iteration of
-regular expression and attack the star regular expression
+regular expression and attaches the star regular expression
-to its back again to make sure there are 0 or more iterations
+to the sequence's second element to make sure a copy is retained
-following this unfolded iteration.
+for possible more iterations in later phases of lexing.
 The main property of the derivative operation
 that enables us to reason about the correctness of
 an algorithm using derivatives is
 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
 Beautiful and simple definition.
 If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow. For example, when starting with the regular
+the algorithm can be excruciatingly slow.
-expression $(a + aa)^*$ and building 12 successive derivatives
+\begin{figure}
+\centering
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=10000,
+ytick={0,1000,...,10000},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={JavaScript},
+legend pos=north west,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
+of the form $\underbrace{aa..a}_{n}$.}
+\end{tabular}
+\caption{EightThousandNodes} \label{fig:EightThousandNodes}
+\end{figure}
+(8000 node data to be added here)
+For example, when starting with the regular
+expression $(a + aa)^*$ and building a few successive derivatives (around 10)
 w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree). Operations like
+with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
+The reason why $(a + aa) ^*$ explodes so drastically is that without
+pruning, the algorithm will keep records of all possible ways of matching:
+\begin{center}
+$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
+\end{center}
+\noindent
+Each of the above alternative branches correspond to the match
+$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
+These different ways of matching will grow exponentially with the string length,
+and without simplifications that throw away some of these very similar matchings,
+it is no surprise that these expressions grow so quickly.
+Operations like
 $\backslash$ and $\nullable$ need to traverse such trees and
 consequently the bigger the size of the derivative the slower the
 algorithm.
 Brzozowski was quick in finding that during this process a lot useless
 Take the example where $r= (a^*\cdot a^*)^*$ and the string
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
 The number of different ways of matching
 without allowing any value under a star to be flattened
 to an empty string can be given by the following formula:
-\begin{center}
+\begin{equation}
-	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
+	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
-\end{center}
+\end{equation}
 and a closed form formula can be calculated to be
 \begin{equation}
 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
 \end{equation}
 which is clearly in exponential order.
 A lexer aimed at getting all the possible values has an exponential
 worst case runtime. Therefore it is impractical to try to generate
 all possible matches in a run. In practice, we are usually
 interested about POSIX values, which by intuition always
-match the leftmost regular expression when there is a choice
+\begin{itemize}
-and always match a sub part as much as possible before proceeding
+\item
-to the next token. For example, the above example has the POSIX value
+match the leftmost regular expression when multiple options of matching
+are available
+\item
+always match a subpart as much as possible before proceeding
+to the next token.
+\end{itemize}
+For example, the above example has the POSIX value
 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
 The output of an algorithm we want would be a POSIX matching
 encoded as a value.
+The reason why we are interested in $\POSIX$ values is that they can
+be practically used in the lexing phase of a compiler front end.
+For instance, when lexing a code snippet
+$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword.
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
 is generated in case the regular expression matches  the string.
 Pictorially, the Sulzmann and Lu algorithm is as follows:
 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
 $\textit{if}\;c=d\; \;\textit{then}\;
 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
 $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
-$\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
+$\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
 $\textit{if}\;\textit{bnullable}\,a_1$\\
 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
 %\end{definition}
 \noindent
 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
 we need to unfold it into a sequence,
 and attach an additional bit $0$ to the front of $r \backslash c$
-to indicate that there is one more star iteration. Also the sequence clause
+to indicate one more star iteration. Also the sequence clause
 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
 that it is for annotated regular expressions, therefore we omit the
 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
 $a_1$ matches the string prior to character $c$ (more on this later),
 \noindent
 In this definition $\_\backslash s$ is the  generalisation  of the derivative
 operation from characters to strings (just like the derivatives for un-annotated
 regular expressions).
-Remember tha one of the important reasons we introduced bitcodes
+Now we introduce the simplifications, which is why we introduce the
-is that they can make simplification more structured and therefore guaranteeing
+bitcodes in the first place.
-the correctness.
+\subsection*{Simplification Rules}
-\subsection*{Our Simplification Rules}
+This section introduces aggressive (in terms of size) simplification rules
-In this section we introduce aggressive (in terms of size) simplification rules
 on annotated regular expressions
-in order to keep derivatives small. Such simplifications are promising
+to keep derivatives small. Such simplifications are promising
 as we have
 generated test data that show
-that a good tight bound can be achieved. Obviously we could only
+that a good tight bound can be achieved. We could only
-partially cover  the search space as there are infinitely many regular
+partially cover the search space as there are infinitely many regular
 expressions and strings.
 One modification we introduced is to allow a list of annotated regular
 expressions in the $\sum$ constructor. This allows us to not just
 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
 \end{center}
 \noindent
 The simplification does a pattern matching on the regular expression.
 When it detected that the regular expression is an alternative or
-sequence, it will try to simplify its children regular expressions
+sequence, it will try to simplify its child regular expressions
-recursively and then see if one of the children turn into $\ZERO$ or
+recursively and then see if one of the children turns into $\ZERO$ or
 $\ONE$, which might trigger further simplification at the current level.
 The most involved part is the $\sum$ clause, where we use two
 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
 alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$  keeps the first occurring copy only and remove all later ones
+$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
 Its recursive definition is given below:
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 succession) all the characters of the string matches the empty string,
 then $r$ matches $s$ (and {\em vice versa}).
-However, there are two difficulties with derivative-based matchers:
+However, two difficulties with derivative-based matchers exist:
 First, Brzozowski's original matcher only generates a yes/no answer
 for whether a regular expression matches a string or not.  This is too
 little information in the context of lexing where separate tokens must
 be identified and also classified (for example as keywords
 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
 of this version, but do not provide any supporting proof arguments, not
 even ``pencil-and-paper'' arguments. They write about their bitcoded
 \emph{incremental parsing method} (that is the algorithm to be formalised
 in this paper):
+%motivation part
 \begin{quote}\it
 ``Correctness Claim: We further claim that the incremental parsing
 method [..] in combination with the simplification steps [..]
 yields POSIX parse trees. We have tested this claim
 extensively [..] but yet

changeset 518	ff7945a988a3
parent 516	6fecb7fe8cd0
child 519	856d025dbc15