--- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex Fri May 20 18:52:03 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter1.tex Thu May 26 20:51:40 2022 +0100
@@ -41,7 +41,7 @@
\def\AONE{\textit{AONE}}
\def\ACHAR{\textit{ACHAR}}
-
+\def\POSIX{\textit{POSIX}}
\def\ALTS{\textit{ALTS}}
\def\ASTAR{\textit{ASTAR}}
\def\DFA{\textit{DFA}}
@@ -51,7 +51,7 @@
\def\flex{\textit{flex}}
\def\inj{\mathit{inj}}
\def\Empty{\mathit{Empty}}
-\def\Left{\mathit{Left}}xc
+\def\Left{\mathit{Left}}
\def\Right{\mathit{Right}}
\def\Stars{\mathit{Stars}}
\def\Char{\mathit{Char}}
@@ -64,6 +64,7 @@
%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
\def\distinctWith{\textit{distinctWith}}
+\def\size{\mathit{size}}
\def\rexp{\mathbf{rexp}}
\def\simp{\mathit{simp}}
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
@@ -79,6 +80,7 @@
\newcommand\asize[1]{\llbracket #1 \rrbracket}
\newcommand\rerase[1]{ (#1)\downarrow_r}
+
\def\erase{\textit{erase}}
\def\STAR{\textit{STAR}}
\def\flts{\textit{flts}}
@@ -100,31 +102,24 @@
Regular expressions are widely used in computer science:
-be it in text-editors\parencite{atomEditor} with syntax hightlighting and auto completion,
-command line tools like $\mathit{grep}$ that facilitates easy
-text processing , network intrusion
-detection systems that rejects suspicious traffic, or compiler
+be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion,
+command-line tools like $\mathit{grep}$ that facilitate easy
+text processing, network intrusion
+detection systems that reject suspicious traffic, or compiler
front ends--the majority of the solutions to these tasks
involve lexing with regular
expressions.
-
Given its usefulness and ubiquity, one would imagine that
modern regular expression matching implementations
-are mature and fully-studied.
-
-If you go to a popular programming language's
-regex engine,
-you can supply it with regex and strings of your own,
-and get matching/lexing information such as how a
-sub-part of the regex matches a sub-part of the string.
-These lexing libraries are on average quite fast.
+are mature and fully studied.
+Indeed, in a popular programming language' regex engine,
+supplying it with regular expressions and strings, one can
+get rich matching information in a very short time.
+Some network intrusion detection systems
+use regex engines that are able to process
+megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}.
+Unfortunately, this is not the case for $\mathbf{all}$ inputs.
%TODO: get source for SNORT/BRO's regex matching engine/speed
-For example, the regex engines some network intrusion detection
-systems use are able to process
- megabytes or even gigabytes of network traffic per second.
-
-Why do we need to have our version, if the algorithms are
-blindingly fast already? Well it turns out it is not always the case.
Take $(a^*)^*\,b$ and ask whether
@@ -132,12 +127,15 @@
expression. Obviously this is not the case---the expected $b$ in the last
position is missing. One would expect that modern regular expression
matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8 with strings like 28
-$a$'s, one discovers that this decision takes around 30 seconds and
-takes considerably longer when adding a few more $a$'s, as the graphs
+this example in JavaScript, Python or Java 8, even with strings of a small
+length, say around 30 $a$'s, one discovers that
+this decision takes crazy time to finish given the simplicity of the problem.
+This is clearly exponential behaviour, and
+is triggered by some relatively simple regex patterns, as the graphs
below show:
-\begin{center}
+\begin{figure}
+\centering
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
@@ -204,11 +202,10 @@
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
of the form $\underbrace{aa..a}_{n}$.}
\end{tabular}
-\end{center}
+\caption{aStarStarb} \label{fig:aStarStarb}
+\end{figure}
-This is clearly exponential behaviour, and
-is triggered by some relatively simple regex patterns.
This superlinear blowup in matching algorithms sometimes cause
@@ -228,17 +225,16 @@
respond so slowly that the load balancer assumed a $\mathit{DoS}$
attack and therefore stopped the servers from responding to any
requests. This made the whole site become unavailable.
-A more
-recent example is a global outage of all Cloudflare servers on 2 July
+A more recent example is a global outage of all Cloudflare servers on 2 July
2019. A poorly written regular expression exhibited exponential
-behaviour and exhausted CPUs that serve HTTP traffic. Although the
-outage had several causes, at the heart was a regular expression that
+behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
+had several causes, at the heart was a regular expression that
was used to monitor network
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
%TODO: data points for some new versions of languages
These problems with regular expressions
are not isolated events that happen
-very occasionally, but actually quite widespread.
+very occasionally, but actually widespread.
They occur so often that they get a
name--Regular-Expression-Denial-Of-Service (ReDoS)
attack.
@@ -250,7 +246,7 @@
requires
more research attention.
- \section{Why are current algorithm for regexes slow?}
+ \section{The Problem Behind Slow Cases}
%find literature/find out for yourself that REGEX->DFA on basic regexes
%does not blow up the size
@@ -309,50 +305,6 @@
However, they do not scale well with bounded repetitions.\\
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
- ymode=log,
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- ylabel={time in secs},
- enlargelimits=false,
- xtick={1,2,...,8},
- xmax=9,
- ymax=16000000,
- ytick={0,500,...,3500},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={JavaScript},
- legend pos=north west,
- legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {re-chengsong.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-
-Another graph:
-\begin{center}
-\begin{tikzpicture}
- \begin{axis}[
- height=0.5\textwidth,
- width=\textwidth,
- xlabel=number of a's,
- xtick={0,...,9},
- ylabel=maximum size,
- ymode=log,
- log basis y={2}
-]
- \addplot[mark=*,blue] table {re-chengsong.data};
- \end{axis}
-\end{tikzpicture}
-\end{center}
-
-
Bounded repetitions, usually written in the form
@@ -469,7 +421,7 @@
$((a|b|c|\ldots|z)^*)\backslash 1$
would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
Back-reference is a construct in the "regex" standard
-that programmers found quite useful, but not exactly
+that programmers found useful, but not exactly
regular any more.
In fact, that allows the regex construct to express
languages that cannot be contained in context-free
@@ -506,10 +458,10 @@
\section{Buggy Regex Engines}
- Another thing about the these libraries is that there
+ Another thing about these libraries is that there
is no correctness guarantee.
- In some cases they either fails to generate a lexing result when there is a match,
- or gives the wrong way of matching.
+ In some cases, they either fail to generate a lexing result when there exists a match,
+ or give the wrong way of matching.
It turns out that regex libraries not only suffer from
@@ -557,12 +509,12 @@
When a regular expression does not behave as intended,
people usually try to rewrite the regex to some equivalent form
-or they try to avoid the possibly problematic patterns completely\parencite{Davis18},
-of which there are many false positives.
+or they try to avoid the possibly problematic patterns completely,
+for which many false positives exist\parencite{Davis18}.
Animated tools to "debug" regular expressions
-are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101}
+are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101}
to name a few.
-There is also static analysis work on regular expressions that
+We are also aware of static analysis work on regular expressions that
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
\parencite{Rathnayake2014StaticAF} proposed an algorithm
that detects regular expressions triggering exponential
@@ -608,7 +560,7 @@
%----------------------------------------------------------------------------------------
-\section{Our Contribution}
+\section{Contribution}
@@ -627,8 +579,8 @@
The main contribution of this thesis is a proven correct lexing algorithm
with formalized time bounds.
-To our best knowledge, there is no lexing libraries using Brzozowski derivatives
-that have a provable time guarantee,
+To our best knowledge, no lexing libraries using Brzozowski derivatives
+have a provable time guarantee,
and claims about running time are usually speculative and backed by thin empirical
evidence.
%TODO: give references
@@ -681,8 +633,8 @@
%TODO: FILL in the other defs
\begin{center}
\begin{tabular}{lcl}
-$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
-$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
+$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
+$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
\end{tabular}
\end{center}
Which are also called the "language interpretation".
@@ -696,7 +648,7 @@
Formally, we define first such a transformation on any string set, which
we call semantic derivative:
\begin{center}
-$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
+$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
\end{center}
Mathematically, it can be expressed as the
@@ -788,13 +740,13 @@
The most involved cases are the sequence
and star case.
The sequence case says that if the first regular expression
-contains an empty string then second component of the sequence
+contains an empty string then the second component of the sequence
might be chosen as the target regular expression to be chopped
off its head character.
-The star regular expression unwraps the iteration of
-regular expression and attack the star regular expression
-to its back again to make sure there are 0 or more iterations
-following this unfolded iteration.
+The star regular expression's derivative unwraps the iteration of
+regular expression and attaches the star regular expression
+to the sequence's second element to make sure a copy is retained
+for possible more iterations in later phases of lexing.
The main property of the derivative operation
@@ -844,10 +796,57 @@
Beautiful and simple definition.
If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow. For example, when starting with the regular
-expression $(a + aa)^*$ and building 12 successive derivatives
+the algorithm can be excruciatingly slow.
+
+
+\begin{figure}
+\centering
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=10000,
+ ytick={0,1000,...,10000},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={JavaScript},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
+ of the form $\underbrace{aa..a}_{n}$.}
+\end{tabular}
+\caption{EightThousandNodes} \label{fig:EightThousandNodes}
+\end{figure}
+
+
+(8000 node data to be added here)
+For example, when starting with the regular
+expression $(a + aa)^*$ and building a few successive derivatives (around 10)
w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree). Operations like
+with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
+The reason why $(a + aa) ^*$ explodes so drastically is that without
+pruning, the algorithm will keep records of all possible ways of matching:
+\begin{center}
+$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
+\end{center}
+
+\noindent
+Each of the above alternative branches correspond to the match
+$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
+These different ways of matching will grow exponentially with the string length,
+and without simplifications that throw away some of these very similar matchings,
+it is no surprise that these expressions grow so quickly.
+Operations like
$\backslash$ and $\nullable$ need to traverse such trees and
consequently the bigger the size of the derivative the slower the
algorithm.
@@ -949,9 +948,9 @@
The number of different ways of matching
without allowing any value under a star to be flattened
to an empty string can be given by the following formula:
-\begin{center}
- $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
-\end{center}
+\begin{equation}
+ C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
+\end{equation}
and a closed form formula can be calculated to be
\begin{equation}
C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
@@ -962,12 +961,26 @@
worst case runtime. Therefore it is impractical to try to generate
all possible matches in a run. In practice, we are usually
interested about POSIX values, which by intuition always
-match the leftmost regular expression when there is a choice
-and always match a sub part as much as possible before proceeding
-to the next token. For example, the above example has the POSIX value
+\begin{itemize}
+\item
+match the leftmost regular expression when multiple options of matching
+are available
+\item
+always match a subpart as much as possible before proceeding
+to the next token.
+\end{itemize}
+
+
+ For example, the above example has the POSIX value
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
The output of an algorithm we want would be a POSIX matching
encoded as a value.
+The reason why we are interested in $\POSIX$ values is that they can
+be practically used in the lexing phase of a compiler front end.
+For instance, when lexing a code snippet
+$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword.
+
The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
@@ -1256,7 +1269,7 @@
$\textit{if}\;c=d\; \;\textit{then}\;
\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
- $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
+ $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
$\textit{if}\;\textit{bnullable}\,a_1$\\
& &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
@@ -1316,7 +1329,7 @@
For instance, when we do derivative of $_{bs}a^*$ with respect to c,
we need to unfold it into a sequence,
and attach an additional bit $0$ to the front of $r \backslash c$
-to indicate that there is one more star iteration. Also the sequence clause
+to indicate one more star iteration. Also the sequence clause
is more subtle---when $a_1$ is $\textit{bnullable}$ (here
\textit{bnullable} is exactly the same as $\textit{nullable}$, except
that it is for annotated regular expressions, therefore we omit the
@@ -1375,19 +1388,18 @@
operation from characters to strings (just like the derivatives for un-annotated
regular expressions).
-Remember tha one of the important reasons we introduced bitcodes
-is that they can make simplification more structured and therefore guaranteeing
-the correctness.
+Now we introduce the simplifications, which is why we introduce the
+bitcodes in the first place.
-\subsection*{Our Simplification Rules}
+\subsection*{Simplification Rules}
-In this section we introduce aggressive (in terms of size) simplification rules
+This section introduces aggressive (in terms of size) simplification rules
on annotated regular expressions
-in order to keep derivatives small. Such simplifications are promising
+to keep derivatives small. Such simplifications are promising
as we have
generated test data that show
-that a good tight bound can be achieved. Obviously we could only
-partially cover the search space as there are infinitely many regular
+that a good tight bound can be achieved. We could only
+partially cover the search space as there are infinitely many regular
expressions and strings.
One modification we introduced is to allow a list of annotated regular
@@ -1426,13 +1438,13 @@
\noindent
The simplification does a pattern matching on the regular expression.
When it detected that the regular expression is an alternative or
-sequence, it will try to simplify its children regular expressions
-recursively and then see if one of the children turn into $\ZERO$ or
+sequence, it will try to simplify its child regular expressions
+recursively and then see if one of the children turns into $\ZERO$ or
$\ONE$, which might trigger further simplification at the current level.
The most involved part is the $\sum$ clause, where we use two
auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$ keeps the first occurring copy only and remove all later ones
+$\textit{distinct}$ keeps the first occurring copy only and removes all later ones
when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
Its recursive definition is given below:
@@ -1491,7 +1503,7 @@
-However, there are two difficulties with derivative-based matchers:
+However, two difficulties with derivative-based matchers exist:
First, Brzozowski's original matcher only generates a yes/no answer
for whether a regular expression matches a string or not. This is too
little information in the context of lexing where separate tokens must
@@ -1554,7 +1566,7 @@
\emph{incremental parsing method} (that is the algorithm to be formalised
in this paper):
-
+%motivation part
\begin{quote}\it
``Correctness Claim: We further claim that the incremental parsing
method [..] in combination with the simplification steps [..]