ChengsongTanPhdThesis/Chapters/Chapter1.tex
changeset 518 ff7945a988a3
parent 516 6fecb7fe8cd0
child 519 856d025dbc15
--- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex	Fri May 20 18:52:03 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter1.tex	Thu May 26 20:51:40 2022 +0100
@@ -41,7 +41,7 @@
 \def\AONE{\textit{AONE}}
 \def\ACHAR{\textit{ACHAR}}
 
-
+\def\POSIX{\textit{POSIX}}
 \def\ALTS{\textit{ALTS}}
 \def\ASTAR{\textit{ASTAR}}
 \def\DFA{\textit{DFA}}
@@ -51,7 +51,7 @@
 \def\flex{\textit{flex}}
 \def\inj{\mathit{inj}}
 \def\Empty{\mathit{Empty}}
-\def\Left{\mathit{Left}}xc
+\def\Left{\mathit{Left}}
 \def\Right{\mathit{Right}}
 \def\Stars{\mathit{Stars}}
 \def\Char{\mathit{Char}}
@@ -64,6 +64,7 @@
 %\def\bderssimp{\mathit{bders}\_\mathit{simp}}
 \def\distinctWith{\textit{distinctWith}}
 
+\def\size{\mathit{size}}
 \def\rexp{\mathbf{rexp}}
 \def\simp{\mathit{simp}}
 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
@@ -79,6 +80,7 @@
 \newcommand\asize[1]{\llbracket #1 \rrbracket}
 \newcommand\rerase[1]{ (#1)\downarrow_r}
 
+
 \def\erase{\textit{erase}}
 \def\STAR{\textit{STAR}}
 \def\flts{\textit{flts}}
@@ -100,31 +102,24 @@
 
 
 Regular expressions are widely used in computer science: 
-be it in text-editors\parencite{atomEditor} with syntax hightlighting and auto completion, 
-command line tools like $\mathit{grep}$ that facilitates easy 
-text processing , network intrusion
-detection systems that rejects suspicious traffic, or compiler
+be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion, 
+command-line tools like $\mathit{grep}$ that facilitate easy 
+text processing, network intrusion
+detection systems that reject suspicious traffic, or compiler
 front ends--the majority of the solutions to these tasks 
 involve lexing with regular 
 expressions.
-
 Given its usefulness and ubiquity, one would imagine that
 modern regular expression matching implementations
-are mature and fully-studied.
-
-If you go to a popular programming language's
-regex engine,
-you can supply it with regex and strings of your own,
-and get matching/lexing  information such as how a 
-sub-part of the regex matches a sub-part of the string.
-These lexing libraries are on average quite fast.
+are mature and fully studied.
+Indeed, in a popular programming language' regex engine, 
+supplying it with regular expressions and strings, one can
+get rich matching information in a very short time.
+Some network intrusion detection systems
+use regex engines that are able to process 
+megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}.
+Unfortunately, this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
-For example, the regex engines some network intrusion detection
-systems use are able to process
- megabytes or even gigabytes of network traffic per second.
- 
-Why do we need to have our version, if the algorithms are
-blindingly fast already? Well it turns out it is not always the case.
 
 
 Take $(a^*)^*\,b$ and ask whether
@@ -132,12 +127,15 @@
 expression. Obviously this is not the case---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
 matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8 with strings like 28
-$a$'s, one discovers that this decision takes around 30 seconds and
-takes considerably longer when adding a few more $a$'s, as the graphs
+this example in JavaScript, Python or Java 8, even with strings of a small
+length, say around 30 $a$'s, one discovers that 
+this decision takes crazy time to finish given the simplicity of the problem.
+This is clearly exponential behaviour, and 
+is triggered by some relatively simple regex patterns, as the graphs
 below show:
 
-\begin{center}
+\begin{figure}
+\centering
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
@@ -204,11 +202,10 @@
 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
            of the form $\underbrace{aa..a}_{n}$.}
 \end{tabular}    
-\end{center}  
+\caption{aStarStarb} \label{fig:aStarStarb}
+\end{figure}
 
 
-This is clearly exponential behaviour, and 
-is triggered by some relatively simple regex patterns.
 
 
 This superlinear blowup in matching algorithms sometimes cause
@@ -228,17 +225,16 @@
 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
 attack and therefore stopped the servers from responding to any
 requests. This made the whole site become unavailable. 
-A more 
-recent example is a global outage of all Cloudflare servers on 2 July
+A more recent example is a global outage of all Cloudflare servers on 2 July
 2019. A poorly written regular expression exhibited exponential
-behaviour and exhausted CPUs that serve HTTP traffic. Although the
-outage had several causes, at the heart was a regular expression that
+behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
+had several causes, at the heart was a regular expression that
 was used to monitor network
 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
 %TODO: data points for some new versions of languages
 These problems with regular expressions 
 are not isolated events that happen
-very occasionally, but actually quite widespread.
+very occasionally, but actually widespread.
 They occur so often that they get a 
 name--Regular-Expression-Denial-Of-Service (ReDoS)
 attack.
@@ -250,7 +246,7 @@
 requires
 more research attention.
 
- \section{Why are current algorithm for regexes slow?}
+ \section{The Problem Behind Slow Cases}
 
 %find literature/find out for yourself that REGEX->DFA on basic regexes
 %does not blow up the size
@@ -309,50 +305,6 @@
 However, they do not scale well with bounded repetitions.\\
 
 
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    ymode=log,
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    xtick={1,2,...,8},
-    xmax=9,
-    ymax=16000000,
-    ytick={0,500,...,3500},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={JavaScript},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {re-chengsong.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-
-Another graph:
-\begin{center}
-\begin{tikzpicture}
-    \begin{axis}[
-        height=0.5\textwidth,
-        width=\textwidth,
-        xlabel=number of a's,
-        xtick={0,...,9},
-        ylabel=maximum size,
-        ymode=log,
-       log basis y={2}
-]
-        \addplot[mark=*,blue] table {re-chengsong.data};
-    \end{axis}
-\end{tikzpicture}
-\end{center}
-
-
 
 
 Bounded repetitions, usually written in the form
@@ -469,7 +421,7 @@
 $((a|b|c|\ldots|z)^*)\backslash 1$
 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
 Back-reference is a construct in the "regex" standard
-that programmers found quite useful, but not exactly 
+that programmers found useful, but not exactly 
 regular any more.
 In fact, that allows the regex construct to express 
 languages that cannot be contained in context-free
@@ -506,10 +458,10 @@
 \section{Buggy Regex Engines} 
 
 
- Another thing about the these libraries is that there
+ Another thing about these libraries is that there
  is no correctness guarantee.
- In some cases they either fails to generate a lexing result when there is a match,
- or gives the wrong way of matching.
+ In some cases, they either fail to generate a lexing result when there exists a match,
+ or give the wrong way of matching.
  
 
 It turns out that regex libraries not only suffer from 
@@ -557,12 +509,12 @@
 
 When a regular expression does not behave as intended,
 people usually try to rewrite the regex to some equivalent form
-or they try to avoid the possibly problematic patterns completely\parencite{Davis18},
-of which there are many false positives.
+or they try to avoid the possibly problematic patterns completely,
+for which many false positives exist\parencite{Davis18}.
 Animated tools to "debug" regular expressions
-are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} 
+are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} 
 to name a few.
-There is also static analysis work on regular expressions that
+We are also aware of static analysis work on regular expressions that
 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
 \parencite{Rathnayake2014StaticAF} proposed an algorithm
 that detects regular expressions triggering exponential
@@ -608,7 +560,7 @@
 
 %----------------------------------------------------------------------------------------
 
-\section{Our Contribution}
+\section{Contribution}
 
 
 
@@ -627,8 +579,8 @@
  
 The main contribution of this thesis is a proven correct lexing algorithm
 with formalized time bounds.
-To our best knowledge, there is no lexing libraries using Brzozowski derivatives
-that have a provable time guarantee, 
+To our best knowledge, no lexing libraries using Brzozowski derivatives
+have a provable time guarantee, 
 and claims about running time are usually speculative and backed by thin empirical
 evidence.
 %TODO: give references
@@ -681,8 +633,8 @@
 %TODO: FILL in the other defs
 \begin{center}
 \begin{tabular}{lcl}
-$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
-$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
+$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
+$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
 \end{tabular}
 \end{center}
 Which are also called the "language interpretation".
@@ -696,7 +648,7 @@
 Formally, we define first such a transformation on any string set, which
 we call semantic derivative:
 \begin{center}
-$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
+$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
 \end{center}
 Mathematically, it can be expressed as the 
 
@@ -788,13 +740,13 @@
 The most involved cases are the sequence 
 and star case.
 The sequence case says that if the first regular expression
-contains an empty string then second component of the sequence
+contains an empty string then the second component of the sequence
 might be chosen as the target regular expression to be chopped
 off its head character.
-The star regular expression unwraps the iteration of
-regular expression and attack the star regular expression
-to its back again to make sure there are 0 or more iterations
-following this unfolded iteration.
+The star regular expression's derivative unwraps the iteration of
+regular expression and attaches the star regular expression
+to the sequence's second element to make sure a copy is retained
+for possible more iterations in later phases of lexing.
 
 
 The main property of the derivative operation
@@ -844,10 +796,57 @@
 Beautiful and simple definition.
 
 If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow. For example, when starting with the regular
-expression $(a + aa)^*$ and building 12 successive derivatives
+the algorithm can be excruciatingly slow. 
+
+
+\begin{figure}
+\centering
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=10000,
+    ytick={0,1000,...,10000},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={JavaScript},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
+           of the form $\underbrace{aa..a}_{n}$.}
+\end{tabular}    
+\caption{EightThousandNodes} \label{fig:EightThousandNodes}
+\end{figure}
+
+
+(8000 node data to be added here)
+For example, when starting with the regular
+expression $(a + aa)^*$ and building a few successive derivatives (around 10)
 w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree). Operations like
+with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
+The reason why $(a + aa) ^*$ explodes so drastically is that without
+pruning, the algorithm will keep records of all possible ways of matching:
+\begin{center}
+$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
+\end{center}
+
+\noindent
+Each of the above alternative branches correspond to the match 
+$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
+These different ways of matching will grow exponentially with the string length,
+and without simplifications that throw away some of these very similar matchings,
+it is no surprise that these expressions grow so quickly.
+Operations like
 $\backslash$ and $\nullable$ need to traverse such trees and
 consequently the bigger the size of the derivative the slower the
 algorithm. 
@@ -949,9 +948,9 @@
 The number of different ways of matching 
 without allowing any value under a star to be flattened
 to an empty string can be given by the following formula:
-\begin{center}
-	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
-\end{center}
+\begin{equation}
+	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
+\end{equation}
 and a closed form formula can be calculated to be
 \begin{equation}
 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
@@ -962,12 +961,26 @@
 worst case runtime. Therefore it is impractical to try to generate
 all possible matches in a run. In practice, we are usually 
 interested about POSIX values, which by intuition always
-match the leftmost regular expression when there is a choice
-and always match a sub part as much as possible before proceeding
-to the next token. For example, the above example has the POSIX value
+\begin{itemize}
+\item
+match the leftmost regular expression when multiple options of matching
+are available  
+\item 
+always match a subpart as much as possible before proceeding
+to the next token.
+\end{itemize}
+
+
+ For example, the above example has the POSIX value
 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
 The output of an algorithm we want would be a POSIX matching
 encoded as a value.
+The reason why we are interested in $\POSIX$ values is that they can
+be practically used in the lexing phase of a compiler front end.
+For instance, when lexing a code snippet 
+$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword.
+
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
@@ -1256,7 +1269,7 @@
         $\textit{if}\;c=d\; \;\textit{then}\;
          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
-  $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
+  $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
      $\textit{if}\;\textit{bnullable}\,a_1$\\
 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
@@ -1316,7 +1329,7 @@
 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
 we need to unfold it into a sequence,
 and attach an additional bit $0$ to the front of $r \backslash c$
-to indicate that there is one more star iteration. Also the sequence clause
+to indicate one more star iteration. Also the sequence clause
 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
 that it is for annotated regular expressions, therefore we omit the
@@ -1375,19 +1388,18 @@
 operation from characters to strings (just like the derivatives for un-annotated
 regular expressions).
 
-Remember tha one of the important reasons we introduced bitcodes
-is that they can make simplification more structured and therefore guaranteeing
-the correctness.
+Now we introduce the simplifications, which is why we introduce the 
+bitcodes in the first place.
 
-\subsection*{Our Simplification Rules}
+\subsection*{Simplification Rules}
 
-In this section we introduce aggressive (in terms of size) simplification rules
+This section introduces aggressive (in terms of size) simplification rules
 on annotated regular expressions
-in order to keep derivatives small. Such simplifications are promising
+to keep derivatives small. Such simplifications are promising
 as we have
 generated test data that show
-that a good tight bound can be achieved. Obviously we could only
-partially cover  the search space as there are infinitely many regular
+that a good tight bound can be achieved. We could only
+partially cover the search space as there are infinitely many regular
 expressions and strings. 
 
 One modification we introduced is to allow a list of annotated regular
@@ -1426,13 +1438,13 @@
 \noindent
 The simplification does a pattern matching on the regular expression.
 When it detected that the regular expression is an alternative or
-sequence, it will try to simplify its children regular expressions
-recursively and then see if one of the children turn into $\ZERO$ or
+sequence, it will try to simplify its child regular expressions
+recursively and then see if one of the children turns into $\ZERO$ or
 $\ONE$, which might trigger further simplification at the current level.
 The most involved part is the $\sum$ clause, where we use two
 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
 alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$  keeps the first occurring copy only and remove all later ones
+$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
 Its recursive definition is given below:
 
@@ -1491,7 +1503,7 @@
 
 
 
-However, there are two difficulties with derivative-based matchers:
+However, two difficulties with derivative-based matchers exist:
 First, Brzozowski's original matcher only generates a yes/no answer
 for whether a regular expression matches a string or not.  This is too
 little information in the context of lexing where separate tokens must
@@ -1554,7 +1566,7 @@
 \emph{incremental parsing method} (that is the algorithm to be formalised
 in this paper):
 
-
+%motivation part
 \begin{quote}\it
   ``Correctness Claim: We further claim that the incremental parsing
   method [..] in combination with the simplification steps [..]