diff -r 3c5e58d08939 -r ff7945a988a3 ChengsongTanPhdThesis/Chapters/Chapter1.tex --- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex Fri May 20 18:52:03 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Chapter1.tex Thu May 26 20:51:40 2022 +0100 @@ -41,7 +41,7 @@ \def\AONE{\textit{AONE}} \def\ACHAR{\textit{ACHAR}} - +\def\POSIX{\textit{POSIX}} \def\ALTS{\textit{ALTS}} \def\ASTAR{\textit{ASTAR}} \def\DFA{\textit{DFA}} @@ -51,7 +51,7 @@ \def\flex{\textit{flex}} \def\inj{\mathit{inj}} \def\Empty{\mathit{Empty}} -\def\Left{\mathit{Left}}xc +\def\Left{\mathit{Left}} \def\Right{\mathit{Right}} \def\Stars{\mathit{Stars}} \def\Char{\mathit{Char}} @@ -64,6 +64,7 @@ %\def\bderssimp{\mathit{bders}\_\mathit{simp}} \def\distinctWith{\textit{distinctWith}} +\def\size{\mathit{size}} \def\rexp{\mathbf{rexp}} \def\simp{\mathit{simp}} \def\simpALTs{\mathit{simp}\_\mathit{ALTs}} @@ -79,6 +80,7 @@ \newcommand\asize[1]{\llbracket #1 \rrbracket} \newcommand\rerase[1]{ (#1)\downarrow_r} + \def\erase{\textit{erase}} \def\STAR{\textit{STAR}} \def\flts{\textit{flts}} @@ -100,31 +102,24 @@ Regular expressions are widely used in computer science: -be it in text-editors\parencite{atomEditor} with syntax hightlighting and auto completion, -command line tools like $\mathit{grep}$ that facilitates easy -text processing , network intrusion -detection systems that rejects suspicious traffic, or compiler +be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion, +command-line tools like $\mathit{grep}$ that facilitate easy +text processing, network intrusion +detection systems that reject suspicious traffic, or compiler front ends--the majority of the solutions to these tasks involve lexing with regular expressions. - Given its usefulness and ubiquity, one would imagine that modern regular expression matching implementations -are mature and fully-studied. - -If you go to a popular programming language's -regex engine, -you can supply it with regex and strings of your own, -and get matching/lexing information such as how a -sub-part of the regex matches a sub-part of the string. -These lexing libraries are on average quite fast. +are mature and fully studied. +Indeed, in a popular programming language' regex engine, +supplying it with regular expressions and strings, one can +get rich matching information in a very short time. +Some network intrusion detection systems +use regex engines that are able to process +megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}. +Unfortunately, this is not the case for $\mathbf{all}$ inputs. %TODO: get source for SNORT/BRO's regex matching engine/speed -For example, the regex engines some network intrusion detection -systems use are able to process - megabytes or even gigabytes of network traffic per second. - -Why do we need to have our version, if the algorithms are -blindingly fast already? Well it turns out it is not always the case. Take $(a^*)^*\,b$ and ask whether @@ -132,12 +127,15 @@ expression. Obviously this is not the case---the expected $b$ in the last position is missing. One would expect that modern regular expression matching engines can find this out very quickly. Alas, if one tries -this example in JavaScript, Python or Java 8 with strings like 28 -$a$'s, one discovers that this decision takes around 30 seconds and -takes considerably longer when adding a few more $a$'s, as the graphs +this example in JavaScript, Python or Java 8, even with strings of a small +length, say around 30 $a$'s, one discovers that +this decision takes crazy time to finish given the simplicity of the problem. +This is clearly exponential behaviour, and +is triggered by some relatively simple regex patterns, as the graphs below show: -\begin{center} +\begin{figure} +\centering \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} \begin{tikzpicture} \begin{axis}[ @@ -204,11 +202,10 @@ \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings of the form $\underbrace{aa..a}_{n}$.} \end{tabular} -\end{center} +\caption{aStarStarb} \label{fig:aStarStarb} +\end{figure} -This is clearly exponential behaviour, and -is triggered by some relatively simple regex patterns. This superlinear blowup in matching algorithms sometimes cause @@ -228,17 +225,16 @@ respond so slowly that the load balancer assumed a $\mathit{DoS}$ attack and therefore stopped the servers from responding to any requests. This made the whole site become unavailable. -A more -recent example is a global outage of all Cloudflare servers on 2 July +A more recent example is a global outage of all Cloudflare servers on 2 July 2019. A poorly written regular expression exhibited exponential -behaviour and exhausted CPUs that serve HTTP traffic. Although the -outage had several causes, at the heart was a regular expression that +behaviour and exhausted CPUs that serve HTTP traffic. Although the outage +had several causes, at the heart was a regular expression that was used to monitor network traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}} %TODO: data points for some new versions of languages These problems with regular expressions are not isolated events that happen -very occasionally, but actually quite widespread. +very occasionally, but actually widespread. They occur so often that they get a name--Regular-Expression-Denial-Of-Service (ReDoS) attack. @@ -250,7 +246,7 @@ requires more research attention. - \section{Why are current algorithm for regexes slow?} + \section{The Problem Behind Slow Cases} %find literature/find out for yourself that REGEX->DFA on basic regexes %does not blow up the size @@ -309,50 +305,6 @@ However, they do not scale well with bounded repetitions.\\ -\begin{center} -\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} -\begin{tikzpicture} -\begin{axis}[ - ymode=log, - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - ylabel={time in secs}, - enlargelimits=false, - xtick={1,2,...,8}, - xmax=9, - ymax=16000000, - ytick={0,500,...,3500}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={JavaScript}, - legend pos=north west, - legend cell align=left] -\addplot[red,mark=*, mark options={fill=white}] table {re-chengsong.data}; -\end{axis} -\end{tikzpicture} -\end{tabular} -\end{center} - -Another graph: -\begin{center} -\begin{tikzpicture} - \begin{axis}[ - height=0.5\textwidth, - width=\textwidth, - xlabel=number of a's, - xtick={0,...,9}, - ylabel=maximum size, - ymode=log, - log basis y={2} -] - \addplot[mark=*,blue] table {re-chengsong.data}; - \end{axis} -\end{tikzpicture} -\end{center} - - Bounded repetitions, usually written in the form @@ -469,7 +421,7 @@ $((a|b|c|\ldots|z)^*)\backslash 1$ would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\ Back-reference is a construct in the "regex" standard -that programmers found quite useful, but not exactly +that programmers found useful, but not exactly regular any more. In fact, that allows the regex construct to express languages that cannot be contained in context-free @@ -506,10 +458,10 @@ \section{Buggy Regex Engines} - Another thing about the these libraries is that there + Another thing about these libraries is that there is no correctness guarantee. - In some cases they either fails to generate a lexing result when there is a match, - or gives the wrong way of matching. + In some cases, they either fail to generate a lexing result when there exists a match, + or give the wrong way of matching. It turns out that regex libraries not only suffer from @@ -557,12 +509,12 @@ When a regular expression does not behave as intended, people usually try to rewrite the regex to some equivalent form -or they try to avoid the possibly problematic patterns completely\parencite{Davis18}, -of which there are many false positives. +or they try to avoid the possibly problematic patterns completely, +for which many false positives exist\parencite{Davis18}. Animated tools to "debug" regular expressions -are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} +are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} to name a few. -There is also static analysis work on regular expressions that +We are also aware of static analysis work on regular expressions that aims to detect potentially expoential regex patterns. Rathnayake and Thielecke \parencite{Rathnayake2014StaticAF} proposed an algorithm that detects regular expressions triggering exponential @@ -608,7 +560,7 @@ %---------------------------------------------------------------------------------------- -\section{Our Contribution} +\section{Contribution} @@ -627,8 +579,8 @@ The main contribution of this thesis is a proven correct lexing algorithm with formalized time bounds. -To our best knowledge, there is no lexing libraries using Brzozowski derivatives -that have a provable time guarantee, +To our best knowledge, no lexing libraries using Brzozowski derivatives +have a provable time guarantee, and claims about running time are usually speculative and backed by thin empirical evidence. %TODO: give references @@ -681,8 +633,8 @@ %TODO: FILL in the other defs \begin{center} \begin{tabular}{lcl} -$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\ -$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\ +$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\ +$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\ \end{tabular} \end{center} Which are also called the "language interpretation". @@ -696,7 +648,7 @@ Formally, we define first such a transformation on any string set, which we call semantic derivative: \begin{center} -$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$ +$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ \end{center} Mathematically, it can be expressed as the @@ -788,13 +740,13 @@ The most involved cases are the sequence and star case. The sequence case says that if the first regular expression -contains an empty string then second component of the sequence +contains an empty string then the second component of the sequence might be chosen as the target regular expression to be chopped off its head character. -The star regular expression unwraps the iteration of -regular expression and attack the star regular expression -to its back again to make sure there are 0 or more iterations -following this unfolded iteration. +The star regular expression's derivative unwraps the iteration of +regular expression and attaches the star regular expression +to the sequence's second element to make sure a copy is retained +for possible more iterations in later phases of lexing. The main property of the derivative operation @@ -844,10 +796,57 @@ Beautiful and simple definition. If we implement the above algorithm naively, however, -the algorithm can be excruciatingly slow. For example, when starting with the regular -expression $(a + aa)^*$ and building 12 successive derivatives +the algorithm can be excruciatingly slow. + + +\begin{figure} +\centering +\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n$}, + x label style={at={(1.05,-0.05)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=10000, + ytick={0,1000,...,10000}, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={JavaScript}, + legend pos=north west, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data}; +\end{axis} +\end{tikzpicture}\\ +\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings + of the form $\underbrace{aa..a}_{n}$.} +\end{tabular} +\caption{EightThousandNodes} \label{fig:EightThousandNodes} +\end{figure} + + +(8000 node data to be added here) +For example, when starting with the regular +expression $(a + aa)^*$ and building a few successive derivatives (around 10) w.r.t.~the character $a$, one obtains a derivative regular expression -with more than 8000 nodes (when viewed as a tree). Operations like +with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. +The reason why $(a + aa) ^*$ explodes so drastically is that without +pruning, the algorithm will keep records of all possible ways of matching: +\begin{center} +$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ +\end{center} + +\noindent +Each of the above alternative branches correspond to the match +$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). +These different ways of matching will grow exponentially with the string length, +and without simplifications that throw away some of these very similar matchings, +it is no surprise that these expressions grow so quickly. +Operations like $\backslash$ and $\nullable$ need to traverse such trees and consequently the bigger the size of the derivative the slower the algorithm. @@ -949,9 +948,9 @@ The number of different ways of matching without allowing any value under a star to be flattened to an empty string can be given by the following formula: -\begin{center} - $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$ -\end{center} +\begin{equation} + C_n = (n+1)+n C_1+\ldots + 2 C_{n-1} +\end{equation} and a closed form formula can be calculated to be \begin{equation} C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} @@ -962,12 +961,26 @@ worst case runtime. Therefore it is impractical to try to generate all possible matches in a run. In practice, we are usually interested about POSIX values, which by intuition always -match the leftmost regular expression when there is a choice -and always match a sub part as much as possible before proceeding -to the next token. For example, the above example has the POSIX value +\begin{itemize} +\item +match the leftmost regular expression when multiple options of matching +are available +\item +always match a subpart as much as possible before proceeding +to the next token. +\end{itemize} + + + For example, the above example has the POSIX value $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. The output of an algorithm we want would be a POSIX matching encoded as a value. +The reason why we are interested in $\POSIX$ values is that they can +be practically used in the lexing phase of a compiler front end. +For instance, when lexing a code snippet +$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized +as an identifier rather than a keyword. + The contribution of Sulzmann and Lu is an extension of Brzozowski's algorithm by a second phase (the first phase being building successive derivatives---see \eqref{graph:*}). In this second phase, a POSIX value @@ -1256,7 +1269,7 @@ $\textit{if}\;c=d\; \;\textit{then}\; \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & - $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\ + $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ @@ -1316,7 +1329,7 @@ For instance, when we do derivative of $_{bs}a^*$ with respect to c, we need to unfold it into a sequence, and attach an additional bit $0$ to the front of $r \backslash c$ -to indicate that there is one more star iteration. Also the sequence clause +to indicate one more star iteration. Also the sequence clause is more subtle---when $a_1$ is $\textit{bnullable}$ (here \textit{bnullable} is exactly the same as $\textit{nullable}$, except that it is for annotated regular expressions, therefore we omit the @@ -1375,19 +1388,18 @@ operation from characters to strings (just like the derivatives for un-annotated regular expressions). -Remember tha one of the important reasons we introduced bitcodes -is that they can make simplification more structured and therefore guaranteeing -the correctness. +Now we introduce the simplifications, which is why we introduce the +bitcodes in the first place. -\subsection*{Our Simplification Rules} +\subsection*{Simplification Rules} -In this section we introduce aggressive (in terms of size) simplification rules +This section introduces aggressive (in terms of size) simplification rules on annotated regular expressions -in order to keep derivatives small. Such simplifications are promising +to keep derivatives small. Such simplifications are promising as we have generated test data that show -that a good tight bound can be achieved. Obviously we could only -partially cover the search space as there are infinitely many regular +that a good tight bound can be achieved. We could only +partially cover the search space as there are infinitely many regular expressions and strings. One modification we introduced is to allow a list of annotated regular @@ -1426,13 +1438,13 @@ \noindent The simplification does a pattern matching on the regular expression. When it detected that the regular expression is an alternative or -sequence, it will try to simplify its children regular expressions -recursively and then see if one of the children turn into $\ZERO$ or +sequence, it will try to simplify its child regular expressions +recursively and then see if one of the children turns into $\ZERO$ or $\ONE$, which might trigger further simplification at the current level. The most involved part is the $\sum$ clause, where we use two auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested alternatives and reduce as many duplicates as possible. Function -$\textit{distinct}$ keeps the first occurring copy only and remove all later ones +$\textit{distinct}$ keeps the first occurring copy only and removes all later ones when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. Its recursive definition is given below: @@ -1491,7 +1503,7 @@ -However, there are two difficulties with derivative-based matchers: +However, two difficulties with derivative-based matchers exist: First, Brzozowski's original matcher only generates a yes/no answer for whether a regular expression matches a string or not. This is too little information in the context of lexing where separate tokens must @@ -1554,7 +1566,7 @@ \emph{incremental parsing method} (that is the algorithm to be formalised in this paper): - +%motivation part \begin{quote}\it ``Correctness Claim: We further claim that the incremental parsing method [..] in combination with the simplification steps [..]