diff -r 46db6ae66448 -r 370fe1dde7c7 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Sep 23 00:44:22 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sat Sep 24 00:49:10 2022 +0100 @@ -221,6 +221,41 @@ %However, , this is not the case for $\mathbf{all}$ inputs. %TODO: get source for SNORT/BRO's regex matching engine/speed + +Take $(a^*)^*\,b$ and ask whether +strings of the form $aa..a$ match this regular +expression. Obviously this is not the case---the expected $b$ in the last +position is missing. One would expect that modern regular expression +matching engines can find this out very quickly. Alas, if one tries +this example in JavaScript, Python or Java 8, even with strings of a small +length, say around 30 $a$'s, +the decision takes crazy time to finish (graph \ref{fig:aStarStarb}). +This is clearly exponential behaviour, and +is triggered by some relatively simple regex patterns. +Java 9 and newer +versions improves this behaviour, but is still slow compared +with the approach we are going to use. + + + + +This superlinear blowup in regular expression engines +had repeatedly caused grief in real life that they +get a name for them--``catastrophic backtracking''. +For example, on 20 July 2016 one evil +regular expression brought the webpage +\href{http://stackexchange.com}{Stack Exchange} to its +knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} +In this instance, a regular expression intended to just trim white +spaces from the beginning and the end of a line actually consumed +massive amounts of CPU resources---causing web servers to grind to a +halt. In this example, the time needed to process +the string was $O(n^2)$ with respect to the string length. This +quadratic overhead was enough for the homepage of Stack Exchange to +respond so slowly that the load balancer assumed a $\mathit{DoS}$ +attack and therefore stopped the servers from responding to any +requests. This made the whole site become unavailable. + \begin{figure}[p] \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} \begin{tikzpicture} @@ -336,40 +371,6 @@ }\label{fig:aStarStarb} \end{figure}\afterpage{\clearpage} -Take $(a^*)^*\,b$ and ask whether -strings of the form $aa..a$ match this regular -expression. Obviously this is not the case---the expected $b$ in the last -position is missing. One would expect that modern regular expression -matching engines can find this out very quickly. Alas, if one tries -this example in JavaScript, Python or Java 8, even with strings of a small -length, say around 30 $a$'s, one discovers that -this decision takes crazy time to finish given the simplicity of the problem. -This is clearly exponential behaviour, and -is triggered by some relatively simple regex patterns, as the graphs - in \ref{fig:aStarStarb} show. -Java 9 and newer -versions improves this behaviour, but is still slow compared -with the approach we are going to use. - - - - -This superlinear blowup in regular expression engines -had repeatedly caused grief in real life. -For example, on 20 July 2016 one evil -regular expression brought the webpage -\href{http://stackexchange.com}{Stack Exchange} to its -knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} -In this instance, a regular expression intended to just trim white -spaces from the beginning and the end of a line actually consumed -massive amounts of CPU resources---causing web servers to grind to a -halt. In this example, the time needed to process -the string was $O(n^2)$ with respect to the string length. This -quadratic overhead was enough for the homepage of Stack Exchange to -respond so slowly that the load balancer assumed a $\mathit{DoS}$ -attack and therefore stopped the servers from responding to any -requests. This made the whole site become unavailable. - A more recent example is a global outage of all Cloudflare servers on 2 July 2019. A poorly written regular expression exhibited exponential behaviour and exhausted CPUs that serve HTTP traffic. Although the outage @@ -390,9 +391,20 @@ requires more research attention. +This work aims to address this issue +with the help of formal proofs. +We offer a lexing algorithm based +on Brzozowski derivatives with certified correctness (in +Isabelle/HOL) +and finiteness property. +Such properties guarantees the absence of +catastrophic backtracking in most cases. +We will give more details why we choose our +approach (Brzozowski derivatives and formal proofs) +in the next sections. -\ChristianComment{I am not totally sure where this sentence should be -put, seems a little out-standing here.} + +\section{The Problem with Bounded Repetitions} Regular expressions and regular expression matchers have of course been studied for many, many years. One of the most recent work in the context of lexing @@ -403,28 +415,78 @@ Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. For that the problem is dealing with the bounded regular expressions of the form $r^{n}$ where $n$ is a constant specifying that $r$ must repeat -exactly $n$ times. +exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow +on the bounded repetitions construct. + +In the work reported in \cite{CSL2022} and here, we add better support +for them. The other repetition constructs include -$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating -at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times. +$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify +intervals for how many times $r$ should match. +$r^{\ldots m}$ means repeating +at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and +$r^{n\ldots m}$ means repeating between $n$ and $m$ times. +The results presented in this thesis extend straightforwardly to them +too. Their formal definitions will be given later. + Bounded repetitions are important because they -tend to occur often in practical use\cite{xml2015}, for example in RegExLib, +tend to occur often in practical use, for example in RegExLib, Snort, as well as in XML Schema definitions (XSDs). -One XSD that seems to be related to the MPEG-7 standard involves -the below regular expression: +According to Bj\"{o}rklund et al \cite{xml2015}, +bounded regular expressions occur frequently in the latter and can have +counters up to ten million. An example XSD with a large counter they gave +was: \begin{verbatim} \end{verbatim} -This is just a fancy way of writing the regular expression +This can be seen as the expression $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves regular expressions -satisfy certain constraints such as floating point number format. +satisfying certain constraints (such as +satisfying the floating point number format). +The problem here is that tools based on the classic notion of +automata need to expand $r^{n}$ into $n$ connected +copies of the automaton for $r$. This leads to very inefficient matching +algorithms or algorithms that consume large amounts of memory. +A classic example is the regular expression $(a+b)^* a (a+b)^{n}$ +where the minimal DFA requires at least $2^{n+1}$ states (more on this +later). +Therefore regular expressions matching libraries that rely on the classic +notion of DFAs often impose adhoc limits +for bounded regular expressions: +For example, in the regular expression matching library in the Go +language the regular expression $a^{1001}$ is not permitted, because no counter +can be above 1000, and in the built-in Rust regular expression library +expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message +for being too big. These problems can of course be solved in matching algorithms where +automata go beyond the classic notion and for instance include explicit +counters \cite{Turo_ov__2020}. +The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be +straightforwardly extended to deal with bounded regular expressions +and moreover the resulting code still consists of only simple +recursive functions and inductive datatypes. +Finally, bounded regular expressions do not destroy our finite +boundedness property, which we shall prove later on. -The problems are not limited to slowness on certain +\section{The Terminology Regex, and why Backtracking?} +Shouldn't regular expression matching be linear? +How can one explain the super-linear behaviour of the +regex matching engines we have? +The time cost of regex matching algorithms in general +involve two different phases, and different things can go differently wrong on +these phases. +$\DFA$s usually have problems in the first (construction) phase +, whereas $\NFA$s usually run into trouble +on the second phase. + + +\section{Error-prone POSIX Implementations} +The problems with practical implementations +of reare not limited to slowness on certain cases. Another thing about these libraries is that there is no correctness guarantee. @@ -468,19 +530,6 @@ based on Brzozowski and Sulzmann and Lu's work. - \section{Why are current regex engines slow?} - -%find literature/find out for yourself that REGEX->DFA on basic regexes -%does not blow up the size -Shouldn't regular expression matching be linear? -How can one explain the super-linear behaviour of the -regex matching engines we have? -The time cost of regex matching algorithms in general -involve two different phases, and different things can go differently wrong on -these phases. -$\DFA$s usually have problems in the first (construction) phase -, whereas $\NFA$s usually run into trouble -on the second phase. \subsection{Different Phases of a Matching/Lexing Algorithm}