lexing: changeset 603:370fe1dde7c7

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Sep 23 00:44:22 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sat Sep 24 00:49:10 2022 +0100
@@ -221,6 +221,41 @@
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
+
+Take $(a^*)^*\,b$ and ask whether
+strings of the form $aa..a$ match this regular
+expression. Obviously this is not the case---the expected $b$ in the last
+position is missing. One would expect that modern regular expression
+matching engines can find this out very quickly. Alas, if one tries
+this example in JavaScript, Python or Java 8, even with strings of a small
+length, say around 30 $a$'s,
+the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
+This is clearly exponential behaviour, and 
+is triggered by some relatively simple regex patterns.
+Java 9 and newer
+versions improves this behaviour, but is still slow compared 
+with the approach we are going to use.
+
+
+
+
+This superlinear blowup in regular expression engines
+had repeatedly caused grief in real life that they
+get a name for them--``catastrophic backtracking''.
+For example, on 20 July 2016 one evil
+regular expression brought the webpage
+\href{http://stackexchange.com}{Stack Exchange} to its
+knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+In this instance, a regular expression intended to just trim white
+spaces from the beginning and the end of a line actually consumed
+massive amounts of CPU resources---causing web servers to grind to a
+halt. In this example, the time needed to process
+the string was $O(n^2)$ with respect to the string length. This
+quadratic overhead was enough for the homepage of Stack Exchange to
+respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
+attack and therefore stopped the servers from responding to any
+requests. This made the whole site become unavailable. 
+
 \begin{figure}[p]
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
@@ -336,40 +371,6 @@
 }\label{fig:aStarStarb}
 \end{figure}\afterpage{\clearpage}
 
-Take $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
-expression. Obviously this is not the case---the expected $b$ in the last
-position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
-length, say around 30 $a$'s, one discovers that 
-this decision takes crazy time to finish given the simplicity of the problem.
-This is clearly exponential behaviour, and 
-is triggered by some relatively simple regex patterns, as the graphs
- in \ref{fig:aStarStarb} show.
-Java 9 and newer
-versions improves this behaviour, but is still slow compared 
-with the approach we are going to use.
-
-
-
-
-This superlinear blowup in regular expression engines
-had repeatedly caused grief in real life.
-For example, on 20 July 2016 one evil
-regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
-In this instance, a regular expression intended to just trim white
-spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU resources---causing web servers to grind to a
-halt. In this example, the time needed to process
-the string was $O(n^2)$ with respect to the string length. This
-quadratic overhead was enough for the homepage of Stack Exchange to
-respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
-attack and therefore stopped the servers from responding to any
-requests. This made the whole site become unavailable. 
-
 A more recent example is a global outage of all Cloudflare servers on 2 July
 2019. A poorly written regular expression exhibited exponential
 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
@@ -390,9 +391,20 @@
 requires
 more research attention.
 
+This work aims to address this issue
+with the help of formal proofs.
+We offer a lexing algorithm based
+on Brzozowski derivatives with certified correctness (in 
+Isabelle/HOL)
+and finiteness property.
+Such properties guarantees the absence of 
+catastrophic backtracking in most cases.
+We will give more details why we choose our 
+approach (Brzozowski derivatives and formal proofs)
+in the next sections.
 
-\ChristianComment{I am not totally sure where this sentence should be
-put, seems a little out-standing here.}
+
+\section{The Problem with Bounded Repetitions}
 Regular expressions and regular expression matchers 
 have of course been studied for many, many years.
 One of the most recent work in the context of lexing
@@ -403,28 +415,78 @@
 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
 For that the problem is dealing with the bounded regular expressions of the form
 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat
-exactly $n$ times.
+exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow
+on the bounded repetitions construct.
+
+In the work reported in \cite{CSL2022} and here, we add better support
+for them. 
 The other repetition constructs include
-$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating
-at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times.
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify 
+intervals for how many times $r$ should match.
+$r^{\ldots m}$ means repeating
+at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and 
+$r^{n\ldots m}$ means repeating between $n$ and $m$ times.
+The results presented in this thesis extend straightforwardly to them
+too. 
 Their formal definitions will be given later.
+
 Bounded repetitions are important because they
-tend to occur often in practical use\cite{xml2015}, for example in RegExLib,
+tend to occur often in practical use, for example in RegExLib,
 Snort, as well as in XML Schema definitions (XSDs).
-One XSD that seems to be related to the MPEG-7 standard involves
-the below regular expression:
+According to Bj\"{o}rklund et al \cite{xml2015},
+bounded regular expressions occur frequently in the latter and can have
+counters up to ten million. An example XSD with a large counter they gave
+was:
 \begin{verbatim}
 <sequence minOccurs="0" maxOccurs="65535">
     <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
     <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
 </sequence>
 \end{verbatim}
-This is just a fancy way of writing the regular expression 
+This can be seen as the expression 
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions 
-satisfy certain constraints such as floating point number format.
+satisfying certain constraints (such as 
+satisfying the floating point number format).
+The problem here is that tools based on the classic notion of
+automata need to expand $r^{n}$ into $n$ connected 
+copies of the automaton for $r$. This leads to very inefficient matching
+algorithms  or algorithms that consume large amounts of memory.
+A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
+where the minimal DFA requires at least $2^{n+1}$ states (more on this
+later).
+Therefore regular expressions matching libraries that rely on the classic
+notion of DFAs  often impose adhoc limits
+for bounded regular expressions:
+For example, in the regular expression matching library in the Go
+language the regular expression $a^{1001}$ is not permitted, because no counter
+can be above 1000, and in the built-in Rust regular expression library
+expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
+for being too big. These problems can of course be solved in matching algorithms where 
+automata go beyond the classic notion and for instance include explicit
+counters \cite{Turo_ov__2020}.
+The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+straightforwardly extended to deal with bounded regular expressions
+and moreover the resulting code still consists of only simple
+recursive functions and inductive datatypes.
+Finally, bounded regular expressions do not destroy our finite
+boundedness property, which we shall prove later on.
 
-The problems are not limited to slowness on certain 
+\section{The Terminology Regex, and why Backtracking?}
+Shouldn't regular expression matching be linear?
+How can one explain the super-linear behaviour of the 
+regex matching engines we have?
+The time cost of regex matching algorithms in general
+involve two different phases, and different things can go differently wrong on 
+these phases.
+$\DFA$s usually have problems in the first (construction) phase
+, whereas $\NFA$s usually run into trouble
+on the second phase.
+
+
+\section{Error-prone POSIX Implementations}
+The problems with practical implementations
+of reare not limited to slowness on certain 
 cases. 
 Another thing about these libraries is that there
 is no correctness guarantee.
@@ -468,19 +530,6 @@
 based on Brzozowski and Sulzmann and Lu's work.
 
 
- \section{Why are current regex engines slow?}
-
-%find literature/find out for yourself that REGEX->DFA on basic regexes
-%does not blow up the size
-Shouldn't regular expression matching be linear?
-How can one explain the super-linear behaviour of the 
-regex matching engines we have?
-The time cost of regex matching algorithms in general
-involve two different phases, and different things can go differently wrong on 
-these phases.
-$\DFA$s usually have problems in the first (construction) phase
-, whereas $\NFA$s usually run into trouble
-on the second phase.
 
 \subsection{Different Phases of a Matching/Lexing Algorithm}

--- a/ChengsongTanPhdThesis/example.bib	Fri Sep 23 00:44:22 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib	Sat Sep 24 00:49:10 2022 +0100
@@ -5,6 +5,13 @@
 
 
 %% Saved with string encoding Unicode (UTF-8) 
+
+@unpublished{CSL22
+author = "Chengsong Tan and Christian Urban",
+title = "POSIX Lexing with Bitcoded Derivatives",
+note = "submitted",
+}
+
 @INPROCEEDINGS{Verbatim,  author={Egolf, Derek and Lasser, Sam and Fisher, Kathleen},  booktitle={2021 IEEE Security and Privacy Workshops (SPW)},   title={Verbatim: A Verified Lexer Generator},   year={2021},  volume={},  number={},  pages={92-100},  doi={10.1109/SPW53761.2021.00022}}

author	Chengsong
	Sat, 24 Sep 2022 00:49:10 +0100
changeset 603	370fe1dde7c7
parent 602	46db6ae66448
child 604	16d67f9c07d4

ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/example.bib		file \| annotate \| diff \| comparison \| revisions