--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Dec 18 18:17:52 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Dec 20 22:32:54 2022 +0000
@@ -421,16 +421,16 @@
attacks.
Davis et al. \cite{Davis18} detected more
than 1000 evil regular expressions
-in Node.js, Python core libraries, npm and in pypi.
+in Node.js, Python core libraries, npm and pypi.
They therefore concluded that evil regular expressions
are real problems rather than just "a parlour trick".
This work aims to address this issue
with the help of formal proofs.
We describe a lexing algorithm based
-on Brzozowski derivatives with verified correctness (in
-Isabelle/HOL)
-and a finiteness property for the size of derivatives.
+on Brzozowski derivatives with verified correctness
+and a finiteness property for the size of derivatives
+(which are all done in Isabelle/HOL).
Such properties %guarantee the absence of
are an important step in preventing
catastrophic backtracking once and for all.
@@ -443,13 +443,13 @@
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
Regular expressions and regular expression matchers
-have of course been studied for many, many years.
+have clearly been studied for many, many years.
Theoretical results in automata theory state
that basic regular expression matching should be linear
w.r.t the input.
This assumes that the regular expression
$r$ was pre-processed and turned into a
-deterministic finite automaton (DFA) before matching~\cite{Sakarovitch2009}.
+deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
By basic we mean textbook definitions such as the one
below, involving only regular expressions for characters, alternatives,
sequences, and Kleene stars:
@@ -458,7 +458,8 @@
\]
Modern regular expression matchers used by programmers,
however,
-support much richer constructs, such as bounded repetitions
+support much richer constructs, such as bounded repetitions,
+negations,
and back-references.
To differentiate, we use the word \emph{regex} to refer
to those expressions with richer constructs while reserving the
@@ -481,7 +482,7 @@
Depending on the types of constructs
the task of matching and lexing with them
will have different levels of complexity.
-Some of those constructs are just syntactic sugars that are
+Some of those constructs are syntactic sugars that are
simply short hand notations
that save the programmers a few keystrokes.
These will not cause problems for regex libraries.
@@ -490,13 +491,16 @@
\[
(a | b | c) \stackrel{means}{=} ((a + b)+ c)
\]
-Similarly, the range operator used to express the alternative
-of all characters between its operands is just a concise way:
+Similarly, the range operator
+%used to express the alternative
+%of all characters between its operands,
+is just a concise way
+of expressing an alternative of consecutive characters:
\[
- [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
+ [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )
\]
for an alternative. The
-wildcard character $.$ is used to refer to any single character,
+wildcard character '$.$' is used to refer to any single character,
\[
. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
\]
@@ -506,7 +510,7 @@
More interesting are bounded repetitions, which can
make the regular expressions much
more compact.
-There are
+Normally there are four kinds of bounded repetitions:
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
(where $n$ and $m$ are constant natural numbers).
Like the star regular expressions, the set of strings or language
@@ -521,8 +525,9 @@
\end{tabular}
\end{center}
The attraction of bounded repetitions is that they can be
-used to avoid a blow up: for example $r^{\{n\}}$
+used to avoid a size blow up: for example $r^{\{n\}}$
is a shorthand for
+the much longer regular expression:
\[
\underbrace{r\ldots r}_\text{n copies of r}.
\]
@@ -532,28 +537,31 @@
The problem with matching
+such bounded repetitions
is that tools based on the classic notion of
automata need to expand $r^{\{n\}}$ into $n$ connected
copies of the automaton for $r$. This leads to very inefficient matching
algorithms or algorithms that consume large amounts of memory.
Implementations using $\DFA$s will
+in such situations
either become excruciatingly slow
-(for example Verbatim++~\cite{Verbatimpp}) or get
-out of memory errors (for example $\mathit{LEX}$ and
-$\mathit{JFLEX}$\footnote{which are lexer generators
+(for example Verbatim++ \cite{Verbatimpp}) or run
+out of memory (for example $\mathit{LEX}$ and
+$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
in C and JAVA that generate $\mathit{DFA}$-based
lexers. The user provides a set of regular expressions
-and configurations to them, and then
+and configurations, and then
gets an output program encoding a minimized $\mathit{DFA}$
that can be compiled and run.
When given the above countdown regular expression,
-a small $n$ (a few dozen) would result in a
-determinised automata
+a small $n$ (say 20) would result in a program representing a
+DFA
with millions of states.}) for large counters.
A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
where the minimal DFA requires at least $2^{n+1}$ states.
For example, when $n$ is equal to 2,
-The corresponding $\mathit{NFA}$ looks like:
+the corresponding $\mathit{NFA}$ looks like:
+\vspace{6mm}
\begin{center}
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
\node[state,initial] (q_0) {$q_0$};
@@ -567,24 +575,25 @@
(q_2) edge node {a,b} (q_3);
\end{tikzpicture}
\end{center}
-when turned into a DFA by the subset construction
+and when turned into a DFA by the subset construction
requires at least $2^3$ states.\footnote{The
-red states are "countdown states" which counts down
+red states are "countdown states" which count down
the number of characters needed in addition to the current
string to make a successful match.
For example, state $q_1$ indicates a match that has
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
and just consumed the "delimiter" $a$ in the middle, and
-need to match 2 more iterations of $(a|b)$ to complete.
+needs to match 2 more iterations of $(a|b)$ to complete.
State $q_2$ on the other hand, can be viewed as a state
after $q_1$ has consumed 1 character, and just waits
for 1 more character to complete.
-$q_3$ is the last state, requiring 0 more character and is accepting.
+The state $q_3$ is the last (accepting) state, requiring 0
+more characters.
Depending on the suffix of the
input string up to the current read location,
the states $q_1$ and $q_2$, $q_3$
may or may
-not be active, independent from each other.
+not be active.
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
contain at least $2^3$ non-equivalent states that cannot be merged,
because the subset construction during determinisation will generate
@@ -595,12 +604,12 @@
would require at least $2^{n+1}$ states, if $r$ itself contains
more than 1 string.
This is to represent all different
-scenarios which "countdown" states are active.}
+scenarios in which "countdown" states are active.}
-Bounded repetitions are very important because they
-tend to occur a lot in practical use,
-for example in the regex library RegExLib,
+Bounded repetitions are important because they
+tend to occur frequently in practical use,
+for example in the regex library RegExLib, in
the rules library of Snort \cite{Snort1999}\footnote{
Snort is a network intrusion detection (NID) tool
for monitoring network traffic.
@@ -615,8 +624,8 @@
XSDs they found on the Maven.org central repository
have bounded regular expressions in them.
Often the counters are quite large, with the largest being
-approximately up to ten million.
-An example XSD they gave
+close to ten million.
+A smaller sample XSD they gave
is:
\begin{verbatim}
<sequence minOccurs="0" maxOccurs="65535">
@@ -624,7 +633,7 @@
<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
</sequence>
\end{verbatim}
-This can be seen as the expression
+This can be seen as the regex
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
regular expressions
satisfying certain constraints (such as
@@ -638,7 +647,7 @@
can be above 1000, and in the built-in Rust regular expression library
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
for being too big.
-As Becchi and Crawley~\cite{Becchi08} have pointed out,
+As Becchi and Crawley \cite{Becchi08} have pointed out,
the reason for these restrictions
is that they simulate a non-deterministic finite
automata (NFA) with a breadth-first search.
@@ -647,7 +656,7 @@
When the counters are large,
the memory requirement could become
infeasible, and a regex engine
-like Go will reject this pattern straight away.
+like in Go will reject this pattern straight away.
\begin{figure}[H]
\begin{center}
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
@@ -702,23 +711,26 @@
with the ability to process
gigabytes of strings input per second
even with large counters \cite{Becchi08}.
-But formal reasoning about these automata especially in Isabelle
-can be challenging
-and un-intuitive.
-Therefore, we take correctness and runtime claims made about these solutions
-with a grain of salt.
+These practical solutions do not come with
+formal guarantees, and as pointed out by
+Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
+%But formal reasoning about these automata especially in Isabelle
+%can be challenging
+%and un-intuitive.
+%Therefore, we take correctness and runtime claims made about these solutions
+%with a grain of salt.
In the work reported in \cite{FoSSaCS2023} and here,
we add better support using derivatives
-for bounded regular expressions $r^{\{n\}}$.
-The results
+for bounded regular expression $r^{\{n\}}$.
+Our results
extend straightforwardly to
-repetitions with an interval such as
+repetitions with intervals such as
$r^{\{n\ldots m\}}$.
The merit of Brzozowski derivatives (more on this later)
on this problem is that
it can be naturally extended to support bounded repetitions.
-Moreover these extensions are still made up of only
+Moreover these extensions are still made up of only small
inductive datatypes and recursive functions,
making it handy to deal with them in theorem provers.
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be