--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Dec 20 22:32:54 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Dec 23 13:27:56 2022 +0000
@@ -709,7 +709,7 @@
counters \cite{Turo_ov__2020}.
These solutions can be quite efficient,
with the ability to process
-gigabytes of strings input per second
+gigabits of strings input per second
even with large counters \cite{Becchi08}.
These practical solutions do not come with
formal guarantees, and as pointed out by
@@ -748,8 +748,9 @@
The other way to simulate an $\mathit{NFA}$ for matching is choosing
a single transition each time, keeping all the other options in
a queue or stack, and backtracking if that choice eventually
-fails. This method, often called a "depth-first-search",
-is efficient in a lot of cases, but could end up
+fails.
+This method, often called a "depth-first-search",
+is efficient in many cases, but could end up
with exponential run time.
The backtracking method is employed in regex libraries
that support \emph{back-references}, for example
@@ -769,15 +770,16 @@
-Given a regular expression like this (the sequence
-operator is omitted for brevity):
+Consider the following regular expression where the sequence
+operator is omitted for brevity:
\begin{center}
$r_1r_2r_3r_4$
\end{center}
+In this example,
one could label sub-expressions of interest
by parenthesizing them and giving
-them a number by the order in which their opening parentheses appear.
-One possible way of parenthesizing and labelling is given below:
+them a number in the order in which their opening parentheses appear.
+One possible way of parenthesizing and labelling is:
\begin{center}
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
\end{center}
@@ -787,13 +789,17 @@
%These sub-expressions are called "capturing groups".
To do so, we use the syntax $\backslash i$
to denote that we want the sub-string
-of the input just matched by the i-th
+of the input matched by the i-th
sub-expression to appear again,
exactly the same as it first appeared:
\begin{center}
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
-\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
+\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
\end{center}
+Once the sub-string $s_i$ for the sub-expression $r_i$
+has been fixed, there is no variability on what the back-reference
+label $\backslash i$ can be---it is tied to $s_i$.
+The overall string will look like $\ldots s_i \ldots s_i \ldots $
%The backslash and number $i$ are the
%so-called "back-references".
%Let $e$ be an expression made of regular expressions
@@ -823,7 +829,7 @@
which does not impose any restrictions on what strings the second
sub-expression $.^*$
might match.
-Another example of back-references is
+Another example for back-references is
\begin{center}
$(.)(.)\backslash 2\backslash 1$
\end{center}
@@ -832,7 +838,7 @@
Back-references is a regex construct
that programmers find quite useful.
-According to Becchi and Crawley~\cite{Becchi08},
+According to Becchi and Crawley \cite{Becchi08},
6\% of Snort rules (up until 2008) use them.
The most common use of back-references
is to express well-formed html files,
@@ -854,20 +860,21 @@
languages either.
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
-which cannot be expressed by context-free grammars \parencite{campeanu2003formal}.
+which cannot be expressed by
+context-free grammars \cite{campeanu2003formal}.
Such a language is contained in the context-sensitive hierarchy
of formal languages.
Also solving the matching problem involving back-references
is known to be NP-complete \parencite{alfred2014algorithms}.
Regex libraries supporting back-references such as
PCRE \cite{pcre} therefore have to
-revert to a depth-first search algorithm which backtracks.
+revert to a depth-first search algorithm including backtracking.
What is unexpected is that even in the cases
not involving back-references, there is still
a (non-negligible) chance they might backtrack super-linearly,
-as shown in the graphs in figure\ref{fig:aStarStarb}.
+as shown in the graphs in figure \ref{fig:aStarStarb}.
-Summing these up, we can categorise existing
+Summing up, we can categorise existing
practical regex libraries into two kinds:
(i) The ones with linear
time guarantees like Go and Rust. The downside with them is that
@@ -932,20 +939,21 @@
or give results that are inconsistent with the POSIX standard.
A concrete example is the regex:
\begin{center}
- $(aba + ab + a)^* \text{and the string} ababa$
+ $(aba + ab + a)^* \text{and the string} \; ababa$
\end{center}
The correct POSIX match for the above
-is the entire string $ababa$,
+involves the entire string $ababa$,
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
$[0, 2), [2, 5)$
respectively.
-But trying this out in regex101 \parencite{regex101} \footnote{
+But feeding this example to the different engines
+in regex101 \parencite{regex101} \footnote{
regex101 is an online regular expression matcher which
provides API for trying out regular expression
engines of multiple popular programming languages like
Java, Python, Go, etc.}
-with different engines yields
-always matches: $[aba]$ at $[0, 3)$
+yields
+only two incomplete matches: $[aba]$ at $[0, 3)$
and $a$ at $[4, 5)$.
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
commented that most regex libraries are not
@@ -960,6 +968,15 @@
%\noindent
We think the implementation complexity of POSIX rules also come from
the specification being not very precise.
+The developers of the regexp package of GO
+\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
+commented that
+\begin{quote}\it
+``
+The POSIX rule is computationally prohibitive
+and not even well-defined.
+``
+\end{quote}
There are many informal summaries of this disambiguation
strategy, which are often quite long and delicate.
For example Kuklewicz \cite{KuklewiczHaskell}
@@ -997,9 +1014,9 @@
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
with regards to that specification.
They also found that the informal POSIX
-specification by Sulzmann and Lu does not work for the correctness proof.
+specification by Sulzmann and Lu needs to be revised for the correctness proof.
-In the next section we will briefly
+In the next section, we will briefly
introduce Brzozowski derivatives and Sulzmann
and Lu's algorithm, which the main point of this thesis builds on.
%We give a taste of what they
@@ -1027,7 +1044,7 @@
%\end{center}
%\noindent
%The first clause says that for the regular expression
-%denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+%denoting a singleton set consisting of a single-character string $\{ d \}$,
%we check the derivative character $c$ against $d$,
%returning a set containing only the empty string $\{ [] \}$
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
@@ -1043,7 +1060,7 @@
%one simply takes derivatives of $r$ successively with
%respect to the characters (in the correct order) in $s$,
%and then test whether the empty string is in the last regular expression.
-With this derivatives give a simple solution
+With this property, derivatives give a simple solution
to the problem of matching a string $s$ with a regular
expression $r$: if the derivative of $r$ w.r.t.\ (in
succession) all the characters of the string matches the empty string,