ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 636 0bcb4a7cb40c
parent 635 7ce2389dff4b
child 637 e3752aac8ec2
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Dec 20 22:32:54 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Dec 23 13:27:56 2022 +0000
@@ -709,7 +709,7 @@
 counters \cite{Turo_ov__2020}.
 These solutions can be quite efficient,
 with the ability to process
-gigabytes of strings input per second
+gigabits of strings input per second
 even with large counters \cite{Becchi08}.
 These practical solutions do not come with
 formal guarantees, and as pointed out by
@@ -748,8 +748,9 @@
 The other way to simulate an $\mathit{NFA}$ for matching is choosing  
 a single transition each time, keeping all the other options in 
 a queue or stack, and backtracking if that choice eventually 
-fails. This method, often called a  "depth-first-search", 
-is efficient in a lot of cases, but could end up
+fails. 
+This method, often called a  "depth-first-search", 
+is efficient in many cases, but could end up
 with exponential run time.
 The backtracking method is employed in regex libraries
 that support \emph{back-references}, for example
@@ -769,15 +770,16 @@
 
 
 
-Given a regular expression like this (the sequence
-operator is omitted for brevity):
+Consider the following regular expression where the sequence
+operator is omitted for brevity:
 \begin{center}
 	$r_1r_2r_3r_4$
 \end{center}
+In this example,
 one could label sub-expressions of interest 
 by parenthesizing them and giving 
-them a number by the order in which their opening parentheses appear.
-One possible way of parenthesizing and labelling is given below:
+them a number in the order in which their opening parentheses appear.
+One possible way of parenthesizing and labelling is: 
 \begin{center}
 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
 \end{center}
@@ -787,13 +789,17 @@
 %These sub-expressions are called "capturing groups".
 To do so, we use the syntax $\backslash i$ 
 to denote that we want the sub-string 
-of the input just matched by the i-th
+of the input matched by the i-th
 sub-expression to appear again, 
 exactly the same as it first appeared: 
 \begin{center}
 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
-\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
+\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
 \end{center}
+Once the sub-string $s_i$ for the sub-expression $r_i$
+has been fixed, there is no variability on what the back-reference
+label $\backslash i$ can be---it is tied to $s_i$.
+The overall string will look like $\ldots s_i \ldots s_i \ldots $
 %The backslash and number $i$ are the
 %so-called "back-references".
 %Let $e$ be an expression made of regular expressions 
@@ -823,7 +829,7 @@
 which does not impose any restrictions on what strings the second 
 sub-expression $.^*$
 might match.
-Another example of back-references is
+Another example for back-references is
 \begin{center}
 $(.)(.)\backslash 2\backslash 1$
 \end{center}
@@ -832,7 +838,7 @@
 
 Back-references is a regex construct 
 that programmers find quite useful.
-According to Becchi and Crawley~\cite{Becchi08},
+According to Becchi and Crawley \cite{Becchi08},
 6\% of Snort rules (up until 2008) use them.
 The most common use of back-references
 is to express well-formed html files,
@@ -854,20 +860,21 @@
 languages either.
 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
-which cannot be expressed by context-free grammars \parencite{campeanu2003formal}.
+which cannot be expressed by 
+context-free grammars \cite{campeanu2003formal}.
 Such a language is contained in the context-sensitive hierarchy
 of formal languages. 
 Also solving the matching problem involving back-references
 is known to be NP-complete \parencite{alfred2014algorithms}.
 Regex libraries supporting back-references such as 
 PCRE \cite{pcre} therefore have to
-revert to a depth-first search algorithm which backtracks.
+revert to a depth-first search algorithm including backtracking.
 What is unexpected is that even in the cases 
 not involving back-references, there is still
 a (non-negligible) chance they might backtrack super-linearly,
-as shown in the graphs in figure\ref{fig:aStarStarb}.
+as shown in the graphs in figure \ref{fig:aStarStarb}.
 
-Summing these up, we can categorise existing 
+Summing up, we can categorise existing 
 practical regex libraries into two kinds:
 (i) The ones  with  linear
 time guarantees like Go and Rust. The downside with them is that
@@ -932,20 +939,21 @@
 or give results that are inconsistent with the POSIX standard.
 A concrete example is the regex:
 \begin{center}
-	$(aba + ab + a)^* \text{and the string} ababa$
+	$(aba + ab + a)^* \text{and the string} \; ababa$
 \end{center}
 The correct POSIX match for the above
-is the entire string $ababa$, 
+involves the entire string $ababa$, 
 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
 $[0, 2), [2, 5)$
 respectively.
-But trying this out in regex101 \parencite{regex101} \footnote{
+But feeding this example to the different engines
+in regex101 \parencite{regex101} \footnote{
 	regex101 is an online regular expression matcher which
 	provides API for trying out regular expression
 	engines of multiple popular programming languages like
 Java, Python, Go, etc.}
-with different engines yields
-always matches: $[aba]$ at $[0, 3)$
+yields
+only two incomplete matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
 commented that most regex libraries are not
@@ -960,6 +968,15 @@
 %\noindent
 We think the implementation complexity of POSIX rules also come from
 the specification being not very precise.
+The developers of the regexp package of GO 
+\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
+commented that
+\begin{quote}\it
+``
+The POSIX rule is computationally prohibitive
+and not even well-defined.
+``
+\end{quote}
 There are many informal summaries of this disambiguation
 strategy, which are often quite long and delicate.
 For example Kuklewicz \cite{KuklewiczHaskell} 
@@ -997,9 +1014,9 @@
 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
 with regards to that specification.
 They also found that the informal POSIX
-specification by Sulzmann and Lu does not work for the correctness proof.
+specification by Sulzmann and Lu needs to be revised for the correctness proof.
 
-In the next section we will briefly
+In the next section, we will briefly
 introduce Brzozowski derivatives and Sulzmann
 and Lu's algorithm, which the main point of this thesis builds on.
 %We give a taste of what they 
@@ -1027,7 +1044,7 @@
 %\end{center}
 %\noindent
 %The first clause says that for the regular expression
-%denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+%denoting a singleton set consisting of a single-character string $\{ d \}$,
 %we check the derivative character $c$ against $d$,
 %returning a set containing only the empty string $\{ [] \}$
 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
@@ -1043,7 +1060,7 @@
 %one simply takes derivatives of $r$ successively with
 %respect to the characters (in the correct order) in $s$,
 %and then test whether the empty string is in the last regular expression.
-With this derivatives give a simple solution
+With this property, derivatives give a simple solution
 to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,