lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:7ce2389dff4b
+:0bcb4a7cb40c
 These problems can of course be solved in matching algorithms where
 automata go beyond the classic notion and for instance include explicit
 counters \cite{Turo_ov__2020}.
 These solutions can be quite efficient,
 with the ability to process
-gigabytes of strings input per second
+gigabits of strings input per second
 even with large counters \cite{Becchi08}.
 These practical solutions do not come with
 formal guarantees, and as pointed out by
 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
 %But formal reasoning about these automata especially in Isabelle
 \subsection{Back-References}
 The other way to simulate an $\mathit{NFA}$ for matching is choosing
 a single transition each time, keeping all the other options in
 a queue or stack, and backtracking if that choice eventually
-fails. This method, often called a  "depth-first-search",
+fails.
-is efficient in a lot of cases, but could end up
+This method, often called a  "depth-first-search",
+is efficient in many cases, but could end up
 with exponential run time.
 The backtracking method is employed in regex libraries
 that support \emph{back-references}, for example
 in Java and Python.
 %\section{Back-references and The Terminology Regex}
 %for a path terminating
 %at an accepting state.
-Given a regular expression like this (the sequence
+Consider the following regular expression where the sequence
-operator is omitted for brevity):
+operator is omitted for brevity:
 \begin{center}
 	$r_1r_2r_3r_4$
 \end{center}
+In this example,
 one could label sub-expressions of interest
 by parenthesizing them and giving
-them a number by the order in which their opening parentheses appear.
+them a number in the order in which their opening parentheses appear.
-One possible way of parenthesizing and labelling is given below:
+One possible way of parenthesizing and labelling is:
 \begin{center}
 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
 \end{center}
 The sub-expressions
 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
 by 1 to 4, and can be ``referred back'' by their respective numbers.
 %These sub-expressions are called "capturing groups".
 To do so, we use the syntax $\backslash i$
 to denote that we want the sub-string
-of the input just matched by the i-th
+of the input matched by the i-th
 sub-expression to appear again,
 exactly the same as it first appeared:
 \begin{center}
 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
-\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
+\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
 \end{center}
+Once the sub-string $s_i$ for the sub-expression $r_i$
+has been fixed, there is no variability on what the back-reference
+label $\backslash i$ can be---it is tied to $s_i$.
+The overall string will look like $\ldots s_i \ldots s_i \ldots $
 %The backslash and number $i$ are the
 %so-called "back-references".
 %Let $e$ be an expression made of regular expressions
 %and back-references. $e$ contains the expression $e_i$
 %as its $i$-th capturing group.
 	$(.^*)(.^*)$,
 \end{center}
 which does not impose any restrictions on what strings the second
 sub-expression $.^*$
 might match.
-Another example of back-references is
+Another example for back-references is
 \begin{center}
 $(.)(.)\backslash 2\backslash 1$
 \end{center}
 which matches four-character palindromes
 like $abba$, $x??x$ and so on.
 Back-references is a regex construct
 that programmers find quite useful.
-According to Becchi and Crawley~\cite{Becchi08},
+According to Becchi and Crawley \cite{Becchi08},
 6\% of Snort rules (up until 2008) use them.
 The most common use of back-references
 is to express well-formed html files,
 where back-references are convenient for matching
 opening and closing tags like
 In fact, they allow the regex construct to express
 languages that cannot be contained in context-free
 languages either.
 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
-which cannot be expressed by context-free grammars \parencite{campeanu2003formal}.
+which cannot be expressed by
+context-free grammars \cite{campeanu2003formal}.
 Such a language is contained in the context-sensitive hierarchy
 of formal languages.
 Also solving the matching problem involving back-references
 is known to be NP-complete \parencite{alfred2014algorithms}.
 Regex libraries supporting back-references such as
 PCRE \cite{pcre} therefore have to
-revert to a depth-first search algorithm which backtracks.
+revert to a depth-first search algorithm including backtracking.
 What is unexpected is that even in the cases
 not involving back-references, there is still
 a (non-negligible) chance they might backtrack super-linearly,
-as shown in the graphs in figure\ref{fig:aStarStarb}.
+as shown in the graphs in figure \ref{fig:aStarStarb}.
-Summing these up, we can categorise existing
+Summing up, we can categorise existing
 practical regex libraries into two kinds:
 (i) The ones  with  linear
 time guarantees like Go and Rust. The downside with them is that
 they impose restrictions
 on the regular expressions (not allowing back-references,
 In some cases, they either fail to generate a
 result when there exists a match,
 or give results that are inconsistent with the POSIX standard.
 A concrete example is the regex:
 \begin{center}
-	$(aba + ab + a)^* \text{and the string} ababa$
+	$(aba + ab + a)^* \text{and the string} \; ababa$
 \end{center}
 The correct POSIX match for the above
-is the entire string $ababa$,
+involves the entire string $ababa$,
 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
 $[0, 2), [2, 5)$
 respectively.
-But trying this out in regex101 \parencite{regex101} \footnote{
+But feeding this example to the different engines
+in regex101 \parencite{regex101} \footnote{
 	regex101 is an online regular expression matcher which
 	provides API for trying out regular expression
 	engines of multiple popular programming languages like
 Java, Python, Go, etc.}
-with different engines yields
+yields
-always matches: $[aba]$ at $[0, 3)$
+only two incomplete matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
 commented that most regex libraries are not
 correctly implementing the central POSIX
 rule, called the maximum munch rule.
 	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
 We think the implementation complexity of POSIX rules also come from
 the specification being not very precise.
+The developers of the regexp package of GO
+\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
+commented that
+\begin{quote}\it
+``
+The POSIX rule is computationally prohibitive
+and not even well-defined.
+``
+\end{quote}
 There are many informal summaries of this disambiguation
 strategy, which are often quite long and delicate.
 For example Kuklewicz \cite{KuklewiczHaskell}
 described the POSIX rule as (section 1, last paragraph):
 \begin{quote}
 POSIX specification given by Okui and Suzuki \cite{Okui10}.
 They then formally proved the correctness of
 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
 with regards to that specification.
 They also found that the informal POSIX
-specification by Sulzmann and Lu does not work for the correctness proof.
+specification by Sulzmann and Lu needs to be revised for the correctness proof.
-In the next section we will briefly
+In the next section, we will briefly
 introduce Brzozowski derivatives and Sulzmann
 and Lu's algorithm, which the main point of this thesis builds on.
 %We give a taste of what they
 %are like and why they are suitable for regular expression
 %matching and lexing.
 %$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
 %	\end{tabular}
 %\end{center}
 %\noindent
 %The first clause says that for the regular expression
-%denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+%denoting a singleton set consisting of a single-character string $\{ d \}$,
 %we check the derivative character $c$ against $d$,
 %returning a set containing only the empty string $\{ [] \}$
 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
 %The second clause states that to obtain the regular expression
 %representing all strings' head character $c$ being chopped off
 %This property can be used on regular expressions
 %matching and lexing--to test whether a string $s$ is in $L \; r$,
 %one simply takes derivatives of $r$ successively with
 %respect to the characters (in the correct order) in $s$,
 %and then test whether the empty string is in the last regular expression.
-With this derivatives give a simple solution
+With this property, derivatives give a simple solution
 to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then $r$ matches $s$ (and {\em vice versa}).
 %This makes formally reasoning about these properties such

changeset 636	0bcb4a7cb40c
parent 635	7ce2389dff4b
child 637	e3752aac8ec2