diff -r 7ce2389dff4b -r 0bcb4a7cb40c ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Dec 20 22:32:54 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Dec 23 13:27:56 2022 +0000 @@ -709,7 +709,7 @@ counters \cite{Turo_ov__2020}. These solutions can be quite efficient, with the ability to process -gigabytes of strings input per second +gigabits of strings input per second even with large counters \cite{Becchi08}. These practical solutions do not come with formal guarantees, and as pointed out by @@ -748,8 +748,9 @@ The other way to simulate an $\mathit{NFA}$ for matching is choosing a single transition each time, keeping all the other options in a queue or stack, and backtracking if that choice eventually -fails. This method, often called a "depth-first-search", -is efficient in a lot of cases, but could end up +fails. +This method, often called a "depth-first-search", +is efficient in many cases, but could end up with exponential run time. The backtracking method is employed in regex libraries that support \emph{back-references}, for example @@ -769,15 +770,16 @@ -Given a regular expression like this (the sequence -operator is omitted for brevity): +Consider the following regular expression where the sequence +operator is omitted for brevity: \begin{center} $r_1r_2r_3r_4$ \end{center} +In this example, one could label sub-expressions of interest by parenthesizing them and giving -them a number by the order in which their opening parentheses appear. -One possible way of parenthesizing and labelling is given below: +them a number in the order in which their opening parentheses appear. +One possible way of parenthesizing and labelling is: \begin{center} $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ \end{center} @@ -787,13 +789,17 @@ %These sub-expressions are called "capturing groups". To do so, we use the syntax $\backslash i$ to denote that we want the sub-string -of the input just matched by the i-th +of the input matched by the i-th sub-expression to appear again, exactly the same as it first appeared: \begin{center} $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots -\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$ +\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$ \end{center} +Once the sub-string $s_i$ for the sub-expression $r_i$ +has been fixed, there is no variability on what the back-reference +label $\backslash i$ can be---it is tied to $s_i$. +The overall string will look like $\ldots s_i \ldots s_i \ldots $ %The backslash and number $i$ are the %so-called "back-references". %Let $e$ be an expression made of regular expressions @@ -823,7 +829,7 @@ which does not impose any restrictions on what strings the second sub-expression $.^*$ might match. -Another example of back-references is +Another example for back-references is \begin{center} $(.)(.)\backslash 2\backslash 1$ \end{center} @@ -832,7 +838,7 @@ Back-references is a regex construct that programmers find quite useful. -According to Becchi and Crawley~\cite{Becchi08}, +According to Becchi and Crawley \cite{Becchi08}, 6\% of Snort rules (up until 2008) use them. The most common use of back-references is to express well-formed html files, @@ -854,20 +860,21 @@ languages either. For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, -which cannot be expressed by context-free grammars \parencite{campeanu2003formal}. +which cannot be expressed by +context-free grammars \cite{campeanu2003formal}. Such a language is contained in the context-sensitive hierarchy of formal languages. Also solving the matching problem involving back-references is known to be NP-complete \parencite{alfred2014algorithms}. Regex libraries supporting back-references such as PCRE \cite{pcre} therefore have to -revert to a depth-first search algorithm which backtracks. +revert to a depth-first search algorithm including backtracking. What is unexpected is that even in the cases not involving back-references, there is still a (non-negligible) chance they might backtrack super-linearly, -as shown in the graphs in figure\ref{fig:aStarStarb}. +as shown in the graphs in figure \ref{fig:aStarStarb}. -Summing these up, we can categorise existing +Summing up, we can categorise existing practical regex libraries into two kinds: (i) The ones with linear time guarantees like Go and Rust. The downside with them is that @@ -932,20 +939,21 @@ or give results that are inconsistent with the POSIX standard. A concrete example is the regex: \begin{center} - $(aba + ab + a)^* \text{and the string} ababa$ + $(aba + ab + a)^* \text{and the string} \; ababa$ \end{center} The correct POSIX match for the above -is the entire string $ababa$, +involves the entire string $ababa$, split into two Kleene star iterations, namely $[ab], [aba]$ at positions $[0, 2), [2, 5)$ respectively. -But trying this out in regex101 \parencite{regex101} \footnote{ +But feeding this example to the different engines +in regex101 \parencite{regex101} \footnote{ regex101 is an online regular expression matcher which provides API for trying out regular expression engines of multiple popular programming languages like Java, Python, Go, etc.} -with different engines yields -always matches: $[aba]$ at $[0, 3)$ +yields +only two incomplete matches: $[aba]$ at $[0, 3)$ and $a$ at $[4, 5)$. Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} commented that most regex libraries are not @@ -960,6 +968,15 @@ %\noindent We think the implementation complexity of POSIX rules also come from the specification being not very precise. +The developers of the regexp package of GO +\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} +commented that +\begin{quote}\it +`` +The POSIX rule is computationally prohibitive +and not even well-defined. +`` +\end{quote} There are many informal summaries of this disambiguation strategy, which are often quite long and delicate. For example Kuklewicz \cite{KuklewiczHaskell} @@ -997,9 +1014,9 @@ a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} with regards to that specification. They also found that the informal POSIX -specification by Sulzmann and Lu does not work for the correctness proof. +specification by Sulzmann and Lu needs to be revised for the correctness proof. -In the next section we will briefly +In the next section, we will briefly introduce Brzozowski derivatives and Sulzmann and Lu's algorithm, which the main point of this thesis builds on. %We give a taste of what they @@ -1027,7 +1044,7 @@ %\end{center} %\noindent %The first clause says that for the regular expression -%denoting a singleton set consisting of a sinlge-character string $\{ d \}$, +%denoting a singleton set consisting of a single-character string $\{ d \}$, %we check the derivative character $c$ against $d$, %returning a set containing only the empty string $\{ [] \}$ %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. @@ -1043,7 +1060,7 @@ %one simply takes derivatives of $r$ successively with %respect to the characters (in the correct order) in $s$, %and then test whether the empty string is in the last regular expression. -With this derivatives give a simple solution +With this property, derivatives give a simple solution to the problem of matching a string $s$ with a regular expression $r$: if the derivative of $r$ w.r.t.\ (in succession) all the characters of the string matches the empty string,