ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 607 e6fc9b72c0e3
parent 606 99b530103464
child 608 37b6fd310a16
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Sep 29 14:05:42 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Sep 30 01:47:33 2022 +0100
@@ -577,7 +577,10 @@
 Bounded repetitions are very important because they
 tend to occur a lot in practical use.
 For example in the regex library RegExLib,
-the rules library of Snort, as well as in XML Schema definitions (XSDs).
+the rules library of Snort \cite{Snort1999}\footnote{
+Snort is a network intrusion detection (NID) tool
+for monitoring network traffic.}, 
+as well as in XML Schema definitions (XSDs).
 According to Bj\"{o}rklund et al \cite{xml2015},
 more than half of the 
 XSDs they found have bounded regular expressions in them.
@@ -707,9 +710,6 @@
 The backtracking method is employed in regex libraries
 that support \emph{back-references}, for example
 in Java and Python.
-The syntax and expressive power of regexes 
-go beyond the regular language hierarchy
-with back-references.
 %\section{Back-references and The Terminology Regex}
 
 %When one constructs an $\NFA$ out of a regular expression
@@ -766,15 +766,46 @@
 A concrete example
 for back-references would be
 \begin{center}
-$((a|b|c|\ldots|z)^*)\backslash 1$,
+$(.^*)\backslash 1$,
 \end{center}
 which would match 
 strings that can be split into two identical halves,
-for example $\mathit{bobo}$, $\mathit{weewee}$ and etc.
-Back-reference is a regex construct 
-that programmers found useful, but not exactly 
-regular any more.
-In fact, that allows the regex construct to express 
+for example $\mathit{foofoo}$, $\mathit{ww}$ and etc.
+Note that this is different from 
+repeating the  sub-expression verbatim like
+\begin{center}
+	$(.^*)(.^*)$,
+\end{center}
+which does not impose any restrictions on what strings the second 
+sub-expression $.^*$
+might match.
+Another example of back-references would be
+\begin{center}
+$(.)(.)\backslash 2\backslash 1$
+\end{center}
+which expresses four-character palindromes
+like $abba$, $x??x$ etc.
+
+Back-references is a regex construct 
+that programmers found quite useful.
+According to Becchi and Crawley\cite{Becchi08},
+6\% of Snort rules (up until 2008) include the use of them.
+The most common use of back-references
+would be expressing well-formed html files,
+where back-references would be handy in expressing
+a pair of opening and closing tags like 
+\begin{center}
+	$\langle html \rangle \ldots \langle / html \rangle$
+\end{center}
+A regex describing such a format
+could be
+\begin{center}
+	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+\end{center}
+Despite being useful, the syntax and expressive power of regexes 
+go beyond the regular language hierarchy
+with back-references.
+In fact, they allow the regex construct to express 
 languages that cannot be contained in context-free
 languages either.
 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
@@ -783,20 +814,37 @@
 Such a language is contained in the context-sensitive hierarchy
 of formal languages. 
 Solving the back-reference expressions matching problem
-is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
+is NP-complete\parencite{alfred2014algorithms}.
+A non-bactracking,
 efficient solution is not known to exist.
-%TODO:read a bit more about back reference algorithms
+Regex libraries supporting back-references such as PCRE therefore have to
+revert to a depth-first search algorithm that backtracks.
+The unreasonable part with them, is that even in the case of 
+regexes not involving back-references, there is still
+a (non-negligible) chance they might backtrack super-linearly.
 
-It seems that languages like Java and Python made the trade-off
-to support back-references at the expense of having to backtrack,
-even in the case of regexes not involving back-references.\\
+\subsection{Summary of the Catastrophic Backtracking Problem}
 Summing these up, we can categorise existing 
-practical regex libraries into the ones  with  linear
-time guarantees like Go and Rust, which impose restrictions
+practical regex libraries into two kinds:
+(i)The ones  with  linear
+time guarantees like Go and Rust. The cost with them is that
+they impose restrictions
 on the user input (not allowing back-references, 
-bounded repetitions cannot exceed 1000 etc.), and ones  
- that allows the programmer much freedom, but grinds to a halt
- in some non-negligible portion of cases.
+bounded repetitions cannot exceed a counter limit etc.).
+(ii) Those 
+that allow large bounded regular expressions and back-references
+at the expense of using a backtracking algorithm.
+They could grind to a halt
+on some very simple cases, posing a vulnerability of
+a ReDoS attack.
+
+ 
+We would like to have regex engines that can 
+deal with the regular part (e.g.
+bounded repetitions) of regexes more
+efficiently.
+Also we want to make sure that they do it correctly.
+It turns out that such aim is not so easy to achieve.
  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
 % For example, the Rust regex engine claims to be linear, 
 % but does not support lookarounds and back-references.
@@ -819,16 +867,25 @@
 
 
 \section{Error-prone POSIX Implementations}
-The problems with practical implementations
-of reare not limited to slowness on certain 
-cases. 
-Another thing about these libraries is that there
-is no correctness guarantee.
-In some cases, they either fail to generate a lexing result when there exists a match,
+When there are multiple ways of matching a string
+with a regular expression, a matcher needs to
+disambiguate.
+The standard for which particular match to pick
+is called the disambiguation strategy.
+The more intuitive strategy is called POSIX,
+which always chooses the longest initial match.
+An alternative strategy would be greedy matches,
+which always ends a sub-match as early as possible.
+The POSIX standard is widely adopted in many operating systems.
+However, many implementations (including the C libraries
+used by Linux and OS X distributions) contain bugs
+or do not meet the specification they claim to adhere to.
+In some cases, they either fail to generate a lexing 
+result when there exists a match,
 or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be the regex
+A concrete example would be the regex given by \cite{fowler2003}
 \begin{center}
-	$(aba + ab + a)* \text{and the string} ababa$
+	$(aba + ab + a)^* \text{and the string} ababa$
 \end{center}
 The correct $\POSIX$ match for the above would be 
 with the entire string $ababa$, 
@@ -839,117 +896,127 @@
 with different language engines would yield 
 the same two fragmented matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
-
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
+commented that most regex libraries are not
 correctly implementing the POSIX (maximum-munch)
 rule of regular expression matching.
-
 As Grathwohl\parencite{grathwohl2014crash} wrote,
 \begin{quote}
-	The POSIX strategy is more complicated than the 
+	``The POSIX strategy is more complicated than the 
 	greedy because of the dependence on information about 
-	the length of matched strings in the various subexpressions.
+	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
-To summarise the above, regular expressions are important.
-They are popular and programming languages' library functions
-for them are very fast on non-catastrophic cases.
-But there are problems with current practical implementations.
-First thing is that the running time might blow up.
-The second problem is that they might be error-prone on certain
-very simple cases.
-In the next part of the chapter, we will look into reasons why 
-certain regex engines are running horribly slow on the "catastrophic"
-cases and propose a solution that addresses both of these problems
-based on Brzozowski and Sulzmann and Lu's work.
-
-
-\subsection{Different Phases of a Matching/Lexing Algorithm}
+The implementation complexity of POSIX rules also come from
+the specification being not very clear.
+There are many informal summaries of this disambiguation
+strategy, which are often quite long and delicate.
+For example Kuklewicz described the POSIX rule as
+\begin{quote}
+	``
+	\begin{itemize}
+		\item
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
+parenthesized subexpressions return the match from their last usage\\
+\item
+text of component subexpressions must be contained in the text of the 
+higher-level subexpressions\\
+\item
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+\item
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
+\end{itemize}
+\end{quote}
+The text above 
+is trying to capture something very precise,
+and is crying out for formalising.
 
 
-Most lexing algorithms can be roughly divided into 
-two phases during its run.
-The first phase is the "construction" phase,
-in which the algorithm builds some  
-suitable data structure from the input regex $r$, so that
-it can be easily operated on later.
-We denote
-the time cost for such a phase by $P_1(r)$.
-The second phase is the lexing phase, when the input string 
-$s$ is read and the data structure
-representing that regex $r$ is being operated on. 
-We represent the time
-it takes by $P_2(r, s)$.\\
-For $\mathit{DFA}$,
-we have $P_2(r, s) = O( |s| )$,
-because we take at most $|s|$ steps, 
-and each step takes
-at most one transition--
-a deterministic-finite-automata
-by definition has at most one state active and at most one
-transition upon receiving an input symbol.
-But unfortunately in the  worst case
-$P_1(r) = O(exp^{|r|})$. An example will be given later. 
-For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
-expressions like $r^n$ into 
-\[
-	\underbrace{r \cdots r}_{\text{n copies of r}}.
-\]
-The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
-On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^{|s|}$.
-%on the input
-%And when calculating the time complexity of the matching algorithm,
-%we are assuming that each input reading step requires constant time.
-%which translates to that the number of 
-%states active and transitions taken each time is bounded by a
-%constant $C$.
-%But modern  regex libraries in popular language engines
-% often want to support much richer constructs than just
-% sequences and Kleene stars,
-%such as negation, intersection, 
-%bounded repetitions and back-references.
-%And de-sugaring these "extended" regular expressions 
-%into basic ones might bloat the size exponentially.
-%TODO: more reference for exponential size blowup on desugaring. 
+%\subsection{Different Phases of a Matching/Lexing Algorithm}
+%
+%
+%Most lexing algorithms can be roughly divided into 
+%two phases during its run.
+%The first phase is the "construction" phase,
+%in which the algorithm builds some  
+%suitable data structure from the input regex $r$, so that
+%it can be easily operated on later.
+%We denote
+%the time cost for such a phase by $P_1(r)$.
+%The second phase is the lexing phase, when the input string 
+%$s$ is read and the data structure
+%representing that regex $r$ is being operated on. 
+%We represent the time
+%it takes by $P_2(r, s)$.\\
+%For $\mathit{DFA}$,
+%we have $P_2(r, s) = O( |s| )$,
+%because we take at most $|s|$ steps, 
+%and each step takes
+%at most one transition--
+%a deterministic-finite-automata
+%by definition has at most one state active and at most one
+%transition upon receiving an input symbol.
+%But unfortunately in the  worst case
+%$P_1(r) = O(exp^{|r|})$. An example will be given later. 
+%For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
+%expressions like $r^n$ into 
+%\[
+%	\underbrace{r \cdots r}_{\text{n copies of r}}.
+%\]
+%The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+%On the other hand, if backtracking is used, the worst-case time bound bloats
+%to $|r| * 2^{|s|}$.
+%%on the input
+%%And when calculating the time complexity of the matching algorithm,
+%%we are assuming that each input reading step requires constant time.
+%%which translates to that the number of 
+%%states active and transitions taken each time is bounded by a
+%%constant $C$.
+%%But modern  regex libraries in popular language engines
+%% often want to support much richer constructs than just
+%% sequences and Kleene stars,
+%%such as negation, intersection, 
+%%bounded repetitions and back-references.
+%%And de-sugaring these "extended" regular expressions 
+%%into basic ones might bloat the size exponentially.
+%%TODO: more reference for exponential size blowup on desugaring. 
+%
+%\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+%
+%
+%The good things about $\mathit{DFA}$s is that once
+%generated, they are fast and stable, unlike
+%backtracking algorithms. 
+%However, they do not scale well with bounded repetitions.
+%
+%\subsubsection{Problems with Bounded Repetitions}
+%
+%
 
-\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
-
-
-The good things about $\mathit{DFA}$s is that once
-generated, they are fast and stable, unlike
-backtracking algorithms. 
-However, they do not scale well with bounded repetitions.
-
-\subsubsection{Problems with Bounded Repetitions}
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-So we have practical implementations 
-on regular expression matching/lexing which are fast
-but do not come with any guarantees that it will not grind to a halt
-or give wrong answers.
-Our goal is to have a regex lexing algorithm that comes with 
+To summarise, we need regex libraries that are both fast
+and correct.
+And that correctness needs to be built on a precise
+model of what POSIX disambiguation is.
+We propose a solution that addresses both problems
+based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work.
+The end result is a regular expression lexing algorithm that comes with 
 \begin{itemize}
 \item
-proven correctness 
+a proven correctness theorem according to POSIX specification 
+given by Ausaf and Urban \cite{AusafDyckhoffUrban2016},
 \item 
-proven non-catastrophic properties
+a proven property saying that the algorithm's internal data structure will
+remain finite,
 \item
-easy extensions to
-constructs like 
- bounded repetitions, negation,  lookarounds, and even back-references.
+and extension to
+the bounded repetitions construct with the correctness and finiteness property
+maintained.
  \end{itemize}
  
 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}