--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Sep 29 14:05:42 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Sep 30 01:47:33 2022 +0100
@@ -577,7 +577,10 @@
Bounded repetitions are very important because they
tend to occur a lot in practical use.
For example in the regex library RegExLib,
-the rules library of Snort, as well as in XML Schema definitions (XSDs).
+the rules library of Snort \cite{Snort1999}\footnote{
+Snort is a network intrusion detection (NID) tool
+for monitoring network traffic.},
+as well as in XML Schema definitions (XSDs).
According to Bj\"{o}rklund et al \cite{xml2015},
more than half of the
XSDs they found have bounded regular expressions in them.
@@ -707,9 +710,6 @@
The backtracking method is employed in regex libraries
that support \emph{back-references}, for example
in Java and Python.
-The syntax and expressive power of regexes
-go beyond the regular language hierarchy
-with back-references.
%\section{Back-references and The Terminology Regex}
%When one constructs an $\NFA$ out of a regular expression
@@ -766,15 +766,46 @@
A concrete example
for back-references would be
\begin{center}
-$((a|b|c|\ldots|z)^*)\backslash 1$,
+$(.^*)\backslash 1$,
\end{center}
which would match
strings that can be split into two identical halves,
-for example $\mathit{bobo}$, $\mathit{weewee}$ and etc.
-Back-reference is a regex construct
-that programmers found useful, but not exactly
-regular any more.
-In fact, that allows the regex construct to express
+for example $\mathit{foofoo}$, $\mathit{ww}$ and etc.
+Note that this is different from
+repeating the sub-expression verbatim like
+\begin{center}
+ $(.^*)(.^*)$,
+\end{center}
+which does not impose any restrictions on what strings the second
+sub-expression $.^*$
+might match.
+Another example of back-references would be
+\begin{center}
+$(.)(.)\backslash 2\backslash 1$
+\end{center}
+which expresses four-character palindromes
+like $abba$, $x??x$ etc.
+
+Back-references is a regex construct
+that programmers found quite useful.
+According to Becchi and Crawley\cite{Becchi08},
+6\% of Snort rules (up until 2008) include the use of them.
+The most common use of back-references
+would be expressing well-formed html files,
+where back-references would be handy in expressing
+a pair of opening and closing tags like
+\begin{center}
+ $\langle html \rangle \ldots \langle / html \rangle$
+\end{center}
+A regex describing such a format
+could be
+\begin{center}
+ $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+\end{center}
+Despite being useful, the syntax and expressive power of regexes
+go beyond the regular language hierarchy
+with back-references.
+In fact, they allow the regex construct to express
languages that cannot be contained in context-free
languages either.
For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
@@ -783,20 +814,37 @@
Such a language is contained in the context-sensitive hierarchy
of formal languages.
Solving the back-reference expressions matching problem
-is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
+is NP-complete\parencite{alfred2014algorithms}.
+A non-bactracking,
efficient solution is not known to exist.
-%TODO:read a bit more about back reference algorithms
+Regex libraries supporting back-references such as PCRE therefore have to
+revert to a depth-first search algorithm that backtracks.
+The unreasonable part with them, is that even in the case of
+regexes not involving back-references, there is still
+a (non-negligible) chance they might backtrack super-linearly.
-It seems that languages like Java and Python made the trade-off
-to support back-references at the expense of having to backtrack,
-even in the case of regexes not involving back-references.\\
+\subsection{Summary of the Catastrophic Backtracking Problem}
Summing these up, we can categorise existing
-practical regex libraries into the ones with linear
-time guarantees like Go and Rust, which impose restrictions
+practical regex libraries into two kinds:
+(i)The ones with linear
+time guarantees like Go and Rust. The cost with them is that
+they impose restrictions
on the user input (not allowing back-references,
-bounded repetitions cannot exceed 1000 etc.), and ones
- that allows the programmer much freedom, but grinds to a halt
- in some non-negligible portion of cases.
+bounded repetitions cannot exceed a counter limit etc.).
+(ii) Those
+that allow large bounded regular expressions and back-references
+at the expense of using a backtracking algorithm.
+They could grind to a halt
+on some very simple cases, posing a vulnerability of
+a ReDoS attack.
+
+
+We would like to have regex engines that can
+deal with the regular part (e.g.
+bounded repetitions) of regexes more
+efficiently.
+Also we want to make sure that they do it correctly.
+It turns out that such aim is not so easy to achieve.
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
% For example, the Rust regex engine claims to be linear,
% but does not support lookarounds and back-references.
@@ -819,16 +867,25 @@
\section{Error-prone POSIX Implementations}
-The problems with practical implementations
-of reare not limited to slowness on certain
-cases.
-Another thing about these libraries is that there
-is no correctness guarantee.
-In some cases, they either fail to generate a lexing result when there exists a match,
+When there are multiple ways of matching a string
+with a regular expression, a matcher needs to
+disambiguate.
+The standard for which particular match to pick
+is called the disambiguation strategy.
+The more intuitive strategy is called POSIX,
+which always chooses the longest initial match.
+An alternative strategy would be greedy matches,
+which always ends a sub-match as early as possible.
+The POSIX standard is widely adopted in many operating systems.
+However, many implementations (including the C libraries
+used by Linux and OS X distributions) contain bugs
+or do not meet the specification they claim to adhere to.
+In some cases, they either fail to generate a lexing
+result when there exists a match,
or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be the regex
+A concrete example would be the regex given by \cite{fowler2003}
\begin{center}
- $(aba + ab + a)* \text{and the string} ababa$
+ $(aba + ab + a)^* \text{and the string} ababa$
\end{center}
The correct $\POSIX$ match for the above would be
with the entire string $ababa$,
@@ -839,117 +896,127 @@
with different language engines would yield
the same two fragmented matches: $[aba]$ at $[0, 3)$
and $a$ at $[4, 5)$.
-
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
+commented that most regex libraries are not
correctly implementing the POSIX (maximum-munch)
rule of regular expression matching.
-
As Grathwohl\parencite{grathwohl2014crash} wrote,
\begin{quote}
- The POSIX strategy is more complicated than the
+ ``The POSIX strategy is more complicated than the
greedy because of the dependence on information about
- the length of matched strings in the various subexpressions.
+ the length of matched strings in the various subexpressions.''
\end{quote}
%\noindent
-To summarise the above, regular expressions are important.
-They are popular and programming languages' library functions
-for them are very fast on non-catastrophic cases.
-But there are problems with current practical implementations.
-First thing is that the running time might blow up.
-The second problem is that they might be error-prone on certain
-very simple cases.
-In the next part of the chapter, we will look into reasons why
-certain regex engines are running horribly slow on the "catastrophic"
-cases and propose a solution that addresses both of these problems
-based on Brzozowski and Sulzmann and Lu's work.
-
-
-\subsection{Different Phases of a Matching/Lexing Algorithm}
+The implementation complexity of POSIX rules also come from
+the specification being not very clear.
+There are many informal summaries of this disambiguation
+strategy, which are often quite long and delicate.
+For example Kuklewicz described the POSIX rule as
+\begin{quote}
+ ``
+ \begin{itemize}
+ \item
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
+parenthesized subexpressions return the match from their last usage\\
+\item
+text of component subexpressions must be contained in the text of the
+higher-level subexpressions\\
+\item
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+\item
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
+\end{itemize}
+\end{quote}
+The text above
+is trying to capture something very precise,
+and is crying out for formalising.
-Most lexing algorithms can be roughly divided into
-two phases during its run.
-The first phase is the "construction" phase,
-in which the algorithm builds some
-suitable data structure from the input regex $r$, so that
-it can be easily operated on later.
-We denote
-the time cost for such a phase by $P_1(r)$.
-The second phase is the lexing phase, when the input string
-$s$ is read and the data structure
-representing that regex $r$ is being operated on.
-We represent the time
-it takes by $P_2(r, s)$.\\
-For $\mathit{DFA}$,
-we have $P_2(r, s) = O( |s| )$,
-because we take at most $|s|$ steps,
-and each step takes
-at most one transition--
-a deterministic-finite-automata
-by definition has at most one state active and at most one
-transition upon receiving an input symbol.
-But unfortunately in the worst case
-$P_1(r) = O(exp^{|r|})$. An example will be given later.
-For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
-expressions like $r^n$ into
-\[
- \underbrace{r \cdots r}_{\text{n copies of r}}.
-\]
-The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
-On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^{|s|}$.
-%on the input
-%And when calculating the time complexity of the matching algorithm,
-%we are assuming that each input reading step requires constant time.
-%which translates to that the number of
-%states active and transitions taken each time is bounded by a
-%constant $C$.
-%But modern regex libraries in popular language engines
-% often want to support much richer constructs than just
-% sequences and Kleene stars,
-%such as negation, intersection,
-%bounded repetitions and back-references.
-%And de-sugaring these "extended" regular expressions
-%into basic ones might bloat the size exponentially.
-%TODO: more reference for exponential size blowup on desugaring.
+%\subsection{Different Phases of a Matching/Lexing Algorithm}
+%
+%
+%Most lexing algorithms can be roughly divided into
+%two phases during its run.
+%The first phase is the "construction" phase,
+%in which the algorithm builds some
+%suitable data structure from the input regex $r$, so that
+%it can be easily operated on later.
+%We denote
+%the time cost for such a phase by $P_1(r)$.
+%The second phase is the lexing phase, when the input string
+%$s$ is read and the data structure
+%representing that regex $r$ is being operated on.
+%We represent the time
+%it takes by $P_2(r, s)$.\\
+%For $\mathit{DFA}$,
+%we have $P_2(r, s) = O( |s| )$,
+%because we take at most $|s|$ steps,
+%and each step takes
+%at most one transition--
+%a deterministic-finite-automata
+%by definition has at most one state active and at most one
+%transition upon receiving an input symbol.
+%But unfortunately in the worst case
+%$P_1(r) = O(exp^{|r|})$. An example will be given later.
+%For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
+%expressions like $r^n$ into
+%\[
+% \underbrace{r \cdots r}_{\text{n copies of r}}.
+%\]
+%The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+%On the other hand, if backtracking is used, the worst-case time bound bloats
+%to $|r| * 2^{|s|}$.
+%%on the input
+%%And when calculating the time complexity of the matching algorithm,
+%%we are assuming that each input reading step requires constant time.
+%%which translates to that the number of
+%%states active and transitions taken each time is bounded by a
+%%constant $C$.
+%%But modern regex libraries in popular language engines
+%% often want to support much richer constructs than just
+%% sequences and Kleene stars,
+%%such as negation, intersection,
+%%bounded repetitions and back-references.
+%%And de-sugaring these "extended" regular expressions
+%%into basic ones might bloat the size exponentially.
+%%TODO: more reference for exponential size blowup on desugaring.
+%
+%\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+%
+%
+%The good things about $\mathit{DFA}$s is that once
+%generated, they are fast and stable, unlike
+%backtracking algorithms.
+%However, they do not scale well with bounded repetitions.
+%
+%\subsubsection{Problems with Bounded Repetitions}
+%
+%
-\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
-
-
-The good things about $\mathit{DFA}$s is that once
-generated, they are fast and stable, unlike
-backtracking algorithms.
-However, they do not scale well with bounded repetitions.
-
-\subsubsection{Problems with Bounded Repetitions}
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-So we have practical implementations
-on regular expression matching/lexing which are fast
-but do not come with any guarantees that it will not grind to a halt
-or give wrong answers.
-Our goal is to have a regex lexing algorithm that comes with
+To summarise, we need regex libraries that are both fast
+and correct.
+And that correctness needs to be built on a precise
+model of what POSIX disambiguation is.
+We propose a solution that addresses both problems
+based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work.
+The end result is a regular expression lexing algorithm that comes with
\begin{itemize}
\item
-proven correctness
+a proven correctness theorem according to POSIX specification
+given by Ausaf and Urban \cite{AusafDyckhoffUrban2016},
\item
-proven non-catastrophic properties
+a proven property saying that the algorithm's internal data structure will
+remain finite,
\item
-easy extensions to
-constructs like
- bounded repetitions, negation, lookarounds, and even back-references.
+and extension to
+the bounded repetitions construct with the correctness and finiteness property
+maintained.
\end{itemize}
\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}