lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:99b530103464
+:e6fc9b72c0e3
 Bounded repetitions are very important because they
 tend to occur a lot in practical use.
 For example in the regex library RegExLib,
-the rules library of Snort, as well as in XML Schema definitions (XSDs).
+the rules library of Snort \cite{Snort1999}\footnote{
+Snort is a network intrusion detection (NID) tool
+for monitoring network traffic.},
+as well as in XML Schema definitions (XSDs).
 According to Bj\"{o}rklund et al \cite{xml2015},
 more than half of the
 XSDs they found have bounded regular expressions in them.
 Often the counters are quite large, the largest up to ten million.
 An example XSD they gave
 is efficient in a lot of cases, but could end up
 with exponential run time.
 The backtracking method is employed in regex libraries
 that support \emph{back-references}, for example
 in Java and Python.
-The syntax and expressive power of regexes
-go beyond the regular language hierarchy
-with back-references.
 %\section{Back-references and The Terminology Regex}
 %When one constructs an $\NFA$ out of a regular expression
 %there is often very little to be done in the first phase, one simply
 %construct the $\NFA$ states based on the structure of the input regular expression.
 %	\end{tabular}
 %\end{center}
 A concrete example
 for back-references would be
 \begin{center}
-$((a|b|c|\ldots|z)^*)\backslash 1$,
+$(.^*)\backslash 1$,
 \end{center}
 which would match
 strings that can be split into two identical halves,
-for example $\mathit{bobo}$, $\mathit{weewee}$ and etc.
+for example $\mathit{foofoo}$, $\mathit{ww}$ and etc.
-Back-reference is a regex construct
+Note that this is different from
-that programmers found useful, but not exactly
+repeating the  sub-expression verbatim like
-regular any more.
+\begin{center}
-In fact, that allows the regex construct to express
+	$(.^*)(.^*)$,
+\end{center}
+which does not impose any restrictions on what strings the second
+sub-expression $.^*$
+might match.
+Another example of back-references would be
+\begin{center}
+$(.)(.)\backslash 2\backslash 1$
+\end{center}
+which expresses four-character palindromes
+like $abba$, $x??x$ etc.
+Back-references is a regex construct
+that programmers found quite useful.
+According to Becchi and Crawley\cite{Becchi08},
+6\% of Snort rules (up until 2008) include the use of them.
+The most common use of back-references
+would be expressing well-formed html files,
+where back-references would be handy in expressing
+a pair of opening and closing tags like
+\begin{center}
+	$\langle html \rangle \ldots \langle / html \rangle$
+\end{center}
+A regex describing such a format
+could be
+\begin{center}
+	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+\end{center}
+Despite being useful, the syntax and expressive power of regexes
+go beyond the regular language hierarchy
+with back-references.
+In fact, they allow the regex construct to express
 languages that cannot be contained in context-free
 languages either.
 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
 Such a language is contained in the context-sensitive hierarchy
 of formal languages.
 Solving the back-reference expressions matching problem
-is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
+is NP-complete\parencite{alfred2014algorithms}.
+A non-bactracking,
 efficient solution is not known to exist.
-%TODO:read a bit more about back reference algorithms
+Regex libraries supporting back-references such as PCRE therefore have to
+revert to a depth-first search algorithm that backtracks.
-It seems that languages like Java and Python made the trade-off
+The unreasonable part with them, is that even in the case of
-to support back-references at the expense of having to backtrack,
+regexes not involving back-references, there is still
-even in the case of regexes not involving back-references.\\
+a (non-negligible) chance they might backtrack super-linearly.
+\subsection{Summary of the Catastrophic Backtracking Problem}
 Summing these up, we can categorise existing
-practical regex libraries into the ones  with  linear
+practical regex libraries into two kinds:
-time guarantees like Go and Rust, which impose restrictions
+(i)The ones  with  linear
+time guarantees like Go and Rust. The cost with them is that
+they impose restrictions
 on the user input (not allowing back-references,
-bounded repetitions cannot exceed 1000 etc.), and ones
+bounded repetitions cannot exceed a counter limit etc.).
-that allows the programmer much freedom, but grinds to a halt
+(ii) Those
-in some non-negligible portion of cases.
+that allow large bounded regular expressions and back-references
+at the expense of using a backtracking algorithm.
+They could grind to a halt
+on some very simple cases, posing a vulnerability of
+a ReDoS attack.
+We would like to have regex engines that can
+deal with the regular part (e.g.
+bounded repetitions) of regexes more
+efficiently.
+Also we want to make sure that they do it correctly.
+It turns out that such aim is not so easy to achieve.
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
 % For example, the Rust regex engine claims to be linear,
 % but does not support lookarounds and back-references.
 % The GoLang regex library does not support over 1000 repetitions.
 % Java and Python both support back-references, but shows
 %, whereas $\NFA$s usually run into trouble
 %on the second phase.
 \section{Error-prone POSIX Implementations}
-The problems with practical implementations
+When there are multiple ways of matching a string
-of reare not limited to slowness on certain
+with a regular expression, a matcher needs to
-cases.
+disambiguate.
-Another thing about these libraries is that there
+The standard for which particular match to pick
-is no correctness guarantee.
+is called the disambiguation strategy.
-In some cases, they either fail to generate a lexing result when there exists a match,
+The more intuitive strategy is called POSIX,
+which always chooses the longest initial match.
+An alternative strategy would be greedy matches,
+which always ends a sub-match as early as possible.
+The POSIX standard is widely adopted in many operating systems.
+However, many implementations (including the C libraries
+used by Linux and OS X distributions) contain bugs
+or do not meet the specification they claim to adhere to.
+In some cases, they either fail to generate a lexing
+result when there exists a match,
 or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be the regex
+A concrete example would be the regex given by \cite{fowler2003}
 \begin{center}
-	$(aba + ab + a)* \text{and the string} ababa$
+	$(aba + ab + a)^* \text{and the string} ababa$
 \end{center}
 The correct $\POSIX$ match for the above would be
 with the entire string $ababa$,
 split into two Kleene star iterations, $[ab] [aba]$ at positions
 $[0, 2), [2, 5)$
 respectively.
 But trying this out in regex101\parencite{regex101}
 with different language engines would yield
 the same two fragmented matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
+Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+commented that most regex libraries are not
 correctly implementing the POSIX (maximum-munch)
 rule of regular expression matching.
 As Grathwohl\parencite{grathwohl2014crash} wrote,
 \begin{quote}
-	The POSIX strategy is more complicated than the
+	``The POSIX strategy is more complicated than the
 	greedy because of the dependence on information about
-	the length of matched strings in the various subexpressions.
+	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
-To summarise the above, regular expressions are important.
+The implementation complexity of POSIX rules also come from
-They are popular and programming languages' library functions
+the specification being not very clear.
-for them are very fast on non-catastrophic cases.
+There are many informal summaries of this disambiguation
-But there are problems with current practical implementations.
+strategy, which are often quite long and delicate.
-First thing is that the running time might blow up.
+For example Kuklewicz described the POSIX rule as
-The second problem is that they might be error-prone on certain
+\begin{quote}
-very simple cases.
+	``
-In the next part of the chapter, we will look into reasons why
+	\begin{itemize}
-certain regex engines are running horribly slow on the "catastrophic"
+		\item
-cases and propose a solution that addresses both of these problems
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
-based on Brzozowski and Sulzmann and Lu's work.
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
-\subsection{Different Phases of a Matching/Lexing Algorithm}
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
-Most lexing algorithms can be roughly divided into
+parenthesized subexpressions return the match from their last usage\\
-two phases during its run.
+\item
-The first phase is the "construction" phase,
+text of component subexpressions must be contained in the text of the
-in which the algorithm builds some
+higher-level subexpressions\\
-suitable data structure from the input regex $r$, so that
+\item
-it can be easily operated on later.
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
-We denote
+\item
-the time cost for such a phase by $P_1(r)$.
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
-The second phase is the lexing phase, when the input string
+\end{itemize}
-$s$ is read and the data structure
+\end{quote}
-representing that regex $r$ is being operated on.
+The text above
-We represent the time
+is trying to capture something very precise,
-it takes by $P_2(r, s)$.\\
+and is crying out for formalising.
-For $\mathit{DFA}$,
-we have $P_2(r, s) = O( |s| )$,
-because we take at most $|s|$ steps,
+%\subsection{Different Phases of a Matching/Lexing Algorithm}
-and each step takes
+%
-at most one transition--
+%
-a deterministic-finite-automata
+%Most lexing algorithms can be roughly divided into
-by definition has at most one state active and at most one
+%two phases during its run.
-transition upon receiving an input symbol.
+%The first phase is the "construction" phase,
-But unfortunately in the  worst case
+%in which the algorithm builds some
-$P_1(r) = O(exp^{|r|})$. An example will be given later.
+%suitable data structure from the input regex $r$, so that
-For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
+%it can be easily operated on later.
-expressions like $r^n$ into
+%We denote
-\[
+%the time cost for such a phase by $P_1(r)$.
-	\underbrace{r \cdots r}_{\text{n copies of r}}.
+%The second phase is the lexing phase, when the input string
-\]
+%$s$ is read and the data structure
-The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+%representing that regex $r$ is being operated on.
-On the other hand, if backtracking is used, the worst-case time bound bloats
+%We represent the time
-to $|r| * 2^{|s|}$.
+%it takes by $P_2(r, s)$.\\
-%on the input
+%For $\mathit{DFA}$,
-%And when calculating the time complexity of the matching algorithm,
+%we have $P_2(r, s) = O( |s| )$,
-%we are assuming that each input reading step requires constant time.
+%because we take at most $|s|$ steps,
-%which translates to that the number of
+%and each step takes
-%states active and transitions taken each time is bounded by a
+%at most one transition--
-%constant $C$.
+%a deterministic-finite-automata
-%But modern  regex libraries in popular language engines
+%by definition has at most one state active and at most one
-% often want to support much richer constructs than just
+%transition upon receiving an input symbol.
-% sequences and Kleene stars,
+%But unfortunately in the  worst case
-%such as negation, intersection,
+%$P_1(r) = O(exp^{|r|})$. An example will be given later.
-%bounded repetitions and back-references.
+%For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
-%And de-sugaring these "extended" regular expressions
+%expressions like $r^n$ into
-%into basic ones might bloat the size exponentially.
+%\[
-%TODO: more reference for exponential size blowup on desugaring.
+%	\underbrace{r \cdots r}_{\text{n copies of r}}.
+%\]
-\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+%The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+%On the other hand, if backtracking is used, the worst-case time bound bloats
+%to $|r| * 2^{|s|}$.
-The good things about $\mathit{DFA}$s is that once
+%%on the input
-generated, they are fast and stable, unlike
+%%And when calculating the time complexity of the matching algorithm,
-backtracking algorithms.
+%%we are assuming that each input reading step requires constant time.
-However, they do not scale well with bounded repetitions.
+%%which translates to that the number of
+%%states active and transitions taken each time is bounded by a
-\subsubsection{Problems with Bounded Repetitions}
+%%constant $C$.
+%%But modern  regex libraries in popular language engines
+%% often want to support much richer constructs than just
+%% sequences and Kleene stars,
+%%such as negation, intersection,
+%%bounded repetitions and back-references.
+%%And de-sugaring these "extended" regular expressions
+%%into basic ones might bloat the size exponentially.
+%%TODO: more reference for exponential size blowup on desugaring.
+%
+%\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+%
+%
+%The good things about $\mathit{DFA}$s is that once
+%generated, they are fast and stable, unlike
-So we have practical implementations
+%backtracking algorithms.
-on regular expression matching/lexing which are fast
+%However, they do not scale well with bounded repetitions.
-but do not come with any guarantees that it will not grind to a halt
+%
-or give wrong answers.
+%\subsubsection{Problems with Bounded Repetitions}
-Our goal is to have a regex lexing algorithm that comes with
+%
+%
+To summarise, we need regex libraries that are both fast
+and correct.
+And that correctness needs to be built on a precise
+model of what POSIX disambiguation is.
+We propose a solution that addresses both problems
+based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work.
+The end result is a regular expression lexing algorithm that comes with
 \begin{itemize}
 \item
-proven correctness
+a proven correctness theorem according to POSIX specification
+given by Ausaf and Urban \cite{AusafDyckhoffUrban2016},
 \item
-proven non-catastrophic properties
+a proven property saying that the algorithm's internal data structure will
+remain finite,
 \item
-easy extensions to
+and extension to
-constructs like
+the bounded repetitions construct with the correctness and finiteness property
-bounded repetitions, negation,  lookarounds, and even back-references.
+maintained.
 \end{itemize}
 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
 We propose Brzozowski derivatives on regular expressions as
 a solution to this.

changeset 607	e6fc9b72c0e3
parent 606	99b530103464
child 608	37b6fd310a16