diff -r 99b530103464 -r e6fc9b72c0e3 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Sep 29 14:05:42 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Sep 30 01:47:33 2022 +0100 @@ -577,7 +577,10 @@ Bounded repetitions are very important because they tend to occur a lot in practical use. For example in the regex library RegExLib, -the rules library of Snort, as well as in XML Schema definitions (XSDs). +the rules library of Snort \cite{Snort1999}\footnote{ +Snort is a network intrusion detection (NID) tool +for monitoring network traffic.}, +as well as in XML Schema definitions (XSDs). According to Bj\"{o}rklund et al \cite{xml2015}, more than half of the XSDs they found have bounded regular expressions in them. @@ -707,9 +710,6 @@ The backtracking method is employed in regex libraries that support \emph{back-references}, for example in Java and Python. -The syntax and expressive power of regexes -go beyond the regular language hierarchy -with back-references. %\section{Back-references and The Terminology Regex} %When one constructs an $\NFA$ out of a regular expression @@ -766,15 +766,46 @@ A concrete example for back-references would be \begin{center} -$((a|b|c|\ldots|z)^*)\backslash 1$, +$(.^*)\backslash 1$, \end{center} which would match strings that can be split into two identical halves, -for example $\mathit{bobo}$, $\mathit{weewee}$ and etc. -Back-reference is a regex construct -that programmers found useful, but not exactly -regular any more. -In fact, that allows the regex construct to express +for example $\mathit{foofoo}$, $\mathit{ww}$ and etc. +Note that this is different from +repeating the sub-expression verbatim like +\begin{center} + $(.^*)(.^*)$, +\end{center} +which does not impose any restrictions on what strings the second +sub-expression $.^*$ +might match. +Another example of back-references would be +\begin{center} +$(.)(.)\backslash 2\backslash 1$ +\end{center} +which expresses four-character palindromes +like $abba$, $x??x$ etc. + +Back-references is a regex construct +that programmers found quite useful. +According to Becchi and Crawley\cite{Becchi08}, +6\% of Snort rules (up until 2008) include the use of them. +The most common use of back-references +would be expressing well-formed html files, +where back-references would be handy in expressing +a pair of opening and closing tags like +\begin{center} + $\langle html \rangle \ldots \langle / html \rangle$ +\end{center} +A regex describing such a format +could be +\begin{center} + $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ +\end{center} +Despite being useful, the syntax and expressive power of regexes +go beyond the regular language hierarchy +with back-references. +In fact, they allow the regex construct to express languages that cannot be contained in context-free languages either. For example, the back-reference $((a^*)b\backslash1 b \backslash 1$ @@ -783,20 +814,37 @@ Such a language is contained in the context-sensitive hierarchy of formal languages. Solving the back-reference expressions matching problem -is NP-complete\parencite{alfred2014algorithms} and a non-bactracking, +is NP-complete\parencite{alfred2014algorithms}. +A non-bactracking, efficient solution is not known to exist. -%TODO:read a bit more about back reference algorithms +Regex libraries supporting back-references such as PCRE therefore have to +revert to a depth-first search algorithm that backtracks. +The unreasonable part with them, is that even in the case of +regexes not involving back-references, there is still +a (non-negligible) chance they might backtrack super-linearly. -It seems that languages like Java and Python made the trade-off -to support back-references at the expense of having to backtrack, -even in the case of regexes not involving back-references.\\ +\subsection{Summary of the Catastrophic Backtracking Problem} Summing these up, we can categorise existing -practical regex libraries into the ones with linear -time guarantees like Go and Rust, which impose restrictions +practical regex libraries into two kinds: +(i)The ones with linear +time guarantees like Go and Rust. The cost with them is that +they impose restrictions on the user input (not allowing back-references, -bounded repetitions cannot exceed 1000 etc.), and ones - that allows the programmer much freedom, but grinds to a halt - in some non-negligible portion of cases. +bounded repetitions cannot exceed a counter limit etc.). +(ii) Those +that allow large bounded regular expressions and back-references +at the expense of using a backtracking algorithm. +They could grind to a halt +on some very simple cases, posing a vulnerability of +a ReDoS attack. + + +We would like to have regex engines that can +deal with the regular part (e.g. +bounded repetitions) of regexes more +efficiently. +Also we want to make sure that they do it correctly. +It turns out that such aim is not so easy to achieve. %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions % For example, the Rust regex engine claims to be linear, % but does not support lookarounds and back-references. @@ -819,16 +867,25 @@ \section{Error-prone POSIX Implementations} -The problems with practical implementations -of reare not limited to slowness on certain -cases. -Another thing about these libraries is that there -is no correctness guarantee. -In some cases, they either fail to generate a lexing result when there exists a match, +When there are multiple ways of matching a string +with a regular expression, a matcher needs to +disambiguate. +The standard for which particular match to pick +is called the disambiguation strategy. +The more intuitive strategy is called POSIX, +which always chooses the longest initial match. +An alternative strategy would be greedy matches, +which always ends a sub-match as early as possible. +The POSIX standard is widely adopted in many operating systems. +However, many implementations (including the C libraries +used by Linux and OS X distributions) contain bugs +or do not meet the specification they claim to adhere to. +In some cases, they either fail to generate a lexing +result when there exists a match, or give results that are inconsistent with the $\POSIX$ standard. -A concrete example would be the regex +A concrete example would be the regex given by \cite{fowler2003} \begin{center} - $(aba + ab + a)* \text{and the string} ababa$ + $(aba + ab + a)^* \text{and the string} ababa$ \end{center} The correct $\POSIX$ match for the above would be with the entire string $ababa$, @@ -839,117 +896,127 @@ with different language engines would yield the same two fragmented matches: $[aba]$ at $[0, 3)$ and $a$ at $[4, 5)$. - -Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not +Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} +commented that most regex libraries are not correctly implementing the POSIX (maximum-munch) rule of regular expression matching. - As Grathwohl\parencite{grathwohl2014crash} wrote, \begin{quote} - The POSIX strategy is more complicated than the + ``The POSIX strategy is more complicated than the greedy because of the dependence on information about - the length of matched strings in the various subexpressions. + the length of matched strings in the various subexpressions.'' \end{quote} %\noindent -To summarise the above, regular expressions are important. -They are popular and programming languages' library functions -for them are very fast on non-catastrophic cases. -But there are problems with current practical implementations. -First thing is that the running time might blow up. -The second problem is that they might be error-prone on certain -very simple cases. -In the next part of the chapter, we will look into reasons why -certain regex engines are running horribly slow on the "catastrophic" -cases and propose a solution that addresses both of these problems -based on Brzozowski and Sulzmann and Lu's work. - - -\subsection{Different Phases of a Matching/Lexing Algorithm} +The implementation complexity of POSIX rules also come from +the specification being not very clear. +There are many informal summaries of this disambiguation +strategy, which are often quite long and delicate. +For example Kuklewicz described the POSIX rule as +\begin{quote} + `` + \begin{itemize} + \item +regular expressions (REs) take the leftmost starting match, and the longest match starting there +earlier subpatterns have leftmost-longest priority over later subpatterns\\ +\item +higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ +\item +REs have right associative concatenation which can be changed with parenthesis\\ +\item +parenthesized subexpressions return the match from their last usage\\ +\item +text of component subexpressions must be contained in the text of the +higher-level subexpressions\\ +\item +if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ +\item +if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\'' +\end{itemize} +\end{quote} +The text above +is trying to capture something very precise, +and is crying out for formalising. -Most lexing algorithms can be roughly divided into -two phases during its run. -The first phase is the "construction" phase, -in which the algorithm builds some -suitable data structure from the input regex $r$, so that -it can be easily operated on later. -We denote -the time cost for such a phase by $P_1(r)$. -The second phase is the lexing phase, when the input string -$s$ is read and the data structure -representing that regex $r$ is being operated on. -We represent the time -it takes by $P_2(r, s)$.\\ -For $\mathit{DFA}$, -we have $P_2(r, s) = O( |s| )$, -because we take at most $|s|$ steps, -and each step takes -at most one transition-- -a deterministic-finite-automata -by definition has at most one state active and at most one -transition upon receiving an input symbol. -But unfortunately in the worst case -$P_1(r) = O(exp^{|r|})$. An example will be given later. -For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold -expressions like $r^n$ into -\[ - \underbrace{r \cdots r}_{\text{n copies of r}}. -\] -The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. -On the other hand, if backtracking is used, the worst-case time bound bloats -to $|r| * 2^{|s|}$. -%on the input -%And when calculating the time complexity of the matching algorithm, -%we are assuming that each input reading step requires constant time. -%which translates to that the number of -%states active and transitions taken each time is bounded by a -%constant $C$. -%But modern regex libraries in popular language engines -% often want to support much richer constructs than just -% sequences and Kleene stars, -%such as negation, intersection, -%bounded repetitions and back-references. -%And de-sugaring these "extended" regular expressions -%into basic ones might bloat the size exponentially. -%TODO: more reference for exponential size blowup on desugaring. +%\subsection{Different Phases of a Matching/Lexing Algorithm} +% +% +%Most lexing algorithms can be roughly divided into +%two phases during its run. +%The first phase is the "construction" phase, +%in which the algorithm builds some +%suitable data structure from the input regex $r$, so that +%it can be easily operated on later. +%We denote +%the time cost for such a phase by $P_1(r)$. +%The second phase is the lexing phase, when the input string +%$s$ is read and the data structure +%representing that regex $r$ is being operated on. +%We represent the time +%it takes by $P_2(r, s)$.\\ +%For $\mathit{DFA}$, +%we have $P_2(r, s) = O( |s| )$, +%because we take at most $|s|$ steps, +%and each step takes +%at most one transition-- +%a deterministic-finite-automata +%by definition has at most one state active and at most one +%transition upon receiving an input symbol. +%But unfortunately in the worst case +%$P_1(r) = O(exp^{|r|})$. An example will be given later. +%For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold +%expressions like $r^n$ into +%\[ +% \underbrace{r \cdots r}_{\text{n copies of r}}. +%\] +%The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. +%On the other hand, if backtracking is used, the worst-case time bound bloats +%to $|r| * 2^{|s|}$. +%%on the input +%%And when calculating the time complexity of the matching algorithm, +%%we are assuming that each input reading step requires constant time. +%%which translates to that the number of +%%states active and transitions taken each time is bounded by a +%%constant $C$. +%%But modern regex libraries in popular language engines +%% often want to support much richer constructs than just +%% sequences and Kleene stars, +%%such as negation, intersection, +%%bounded repetitions and back-references. +%%And de-sugaring these "extended" regular expressions +%%into basic ones might bloat the size exponentially. +%%TODO: more reference for exponential size blowup on desugaring. +% +%\subsection{Why $\mathit{DFA}s$ can be slow in the first phase} +% +% +%The good things about $\mathit{DFA}$s is that once +%generated, they are fast and stable, unlike +%backtracking algorithms. +%However, they do not scale well with bounded repetitions. +% +%\subsubsection{Problems with Bounded Repetitions} +% +% -\subsection{Why $\mathit{DFA}s$ can be slow in the first phase} - - -The good things about $\mathit{DFA}$s is that once -generated, they are fast and stable, unlike -backtracking algorithms. -However, they do not scale well with bounded repetitions. - -\subsubsection{Problems with Bounded Repetitions} - - - - - - - - - - - - - - -So we have practical implementations -on regular expression matching/lexing which are fast -but do not come with any guarantees that it will not grind to a halt -or give wrong answers. -Our goal is to have a regex lexing algorithm that comes with +To summarise, we need regex libraries that are both fast +and correct. +And that correctness needs to be built on a precise +model of what POSIX disambiguation is. +We propose a solution that addresses both problems +based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work. +The end result is a regular expression lexing algorithm that comes with \begin{itemize} \item -proven correctness +a proven correctness theorem according to POSIX specification +given by Ausaf and Urban \cite{AusafDyckhoffUrban2016}, \item -proven non-catastrophic properties +a proven property saying that the algorithm's internal data structure will +remain finite, \item -easy extensions to -constructs like - bounded repetitions, negation, lookarounds, and even back-references. +and extension to +the bounded repetitions construct with the correctness and finiteness property +maintained. \end{itemize} \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}