diff -r 16d67f9c07d4 -r ed53ce26ecb6 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Sep 25 01:40:12 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Wed Sep 28 01:28:01 2022 +0100 @@ -406,14 +406,16 @@ approach (Brzozowski derivatives and formal proofs). -\section{Terminology, and the Problem with Bounded Repetitions} +\section{Regex, and the Problems with Regex Matchers} Regular expressions and regular expression matchers have of course been studied for many, many years. -Theoretical results in automata theory says +Theoretical results in automata theory say that basic regular expression matching should be linear -w.r.t the input, provided that the regular expression -$r$ had been pre-processed and turned into a -deterministic finite automata (DFA). +w.r.t the input. +This assumes that the regular expression +$r$ was pre-processed and turned into a +deterministic finite automata (DFA) before matching, +which could be exponential\cite{Sakarovitch2009}. By basic we mean textbook definitions such as the one below, involving only characters, alternatives, sequences, and Kleene stars: @@ -424,133 +426,96 @@ however, support richer constructs such as bounded repetitions and back-references. -The syntax and expressive power of those -matching engines -make ``regular expressions'' quite different from -their original meaning in the formal languages -theory. -To differentiate, people tend to use the word \emph{regex} to refer -those expressions with richer constructs, and regular expressions -for the more traditional meaning. -For example, the PCRE standard (Peral Compatible Regular Expressions) -is such a regex syntax standard. -We follow this convention in this thesis. -We aim to support all the popular features of regexes in the future, +To differentiate, people use the word \emph{regex} to refer +to those expressions with richer constructs while reserving the +term \emph{regular expression} +for the more traditional meaning in formal languages theory. +We follow this convention +in this thesis. +In the future, we aim to support all the popular features of regexes, but for this work we mainly look at regular expressions. -\subsection{A Little Introduction to Regexes: Bounded Repetitions -and Back-references} + + +%Most modern regex libraries +%the so-called PCRE standard (Peral Compatible Regular Expressions) +%has the back-references Regexes come with a lot of constructs -that makes it more convenient for +beyond the basic ones +that make it more convenient for programmers to write regular expressions. +Depending on the types of these constructs +the task of matching and lexing with them +will have different levels of complexity increase. Some of those constructs are syntactic sugars that are simply short hand notations -that save the programmers a few keystrokes, -for example the +that save the programmers a few keystrokes. +These will not cause trouble for regex libraries. + +\noindent +For example the non-binary alternative involving three or more choices: \[ - r = (a | b | c | \ldots | z)^* + (a | b | c) \stackrel{means}{=} ((a + b)+ c) \] -, the range operator $-$ which means the alternative -of all characters between its operands: -\[ - r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)} -\] -and the -wildcard character $.$ meaning any character +the range operator $-$ used to express the alternative +of all characters between its operands in a concise way: \[ - . = [0-9a-zA-Z+-()*&\ldots] - + [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)} \] -Some of those constructs do make the expressions much -more compact, and matching time could be greatly increase. -For example, $r^{n}$ is exponentially more concise compared with -the expression $\underbrace{r}_\text{n copies of r}$, -and therefore a naive algorithm that simply unfolds -$r^{n}$ into $\underbrace{r}_\text{n copies of r}$ -will suffer exponential runtime increase. -Some constructs can even raise the expressive -power to the non-regular realm, for example -the back-references. +and the +wildcard character $.$ used to refer to any single character: +\[ + . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] +\] -bounded repetitions, as we have discussed in the -previous section. -This super-linear behaviour of the -regex matching engines we have? -One of the most recent work in the context of lexing -is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. -This is relevant work and we will compare later on -our derivative-based matcher we are going to present. -There is also some newer work called -Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. -For that the problem is dealing with the bounded regular expressions of the form -$r^{n}$ where $n$ is a constant specifying that $r$ must repeat -exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow -on the bounded repetitions construct. +\noindent +\subsection{Bounded Repetitions} +Some of those constructs do make the expressions much +more compact. +For example, the bounded regular expressions +(where $n$ and $m$ are constant natural numbers) +$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$, +defined as +\begin{center} + \begin{tabular}{lcl} + $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ + $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ + $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ + $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ + \end{tabular} +\end{center} +are exponentially smaller compared with +their unfolded form: for example $r^{\{n\}}$ +as opposed to +\[ + \underbrace{r\ldots r}_\text{n copies of r}. +\] +%Therefore, a naive algorithm that simply unfolds +%them into their desugared forms +%will suffer from at least an exponential runtime increase. -In the work reported in \cite{CSL2022} and here, we add better support -for them. -The other repetition constructs include -$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify -intervals for how many times $r$ should match. -$r^{\ldots m}$ means repeating -at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and -$r^{n\ldots m}$ means repeating between $n$ and $m$ times. -The results presented in this thesis extend straightforwardly to them -too. -Their formal definitions will be given later. - -Bounded repetitions are important because they -tend to occur often in practical use, for example in RegExLib, -Snort, as well as in XML Schema definitions (XSDs). -According to Bj\"{o}rklund et al \cite{xml2015}, -bounded regular expressions occur frequently in the latter and can have -counters up to ten million. An example XSD with a large counter they gave -was: -\begin{verbatim} - - - - -\end{verbatim} -This can be seen as the expression -$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves -regular expressions -satisfying certain constraints (such as -satisfying the floating point number format). The problem here is that tools based on the classic notion of automata need to expand $r^{n}$ into $n$ connected copies of the automaton for $r$. This leads to very inefficient matching algorithms or algorithms that consume large amounts of memory. +Implementations using $\DFA$s will +either become excruciatingly slow +(for example Verbatim++\cite{Verbatimpp}) or get +out of memory errors (for example $\mathit{LEX}$ and +$\mathit{JFLEX}$\footnote{which are lexer generators +in C and JAVA that generate $\mathit{DFA}$-based +lexers. The user provides a set of regular expressions +and configurations to them, and then +gets an output program encoding a minimized $\mathit{DFA}$ +that can be compiled and run. +When given the above countdown regular expression, +a small $n$ (a few dozen) would result in a +determinised automata +with millions of states.}) under large counters. A classic example is the regular expression $(a+b)^* a (a+b)^{n}$ -where the minimal DFA requires at least $2^{n+1}$ states (more on this -later). -Therefore regular expressions matching libraries that rely on the classic -notion of DFAs often impose adhoc limits -for bounded regular expressions: -For example, in the regular expression matching library in the Go -language the regular expression $a^{1001}$ is not permitted, because no counter -can be above 1000, and in the built-in Rust regular expression library -expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message -for being too big. These problems can of course be solved in matching algorithms where -automata go beyond the classic notion and for instance include explicit -counters \cite{Turo_ov__2020}. -The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be -straightforwardly extended to deal with bounded regular expressions -and moreover the resulting code still consists of only simple -recursive functions and inductive datatypes. -Finally, bounded regular expressions do not destroy our finite -boundedness property, which we shall prove later on. - -\section{Back-references and The Terminology Regex} - - -Bounded repetitions, usually written in the form -$r^{\{c\}}$ (where $c$ is a constant natural number), -denotes a regular expression accepting strings -that can be divided into $c$ substrings, where each -substring is in $r$. -For the regular expression $(a|b)^*a(a|b)^{\{2\}}$, +where the minimal DFA requires at least $2^{n+1}$ states. +For example, when $n$ is equal to 2, an $\mathit{NFA}$ describing it would look like: \begin{center} \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] @@ -565,7 +530,9 @@ (q_2) edge node {a,b} (q_3); \end{tikzpicture} \end{center} -The red states are "countdown states" which counts down +which requires at least $2^3$ states +for its subset construction.\footnote{The +red states are "countdown states" which counts down the number of characters needed in addition to the current string to make a successful match. For example, state $q_1$ indicates a match that has @@ -588,166 +555,103 @@ Generalizing this to regular expressions with larger bounded repetitions number, we have that regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s -would require at least $2^{n+1}$ states, if $r$ contains +would require at least $2^{n+1}$ states, if $r$ itself contains more than 1 string. This is to represent all different -scenarios which "countdown" states are active. -For those regexes, tools that uses $\DFA$s will get -out of memory errors. - - -The time cost of regex matching algorithms in general -involve two different phases, and different things can go differently wrong on -these phases. -$\DFA$s usually have problems in the first (construction) phase -, whereas $\NFA$s usually run into trouble -on the second phase. - +scenarios which "countdown" states are active.} -\section{Error-prone POSIX Implementations} -The problems with practical implementations -of reare not limited to slowness on certain -cases. -Another thing about these libraries is that there -is no correctness guarantee. -In some cases, they either fail to generate a lexing result when there exists a match, -or give results that are inconsistent with the $\POSIX$ standard. -A concrete example would be the regex -\begin{center} - $(aba + ab + a)* \text{and the string} ababa$ -\end{center} -The correct $\POSIX$ match for the above would be -with the entire string $ababa$, -split into two Kleene star iterations, $[ab] [aba]$ at positions -$[0, 2), [2, 5)$ -respectively. -But trying this out in regex101\parencite{regex101} -with different language engines would yield -the same two fragmented matches: $[aba]$ at $[0, 3)$ -and $a$ at $[4, 5)$. - -Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not -correctly implementing the POSIX (maximum-munch) -rule of regular expression matching. - -As Grathwohl\parencite{grathwohl2014crash} wrote, -\begin{quote} - The POSIX strategy is more complicated than the - greedy because of the dependence on information about - the length of matched strings in the various subexpressions. -\end{quote} -%\noindent -To summarise the above, regular expressions are important. -They are popular and programming languages' library functions -for them are very fast on non-catastrophic cases. -But there are problems with current practical implementations. -First thing is that the running time might blow up. -The second problem is that they might be error-prone on certain -very simple cases. -In the next part of the chapter, we will look into reasons why -certain regex engines are running horribly slow on the "catastrophic" -cases and propose a solution that addresses both of these problems -based on Brzozowski and Sulzmann and Lu's work. - - - -\subsection{Different Phases of a Matching/Lexing Algorithm} +One of the most recent work in the context of lexing +%with this issue +is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. +This is relevant work and we will compare later on +our derivative-based matcher we are going to present. +There is also some newer work called +Verbatim++\cite{Verbatimpp}, which does not use derivatives, +but deterministic finite automaton instead. +%An example that gives problem to automaton approaches would be +%the regular expression $(a|b)^*a(a|b)^{\{n\}}$. +%It requires at least $2^{n+1}$ states to represent +%as a DFA. -Most lexing algorithms can be roughly divided into -two phases during its run. -The first phase is the "construction" phase, -in which the algorithm builds some -suitable data structure from the input regex $r$, so that -it can be easily operated on later. -We denote -the time cost for such a phase by $P_1(r)$. -The second phase is the lexing phase, when the input string -$s$ is read and the data structure -representing that regex $r$ is being operated on. -We represent the time -it takes by $P_2(r, s)$.\\ -For $\mathit{DFA}$, -we have $P_2(r, s) = O( |s| )$, -because we take at most $|s|$ steps, -and each step takes -at most one transition-- -a deterministic-finite-automata -by definition has at most one state active and at most one -transition upon receiving an input symbol. -But unfortunately in the worst case -$P_1(r) = O(exp^{|r|})$. An example will be given later. -For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold -expressions like $r^n$ into -\[ - \underbrace{r \cdots r}_{\text{n copies of r}}. -\] -The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. -On the other hand, if backtracking is used, the worst-case time bound bloats -to $|r| * 2^{|s|}$. -%on the input -%And when calculating the time complexity of the matching algorithm, -%we are assuming that each input reading step requires constant time. -%which translates to that the number of -%states active and transitions taken each time is bounded by a -%constant $C$. -%But modern regex libraries in popular language engines -% often want to support much richer constructs than just -% sequences and Kleene stars, -%such as negation, intersection, -%bounded repetitions and back-references. -%And de-sugaring these "extended" regular expressions -%into basic ones might bloat the size exponentially. -%TODO: more reference for exponential size blowup on desugaring. +Bounded repetitions are very important because they +tend to occur a lot in practical use. +For example in the regex library RegExLib, +the rules library of Snort, as well as in XML Schema definitions (XSDs). +According to Bj\"{o}rklund et al \cite{xml2015}, +more than half of the +XSDs they found have bounded regular expressions in them. +Often the counters are quite large, the largest up to ten million. +An example XSD they gave +was: +\begin{verbatim} + + + + +\end{verbatim} +This can be seen as the expression +$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves +regular expressions +satisfying certain constraints (such as +satisfying the floating point number format). -\subsection{Why $\mathit{DFA}s$ can be slow in the first phase} +It is therefore quite unsatisfying that +some regular expressions matching libraries +impose adhoc limits +for bounded regular expressions: +For example, in the regular expression matching library in the Go +language the regular expression $a^{1001}$ is not permitted, because no counter +can be above 1000, and in the built-in Rust regular expression library +expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message +for being too big. These problems can of course be solved in matching algorithms where +automata go beyond the classic notion and for instance include explicit +counters \cite{Turo_ov__2020}. + +In the work reported in \cite{CSL2022} and here, +we add better support using derivatives +for bounded regular expressions $r^{\{n\}}$. +The results +extend straightforwardly to +repetitions with an interval such as +$r^{\{n\ldots m\}}$. +The merit of Brzozowski derivatives +on this problem is that +it can be naturally extended to support bounded repetitions. +Moreover these extensions are still made up of only +inductive datatypes and recursive functions, +making it handy to deal with using theorem provers. +%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be +%straightforwardly extended to deal with bounded regular expressions +%and moreover the resulting code still consists of only simple +%recursive functions and inductive datatypes. +Finally, bounded regular expressions do not destroy our finite +boundedness property, which we shall prove later on. -The good things about $\mathit{DFA}$s is that once -generated, they are fast and stable, unlike -backtracking algorithms. -However, they do not scale well with bounded repetitions. - -\subsubsection{Problems with Bounded Repetitions} - - - - - -\subsubsection{Tools that uses $\mathit{DFA}$s} -%TODO:more tools that use DFAs? -$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools -in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based -lexers. The user provides a set of regular expressions -and configurations to such lexer generators, and then -gets an output program encoding a minimized $\mathit{DFA}$ -that can be compiled and run. -When given the above countdown regular expression, -a small number $n$ would result in a determinised automata -with millions of states. +\subsection{Back-References} +%Some constructs can even raise the expressive +%power to the non-regular realm, for example +%the back-references. +%bounded repetitions, as we have discussed in the +%previous section. +%This super-linear behaviour of the +%regex matching engines we have? +The syntax and expressive power of regexes +can go quickly beyond the regular language hierarchy +with back-references. +%\section{Back-references and The Terminology Regex} -For this reason, regex libraries that support -bounded repetitions often choose to use the $\mathit{NFA}$ -approach. - - - - - - - +%When one constructs an $\NFA$ out of a regular expression +%there is often very little to be done in the first phase, one simply +%construct the $\NFA$ states based on the structure of the input regular expression. -\subsection{Why $\mathit{NFA}$s can be slow in the second phase} -When one constructs an $\NFA$ out of a regular expression -there is often very little to be done in the first phase, one simply -construct the $\NFA$ states based on the structure of the input regular expression. - -In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: -one by keeping track of all active states after consuming -a character, and update that set of states iteratively. -This can be viewed as a breadth-first-search of the $\mathit{NFA}$ -for a path terminating -at an accepting state. +%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: +%one by keeping track of all active states after consuming +%a character, and update that set of states iteratively. +%This can be viewed as a breadth-first-search of the $\mathit{NFA}$ +%for a path terminating +%at an accepting state. Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this type of $\mathit{NFA}$ simulation and guarantees a linear runtime in terms of input string length. @@ -838,6 +742,134 @@ %TODO: verify the fact Rust does not allow 1000+ reps + + +%The time cost of regex matching algorithms in general +%involve two different phases, and different things can go differently wrong on +%these phases. +%$\DFA$s usually have problems in the first (construction) phase +%, whereas $\NFA$s usually run into trouble +%on the second phase. + + +\section{Error-prone POSIX Implementations} +The problems with practical implementations +of reare not limited to slowness on certain +cases. +Another thing about these libraries is that there +is no correctness guarantee. +In some cases, they either fail to generate a lexing result when there exists a match, +or give results that are inconsistent with the $\POSIX$ standard. +A concrete example would be the regex +\begin{center} + $(aba + ab + a)* \text{and the string} ababa$ +\end{center} +The correct $\POSIX$ match for the above would be +with the entire string $ababa$, +split into two Kleene star iterations, $[ab] [aba]$ at positions +$[0, 2), [2, 5)$ +respectively. +But trying this out in regex101\parencite{regex101} +with different language engines would yield +the same two fragmented matches: $[aba]$ at $[0, 3)$ +and $a$ at $[4, 5)$. + +Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not +correctly implementing the POSIX (maximum-munch) +rule of regular expression matching. + +As Grathwohl\parencite{grathwohl2014crash} wrote, +\begin{quote} + The POSIX strategy is more complicated than the + greedy because of the dependence on information about + the length of matched strings in the various subexpressions. +\end{quote} +%\noindent +To summarise the above, regular expressions are important. +They are popular and programming languages' library functions +for them are very fast on non-catastrophic cases. +But there are problems with current practical implementations. +First thing is that the running time might blow up. +The second problem is that they might be error-prone on certain +very simple cases. +In the next part of the chapter, we will look into reasons why +certain regex engines are running horribly slow on the "catastrophic" +cases and propose a solution that addresses both of these problems +based on Brzozowski and Sulzmann and Lu's work. + + +\subsection{Different Phases of a Matching/Lexing Algorithm} + + +Most lexing algorithms can be roughly divided into +two phases during its run. +The first phase is the "construction" phase, +in which the algorithm builds some +suitable data structure from the input regex $r$, so that +it can be easily operated on later. +We denote +the time cost for such a phase by $P_1(r)$. +The second phase is the lexing phase, when the input string +$s$ is read and the data structure +representing that regex $r$ is being operated on. +We represent the time +it takes by $P_2(r, s)$.\\ +For $\mathit{DFA}$, +we have $P_2(r, s) = O( |s| )$, +because we take at most $|s|$ steps, +and each step takes +at most one transition-- +a deterministic-finite-automata +by definition has at most one state active and at most one +transition upon receiving an input symbol. +But unfortunately in the worst case +$P_1(r) = O(exp^{|r|})$. An example will be given later. +For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold +expressions like $r^n$ into +\[ + \underbrace{r \cdots r}_{\text{n copies of r}}. +\] +The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. +On the other hand, if backtracking is used, the worst-case time bound bloats +to $|r| * 2^{|s|}$. +%on the input +%And when calculating the time complexity of the matching algorithm, +%we are assuming that each input reading step requires constant time. +%which translates to that the number of +%states active and transitions taken each time is bounded by a +%constant $C$. +%But modern regex libraries in popular language engines +% often want to support much richer constructs than just +% sequences and Kleene stars, +%such as negation, intersection, +%bounded repetitions and back-references. +%And de-sugaring these "extended" regular expressions +%into basic ones might bloat the size exponentially. +%TODO: more reference for exponential size blowup on desugaring. + +\subsection{Why $\mathit{DFA}s$ can be slow in the first phase} + + +The good things about $\mathit{DFA}$s is that once +generated, they are fast and stable, unlike +backtracking algorithms. +However, they do not scale well with bounded repetitions. + +\subsubsection{Problems with Bounded Repetitions} + + + + + + + + + + + + + + So we have practical implementations on regular expression matching/lexing which are fast but do not come with any guarantees that it will not grind to a halt