ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 605 ed53ce26ecb6
parent 604 16d67f9c07d4
child 606 99b530103464
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sun Sep 25 01:40:12 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Wed Sep 28 01:28:01 2022 +0100
@@ -406,14 +406,16 @@
 approach (Brzozowski derivatives and formal proofs).
 
 
-\section{Terminology, and the Problem with Bounded Repetitions}
+\section{Regex, and the Problems with Regex Matchers}
 Regular expressions and regular expression matchers 
 have of course been studied for many, many years.
-Theoretical results in automata theory says
+Theoretical results in automata theory say
 that basic regular expression matching should be linear
-w.r.t the input, provided that the regular expression
-$r$ had been pre-processed and turned into a
-deterministic finite automata (DFA).
+w.r.t the input.
+This assumes that the regular expression
+$r$ was pre-processed and turned into a
+deterministic finite automata (DFA) before matching,
+which could be exponential\cite{Sakarovitch2009}.
 By basic we mean textbook definitions such as the one
 below, involving only characters, alternatives,
 sequences, and Kleene stars:
@@ -424,133 +426,96 @@
 however,
 support richer constructs such as bounded repetitions
 and back-references.
-The syntax and expressive power of those 
-matching engines
-make ``regular expressions'' quite different from 
-their original meaning in the formal languages
-theory.
-To differentiate, people tend to use the word \emph{regex} to refer
-those expressions with richer constructs, and regular expressions
-for the more traditional meaning.
-For example, the PCRE standard (Peral Compatible Regular Expressions)
-is such a regex syntax standard.
-We follow this convention in this thesis.
-We aim to support all the popular features of regexes in the future,
+To differentiate, people use the word \emph{regex} to refer
+to those expressions with richer constructs while reserving the
+term \emph{regular expression}
+for the more traditional meaning in formal languages theory.
+We follow this convention 
+in this thesis.
+In the future, we aim to support all the popular features of regexes, 
 but for this work we mainly look at regular expressions.
 
-\subsection{A Little Introduction to Regexes: Bounded Repetitions
-and Back-references}
+
+
+%Most modern regex libraries
+%the so-called PCRE standard (Peral Compatible Regular Expressions)
+%has the back-references
 Regexes come with a lot of constructs
-that makes it more convenient for 
+beyond the basic ones
+that make it more convenient for 
 programmers to write regular expressions.
+Depending on the types of these constructs
+the task of matching and lexing with them
+will have different levels of complexity increase.
 Some of those constructs are syntactic sugars that are
 simply short hand notations
-that save the programmers a few keystrokes,
-for example the
+that save the programmers a few keystrokes.
+These will not cause trouble for regex libraries.
+
+\noindent
+For example the
 non-binary alternative involving three or more choices:
 \[
-	r = (a | b | c | \ldots | z)^*
+	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
 \]
-, the range operator $-$ which means the alternative
-of all characters between its operands:
-\[
-	r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)}
-\]
-and the 
-wildcard character $.$ meaning any character
+the range operator $-$ used to express the alternative
+of all characters between its operands in a concise way:
 \[
-	. = [0-9a-zA-Z+-()*&\ldots]
-
+	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
 \]
-Some of those constructs do make the expressions much
-more compact, and matching time could be greatly increase.
-For example, $r^{n}$ is exponentially more concise compared with
-the expression $\underbrace{r}_\text{n copies of r}$,
-and therefore a naive algorithm that simply unfolds
-$r^{n}$ into $\underbrace{r}_\text{n copies of r}$
-will suffer exponential runtime increase.
-Some constructs can even raise the expressive
-power to the non-regular realm, for example
-the back-references.
+and the
+wildcard character $.$ used to refer to any single character:
+\[
+	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
+\]
 
-bounded repetitions, as we have discussed in the 
-previous section.
-This super-linear behaviour of the 
-regex matching engines we have?
-One of the most recent work in the context of lexing
-is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
-This is relevant work and we will compare later on
-our derivative-based matcher we are going to present.
-There is also some newer work called
-Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
-For that the problem is dealing with the bounded regular expressions of the form
-$r^{n}$ where $n$ is a constant specifying that $r$ must repeat
-exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow
-on the bounded repetitions construct.
+\noindent
+\subsection{Bounded Repetitions}
+Some of those constructs do make the expressions much
+more compact.
+For example, the bounded regular expressions
+(where $n$ and $m$ are constant natural numbers)
+$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$,
+defined as 
+\begin{center}
+	\begin{tabular}{lcl}
+		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
+		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
+		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
+		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
+	\end{tabular}
+\end{center}
+are exponentially smaller compared with
+their unfolded form: for example $r^{\{n\}}$
+as opposed to
+\[
+	\underbrace{r\ldots r}_\text{n copies of r}.
+\]
+%Therefore, a naive algorithm that simply unfolds
+%them into their desugared forms
+%will suffer from at least an exponential runtime increase.
 
-In the work reported in \cite{CSL2022} and here, we add better support
-for them. 
-The other repetition constructs include
-$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify 
-intervals for how many times $r$ should match.
-$r^{\ldots m}$ means repeating
-at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and 
-$r^{n\ldots m}$ means repeating between $n$ and $m$ times.
-The results presented in this thesis extend straightforwardly to them
-too. 
-Their formal definitions will be given later.
-
-Bounded repetitions are important because they
-tend to occur often in practical use, for example in RegExLib,
-Snort, as well as in XML Schema definitions (XSDs).
-According to Bj\"{o}rklund et al \cite{xml2015},
-bounded regular expressions occur frequently in the latter and can have
-counters up to ten million. An example XSD with a large counter they gave
-was:
-\begin{verbatim}
-<sequence minOccurs="0" maxOccurs="65535">
-    <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-    <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-</sequence>
-\end{verbatim}
-This can be seen as the expression 
-$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
-regular expressions 
-satisfying certain constraints (such as 
-satisfying the floating point number format).
 The problem here is that tools based on the classic notion of
 automata need to expand $r^{n}$ into $n$ connected 
 copies of the automaton for $r$. This leads to very inefficient matching
 algorithms  or algorithms that consume large amounts of memory.
+Implementations using $\DFA$s will
+either become excruciatingly slow 
+(for example Verbatim++\cite{Verbatimpp}) or get
+out of memory errors (for example $\mathit{LEX}$ and 
+$\mathit{JFLEX}$\footnote{which are lexer generators
+in C and JAVA that generate $\mathit{DFA}$-based
+lexers. The user provides a set of regular expressions
+and configurations to them, and then 
+gets an output program encoding a minimized $\mathit{DFA}$
+that can be compiled and run. 
+When given the above countdown regular expression,
+a small $n$ (a few dozen) would result in a 
+determinised automata
+with millions of states.}) under large counters.
 A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
-where the minimal DFA requires at least $2^{n+1}$ states (more on this
-later).
-Therefore regular expressions matching libraries that rely on the classic
-notion of DFAs  often impose adhoc limits
-for bounded regular expressions:
-For example, in the regular expression matching library in the Go
-language the regular expression $a^{1001}$ is not permitted, because no counter
-can be above 1000, and in the built-in Rust regular expression library
-expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
-for being too big. These problems can of course be solved in matching algorithms where 
-automata go beyond the classic notion and for instance include explicit
-counters \cite{Turo_ov__2020}.
-The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
-straightforwardly extended to deal with bounded regular expressions
-and moreover the resulting code still consists of only simple
-recursive functions and inductive datatypes.
-Finally, bounded regular expressions do not destroy our finite
-boundedness property, which we shall prove later on.
-
-\section{Back-references and The Terminology Regex}
-
-
-Bounded repetitions, usually written in the form
-$r^{\{c\}}$ (where $c$ is a constant natural number),
-denotes a regular expression accepting strings
-that can be divided into $c$ substrings, where each 
-substring is in $r$. 
-For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
+where the minimal DFA requires at least $2^{n+1}$ states.
+For example, when $n$ is equal to 2,
 an $\mathit{NFA}$ describing it would look like:
 \begin{center}
 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
@@ -565,7 +530,9 @@
     (q_2) edge  node  {a,b} (q_3);
 \end{tikzpicture}
 \end{center}
-The red states are "countdown states" which counts down 
+which requires at least $2^3$ states
+for its subset construction.\footnote{The 
+red states are "countdown states" which counts down 
 the number of characters needed in addition to the current
 string to make a successful match.
 For example, state $q_1$ indicates a match that has
@@ -588,166 +555,103 @@
 Generalizing this to regular expressions with larger
 bounded repetitions number, we have that
 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
-would require at least $2^{n+1}$ states, if $r$ contains
+would require at least $2^{n+1}$ states, if $r$ itself contains
 more than 1 string.
 This is to represent all different 
-scenarios which "countdown" states are active.
-For those regexes, tools that uses $\DFA$s will get
-out of memory errors.
-
-
-The time cost of regex matching algorithms in general
-involve two different phases, and different things can go differently wrong on 
-these phases.
-$\DFA$s usually have problems in the first (construction) phase
-, whereas $\NFA$s usually run into trouble
-on the second phase.
-
+scenarios which "countdown" states are active.}
 
-\section{Error-prone POSIX Implementations}
-The problems with practical implementations
-of reare not limited to slowness on certain 
-cases. 
-Another thing about these libraries is that there
-is no correctness guarantee.
-In some cases, they either fail to generate a lexing result when there exists a match,
-or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be the regex
-\begin{center}
-	$(aba + ab + a)* \text{and the string} ababa$
-\end{center}
-The correct $\POSIX$ match for the above would be 
-with the entire string $ababa$, 
-split into two Kleene star iterations, $[ab] [aba]$ at positions
-$[0, 2), [2, 5)$
-respectively.
-But trying this out in regex101\parencite{regex101}
-with different language engines would yield 
-the same two fragmented matches: $[aba]$ at $[0, 3)$
-and $a$ at $[4, 5)$.
-
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
-correctly implementing the POSIX (maximum-munch)
-rule of regular expression matching.
-
-As Grathwohl\parencite{grathwohl2014crash} wrote,
-\begin{quote}
-	The POSIX strategy is more complicated than the 
-	greedy because of the dependence on information about 
-	the length of matched strings in the various subexpressions.
-\end{quote}
-%\noindent
-To summarise the above, regular expressions are important.
-They are popular and programming languages' library functions
-for them are very fast on non-catastrophic cases.
-But there are problems with current practical implementations.
-First thing is that the running time might blow up.
-The second problem is that they might be error-prone on certain
-very simple cases.
-In the next part of the chapter, we will look into reasons why 
-certain regex engines are running horribly slow on the "catastrophic"
-cases and propose a solution that addresses both of these problems
-based on Brzozowski and Sulzmann and Lu's work.
-
-
-
-\subsection{Different Phases of a Matching/Lexing Algorithm}
+One of the most recent work in the context of lexing
+%with this issue
+is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+This is relevant work and we will compare later on
+our derivative-based matcher we are going to present.
+There is also some newer work called
+Verbatim++\cite{Verbatimpp}, which does not use derivatives, 
+but deterministic finite automaton instead.
+%An example that gives problem to automaton approaches would be
+%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+%It requires at least $2^{n+1}$ states to represent
+%as a DFA.
 
 
-Most lexing algorithms can be roughly divided into 
-two phases during its run.
-The first phase is the "construction" phase,
-in which the algorithm builds some  
-suitable data structure from the input regex $r$, so that
-it can be easily operated on later.
-We denote
-the time cost for such a phase by $P_1(r)$.
-The second phase is the lexing phase, when the input string 
-$s$ is read and the data structure
-representing that regex $r$ is being operated on. 
-We represent the time
-it takes by $P_2(r, s)$.\\
-For $\mathit{DFA}$,
-we have $P_2(r, s) = O( |s| )$,
-because we take at most $|s|$ steps, 
-and each step takes
-at most one transition--
-a deterministic-finite-automata
-by definition has at most one state active and at most one
-transition upon receiving an input symbol.
-But unfortunately in the  worst case
-$P_1(r) = O(exp^{|r|})$. An example will be given later. 
-For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
-expressions like $r^n$ into 
-\[
-	\underbrace{r \cdots r}_{\text{n copies of r}}.
-\]
-The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
-On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^{|s|}$.
-%on the input
-%And when calculating the time complexity of the matching algorithm,
-%we are assuming that each input reading step requires constant time.
-%which translates to that the number of 
-%states active and transitions taken each time is bounded by a
-%constant $C$.
-%But modern  regex libraries in popular language engines
-% often want to support much richer constructs than just
-% sequences and Kleene stars,
-%such as negation, intersection, 
-%bounded repetitions and back-references.
-%And de-sugaring these "extended" regular expressions 
-%into basic ones might bloat the size exponentially.
-%TODO: more reference for exponential size blowup on desugaring. 
+Bounded repetitions are very important because they
+tend to occur a lot in practical use.
+For example in the regex library RegExLib,
+the rules library of Snort, as well as in XML Schema definitions (XSDs).
+According to Bj\"{o}rklund et al \cite{xml2015},
+more than half of the 
+XSDs they found have bounded regular expressions in them.
+Often the counters are quite large, the largest up to ten million. 
+An example XSD they gave
+was:
+\begin{verbatim}
+<sequence minOccurs="0" maxOccurs="65535">
+    <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+    <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{verbatim}
+This can be seen as the expression 
+$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+regular expressions 
+satisfying certain constraints (such as 
+satisfying the floating point number format).
 
-\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+It is therefore quite unsatisfying that 
+some regular expressions matching libraries
+impose adhoc limits
+for bounded regular expressions:
+For example, in the regular expression matching library in the Go
+language the regular expression $a^{1001}$ is not permitted, because no counter
+can be above 1000, and in the built-in Rust regular expression library
+expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
+for being too big. These problems can of course be solved in matching algorithms where 
+automata go beyond the classic notion and for instance include explicit
+counters \cite{Turo_ov__2020}.
+
+In the work reported in \cite{CSL2022} and here, 
+we add better support using derivatives
+for bounded regular expressions $r^{\{n\}}$.
+The results
+extend straightforwardly to
+repetitions with an interval such as 
+$r^{\{n\ldots m\}}$.
+The merit of Brzozowski derivatives
+on this problem is that
+it can be naturally extended to support bounded repetitions.
+Moreover these extensions are still made up of only
+inductive datatypes and recursive functions,
+making it handy to deal with using theorem provers.
+%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+%straightforwardly extended to deal with bounded regular expressions
+%and moreover the resulting code still consists of only simple
+%recursive functions and inductive datatypes.
+Finally, bounded regular expressions do not destroy our finite
+boundedness property, which we shall prove later on.
 
 
-The good things about $\mathit{DFA}$s is that once
-generated, they are fast and stable, unlike
-backtracking algorithms. 
-However, they do not scale well with bounded repetitions.
-
-\subsubsection{Problems with Bounded Repetitions}
-
-
-
-
-
-\subsubsection{Tools that uses $\mathit{DFA}$s}
-%TODO:more tools that use DFAs?
-$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
-in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
-lexers. The user provides a set of regular expressions
-and configurations to such lexer generators, and then 
-gets an output program encoding a minimized $\mathit{DFA}$
-that can be compiled and run. 
-When given the above countdown regular expression,
-a small number $n$ would result in a determinised automata
-with millions of states.
+\subsection{Back-References}
+%Some constructs can even raise the expressive
+%power to the non-regular realm, for example
+%the back-references.
+%bounded repetitions, as we have discussed in the 
+%previous section.
+%This super-linear behaviour of the 
+%regex matching engines we have?
+The syntax and expressive power of regexes 
+can go quickly beyond the regular language hierarchy
+with back-references.
+%\section{Back-references and The Terminology Regex}
 
-For this reason, regex libraries that support 
-bounded repetitions often choose to use the $\mathit{NFA}$ 
-approach.
-
-
-
-
-
-
-
+%When one constructs an $\NFA$ out of a regular expression
+%there is often very little to be done in the first phase, one simply 
+%construct the $\NFA$ states based on the structure of the input regular expression.
 
-\subsection{Why $\mathit{NFA}$s can be slow in the second phase}
-When one constructs an $\NFA$ out of a regular expression
-there is often very little to be done in the first phase, one simply 
-construct the $\NFA$ states based on the structure of the input regular expression.
-
-In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
-one by keeping track of all active states after consuming 
-a character, and update that set of states iteratively.
-This can be viewed as a breadth-first-search of the $\mathit{NFA}$
-for a path terminating
-at an accepting state.
+%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
+%one by keeping track of all active states after consuming 
+%a character, and update that set of states iteratively.
+%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+%for a path terminating
+%at an accepting state.
 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
 type of $\mathit{NFA}$ simulation and guarantees a linear runtime
 in terms of input string length.
@@ -838,6 +742,134 @@
  %TODO: verify the fact Rust does not allow 1000+ reps
 
 
+
+
+%The time cost of regex matching algorithms in general
+%involve two different phases, and different things can go differently wrong on 
+%these phases.
+%$\DFA$s usually have problems in the first (construction) phase
+%, whereas $\NFA$s usually run into trouble
+%on the second phase.
+
+
+\section{Error-prone POSIX Implementations}
+The problems with practical implementations
+of reare not limited to slowness on certain 
+cases. 
+Another thing about these libraries is that there
+is no correctness guarantee.
+In some cases, they either fail to generate a lexing result when there exists a match,
+or give results that are inconsistent with the $\POSIX$ standard.
+A concrete example would be the regex
+\begin{center}
+	$(aba + ab + a)* \text{and the string} ababa$
+\end{center}
+The correct $\POSIX$ match for the above would be 
+with the entire string $ababa$, 
+split into two Kleene star iterations, $[ab] [aba]$ at positions
+$[0, 2), [2, 5)$
+respectively.
+But trying this out in regex101\parencite{regex101}
+with different language engines would yield 
+the same two fragmented matches: $[aba]$ at $[0, 3)$
+and $a$ at $[4, 5)$.
+
+Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+correctly implementing the POSIX (maximum-munch)
+rule of regular expression matching.
+
+As Grathwohl\parencite{grathwohl2014crash} wrote,
+\begin{quote}
+	The POSIX strategy is more complicated than the 
+	greedy because of the dependence on information about 
+	the length of matched strings in the various subexpressions.
+\end{quote}
+%\noindent
+To summarise the above, regular expressions are important.
+They are popular and programming languages' library functions
+for them are very fast on non-catastrophic cases.
+But there are problems with current practical implementations.
+First thing is that the running time might blow up.
+The second problem is that they might be error-prone on certain
+very simple cases.
+In the next part of the chapter, we will look into reasons why 
+certain regex engines are running horribly slow on the "catastrophic"
+cases and propose a solution that addresses both of these problems
+based on Brzozowski and Sulzmann and Lu's work.
+
+
+\subsection{Different Phases of a Matching/Lexing Algorithm}
+
+
+Most lexing algorithms can be roughly divided into 
+two phases during its run.
+The first phase is the "construction" phase,
+in which the algorithm builds some  
+suitable data structure from the input regex $r$, so that
+it can be easily operated on later.
+We denote
+the time cost for such a phase by $P_1(r)$.
+The second phase is the lexing phase, when the input string 
+$s$ is read and the data structure
+representing that regex $r$ is being operated on. 
+We represent the time
+it takes by $P_2(r, s)$.\\
+For $\mathit{DFA}$,
+we have $P_2(r, s) = O( |s| )$,
+because we take at most $|s|$ steps, 
+and each step takes
+at most one transition--
+a deterministic-finite-automata
+by definition has at most one state active and at most one
+transition upon receiving an input symbol.
+But unfortunately in the  worst case
+$P_1(r) = O(exp^{|r|})$. An example will be given later. 
+For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
+expressions like $r^n$ into 
+\[
+	\underbrace{r \cdots r}_{\text{n copies of r}}.
+\]
+The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+On the other hand, if backtracking is used, the worst-case time bound bloats
+to $|r| * 2^{|s|}$.
+%on the input
+%And when calculating the time complexity of the matching algorithm,
+%we are assuming that each input reading step requires constant time.
+%which translates to that the number of 
+%states active and transitions taken each time is bounded by a
+%constant $C$.
+%But modern  regex libraries in popular language engines
+% often want to support much richer constructs than just
+% sequences and Kleene stars,
+%such as negation, intersection, 
+%bounded repetitions and back-references.
+%And de-sugaring these "extended" regular expressions 
+%into basic ones might bloat the size exponentially.
+%TODO: more reference for exponential size blowup on desugaring. 
+
+\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+
+
+The good things about $\mathit{DFA}$s is that once
+generated, they are fast and stable, unlike
+backtracking algorithms. 
+However, they do not scale well with bounded repetitions.
+
+\subsubsection{Problems with Bounded Repetitions}
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 So we have practical implementations 
 on regular expression matching/lexing which are fast
 but do not come with any guarantees that it will not grind to a halt