lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:370fe1dde7c7
+:16d67f9c07d4
 with the help of formal proofs.
 We offer a lexing algorithm based
 on Brzozowski derivatives with certified correctness (in
 Isabelle/HOL)
 and finiteness property.
-Such properties guarantees the absence of
+Such properties guarantee the absence of
 catastrophic backtracking in most cases.
-We will give more details why we choose our
+We will give more details in the next sections
-approach (Brzozowski derivatives and formal proofs)
+on (i) why the slow cases in graph \ref{fig:aStarStarb}
-in the next sections.
+can occur
+and (ii) why we choose our
+approach (Brzozowski derivatives and formal proofs).
-\section{The Problem with Bounded Repetitions}
+\section{Terminology, and the Problem with Bounded Repetitions}
 Regular expressions and regular expression matchers
 have of course been studied for many, many years.
+Theoretical results in automata theory says
+that basic regular expression matching should be linear
+w.r.t the input, provided that the regular expression
+$r$ had been pre-processed and turned into a
+deterministic finite automata (DFA).
+By basic we mean textbook definitions such as the one
+below, involving only characters, alternatives,
+sequences, and Kleene stars:
+\[
+	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
+\]
+Modern regular expression matchers used by programmers,
+however,
+support richer constructs such as bounded repetitions
+and back-references.
+The syntax and expressive power of those
+matching engines
+make ``regular expressions'' quite different from
+their original meaning in the formal languages
+theory.
+To differentiate, people tend to use the word \emph{regex} to refer
+those expressions with richer constructs, and regular expressions
+for the more traditional meaning.
+For example, the PCRE standard (Peral Compatible Regular Expressions)
+is such a regex syntax standard.
+We follow this convention in this thesis.
+We aim to support all the popular features of regexes in the future,
+but for this work we mainly look at regular expressions.
+\subsection{A Little Introduction to Regexes: Bounded Repetitions
+and Back-references}
+Regexes come with a lot of constructs
+that makes it more convenient for
+programmers to write regular expressions.
+Some of those constructs are syntactic sugars that are
+simply short hand notations
+that save the programmers a few keystrokes,
+for example the
+non-binary alternative involving three or more choices:
+\[
+	r = (a | b | c | \ldots | z)^*
+\]
+, the range operator $-$ which means the alternative
+of all characters between its operands:
+\[
+	r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)}
+\]
+and the
+wildcard character $.$ meaning any character
+\[
+	. = [0-9a-zA-Z+-()*&\ldots]
+\]
+Some of those constructs do make the expressions much
+more compact, and matching time could be greatly increase.
+For example, $r^{n}$ is exponentially more concise compared with
+the expression $\underbrace{r}_\text{n copies of r}$,
+and therefore a naive algorithm that simply unfolds
+$r^{n}$ into $\underbrace{r}_\text{n copies of r}$
+will suffer exponential runtime increase.
+Some constructs can even raise the expressive
+power to the non-regular realm, for example
+the back-references.
+bounded repetitions, as we have discussed in the
+previous section.
+This super-linear behaviour of the
+regex matching engines we have?
 One of the most recent work in the context of lexing
 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
 This is relevant work and we will compare later on
 our derivative-based matcher we are going to present.
 There is also some newer work called
 and moreover the resulting code still consists of only simple
 recursive functions and inductive datatypes.
 Finally, bounded regular expressions do not destroy our finite
 boundedness property, which we shall prove later on.
-\section{The Terminology Regex, and why Backtracking?}
+\section{Back-references and The Terminology Regex}
-Shouldn't regular expression matching be linear?
-How can one explain the super-linear behaviour of the
-regex matching engines we have?
-The time cost of regex matching algorithms in general
-involve two different phases, and different things can go differently wrong on
-these phases.
-$\DFA$s usually have problems in the first (construction) phase
-, whereas $\NFA$s usually run into trouble
-on the second phase.
-\section{Error-prone POSIX Implementations}
-The problems with practical implementations
-of reare not limited to slowness on certain
-cases.
-Another thing about these libraries is that there
-is no correctness guarantee.
-In some cases, they either fail to generate a lexing result when there exists a match,
-or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be the regex
-\begin{center}
-	$(aba + ab + a)* \text{and the string} ababa$
-\end{center}
-The correct $\POSIX$ match for the above would be
-with the entire string $ababa$,
-split into two Kleene star iterations, $[ab] [aba]$ at positions
-$[0, 2), [2, 5)$
-respectively.
-But trying this out in regex101\parencite{regex101}
-with different language engines would yield
-the same two fragmented matches: $[aba]$ at $[0, 3)$
-and $a$ at $[4, 5)$.
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
-correctly implementing the POSIX (maximum-munch)
-rule of regular expression matching.
-As Grathwohl\parencite{grathwohl2014crash} wrote,
-\begin{quote}
-	The POSIX strategy is more complicated than the
-	greedy because of the dependence on information about
-	the length of matched strings in the various subexpressions.
-\end{quote}
-%\noindent
-To summarise the above, regular expressions are important.
-They are popular and programming languages' library functions
-for them are very fast on non-catastrophic cases.
-But there are problems with current practical implementations.
-First thing is that the running time might blow up.
-The second problem is that they might be error-prone on certain
-very simple cases.
-In the next part of the chapter, we will look into reasons why
-certain regex engines are running horribly slow on the "catastrophic"
-cases and propose a solution that addresses both of these problems
-based on Brzozowski and Sulzmann and Lu's work.
-\subsection{Different Phases of a Matching/Lexing Algorithm}
-Most lexing algorithms can be roughly divided into
-two phases during its run.
-The first phase is the "construction" phase,
-in which the algorithm builds some
-suitable data structure from the input regex $r$, so that
-it can be easily operated on later.
-We denote
-the time cost for such a phase by $P_1(r)$.
-The second phase is the lexing phase, when the input string
-$s$ is read and the data structure
-representing that regex $r$ is being operated on.
-We represent the time
-it takes by $P_2(r, s)$.\\
-For $\mathit{DFA}$,
-we have $P_2(r, s) = O( |s| )$,
-because we take at most $|s|$ steps,
-and each step takes
-at most one transition--
-a deterministic-finite-automata
-by definition has at most one state active and at most one
-transition upon receiving an input symbol.
-But unfortunately in the  worst case
-$P_1(r) = O(exp^{|r|})$. An example will be given later.
-For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
-expressions like $r^n$ into
-\[
-	\underbrace{r \cdots r}_{\text{n copies of r}}.
-\]
-The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
-On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^{|s|}$.
-%on the input
-%And when calculating the time complexity of the matching algorithm,
-%we are assuming that each input reading step requires constant time.
-%which translates to that the number of
-%states active and transitions taken each time is bounded by a
-%constant $C$.
-%But modern  regex libraries in popular language engines
-% often want to support much richer constructs than just
-% sequences and Kleene stars,
-%such as negation, intersection,
-%bounded repetitions and back-references.
-%And de-sugaring these "extended" regular expressions
-%into basic ones might bloat the size exponentially.
-%TODO: more reference for exponential size blowup on desugaring.
-\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
-The good things about $\mathit{DFA}$s is that once
-generated, they are fast and stable, unlike
-backtracking algorithms.
-However, they do not scale well with bounded repetitions.
-\subsubsection{Problems with Bounded Repetitions}
 Bounded repetitions, usually written in the form
 $r^{\{c\}}$ (where $c$ is a constant natural number),
 denotes a regular expression accepting strings
 that can be divided into $c$ substrings, where each
 substring is in $r$.
 more than 1 string.
 This is to represent all different
 scenarios which "countdown" states are active.
 For those regexes, tools that uses $\DFA$s will get
 out of memory errors.
+The time cost of regex matching algorithms in general
+involve two different phases, and different things can go differently wrong on
+these phases.
+$\DFA$s usually have problems in the first (construction) phase
+, whereas $\NFA$s usually run into trouble
+on the second phase.
+\section{Error-prone POSIX Implementations}
+The problems with practical implementations
+of reare not limited to slowness on certain
+cases.
+Another thing about these libraries is that there
+is no correctness guarantee.
+In some cases, they either fail to generate a lexing result when there exists a match,
+or give results that are inconsistent with the $\POSIX$ standard.
+A concrete example would be the regex
+\begin{center}
+	$(aba + ab + a)* \text{and the string} ababa$
+\end{center}
+The correct $\POSIX$ match for the above would be
+with the entire string $ababa$,
+split into two Kleene star iterations, $[ab] [aba]$ at positions
+$[0, 2), [2, 5)$
+respectively.
+But trying this out in regex101\parencite{regex101}
+with different language engines would yield
+the same two fragmented matches: $[aba]$ at $[0, 3)$
+and $a$ at $[4, 5)$.
+Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+correctly implementing the POSIX (maximum-munch)
+rule of regular expression matching.
+As Grathwohl\parencite{grathwohl2014crash} wrote,
+\begin{quote}
+	The POSIX strategy is more complicated than the
+	greedy because of the dependence on information about
+	the length of matched strings in the various subexpressions.
+\end{quote}
+%\noindent
+To summarise the above, regular expressions are important.
+They are popular and programming languages' library functions
+for them are very fast on non-catastrophic cases.
+But there are problems with current practical implementations.
+First thing is that the running time might blow up.
+The second problem is that they might be error-prone on certain
+very simple cases.
+In the next part of the chapter, we will look into reasons why
+certain regex engines are running horribly slow on the "catastrophic"
+cases and propose a solution that addresses both of these problems
+based on Brzozowski and Sulzmann and Lu's work.
+\subsection{Different Phases of a Matching/Lexing Algorithm}
+Most lexing algorithms can be roughly divided into
+two phases during its run.
+The first phase is the "construction" phase,
+in which the algorithm builds some
+suitable data structure from the input regex $r$, so that
+it can be easily operated on later.
+We denote
+the time cost for such a phase by $P_1(r)$.
+The second phase is the lexing phase, when the input string
+$s$ is read and the data structure
+representing that regex $r$ is being operated on.
+We represent the time
+it takes by $P_2(r, s)$.\\
+For $\mathit{DFA}$,
+we have $P_2(r, s) = O( |s| )$,
+because we take at most $|s|$ steps,
+and each step takes
+at most one transition--
+a deterministic-finite-automata
+by definition has at most one state active and at most one
+transition upon receiving an input symbol.
+But unfortunately in the  worst case
+$P_1(r) = O(exp^{|r|})$. An example will be given later.
+For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
+expressions like $r^n$ into
+\[
+	\underbrace{r \cdots r}_{\text{n copies of r}}.
+\]
+The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+On the other hand, if backtracking is used, the worst-case time bound bloats
+to $|r| * 2^{|s|}$.
+%on the input
+%And when calculating the time complexity of the matching algorithm,
+%we are assuming that each input reading step requires constant time.
+%which translates to that the number of
+%states active and transitions taken each time is bounded by a
+%constant $C$.
+%But modern  regex libraries in popular language engines
+% often want to support much richer constructs than just
+% sequences and Kleene stars,
+%such as negation, intersection,
+%bounded repetitions and back-references.
+%And de-sugaring these "extended" regular expressions
+%into basic ones might bloat the size exponentially.
+%TODO: more reference for exponential size blowup on desugaring.
+\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+The good things about $\mathit{DFA}$s is that once
+generated, they are fast and stable, unlike
+backtracking algorithms.
+However, they do not scale well with bounded repetitions.
+\subsubsection{Problems with Bounded Repetitions}
 \subsubsection{Tools that uses $\mathit{DFA}$s}
 %TODO:more tools that use DFAs?
 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based

changeset 604	16d67f9c07d4
parent 603	370fe1dde7c7
child 605	ed53ce26ecb6