lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:e6fc9b72c0e3
+:37b6fd310a16
 Bounded repetitions are very important because they
 tend to occur a lot in practical use.
 For example in the regex library RegExLib,
 the rules library of Snort \cite{Snort1999}\footnote{
 Snort is a network intrusion detection (NID) tool
-for monitoring network traffic.},
+for monitoring network traffic.
+The network security community curates a list
+of malicious patterns written as regexes,
+which is used by Snort's detection engine
+to match against network traffic for any hostile
+activities such as buffer overflow attacks.},
 as well as in XML Schema definitions (XSDs).
 According to Bj\"{o}rklund et al \cite{xml2015},
 more than half of the
 XSDs they found have bounded regular expressions in them.
 Often the counters are quite large, the largest up to ten million.
 for bounded regular expressions $r^{\{n\}}$.
 The results
 extend straightforwardly to
 repetitions with an interval such as
 $r^{\{n\ldots m\}}$.
-The merit of Brzozowski derivatives
+The merit of Brzozowski derivatives (more on this later)
 on this problem is that
 it can be naturally extended to support bounded repetitions.
 Moreover these extensions are still made up of only
 inductive datatypes and recursive functions,
 making it handy to deal with using theorem provers.
 go beyond the regular language hierarchy
 with back-references.
 In fact, they allow the regex construct to express
 languages that cannot be contained in context-free
 languages either.
-For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
+For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
 Such a language is contained in the context-sensitive hierarchy
 of formal languages.
 Solving the back-reference expressions matching problem
 is NP-complete\parencite{alfred2014algorithms}.
 A non-bactracking,
 efficient solution is not known to exist.
-Regex libraries supporting back-references such as PCRE therefore have to
+Regex libraries supporting back-references such as
+PCRE \cite{pcre} therefore have to
 revert to a depth-first search algorithm that backtracks.
 The unreasonable part with them, is that even in the case of
 regexes not involving back-references, there is still
 a (non-negligible) chance they might backtrack super-linearly.
 %\noindent
 The implementation complexity of POSIX rules also come from
 the specification being not very clear.
 There are many informal summaries of this disambiguation
 strategy, which are often quite long and delicate.
-For example Kuklewicz described the POSIX rule as
+For example Kuklewicz \cite{KuklewiczHaskell}
+described the POSIX rule as
 \begin{quote}
 	``
 	\begin{itemize}
 		\item
 regular expressions (REs) take the leftmost starting match, and the longest match starting there
 \end{itemize}
 \end{quote}
 The text above
 is trying to capture something very precise,
 and is crying out for formalising.
+Ausaf et al. \cite{AusafDyckhoffUrban2016}
+are the first to describe such a formalised POSIX
-%\subsection{Different Phases of a Matching/Lexing Algorithm}
+specification in Isabelle/HOL.
-%
+They then formally proved the correctness of
-%
+a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
-%Most lexing algorithms can be roughly divided into
+based on that specification.
-%two phases during its run.
-%The first phase is the "construction" phase,
+In this thesis,
-%in which the algorithm builds some
+we propose a solution to catastrophic
-%suitable data structure from the input regex $r$, so that
+backtracking and error-prone matchers--a formally verified
-%it can be easily operated on later.
+regular expression lexing algorithm
-%We denote
+that is both fast
-%the time cost for such a phase by $P_1(r)$.
+and correct by extending Ausaf et al.'s work.
-%The second phase is the lexing phase, when the input string
-%$s$ is read and the data structure
-%representing that regex $r$ is being operated on.
-%We represent the time
-%it takes by $P_2(r, s)$.\\
-%For $\mathit{DFA}$,
-%we have $P_2(r, s) = O( |s| )$,
-%because we take at most $|s|$ steps,
-%and each step takes
-%at most one transition--
-%a deterministic-finite-automata
-%by definition has at most one state active and at most one
-%transition upon receiving an input symbol.
-%But unfortunately in the  worst case
-%$P_1(r) = O(exp^{|r|})$. An example will be given later.
-%For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
-%expressions like $r^n$ into
-%\[
-%	\underbrace{r \cdots r}_{\text{n copies of r}}.
-%\]
-%The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
-%On the other hand, if backtracking is used, the worst-case time bound bloats
-%to $|r| * 2^{|s|}$.
-%%on the input
-%%And when calculating the time complexity of the matching algorithm,
-%%we are assuming that each input reading step requires constant time.
-%%which translates to that the number of
-%%states active and transitions taken each time is bounded by a
-%%constant $C$.
-%%But modern  regex libraries in popular language engines
-%% often want to support much richer constructs than just
-%% sequences and Kleene stars,
-%%such as negation, intersection,
-%%bounded repetitions and back-references.
-%%And de-sugaring these "extended" regular expressions
-%%into basic ones might bloat the size exponentially.
-%%TODO: more reference for exponential size blowup on desugaring.
-%
-%\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
-%
-%
-%The good things about $\mathit{DFA}$s is that once
-%generated, they are fast and stable, unlike
-%backtracking algorithms.
-%However, they do not scale well with bounded repetitions.
-%
-%\subsubsection{Problems with Bounded Repetitions}
-%
-%
-To summarise, we need regex libraries that are both fast
-and correct.
-And that correctness needs to be built on a precise
-model of what POSIX disambiguation is.
-We propose a solution that addresses both problems
-based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work.
 The end result is a regular expression lexing algorithm that comes with
 \begin{itemize}
 \item
 a proven correctness theorem according to POSIX specification
-given by Ausaf and Urban \cite{AusafDyckhoffUrban2016},
+given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
 \item
-a proven property saying that the algorithm's internal data structure will
+a proven complexity-related property saying that the
+algorithm's internal data structure will
 remain finite,
 \item
 and extension to
 the bounded repetitions construct with the correctness and finiteness property
 maintained.
 \end{itemize}
+In the next section we will very briefly
+introduce Brzozowski derivatives.
+We will give a taste to what they
+are like and why they are suitable for regular expression
+matching and lexing.
 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
 We propose Brzozowski derivatives on regular expressions as
 a solution to this.
 In the last fifteen or so years, Brzozowski's derivatives of regular

changeset 608	37b6fd310a16
parent 607	e6fc9b72c0e3
child 609	61139fdddae0