--- a/ninems/ninems.tex Fri Jun 28 12:40:58 2019 +0100
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,511 +0,0 @@
-\documentclass[a4paper,UKenglish]{lipics}
-\usepackage{graphic}
-\usepackage{data}
-% \documentclass{article}
-%\usepackage[utf8]{inputenc}
-%\usepackage[english]{babel}
-%\usepackage{listings}
-% \usepackage{amsthm}
-% \usepackage{hyperref}
-% \usepackage[margin=0.5in]{geometry}
-%\usepackage{pmboxdraw}
-
-\title{POSIX Regular Expression Matching and Lexing}
-\author{Chengsong Tan}
-\affil{King's College London\\
-London, UK\\
-\texttt{chengsong.tan@kcl.ac.uk}}
-\authorrunning{Chengsong Tan}
-\Copyright{Chengsong Tan}
-
-\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
-\newcommand{\ZERO}{\mbox{\bf 0}}
-\newcommand{\ONE}{\mbox{\bf 1}}
-\def\lexer{\mathit{lexer}}
-\def\mkeps{\mathit{mkeps}}
-\def\inj{\mathit{inj}}
-\def\Empty{\mathit{Empty}}
-\def\Left{\mathit{Left}}
-\def\Right{\mathit{Right}}
-\def\Stars{\mathit{Stars}}
-\def\Char{\mathit{Char}}
-\def\Seq{\mathit{Seq}}
-\def\Der{\mathit{Der}}
-\def\nullable{\mathit{nullable}}
-\def\Z{\mathit{Z}}
-\def\S{\mathit{S}}
-
-%\theoremstyle{theorem}
-%\newtheorem{theorem}{Theorem}
-%\theoremstyle{lemma}
-%\newtheorem{lemma}{Lemma}
-%\newcommand{\lemmaautorefname}{Lemma}
-%\theoremstyle{definition}
-%\newtheorem{definition}{Definition}
-
-
-\begin{document}
-
-\maketitle
-
-\begin{abstract}
- Brzozowski introduced in 1964 a beautifully simple algorithm for
- regular expression matching based on the notion of derivatives of
- regular expressions. In 2014, Sulzmann and Lu extended this
- algorithm to not just give a YES/NO answer for whether or not a regular
- expression matches a string, but in case it matches also \emph{how}
- it matches the string. This is important for applications such as
- lexing (tokenising a string). The problem is to make the algorithm
- by Sulzmann and Lu fast on all inputs without breaking its
- correctness.
-\end{abstract}
-
-\section{Introduction}
-
-This PhD-project is about regular expression matching and
-lexing. Given the maturity of this topic, the reader might wonder:
-Surely, regular expressions must have already been studied to death?
-What could possibly be \emph{not} known in this area? And surely all
-implemented algorithms for regular expression matching are blindingly
-fast?
-
-
-
-Unfortunately these preconceptions are not supported by evidence: Take
-for example the regular expression $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
-expression. Obviously they do not match---the expected $b$ in the last
-position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8 with strings like 28
-$a$'s, one discovers that this decision takes around 30 seconds and
-takes considerably longer when adding a few more $a$'s, as the graphs
-below show:
-
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={JavaScript},
- legend pos=north west,
- legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
-\end{axis}
-\end{tikzpicture}
- &
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- %ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={Python},
- legend pos=north west,
- legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\end{axis}
-\end{tikzpicture}
- &
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- %ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={Java 8},
- legend pos=north west,
- legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
- of the form $\underbrace{aa..a}_{n}$.}
-\end{tabular}
-\end{center}
-
-\noindent These are clearly abysmal and possibly surprising results.
-One would expect these systems doing much better than that---after
-all, given a DFA and a string, whether a string is matched by this DFA
-should be linear.
-
-Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
-exhibit this ``exponential behaviour''. Unfortunately, such regular
-expressions are not just a few ``outliers'', but actually they are
-frequent enough that a separate name has been created for
-them---\emph{evil regular expressions}. In empiric work, Davis et al
-report that they have found thousands of such evil regular expressions
-in the JavaScript and Python ecosystems \cite{Davis18}.
-
-This exponential blowup sometimes causes real pain in real life:
-for example on 20 July 2016 one evil regular expression brought the
-webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \cite{SE16}.
-In this instance, a regular expression intended to just trim white
-spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU-resources and because of this the web servers
-ground to a halt. This happened when a post with 20,000 white spaces
-was submitted, but importantly the white spaces were neither at the
-beginning nor at the end. As a result, the regular expression matching
-engine needed to backtrack over many choices.
-
-The underlying problem is that many ``real life'' regular expression
-matching engines do not use DFAs for matching. This is because they
-support regular expressions that are not covered by the classical
-automata theory, and in this more general setting there are quite a
-few research questions still unanswered and fast algorithms still need
-to be developed.
-
-There is also another under-researched problem to do with regular
-expressions and lexing, i.e.~the process of breaking up strings into
-sequences of tokens according to some regular expressions. In this
-setting one is not just interested in whether or not a regular
-expression matches a string, but if it matches also in \emph{how} it
-matches the string. Consider for example a regular expression
-$r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
-and so on; and a regular expression $r_{id}$ for recognising
-identifiers (say, a single character followed by characters or
-numbers). One can then form the compound regular expression
-$(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how
-should the string \textit{iffoo} be tokenised? It could be tokenised
-as a keyword followed by an identifier, or the entire string as a
-single identifier. Similarly, how should the string \textit{if} be
-tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
-``fire''---so is it an identifier or a keyword? While in applications
-there is a well-known strategy to decide these questions, called POSIX
-matching, only relatively recently precise definitions of what POSIX
-matching actually means have been formalised
-\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
-POSIX matching means matching the longest initial substring and
-in the case of a tie, the initial submatch is chosen according to some priorities attached to the
-regular expressions (e.g.~keywords have a higher priority than
-identifiers). This sounds rather simple, but according to Grathwohl et
-al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
-
-\begin{quote}
-\it{}``The POSIX strategy is more complicated than the greedy because of
-the dependence on information about the length of matched strings in the
-various subexpressions.''
-\end{quote}
-
-\noindent
-This is also supported by evidence collected by Kuklewicz
-\cite{Kuklewicz} who noticed that a number of POSIX regular expression
-matchers calculate incorrect results.
-
-Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
-regular expression matching according to the POSIX strategy
-\cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
-Brzozowski from 1964 where he introduced the notion of derivatives of
-regular expressions \cite{Brzozowski1964}. We shall briefly explain
-the algorithms next.
-
-\section{The Algorithms by Brzozowski, and Sulzmann and Lu}
-
-Suppose regular expressions are given by the following grammar (for
-the moment ignore the grammar for values on the right-hand side):
-
-
-\begin{center}
- \begin{tabular}{c@{\hspace{20mm}}c}
- \begin{tabular}{@{}rrl@{}}
- \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
- $r$ & $::=$ & $\ZERO$\\
- & $\mid$ & $\ONE$ \\
- & $\mid$ & $c$ \\
- & $\mid$ & $r_1 \cdot r_2$\\
- & $\mid$ & $r_1 + r_2$ \\
- \\
- & $\mid$ & $r^*$ \\
- \end{tabular}
- &
- \begin{tabular}{@{\hspace{0mm}}rrl@{}}
- \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
- $v$ & $::=$ & \\
- & & $\Empty$ \\
- & $\mid$ & $\Char(c)$ \\
- & $\mid$ & $\Seq\,v_1\, v_2$\\
- & $\mid$ & $\Left(v)$ \\
- & $\mid$ & $\Right(v)$ \\
- & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
- \end{tabular}
- \end{tabular}
-\end{center}
-
-\noindent
-The intended meaning of the regular expressions is as usual: $\ZERO$
-cannot match any string, $\ONE$ can match the empty string, the
-character regular expression $c$ can match the character $c$, and so
-on. The brilliant contribution by Brzozowski is the notion of
-\emph{derivatives} of regular expressions. The idea behind this
-notion is as follows: suppose a regular expression $r$ can match a
-string of the form $c\!::\! s$ (that is a list of characters starting
-with $c$), what does the regular expression look like that can match
-just $s$? Brzozowski gave a neat answer to this question. He defined
-the following operation on regular expressions, written
-$r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
-
-\begin{center}
-\begin{tabular}{lcl}
- $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
- $\ONE \backslash c$ & $\dn$ & $\ZERO$\\
- $d \backslash c$ & $\dn$ &
- $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
- & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
- & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
- $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
-\end{tabular}
-\end{center}
-
-\noindent
-In this definition $\nullable(\_)$ stands for a simple recursive
-function that tests whether a regular expression can match the empty
-string (its definition is omitted). Assuming the classic notion of a
-\emph{language} of a regular expression, written $L(\_)$, the main
-property of the derivative operation is that
-
-\begin{center}
-$c\!::\!s \in L(r)$ holds
-if and only if $s \in L(r\backslash c)$.
-\end{center}
-
-\noindent
-The beauty of derivatives is that they lead to a really simple regular
-expression matching algorithm: To find out whether a string $s$
-matches with a regular expression $r$, build the derivatives of $r$
-w.r.t.\ (in succession) all the characters of the string $s$. Finally,
-test whether the resulting regular expression can match the empty
-string. If yes, then $r$ matches $s$, and no in the negative
-case. For us the main advantage is that derivatives can be
-straightforwardly implemented in any functional programming language,
-and are easily definable and reasoned about in theorem provers---the
-definitions just consist of inductive datatypes and simple recursive
-functions. Moreover, the notion of derivatives can be easily
-generalised to cover extended regular expression constructors such as
-the not-regular expression, written $\neg\,r$, or bounded repetitions
-(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
-straightforwardly realised within the classic automata approach.
-
-
-One limitation, however, of Brzozowski's algorithm is that it only
-produces a YES/NO answer for whether a string is being matched by a
-regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this
-algorithm to allow generation of an actual matching, called a
-\emph{value}---see the grammar above for its definition. Assuming a
-regular expression matches a string, values encode the information of
-\emph{how} the string is matched by the regular expression---that is,
-which part of the string is matched by which part of the regular
-expression. To illustrate this consider the string $xy$ and the
-regular expression $(x + (y + xy))^*$. We can view this regular
-expression as a tree and if the string $xy$ is matched by two Star
-``iterations'', then the $x$ is matched by the left-most alternative
-in this tree and the $y$ by the right-left alternative. This suggests
-to record this matching as
-
-\begin{center}
-$\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
-\end{center}
-
-\noindent
-where $\Stars$ records how many
-iterations were used; and $\Left$, respectively $\Right$, which
-alternative is used. The value for
-matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
-would look as follows
-
-\begin{center}
-$\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
-\end{center}
-
-\noindent
-where $\Stars$ has only a single-element list for the single iteration
-and $\Seq$ indicates that $xy$ is matched by a sequence regular
-expression.
-
-The contribution of Sulzmann and Lu is an extension of Brzozowski's
-algorithm by a second phase (the first phase being building successive
-derivatives). In this second phase, for every successful match the
-corresponding POSIX value is computed. As mentioned before, from this
-value we can extract the information \emph{how} a regular expression
-matched a string. We omit the details here on how Sulzmann and Lu
-achieved this~(see \cite{Sulzmann2014}). Rather, we shall focus next on the
-process of simplification of regular expressions, which is needed in
-order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
-and Lu's algorithms. This is where the PhD-project hopes to advance
-the state-of-the-art.
-
-
-\section{Simplification of Regular Expressions}
-
-The main drawback of building successive derivatives according to
-Brzozowski's definition is that they can grow very quickly in size.
-This is mainly due to the fact that the derivative operation generates
-often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result,
-if implemented naively both algorithms by Brzozowski and by Sulzmann
-and Lu are excruciatingly slow. For example when starting with the
-regular expression $(a + aa)^*$ and building 12 successive derivatives
-w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree). Operations like
-derivative and $\nullable$ need to traverse such trees and
-consequently the bigger the size of the derivative the slower the
-algorithm. Fortunately, one can simplify regular expressions after
-each derivative step. Various simplifications of regular expressions
-are possible, such as the simplifications of $\ZERO + r$,
-$r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
-$r$. These simplifications do not affect the answer for whether a
-regular expression matches a string or not, but fortunately also do
-not affect the POSIX strategy of how regular expressions match
-strings---although the latter is much harder to establish. Some
-initial results in this regard have been obtained in
-\cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
-is a very tight bound for the size. Such a tight bound is suggested by
-work of Antimirov who proved that (partial) derivatives can be bound
-by the number of characters contained in the initial regular
-expression \cite{Antimirov95}. We believe, and have generated test
-data, that a similar bound can be obtained for the derivatives in
-Sulzmann and Lu's algorithm. Let us give some details about this next.
-
-We first followed Sulzmann and Lu's idea of introducing
-\emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
-defined by the following grammar:
-
-\begin{center}
-\begin{tabular}{lcl}
- $\textit{a}$ & $::=$ & $\textit{ZERO}$\\
- & $\mid$ & $\textit{ONE}\;\;bs$\\
- & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
- & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
- & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
- & $\mid$ & $\textit{STAR}\;\;bs\,a$
-\end{tabular}
-\end{center}
-
-\noindent
-where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
-list of annotated regular expressions. These bitsequences encode
-information about the (POSIX) value that should be generated by the
-Sulzmann and Lu algorithm. There are operations that can transform the
-usual (un-annotated) regular expressions into annotated regular
-expressions, and there are operations for encoding/decoding values to
-or from bitsequences. For example the encoding function for values is
-defined as follows:
-
-\begin{center}
-\begin{tabular}{lcl}
- $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
- $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
- $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
- $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
- $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
- $\textit{code}(\Stars\,[])$ & $\dn$ & $[\S]$\\
- $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\Z :: code(v) \;@\;
- code(\Stars\,vs)$
-\end{tabular}
-\end{center}
-
-\noindent
-where $\Z$ and $\S$ are arbitrary names for the ``bits'' in the
-bitsequences. Although this encoding is ``lossy'' in the sense of not
-recording all details of a value, Sulzmann and Lu have defined the
-decoding function such that with additional information (namely the
-corresponding regular expression) a value can be precisely extracted
-from a bitsequence.
-
-The main point of the bitsequences and annotated regular expressions
-is that we can apply rather aggressive (in terms of size)
-simplification rules in order to keep derivatives small. We have
-developed such ``aggressive'' simplification rules and generated test
-data that show that the expected bound can be achieved. Obviously we
-could only partially cover the search space as there are infinitely
-many regular expressions and strings. One modification we introduced
-is to allow a list of annotated regular expressions in the
-\textit{ALTS} constructor. This allows us to not just delete
-unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
-unnecessary ``copies'' of regular expressions (very similar to
-simplifying $r + r$ to just $r$, but in a more general
-setting). Another modification is that we use simplification rules
-inspired by Antimirov's work on partial derivatives. They maintain the
-idea that only the first ``copy'' of a regular expression in an
-alternative contributes to the calculation of a POSIX value. All
-subsequent copies can be prunned from the regular expression.
-
-We are currently engaged with proving that our simplification rules
-actually do not affect the POSIX value that should be generated by the
-algorithm according to the specification of a POSIX value and
-furthermore that our derivatives stay small for all derivatives. For
-this proof we use the theorem prover Isabelle. Once completed, this
-result will advance the state-of-the-art: Sulzmann and Lu wrote in
-their paper \cite{Sulzmann2014} about the bitcoded ``incremental
-parsing method'' (that is the matching algorithm outlined in this
-section):
-
-\begin{quote}\it
- ``Correctness Claim: We further claim that the incremental parsing
- method in Figure~5 in combination with the simplification steps in
- Figure 6 yields POSIX parse trees. We have tested this claim
- extensively by using the method in Figure~3 as a reference but yet
- have to work out all proof details.''
-\end{quote}
-
-\noindent
-We would settle the correctness claim and furthermore obtain a much
-tighter bound on the sizes of derivatives. The result is that our
-algorithm should be correct and faster on all inputs. The original
-blow-up, as observed in JavaScript, Python and Java, would be excluded
-from happening in our algorithm.
-
-\section{Conclusion}
-
-In this PhD-project we are interested in fast algorithms for regular
-expression matching. While this seems to be a ``settled'' area, in
-fact interesting research questions are popping up as soon as one steps
-outside the classic automata theory (for example in terms of what kind
-of regular expressions are supported). The reason why it is
-interesting for us to look at the derivative approach introduced by
-Brzozowski for regular expression matching, and then much further
-developed by Sulzmann and Lu, is that derivatives can elegantly deal
-with some of the regular expressions that are of interest in ``real
-life''. This includes the not-regular expression, written $\neg\,r$
-(that is all strings that are not recognised by $r$), but also bounded
-regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
-also hope that the derivatives can provide another angle for how to
-deal more efficiently with back-references, which are one of the
-reasons why regular expression engines in JavaScript, Python and Java
-choose to not implement the classic automata approach of transforming
-regular expressions into NFAs and then DFAs---because we simply do not
-know how such back-references can be represented by DFAs.
-
-
-\bibliographystyle{plain}
-\bibliography{root}
-
-
-\end{document}