# HG changeset patch # User Chengsong # Date 1561921549 -3600 # Node ID 550d12bb6032bd7393e05dd48e4ce308148bffd2 # Parent ae23e2d63936b8a9fb48b000b97ec5be0e5e3760# Parent 7a74081288c14f7ab9372bae2a9dff5ffe05aeee resolved i guess? diff -r 7a74081288c1 -r 550d12bb6032 ecp/ecoop_paper.tex --- a/ecp/ecoop_paper.tex Fri Jun 28 12:40:58 2019 +0100 +++ b/ecp/ecoop_paper.tex Sun Jun 30 20:05:49 2019 +0100 @@ -166,7 +166,7 @@ This exponential blowup sometimes causes real pain in real life: for example on 20 July 2016 one evil regular expression brought the -webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \cite{SE16}. +webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}. In this instance, a regular expression intended to just trim white spaces from the beginning and the end of a line actually consumed massive amounts of CPU-resources and because of this the web servers diff -r 7a74081288c1 -r 550d12bb6032 ninems/ninems.tex --- a/ninems/ninems.tex Fri Jun 28 12:40:58 2019 +0100 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,511 +0,0 @@ -\documentclass[a4paper,UKenglish]{lipics} -\usepackage{graphic} -\usepackage{data} -% \documentclass{article} -%\usepackage[utf8]{inputenc} -%\usepackage[english]{babel} -%\usepackage{listings} -% \usepackage{amsthm} -% \usepackage{hyperref} -% \usepackage[margin=0.5in]{geometry} -%\usepackage{pmboxdraw} - -\title{POSIX Regular Expression Matching and Lexing} -\author{Chengsong Tan} -\affil{King's College London\\ -London, UK\\ -\texttt{chengsong.tan@kcl.ac.uk}} -\authorrunning{Chengsong Tan} -\Copyright{Chengsong Tan} - -\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% -\newcommand{\ZERO}{\mbox{\bf 0}} -\newcommand{\ONE}{\mbox{\bf 1}} -\def\lexer{\mathit{lexer}} -\def\mkeps{\mathit{mkeps}} -\def\inj{\mathit{inj}} -\def\Empty{\mathit{Empty}} -\def\Left{\mathit{Left}} -\def\Right{\mathit{Right}} -\def\Stars{\mathit{Stars}} -\def\Char{\mathit{Char}} -\def\Seq{\mathit{Seq}} -\def\Der{\mathit{Der}} -\def\nullable{\mathit{nullable}} -\def\Z{\mathit{Z}} -\def\S{\mathit{S}} - -%\theoremstyle{theorem} -%\newtheorem{theorem}{Theorem} -%\theoremstyle{lemma} -%\newtheorem{lemma}{Lemma} -%\newcommand{\lemmaautorefname}{Lemma} -%\theoremstyle{definition} -%\newtheorem{definition}{Definition} - - -\begin{document} - -\maketitle - -\begin{abstract} - Brzozowski introduced in 1964 a beautifully simple algorithm for - regular expression matching based on the notion of derivatives of - regular expressions. In 2014, Sulzmann and Lu extended this - algorithm to not just give a YES/NO answer for whether or not a regular - expression matches a string, but in case it matches also \emph{how} - it matches the string. This is important for applications such as - lexing (tokenising a string). The problem is to make the algorithm - by Sulzmann and Lu fast on all inputs without breaking its - correctness. -\end{abstract} - -\section{Introduction} - -This PhD-project is about regular expression matching and -lexing. Given the maturity of this topic, the reader might wonder: -Surely, regular expressions must have already been studied to death? -What could possibly be \emph{not} known in this area? And surely all -implemented algorithms for regular expression matching are blindingly -fast? - - - -Unfortunately these preconceptions are not supported by evidence: Take -for example the regular expression $(a^*)^*\,b$ and ask whether -strings of the form $aa..a$ match this regular -expression. Obviously they do not match---the expected $b$ in the last -position is missing. One would expect that modern regular expression -matching engines can find this out very quickly. Alas, if one tries -this example in JavaScript, Python or Java 8 with strings like 28 -$a$'s, one discovers that this decision takes around 30 seconds and -takes considerably longer when adding a few more $a$'s, as the graphs -below show: - -\begin{center} -\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={JavaScript}, - legend pos=north west, - legend cell align=left] -\addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={Python}, - legend pos=north west, - legend cell align=left] -\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; -\end{axis} -\end{tikzpicture} - & -\begin{tikzpicture} -\begin{axis}[ - xlabel={$n$}, - x label style={at={(1.05,-0.05)}}, - %ylabel={time in secs}, - enlargelimits=false, - xtick={0,5,...,30}, - xmax=33, - ymax=35, - ytick={0,5,...,30}, - scaled ticks=false, - axis lines=left, - width=5cm, - height=4cm, - legend entries={Java 8}, - legend pos=north west, - legend cell align=left] -\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; -\end{axis} -\end{tikzpicture}\\ -\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings - of the form $\underbrace{aa..a}_{n}$.} -\end{tabular} -\end{center} - -\noindent These are clearly abysmal and possibly surprising results. -One would expect these systems doing much better than that---after -all, given a DFA and a string, whether a string is matched by this DFA -should be linear. - -Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to -exhibit this ``exponential behaviour''. Unfortunately, such regular -expressions are not just a few ``outliers'', but actually they are -frequent enough that a separate name has been created for -them---\emph{evil regular expressions}. In empiric work, Davis et al -report that they have found thousands of such evil regular expressions -in the JavaScript and Python ecosystems \cite{Davis18}. - -This exponential blowup sometimes causes real pain in real life: -for example on 20 July 2016 one evil regular expression brought the -webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \cite{SE16}. -In this instance, a regular expression intended to just trim white -spaces from the beginning and the end of a line actually consumed -massive amounts of CPU-resources and because of this the web servers -ground to a halt. This happened when a post with 20,000 white spaces -was submitted, but importantly the white spaces were neither at the -beginning nor at the end. As a result, the regular expression matching -engine needed to backtrack over many choices. - -The underlying problem is that many ``real life'' regular expression -matching engines do not use DFAs for matching. This is because they -support regular expressions that are not covered by the classical -automata theory, and in this more general setting there are quite a -few research questions still unanswered and fast algorithms still need -to be developed. - -There is also another under-researched problem to do with regular -expressions and lexing, i.e.~the process of breaking up strings into -sequences of tokens according to some regular expressions. In this -setting one is not just interested in whether or not a regular -expression matches a string, but if it matches also in \emph{how} it -matches the string. Consider for example a regular expression -$r_{key}$ for recognising keywords such as \textit{if}, \textit{then} -and so on; and a regular expression $r_{id}$ for recognising -identifiers (say, a single character followed by characters or -numbers). One can then form the compound regular expression -$(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how -should the string \textit{iffoo} be tokenised? It could be tokenised -as a keyword followed by an identifier, or the entire string as a -single identifier. Similarly, how should the string \textit{if} be -tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would -``fire''---so is it an identifier or a keyword? While in applications -there is a well-known strategy to decide these questions, called POSIX -matching, only relatively recently precise definitions of what POSIX -matching actually means have been formalised -\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly, -POSIX matching means matching the longest initial substring and -in the case of a tie, the initial submatch is chosen according to some priorities attached to the -regular expressions (e.g.~keywords have a higher priority than -identifiers). This sounds rather simple, but according to Grathwohl et -al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote: - -\begin{quote} -\it{}``The POSIX strategy is more complicated than the greedy because of -the dependence on information about the length of matched strings in the -various subexpressions.'' -\end{quote} - -\noindent -This is also supported by evidence collected by Kuklewicz -\cite{Kuklewicz} who noticed that a number of POSIX regular expression -matchers calculate incorrect results. - -Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for -regular expression matching according to the POSIX strategy -\cite{Sulzmann2014}. Their algorithm is based on an older algorithm by -Brzozowski from 1964 where he introduced the notion of derivatives of -regular expressions \cite{Brzozowski1964}. We shall briefly explain -the algorithms next. - -\section{The Algorithms by Brzozowski, and Sulzmann and Lu} - -Suppose regular expressions are given by the following grammar (for -the moment ignore the grammar for values on the right-hand side): - - -\begin{center} - \begin{tabular}{c@{\hspace{20mm}}c} - \begin{tabular}{@{}rrl@{}} - \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ - $r$ & $::=$ & $\ZERO$\\ - & $\mid$ & $\ONE$ \\ - & $\mid$ & $c$ \\ - & $\mid$ & $r_1 \cdot r_2$\\ - & $\mid$ & $r_1 + r_2$ \\ - \\ - & $\mid$ & $r^*$ \\ - \end{tabular} - & - \begin{tabular}{@{\hspace{0mm}}rrl@{}} - \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ - $v$ & $::=$ & \\ - & & $\Empty$ \\ - & $\mid$ & $\Char(c)$ \\ - & $\mid$ & $\Seq\,v_1\, v_2$\\ - & $\mid$ & $\Left(v)$ \\ - & $\mid$ & $\Right(v)$ \\ - & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ - \end{tabular} - \end{tabular} -\end{center} - -\noindent -The intended meaning of the regular expressions is as usual: $\ZERO$ -cannot match any string, $\ONE$ can match the empty string, the -character regular expression $c$ can match the character $c$, and so -on. The brilliant contribution by Brzozowski is the notion of -\emph{derivatives} of regular expressions. The idea behind this -notion is as follows: suppose a regular expression $r$ can match a -string of the form $c\!::\! s$ (that is a list of characters starting -with $c$), what does the regular expression look like that can match -just $s$? Brzozowski gave a neat answer to this question. He defined -the following operation on regular expressions, written -$r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$): - -\begin{center} -\begin{tabular}{lcl} - $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ - $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ - $d \backslash c$ & $\dn$ & - $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ -$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ -$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\ - & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ - & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ - $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ -\end{tabular} -\end{center} - -\noindent -In this definition $\nullable(\_)$ stands for a simple recursive -function that tests whether a regular expression can match the empty -string (its definition is omitted). Assuming the classic notion of a -\emph{language} of a regular expression, written $L(\_)$, the main -property of the derivative operation is that - -\begin{center} -$c\!::\!s \in L(r)$ holds -if and only if $s \in L(r\backslash c)$. -\end{center} - -\noindent -The beauty of derivatives is that they lead to a really simple regular -expression matching algorithm: To find out whether a string $s$ -matches with a regular expression $r$, build the derivatives of $r$ -w.r.t.\ (in succession) all the characters of the string $s$. Finally, -test whether the resulting regular expression can match the empty -string. If yes, then $r$ matches $s$, and no in the negative -case. For us the main advantage is that derivatives can be -straightforwardly implemented in any functional programming language, -and are easily definable and reasoned about in theorem provers---the -definitions just consist of inductive datatypes and simple recursive -functions. Moreover, the notion of derivatives can be easily -generalised to cover extended regular expression constructors such as -the not-regular expression, written $\neg\,r$, or bounded repetitions -(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so -straightforwardly realised within the classic automata approach. - - -One limitation, however, of Brzozowski's algorithm is that it only -produces a YES/NO answer for whether a string is being matched by a -regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this -algorithm to allow generation of an actual matching, called a -\emph{value}---see the grammar above for its definition. Assuming a -regular expression matches a string, values encode the information of -\emph{how} the string is matched by the regular expression---that is, -which part of the string is matched by which part of the regular -expression. To illustrate this consider the string $xy$ and the -regular expression $(x + (y + xy))^*$. We can view this regular -expression as a tree and if the string $xy$ is matched by two Star -``iterations'', then the $x$ is matched by the left-most alternative -in this tree and the $y$ by the right-left alternative. This suggests -to record this matching as - -\begin{center} -$\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$ -\end{center} - -\noindent -where $\Stars$ records how many -iterations were used; and $\Left$, respectively $\Right$, which -alternative is used. The value for -matching $xy$ in a single ``iteration'', i.e.~the POSIX value, -would look as follows - -\begin{center} -$\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$ -\end{center} - -\noindent -where $\Stars$ has only a single-element list for the single iteration -and $\Seq$ indicates that $xy$ is matched by a sequence regular -expression. - -The contribution of Sulzmann and Lu is an extension of Brzozowski's -algorithm by a second phase (the first phase being building successive -derivatives). In this second phase, for every successful match the -corresponding POSIX value is computed. As mentioned before, from this -value we can extract the information \emph{how} a regular expression -matched a string. We omit the details here on how Sulzmann and Lu -achieved this~(see \cite{Sulzmann2014}). Rather, we shall focus next on the -process of simplification of regular expressions, which is needed in -order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann -and Lu's algorithms. This is where the PhD-project hopes to advance -the state-of-the-art. - - -\section{Simplification of Regular Expressions} - -The main drawback of building successive derivatives according to -Brzozowski's definition is that they can grow very quickly in size. -This is mainly due to the fact that the derivative operation generates -often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, -if implemented naively both algorithms by Brzozowski and by Sulzmann -and Lu are excruciatingly slow. For example when starting with the -regular expression $(a + aa)^*$ and building 12 successive derivatives -w.r.t.~the character $a$, one obtains a derivative regular expression -with more than 8000 nodes (when viewed as a tree). Operations like -derivative and $\nullable$ need to traverse such trees and -consequently the bigger the size of the derivative the slower the -algorithm. Fortunately, one can simplify regular expressions after -each derivative step. Various simplifications of regular expressions -are possible, such as the simplifications of $\ZERO + r$, -$r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just -$r$. These simplifications do not affect the answer for whether a -regular expression matches a string or not, but fortunately also do -not affect the POSIX strategy of how regular expressions match -strings---although the latter is much harder to establish. Some -initial results in this regard have been obtained in -\cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet -is a very tight bound for the size. Such a tight bound is suggested by -work of Antimirov who proved that (partial) derivatives can be bound -by the number of characters contained in the initial regular -expression \cite{Antimirov95}. We believe, and have generated test -data, that a similar bound can be obtained for the derivatives in -Sulzmann and Lu's algorithm. Let us give some details about this next. - -We first followed Sulzmann and Lu's idea of introducing -\emph{annotated regular expressions}~\cite{Sulzmann2014}. They are -defined by the following grammar: - -\begin{center} -\begin{tabular}{lcl} - $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ - & $\mid$ & $\textit{ONE}\;\;bs$\\ - & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ - & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\ - & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ - & $\mid$ & $\textit{STAR}\;\;bs\,a$ -\end{tabular} -\end{center} - -\noindent -where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a -list of annotated regular expressions. These bitsequences encode -information about the (POSIX) value that should be generated by the -Sulzmann and Lu algorithm. There are operations that can transform the -usual (un-annotated) regular expressions into annotated regular -expressions, and there are operations for encoding/decoding values to -or from bitsequences. For example the encoding function for values is -defined as follows: - -\begin{center} -\begin{tabular}{lcl} - $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ - $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ - $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\ - $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\ - $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ - $\textit{code}(\Stars\,[])$ & $\dn$ & $[\S]$\\ - $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\Z :: code(v) \;@\; - code(\Stars\,vs)$ -\end{tabular} -\end{center} - -\noindent -where $\Z$ and $\S$ are arbitrary names for the ``bits'' in the -bitsequences. Although this encoding is ``lossy'' in the sense of not -recording all details of a value, Sulzmann and Lu have defined the -decoding function such that with additional information (namely the -corresponding regular expression) a value can be precisely extracted -from a bitsequence. - -The main point of the bitsequences and annotated regular expressions -is that we can apply rather aggressive (in terms of size) -simplification rules in order to keep derivatives small. We have -developed such ``aggressive'' simplification rules and generated test -data that show that the expected bound can be achieved. Obviously we -could only partially cover the search space as there are infinitely -many regular expressions and strings. One modification we introduced -is to allow a list of annotated regular expressions in the -\textit{ALTS} constructor. This allows us to not just delete -unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also -unnecessary ``copies'' of regular expressions (very similar to -simplifying $r + r$ to just $r$, but in a more general -setting). Another modification is that we use simplification rules -inspired by Antimirov's work on partial derivatives. They maintain the -idea that only the first ``copy'' of a regular expression in an -alternative contributes to the calculation of a POSIX value. All -subsequent copies can be prunned from the regular expression. - -We are currently engaged with proving that our simplification rules -actually do not affect the POSIX value that should be generated by the -algorithm according to the specification of a POSIX value and -furthermore that our derivatives stay small for all derivatives. For -this proof we use the theorem prover Isabelle. Once completed, this -result will advance the state-of-the-art: Sulzmann and Lu wrote in -their paper \cite{Sulzmann2014} about the bitcoded ``incremental -parsing method'' (that is the matching algorithm outlined in this -section): - -\begin{quote}\it - ``Correctness Claim: We further claim that the incremental parsing - method in Figure~5 in combination with the simplification steps in - Figure 6 yields POSIX parse trees. We have tested this claim - extensively by using the method in Figure~3 as a reference but yet - have to work out all proof details.'' -\end{quote} - -\noindent -We would settle the correctness claim and furthermore obtain a much -tighter bound on the sizes of derivatives. The result is that our -algorithm should be correct and faster on all inputs. The original -blow-up, as observed in JavaScript, Python and Java, would be excluded -from happening in our algorithm. - -\section{Conclusion} - -In this PhD-project we are interested in fast algorithms for regular -expression matching. While this seems to be a ``settled'' area, in -fact interesting research questions are popping up as soon as one steps -outside the classic automata theory (for example in terms of what kind -of regular expressions are supported). The reason why it is -interesting for us to look at the derivative approach introduced by -Brzozowski for regular expression matching, and then much further -developed by Sulzmann and Lu, is that derivatives can elegantly deal -with some of the regular expressions that are of interest in ``real -life''. This includes the not-regular expression, written $\neg\,r$ -(that is all strings that are not recognised by $r$), but also bounded -regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is -also hope that the derivatives can provide another angle for how to -deal more efficiently with back-references, which are one of the -reasons why regular expression engines in JavaScript, Python and Java -choose to not implement the classic automata approach of transforming -regular expressions into NFAs and then DFAs---because we simply do not -know how such back-references can be represented by DFAs. - - -\bibliographystyle{plain} -\bibliography{root} - - -\end{document}