ecp/ecoop_paper.tex
changeset 17 3241b1e71633
child 22 feffec3af1a1
equal deleted inserted replaced
16:c51178fa85fe 17:3241b1e71633
       
     1 \documentclass[a4paper,UKenglish]{lipics}
       
     2 \usepackage{graphic}
       
     3 \usepackage{data}
       
     4 % \documentclass{article}
       
     5 %\usepackage[utf8]{inputenc}
       
     6 %\usepackage[english]{babel}
       
     7 %\usepackage{listings}
       
     8 % \usepackage{amsthm}
       
     9 % \usepackage{hyperref}
       
    10 % \usepackage[margin=0.5in]{geometry}
       
    11 %\usepackage{pmboxdraw}
       
    12  
       
    13 \title{POSIX Regular Expression Matching and Lexing}
       
    14 \author[1]{Annonymous}
       
    15 
       
    16 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    17 \newcommand{\ZERO}{\mbox{\bf 0}}
       
    18 \newcommand{\ONE}{\mbox{\bf 1}}
       
    19 \def\lexer{\mathit{lexer}}
       
    20 \def\mkeps{\mathit{mkeps}}
       
    21 \def\inj{\mathit{inj}}
       
    22 \def\Empty{\mathit{Empty}}
       
    23 \def\Left{\mathit{Left}}
       
    24 \def\Right{\mathit{Right}}
       
    25 \def\Stars{\mathit{Stars}}
       
    26 \def\Char{\mathit{Char}}
       
    27 \def\Seq{\mathit{Seq}}
       
    28 \def\Der{\mathit{Der}}
       
    29 \def\nullable{\mathit{nullable}}
       
    30 \def\Z{\mathit{Z}}
       
    31 \def\S{\mathit{S}}
       
    32 
       
    33 %\theoremstyle{theorem}
       
    34 %\newtheorem{theorem}{Theorem}
       
    35 %\theoremstyle{lemma}
       
    36 %\newtheorem{lemma}{Lemma}
       
    37 %\newcommand{\lemmaautorefname}{Lemma}
       
    38 %\theoremstyle{definition}
       
    39 %\newtheorem{definition}{Definition}
       
    40 
       
    41 
       
    42 \begin{document}
       
    43 
       
    44 \maketitle
       
    45 \begin{abstract}
       
    46   Brzozowski introduced in 1964 a beautifully simple algorithm for
       
    47   regular expression matching based on the notion of derivatives of
       
    48   regular expressions. In 2014, Sulzmann and Lu extended this
       
    49   algorithm to not just give a YES/NO answer for whether or not a regular
       
    50   expression matches a string, but in case it matches also \emph{how}
       
    51   it matches the string.  This is important for applications such as
       
    52   lexing (tokenising a string). The problem is to make the algorithm
       
    53   by Sulzmann and Lu fast on all inputs without breaking its
       
    54   correctness.
       
    55 \end{abstract}
       
    56 
       
    57 \section{Introduction}
       
    58 
       
    59 This PhD-project is about regular expression matching and
       
    60 lexing. Given the maturity of this topic, the reader might wonder:
       
    61 Surely, regular expressions must have already been studied to death?
       
    62 What could possibly be \emph{not} known in this area? And surely all
       
    63 implemented algorithms for regular expression matching are blindingly
       
    64 fast?
       
    65 
       
    66 
       
    67 
       
    68 Unfortunately these preconceptions are not supported by evidence: Take
       
    69 for example the regular expression $(a^*)^*\,b$ and ask whether
       
    70 strings of the form $aa..a$ match this regular
       
    71 expression. Obviously they do not match---the expected $b$ in the last
       
    72 position is missing. One would expect that modern regular expression
       
    73 matching engines can find this out very quickly. Alas, if one tries
       
    74 this example in JavaScript, Python or Java 8 with strings like 28
       
    75 $a$'s, one discovers that this decision takes around 30 seconds and
       
    76 takes considerably longer when adding a few more $a$'s, as the graphs
       
    77 below show:
       
    78 
       
    79 \begin{center}
       
    80 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
    81 \begin{tikzpicture}
       
    82 \begin{axis}[
       
    83     xlabel={$n$},
       
    84     x label style={at={(1.05,-0.05)}},
       
    85     ylabel={time in secs},
       
    86     enlargelimits=false,
       
    87     xtick={0,5,...,30},
       
    88     xmax=33,
       
    89     ymax=35,
       
    90     ytick={0,5,...,30},
       
    91     scaled ticks=false,
       
    92     axis lines=left,
       
    93     width=5cm,
       
    94     height=4cm, 
       
    95     legend entries={JavaScript},  
       
    96     legend pos=north west,
       
    97     legend cell align=left]
       
    98 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
    99 \end{axis}
       
   100 \end{tikzpicture}
       
   101   &
       
   102 \begin{tikzpicture}
       
   103 \begin{axis}[
       
   104     xlabel={$n$},
       
   105     x label style={at={(1.05,-0.05)}},
       
   106     %ylabel={time in secs},
       
   107     enlargelimits=false,
       
   108     xtick={0,5,...,30},
       
   109     xmax=33,
       
   110     ymax=35,
       
   111     ytick={0,5,...,30},
       
   112     scaled ticks=false,
       
   113     axis lines=left,
       
   114     width=5cm,
       
   115     height=4cm, 
       
   116     legend entries={Python},  
       
   117     legend pos=north west,
       
   118     legend cell align=left]
       
   119 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   120 \end{axis}
       
   121 \end{tikzpicture}
       
   122   &
       
   123 \begin{tikzpicture}
       
   124 \begin{axis}[
       
   125     xlabel={$n$},
       
   126     x label style={at={(1.05,-0.05)}},
       
   127     %ylabel={time in secs},
       
   128     enlargelimits=false,
       
   129     xtick={0,5,...,30},
       
   130     xmax=33,
       
   131     ymax=35,
       
   132     ytick={0,5,...,30},
       
   133     scaled ticks=false,
       
   134     axis lines=left,
       
   135     width=5cm,
       
   136     height=4cm, 
       
   137     legend entries={Java 8},  
       
   138     legend pos=north west,
       
   139     legend cell align=left]
       
   140 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   141 \end{axis}
       
   142 \end{tikzpicture}\\
       
   143 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   144            of the form $\underbrace{aa..a}_{n}$.}
       
   145 \end{tabular}    
       
   146 \end{center}  
       
   147 
       
   148 \noindent These are clearly abysmal and possibly surprising results.
       
   149 One would expect these systems doing much better than that---after
       
   150 all, given a DFA and a string, whether a string is matched by this DFA
       
   151 should be linear.
       
   152 
       
   153 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
       
   154 exhibit this ``exponential behaviour''.  Unfortunately, such regular
       
   155 expressions are not just a few ``outliers'', but actually they are
       
   156 frequent enough that a separate name has been created for
       
   157 them---\emph{evil regular expressions}. In empiric work, Davis et al
       
   158 report that they have found thousands of such evil regular expressions
       
   159 in the JavaScript and Python ecosystems \cite{Davis18}.
       
   160 
       
   161 This exponential blowup sometimes causes real pain in ``real life'':
       
   162 for example one evil regular expression brought on 20 July 2016 the
       
   163 webpage \href{http://stackexchange.com}{Stack Exchange} to its knees.
       
   164 In this instance, a regular expression intended to just trim white
       
   165 spaces from the beginning and the end of a line actually consumed
       
   166 massive amounts of CPU-resources and because of this the web servers
       
   167 ground to a halt. This happened when a post with 20,000 white spaces
       
   168 was submitted, but importantly the white spaces were neither at the
       
   169 beginning nor at the end. As a result, the regular expression matching
       
   170 engine needed to backtrack over many choices.
       
   171 
       
   172 The underlying problem is that many ``real life'' regular expression
       
   173 matching engines do not use DFAs for matching. This is because they
       
   174 support regular expressions that are not covered by the classical
       
   175 automata theory, and in this more general setting there are quite a
       
   176 few research questions still unanswered and fast algorithms still need
       
   177 to be developed.
       
   178 
       
   179 There is also another under-researched problem to do with regular
       
   180 expressions and lexing, i.e.~the process of breaking up strings into
       
   181 sequences of tokens according to some regular expressions. In this
       
   182 setting one is not just interested in whether or not a regular
       
   183 expression matches a string, but if it matches also in \emph{how} it
       
   184 matches the string.  Consider for example a regular expression
       
   185 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
       
   186 and so on; and a regular expression $r_{id}$ for recognising
       
   187 identifiers (say, a single character followed by characters or
       
   188 numbers). One can then form the compound regular expression
       
   189 $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But then how
       
   190 should the string \textit{iffoo} be tokenised?  It could be tokenised
       
   191 as a keyword followed by an identifier, or the entire string as a
       
   192 single identifier.  Similarly, how should the string \textit{if} be
       
   193 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
       
   194 ``fire''---so is it an identifier or a keyword?  While in applications
       
   195 there is a well-known strategy to decide these questions, called POSIX
       
   196 matching, only relatively recently precise definitions of what POSIX
       
   197 matching actually means have been formalised
       
   198 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
       
   199 POSIX matching means to match the longest initial substring and
       
   200 possible ties are solved according to some priorities attached to the
       
   201 regular expressions (e.g.~keywords have a higher priority than
       
   202 identifiers). This sounds rather simple, but according to Grathwohl et
       
   203 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
       
   204 
       
   205 \begin{quote}
       
   206 \it{}``The POSIX strategy is more complicated than the greedy because of 
       
   207 the dependence on information about the length of matched strings in the 
       
   208 various subexpressions.''
       
   209 \end{quote}
       
   210 
       
   211 \noindent
       
   212 This is also supported by evidence collected by Kuklewicz
       
   213 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
       
   214 matchers calculate incorrect results.
       
   215 
       
   216 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
       
   217 regular expression matching according to the POSIX strategy
       
   218 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
       
   219 Brzozowski from 1964 where he introduced the notion of derivatives of
       
   220 regular expressions \cite{Brzozowski1964}. We shall briefly explain
       
   221 the algorithms next.
       
   222 
       
   223 \section{The Algorithms by  Brzozowski, and Sulzmann and Lu}
       
   224 
       
   225 Suppose regular expressions are given by the following grammar (for
       
   226 the moment ignore the grammar for values on the right-hand side):
       
   227 
       
   228 
       
   229 \begin{center}
       
   230 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   231 		\begin{tabular}{@{}rrl@{}}
       
   232 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   233 			$r$ & $::=$  & $\ZERO$\\
       
   234 			& $\mid$ & $\ONE$   \\
       
   235 			& $\mid$ & $c$          \\
       
   236 			& $\mid$ & $r_1 \cdot r_2$\\
       
   237 			& $\mid$ & $r_1 + r_2$   \\
       
   238 			\\
       
   239 			& $\mid$ & $r^*$         \\
       
   240 		\end{tabular}
       
   241 		&
       
   242 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   243 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   244 			$v$ & $::=$  & \\
       
   245 			&        & $\Empty$   \\
       
   246 			& $\mid$ & $\Char(c)$          \\
       
   247 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   248 			& $\mid$ & $\Left(v)$   \\
       
   249 			& $\mid$ & $\Right(v)$  \\
       
   250 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   251 		\end{tabular}
       
   252 	\end{tabular}
       
   253 \end{center}
       
   254 
       
   255 \noindent
       
   256 The intended meaning of the regular expressions is as usual: $\ZERO$
       
   257 cannot match any string, $\ONE$ can match the empty string, the
       
   258 character regular expression $c$ can match the character $c$, and so
       
   259 on. The brilliant contribution by Brzozowski is the notion of
       
   260 \emph{derivatives} of regular expressions.  The idea behind this
       
   261 notion is as follows: suppose a regular expression $r$ can match a
       
   262 string of the form $c\!::\! s$ (that is a list of characters starting
       
   263 with $c$), what does the regular expression look like that can match
       
   264 just $s$? Brzozowski gave a neat answer to this question. He defined
       
   265 the following operation on regular expressions, written
       
   266 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
       
   267 
       
   268 \begin{center}
       
   269 \begin{tabular}{lcl}
       
   270 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   271 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   272 		$d \backslash c$     & $\dn$ & 
       
   273 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   274 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   275 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
       
   276 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   277 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   278 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   279 \end{tabular}
       
   280 \end{center}
       
   281 
       
   282 \noindent
       
   283 In this definition $\nullable(\_)$ stands for a simple recursive
       
   284 function that tests whether a regular expression can match the empty
       
   285 string (its definition is omitted). Assuming the classic notion of a
       
   286 \emph{language} of a regular expression, written $L(\_)$, the main
       
   287 property of the derivative operation is that
       
   288 
       
   289 \begin{center}
       
   290 $c\!::\!s \in L(r)$ holds
       
   291 if and only if $s \in L(r\backslash c)$.
       
   292 \end{center}
       
   293 
       
   294 \noindent
       
   295 The beauty of derivatives is that they lead to a really simple regular
       
   296 expression matching algorithm: To find out whether a string $s$
       
   297 matches with a regular expression $r$, build the derivatives of $r$
       
   298 w.r.t.\ (in succession) all the characters of the string $s$. Finally,
       
   299 test whether the resulting regular expression can match the empty
       
   300 string.  If yes, then $r$ matches $s$, and no in the negative
       
   301 case. For us the main advantage is that derivatives can be
       
   302 straightforwardly implemented in any functional programming language,
       
   303 and are easily definable and reasoned about in theorem provers---the
       
   304 definitions just consist of inductive datatypes and simple recursive
       
   305 functions. Moreover, the notion of derivatives can be easily
       
   306 generalised to cover extended regular expression constructors such as
       
   307 the not-regular expression, written $\neg\,r$, or bounded repetitions
       
   308 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
       
   309 straightforwardly realised within the classic automata approach.
       
   310 
       
   311 
       
   312 One limitation, however, of Brzozowski's algorithm is that it only
       
   313 produces a YES/NO answer for whether a string is being matched by a
       
   314 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
       
   315 algorithm to allow generation of an actual matching, called a
       
   316 \emph{value}---see the grammar above for its definition.  Assuming a
       
   317 regular expression matches a string, values encode the information of
       
   318 \emph{how} the string is matched by the regular expression---that is,
       
   319 which part of the string is matched by which part of the regular
       
   320 expression. To illustrate this consider the string $xy$ and the
       
   321 regular expression $(x + (y + xy))^*$. We can view this regular
       
   322 expression as a tree and if the string $xy$ is matched by two Star
       
   323 ``iterations'', then the $x$ is matched by the left-most alternative
       
   324 in this tree and the $y$ by the right-left alternative. This suggests
       
   325 to record this matching as
       
   326 
       
   327 \begin{center}
       
   328 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
       
   329 \end{center}
       
   330 
       
   331 \noindent
       
   332 where $\Stars$ records how many
       
   333 iterations were used; and $\Left$, respectively $\Right$, which
       
   334 alternative is used. The value for
       
   335 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
       
   336 would look as follows
       
   337 
       
   338 \begin{center}
       
   339 $\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
       
   340 \end{center}
       
   341 
       
   342 \noindent
       
   343 where $\Stars$ has only a single-element list for the single iteration
       
   344 and $\Seq$ indicates that $xy$ is matched by a sequence regular
       
   345 expression.
       
   346 
       
   347 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   348 algorithm by a second phase (the first phase being building successive
       
   349 derivatives). In this second phase, for every successful match the
       
   350 corresponding POSIX value is computed. As mentioned before, from this
       
   351 value we can extract the information \emph{how} a regular expression
       
   352 matched a string. We omit the details here on how Sulzmann and Lu
       
   353 achieved this~(see \cite{Sulzmann2014}). Rather, we shall focus next on the
       
   354 process of simplification of regular expressions, which is needed in
       
   355 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
       
   356 and Lu's algorithms.  This is where the PhD-project hopes to advance
       
   357 the state-of-the-art.
       
   358 
       
   359 
       
   360 \section{Simplification of Regular Expressions}
       
   361 
       
   362 The main drawback of building successive derivatives according to
       
   363 Brzozowski's definition is that they can grow very quickly in size.
       
   364 This is mainly due to the fact that the derivative operation generates
       
   365 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
       
   366 if implemented naively both algorithms by Brzozowski and by Sulzmann
       
   367 and Lu are excruciatingly slow. For example when starting with the
       
   368 regular expression $(a + aa)^*$ and building 12 successive derivatives
       
   369 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   370 with more than 8000 nodes (when viewed as a tree). Operations like
       
   371 derivative and $\nullable$ need to traverse such trees and
       
   372 consequently the bigger the size of the derivative the slower the
       
   373 algorithm. Fortunately, one can simplify regular expressions after
       
   374 each derivative step. Various simplifications of regular expressions
       
   375 are possible, such as the simplifications of $\ZERO + r$,
       
   376 $r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
       
   377 $r$. These simplifications do not affect the answer for whether a
       
   378 regular expression matches a string or not, but fortunately also do
       
   379 not affect the POSIX strategy of how regular expressions match
       
   380 strings---although the latter is much harder to establish. Some
       
   381 initial results in this regard have been obtained in
       
   382 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
       
   383 is a very tight bound for the size. Such a tight bound is suggested by
       
   384 work of Antimirov who proved that (partial) derivatives can be bound
       
   385 by the number of characters contained in the initial regular
       
   386 expression \cite{Antimirov95}. We believe, and have generated test
       
   387 data, that a similar bound can be obtained for the derivatives in
       
   388 Sulzmann and Lu's algorithm. Let us give some details about this next.
       
   389 
       
   390 We first followed Sulzmann and Lu's idea of introducing
       
   391 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
       
   392 defined by the following grammar:
       
   393 
       
   394 \begin{center}
       
   395 \begin{tabular}{lcl}
       
   396   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
       
   397                   & $\mid$ & $\textit{ONE}\;\;bs$\\
       
   398                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
       
   399                   & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
       
   400                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
       
   401                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
       
   402 \end{tabular}    
       
   403 \end{center}  
       
   404 
       
   405 \noindent
       
   406 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
       
   407 list of annotated regular expressions. These bitsequences encode
       
   408 information about the (POSIX) value that should be generated by the
       
   409 Sulzmann and Lu algorithm. There are operations that can transform the
       
   410 usual (un-annotated) regular expressions into annotated regular
       
   411 expressions, and there are operations for encoding/decoding values to
       
   412 or from bitsequences. For example the encoding function for values is
       
   413 defined as follows:
       
   414 
       
   415 \begin{center}
       
   416 \begin{tabular}{lcl}
       
   417   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
   418   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
   419   $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
       
   420   $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
       
   421   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
   422   $\textit{code}(\Stars\,[])$ & $\dn$ & $[\S]$\\
       
   423   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\Z :: code(v) \;@\;
       
   424                                                  code(\Stars\,vs)$
       
   425 \end{tabular}    
       
   426 \end{center} 
       
   427 
       
   428 \noindent
       
   429 where $\Z$ and $\S$ are arbitrary names for the ``bits'' in the
       
   430 bitsequences. Although this encoding is ``lossy'' in the sense of not
       
   431 recording all details of a value, Sulzmann and Lu have defined the
       
   432 decoding function such that with additional information (namely the
       
   433 corresponding regular expression) a value can be precisely extracted
       
   434 from a bitsequence.
       
   435 
       
   436 The main point of the bitsequences and annotated regular expressions
       
   437 is that we can apply rather aggressive (in terms of size)
       
   438 simplification rules in order to keep derivatives small.  We have
       
   439 developed such ``aggressive'' simplification rules and generated test
       
   440 data that show that the expected bound can be achieved. Obviously we
       
   441 could only cover partially the search space as there are infinitely
       
   442 many regular expressions and strings. One modification we introduced
       
   443 is to allow a list of annotated regular expressions in the
       
   444 \textit{ALTS} constructor. This allows us to not just delete
       
   445 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
       
   446 unnecessary ``copies'' of regular expressions (very similar to
       
   447 simplifying $r + r$ to just $r$, but in a more general
       
   448 setting). Another modification is that we use simplification rules
       
   449 inspired by Antimirov's work on partial derivatives. They maintain the
       
   450 idea that only the first ``copy'' of a regular expression in an
       
   451 alternative contributes to the calculation of a POSIX value. All
       
   452 subsequent copies can be prunned from the regular expression.
       
   453 
       
   454 We are currently engaged with proving that our simplification rules
       
   455 actually do not affect the POSIX value that should be generated by the
       
   456 algorithm according to the specification of a POSIX value and
       
   457 furthermore that our derivatives stay small for all derivatives. For
       
   458 this proof we use the theorem prover Isabelle. Once completed, this
       
   459 result will advance the state-of-the-art: Sulzmann and Lu wrote in
       
   460 their paper \cite{Sulzmann2014} about the bitcoded ``incremental
       
   461 parsing method'' (that is the matching algorithm outlined in this
       
   462 section):
       
   463 
       
   464 \begin{quote}\it
       
   465   ``Correctness Claim: We further claim that the incremental parsing
       
   466   method in Figure~5 in combination with the simplification steps in
       
   467   Figure 6 yields POSIX parse trees. We have tested this claim
       
   468   extensively by using the method in Figure~3 as a reference but yet
       
   469   have to work out all proof details.''
       
   470 \end{quote}  
       
   471 
       
   472 \noindent
       
   473 We would settle the correctness claim and furthermore obtain a much
       
   474 tighter bound on the sizes of derivatives. The result is that our
       
   475 algorithm should be correct and faster on all inputs.  The original
       
   476 blow-up, as observed in JavaScript, Python and Java, would be excluded
       
   477 from happening in our algorithm.
       
   478 
       
   479 \section{Conclusion}
       
   480 
       
   481 In this PhD-project we are interested in fast algorithms for regular
       
   482 expression matching. While this seems to be a ``settled'' area, in
       
   483 fact interesting research questions are popping up as soon as one steps
       
   484 outside the classic automata theory (for example in terms of what kind
       
   485 of regular expressions are supported). The reason why it is
       
   486 interesting for us to look at the derivative approach introduced by
       
   487 Brzozowski for regular expression matching, and then much further
       
   488 developed by Sulzmann and Lu, is that derivatives can elegantly deal
       
   489 with some of the regular expressions that are of interest in ``real
       
   490 life''. This includes the not-regular expression, written $\neg\,r$
       
   491 (that is all strings that are not recognised by $r$), but also bounded
       
   492 regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
       
   493 also hope that the derivatives can provide another angle for how to
       
   494 deal more efficiently with back-references, which are one of the
       
   495 reasons why regular expression engines in JavaScript, Python and Java
       
   496 choose to not implement the classic automata approach of transforming
       
   497 regular expressions into NFAs and then DFAs---because we simply do not
       
   498 know how such back-references can be represented by DFAs.
       
   499 
       
   500 
       
   501 \bibliographystyle{plain}
       
   502 \bibliography{root}
       
   503 
       
   504 
       
   505 \end{document}