ninems/ninems.tex
changeset 26 7a74081288c1
parent 25 5ca7bf724474
equal deleted inserted replaced
25:5ca7bf724474 26:7a74081288c1
       
     1 \documentclass[a4paper,UKenglish]{lipics}
       
     2 \usepackage{graphic}
       
     3 \usepackage{data}
       
     4 % \documentclass{article}
       
     5 %\usepackage[utf8]{inputenc}
       
     6 %\usepackage[english]{babel}
       
     7 %\usepackage{listings}
       
     8 % \usepackage{amsthm}
       
     9 % \usepackage{hyperref}
       
    10 % \usepackage[margin=0.5in]{geometry}
       
    11 %\usepackage{pmboxdraw}
       
    12  
       
    13 \title{POSIX Regular Expression Matching and Lexing}
       
    14 \author{Chengsong Tan}
       
    15 \affil{King's College London\\
       
    16 London, UK\\
       
    17 \texttt{chengsong.tan@kcl.ac.uk}}
       
    18 \authorrunning{Chengsong Tan}
       
    19 \Copyright{Chengsong Tan}
       
    20 
       
    21 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    22 \newcommand{\ZERO}{\mbox{\bf 0}}
       
    23 \newcommand{\ONE}{\mbox{\bf 1}}
       
    24 \def\lexer{\mathit{lexer}}
       
    25 \def\mkeps{\mathit{mkeps}}
       
    26 \def\inj{\mathit{inj}}
       
    27 \def\Empty{\mathit{Empty}}
       
    28 \def\Left{\mathit{Left}}
       
    29 \def\Right{\mathit{Right}}
       
    30 \def\Stars{\mathit{Stars}}
       
    31 \def\Char{\mathit{Char}}
       
    32 \def\Seq{\mathit{Seq}}
       
    33 \def\Der{\mathit{Der}}
       
    34 \def\nullable{\mathit{nullable}}
       
    35 \def\Z{\mathit{Z}}
       
    36 \def\S{\mathit{S}}
       
    37 
       
    38 %\theoremstyle{theorem}
       
    39 %\newtheorem{theorem}{Theorem}
       
    40 %\theoremstyle{lemma}
       
    41 %\newtheorem{lemma}{Lemma}
       
    42 %\newcommand{\lemmaautorefname}{Lemma}
       
    43 %\theoremstyle{definition}
       
    44 %\newtheorem{definition}{Definition}
       
    45 
       
    46 
       
    47 \begin{document}
       
    48 
       
    49 \maketitle
       
    50 
       
    51 \begin{abstract}
       
    52   Brzozowski introduced in 1964 a beautifully simple algorithm for
       
    53   regular expression matching based on the notion of derivatives of
       
    54   regular expressions. In 2014, Sulzmann and Lu extended this
       
    55   algorithm to not just give a YES/NO answer for whether or not a regular
       
    56   expression matches a string, but in case it matches also \emph{how}
       
    57   it matches the string.  This is important for applications such as
       
    58   lexing (tokenising a string). The problem is to make the algorithm
       
    59   by Sulzmann and Lu fast on all inputs without breaking its
       
    60   correctness.
       
    61 \end{abstract}
       
    62 
       
    63 \section{Introduction}
       
    64 
       
    65 This PhD-project is about regular expression matching and
       
    66 lexing. Given the maturity of this topic, the reader might wonder:
       
    67 Surely, regular expressions must have already been studied to death?
       
    68 What could possibly be \emph{not} known in this area? And surely all
       
    69 implemented algorithms for regular expression matching are blindingly
       
    70 fast?
       
    71 
       
    72 
       
    73 
       
    74 Unfortunately these preconceptions are not supported by evidence: Take
       
    75 for example the regular expression $(a^*)^*\,b$ and ask whether
       
    76 strings of the form $aa..a$ match this regular
       
    77 expression. Obviously they do not match---the expected $b$ in the last
       
    78 position is missing. One would expect that modern regular expression
       
    79 matching engines can find this out very quickly. Alas, if one tries
       
    80 this example in JavaScript, Python or Java 8 with strings like 28
       
    81 $a$'s, one discovers that this decision takes around 30 seconds and
       
    82 takes considerably longer when adding a few more $a$'s, as the graphs
       
    83 below show:
       
    84 
       
    85 \begin{center}
       
    86 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
    87 \begin{tikzpicture}
       
    88 \begin{axis}[
       
    89     xlabel={$n$},
       
    90     x label style={at={(1.05,-0.05)}},
       
    91     ylabel={time in secs},
       
    92     enlargelimits=false,
       
    93     xtick={0,5,...,30},
       
    94     xmax=33,
       
    95     ymax=35,
       
    96     ytick={0,5,...,30},
       
    97     scaled ticks=false,
       
    98     axis lines=left,
       
    99     width=5cm,
       
   100     height=4cm, 
       
   101     legend entries={JavaScript},  
       
   102     legend pos=north west,
       
   103     legend cell align=left]
       
   104 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
   105 \end{axis}
       
   106 \end{tikzpicture}
       
   107   &
       
   108 \begin{tikzpicture}
       
   109 \begin{axis}[
       
   110     xlabel={$n$},
       
   111     x label style={at={(1.05,-0.05)}},
       
   112     %ylabel={time in secs},
       
   113     enlargelimits=false,
       
   114     xtick={0,5,...,30},
       
   115     xmax=33,
       
   116     ymax=35,
       
   117     ytick={0,5,...,30},
       
   118     scaled ticks=false,
       
   119     axis lines=left,
       
   120     width=5cm,
       
   121     height=4cm, 
       
   122     legend entries={Python},  
       
   123     legend pos=north west,
       
   124     legend cell align=left]
       
   125 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   126 \end{axis}
       
   127 \end{tikzpicture}
       
   128   &
       
   129 \begin{tikzpicture}
       
   130 \begin{axis}[
       
   131     xlabel={$n$},
       
   132     x label style={at={(1.05,-0.05)}},
       
   133     %ylabel={time in secs},
       
   134     enlargelimits=false,
       
   135     xtick={0,5,...,30},
       
   136     xmax=33,
       
   137     ymax=35,
       
   138     ytick={0,5,...,30},
       
   139     scaled ticks=false,
       
   140     axis lines=left,
       
   141     width=5cm,
       
   142     height=4cm, 
       
   143     legend entries={Java 8},  
       
   144     legend pos=north west,
       
   145     legend cell align=left]
       
   146 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   147 \end{axis}
       
   148 \end{tikzpicture}\\
       
   149 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   150            of the form $\underbrace{aa..a}_{n}$.}
       
   151 \end{tabular}    
       
   152 \end{center}  
       
   153 
       
   154 \noindent These are clearly abysmal and possibly surprising results.
       
   155 One would expect these systems doing much better than that---after
       
   156 all, given a DFA and a string, whether a string is matched by this DFA
       
   157 should be linear.
       
   158 
       
   159 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
       
   160 exhibit this ``exponential behaviour''.  Unfortunately, such regular
       
   161 expressions are not just a few ``outliers'', but actually they are
       
   162 frequent enough that a separate name has been created for
       
   163 them---\emph{evil regular expressions}. In empiric work, Davis et al
       
   164 report that they have found thousands of such evil regular expressions
       
   165 in the JavaScript and Python ecosystems \cite{Davis18}.
       
   166 
       
   167 This exponential blowup sometimes causes real pain in real life:
       
   168 for example on 20 July 2016 one evil regular expression brought the
       
   169 webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \cite{SE16}.
       
   170 In this instance, a regular expression intended to just trim white
       
   171 spaces from the beginning and the end of a line actually consumed
       
   172 massive amounts of CPU-resources and because of this the web servers
       
   173 ground to a halt. This happened when a post with 20,000 white spaces
       
   174 was submitted, but importantly the white spaces were neither at the
       
   175 beginning nor at the end. As a result, the regular expression matching
       
   176 engine needed to backtrack over many choices.
       
   177 
       
   178 The underlying problem is that many ``real life'' regular expression
       
   179 matching engines do not use DFAs for matching. This is because they
       
   180 support regular expressions that are not covered by the classical
       
   181 automata theory, and in this more general setting there are quite a
       
   182 few research questions still unanswered and fast algorithms still need
       
   183 to be developed.
       
   184 
       
   185 There is also another under-researched problem to do with regular
       
   186 expressions and lexing, i.e.~the process of breaking up strings into
       
   187 sequences of tokens according to some regular expressions. In this
       
   188 setting one is not just interested in whether or not a regular
       
   189 expression matches a string, but if it matches also in \emph{how} it
       
   190 matches the string.  Consider for example a regular expression
       
   191 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
       
   192 and so on; and a regular expression $r_{id}$ for recognising
       
   193 identifiers (say, a single character followed by characters or
       
   194 numbers). One can then form the compound regular expression
       
   195 $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But then how
       
   196 should the string \textit{iffoo} be tokenised?  It could be tokenised
       
   197 as a keyword followed by an identifier, or the entire string as a
       
   198 single identifier.  Similarly, how should the string \textit{if} be
       
   199 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
       
   200 ``fire''---so is it an identifier or a keyword?  While in applications
       
   201 there is a well-known strategy to decide these questions, called POSIX
       
   202 matching, only relatively recently precise definitions of what POSIX
       
   203 matching actually means have been formalised
       
   204 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
       
   205 POSIX matching means matching the longest initial substring and
       
   206 in the case of a tie, the initial submatch is chosen according to some priorities attached to the
       
   207 regular expressions (e.g.~keywords have a higher priority than
       
   208 identifiers). This sounds rather simple, but according to Grathwohl et
       
   209 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
       
   210 
       
   211 \begin{quote}
       
   212 \it{}``The POSIX strategy is more complicated than the greedy because of 
       
   213 the dependence on information about the length of matched strings in the 
       
   214 various subexpressions.''
       
   215 \end{quote}
       
   216 
       
   217 \noindent
       
   218 This is also supported by evidence collected by Kuklewicz
       
   219 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
       
   220 matchers calculate incorrect results.
       
   221 
       
   222 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
       
   223 regular expression matching according to the POSIX strategy
       
   224 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
       
   225 Brzozowski from 1964 where he introduced the notion of derivatives of
       
   226 regular expressions \cite{Brzozowski1964}. We shall briefly explain
       
   227 the algorithms next.
       
   228 
       
   229 \section{The Algorithms by  Brzozowski, and Sulzmann and Lu}
       
   230 
       
   231 Suppose regular expressions are given by the following grammar (for
       
   232 the moment ignore the grammar for values on the right-hand side):
       
   233 
       
   234 
       
   235 \begin{center}
       
   236 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   237 		\begin{tabular}{@{}rrl@{}}
       
   238 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   239 			$r$ & $::=$  & $\ZERO$\\
       
   240 			& $\mid$ & $\ONE$   \\
       
   241 			& $\mid$ & $c$          \\
       
   242 			& $\mid$ & $r_1 \cdot r_2$\\
       
   243 			& $\mid$ & $r_1 + r_2$   \\
       
   244 			\\
       
   245 			& $\mid$ & $r^*$         \\
       
   246 		\end{tabular}
       
   247 		&
       
   248 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   249 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   250 			$v$ & $::=$  & \\
       
   251 			&        & $\Empty$   \\
       
   252 			& $\mid$ & $\Char(c)$          \\
       
   253 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   254 			& $\mid$ & $\Left(v)$   \\
       
   255 			& $\mid$ & $\Right(v)$  \\
       
   256 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   257 		\end{tabular}
       
   258 	\end{tabular}
       
   259 \end{center}
       
   260 
       
   261 \noindent
       
   262 The intended meaning of the regular expressions is as usual: $\ZERO$
       
   263 cannot match any string, $\ONE$ can match the empty string, the
       
   264 character regular expression $c$ can match the character $c$, and so
       
   265 on. The brilliant contribution by Brzozowski is the notion of
       
   266 \emph{derivatives} of regular expressions.  The idea behind this
       
   267 notion is as follows: suppose a regular expression $r$ can match a
       
   268 string of the form $c\!::\! s$ (that is a list of characters starting
       
   269 with $c$), what does the regular expression look like that can match
       
   270 just $s$? Brzozowski gave a neat answer to this question. He defined
       
   271 the following operation on regular expressions, written
       
   272 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
       
   273 
       
   274 \begin{center}
       
   275 \begin{tabular}{lcl}
       
   276 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   277 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   278 		$d \backslash c$     & $\dn$ & 
       
   279 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   280 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   281 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
       
   282 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   283 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   284 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   285 \end{tabular}
       
   286 \end{center}
       
   287 
       
   288 \noindent
       
   289 In this definition $\nullable(\_)$ stands for a simple recursive
       
   290 function that tests whether a regular expression can match the empty
       
   291 string (its definition is omitted). Assuming the classic notion of a
       
   292 \emph{language} of a regular expression, written $L(\_)$, the main
       
   293 property of the derivative operation is that
       
   294 
       
   295 \begin{center}
       
   296 $c\!::\!s \in L(r)$ holds
       
   297 if and only if $s \in L(r\backslash c)$.
       
   298 \end{center}
       
   299 
       
   300 \noindent
       
   301 The beauty of derivatives is that they lead to a really simple regular
       
   302 expression matching algorithm: To find out whether a string $s$
       
   303 matches with a regular expression $r$, build the derivatives of $r$
       
   304 w.r.t.\ (in succession) all the characters of the string $s$. Finally,
       
   305 test whether the resulting regular expression can match the empty
       
   306 string.  If yes, then $r$ matches $s$, and no in the negative
       
   307 case. For us the main advantage is that derivatives can be
       
   308 straightforwardly implemented in any functional programming language,
       
   309 and are easily definable and reasoned about in theorem provers---the
       
   310 definitions just consist of inductive datatypes and simple recursive
       
   311 functions. Moreover, the notion of derivatives can be easily
       
   312 generalised to cover extended regular expression constructors such as
       
   313 the not-regular expression, written $\neg\,r$, or bounded repetitions
       
   314 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
       
   315 straightforwardly realised within the classic automata approach.
       
   316 
       
   317 
       
   318 One limitation, however, of Brzozowski's algorithm is that it only
       
   319 produces a YES/NO answer for whether a string is being matched by a
       
   320 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
       
   321 algorithm to allow generation of an actual matching, called a
       
   322 \emph{value}---see the grammar above for its definition.  Assuming a
       
   323 regular expression matches a string, values encode the information of
       
   324 \emph{how} the string is matched by the regular expression---that is,
       
   325 which part of the string is matched by which part of the regular
       
   326 expression. To illustrate this consider the string $xy$ and the
       
   327 regular expression $(x + (y + xy))^*$. We can view this regular
       
   328 expression as a tree and if the string $xy$ is matched by two Star
       
   329 ``iterations'', then the $x$ is matched by the left-most alternative
       
   330 in this tree and the $y$ by the right-left alternative. This suggests
       
   331 to record this matching as
       
   332 
       
   333 \begin{center}
       
   334 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
       
   335 \end{center}
       
   336 
       
   337 \noindent
       
   338 where $\Stars$ records how many
       
   339 iterations were used; and $\Left$, respectively $\Right$, which
       
   340 alternative is used. The value for
       
   341 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
       
   342 would look as follows
       
   343 
       
   344 \begin{center}
       
   345 $\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
       
   346 \end{center}
       
   347 
       
   348 \noindent
       
   349 where $\Stars$ has only a single-element list for the single iteration
       
   350 and $\Seq$ indicates that $xy$ is matched by a sequence regular
       
   351 expression.
       
   352 
       
   353 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   354 algorithm by a second phase (the first phase being building successive
       
   355 derivatives). In this second phase, for every successful match the
       
   356 corresponding POSIX value is computed. As mentioned before, from this
       
   357 value we can extract the information \emph{how} a regular expression
       
   358 matched a string. We omit the details here on how Sulzmann and Lu
       
   359 achieved this~(see \cite{Sulzmann2014}). Rather, we shall focus next on the
       
   360 process of simplification of regular expressions, which is needed in
       
   361 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
       
   362 and Lu's algorithms.  This is where the PhD-project hopes to advance
       
   363 the state-of-the-art.
       
   364 
       
   365 
       
   366 \section{Simplification of Regular Expressions}
       
   367 
       
   368 The main drawback of building successive derivatives according to
       
   369 Brzozowski's definition is that they can grow very quickly in size.
       
   370 This is mainly due to the fact that the derivative operation generates
       
   371 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
       
   372 if implemented naively both algorithms by Brzozowski and by Sulzmann
       
   373 and Lu are excruciatingly slow. For example when starting with the
       
   374 regular expression $(a + aa)^*$ and building 12 successive derivatives
       
   375 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   376 with more than 8000 nodes (when viewed as a tree). Operations like
       
   377 derivative and $\nullable$ need to traverse such trees and
       
   378 consequently the bigger the size of the derivative the slower the
       
   379 algorithm. Fortunately, one can simplify regular expressions after
       
   380 each derivative step. Various simplifications of regular expressions
       
   381 are possible, such as the simplifications of $\ZERO + r$,
       
   382 $r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
       
   383 $r$. These simplifications do not affect the answer for whether a
       
   384 regular expression matches a string or not, but fortunately also do
       
   385 not affect the POSIX strategy of how regular expressions match
       
   386 strings---although the latter is much harder to establish. Some
       
   387 initial results in this regard have been obtained in
       
   388 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
       
   389 is a very tight bound for the size. Such a tight bound is suggested by
       
   390 work of Antimirov who proved that (partial) derivatives can be bound
       
   391 by the number of characters contained in the initial regular
       
   392 expression \cite{Antimirov95}. We believe, and have generated test
       
   393 data, that a similar bound can be obtained for the derivatives in
       
   394 Sulzmann and Lu's algorithm. Let us give some details about this next.
       
   395 
       
   396 We first followed Sulzmann and Lu's idea of introducing
       
   397 \emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
       
   398 defined by the following grammar:
       
   399 
       
   400 \begin{center}
       
   401 \begin{tabular}{lcl}
       
   402   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
       
   403                   & $\mid$ & $\textit{ONE}\;\;bs$\\
       
   404                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
       
   405                   & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
       
   406                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
       
   407                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
       
   408 \end{tabular}    
       
   409 \end{center}  
       
   410 
       
   411 \noindent
       
   412 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
       
   413 list of annotated regular expressions. These bitsequences encode
       
   414 information about the (POSIX) value that should be generated by the
       
   415 Sulzmann and Lu algorithm. There are operations that can transform the
       
   416 usual (un-annotated) regular expressions into annotated regular
       
   417 expressions, and there are operations for encoding/decoding values to
       
   418 or from bitsequences. For example the encoding function for values is
       
   419 defined as follows:
       
   420 
       
   421 \begin{center}
       
   422 \begin{tabular}{lcl}
       
   423   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
   424   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
   425   $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
       
   426   $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
       
   427   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
   428   $\textit{code}(\Stars\,[])$ & $\dn$ & $[\S]$\\
       
   429   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\Z :: code(v) \;@\;
       
   430                                                  code(\Stars\,vs)$
       
   431 \end{tabular}    
       
   432 \end{center} 
       
   433 
       
   434 \noindent
       
   435 where $\Z$ and $\S$ are arbitrary names for the ``bits'' in the
       
   436 bitsequences. Although this encoding is ``lossy'' in the sense of not
       
   437 recording all details of a value, Sulzmann and Lu have defined the
       
   438 decoding function such that with additional information (namely the
       
   439 corresponding regular expression) a value can be precisely extracted
       
   440 from a bitsequence.
       
   441 
       
   442 The main point of the bitsequences and annotated regular expressions
       
   443 is that we can apply rather aggressive (in terms of size)
       
   444 simplification rules in order to keep derivatives small.  We have
       
   445 developed such ``aggressive'' simplification rules and generated test
       
   446 data that show that the expected bound can be achieved. Obviously we
       
   447 could only partially cover  the search space as there are infinitely
       
   448 many regular expressions and strings. One modification we introduced
       
   449 is to allow a list of annotated regular expressions in the
       
   450 \textit{ALTS} constructor. This allows us to not just delete
       
   451 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
       
   452 unnecessary ``copies'' of regular expressions (very similar to
       
   453 simplifying $r + r$ to just $r$, but in a more general
       
   454 setting). Another modification is that we use simplification rules
       
   455 inspired by Antimirov's work on partial derivatives. They maintain the
       
   456 idea that only the first ``copy'' of a regular expression in an
       
   457 alternative contributes to the calculation of a POSIX value. All
       
   458 subsequent copies can be prunned from the regular expression.
       
   459 
       
   460 We are currently engaged with proving that our simplification rules
       
   461 actually do not affect the POSIX value that should be generated by the
       
   462 algorithm according to the specification of a POSIX value and
       
   463 furthermore that our derivatives stay small for all derivatives. For
       
   464 this proof we use the theorem prover Isabelle. Once completed, this
       
   465 result will advance the state-of-the-art: Sulzmann and Lu wrote in
       
   466 their paper \cite{Sulzmann2014} about the bitcoded ``incremental
       
   467 parsing method'' (that is the matching algorithm outlined in this
       
   468 section):
       
   469 
       
   470 \begin{quote}\it
       
   471   ``Correctness Claim: We further claim that the incremental parsing
       
   472   method in Figure~5 in combination with the simplification steps in
       
   473   Figure 6 yields POSIX parse trees. We have tested this claim
       
   474   extensively by using the method in Figure~3 as a reference but yet
       
   475   have to work out all proof details.''
       
   476 \end{quote}  
       
   477 
       
   478 \noindent
       
   479 We would settle the correctness claim and furthermore obtain a much
       
   480 tighter bound on the sizes of derivatives. The result is that our
       
   481 algorithm should be correct and faster on all inputs.  The original
       
   482 blow-up, as observed in JavaScript, Python and Java, would be excluded
       
   483 from happening in our algorithm.
       
   484 
       
   485 \section{Conclusion}
       
   486 
       
   487 In this PhD-project we are interested in fast algorithms for regular
       
   488 expression matching. While this seems to be a ``settled'' area, in
       
   489 fact interesting research questions are popping up as soon as one steps
       
   490 outside the classic automata theory (for example in terms of what kind
       
   491 of regular expressions are supported). The reason why it is
       
   492 interesting for us to look at the derivative approach introduced by
       
   493 Brzozowski for regular expression matching, and then much further
       
   494 developed by Sulzmann and Lu, is that derivatives can elegantly deal
       
   495 with some of the regular expressions that are of interest in ``real
       
   496 life''. This includes the not-regular expression, written $\neg\,r$
       
   497 (that is all strings that are not recognised by $r$), but also bounded
       
   498 regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
       
   499 also hope that the derivatives can provide another angle for how to
       
   500 deal more efficiently with back-references, which are one of the
       
   501 reasons why regular expression engines in JavaScript, Python and Java
       
   502 choose to not implement the classic automata approach of transforming
       
   503 regular expressions into NFAs and then DFAs---because we simply do not
       
   504 know how such back-references can be represented by DFAs.
       
   505 
       
   506 
       
   507 \bibliographystyle{plain}
       
   508 \bibliography{root}
       
   509 
       
   510 
       
   511 \end{document}