etnms/etnms.tex
changeset 94 2e2dca212fff
child 95 c969a973fcae
equal deleted inserted replaced
93:d486c12deeab 94:2e2dca212fff
       
     1 \documentclass[a4paper,UKenglish]{lipics}
       
     2 \usepackage{graphic}
       
     3 \usepackage{data}
       
     4 \usepackage{tikz-cd}
       
     5 %\usepackage{algorithm}
       
     6 \usepackage{amsmath}
       
     7 \usepackage[noend]{algpseudocode}
       
     8 \usepackage{enumitem}
       
     9 \usepackage{nccmath}
       
    10 
       
    11 \definecolor{darkblue}{rgb}{0,0,0.6}
       
    12 \hypersetup{colorlinks=true,allcolors=darkblue}
       
    13 \newcommand{\comment}[1]%
       
    14 {{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}}
       
    15 
       
    16 % \documentclass{article}
       
    17 %\usepackage[utf8]{inputenc}
       
    18 %\usepackage[english]{babel}
       
    19 %\usepackage{listings}
       
    20 % \usepackage{amsthm}
       
    21 %\usepackage{hyperref}
       
    22 % \usepackage[margin=0.5in]{geometry}
       
    23 %\usepackage{pmboxdraw}
       
    24  
       
    25 \title{POSIX Regular Expression Matching and Lexing}
       
    26 \author{Chengsong Tan}
       
    27 \affil{King's College London\\
       
    28 London, UK\\
       
    29 \texttt{chengsong.tan@kcl.ac.uk}}
       
    30 \authorrunning{Chengsong Tan}
       
    31 \Copyright{Chengsong Tan}
       
    32 
       
    33 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    34 \newcommand{\ZERO}{\mbox{\bf 0}}
       
    35 \newcommand{\ONE}{\mbox{\bf 1}}
       
    36 \def\bders{\textit{bders}}
       
    37 \def\lexer{\mathit{lexer}}
       
    38 \def\blexer{\textit{blexer}}
       
    39 \def\blexers{\mathit{blexer\_simp}}
       
    40 \def\mkeps{\mathit{mkeps}}
       
    41 \def\bmkeps{\textit{bmkeps}}
       
    42 \def\inj{\mathit{inj}}
       
    43 \def\Empty{\mathit{Empty}}
       
    44 \def\Left{\mathit{Left}}
       
    45 \def\Right{\mathit{Right}}
       
    46 \def\Stars{\mathit{Stars}}
       
    47 \def\Char{\mathit{Char}}
       
    48 \def\Seq{\mathit{Seq}}
       
    49 \def\Der{\mathit{Der}}
       
    50 \def\nullable{\mathit{nullable}}
       
    51 \def\Z{\mathit{Z}}
       
    52 \def\S{\mathit{S}}
       
    53 \def\flex{\textit{flex}}
       
    54 \def\rup{r^\uparrow}
       
    55 \def\retrieve{\textit{retrieve}}
       
    56 \def\AALTS{\textit{AALTS}}
       
    57 \def\AONE{\textit{AONE}}
       
    58 %\theoremstyle{theorem}
       
    59 %\newtheorem{theorem}{Theorem}
       
    60 %\theoremstyle{lemma}
       
    61 %\newtheorem{lemma}{Lemma}
       
    62 %\newcommand{\lemmaautorefname}{Lemma}
       
    63 %\theoremstyle{definition}
       
    64 %\newtheorem{definition}{Definition}
       
    65 \algnewcommand\algorithmicswitch{\textbf{switch}}
       
    66 \algnewcommand\algorithmiccase{\textbf{case}}
       
    67 \algnewcommand\algorithmicassert{\texttt{assert}}
       
    68 \algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%
       
    69 % New "environments"
       
    70 \algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
       
    71 \algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%
       
    72 \algtext*{EndSwitch}%
       
    73 \algtext*{EndCase}%
       
    74 
       
    75 
       
    76 \begin{document}
       
    77 
       
    78 \maketitle
       
    79 
       
    80 \begin{abstract}
       
    81   Brzozowski introduced in 1964 a beautifully simple algorithm for
       
    82   regular expression matching based on the notion of derivatives of
       
    83   regular expressions. In 2014, Sulzmann and Lu extended this
       
    84   algorithm to not just give a YES/NO answer for whether or not a
       
    85   regular expression matches a string, but in case it does also
       
    86   answers with \emph{how} it matches the string.  This is important for
       
    87   applications such as lexing (tokenising a string). The problem is to
       
    88   make the algorithm by Sulzmann and Lu fast on all inputs without
       
    89   breaking its correctness. We have already developed some
       
    90   simplification rules for this, but have not yet proved that they
       
    91   preserve the correctness of the algorithm. We also have not yet
       
    92   looked at extended regular expressions, such as bounded repetitions,
       
    93   negation and back-references.
       
    94 \end{abstract}
       
    95 
       
    96 \section{Introduction}
       
    97 
       
    98 
       
    99 
       
   100 \noindent\rule[0.5ex]{\linewidth}{1pt}
       
   101 Between the 2 bars are the new materials.\\
       
   102 In the past 6 months I was trying to prove that the bit-coded algorithm is correct.
       
   103 \begin{center}
       
   104 $\blexers \;r \; s = \blexer \; r \; s$
       
   105 \end{center}
       
   106 \noindent
       
   107 To prove this, we need to prove these two functions produce the same output 
       
   108 whether or not $r \in L(r)$.
       
   109 Given the definition of $\blexer$ and $\blexers$:
       
   110 \begin{center}
       
   111 \begin{tabular}{lcl}
       
   112   $\textit{blexer}\;r\,s$ & $\dn$ &
       
   113       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
   114   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   115   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   116   & & $\;\;\textit{else}\;\textit{None}$
       
   117 \end{tabular}
       
   118 \end{center}
       
   119 
       
   120  \begin{center}
       
   121 \begin{tabular}{lcl}
       
   122   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
   123       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
   124   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   125   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   126   & & $\;\;\textit{else}\;\textit{None}$
       
   127 \end{tabular}
       
   128 \end{center}
       
   129 \noindent
       
   130 it boils down to proving the following two propositions(depending on which
       
   131 branch in the if-else clause is taken):
       
   132 
       
   133 \begin{itemize}
       
   134 
       
   135 \item{}
       
   136 When s is a string in the language L(r), \\
       
   137 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\, s  = \textit{bmkeps} (r^\uparrow)\backslash s$, \\
       
   138 \item{}
       
   139 when s is not a string of the language L(ar)
       
   140 ders\_simp(ar, s) is not nullable
       
   141 \end{itemize}
       
   142 The second one is relatively straightforward using isabelle to prove.
       
   143 The first part requires more effort. 
       
   144 It builds on the result that the bit-coded algorithm without simplification
       
   145 produces the correct result:
       
   146 \begin{center}
       
   147 $\blexer \;r^\uparrow  s = \lexer \; r\; s$
       
   148 \end{center}
       
   149 \noindent
       
   150 the definition of lexer and its correctness is 
       
   151 omitted(see \cite{AusafDyckhoffUrban2016}).
       
   152 if we can prove that the bit-coded algorithm with simplification produces 
       
   153 the same result as the original bit-coded algorithm, 
       
   154 then we are done.
       
   155 The correctness proof of
       
   156 \begin{center}
       
   157 $\blexer \; r^\uparrow  s = \lexer \;r \;s$
       
   158 \end{center}
       
   159 \noindent
       
   160 might provide us insight into proving 
       
   161 \begin{center}
       
   162 $\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
       
   163 \end{center}
       
   164 \noindent
       
   165 (that is also
       
   166 why we say the new proof builds on the older one).
       
   167 The proof defined the function $\flex$ as another way of
       
   168 expressing the $\lexer$ function:
       
   169 \begin{center}
       
   170 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; r\backslash s)$
       
   171 \end{center}.
       
   172 \noindent
       
   173 (proof for the above equality will be explained later)
       
   174 The definition of $flex$ is as follows:
       
   175 \begin{center}
       
   176 \begin{tabular}{lcl}
       
   177 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \;  (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
       
   178 $\textit{flex} \;r\; f\;  [\,] $ & $\dn$ & $f$
       
   179 \end{tabular}
       
   180 \end{center}
       
   181 \noindent
       
   182 here $\flex$ essentially does lexing by
       
   183 stacking up injection functions while doing derivatives,
       
   184 explicitly showing the order of characters being
       
   185 injected back in each step.
       
   186 With $\flex$ we can write $\lexer$ this way: 
       
   187 \begin{center}
       
   188 $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$
       
   189 \end{center}
       
   190 \noindent
       
   191 $\flex$ focuses on
       
   192  the injections instead 
       
   193 of the derivatives ,
       
   194 compared 
       
   195 to the original definition of $\lexer$,
       
   196 which puts equal amount of emphasis on 
       
   197 injection and derivative with respect to each character:
       
   198 \begin{center}
       
   199 \begin{tabular}{lcl}
       
   200 $\textit{lexer} \; r\; (c\!::\!s) $ & $\dn$ & $\textit{case} \; \lexer \; (r\backslash c) \;s \; \textit{of}$ \\
       
   201  & & $\textit{None} \; \Longrightarrow \; \textit{None}$\\
       
   202   & & $\textbar \; v \; \Longrightarrow \; \inj \; r\;c\;v$\\
       
   203 $\textit{lexer} \; r\;  [\,] $ & $\dn$ & $\textit{if} \; \nullable (r) \; \textit{then} \; \mkeps (r) \; \textit{else} \;None$
       
   204 \end{tabular}
       
   205 \end{center}
       
   206 \noindent
       
   207 Using this feature of $\flex$  we can rewrite the lexing
       
   208 $w.r.t \; s @ [c]$ in term of lexing 
       
   209 $w.r.t \; s$:
       
   210 \begin{center}
       
   211 $\flex \; r \; id \; (s@[c]) \; v = \flex \;  r \; id \; s \; (inj \; (r\backslash s) \; c\; v)$.
       
   212 \end{center}
       
   213 \noindent
       
   214 this allows us to use 
       
   215 the inductive hypothesis to get
       
   216 \begin{center} 
       
   217 $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
       
   218 \end{center}
       
   219 \noindent
       
   220 By using a property of retrieve we have the $\textit{RHS}$ of the above equality is
       
   221 $decode (retrieve (r^\uparrow \backslash(s @ [c])) v) r$, and this gives the 
       
   222 main lemma result:
       
   223 \begin{center}
       
   224 $ \flex \;r\;  id \; (s@[c]) \; v =\textit{decode}(\textit{retrieve} (\rup \backslash (s@[c])) \;v) r$
       
   225 \end{center}
       
   226 \noindent
       
   227 To use this lemma result for our 
       
   228 correctness proof, simply replace the $v$ in the
       
   229 $\textit{RHS}$ of the above equality with
       
   230 $\mkeps\;(r\backslash (s@[c]))$, and apply the lemma that
       
   231  
       
   232 \begin{center}
       
   233 $\textit{decode} \; \bmkeps \; \rup \; r = \textit{decode} \; (\textit{retrieve} \; \rup \; \mkeps(r)) \;r$
       
   234 \end{center}
       
   235 \noindent
       
   236 We get the correctness of our bit-coded algorithm:
       
   237 \begin{center}
       
   238 $\flex \;r\;  id \; s \; (\mkeps \; r\backslash s) = \textit{decode} \; \bmkeps \; \rup\backslash s \; r$
       
   239 \end{center}
       
   240 \noindent
       
   241 The bridge between the above chain of equalities
       
   242 is the use of $\retrieve$,
       
   243 if we want to use a similar technique for the 
       
   244 simplified version of algorithm,
       
   245 we face the problem that in the above 
       
   246 equalities,
       
   247 $\retrieve \; a \; v$ is not always defined.
       
   248 for example,
       
   249 $\retrieve \; \AALTS(Z, \AONE(S), \AONE(S)) \; \Left(\Empty)$
       
   250 is defined, but not $\retrieve \; \AONE(\Z\S) \; \Left(\Empty)$,
       
   251 though we can extract the same POSIX
       
   252 bits from the two annotated regular expressions.
       
   253 That means, if we 
       
   254 want to prove that 
       
   255 \begin{center}
       
   256 $\textit{decode} \; \bmkeps \; \rup\backslash s \; r = \textit{decode} \; \bmkeps \; \rup\backslash_{simp} s \; r$
       
   257 \end{center}
       
   258 \noindent
       
   259 holds by using $\retrieve$,
       
   260 we probably need to prove an equality like below:
       
   261 \begin{center}
       
   262 %$\retrieve \; \rup\backslash_{simp} s \; \mkeps(r\backslash_{simp} s)=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
       
   263 $\retrieve \; \rup\backslash_{simp} s \; \mkeps(r\backslash s)=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
       
   264 \end{center}
       
   265 \noindent
       
   266 we would need to rectify the value $\mkeps(r\backslash s)$ into something simpler
       
   267 to make the retrieve function defined.\\
       
   268 %HERE CONSTRUCTION SITE
       
   269 The vsimp function, defined as follows
       
   270 tries to simplify the value in lockstep with 
       
   271 regular expression:\\
       
   272 
       
   273 
       
   274 The problem here is that 
       
   275 
       
   276 we used retrieve for the key induction:
       
   277 $decode (retrieve (r\backslash (s @ [c])) v) r $
       
   278 $decode (retrieve (r\backslash s) (inj (r\backslash s) c v)) r$
       
   279 Here, decode recovers a value that corresponds to a match(possibly partial)
       
   280 from bits, and the bits are extracted by retrieve,
       
   281 and the key value $v$ that guides retrieve is
       
   282 $mkeps r\backslash s$, $inj r c (mkeps r\backslash s)$, $inj (inj (v))$, ......
       
   283 if we can 
       
   284 the problem is that 
       
   285 need vsiimp to make a value that is suitable for decoding
       
   286 $Some(flex rid(s@[c])v) = Some(flex rids(inj (r\backslash s)cv))$
       
   287 another way that christian came up with that might circumvent the 
       
   288 prblem of finding suitable value is by not stating the visimp
       
   289 function but include all possible value in a set that a regex is able to produce,
       
   290 and proving that both r and sr are able to produce the bits that correspond the POSIX value
       
   291 
       
   292 produced by feeding the same initial regular expression $r$ and string $s$ to the
       
   293  two functions $ders$ and $ders\_simp$.
       
   294 The reason why
       
   295 Namely, if $bmkeps( r_1) = bmkeps(r_2)$, then we 
       
   296 
       
   297 
       
   298 If we define the equivalence relation $\sim_{m\epsilon}$ between two regular expressions
       
   299 $r_1$ and $r_2$as follows:
       
   300 $r_1 \sim_{m\epsilon} r_2  \iff bmkeps(r_1)= bmkeps(r_2)$
       
   301 (in other words, they $r1$ and $r2$ produce the same output under the function $bmkeps$.)
       
   302 Then the first goal 
       
   303 might be restated as 
       
   304 $(r^\uparrow)\backslash_{simp}\, s  \sim_{m\epsilon} (r^\uparrow)\backslash s$.
       
   305 I tried to establish an equivalence relation between the regular experssions 
       
   306 like dddr dddsr,.....
       
   307 but right now i am only able to establish dsr and dr, using structural induction on r.
       
   308 Those involve multiple derivative operations are harder to prove.
       
   309 Two attempts have been made:
       
   310 (1)induction on the number of der operations(or in other words, the length of the string s),
       
   311 the inductive hypothesis was initially specified as 
       
   312 "For an arbitrary regular expression r, 
       
   313 For all string s in the language of r whose length do not exceed 
       
   314 the number n, ders s r me derssimp s r"
       
   315 and the proof goal may be stated as
       
   316 "For an arbitrary regular expression r, 
       
   317 For all string s in the language of r whose length do not exceed 
       
   318 the number n+1, ders s r me derssimp s r"
       
   319 the problem here is that although we can easily break down
       
   320 a string s of length n+1 into s1@list(c), it is not that easy
       
   321 to use the i.h. as a stepping stone to prove anything because s1 may well be not
       
   322 in the language L(r). This inhibits us from obtaining the fact that
       
   323 ders s1 r me derssimps s1 r.
       
   324 Further exploration is needed to amend this hypothesis so it includes the
       
   325 situation when s1 is not nullable.
       
   326 For example, what information(bits? 
       
   327 values?) can be extracted
       
   328 from the regular expression ders(s1,r) so that we can compute or predict the possible 
       
   329 result of bmkeps after another derivative operation. What function f can used to 
       
   330 carry out the task? The possible way of exploration can be 
       
   331 more directly perceived throught the graph below:
       
   332 find a function
       
   333 f
       
   334 such that
       
   335 f(bders s1 r)
       
   336 = re1
       
   337 f(bderss s1 r)
       
   338 = re2
       
   339 bmkeps(bders s r) = g(re1,c)
       
   340 bmkeps(bderssimp s r) = g(re2,c)
       
   341 and g(re1,c) = g(re2,c)
       
   342 The inductive hypothesis would be
       
   343 "For all strings s1 of length <= n, 
       
   344 f(bders s1 r)
       
   345 = re1
       
   346 f(bderss s1 r)
       
   347 = re2"
       
   348 proving this would be a lemma for the main proof:
       
   349 the main proof would be 
       
   350 "
       
   351 bmkeps(bders s r) = g(re1,c)
       
   352 bmkeps(bderssimp s r) = g(re2,c)
       
   353 for s = s1@c
       
   354 "
       
   355 and f need to be a recursive property for the lemma to be proved:
       
   356 it needs to store not only the "after one char nullable info",
       
   357 but also the "after two char nullable info",
       
   358 and so on so that it is able to  predict what f will compute after a derivative operation,
       
   359 in other words, it needs to be "infinitely recursive"\\
       
   360 To prove the lemma, in other words, to get
       
   361 "For all strings s1 of length <= n+1, 
       
   362 f(bders s1 r)
       
   363 = re3
       
   364 f(bderss s1 r)
       
   365 = re4"\\
       
   366 from\\
       
   367 "For all strings s1 of length <= n, 
       
   368 f(bders s1 r)
       
   369 = re1
       
   370 f(bderss s1 r)
       
   371 = re2"\\
       
   372 it might be best to construct an auxiliary function h such that\\
       
   373 h(re1, c) = re3\\
       
   374 h(re2, c) = re4\\
       
   375 and re3 = f(bder c (bders s1 r))\\
       
   376 re4 = f(simp(bder c (bderss s1 r)))
       
   377 The key point here is that we are not satisfied with what bders s r will produce under
       
   378 bmkeps, but also how it will perform after a derivative operation and then bmkeps, and two 
       
   379 derivative operations and so on. In essence, we are preserving the regular expression 
       
   380 itself under the function f, in a less compact way than the regluar expression: we are
       
   381 not just recording but also interpreting what the regular expression matches.
       
   382 In other words, we need to prove the properties of bderss s r beyond the bmkeps result,
       
   383 i.e., not just the nullable ones, but also those containing remaining characters.\\
       
   384 (2)we observed the fact that 
       
   385 erase sdddddr= erase sdsdsdsr
       
   386 that is to say, despite the bits are being moved around on the regular expression
       
   387 (difference in bits), the structure of the (unannotated)regular expression
       
   388 after one simplification is exactly the same after the 
       
   389 same sequence of derivative operations 
       
   390 regardless of whether we did simplification
       
   391 along the way.
       
   392 However, without erase the above equality does not hold:
       
   393 for the regular expression  
       
   394 $(a+b)(a+a*)$,
       
   395 if we do derivative with respect to string $aa$,
       
   396 we get
       
   397 %TODO
       
   398 sdddddr does not equal sdsdsdsr sometimes.\\
       
   399 For example,
       
   400 
       
   401 This equicalence class method might still have the potential of proving this,
       
   402 but not yet
       
   403 i parallelly tried another method of using retrieve\\
       
   404 
       
   405 
       
   406 
       
   407 \noindent\rule[0.5ex]{\linewidth}{1pt}
       
   408 
       
   409 This PhD-project is about regular expression matching and
       
   410 lexing. Given the maturity of this topic, the reader might wonder:
       
   411 Surely, regular expressions must have already been studied to death?
       
   412 What could possibly be \emph{not} known in this area? And surely all
       
   413 implemented algorithms for regular expression matching are blindingly
       
   414 fast?
       
   415 
       
   416 Unfortunately these preconceptions are not supported by evidence: Take
       
   417 for example the regular expression $(a^*)^*\,b$ and ask whether
       
   418 strings of the form $aa..a$ match this regular
       
   419 expression. Obviously this is not the case---the expected $b$ in the last
       
   420 position is missing. One would expect that modern regular expression
       
   421 matching engines can find this out very quickly. Alas, if one tries
       
   422 this example in JavaScript, Python or Java 8 with strings like 28
       
   423 $a$'s, one discovers that this decision takes around 30 seconds and
       
   424 takes considerably longer when adding a few more $a$'s, as the graphs
       
   425 below show:
       
   426 
       
   427 \begin{center}
       
   428 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   429 \begin{tikzpicture}
       
   430 \begin{axis}[
       
   431     xlabel={$n$},
       
   432     x label style={at={(1.05,-0.05)}},
       
   433     ylabel={time in secs},
       
   434     enlargelimits=false,
       
   435     xtick={0,5,...,30},
       
   436     xmax=33,
       
   437     ymax=35,
       
   438     ytick={0,5,...,30},
       
   439     scaled ticks=false,
       
   440     axis lines=left,
       
   441     width=5cm,
       
   442     height=4cm, 
       
   443     legend entries={JavaScript},  
       
   444     legend pos=north west,
       
   445     legend cell align=left]
       
   446 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
   447 \end{axis}
       
   448 \end{tikzpicture}
       
   449   &
       
   450 \begin{tikzpicture}
       
   451 \begin{axis}[
       
   452     xlabel={$n$},
       
   453     x label style={at={(1.05,-0.05)}},
       
   454     %ylabel={time in secs},
       
   455     enlargelimits=false,
       
   456     xtick={0,5,...,30},
       
   457     xmax=33,
       
   458     ymax=35,
       
   459     ytick={0,5,...,30},
       
   460     scaled ticks=false,
       
   461     axis lines=left,
       
   462     width=5cm,
       
   463     height=4cm, 
       
   464     legend entries={Python},  
       
   465     legend pos=north west,
       
   466     legend cell align=left]
       
   467 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   468 \end{axis}
       
   469 \end{tikzpicture}
       
   470   &
       
   471 \begin{tikzpicture}
       
   472 \begin{axis}[
       
   473     xlabel={$n$},
       
   474     x label style={at={(1.05,-0.05)}},
       
   475     %ylabel={time in secs},
       
   476     enlargelimits=false,
       
   477     xtick={0,5,...,30},
       
   478     xmax=33,
       
   479     ymax=35,
       
   480     ytick={0,5,...,30},
       
   481     scaled ticks=false,
       
   482     axis lines=left,
       
   483     width=5cm,
       
   484     height=4cm, 
       
   485     legend entries={Java 8},  
       
   486     legend pos=north west,
       
   487     legend cell align=left]
       
   488 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   489 \end{axis}
       
   490 \end{tikzpicture}\\
       
   491 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   492            of the form $\underbrace{aa..a}_{n}$.}
       
   493 \end{tabular}    
       
   494 \end{center}  
       
   495 
       
   496 \noindent These are clearly abysmal and possibly surprising results. One
       
   497 would expect these systems to do  much better than that---after all,
       
   498 given a DFA and a string, deciding whether a string is matched by this
       
   499 DFA should be linear in terms of the size of the regular expression and
       
   500 the string?
       
   501 
       
   502 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
       
   503 exhibit this super-linear behaviour.  But unfortunately, such regular
       
   504 expressions are not just a few outliers. They are actually 
       
   505 frequent enough to have a separate name created for
       
   506 them---\emph{evil regular expressions}. In empiric work, Davis et al
       
   507 report that they have found thousands of such evil regular expressions
       
   508 in the JavaScript and Python ecosystems \cite{Davis18}. Static analysis
       
   509 approach that is both sound and complete exists\cite{17Bir}, but the running 
       
   510 time on certain examples in the RegExLib and Snort regular expressions
       
   511 libraries is unacceptable. Therefore the problem of efficiency still remains.
       
   512 
       
   513 This superlinear blowup in matching algorithms sometimes causes
       
   514 considerable grief in real life: for example on 20 July 2016 one evil
       
   515 regular expression brought the webpage
       
   516 \href{http://stackexchange.com}{Stack Exchange} to its
       
   517 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
       
   518 In this instance, a regular expression intended to just trim white
       
   519 spaces from the beginning and the end of a line actually consumed
       
   520 massive amounts of CPU-resources---causing web servers to grind to a
       
   521 halt. This happened when a post with 20,000 white spaces was submitted,
       
   522 but importantly the white spaces were neither at the beginning nor at
       
   523 the end. As a result, the regular expression matching engine needed to
       
   524 backtrack over many choices. In this example, the time needed to process
       
   525 the string was $O(n^2)$ with respect to the string length. This
       
   526 quadratic overhead was enough for the homepage of Stack Exchange to
       
   527 respond so slowly that the load balancer assumed there must be some
       
   528 attack and therefore stopped the servers from responding to any
       
   529 requests. This made the whole site become unavailable. Another very
       
   530 recent example is a global outage of all Cloudflare servers on 2 July
       
   531 2019. A poorly written regular expression exhibited exponential
       
   532 behaviour and exhausted CPUs that serve HTTP traffic. Although the
       
   533 outage had several causes, at the heart was a regular expression that
       
   534 was used to monitor network
       
   535 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
       
   536 
       
   537 The underlying problem is that many ``real life'' regular expression
       
   538 matching engines do not use DFAs for matching. This is because they
       
   539 support regular expressions that are not covered by the classical
       
   540 automata theory, and in this more general setting there are quite a few
       
   541 research questions still unanswered and fast algorithms still need to be
       
   542 developed (for example how to treat efficiently bounded repetitions, negation and
       
   543 back-references).
       
   544 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
       
   545 %how do they avoid dfas exponential states if they use them for fast matching?
       
   546 
       
   547 There is also another under-researched problem to do with regular
       
   548 expressions and lexing, i.e.~the process of breaking up strings into
       
   549 sequences of tokens according to some regular expressions. In this
       
   550 setting one is not just interested in whether or not a regular
       
   551 expression matches a string, but also in \emph{how}.  Consider for
       
   552 example a regular expression $r_{key}$ for recognising keywords such as
       
   553 \textit{if}, \textit{then} and so on; and a regular expression $r_{id}$
       
   554 for recognising identifiers (say, a single character followed by
       
   555 characters or numbers). One can then form the compound regular
       
   556 expression $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But
       
   557 then how should the string \textit{iffoo} be tokenised?  It could be
       
   558 tokenised as a keyword followed by an identifier, or the entire string
       
   559 as a single identifier.  Similarly, how should the string \textit{if} be
       
   560 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
       
   561 ``fire''---so is it an identifier or a keyword?  While in applications
       
   562 there is a well-known strategy to decide these questions, called POSIX
       
   563 matching, only relatively recently precise definitions of what POSIX
       
   564 matching actually means have been formalised
       
   565 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Such a
       
   566 definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014},
       
   567 but the corresponding correctness proof turned out to be  faulty
       
   568 \cite{AusafDyckhoffUrban2016}. Roughly, POSIX matching means matching
       
   569 the longest initial substring. In the case of a tie, the initial
       
   570 sub-match is chosen according to some priorities attached to the regular
       
   571 expressions (e.g.~keywords have a higher priority than identifiers).
       
   572 This sounds rather simple, but according to Grathwohl et al \cite[Page
       
   573 36]{CrashCourse2014} this is not the case. They wrote:
       
   574 
       
   575 \begin{quote}
       
   576 \it{}``The POSIX strategy is more complicated than the greedy because of 
       
   577 the dependence on information about the length of matched strings in the 
       
   578 various subexpressions.''
       
   579 \end{quote}
       
   580 
       
   581 \noindent
       
   582 This is also supported by evidence collected by Kuklewicz
       
   583 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
       
   584 matchers calculate incorrect results.
       
   585 
       
   586 Our focus in this project is on an algorithm introduced by Sulzmann and
       
   587 Lu in 2014 for regular expression matching according to the POSIX
       
   588 strategy \cite{Sulzmann2014}. Their algorithm is based on an older
       
   589 algorithm by Brzozowski from 1964 where he introduced the notion of
       
   590 derivatives of regular expressions~\cite{Brzozowski1964}. We shall
       
   591 briefly explain this algorithm next.
       
   592 
       
   593 \section{The Algorithm by Brzozowski based on Derivatives of Regular
       
   594 Expressions}
       
   595 
       
   596 Suppose (basic) regular expressions are given by the following grammar:
       
   597 \[			r ::=   \ZERO \mid  \ONE
       
   598 			 \mid  c  
       
   599 			 \mid  r_1 \cdot r_2
       
   600 			 \mid  r_1 + r_2   
       
   601 			 \mid r^*         
       
   602 \]
       
   603 
       
   604 \noindent
       
   605 The intended meaning of the constructors is as follows: $\ZERO$
       
   606 cannot match any string, $\ONE$ can match the empty string, the
       
   607 character regular expression $c$ can match the character $c$, and so
       
   608 on.
       
   609 
       
   610 The ingenious contribution by Brzozowski is the notion of
       
   611 \emph{derivatives} of regular expressions.  The idea behind this
       
   612 notion is as follows: suppose a regular expression $r$ can match a
       
   613 string of the form $c\!::\! s$ (that is a list of characters starting
       
   614 with $c$), what does the regular expression look like that can match
       
   615 just $s$? Brzozowski gave a neat answer to this question. He started
       
   616 with the definition of $nullable$:
       
   617 \begin{center}
       
   618 		\begin{tabular}{lcl}
       
   619 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   620 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   621 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   622 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   623 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   624 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   625 		\end{tabular}
       
   626 	\end{center}
       
   627 This function simply tests whether the empty string is in $L(r)$.
       
   628 He then defined
       
   629 the following operation on regular expressions, written
       
   630 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
       
   631 
       
   632 \begin{center}
       
   633 \begin{tabular}{lcl}
       
   634 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   635 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   636 		$d \backslash c$     & $\dn$ & 
       
   637 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   638 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   639 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   640 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   641 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   642 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   643 \end{tabular}
       
   644 \end{center}
       
   645 
       
   646 %Assuming the classic notion of a
       
   647 %\emph{language} of a regular expression, written $L(\_)$, t
       
   648 
       
   649 \noindent
       
   650 The main property of the derivative operation is that
       
   651 
       
   652 \begin{center}
       
   653 $c\!::\!s \in L(r)$ holds
       
   654 if and only if $s \in L(r\backslash c)$.
       
   655 \end{center}
       
   656 
       
   657 \noindent
       
   658 For us the main advantage is that derivatives can be
       
   659 straightforwardly implemented in any functional programming language,
       
   660 and are easily definable and reasoned about in theorem provers---the
       
   661 definitions just consist of inductive datatypes and simple recursive
       
   662 functions. Moreover, the notion of derivatives can be easily
       
   663 generalised to cover extended regular expression constructors such as
       
   664 the not-regular expression, written $\neg\,r$, or bounded repetitions
       
   665 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
       
   666 straightforwardly realised within the classic automata approach.
       
   667 For the moment however, we focus only on the usual basic regular expressions.
       
   668 
       
   669 
       
   670 Now if we want to find out whether a string $s$ matches with a regular
       
   671 expression $r$, we can build the derivatives of $r$ w.r.t.\ (in succession)
       
   672 all the characters of the string $s$. Finally, test whether the
       
   673 resulting regular expression can match the empty string.  If yes, then
       
   674 $r$ matches $s$, and no in the negative case. To implement this idea
       
   675 we can generalise the derivative operation to strings like this:
       
   676 
       
   677 \begin{center}
       
   678 \begin{tabular}{lcl}
       
   679 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   680 $r \backslash [\,] $ & $\dn$ & $r$
       
   681 \end{tabular}
       
   682 \end{center}
       
   683 
       
   684 \noindent
       
   685 and then define as  regular-expression matching algorithm: 
       
   686 \[
       
   687 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   688 \]
       
   689 
       
   690 \noindent
       
   691 This algorithm looks graphically as follows:
       
   692 \begin{equation}\label{graph:*}
       
   693 \begin{tikzcd}
       
   694 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   695 \end{tikzcd}
       
   696 \end{equation}
       
   697 
       
   698 \noindent
       
   699 where we start with  a regular expression  $r_0$, build successive
       
   700 derivatives until we exhaust the string and then use \textit{nullable}
       
   701 to test whether the result can match the empty string. It can  be
       
   702 relatively  easily shown that this matcher is correct  (that is given
       
   703 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   704 
       
   705  
       
   706 \section{Values and the Algorithm by Sulzmann and Lu}
       
   707 
       
   708 One limitation of Brzozowski's algorithm is that it only produces a
       
   709 YES/NO answer for whether a string is being matched by a regular
       
   710 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
       
   711 to allow generation of an actual matching, called a \emph{value} or
       
   712 sometimes also \emph{lexical value}.  These values and regular
       
   713 expressions correspond to each other as illustrated in the following
       
   714 table:
       
   715 
       
   716 
       
   717 \begin{center}
       
   718 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   719 		\begin{tabular}{@{}rrl@{}}
       
   720 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   721 			$r$ & $::=$  & $\ZERO$\\
       
   722 			& $\mid$ & $\ONE$   \\
       
   723 			& $\mid$ & $c$          \\
       
   724 			& $\mid$ & $r_1 \cdot r_2$\\
       
   725 			& $\mid$ & $r_1 + r_2$   \\
       
   726 			\\
       
   727 			& $\mid$ & $r^*$         \\
       
   728 		\end{tabular}
       
   729 		&
       
   730 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   731 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   732 			$v$ & $::=$  & \\
       
   733 			&        & $\Empty$   \\
       
   734 			& $\mid$ & $\Char(c)$          \\
       
   735 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   736 			& $\mid$ & $\Left(v)$   \\
       
   737 			& $\mid$ & $\Right(v)$  \\
       
   738 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   739 		\end{tabular}
       
   740 	\end{tabular}
       
   741 \end{center}
       
   742 
       
   743 \noindent
       
   744 No value  corresponds to $\ZERO$; $\Empty$ corresponds to $\ONE$;
       
   745 $\Char$ to the character regular expression; $\Seq$ to the sequence
       
   746 regular expression and so on. The idea of values is to encode a kind of
       
   747 lexical value for how the sub-parts of a regular expression match the
       
   748 sub-parts of a string. To see this, suppose a \emph{flatten} operation,
       
   749 written $|v|$ for values. We can use this function to extract the
       
   750 underlying string of a value $v$. For example, $|\mathit{Seq} \,
       
   751 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
       
   752 flatten, we can describe how values encode lexical values: $\Seq\,v_1\,
       
   753 v_2$ encodes a tree with two children nodes that tells how the string
       
   754 $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches
       
   755 the substring $|v_1|$ and, respectively, $r_2$ matches the substring
       
   756 $|v_2|$. Exactly how these two are matched is contained in the children
       
   757 nodes $v_1$ and $v_2$ of parent $\textit{Seq}$. 
       
   758 
       
   759 To give a concrete example of how values work, consider the string $xy$
       
   760 and the regular expression $(x + (y + xy))^*$. We can view this regular
       
   761 expression as a tree and if the string $xy$ is matched by two Star
       
   762 ``iterations'', then the $x$ is matched by the left-most alternative in
       
   763 this tree and the $y$ by the right-left alternative. This suggests to
       
   764 record this matching as
       
   765 
       
   766 \begin{center}
       
   767 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
       
   768 \end{center}
       
   769 
       
   770 \noindent
       
   771 where $\Stars \; [\ldots]$ records all the
       
   772 iterations; and $\Left$, respectively $\Right$, which
       
   773 alternative is used. The value for
       
   774 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
       
   775 would look as follows
       
   776 
       
   777 \begin{center}
       
   778 $\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
       
   779 \end{center}
       
   780 
       
   781 \noindent
       
   782 where $\Stars$ has only a single-element list for the single iteration
       
   783 and $\Seq$ indicates that $xy$ is matched by a sequence regular
       
   784 expression.
       
   785 
       
   786 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   787 algorithm by a second phase (the first phase being building successive
       
   788 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
   789 is generated in case the regular expression matches  the string. 
       
   790 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
   791 
       
   792 \begin{ceqn}
       
   793 \begin{equation}\label{graph:2}
       
   794 \begin{tikzcd}
       
   795 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   796 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   797 \end{tikzcd}
       
   798 \end{equation}
       
   799 \end{ceqn}
       
   800 
       
   801 \noindent
       
   802 For convenience, we shall employ the following notations: the regular
       
   803 expression we start with is $r_0$, and the given string $s$ is composed
       
   804 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   805 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   806 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   807 the derivative $r_n$. We test whether this derivative is
       
   808 $\textit{nullable}$ or not. If not, we know the string does not match
       
   809 $r$ and no value needs to be generated. If yes, we start building the
       
   810 values incrementally by \emph{injecting} back the characters into the
       
   811 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   812 algorithm from the right to left. For the first value $v_n$, we call the
       
   813 function $\textit{mkeps}$, which builds the lexical value
       
   814 for how the empty string has been matched by the (nullable) regular
       
   815 expression $r_n$. This function is defined as
       
   816 
       
   817 	\begin{center}
       
   818 		\begin{tabular}{lcl}
       
   819 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   820 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   821 			& \textit{if} $\nullable(r_{1})$\\ 
       
   822 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   823 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   824 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   825 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   826 		\end{tabular}
       
   827 	\end{center}
       
   828 
       
   829 
       
   830 \noindent There are no cases for $\ZERO$ and $c$, since
       
   831 these regular expression cannot match the empty string. Note
       
   832 also that in case of alternatives we give preference to the
       
   833 regular expression on the left-hand side. This will become
       
   834 important later on about what value is calculated.
       
   835 
       
   836 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   837 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   838 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   839 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   840 matches $s$. For this Sulzmann and Lu defined a function that reverses
       
   841 the ``chopping off'' of characters during the derivative phase. The
       
   842 corresponding function is called \emph{injection}, written
       
   843 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   844 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   845 is a character ${c_{i-1}}$, the character we want to inject and the
       
   846 third argument is the value ${v_i}$, into which one wants to inject the
       
   847 character (it corresponds to the regular expression after the character
       
   848 has been chopped off). The result of this function is a new value. The
       
   849 definition of $\textit{inj}$ is as follows: 
       
   850 
       
   851 \begin{center}
       
   852 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   853   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   854   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   855   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   856   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   857   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   858   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   859   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   860 \end{tabular}
       
   861 \end{center}
       
   862 
       
   863 \noindent This definition is by recursion on the ``shape'' of regular
       
   864 expressions and values. To understands this definition better consider
       
   865 the situation when we build the derivative on regular expression $r_{i-1}$.
       
   866 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
       
   867 ``hole'' in $r_i$ and its corresponding value $v_i$. 
       
   868 To calculate $v_{i-1}$, we need to
       
   869 locate where that hole is and fill it. 
       
   870 We can find this location by
       
   871 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
       
   872 $r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that 
       
   873 %
       
   874 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
       
   875 
       
   876 \noindent
       
   877 otherwise if $r_a$ is not nullable,
       
   878 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
       
   879 
       
   880 \noindent
       
   881 the value $v_i$ should be  $\Seq(\ldots)$, contradicting the fact that
       
   882 $v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
       
   883 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
       
   884 branch of \[ (r_a \cdot r_b)\backslash c =
       
   885 \bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined)
       
   886  is taken instead of the right one. This means $c$ is chopped off 
       
   887 from $r_a$ rather than $r_b$.
       
   888 We have therefore found out 
       
   889 that the hole will be on $r_a$. So we recursively call $\inj\, 
       
   890 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
       
   891 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
       
   892 Other clauses can be understood in a similar way.
       
   893 
       
   894 %\comment{Other word: insight?}
       
   895 The following example gives an insight of $\textit{inj}$'s effect and
       
   896 how Sulzmann and Lu's algorithm works as a whole. Suppose we have a
       
   897 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it
       
   898 against the string $abc$ (when $abc$ is written as a regular expression,
       
   899 the standard way of expressing it is $a \cdot (b \cdot c)$. But we
       
   900 usually omit the parentheses and dots here for better readability. This
       
   901 algorithm returns a POSIX value, which means it will produce the longest
       
   902 matching. Consequently, it matches the string $abc$ in one star
       
   903 iteration, using the longest alternative $abc$ in the sub-expression (we shall use $r$ to denote this
       
   904 sub-expression for conciseness):
       
   905 
       
   906 \[((((a+b)+ab)+c)+\underbrace{abc}_r)\] 
       
   907 
       
   908 \noindent
       
   909 Before $\textit{inj}$ is called, our lexer first builds derivative using
       
   910 string $abc$ (we simplified some regular expressions like $\ZERO \cdot
       
   911 b$ to $\ZERO$ for conciseness; we also omit parentheses if they are
       
   912 clear from the context):
       
   913 
       
   914 %Similarly, we allow
       
   915 %$\textit{ALT}$ to take a list of regular expressions as an argument
       
   916 %instead of just 2 operands to reduce the nested depth of
       
   917 %$\textit{ALT}$
       
   918 
       
   919 \begin{center}
       
   920 \begin{tabular}{lcl}
       
   921 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\
       
   922       & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$\\
       
   923       & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^*) + $\\ 
       
   924       &                              & $\phantom{r_3 = (} ((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* )$
       
   925 \end{tabular}
       
   926 \end{center}
       
   927 
       
   928 \noindent
       
   929 In  case $r_3$ is nullable, we can call $\textit{mkeps}$ 
       
   930 to construct a lexical value for how $r_3$ matched the string $abc$. 
       
   931 This function gives the following value $v_3$: 
       
   932 
       
   933 
       
   934 \begin{center}
       
   935 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
       
   936 \end{center}
       
   937 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
       
   938 
       
   939 \begin{center}
       
   940 	\begin{tabular}{l@{\hspace{2mm}}l}
       
   941     & $\big(\underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} 
       
   942     \;+\; (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*\big)$ \smallskip\\
       
   943     $+$ & $\big((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*
       
   944     \;+\; (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* \big)$
       
   945   	\end{tabular}
       
   946  \end{center}
       
   947 
       
   948 \noindent
       
   949  Note that the leftmost location of term $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
       
   950  \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
       
   951  $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the
       
   952  left one when it is nullable. In the case of this example, $abc$ is
       
   953  preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
       
   954  generated by two applications of the splitting clause
       
   955 
       
   956 \begin{center}
       
   957      $(r_1 \cdot r_2)\backslash c  \;\;(when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
       
   958 \end{center}
       
   959        
       
   960 \noindent
       
   961 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the
       
   962 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This
       
   963 allows $\textit{mkeps}$ to always pick up among two matches the one with a longer
       
   964 initial sub-match. Removing the outside $\Left(\Left(...))$, the inside
       
   965 sub-value 
       
   966  
       
   967 \begin{center}
       
   968  $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
       
   969 \end{center}
       
   970 
       
   971 \noindent
       
   972 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot
       
   973 \ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of two nullable regular
       
   974 expressions. The first one is an alternative, we take the rightmost
       
   975 alternative---whose language contains the empty string. The second
       
   976 nullable regular expression is a Kleene star. $\Stars$ tells us how it
       
   977 generates the nullable regular expression: by 0 iterations to form
       
   978 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
       
   979 builds a lexical value based on $v_3$. Using the value $v_3$, the character
       
   980 c, and the regular expression $r_2$, we can recover how $r_2$ matched
       
   981 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
       
   982  \begin{center}
       
   983  $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
       
   984  \end{center}
       
   985 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
       
   986 \begin{center}
       
   987 $v_1 = \Seq(\Right(\Seq(\Empty, \Seq(b, c))), \Stars [])$
       
   988 \end{center}
       
   989  for how 
       
   990  \begin{center}
       
   991  $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$
       
   992  \end{center}
       
   993   matched  the string $bc$ before it split into two substrings. 
       
   994   Finally, after injecting character $a$ back to $v_1$, 
       
   995   we get  the lexical value tree 
       
   996   \begin{center}
       
   997   $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
       
   998   \end{center}
       
   999    for how $r$ matched $abc$. This completes the algorithm.
       
  1000    
       
  1001 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
       
  1002 Readers might have noticed that the lexical value information is actually
       
  1003 already available when doing derivatives. For example, immediately after
       
  1004 the operation $\backslash a$ we know that if we want to match a string
       
  1005 that starts with $a$, we can either take the initial match to be 
       
  1006 
       
  1007  \begin{center}
       
  1008 \begin{enumerate}
       
  1009     \item[1)] just $a$ or
       
  1010     \item[2)] string $ab$ or 
       
  1011     \item[3)] string $abc$.
       
  1012 \end{enumerate}
       
  1013 \end{center}
       
  1014 
       
  1015 \noindent
       
  1016 In order to differentiate between these choices, we just need to
       
  1017 remember their positions---$a$ is on the left, $ab$ is in the middle ,
       
  1018 and $abc$ is on the right. Which of these alternatives is chosen
       
  1019 later does not affect their relative position because the algorithm does
       
  1020 not change this order. If this parsing information can be determined and
       
  1021 does not change because of later derivatives, there is no point in
       
  1022 traversing this information twice. This leads to an optimisation---if we
       
  1023 store the information for lexical values inside the regular expression,
       
  1024 update it when we do derivative on them, and collect the information
       
  1025 when finished with derivatives and call $\textit{mkeps}$ for deciding which
       
  1026 branch is POSIX, we can generate the lexical value in one pass, instead of
       
  1027 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
       
  1028 idea of using bitcodes in derivatives.
       
  1029 
       
  1030 In the next section, we shall focus on the bitcoded algorithm and the
       
  1031 process of simplification of regular expressions. This is needed in
       
  1032 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
       
  1033 and Lu's algorithms.  This is where the PhD-project aims to advance the
       
  1034 state-of-the-art.
       
  1035 
       
  1036 
       
  1037 \section{Simplification of Regular Expressions}
       
  1038 
       
  1039 Using bitcodes to guide  parsing is not a novel idea. It was applied to
       
  1040 context free grammars and then adapted by Henglein and Nielson for
       
  1041 efficient regular expression  lexing using DFAs~\cite{nielson11bcre}.
       
  1042 Sulzmann and Lu took this idea of bitcodes a step further by integrating
       
  1043 bitcodes into derivatives. The reason why we want to use bitcodes in
       
  1044 this project is that we want to introduce more aggressive simplification
       
  1045 rules in order to keep the size of derivatives small throughout. This is
       
  1046 because the main drawback of building successive derivatives according
       
  1047 to Brzozowski's definition is that they can grow very quickly in size.
       
  1048 This is mainly due to the fact that the derivative operation generates
       
  1049 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
       
  1050 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
       
  1051 are excruciatingly slow. For example when starting with the regular
       
  1052 expression $(a + aa)^*$ and building 12 successive derivatives
       
  1053 w.r.t.~the character $a$, one obtains a derivative regular expression
       
  1054 with more than 8000 nodes (when viewed as a tree). Operations like
       
  1055 $\textit{der}$ and $\nullable$ need to traverse such trees and
       
  1056 consequently the bigger the size of the derivative the slower the
       
  1057 algorithm. 
       
  1058 
       
  1059 Fortunately, one can simplify regular expressions after each derivative
       
  1060 step. Various simplifications of regular expressions are possible, such
       
  1061 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
       
  1062 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
       
  1063 affect the answer for whether a regular expression matches a string or
       
  1064 not, but fortunately also do not affect the POSIX strategy of how
       
  1065 regular expressions match strings---although the latter is much harder
       
  1066 to establish. Some initial results in this regard have been
       
  1067 obtained in \cite{AusafDyckhoffUrban2016}. 
       
  1068 
       
  1069 Unfortunately, the simplification rules outlined above  are not
       
  1070 sufficient to prevent a size explosion in all cases. We
       
  1071 believe a tighter bound can be achieved that prevents an explosion in
       
  1072 \emph{all} cases. Such a tighter bound is suggested by work of Antimirov who
       
  1073 proved that (partial) derivatives can be bound by the number of
       
  1074 characters contained in the initial regular expression
       
  1075 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
       
  1076 expressions as follows:
       
  1077 
       
  1078 \begin{center}
       
  1079 \begin{tabular}{lcl}
       
  1080  $\textit{pder} \; c \; \ZERO$ & $\dn$ & $\emptyset$\\
       
  1081  $\textit{pder} \; c \; \ONE$ & $\dn$ & $\emptyset$ \\
       
  1082  $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{  \ONE   \}  \; \textit{else} \; \emptyset$ \\ 
       
  1083   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
       
  1084    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\
       
  1085      & & $\textit{then} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \;$\\
       
  1086      & & $\textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
       
  1087      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
       
  1088  \end{tabular}
       
  1089  \end{center}
       
  1090 
       
  1091 \noindent
       
  1092 A partial derivative of a regular expression $r$ is essentially a set of
       
  1093 regular expressions that are either $r$'s children expressions or a
       
  1094 concatenation of them. Antimirov has proved a tight bound of the sum of
       
  1095 the size of \emph{all} partial derivatives no matter what the string
       
  1096 looks like. Roughly speaking the size sum will be at most cubic in the
       
  1097 size of the regular expression.
       
  1098 
       
  1099 If we want the size of derivatives in Sulzmann and Lu's algorithm to
       
  1100 stay below this bound, we would need more aggressive simplifications.
       
  1101 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
  1102 deleting duplicates whenever possible. For example, the parentheses in
       
  1103 $(a+b) \cdot c + bc$ can be opened up to get $a\cdot c +  b \cdot c + b
       
  1104 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
  1105 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
  1106 $a^*+a+\ONE$. Adding these more aggressive simplification rules helps us
       
  1107 to achieve the same size bound as that of the partial derivatives. 
       
  1108 
       
  1109 In order to implement the idea of ``spilling out alternatives'' and to
       
  1110 make them compatible with the $\text{inj}$-mechanism, we use
       
  1111 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
       
  1112 
       
  1113 %This allows us to prove a tight
       
  1114 %bound on the size of regular expression during the running time of the
       
  1115 %algorithm if we can establish the connection between our simplification
       
  1116 %rules and partial derivatives.
       
  1117 
       
  1118  %We believe, and have generated test
       
  1119 %data, that a similar bound can be obtained for the derivatives in
       
  1120 %Sulzmann and Lu's algorithm. Let us give some details about this next.
       
  1121 
       
  1122 
       
  1123 \begin{center}
       
  1124 		$b ::=   S \mid  Z \qquad
       
  1125 bs ::= [] \mid b:bs    
       
  1126 $
       
  1127 \end{center}
       
  1128 
       
  1129 \noindent
       
  1130 The $S$ and $Z$ are arbitrary names for the bits in order to avoid 
       
  1131 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
  1132 bit-lists) can be used to encode values (or incomplete values) in a
       
  1133 compact form. This can be straightforwardly seen in the following
       
  1134 coding function from values to bitcodes: 
       
  1135 
       
  1136 \begin{center}
       
  1137 \begin{tabular}{lcl}
       
  1138   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
  1139   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
  1140   $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
       
  1141   $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
       
  1142   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
  1143   $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\
       
  1144   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
       
  1145                                                  code(\Stars\,vs)$
       
  1146 \end{tabular}    
       
  1147 \end{center} 
       
  1148 
       
  1149 \noindent
       
  1150 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
  1151 $\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty
       
  1152 star iteration into $\S$, and the border where a local star terminates
       
  1153 into $\Z$. This coding is lossy, as it throws away the information about
       
  1154 characters, and also does not encode the ``boundary'' between two
       
  1155 sequence values. Moreover, with only the bitcode we cannot even tell
       
  1156 whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The
       
  1157 reason for choosing this compact way of storing information is that the
       
  1158 relatively small size of bits can be easily manipulated and ``moved
       
  1159 around'' in a regular expression. In order to recover values, we will 
       
  1160 need the corresponding regular expression as an extra information. This
       
  1161 means the decoding function is defined as:
       
  1162 
       
  1163 
       
  1164 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
  1165 \begin{center}
       
  1166 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
  1167   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
  1168   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
  1169   $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  1170      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
  1171        (\Left\,v, bs_1)$\\
       
  1172   $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  1173      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
  1174        (\Right\,v, bs_1)$\\                           
       
  1175   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
  1176         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
  1177   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
  1178   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
  1179   $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
  1180   $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & 
       
  1181          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  1182   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
  1183   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
  1184   
       
  1185   $\textit{decode}\,bs\,r$ & $\dn$ &
       
  1186      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  1187   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
  1188        \textit{else}\;\textit{None}$                       
       
  1189 \end{tabular}    
       
  1190 \end{center}    
       
  1191 %\end{definition}
       
  1192 
       
  1193 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
  1194 create annotated regular expressions \cite{Sulzmann2014}.
       
  1195 \emph{Annotated regular expressions} are defined by the following
       
  1196 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
  1197 
       
  1198 \begin{center}
       
  1199 \begin{tabular}{lcl}
       
  1200   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
       
  1201                   & $\mid$ & $\textit{ONE}\;\;bs$\\
       
  1202                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
       
  1203                   & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
       
  1204                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
       
  1205                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
       
  1206 \end{tabular}    
       
  1207 \end{center}  
       
  1208 %(in \textit{ALTS})
       
  1209 
       
  1210 \noindent
       
  1211 where $bs$ stands for bitcodes, $a$  for $\bold{a}$nnotated regular
       
  1212 expressions and $as$ for a list of annotated regular expressions.
       
  1213 The alternative constructor($\textit{ALTS}$) has been generalized to 
       
  1214 accept a list of annotated regular expressions rather than just 2.
       
  1215 We will show that these bitcodes encode information about
       
  1216 the (POSIX) value that should be generated by the Sulzmann and Lu
       
  1217 algorithm.
       
  1218 
       
  1219 
       
  1220 To do lexing using annotated regular expressions, we shall first
       
  1221 transform the usual (un-annotated) regular expressions into annotated
       
  1222 regular expressions. This operation is called \emph{internalisation} and
       
  1223 defined as follows:
       
  1224 
       
  1225 %\begin{definition}
       
  1226 \begin{center}
       
  1227 \begin{tabular}{lcl}
       
  1228   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
       
  1229   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
       
  1230   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
       
  1231   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
  1232   $\textit{ALTS}\;[]\,List((\textit{fuse}\,[\Z]\,r_1^\uparrow),\,
       
  1233   (\textit{fuse}\,[\S]\,r_2^\uparrow))$\\
       
  1234   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
  1235          $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
       
  1236   $(r^*)^\uparrow$ & $\dn$ &
       
  1237          $\textit{STAR}\;[]\,r^\uparrow$\\
       
  1238 \end{tabular}    
       
  1239 \end{center}    
       
  1240 %\end{definition}
       
  1241 
       
  1242 \noindent
       
  1243 We use up arrows here to indicate that the basic un-annotated regular
       
  1244 expressions are ``lifted up'' into something slightly more complex. In the
       
  1245 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
  1246 attach bits to the front of an annotated regular expression. Its
       
  1247 definition is as follows:
       
  1248 
       
  1249 \begin{center}
       
  1250 \begin{tabular}{lcl}
       
  1251   $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
       
  1252   $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
       
  1253      $\textit{ONE}\,(bs\,@\,bs')$\\
       
  1254   $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
       
  1255      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
       
  1256   $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,as)$ & $\dn$ &
       
  1257      $\textit{ALTS}\,(bs\,@\,bs')\,as$\\
       
  1258   $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
       
  1259      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
       
  1260   $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
       
  1261      $\textit{STAR}\,(bs\,@\,bs')\,a$
       
  1262 \end{tabular}    
       
  1263 \end{center}  
       
  1264 
       
  1265 \noindent
       
  1266 After internalising the regular expression, we perform successive
       
  1267 derivative operations on the annotated regular expressions. This
       
  1268 derivative operation is the same as what we had previously for the
       
  1269 basic regular expressions, except that we beed to take care of
       
  1270 the bitcodes:
       
  1271 
       
  1272  %\begin{definition}{bder}
       
  1273 \begin{center}
       
  1274   \begin{tabular}{@{}lcl@{}}
       
  1275   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1276   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1277   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
  1278         $\textit{if}\;c=d\; \;\textit{then}\;
       
  1279          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
  1280   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
  1281   $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
       
  1282   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  1283      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  1284 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
  1285 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
  1286   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
  1287   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
  1288       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
  1289        (\textit{STAR}\,[]\,r)$
       
  1290 \end{tabular}    
       
  1291 \end{center}    
       
  1292 %\end{definition}
       
  1293 
       
  1294 \noindent
       
  1295 For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
       
  1296 we need to attach an additional bit $Z$ to the front of $r \backslash c$
       
  1297 to indicate that there is one more star iteration. Also the $SEQ$ clause
       
  1298 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
  1299 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
  1300 that it is for annotated regular expressions, therefore we omit the
       
  1301 definition). Assume that $bmkeps$ correctly extracts the bitcode for how
       
  1302 $a_1$ matches the string prior to character $c$ (more on this later),
       
  1303 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
       
  1304 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
  1305 already been fully matched) and store the parsing information at the
       
  1306 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
  1307 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
       
  1308 now been elevated to the top-level of $ALTS$, as this information will be
       
  1309 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
       
  1310 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
  1311 the lexing information, we complete the lexing by collecting the
       
  1312 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
  1313 for annotated regular expressions, called $\textit{bmkeps}$:
       
  1314 
       
  1315 
       
  1316 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
  1317 \begin{center}
       
  1318 \begin{tabular}{lcl}
       
  1319   $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
       
  1320   $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a::as)$ & $\dn$ &
       
  1321      $\textit{if}\;\textit{bnullable}\,a$\\
       
  1322   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
  1323   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(\textit{ALTS}\;bs\,as)$\\
       
  1324   $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
       
  1325      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
  1326   $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ &
       
  1327      $bs \,@\, [\S]$
       
  1328 \end{tabular}    
       
  1329 \end{center}    
       
  1330 %\end{definition}
       
  1331 
       
  1332 \noindent
       
  1333 This function completes the value information by travelling along the
       
  1334 path of the regular expression that corresponds to a POSIX value and
       
  1335 collecting all the bitcodes, and using $S$ to indicate the end of star
       
  1336 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
  1337 decode them, we get the value we expect. The corresponding lexing
       
  1338 algorithm looks as follows:
       
  1339 
       
  1340 \begin{center}
       
  1341 \begin{tabular}{lcl}
       
  1342   $\textit{blexer}\;r\,s$ & $\dn$ &
       
  1343       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
  1344   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1345   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1346   & & $\;\;\textit{else}\;\textit{None}$
       
  1347 \end{tabular}
       
  1348 \end{center}
       
  1349 
       
  1350 \noindent
       
  1351 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
  1352 operation from characters to strings (just like the derivatives for un-annotated
       
  1353 regular expressions).
       
  1354 
       
  1355 The main point of the bitcodes and annotated regular expressions is that
       
  1356 we can apply rather aggressive (in terms of size) simplification rules
       
  1357 in order to keep derivatives small. We have developed such
       
  1358 ``aggressive'' simplification rules and generated test data that show
       
  1359 that the expected bound can be achieved. Obviously we could only
       
  1360 partially cover  the search space as there are infinitely many regular
       
  1361 expressions and strings. 
       
  1362 
       
  1363 One modification we introduced is to allow a list of annotated regular
       
  1364 expressions in the \textit{ALTS} constructor. This allows us to not just
       
  1365 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
  1366 also unnecessary ``copies'' of regular expressions (very similar to
       
  1367 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
  1368 modification is that we use simplification rules inspired by Antimirov's
       
  1369 work on partial derivatives. They maintain the idea that only the first
       
  1370 ``copy'' of a regular expression in an alternative contributes to the
       
  1371 calculation of a POSIX value. All subsequent copies can be pruned away from
       
  1372 the regular expression. A recursive definition of our  simplification function 
       
  1373 that looks somewhat similar to our Scala code is given below:
       
  1374 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
  1375 %Is it $ALTS$ or $ALTS$?}\\
       
  1376 
       
  1377 \begin{center}
       
  1378   \begin{tabular}{@{}lcl@{}}
       
  1379    
       
  1380   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
  1381    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
  1382    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
  1383    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
  1384    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
  1385    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
       
  1386 
       
  1387   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
       
  1388   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  1389    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
  1390    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
       
  1391 
       
  1392    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  1393 \end{tabular}    
       
  1394 \end{center}    
       
  1395 
       
  1396 \noindent
       
  1397 The simplification does a pattern matching on the regular expression.
       
  1398 When it detected that the regular expression is an alternative or
       
  1399 sequence, it will try to simplify its children regular expressions
       
  1400 recursively and then see if one of the children turn into $\ZERO$ or
       
  1401 $\ONE$, which might trigger further simplification at the current level.
       
  1402 The most involved part is the $\textit{ALTS}$ clause, where we use two
       
  1403 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
  1404 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
       
  1405 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
       
  1406 when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.
       
  1407 Its recursive definition is given below:
       
  1408 
       
  1409  \begin{center}
       
  1410   \begin{tabular}{@{}lcl@{}}
       
  1411   $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
       
  1412      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
  1413   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
       
  1414     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
       
  1415 \end{tabular}    
       
  1416 \end{center}  
       
  1417 
       
  1418 \noindent
       
  1419 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
  1420 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
  1421 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
  1422 
       
  1423 Suppose we apply simplification after each derivative step, and view
       
  1424 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
       
  1425 \textit{simp}(a \backslash c)$. Then we can use the previous natural
       
  1426 extension from derivative w.r.t.~character to derivative
       
  1427 w.r.t.~string:%\comment{simp in  the [] case?}
       
  1428 
       
  1429 \begin{center}
       
  1430 \begin{tabular}{lcl}
       
  1431 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
  1432 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
  1433 \end{tabular}
       
  1434 \end{center}
       
  1435 
       
  1436 \noindent
       
  1437 we obtain an optimised version of the algorithm:
       
  1438 
       
  1439  \begin{center}
       
  1440 \begin{tabular}{lcl}
       
  1441   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
  1442       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
  1443   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1444   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1445   & & $\;\;\textit{else}\;\textit{None}$
       
  1446 \end{tabular}
       
  1447 \end{center}
       
  1448 
       
  1449 \noindent
       
  1450 This algorithm keeps the regular expression size small, for example,
       
  1451 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
  1452 will be reduced to just 6 and stays constant, no matter how long the
       
  1453 input string is.
       
  1454 
       
  1455 
       
  1456 
       
  1457 \section{Current Work}
       
  1458 
       
  1459 We are currently engaged in two tasks related to this algorithm. The
       
  1460 first task is proving that our simplification rules actually do not
       
  1461 affect the POSIX value that should be generated by the algorithm
       
  1462 according to the specification of a POSIX value and furthermore obtain a
       
  1463 much tighter bound on the sizes of derivatives. The result is that our
       
  1464 algorithm should be correct and faster on all inputs.  The original
       
  1465 blow-up, as observed in JavaScript, Python and Java, would be excluded
       
  1466 from happening in our algorithm. For this proof we use the theorem prover
       
  1467 Isabelle. Once completed, this result will advance the state-of-the-art:
       
  1468 Sulzmann and Lu wrote in their paper~\cite{Sulzmann2014} about the
       
  1469 bitcoded ``incremental parsing method'' (that is the lexing algorithm
       
  1470 outlined in this section):
       
  1471 
       
  1472 \begin{quote}\it
       
  1473   ``Correctness Claim: We further claim that the incremental parsing
       
  1474   method in Figure~5 in combination with the simplification steps in
       
  1475   Figure 6 yields POSIX parse tree [our lexical values]. We have tested this claim
       
  1476   extensively by using the method in Figure~3 as a reference but yet
       
  1477   have to work out all proof details.''
       
  1478 \end{quote}  
       
  1479 
       
  1480 \noindent 
       
  1481 We like to settle this correctness claim. It is relatively
       
  1482 straightforward to establish that after one simplification step, the part of a
       
  1483 nullable derivative that corresponds to a POSIX value remains intact and can
       
  1484 still be collected, in other words, we can show that
       
  1485 %\comment{Double-check....I
       
  1486 %think this  is not the case}
       
  1487 %\comment{If i remember correctly, you have proved this lemma.
       
  1488 %I feel this is indeed not true because you might place arbitrary 
       
  1489 %bits on the regex r, however if this is the case, did i remember wrongly that
       
  1490 %you proved something like simplification does not affect $\textit{bmkeps}$ results?
       
  1491 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
       
  1492 %to a regex. Maybe it works now.}
       
  1493 
       
  1494 \begin{center}
       
  1495 	$\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;($\textit{provided}$ \; a\; is \; \textit{bnullable} )$
       
  1496 \end{center} 
       
  1497 
       
  1498 \noindent
       
  1499 as this basically comes down to proving actions like removing the
       
  1500 additional $r$ in $r+r$  does not delete important POSIX information in
       
  1501 a regular expression. The hard part of this proof is to establish that
       
  1502 
       
  1503 \begin{center}
       
  1504 	$ \textit{blexer}\_{simp}(r, \; s) =  \textit{blexer}(r, \; s)$
       
  1505 \end{center}
       
  1506 %comment{This is not true either...look at the definion blexer/blexer-simp}
       
  1507 
       
  1508 \noindent That is, if we do derivative on regular expression $r$ and then
       
  1509 simplify it, and repeat this process until we exhaust the string, we get a
       
  1510 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
       
  1511 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
       
  1512 normal derivative algorithm that only does derivative repeatedly and has no
       
  1513 simplification at all.  This might seem at first glance very unintuitive, as
       
  1514 the $r'$ could be exponentially larger than $r''$, but can be explained in the
       
  1515 following way: we are pruning away the possible matches that are not POSIX.
       
  1516 Since there could be exponentially many 
       
  1517 non-POSIX matchings and only 1 POSIX matching, it
       
  1518 is understandable that our $r''$ can be a lot smaller.  we can still provide
       
  1519 the same POSIX value if there is one.  This is not as straightforward as the
       
  1520 previous proposition, as the two regular expressions $r'$ and $r''$ might have
       
  1521 become very different.  The crucial point is to find the
       
  1522 $\textit{POSIX}$  information of a regular expression and how it is modified,
       
  1523 augmented and propagated 
       
  1524 during simplification in parallel with the regular expression that
       
  1525 has not been simplified in the subsequent derivative operations.  To aid this,
       
  1526 we use the helper function retrieve described by Sulzmann and Lu:
       
  1527 \begin{center}
       
  1528 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
       
  1529   $\textit{retrieve}\,(\textit{ONE}\,bs)\,\Empty$ & $\dn$ & $bs$\\
       
  1530   $\textit{retrieve}\,(\textit{CHAR}\,bs\,c)\,(\Char\,d)$ & $\dn$ & $bs$\\
       
  1531   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Left\,v)$ & $\dn$ &
       
  1532      $bs \,@\, \textit{retrieve}\,a\,v$\\
       
  1533   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Right\,v)$ & $\dn$ &
       
  1534   $bs \,@\, \textit{retrieve}\,(\textit{ALTS}\,bs\,as)\,v$\\
       
  1535   $\textit{retrieve}\,(\textit{SEQ}\,bs\,a_1\,a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
       
  1536      $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
       
  1537   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,[])$ & $\dn$ &
       
  1538      $bs \,@\, [\S]$\\
       
  1539   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
       
  1540   \multicolumn{3}{l}{
       
  1541      \hspace{3cm}$bs \,@\, [\Z] \,@\, \textit{retrieve}\,a\,v\,@\,
       
  1542                     \textit{retrieve}\,(\textit{STAR}\,[]\,a)\,(\Stars\,vs)$}\\
       
  1543 \end{tabular}
       
  1544 \end{center}
       
  1545 %\comment{Did not read further}\\
       
  1546 This function assembles the bitcode 
       
  1547 %that corresponds to a lexical value for how
       
  1548 %the current derivative matches the suffix of the string(the characters that
       
  1549 %have not yet appeared, but will appear as the successive derivatives go on.
       
  1550 %How do we get this "future" information? By the value $v$, which is
       
  1551 %computed by a pass of the algorithm that uses
       
  1552 %$inj$ as described in the previous section).  
       
  1553 using information from both the derivative regular expression and the
       
  1554 value. Sulzmann and Lu poroposed this function, but did not prove
       
  1555 anything about it. Ausaf and Urban used it to connect the bitcoded
       
  1556 algorithm to the older algorithm by the following equation:
       
  1557  
       
  1558  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
       
  1559 	 (r^\uparrow)\backslash_{simp} \,c)\,v)$ 
       
  1560  \end{center} 
       
  1561 
       
  1562 \noindent
       
  1563 whereby $r^\uparrow$ stands for the internalised version of $r$. Ausaf
       
  1564 and Urban also used this fact to prove  the correctness of bitcoded
       
  1565 algorithm without simplification.  Our purpose of using this, however,
       
  1566 is to establish 
       
  1567 
       
  1568 \begin{center}
       
  1569 $ \textit{retrieve} \;
       
  1570 a \; v \;=\; \textit{retrieve}  \; (\textit{simp}\,a) \; v'.$
       
  1571 \end{center}
       
  1572 The idea is that using $v'$, a simplified version of $v$ that had gone
       
  1573 through the same simplification step as $\textit{simp}(a)$, we are able
       
  1574 to extract the bitcode that gives the same parsing information as the
       
  1575 unsimplified one. However, we noticed that constructing such a  $v'$
       
  1576 from $v$ is not so straightforward. The point of this is that  we might
       
  1577 be able to finally bridge the gap by proving
       
  1578 
       
  1579 \begin{center}
       
  1580 $\textit{retrieve} \; (r^\uparrow   \backslash  s) \; v = \;\textit{retrieve} \;
       
  1581 (\textit{simp}(r^\uparrow)  \backslash  s) \; v'$
       
  1582 \end{center}
       
  1583 
       
  1584 \noindent
       
  1585 and subsequently
       
  1586 
       
  1587 \begin{center}
       
  1588 $\textit{retrieve} \; (r^\uparrow \backslash  s) \; v\; = \; \textit{retrieve} \;
       
  1589 (r^\uparrow  \backslash_{simp}  \, s) \; v'$.
       
  1590 \end{center}
       
  1591 
       
  1592 \noindent
       
  1593 The $\textit{LHS}$ of the above equation is the bitcode we want. This
       
  1594 would prove that our simplified version of regular expression still
       
  1595 contains all the bitcodes needed. The task here is to find a way to
       
  1596 compute the correct $v'$.
       
  1597 
       
  1598 The second task is to speed up the more aggressive simplification.  Currently
       
  1599 it is slower than the original naive simplification by Ausaf and Urban (the
       
  1600 naive version as implemented by Ausaf   and Urban of course can ``explode'' in
       
  1601 some cases).  It is therefore not surprising that the speed is also much slower
       
  1602 than regular expression engines in popular programming languages such as Java
       
  1603 and Python on most inputs that are linear. For example, just by rewriting the
       
  1604 example regular expression in the beginning of this report  $(a^*)^*\,b$ into
       
  1605 $a^*\,b$ would eliminate the ambiguity in the matching and make the time
       
  1606 for matching linear with respect to the input string size. This allows the 
       
  1607 DFA approach to become blindingly fast, and dwarf the speed of our current
       
  1608 implementation. For example, here is a comparison of Java regex engine 
       
  1609 and our implementation on this example.
       
  1610 
       
  1611 \begin{center}
       
  1612 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
  1613 \begin{tikzpicture}
       
  1614 \begin{axis}[
       
  1615     xlabel={$n*1000$},
       
  1616     x label style={at={(1.05,-0.05)}},
       
  1617     ylabel={time in secs},
       
  1618     enlargelimits=false,
       
  1619     xtick={0,5,...,30},
       
  1620     xmax=33,
       
  1621     ymax=9,
       
  1622     scaled ticks=true,
       
  1623     axis lines=left,
       
  1624     width=5cm,
       
  1625     height=4cm, 
       
  1626     legend entries={Bitcoded Algorithm},  
       
  1627     legend pos=north west,
       
  1628     legend cell align=left]
       
  1629 \addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data};
       
  1630 \end{axis}
       
  1631 \end{tikzpicture}
       
  1632   &
       
  1633 \begin{tikzpicture}
       
  1634 \begin{axis}[
       
  1635     xlabel={$n*1000$},
       
  1636     x label style={at={(1.05,-0.05)}},
       
  1637     %ylabel={time in secs},
       
  1638     enlargelimits=false,
       
  1639     xtick={0,5,...,30},
       
  1640     xmax=33,
       
  1641     ymax=9,
       
  1642     scaled ticks=false,
       
  1643     axis lines=left,
       
  1644     width=5cm,
       
  1645     height=4cm, 
       
  1646     legend entries={Java},  
       
  1647     legend pos=north west,
       
  1648     legend cell align=left]
       
  1649 \addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data};
       
  1650 \end{axis}
       
  1651 \end{tikzpicture}\\
       
  1652 \multicolumn{3}{c}{Graphs: Runtime for matching $a^*\,b$ with strings 
       
  1653            of the form $\underbrace{aa..a}_{n}$.}
       
  1654 \end{tabular}    
       
  1655 \end{center}  
       
  1656 
       
  1657 
       
  1658 Java regex engine can match string of thousands of characters in a few milliseconds,
       
  1659 whereas our current algorithm gets excruciatingly slow on input of this size.
       
  1660 The running time in theory is linear, however it does not appear to be the 
       
  1661 case in an actual implementation. So it needs to be explored how to
       
  1662 make our algorithm faster on all inputs.  It could be the recursive calls that are
       
  1663 needed to manipulate bits that are causing the slow down. A possible solution
       
  1664 is to write recursive functions into tail-recusive form.
       
  1665 Another possibility would be to explore
       
  1666 again the connection to DFAs to speed up the algorithm on 
       
  1667 subcalls that are small enough. This is very much work in progress.
       
  1668 
       
  1669 \section{Conclusion}
       
  1670 
       
  1671 In this PhD-project we are interested in fast algorithms for regular
       
  1672 expression matching. While this seems to be a ``settled'' area, in
       
  1673 fact interesting research questions are popping up as soon as one steps
       
  1674 outside the classic automata theory (for example in terms of what kind
       
  1675 of regular expressions are supported). The reason why it is
       
  1676 interesting for us to look at the derivative approach introduced by
       
  1677 Brzozowski for regular expression matching, and then much further
       
  1678 developed by Sulzmann and Lu, is that derivatives can elegantly deal
       
  1679 with some of the regular expressions that are of interest in ``real
       
  1680 life''. This includes the not-regular expression, written $\neg\,r$
       
  1681 (that is all strings that are not recognised by $r$), but also bounded
       
  1682 regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
       
  1683 also hope that the derivatives can provide another angle for how to
       
  1684 deal more efficiently with back-references, which are one of the
       
  1685 reasons why regular expression engines in JavaScript, Python and Java
       
  1686 choose to not implement the classic automata approach of transforming
       
  1687 regular expressions into NFAs and then DFAs---because we simply do not
       
  1688 know how such back-references can be represented by DFAs.
       
  1689 We also plan to implement the bitcoded algorithm
       
  1690 in some imperative language like C to see if the inefficiency of the 
       
  1691 Scala implementation
       
  1692 is language specific. To make this research more comprehensive we also plan
       
  1693 to contrast our (faster) version of bitcoded algorithm with the
       
  1694 Symbolic Regex Matcher, the RE2, the Rust Regex Engine, and the static
       
  1695 analysis approach by implementing them in the same language and then compare
       
  1696 their performance.
       
  1697 
       
  1698 \bibliographystyle{plain}
       
  1699 \bibliography{root}
       
  1700 
       
  1701 
       
  1702 \end{document}