ChengsongPhdThesis/ChengsongPhDThesis.tex
changeset 438 a73b2e553804
child 440 0856fbf8bda7
equal deleted inserted replaced
435:65e786a58365 438:a73b2e553804
       
     1 \documentclass[a4paper,UKenglish]{lipics}
       
     2 \usepackage{graphic}
       
     3 \usepackage{data}
       
     4 \usepackage{tikz-cd}
       
     5 %\usepackage{algorithm}
       
     6 \usepackage{amsmath}
       
     7 \usepackage[noend]{algpseudocode}
       
     8 \usepackage{enumitem}
       
     9 \usepackage{nccmath}
       
    10 
       
    11 \usetikzlibrary{positioning}
       
    12 
       
    13 \definecolor{darkblue}{rgb}{0,0,0.6}
       
    14 \hypersetup{colorlinks=true,allcolors=darkblue}
       
    15 \newcommand{\comment}[1]%
       
    16 {{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}}
       
    17 
       
    18 % \documentclass{article}
       
    19 %\usepackage[utf8]{inputenc}
       
    20 %\usepackage[english]{babel}
       
    21 %\usepackage{listings}
       
    22 % \usepackage{amsthm}
       
    23 %\usepackage{hyperref}
       
    24 % \usepackage[margin=0.5in]{geometry}
       
    25 %\usepackage{pmboxdraw}
       
    26  
       
    27 \title{POSIX Regular Expression Matching and Lexing}
       
    28 \author{Chengsong Tan}
       
    29 \affil{King's College London\\
       
    30 London, UK\\
       
    31 \texttt{chengsong.tan@kcl.ac.uk}}
       
    32 \authorrunning{Chengsong Tan}
       
    33 \Copyright{Chengsong Tan}
       
    34 
       
    35 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    36 \newcommand{\ZERO}{\mbox{\bf 0}}
       
    37 \newcommand{\ONE}{\mbox{\bf 1}}
       
    38 \def\lexer{\mathit{lexer}}
       
    39 \def\mkeps{\mathit{mkeps}}
       
    40 
       
    41 \def\inj{\mathit{inj}}
       
    42 \def\Empty{\mathit{Empty}}
       
    43 \def\Left{\mathit{Left}}
       
    44 \def\Right{\mathit{Right}}
       
    45 \def\Stars{\mathit{Stars}}
       
    46 \def\Char{\mathit{Char}}
       
    47 \def\Seq{\mathit{Seq}}
       
    48 \def\Der{\mathit{Der}}
       
    49 \def\nullable{\mathit{nullable}}
       
    50 \def\Z{\mathit{Z}}
       
    51 \def\S{\mathit{S}}
       
    52 
       
    53 
       
    54 \def\awidth{\mathit{awidth}}
       
    55 \def\pder{\mathit{pder}}
       
    56 \def\maxterms{\mathit{maxterms}}
       
    57 \def\bsimp{\mathit{bsimp}}
       
    58 
       
    59 %\theoremstyle{theorem}
       
    60 %\newtheorem{theorem}{Theorem}
       
    61 %\theoremstyle{lemma}
       
    62 %\newtheorem{lemma}{Lemma}
       
    63 %\newcommand{\lemmaautorefname}{Lemma}
       
    64 %\theoremstyle{definition}
       
    65 %\newtheorem{definition}{Definition}
       
    66 \algnewcommand\algorithmicswitch{\textbf{switch}}
       
    67 \algnewcommand\algorithmiccase{\textbf{case}}
       
    68 \algnewcommand\algorithmicassert{\texttt{assert}}
       
    69 \algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%
       
    70 % New "environments"
       
    71 \algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
       
    72 \algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%
       
    73 \algtext*{EndSwitch}%
       
    74 \algtext*{EndCase}%
       
    75 
       
    76 
       
    77 \begin{document}
       
    78 
       
    79 \maketitle
       
    80 
       
    81 \begin{abstract}
       
    82   Brzozowski introduced in 1964 a beautifully simple algorithm for
       
    83   regular expression matching based on the notion of derivatives of
       
    84   regular expressions. In 2014, Sulzmann and Lu extended this
       
    85   algorithm to not just give a YES/NO answer for whether or not a
       
    86   regular expression matches a string, but in case it does also
       
    87   answers with \emph{how} it matches the string.  This is important for
       
    88   applications such as lexing (tokenising a string). The problem is to
       
    89   make the algorithm by Sulzmann and Lu fast on all inputs without
       
    90   breaking its correctness. We have already developed some
       
    91   simplification rules for this, but have not yet proved that they
       
    92   preserve the correctness of the algorithm. We also have not yet
       
    93   looked at extended regular expressions, such as bounded repetitions,
       
    94   negation and back-references.
       
    95 \end{abstract}
       
    96 
       
    97 \section{Introduction}
       
    98 \subsection{Practical Example of Regex}
       
    99 
       
   100 %TODO: read rules libraries and the explanation for some of the rules
       
   101 matching some string $s$ with a regex
       
   102 
       
   103 \begin{verbatim}
       
   104 (?:(?:\"|'|\]|\}|\\|\d|
       
   105 (?:nan|infinity|true|false|null|undefined|symbol|math)
       
   106 |\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) 
       
   107 \end{verbatim}
       
   108 
       
   109 
       
   110 %Could be from a network intrusion detection algorithm.
       
   111 %Checking whether there is some malicious code 
       
   112 %in the network data blocks being routed.
       
   113 %If so, discard the data and identify the sender for future alert.
       
   114 
       
   115 \subsection{The problem: Efficient Matching and Lexing}
       
   116 A programmer writes patterns to process texts,
       
   117 where a regex is a structured symbolic pattern
       
   118 specifying what the string should be like.
       
   119 
       
   120 The above regex looks complicated, but can be 
       
   121 described by some basic constructs:
       
   122 
       
   123 
       
   124 Suppose (basic) regular expressions are given by the following grammar:
       
   125 \[			r ::=   \ZERO \mid  \ONE
       
   126 			 \mid  c  
       
   127 			 \mid  r_1 \cdot r_2
       
   128 			 \mid  r_1 + r_2   
       
   129 			 \mid r^*         
       
   130 \]
       
   131 
       
   132 \noindent
       
   133 The intended meaning of the constructors is as follows: $\ZERO$
       
   134 cannot match any string, $\ONE$ can match the empty string, the
       
   135 character regular expression $c$ can match the character $c$, and so
       
   136 on.
       
   137 
       
   138 and the underlying algorithmic problem is:
       
   139 \begin{center}
       
   140 \begin{tabular}{lcr}
       
   141 $\textit{Match}(r, s)$ & $ =  $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\
       
   142 &				& $\textit{else} \; \textit{output} \; \textit{NO}$
       
   143 \end{tabular}
       
   144 \end{center}
       
   145 
       
   146 Deciding whether a string is in the language of the regex
       
   147 can be intuitively done by constructing an NFA\cite{Thompson_1968}:
       
   148 and simulate the running of it:
       
   149 \begin{figure}
       
   150 \centering
       
   151 \includegraphics[scale=0.5]{pics/regex_nfa_base.png}
       
   152 \end{figure}
       
   153 
       
   154 \begin{figure}
       
   155 \centering
       
   156 \includegraphics[scale=0.5]{pics/regex_nfa_seq1.png}
       
   157 \end{figure}
       
   158 
       
   159 \begin{figure}
       
   160 \centering
       
   161 \includegraphics[scale=0.5]{pics/regex_nfa_seq2.png}
       
   162 \end{figure}
       
   163 
       
   164 \begin{figure}
       
   165 \centering
       
   166 \includegraphics[scale=0.5]{pics/regex_nfa_alt.png}
       
   167 \end{figure}
       
   168 
       
   169 
       
   170 \begin{figure}
       
   171 \centering
       
   172 \includegraphics[scale=0.5]{pics/regex_nfa_star.png}
       
   173 \end{figure}
       
   174 
       
   175 Which should be simple enough that modern programmers
       
   176 have no problems with it at all?
       
   177 Not really:
       
   178 
       
   179 Take $(a^*)^*\,b$ and ask whether
       
   180 strings of the form $aa..a$ match this regular
       
   181 expression. Obviously this is not the case---the expected $b$ in the last
       
   182 position is missing. One would expect that modern regular expression
       
   183 matching engines can find this out very quickly. Alas, if one tries
       
   184 this example in JavaScript, Python or Java 8 with strings like 28
       
   185 $a$'s, one discovers that this decision takes around 30 seconds and
       
   186 takes considerably longer when adding a few more $a$'s, as the graphs
       
   187 below show:
       
   188 
       
   189 \begin{center}
       
   190 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   191 \begin{tikzpicture}
       
   192 \begin{axis}[
       
   193     xlabel={$n$},
       
   194     x label style={at={(1.05,-0.05)}},
       
   195     ylabel={time in secs},
       
   196     enlargelimits=false,
       
   197     xtick={0,5,...,30},
       
   198     xmax=33,
       
   199     ymax=35,
       
   200     ytick={0,5,...,30},
       
   201     scaled ticks=false,
       
   202     axis lines=left,
       
   203     width=5cm,
       
   204     height=4cm, 
       
   205     legend entries={JavaScript},  
       
   206     legend pos=north west,
       
   207     legend cell align=left]
       
   208 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
   209 \end{axis}
       
   210 \end{tikzpicture}
       
   211   &
       
   212 \begin{tikzpicture}
       
   213 \begin{axis}[
       
   214     xlabel={$n$},
       
   215     x label style={at={(1.05,-0.05)}},
       
   216     %ylabel={time in secs},
       
   217     enlargelimits=false,
       
   218     xtick={0,5,...,30},
       
   219     xmax=33,
       
   220     ymax=35,
       
   221     ytick={0,5,...,30},
       
   222     scaled ticks=false,
       
   223     axis lines=left,
       
   224     width=5cm,
       
   225     height=4cm, 
       
   226     legend entries={Python},  
       
   227     legend pos=north west,
       
   228     legend cell align=left]
       
   229 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   230 \end{axis}
       
   231 \end{tikzpicture}
       
   232   &
       
   233 \begin{tikzpicture}
       
   234 \begin{axis}[
       
   235     xlabel={$n$},
       
   236     x label style={at={(1.05,-0.05)}},
       
   237     %ylabel={time in secs},
       
   238     enlargelimits=false,
       
   239     xtick={0,5,...,30},
       
   240     xmax=33,
       
   241     ymax=35,
       
   242     ytick={0,5,...,30},
       
   243     scaled ticks=false,
       
   244     axis lines=left,
       
   245     width=5cm,
       
   246     height=4cm, 
       
   247     legend entries={Java 8},  
       
   248     legend pos=north west,
       
   249     legend cell align=left]
       
   250 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   251 \end{axis}
       
   252 \end{tikzpicture}\\
       
   253 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   254            of the form $\underbrace{aa..a}_{n}$.}
       
   255 \end{tabular}    
       
   256 \end{center}  
       
   257 
       
   258 Why?
       
   259 Using $\textit{NFA}$'s that can backtrack.
       
   260 %TODO: what does it mean to do DFS BFS on NFA's
       
   261 	
       
   262 
       
   263 Then how about determinization?
       
   264 \begin{itemize}
       
   265 \item
       
   266  Turning NFA's to DFA's can cause the size of the automata
       
   267 to blow up exponentially.
       
   268 \item
       
   269 Want to extract submatch information.
       
   270 For example, 
       
   271 $r_1 \cdot r_2$ matches $s$,
       
   272 want to know $s = s_1@s_2$ where $s_i$ 
       
   273 corresponds to $r_i$. Where $s_i$ might be the
       
   274 attacker's ip address.
       
   275 \item
       
   276 Variants such as counting automaton exist.
       
   277 But usually made super fast on a certain class
       
   278 of regexes like bounded repetitions:
       
   279 \begin{verbatim}
       
   280 .*a.{100}
       
   281 \end{verbatim}
       
   282 On a lot of inputs this works very well.
       
   283 On average good practical performance.
       
   284 ~10MiB per second.
       
   285 But cannot be super fast on all inputs of regexes and strings,
       
   286 can be imprecise (incorrect) when it comes to more complex regexes.
       
   287 
       
   288 \end{itemize}
       
   289 %TODO: real world example?
       
   290 
       
   291 
       
   292 
       
   293 \subsection{derivatives}
       
   294 Q:
       
   295 Is there an efficient lexing algorithm with provable guarantees on 
       
   296 correctness and running time?
       
   297 Brzozowski Derivatives\cite{Brzozowski1964}!
       
   298 
       
   299 \begin{center}
       
   300 		\begin{tabular}{lcl}
       
   301 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   302 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   303 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   304 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   305 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   306 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   307 		\end{tabular}
       
   308 	\end{center}
       
   309 	
       
   310 	
       
   311 	
       
   312 This function simply tests whether the empty string is in $L(r)$.
       
   313 He then defined
       
   314 the following operation on regular expressions, written
       
   315 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
       
   316 
       
   317 \begin{center}
       
   318 \begin{tabular}{lcl}
       
   319 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   320 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   321 		$d \backslash c$     & $\dn$ & 
       
   322 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   323 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   324 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   325 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   326 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   327 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   328 \end{tabular}
       
   329 \end{center}
       
   330 
       
   331 
       
   332 
       
   333 
       
   334 \begin{ceqn}
       
   335 \begin{equation}\label{graph:01}
       
   336 \begin{tikzcd}
       
   337 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   338 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   339 \end{tikzcd}
       
   340 \end{equation}
       
   341 \end{ceqn}
       
   342 Nicely functional, correctness easily provable, but suffers
       
   343 from large stack size with long strings, and 
       
   344 inability to perform even moderate simplification.
       
   345 
       
   346 
       
   347 
       
   348 The Sulzmann and Lu's bit-coded algorithm:
       
   349 \begin{figure}
       
   350 \centering
       
   351 \includegraphics[scale=0.3]{bitcoded_sulzmann.png}
       
   352 \end{figure}
       
   353 
       
   354 This one-phase algorithm is free from the burden of large stack usage:
       
   355 
       
   356 \begin{center}
       
   357 \begin{tikzpicture}[scale=2,node distance=1.9cm,
       
   358                     every node/.style={minimum size=7mm}]
       
   359 \node (r0)  {$r_0$};
       
   360 \node (r1) [right=of r0]{$r_1$};
       
   361 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash\,c_0$};
       
   362 \node (r2) [right=of r1]{$r_2$};
       
   363 \draw[->, line width = 0.2mm](r1)--(r2) node[above,midway] {$\backslash\,c_1$};
       
   364 \node (rn) [right=of r2]{$r_n$};
       
   365 \draw[dashed,->,line width=0.2mm](r2)--(rn) node[above,midway] {} ;
       
   366 \draw (rn) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};
       
   367 \node (bs) [below=of rn]{$bs$};
       
   368 \draw[->,line width=0.2mm](rn) -- (bs);
       
   369 \node (v0) [left=of bs] {$v_0$};
       
   370 \draw[->,line width=0.2mm](bs)--(v0) node[below,midway] {$\textit{decode}$};
       
   371 \draw (rn) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{collect bits}$}};
       
   372 \draw[->, line width=0.2mm](v0)--(r0) node[below, midway] {};
       
   373 \end{tikzpicture}
       
   374 \end{center}
       
   375 
       
   376 
       
   377 
       
   378 This is functional code, and easily provable (proof by Urban and Ausaf).
       
   379 
       
   380 But it suffers from exponential blows even with the simplification steps:
       
   381 \begin{figure}
       
   382 \centering
       
   383 \includegraphics[scale= 0.3]{pics/nub_filter_simp.png}
       
   384 \end{figure}
       
   385 claim: Sulzmann and Lu claimed it linear $w.r.t$ input.
       
   386 
       
   387 example that blows it up:
       
   388 $(a+aa)^*$
       
   389 \section{Contributions}
       
   390 \subsection{Our contribution 1}
       
   391 an improved version of the above algorithm that solves most blow up 
       
   392 cases, including the above example.
       
   393 
       
   394 a formalized closed-form for string derivatives:
       
   395 \[ (\sum rs) \backslash_s s = simp(\sum_{r \in rs}(r \backslash_s s)) \]
       
   396 \[ (r1\cdot r2) \backslash_s s = simp(r_1 \backslash_s s  \cdot r_2 + \sum_{s' \in Suffix(s)} r_2 \backslash_s s' )\]
       
   397 \[r0^* \backslash_s s = simp(\sum_{s' \in substr(s)} (r0 \backslash_s s') \cdot r0^*) \]
       
   398 
       
   399 
       
   400 
       
   401 Also with a size guarantee that make sure the size of the derivatives 
       
   402 don't go up unbounded.
       
   403 
       
   404 
       
   405 \begin{theorem}
       
   406 Given a regular expression r, we have 
       
   407 \begin{center}
       
   408 $\exists N_r.\; s.t. \;\forall s. \; |r \backslash_s s| < N_r$
       
   409 \end{center}
       
   410 \end{theorem}
       
   411 
       
   412 The proof for this is using partial derivative's terms to bound it.
       
   413 \begin{center}
       
   414 \begin{tabular}{lcl}
       
   415 $| \maxterms (\bsimp  (a\cdot b) \backslash s)|$ & $=$ & $ |maxterms(\bsimp( (a\backslash s \cdot b) + \sum_{s'\in sl}(b\backslash s') ))|$\\
       
   416 & $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
       
   417 & $=$ & $\awidth(a) + \awidth(b)$ \\
       
   418 & $=$ & $\awidth(a+b)$
       
   419 \end{tabular}
       
   420 \end{center} 
       
   421 
       
   422 
       
   423 \subsection{Our Contribution 2}
       
   424 more aggressive simplification that prunes away sub-parts
       
   425 of a regex based on what terms has appeared before.
       
   426 Which gives us a truly linear bound on the input length.
       
   427 
       
   428 
       
   429 
       
   430 \section{To be completed}
       
   431 
       
   432 benchmarking our algorithm against JFLEX
       
   433 counting set automata, Silex, other main regex engines (incorporate their ideas such
       
   434 as zippers and other data structures reducing memory use).
       
   435 
       
   436 extend to back references.
       
   437 
       
   438 
       
   439 
       
   440 
       
   441 
       
   442 
       
   443 \noindent These are clearly abysmal and possibly surprising results. One
       
   444 would expect these systems to do  much better than that---after all,
       
   445 given a DFA and a string, deciding whether a string is matched by this
       
   446 DFA should be linear in terms of the size of the regular expression and
       
   447 the string?
       
   448 
       
   449 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
       
   450 exhibit this super-linear behaviour.  But unfortunately, such regular
       
   451 expressions are not just a few outliers. They are actually 
       
   452 frequent enough to have a separate name created for
       
   453 them---\emph{evil regular expressions}. In empiric work, Davis et al
       
   454 report that they have found thousands of such evil regular expressions
       
   455 in the JavaScript and Python ecosystems \cite{Davis18}. Static analysis
       
   456 approach that is both sound and complete exists\cite{17Bir}, but the running 
       
   457 time on certain examples in the RegExLib and Snort regular expressions
       
   458 libraries is unacceptable. Therefore the problem of efficiency still remains.
       
   459 
       
   460 This superlinear blowup in matching algorithms sometimes causes
       
   461 considerable grief in real life: for example on 20 July 2016 one evil
       
   462 regular expression brought the webpage
       
   463 \href{http://stackexchange.com}{Stack Exchange} to its
       
   464 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
       
   465 In this instance, a regular expression intended to just trim white
       
   466 spaces from the beginning and the end of a line actually consumed
       
   467 massive amounts of CPU-resources---causing web servers to grind to a
       
   468 halt. This happened when a post with 20,000 white spaces was submitted,
       
   469 but importantly the white spaces were neither at the beginning nor at
       
   470 the end. As a result, the regular expression matching engine needed to
       
   471 backtrack over many choices. In this example, the time needed to process
       
   472 the string was $O(n^2)$ with respect to the string length. This
       
   473 quadratic overhead was enough for the homepage of Stack Exchange to
       
   474 respond so slowly that the load balancer assumed there must be some
       
   475 attack and therefore stopped the servers from responding to any
       
   476 requests. This made the whole site become unavailable. Another very
       
   477 recent example is a global outage of all Cloudflare servers on 2 July
       
   478 2019. A poorly written regular expression exhibited exponential
       
   479 behaviour and exhausted CPUs that serve HTTP traffic. Although the
       
   480 outage had several causes, at the heart was a regular expression that
       
   481 was used to monitor network
       
   482 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
       
   483 
       
   484 The underlying problem is that many ``real life'' regular expression
       
   485 matching engines do not use DFAs for matching. This is because they
       
   486 support regular expressions that are not covered by the classical
       
   487 automata theory, and in this more general setting there are quite a few
       
   488 research questions still unanswered and fast algorithms still need to be
       
   489 developed (for example how to treat efficiently bounded repetitions, negation and
       
   490 back-references).
       
   491 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
       
   492 %how do they avoid dfas exponential states if they use them for fast matching?
       
   493 
       
   494 There is also another under-researched problem to do with regular
       
   495 expressions and lexing, i.e.~the process of breaking up strings into
       
   496 sequences of tokens according to some regular expressions. In this
       
   497 setting one is not just interested in whether or not a regular
       
   498 expression matches a string, but also in \emph{how}.  Consider for
       
   499 example a regular expression $r_{key}$ for recognising keywords such as
       
   500 \textit{if}, \textit{then} and so on; and a regular expression $r_{id}$
       
   501 for recognising identifiers (say, a single character followed by
       
   502 characters or numbers). One can then form the compound regular
       
   503 expression $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But
       
   504 then how should the string \textit{iffoo} be tokenised?  It could be
       
   505 tokenised as a keyword followed by an identifier, or the entire string
       
   506 as a single identifier.  Similarly, how should the string \textit{if} be
       
   507 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
       
   508 ``fire''---so is it an identifier or a keyword?  While in applications
       
   509 there is a well-known strategy to decide these questions, called POSIX
       
   510 matching, only relatively recently precise definitions of what POSIX
       
   511 matching actually means have been formalised
       
   512 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Such a
       
   513 definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014},
       
   514 but the corresponding correctness proof turned out to be  faulty
       
   515 \cite{AusafDyckhoffUrban2016}. Roughly, POSIX matching means matching
       
   516 the longest initial substring. In the case of a tie, the initial
       
   517 sub-match is chosen according to some priorities attached to the regular
       
   518 expressions (e.g.~keywords have a higher priority than identifiers).
       
   519 This sounds rather simple, but according to Grathwohl et al \cite[Page
       
   520 36]{CrashCourse2014} this is not the case. They wrote:
       
   521 
       
   522 \begin{quote}
       
   523 \it{}``The POSIX strategy is more complicated than the greedy because of 
       
   524 the dependence on information about the length of matched strings in the 
       
   525 various subexpressions.''
       
   526 \end{quote}
       
   527 
       
   528 \noindent
       
   529 This is also supported by evidence collected by Kuklewicz
       
   530 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
       
   531 matchers calculate incorrect results.
       
   532 
       
   533 Our focus in this project is on an algorithm introduced by Sulzmann and
       
   534 Lu in 2014 for regular expression matching according to the POSIX
       
   535 strategy \cite{Sulzmann2014}. Their algorithm is based on an older
       
   536 algorithm by Brzozowski from 1964 where he introduced the notion of
       
   537 derivatives of regular expressions~\cite{Brzozowski1964}. We shall
       
   538 briefly explain this algorithm next.
       
   539 
       
   540 \section{The Algorithm by Brzozowski based on Derivatives of Regular
       
   541 Expressions}
       
   542 
       
   543 Suppose (basic) regular expressions are given by the following grammar:
       
   544 \[			r ::=   \ZERO \mid  \ONE
       
   545 			 \mid  c  
       
   546 			 \mid  r_1 \cdot r_2
       
   547 			 \mid  r_1 + r_2   
       
   548 			 \mid r^*         
       
   549 \]
       
   550 
       
   551 \noindent
       
   552 The intended meaning of the constructors is as follows: $\ZERO$
       
   553 cannot match any string, $\ONE$ can match the empty string, the
       
   554 character regular expression $c$ can match the character $c$, and so
       
   555 on.
       
   556 
       
   557 The ingenious contribution by Brzozowski is the notion of
       
   558 \emph{derivatives} of regular expressions.  The idea behind this
       
   559 notion is as follows: suppose a regular expression $r$ can match a
       
   560 string of the form $c\!::\! s$ (that is a list of characters starting
       
   561 with $c$), what does the regular expression look like that can match
       
   562 just $s$? Brzozowski gave a neat answer to this question. He started
       
   563 with the definition of $nullable$:
       
   564 \begin{center}
       
   565 		\begin{tabular}{lcl}
       
   566 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   567 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   568 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   569 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   570 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   571 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   572 		\end{tabular}
       
   573 	\end{center}
       
   574 This function simply tests whether the empty string is in $L(r)$.
       
   575 He then defined
       
   576 the following operation on regular expressions, written
       
   577 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
       
   578 
       
   579 \begin{center}
       
   580 \begin{tabular}{lcl}
       
   581 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   582 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   583 		$d \backslash c$     & $\dn$ & 
       
   584 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   585 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   586 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   587 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   588 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   589 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   590 \end{tabular}
       
   591 \end{center}
       
   592 
       
   593 %Assuming the classic notion of a
       
   594 %\emph{language} of a regular expression, written $L(\_)$, t
       
   595 
       
   596 \noindent
       
   597 The main property of the derivative operation is that
       
   598 
       
   599 \begin{center}
       
   600 $c\!::\!s \in L(r)$ holds
       
   601 if and only if $s \in L(r\backslash c)$.
       
   602 \end{center}
       
   603 
       
   604 \noindent
       
   605 For us the main advantage is that derivatives can be
       
   606 straightforwardly implemented in any functional programming language,
       
   607 and are easily definable and reasoned about in theorem provers---the
       
   608 definitions just consist of inductive datatypes and simple recursive
       
   609 functions. Moreover, the notion of derivatives can be easily
       
   610 generalised to cover extended regular expression constructors such as
       
   611 the not-regular expression, written $\neg\,r$, or bounded repetitions
       
   612 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
       
   613 straightforwardly realised within the classic automata approach.
       
   614 For the moment however, we focus only on the usual basic regular expressions.
       
   615 
       
   616 
       
   617 Now if we want to find out whether a string $s$ matches with a regular
       
   618 expression $r$, we can build the derivatives of $r$ w.r.t.\ (in succession)
       
   619 all the characters of the string $s$. Finally, test whether the
       
   620 resulting regular expression can match the empty string.  If yes, then
       
   621 $r$ matches $s$, and no in the negative case. To implement this idea
       
   622 we can generalise the derivative operation to strings like this:
       
   623 
       
   624 \begin{center}
       
   625 \begin{tabular}{lcl}
       
   626 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   627 $r \backslash [\,] $ & $\dn$ & $r$
       
   628 \end{tabular}
       
   629 \end{center}
       
   630 
       
   631 \noindent
       
   632 and then define as  regular-expression matching algorithm: 
       
   633 \[
       
   634 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   635 \]
       
   636 
       
   637 \noindent
       
   638 This algorithm looks graphically as follows:
       
   639 \begin{equation}\label{graph:*}
       
   640 \begin{tikzcd}
       
   641 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   642 \end{tikzcd}
       
   643 \end{equation}
       
   644 
       
   645 \noindent
       
   646 where we start with  a regular expression  $r_0$, build successive
       
   647 derivatives until we exhaust the string and then use \textit{nullable}
       
   648 to test whether the result can match the empty string. It can  be
       
   649 relatively  easily shown that this matcher is correct  (that is given
       
   650 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   651 
       
   652  
       
   653 \section{Values and the Algorithm by Sulzmann and Lu}
       
   654 
       
   655 One limitation of Brzozowski's algorithm is that it only produces a
       
   656 YES/NO answer for whether a string is being matched by a regular
       
   657 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
       
   658 to allow generation of an actual matching, called a \emph{value} or
       
   659 sometimes also \emph{lexical value}.  These values and regular
       
   660 expressions correspond to each other as illustrated in the following
       
   661 table:
       
   662 
       
   663 
       
   664 \begin{center}
       
   665 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   666 		\begin{tabular}{@{}rrl@{}}
       
   667 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   668 			$r$ & $::=$  & $\ZERO$\\
       
   669 			& $\mid$ & $\ONE$   \\
       
   670 			& $\mid$ & $c$          \\
       
   671 			& $\mid$ & $r_1 \cdot r_2$\\
       
   672 			& $\mid$ & $r_1 + r_2$   \\
       
   673 			\\
       
   674 			& $\mid$ & $r^*$         \\
       
   675 		\end{tabular}
       
   676 		&
       
   677 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   678 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   679 			$v$ & $::=$  & \\
       
   680 			&        & $\Empty$   \\
       
   681 			& $\mid$ & $\Char(c)$          \\
       
   682 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   683 			& $\mid$ & $\Left(v)$   \\
       
   684 			& $\mid$ & $\Right(v)$  \\
       
   685 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   686 		\end{tabular}
       
   687 	\end{tabular}
       
   688 \end{center}
       
   689 
       
   690 \noindent
       
   691 No value  corresponds to $\ZERO$; $\Empty$ corresponds to $\ONE$;
       
   692 $\Char$ to the character regular expression; $\Seq$ to the sequence
       
   693 regular expression and so on. The idea of values is to encode a kind of
       
   694 lexical value for how the sub-parts of a regular expression match the
       
   695 sub-parts of a string. To see this, suppose a \emph{flatten} operation,
       
   696 written $|v|$ for values. We can use this function to extract the
       
   697 underlying string of a value $v$. For example, $|\mathit{Seq} \,
       
   698 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
       
   699 flatten, we can describe how values encode lexical values: $\Seq\,v_1\,
       
   700 v_2$ encodes a tree with two children nodes that tells how the string
       
   701 $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches
       
   702 the substring $|v_1|$ and, respectively, $r_2$ matches the substring
       
   703 $|v_2|$. Exactly how these two are matched is contained in the children
       
   704 nodes $v_1$ and $v_2$ of parent $\textit{Seq}$. 
       
   705 
       
   706 To give a concrete example of how values work, consider the string $xy$
       
   707 and the regular expression $(x + (y + xy))^*$. We can view this regular
       
   708 expression as a tree and if the string $xy$ is matched by two Star
       
   709 ``iterations'', then the $x$ is matched by the left-most alternative in
       
   710 this tree and the $y$ by the right-left alternative. This suggests to
       
   711 record this matching as
       
   712 
       
   713 \begin{center}
       
   714 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
       
   715 \end{center}
       
   716 
       
   717 \noindent
       
   718 where $\Stars \; [\ldots]$ records all the
       
   719 iterations; and $\Left$, respectively $\Right$, which
       
   720 alternative is used. The value for
       
   721 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
       
   722 would look as follows
       
   723 
       
   724 \begin{center}
       
   725 $\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
       
   726 \end{center}
       
   727 
       
   728 \noindent
       
   729 where $\Stars$ has only a single-element list for the single iteration
       
   730 and $\Seq$ indicates that $xy$ is matched by a sequence regular
       
   731 expression.
       
   732 
       
   733 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   734 algorithm by a second phase (the first phase being building successive
       
   735 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
   736 is generated in case the regular expression matches  the string. 
       
   737 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
   738 
       
   739 \begin{ceqn}
       
   740 \begin{equation}\label{graph:2}
       
   741 \begin{tikzcd}
       
   742 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   743 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   744 \end{tikzcd}
       
   745 \end{equation}
       
   746 \end{ceqn}
       
   747 
       
   748 \noindent
       
   749 For convenience, we shall employ the following notations: the regular
       
   750 expression we start with is $r_0$, and the given string $s$ is composed
       
   751 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   752 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   753 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   754 the derivative $r_n$. We test whether this derivative is
       
   755 $\textit{nullable}$ or not. If not, we know the string does not match
       
   756 $r$ and no value needs to be generated. If yes, we start building the
       
   757 values incrementally by \emph{injecting} back the characters into the
       
   758 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   759 algorithm from the right to left. For the first value $v_n$, we call the
       
   760 function $\textit{mkeps}$, which builds the lexical value
       
   761 for how the empty string has been matched by the (nullable) regular
       
   762 expression $r_n$. This function is defined as
       
   763 
       
   764 	\begin{center}
       
   765 		\begin{tabular}{lcl}
       
   766 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   767 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   768 			& \textit{if} $\nullable(r_{1})$\\ 
       
   769 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   770 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   771 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   772 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   773 		\end{tabular}
       
   774 	\end{center}
       
   775 
       
   776 
       
   777 \noindent There are no cases for $\ZERO$ and $c$, since
       
   778 these regular expression cannot match the empty string. Note
       
   779 also that in case of alternatives we give preference to the
       
   780 regular expression on the left-hand side. This will become
       
   781 important later on about what value is calculated.
       
   782 
       
   783 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   784 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   785 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   786 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   787 matches $s$. For this Sulzmann and Lu defined a function that reverses
       
   788 the ``chopping off'' of characters during the derivative phase. The
       
   789 corresponding function is called \emph{injection}, written
       
   790 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   791 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   792 is a character ${c_{i-1}}$, the character we want to inject and the
       
   793 third argument is the value ${v_i}$, into which one wants to inject the
       
   794 character (it corresponds to the regular expression after the character
       
   795 has been chopped off). The result of this function is a new value. The
       
   796 definition of $\textit{inj}$ is as follows: 
       
   797 
       
   798 \begin{center}
       
   799 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   800   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   801   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   802   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   803   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   804   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   805   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   806   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   807 \end{tabular}
       
   808 \end{center}
       
   809 
       
   810 \noindent This definition is by recursion on the ``shape'' of regular
       
   811 expressions and values. To understands this definition better consider
       
   812 the situation when we build the derivative on regular expression $r_{i-1}$.
       
   813 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
       
   814 ``hole'' in $r_i$ and its corresponding value $v_i$. 
       
   815 To calculate $v_{i-1}$, we need to
       
   816 locate where that hole is and fill it. 
       
   817 We can find this location by
       
   818 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
       
   819 $r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that 
       
   820 %
       
   821 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
       
   822 
       
   823 \noindent
       
   824 otherwise if $r_a$ is not nullable,
       
   825 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
       
   826 
       
   827 \noindent
       
   828 the value $v_i$ should be  $\Seq(\ldots)$, contradicting the fact that
       
   829 $v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
       
   830 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
       
   831 branch of \[ (r_a \cdot r_b)\backslash c =
       
   832 \bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined)
       
   833  is taken instead of the right one. This means $c$ is chopped off 
       
   834 from $r_a$ rather than $r_b$.
       
   835 We have therefore found out 
       
   836 that the hole will be on $r_a$. So we recursively call $\inj\, 
       
   837 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
       
   838 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
       
   839 Other clauses can be understood in a similar way.
       
   840 
       
   841 %\comment{Other word: insight?}
       
   842 The following example gives an insight of $\textit{inj}$'s effect and
       
   843 how Sulzmann and Lu's algorithm works as a whole. Suppose we have a
       
   844 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it
       
   845 against the string $abc$ (when $abc$ is written as a regular expression,
       
   846 the standard way of expressing it is $a \cdot (b \cdot c)$. But we
       
   847 usually omit the parentheses and dots here for better readability. This
       
   848 algorithm returns a POSIX value, which means it will produce the longest
       
   849 matching. Consequently, it matches the string $abc$ in one star
       
   850 iteration, using the longest alternative $abc$ in the sub-expression (we shall use $r$ to denote this
       
   851 sub-expression for conciseness):
       
   852 
       
   853 \[((((a+b)+ab)+c)+\underbrace{abc}_r)\] 
       
   854 
       
   855 \noindent
       
   856 Before $\textit{inj}$ is called, our lexer first builds derivative using
       
   857 string $abc$ (we simplified some regular expressions like $\ZERO \cdot
       
   858 b$ to $\ZERO$ for conciseness; we also omit parentheses if they are
       
   859 clear from the context):
       
   860 
       
   861 %Similarly, we allow
       
   862 %$\textit{ALT}$ to take a list of regular expressions as an argument
       
   863 %instead of just 2 operands to reduce the nested depth of
       
   864 %$\textit{ALT}$
       
   865 
       
   866 \begin{center}
       
   867 \begin{tabular}{lcl}
       
   868 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\
       
   869       & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$\\
       
   870       & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^*) + $\\ 
       
   871       &                              & $\phantom{r_3 = (} ((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* )$
       
   872 \end{tabular}
       
   873 \end{center}
       
   874 
       
   875 \noindent
       
   876 In  case $r_3$ is nullable, we can call $\textit{mkeps}$ 
       
   877 to construct a lexical value for how $r_3$ matched the string $abc$. 
       
   878 This function gives the following value $v_3$: 
       
   879 
       
   880 
       
   881 \begin{center}
       
   882 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
       
   883 \end{center}
       
   884 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
       
   885 
       
   886 \begin{center}
       
   887 	\begin{tabular}{l@{\hspace{2mm}}l}
       
   888     & $\big(\underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} 
       
   889     \;+\; (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*\big)$ \smallskip\\
       
   890     $+$ & $\big((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*
       
   891     \;+\; (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* \big)$
       
   892   	\end{tabular}
       
   893  \end{center}
       
   894 
       
   895 \noindent
       
   896  Note that the leftmost location of term $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
       
   897  \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
       
   898  $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the
       
   899  left one when it is nullable. In the case of this example, $abc$ is
       
   900  preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
       
   901  generated by two applications of the splitting clause
       
   902 
       
   903 \begin{center}
       
   904      $(r_1 \cdot r_2)\backslash c  \;\;(when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
       
   905 \end{center}
       
   906        
       
   907 \noindent
       
   908 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the
       
   909 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This
       
   910 allows $\textit{mkeps}$ to always pick up among two matches the one with a longer
       
   911 initial sub-match. Removing the outside $\Left(\Left(...))$, the inside
       
   912 sub-value 
       
   913  
       
   914 \begin{center}
       
   915  $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
       
   916 \end{center}
       
   917 
       
   918 \noindent
       
   919 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot
       
   920 \ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of two nullable regular
       
   921 expressions. The first one is an alternative, we take the rightmost
       
   922 alternative---whose language contains the empty string. The second
       
   923 nullable regular expression is a Kleene star. $\Stars$ tells us how it
       
   924 generates the nullable regular expression: by 0 iterations to form
       
   925 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
       
   926 builds a lexical value based on $v_3$. Using the value $v_3$, the character
       
   927 c, and the regular expression $r_2$, we can recover how $r_2$ matched
       
   928 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
       
   929  \begin{center}
       
   930  $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
       
   931  \end{center}
       
   932 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
       
   933 \begin{center}
       
   934 $v_1 = \Seq(\Right(\Seq(\Empty, \Seq(b, c))), \Stars [])$
       
   935 \end{center}
       
   936  for how 
       
   937  \begin{center}
       
   938  $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$
       
   939  \end{center}
       
   940   matched  the string $bc$ before it split into two substrings. 
       
   941   Finally, after injecting character $a$ back to $v_1$, 
       
   942   we get  the lexical value tree 
       
   943   \begin{center}
       
   944   $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
       
   945   \end{center}
       
   946    for how $r$ matched $abc$. This completes the algorithm.
       
   947    
       
   948 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
       
   949 Readers might have noticed that the lexical value information is actually
       
   950 already available when doing derivatives. For example, immediately after
       
   951 the operation $\backslash a$ we know that if we want to match a string
       
   952 that starts with $a$, we can either take the initial match to be 
       
   953 
       
   954  \begin{center}
       
   955 \begin{enumerate}
       
   956     \item[1)] just $a$ or
       
   957     \item[2)] string $ab$ or 
       
   958     \item[3)] string $abc$.
       
   959 \end{enumerate}
       
   960 \end{center}
       
   961 
       
   962 \noindent
       
   963 In order to differentiate between these choices, we just need to
       
   964 remember their positions---$a$ is on the left, $ab$ is in the middle ,
       
   965 and $abc$ is on the right. Which of these alternatives is chosen
       
   966 later does not affect their relative position because the algorithm does
       
   967 not change this order. If this parsing information can be determined and
       
   968 does not change because of later derivatives, there is no point in
       
   969 traversing this information twice. This leads to an optimisation---if we
       
   970 store the information for lexical values inside the regular expression,
       
   971 update it when we do derivative on them, and collect the information
       
   972 when finished with derivatives and call $\textit{mkeps}$ for deciding which
       
   973 branch is POSIX, we can generate the lexical value in one pass, instead of
       
   974 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
       
   975 idea of using bitcodes in derivatives.
       
   976 
       
   977 In the next section, we shall focus on the bitcoded algorithm and the
       
   978 process of simplification of regular expressions. This is needed in
       
   979 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
       
   980 and Lu's algorithms.  This is where the PhD-project aims to advance the
       
   981 state-of-the-art.
       
   982 
       
   983 
       
   984 \section{Simplification of Regular Expressions}
       
   985 
       
   986 Using bitcodes to guide  parsing is not a novel idea. It was applied to
       
   987 context free grammars and then adapted by Henglein and Nielson for
       
   988 efficient regular expression  lexing using DFAs~\cite{nielson11bcre}.
       
   989 Sulzmann and Lu took this idea of bitcodes a step further by integrating
       
   990 bitcodes into derivatives. The reason why we want to use bitcodes in
       
   991 this project is that we want to introduce more aggressive simplification
       
   992 rules in order to keep the size of derivatives small throughout. This is
       
   993 because the main drawback of building successive derivatives according
       
   994 to Brzozowski's definition is that they can grow very quickly in size.
       
   995 This is mainly due to the fact that the derivative operation generates
       
   996 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
       
   997 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
       
   998 are excruciatingly slow. For example when starting with the regular
       
   999 expression $(a + aa)^*$ and building 12 successive derivatives
       
  1000 w.r.t.~the character $a$, one obtains a derivative regular expression
       
  1001 with more than 8000 nodes (when viewed as a tree). Operations like
       
  1002 $\textit{der}$ and $\nullable$ need to traverse such trees and
       
  1003 consequently the bigger the size of the derivative the slower the
       
  1004 algorithm. 
       
  1005 
       
  1006 Fortunately, one can simplify regular expressions after each derivative
       
  1007 step. Various simplifications of regular expressions are possible, such
       
  1008 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
       
  1009 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
       
  1010 affect the answer for whether a regular expression matches a string or
       
  1011 not, but fortunately also do not affect the POSIX strategy of how
       
  1012 regular expressions match strings---although the latter is much harder
       
  1013 to establish. Some initial results in this regard have been
       
  1014 obtained in \cite{AusafDyckhoffUrban2016}. 
       
  1015 
       
  1016 Unfortunately, the simplification rules outlined above  are not
       
  1017 sufficient to prevent a size explosion in all cases. We
       
  1018 believe a tighter bound can be achieved that prevents an explosion in
       
  1019 \emph{all} cases. Such a tighter bound is suggested by work of Antimirov who
       
  1020 proved that (partial) derivatives can be bound by the number of
       
  1021 characters contained in the initial regular expression
       
  1022 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
       
  1023 expressions as follows:
       
  1024 
       
  1025 \begin{center}
       
  1026 \begin{tabular}{lcl}
       
  1027  $\textit{pder} \; c \; \ZERO$ & $\dn$ & $\emptyset$\\
       
  1028  $\textit{pder} \; c \; \ONE$ & $\dn$ & $\emptyset$ \\
       
  1029  $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{  \ONE   \}  \; \textit{else} \; \emptyset$ \\ 
       
  1030   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
       
  1031    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\
       
  1032      & & $\textit{then} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \;$\\
       
  1033      & & $\textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
       
  1034      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
       
  1035  \end{tabular}
       
  1036  \end{center}
       
  1037 
       
  1038 \noindent
       
  1039 A partial derivative of a regular expression $r$ is essentially a set of
       
  1040 regular expressions that are either $r$'s children expressions or a
       
  1041 concatenation of them. Antimirov has proved a tight bound of the sum of
       
  1042 the size of \emph{all} partial derivatives no matter what the string
       
  1043 looks like. Roughly speaking the size sum will be at most cubic in the
       
  1044 size of the regular expression.
       
  1045 
       
  1046 If we want the size of derivatives in Sulzmann and Lu's algorithm to
       
  1047 stay below this bound, we would need more aggressive simplifications.
       
  1048 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
  1049 deleting duplicates whenever possible. For example, the parentheses in
       
  1050 $(a+b) \cdot c + bc$ can be opened up to get $a\cdot c +  b \cdot c + b
       
  1051 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
  1052 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
  1053 $a^*+a+\ONE$. Adding these more aggressive simplification rules helps us
       
  1054 to achieve the same size bound as that of the partial derivatives. 
       
  1055 
       
  1056 In order to implement the idea of ``spilling out alternatives'' and to
       
  1057 make them compatible with the $\text{inj}$-mechanism, we use
       
  1058 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
       
  1059 
       
  1060 %This allows us to prove a tight
       
  1061 %bound on the size of regular expression during the running time of the
       
  1062 %algorithm if we can establish the connection between our simplification
       
  1063 %rules and partial derivatives.
       
  1064 
       
  1065  %We believe, and have generated test
       
  1066 %data, that a similar bound can be obtained for the derivatives in
       
  1067 %Sulzmann and Lu's algorithm. Let us give some details about this next.
       
  1068 
       
  1069 
       
  1070 \begin{center}
       
  1071 		$b ::=   S \mid  Z \qquad
       
  1072 bs ::= [] \mid b:bs    
       
  1073 $
       
  1074 \end{center}
       
  1075 
       
  1076 \noindent
       
  1077 The $S$ and $Z$ are arbitrary names for the bits in order to avoid 
       
  1078 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
  1079 bit-lists) can be used to encode values (or incomplete values) in a
       
  1080 compact form. This can be straightforwardly seen in the following
       
  1081 coding function from values to bitcodes: 
       
  1082 
       
  1083 \begin{center}
       
  1084 \begin{tabular}{lcl}
       
  1085   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
  1086   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
  1087   $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
       
  1088   $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
       
  1089   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
  1090   $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\
       
  1091   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
       
  1092                                                  code(\Stars\,vs)$
       
  1093 \end{tabular}    
       
  1094 \end{center} 
       
  1095 
       
  1096 \noindent
       
  1097 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
  1098 $\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty
       
  1099 star iteration into $\S$, and the border where a local star terminates
       
  1100 into $\Z$. This coding is lossy, as it throws away the information about
       
  1101 characters, and also does not encode the ``boundary'' between two
       
  1102 sequence values. Moreover, with only the bitcode we cannot even tell
       
  1103 whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The
       
  1104 reason for choosing this compact way of storing information is that the
       
  1105 relatively small size of bits can be easily manipulated and ``moved
       
  1106 around'' in a regular expression. In order to recover values, we will 
       
  1107 need the corresponding regular expression as an extra information. This
       
  1108 means the decoding function is defined as:
       
  1109 
       
  1110 
       
  1111 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
  1112 \begin{center}
       
  1113 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
  1114   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
  1115   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
  1116   $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  1117      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
  1118        (\Left\,v, bs_1)$\\
       
  1119   $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  1120      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
  1121        (\Right\,v, bs_1)$\\                           
       
  1122   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
  1123         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
  1124   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
  1125   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
  1126   $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
  1127   $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & 
       
  1128          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  1129   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
  1130   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
  1131   
       
  1132   $\textit{decode}\,bs\,r$ & $\dn$ &
       
  1133      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  1134   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
  1135        \textit{else}\;\textit{None}$                       
       
  1136 \end{tabular}    
       
  1137 \end{center}    
       
  1138 %\end{definition}
       
  1139 
       
  1140 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
  1141 create annotated regular expressions \cite{Sulzmann2014}.
       
  1142 \emph{Annotated regular expressions} are defined by the following
       
  1143 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
  1144 
       
  1145 \begin{center}
       
  1146 \begin{tabular}{lcl}
       
  1147   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
       
  1148                   & $\mid$ & $\textit{ONE}\;\;bs$\\
       
  1149                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
       
  1150                   & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
       
  1151                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
       
  1152                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
       
  1153 \end{tabular}    
       
  1154 \end{center}  
       
  1155 %(in \textit{ALTS})
       
  1156 
       
  1157 \noindent
       
  1158 where $bs$ stands for bitcodes, $a$  for $\bold{a}$nnotated regular
       
  1159 expressions and $as$ for a list of annotated regular expressions.
       
  1160 The alternative constructor($\textit{ALTS}$) has been generalized to 
       
  1161 accept a list of annotated regular expressions rather than just 2.
       
  1162 We will show that these bitcodes encode information about
       
  1163 the (POSIX) value that should be generated by the Sulzmann and Lu
       
  1164 algorithm.
       
  1165 
       
  1166 
       
  1167 To do lexing using annotated regular expressions, we shall first
       
  1168 transform the usual (un-annotated) regular expressions into annotated
       
  1169 regular expressions. This operation is called \emph{internalisation} and
       
  1170 defined as follows:
       
  1171 
       
  1172 %\begin{definition}
       
  1173 \begin{center}
       
  1174 \begin{tabular}{lcl}
       
  1175   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
       
  1176   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
       
  1177   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
       
  1178   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
  1179   $\textit{ALTS}\;[]\,List((\textit{fuse}\,[\Z]\,r_1^\uparrow),\,
       
  1180   (\textit{fuse}\,[\S]\,r_2^\uparrow))$\\
       
  1181   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
  1182          $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
       
  1183   $(r^*)^\uparrow$ & $\dn$ &
       
  1184          $\textit{STAR}\;[]\,r^\uparrow$\\
       
  1185 \end{tabular}    
       
  1186 \end{center}    
       
  1187 %\end{definition}
       
  1188 
       
  1189 \noindent
       
  1190 We use up arrows here to indicate that the basic un-annotated regular
       
  1191 expressions are ``lifted up'' into something slightly more complex. In the
       
  1192 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
  1193 attach bits to the front of an annotated regular expression. Its
       
  1194 definition is as follows:
       
  1195 
       
  1196 \begin{center}
       
  1197 \begin{tabular}{lcl}
       
  1198   $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
       
  1199   $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
       
  1200      $\textit{ONE}\,(bs\,@\,bs')$\\
       
  1201   $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
       
  1202      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
       
  1203   $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,as)$ & $\dn$ &
       
  1204      $\textit{ALTS}\,(bs\,@\,bs')\,as$\\
       
  1205   $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
       
  1206      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
       
  1207   $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
       
  1208      $\textit{STAR}\,(bs\,@\,bs')\,a$
       
  1209 \end{tabular}    
       
  1210 \end{center}  
       
  1211 
       
  1212 \noindent
       
  1213 After internalising the regular expression, we perform successive
       
  1214 derivative operations on the annotated regular expressions. This
       
  1215 derivative operation is the same as what we had previously for the
       
  1216 basic regular expressions, except that we beed to take care of
       
  1217 the bitcodes:
       
  1218 
       
  1219  %\begin{definition}{bder}
       
  1220 \begin{center}
       
  1221   \begin{tabular}{@{}lcl@{}}
       
  1222   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1223   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1224   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
  1225         $\textit{if}\;c=d\; \;\textit{then}\;
       
  1226          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
  1227   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
  1228   $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
       
  1229   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  1230      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  1231 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
  1232 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
  1233   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
  1234   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
  1235       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
  1236        (\textit{STAR}\,[]\,r)$
       
  1237 \end{tabular}    
       
  1238 \end{center}    
       
  1239 %\end{definition}
       
  1240 
       
  1241 \noindent
       
  1242 For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
       
  1243 we need to attach an additional bit $Z$ to the front of $r \backslash c$
       
  1244 to indicate that there is one more star iteration. Also the $SEQ$ clause
       
  1245 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
  1246 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
  1247 that it is for annotated regular expressions, therefore we omit the
       
  1248 definition). Assume that $bmkeps$ correctly extracts the bitcode for how
       
  1249 $a_1$ matches the string prior to character $c$ (more on this later),
       
  1250 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
       
  1251 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
  1252 already been fully matched) and store the parsing information at the
       
  1253 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
  1254 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
       
  1255 now been elevated to the top-level of $ALTS$, as this information will be
       
  1256 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
       
  1257 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
  1258 the lexing information, we complete the lexing by collecting the
       
  1259 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
  1260 for annotated regular expressions, called $\textit{bmkeps}$:
       
  1261 
       
  1262 
       
  1263 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
  1264 \begin{center}
       
  1265 \begin{tabular}{lcl}
       
  1266   $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
       
  1267   $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a::as)$ & $\dn$ &
       
  1268      $\textit{if}\;\textit{bnullable}\,a$\\
       
  1269   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
  1270   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(\textit{ALTS}\;bs\,as)$\\
       
  1271   $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
       
  1272      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
  1273   $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ &
       
  1274      $bs \,@\, [\S]$
       
  1275 \end{tabular}    
       
  1276 \end{center}    
       
  1277 %\end{definition}
       
  1278 
       
  1279 \noindent
       
  1280 This function completes the value information by travelling along the
       
  1281 path of the regular expression that corresponds to a POSIX value and
       
  1282 collecting all the bitcodes, and using $S$ to indicate the end of star
       
  1283 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
  1284 decode them, we get the value we expect. The corresponding lexing
       
  1285 algorithm looks as follows:
       
  1286 
       
  1287 \begin{center}
       
  1288 \begin{tabular}{lcl}
       
  1289   $\textit{blexer}\;r\,s$ & $\dn$ &
       
  1290       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
  1291   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1292   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1293   & & $\;\;\textit{else}\;\textit{None}$
       
  1294 \end{tabular}
       
  1295 \end{center}
       
  1296 
       
  1297 \noindent
       
  1298 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
  1299 operation from characters to strings (just like the derivatives for un-annotated
       
  1300 regular expressions).
       
  1301 
       
  1302 The main point of the bitcodes and annotated regular expressions is that
       
  1303 we can apply rather aggressive (in terms of size) simplification rules
       
  1304 in order to keep derivatives small. We have developed such
       
  1305 ``aggressive'' simplification rules and generated test data that show
       
  1306 that the expected bound can be achieved. Obviously we could only
       
  1307 partially cover  the search space as there are infinitely many regular
       
  1308 expressions and strings. 
       
  1309 
       
  1310 One modification we introduced is to allow a list of annotated regular
       
  1311 expressions in the \textit{ALTS} constructor. This allows us to not just
       
  1312 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
  1313 also unnecessary ``copies'' of regular expressions (very similar to
       
  1314 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
  1315 modification is that we use simplification rules inspired by Antimirov's
       
  1316 work on partial derivatives. They maintain the idea that only the first
       
  1317 ``copy'' of a regular expression in an alternative contributes to the
       
  1318 calculation of a POSIX value. All subsequent copies can be pruned away from
       
  1319 the regular expression. A recursive definition of our  simplification function 
       
  1320 that looks somewhat similar to our Scala code is given below:
       
  1321 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
  1322 %Is it $ALTS$ or $ALTS$?}\\
       
  1323 
       
  1324 \begin{center}
       
  1325   \begin{tabular}{@{}lcl@{}}
       
  1326    
       
  1327   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
  1328    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
  1329    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
  1330    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
  1331    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
  1332    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
       
  1333 
       
  1334   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
       
  1335   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  1336    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
  1337    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
       
  1338 
       
  1339    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  1340 \end{tabular}    
       
  1341 \end{center}    
       
  1342 
       
  1343 \noindent
       
  1344 The simplification does a pattern matching on the regular expression.
       
  1345 When it detected that the regular expression is an alternative or
       
  1346 sequence, it will try to simplify its children regular expressions
       
  1347 recursively and then see if one of the children turn into $\ZERO$ or
       
  1348 $\ONE$, which might trigger further simplification at the current level.
       
  1349 The most involved part is the $\textit{ALTS}$ clause, where we use two
       
  1350 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
  1351 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
       
  1352 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
       
  1353 when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.
       
  1354 Its recursive definition is given below:
       
  1355 
       
  1356  \begin{center}
       
  1357   \begin{tabular}{@{}lcl@{}}
       
  1358   $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
       
  1359      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
  1360   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
       
  1361     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
       
  1362 \end{tabular}    
       
  1363 \end{center}  
       
  1364 
       
  1365 \noindent
       
  1366 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
  1367 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
  1368 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
  1369 
       
  1370 Suppose we apply simplification after each derivative step, and view
       
  1371 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
       
  1372 \textit{simp}(a \backslash c)$. Then we can use the previous natural
       
  1373 extension from derivative w.r.t.~character to derivative
       
  1374 w.r.t.~string:%\comment{simp in  the [] case?}
       
  1375 
       
  1376 \begin{center}
       
  1377 \begin{tabular}{lcl}
       
  1378 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
  1379 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
  1380 \end{tabular}
       
  1381 \end{center}
       
  1382 
       
  1383 \noindent
       
  1384 we obtain an optimised version of the algorithm:
       
  1385 
       
  1386  \begin{center}
       
  1387 \begin{tabular}{lcl}
       
  1388   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
  1389       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
  1390   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1391   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1392   & & $\;\;\textit{else}\;\textit{None}$
       
  1393 \end{tabular}
       
  1394 \end{center}
       
  1395 
       
  1396 \noindent
       
  1397 This algorithm keeps the regular expression size small, for example,
       
  1398 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
  1399 will be reduced to just 6 and stays constant, no matter how long the
       
  1400 input string is.
       
  1401 
       
  1402 
       
  1403 
       
  1404 \section{Current Work}
       
  1405 
       
  1406 We are currently engaged in two tasks related to this algorithm. The
       
  1407 first task is proving that our simplification rules actually do not
       
  1408 affect the POSIX value that should be generated by the algorithm
       
  1409 according to the specification of a POSIX value and furthermore obtain a
       
  1410 much tighter bound on the sizes of derivatives. The result is that our
       
  1411 algorithm should be correct and faster on all inputs.  The original
       
  1412 blow-up, as observed in JavaScript, Python and Java, would be excluded
       
  1413 from happening in our algorithm. For this proof we use the theorem prover
       
  1414 Isabelle. Once completed, this result will advance the state-of-the-art:
       
  1415 Sulzmann and Lu wrote in their paper~\cite{Sulzmann2014} about the
       
  1416 bitcoded ``incremental parsing method'' (that is the lexing algorithm
       
  1417 outlined in this section):
       
  1418 
       
  1419 \begin{quote}\it
       
  1420   ``Correctness Claim: We further claim that the incremental parsing
       
  1421   method in Figure~5 in combination with the simplification steps in
       
  1422   Figure 6 yields POSIX parse tree [our lexical values]. We have tested this claim
       
  1423   extensively by using the method in Figure~3 as a reference but yet
       
  1424   have to work out all proof details.''
       
  1425 \end{quote}  
       
  1426 
       
  1427 \noindent 
       
  1428 We like to settle this correctness claim. It is relatively
       
  1429 straightforward to establish that after one simplification step, the part of a
       
  1430 nullable derivative that corresponds to a POSIX value remains intact and can
       
  1431 still be collected, in other words, we can show that
       
  1432 %\comment{Double-check....I
       
  1433 %think this  is not the case}
       
  1434 %\comment{If i remember correctly, you have proved this lemma.
       
  1435 %I feel this is indeed not true because you might place arbitrary 
       
  1436 %bits on the regex r, however if this is the case, did i remember wrongly that
       
  1437 %you proved something like simplification does not affect $\textit{bmkeps}$ results?
       
  1438 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
       
  1439 %to a regex. Maybe it works now.}
       
  1440 
       
  1441 \begin{center}
       
  1442 	$\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;($\textit{provided}$ \; a\; is \; \textit{bnullable} )$
       
  1443 \end{center} 
       
  1444 
       
  1445 \noindent
       
  1446 as this basically comes down to proving actions like removing the
       
  1447 additional $r$ in $r+r$  does not delete important POSIX information in
       
  1448 a regular expression. The hard part of this proof is to establish that
       
  1449 
       
  1450 \begin{center}
       
  1451 	$ \textit{blexer}\_{simp}(r, \; s) =  \textit{blexer}(r, \; s)$
       
  1452 \end{center}
       
  1453 %comment{This is not true either...look at the definion blexer/blexer-simp}
       
  1454 
       
  1455 \noindent That is, if we do derivative on regular expression $r$ and then
       
  1456 simplify it, and repeat this process until we exhaust the string, we get a
       
  1457 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
       
  1458 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
       
  1459 normal derivative algorithm that only does derivative repeatedly and has no
       
  1460 simplification at all.  This might seem at first glance very unintuitive, as
       
  1461 the $r'$ could be exponentially larger than $r''$, but can be explained in the
       
  1462 following way: we are pruning away the possible matches that are not POSIX.
       
  1463 Since there could be exponentially many 
       
  1464 non-POSIX matchings and only 1 POSIX matching, it
       
  1465 is understandable that our $r''$ can be a lot smaller.  we can still provide
       
  1466 the same POSIX value if there is one.  This is not as straightforward as the
       
  1467 previous proposition, as the two regular expressions $r'$ and $r''$ might have
       
  1468 become very different.  The crucial point is to find the
       
  1469 $\textit{POSIX}$  information of a regular expression and how it is modified,
       
  1470 augmented and propagated 
       
  1471 during simplification in parallel with the regular expression that
       
  1472 has not been simplified in the subsequent derivative operations.  To aid this,
       
  1473 we use the helper function retrieve described by Sulzmann and Lu:
       
  1474 \begin{center}
       
  1475 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
       
  1476   $\textit{retrieve}\,(\textit{ONE}\,bs)\,\Empty$ & $\dn$ & $bs$\\
       
  1477   $\textit{retrieve}\,(\textit{CHAR}\,bs\,c)\,(\Char\,d)$ & $\dn$ & $bs$\\
       
  1478   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Left\,v)$ & $\dn$ &
       
  1479      $bs \,@\, \textit{retrieve}\,a\,v$\\
       
  1480   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Right\,v)$ & $\dn$ &
       
  1481   $bs \,@\, \textit{retrieve}\,(\textit{ALTS}\,bs\,as)\,v$\\
       
  1482   $\textit{retrieve}\,(\textit{SEQ}\,bs\,a_1\,a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
       
  1483      $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
       
  1484   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,[])$ & $\dn$ &
       
  1485      $bs \,@\, [\S]$\\
       
  1486   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
       
  1487   \multicolumn{3}{l}{
       
  1488      \hspace{3cm}$bs \,@\, [\Z] \,@\, \textit{retrieve}\,a\,v\,@\,
       
  1489                     \textit{retrieve}\,(\textit{STAR}\,[]\,a)\,(\Stars\,vs)$}\\
       
  1490 \end{tabular}
       
  1491 \end{center}
       
  1492 %\comment{Did not read further}\\
       
  1493 This function assembles the bitcode 
       
  1494 %that corresponds to a lexical value for how
       
  1495 %the current derivative matches the suffix of the string(the characters that
       
  1496 %have not yet appeared, but will appear as the successive derivatives go on.
       
  1497 %How do we get this "future" information? By the value $v$, which is
       
  1498 %computed by a pass of the algorithm that uses
       
  1499 %$inj$ as described in the previous section).  
       
  1500 using information from both the derivative regular expression and the
       
  1501 value. Sulzmann and Lu poroposed this function, but did not prove
       
  1502 anything about it. Ausaf and Urban used it to connect the bitcoded
       
  1503 algorithm to the older algorithm by the following equation:
       
  1504  
       
  1505  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
       
  1506 	 (r^\uparrow)\backslash_{simp} \,c)\,v)$ 
       
  1507  \end{center} 
       
  1508 
       
  1509 \noindent
       
  1510 whereby $r^\uparrow$ stands for the internalised version of $r$. Ausaf
       
  1511 and Urban also used this fact to prove  the correctness of bitcoded
       
  1512 algorithm without simplification.  Our purpose of using this, however,
       
  1513 is to establish 
       
  1514 
       
  1515 \begin{center}
       
  1516 $ \textit{retrieve} \;
       
  1517 a \; v \;=\; \textit{retrieve}  \; (\textit{simp}\,a) \; v'.$
       
  1518 \end{center}
       
  1519 The idea is that using $v'$, a simplified version of $v$ that had gone
       
  1520 through the same simplification step as $\textit{simp}(a)$, we are able
       
  1521 to extract the bitcode that gives the same parsing information as the
       
  1522 unsimplified one. However, we noticed that constructing such a  $v'$
       
  1523 from $v$ is not so straightforward. The point of this is that  we might
       
  1524 be able to finally bridge the gap by proving
       
  1525 
       
  1526 \begin{center}
       
  1527 $\textit{retrieve} \; (r^\uparrow   \backslash  s) \; v = \;\textit{retrieve} \;
       
  1528 (\textit{simp}(r^\uparrow)  \backslash  s) \; v'$
       
  1529 \end{center}
       
  1530 
       
  1531 \noindent
       
  1532 and subsequently
       
  1533 
       
  1534 \begin{center}
       
  1535 $\textit{retrieve} \; (r^\uparrow \backslash  s) \; v\; = \; \textit{retrieve} \;
       
  1536 (r^\uparrow  \backslash_{simp}  \, s) \; v'$.
       
  1537 \end{center}
       
  1538 
       
  1539 \noindent
       
  1540 The $\textit{LHS}$ of the above equation is the bitcode we want. This
       
  1541 would prove that our simplified version of regular expression still
       
  1542 contains all the bitcodes needed. The task here is to find a way to
       
  1543 compute the correct $v'$.
       
  1544 
       
  1545 The second task is to speed up the more aggressive simplification.  Currently
       
  1546 it is slower than the original naive simplification by Ausaf and Urban (the
       
  1547 naive version as implemented by Ausaf   and Urban of course can ``explode'' in
       
  1548 some cases).  It is therefore not surprising that the speed is also much slower
       
  1549 than regular expression engines in popular programming languages such as Java
       
  1550 and Python on most inputs that are linear. For example, just by rewriting the
       
  1551 example regular expression in the beginning of this report  $(a^*)^*\,b$ into
       
  1552 $a^*\,b$ would eliminate the ambiguity in the matching and make the time
       
  1553 for matching linear with respect to the input string size. This allows the 
       
  1554 DFA approach to become blindingly fast, and dwarf the speed of our current
       
  1555 implementation. For example, here is a comparison of Java regex engine 
       
  1556 and our implementation on this example.
       
  1557 
       
  1558 \begin{center}
       
  1559 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
  1560 \begin{tikzpicture}
       
  1561 \begin{axis}[
       
  1562     xlabel={$n*1000$},
       
  1563     x label style={at={(1.05,-0.05)}},
       
  1564     ylabel={time in secs},
       
  1565     enlargelimits=false,
       
  1566     xtick={0,5,...,30},
       
  1567     xmax=33,
       
  1568     ymax=9,
       
  1569     scaled ticks=true,
       
  1570     axis lines=left,
       
  1571     width=5cm,
       
  1572     height=4cm, 
       
  1573     legend entries={Bitcoded Algorithm},  
       
  1574     legend pos=north west,
       
  1575     legend cell align=left]
       
  1576 \addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data};
       
  1577 \end{axis}
       
  1578 \end{tikzpicture}
       
  1579   &
       
  1580 \begin{tikzpicture}
       
  1581 \begin{axis}[
       
  1582     xlabel={$n*1000$},
       
  1583     x label style={at={(1.05,-0.05)}},
       
  1584     %ylabel={time in secs},
       
  1585     enlargelimits=false,
       
  1586     xtick={0,5,...,30},
       
  1587     xmax=33,
       
  1588     ymax=9,
       
  1589     scaled ticks=false,
       
  1590     axis lines=left,
       
  1591     width=5cm,
       
  1592     height=4cm, 
       
  1593     legend entries={Java},  
       
  1594     legend pos=north west,
       
  1595     legend cell align=left]
       
  1596 \addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data};
       
  1597 \end{axis}
       
  1598 \end{tikzpicture}\\
       
  1599 \multicolumn{3}{c}{Graphs: Runtime for matching $a^*\,b$ with strings 
       
  1600            of the form $\underbrace{aa..a}_{n}$.}
       
  1601 \end{tabular}    
       
  1602 \end{center}  
       
  1603 
       
  1604 
       
  1605 Java regex engine can match string of thousands of characters in a few milliseconds,
       
  1606 whereas our current algorithm gets excruciatingly slow on input of this size.
       
  1607 The running time in theory is linear, however it does not appear to be the 
       
  1608 case in an actual implementation. So it needs to be explored how to
       
  1609 make our algorithm faster on all inputs.  It could be the recursive calls that are
       
  1610 needed to manipulate bits that are causing the slow down. A possible solution
       
  1611 is to write recursive functions into tail-recusive form.
       
  1612 Another possibility would be to explore
       
  1613 again the connection to DFAs to speed up the algorithm on 
       
  1614 subcalls that are small enough. This is very much work in progress.
       
  1615 
       
  1616 \section{Conclusion}
       
  1617 
       
  1618 In this PhD-project we are interested in fast algorithms for regular
       
  1619 expression matching. While this seems to be a ``settled'' area, in
       
  1620 fact interesting research questions are popping up as soon as one steps
       
  1621 outside the classic automata theory (for example in terms of what kind
       
  1622 of regular expressions are supported). The reason why it is
       
  1623 interesting for us to look at the derivative approach introduced by
       
  1624 Brzozowski for regular expression matching, and then much further
       
  1625 developed by Sulzmann and Lu, is that derivatives can elegantly deal
       
  1626 with some of the regular expressions that are of interest in ``real
       
  1627 life''. This includes the not-regular expression, written $\neg\,r$
       
  1628 (that is all strings that are not recognised by $r$), but also bounded
       
  1629 regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
       
  1630 also hope that the derivatives can provide another angle for how to
       
  1631 deal more efficiently with back-references, which are one of the
       
  1632 reasons why regular expression engines in JavaScript, Python and Java
       
  1633 choose to not implement the classic automata approach of transforming
       
  1634 regular expressions into NFAs and then DFAs---because we simply do not
       
  1635 know how such back-references can be represented by DFAs.
       
  1636 We also plan to implement the bitcoded algorithm
       
  1637 in some imperative language like C to see if the inefficiency of the 
       
  1638 Scala implementation
       
  1639 is language specific. To make this research more comprehensive we also plan
       
  1640 to contrast our (faster) version of bitcoded algorithm with the
       
  1641 Symbolic Regex Matcher, the RE2, the Rust Regex Engine, and the static
       
  1642 analysis approach by implementing them in the same language and then compare
       
  1643 their performance.
       
  1644 
       
  1645 \bibliographystyle{plain}
       
  1646 \bibliography{root,regex_time_complexity}
       
  1647 
       
  1648 
       
  1649 
       
  1650 \end{document}