ChengsongPhdThesis/ChengsongPhDThesis.tex
changeset 466 31abe0e496bc
parent 465 2e7c7111c0be
child 467 599239394c51
equal deleted inserted replaced
465:2e7c7111c0be 466:31abe0e496bc
     1 \documentclass[a4paper,UKenglish]{lipics}
       
     2 \usepackage{graphic}
       
     3 \usepackage{data}
       
     4 
       
     5 %\usepackage{algorithm}
       
     6 \usepackage{amsmath}
       
     7 \usepackage[noend]{algpseudocode}
       
     8 \usepackage{enumitem}
       
     9 \usepackage{nccmath}
       
    10 \usepackage{tikz-cd}
       
    11 \usetikzlibrary{positioning}
       
    12 
       
    13 \definecolor{darkblue}{rgb}{0,0,0.6}
       
    14 \hypersetup{colorlinks=true,allcolors=darkblue}
       
    15 \newcommand{\comment}[1]%
       
    16 {{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}}
       
    17 
       
    18 % \documentclass{article}
       
    19 %\usepackage[utf8]{inputenc}
       
    20 %\usepackage[english]{babel}
       
    21 %\usepackage{listings}
       
    22 % \usepackage{amsthm}
       
    23 %\usepackage{hyperref}
       
    24 % \usepackage[margin=0.5in]{geometry}
       
    25 %\usepackage{pmboxdraw}
       
    26  
       
    27 \title{POSIX Regular Expression Matching and Lexing}
       
    28 \author{Chengsong Tan}
       
    29 \affil{King's College London\\
       
    30 London, UK\\
       
    31 \texttt{chengsong.tan@kcl.ac.uk}}
       
    32 \authorrunning{Chengsong Tan}
       
    33 \Copyright{Chengsong Tan}
       
    34 
       
    35 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    36 \newcommand{\ZERO}{\mbox{\bf 0}}
       
    37 \newcommand{\ONE}{\mbox{\bf 1}}
       
    38 \def\lexer{\mathit{lexer}}
       
    39 \def\mkeps{\mathit{mkeps}}
       
    40 
       
    41 \def\DFA{\textit{DFA}}
       
    42 \def\bmkeps{\textit{bmkeps}}
       
    43 \def\retrieve{\textit{retrieve}}
       
    44 \def\blexer{\textit{blexer}}
       
    45 \def\flex{\textit{flex}}
       
    46 \def\inj{\mathit{inj}}
       
    47 \def\Empty{\mathit{Empty}}
       
    48 \def\Left{\mathit{Left}}
       
    49 \def\Right{\mathit{Right}}
       
    50 \def\Stars{\mathit{Stars}}
       
    51 \def\Char{\mathit{Char}}
       
    52 \def\Seq{\mathit{Seq}}
       
    53 \def\Der{\mathit{Der}}
       
    54 \def\nullable{\mathit{nullable}}
       
    55 \def\Z{\mathit{Z}}
       
    56 \def\S{\mathit{S}}
       
    57 \def\rup{r^\uparrow}
       
    58 
       
    59 \newcommand{\PDER}{\textit{PDER}}
       
    60 \newcommand{\flts}{\textit{flts}}
       
    61 \newcommand{\distinctBy}{\textit{distinctBy}}
       
    62 \newcommand{\map}{\textit{map}}
       
    63 \newcommand{\size}{\textit{size}}
       
    64 \def\awidth{\mathit{awidth}}
       
    65 \def\pder{\mathit{pder}}
       
    66 \def\maxterms{\mathit{maxterms}}
       
    67 \def\bsimp{\mathit{bsimp}}
       
    68 
       
    69 %\theoremstyle{theorem}
       
    70 %\newtheorem{theorem}{Theorem}
       
    71 %\theoremstyle{lemma}
       
    72 %\newtheorem{lemma}{Lemma}
       
    73 %\newcommand{\lemmaautorefname}{Lemma}
       
    74 %\theoremstyle{definition}
       
    75 %\newtheorem{definition}{Definition}
       
    76 \algnewcommand\algorithmicswitch{\textbf{switch}}
       
    77 \algnewcommand\algorithmiccase{\textbf{case}}
       
    78 \algnewcommand\algorithmicassert{\texttt{assert}}
       
    79 \algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%
       
    80 % New "environments"
       
    81 \algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
       
    82 \algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%
       
    83 \algtext*{EndSwitch}%
       
    84 \algtext*{EndCase}%
       
    85 
       
    86 
       
    87 \begin{document}
       
    88 
       
    89 \maketitle
       
    90 
       
    91 \begin{abstract}
       
    92   Brzozowski introduced in 1964 a beautifully simple algorithm for
       
    93   regular expression matching based on the notion of derivatives of
       
    94   regular expressions. In 2014, Sulzmann and Lu extended this
       
    95   algorithm to not just give a YES/NO answer for whether or not a
       
    96   regular expression matches a string, but in case it does also
       
    97   answers with \emph{how} it matches the string.  This is important for
       
    98   applications such as lexing (tokenising a string). The problem is to
       
    99   make the algorithm by Sulzmann and Lu fast on all inputs without
       
   100   breaking its correctness. We have already developed some
       
   101   simplification rules for this, but have not yet proved that they
       
   102   preserve the correctness of the algorithm. We also have not yet
       
   103   looked at extended regular expressions, such as bounded repetitions,
       
   104   negation and back-references.
       
   105 \end{abstract}
       
   106 
       
   107 \section{Introduction}
       
   108 \subsection{Basic Regex Introduction}
       
   109 
       
   110 Suppose (basic) regular expressions are given by the following grammar:
       
   111 \[			r ::=   \ZERO \mid  \ONE
       
   112 			 \mid  c  
       
   113 			 \mid  r_1 \cdot r_2
       
   114 			 \mid  r_1 + r_2   
       
   115 			 \mid r^*         
       
   116 \]
       
   117 
       
   118 Problem of matching:
       
   119 
       
   120 \begin{center}
       
   121 \begin{tabular}{lcr}
       
   122 $\textit{Match}(r, s)$ & $ =  $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\
       
   123 &				& $\textit{else} \; \textit{output} \; \textit{NO}$
       
   124 \end{tabular}
       
   125 \end{center}
       
   126 Omnipresent use of regexes in modern 
       
   127 software. 
       
   128 Examples: Snort, Bro, etc?
       
   129 \subsubsection{The rules for network intrusion analysis tools }
       
   130 TODO: read rules libraries such as Snort and the explanation for some of the rules
       
   131 TODO: pcre/pcre2?
       
   132 TODO: any other libraries?
       
   133 
       
   134 
       
   135 There has been many widely used libraries such as 
       
   136 Henry Spencer's regexp(3), RE2, etc.
       
   137 They are fast and successful, but ugly corner cases
       
   138 allowing the $\textit{ReDoS}$ attack exist, and
       
   139 is a non-negligible protion.
       
   140 \subsection{The practical problem}
       
   141 These corner cases either
       
   142 \begin{itemize}
       
   143 \item
       
   144 go unnoticed until they
       
   145 cause considerable grief in real life
       
   146 \item
       
   147 or force the regex library writers to pose
       
   148 restrictions on the input, limiting the 
       
   149 choice a programmer has when using regexes.
       
   150 \end{itemize}
       
   151 
       
   152 Motivation:
       
   153 We want some library that supports as many constructs as possible,
       
   154 but still gives formal guarantees on the correctness and running
       
   155 time.
       
   156 
       
   157 \subsection{Regexes that brought down CloudFlare}
       
   158 
       
   159 
       
   160 matching some string $s$ with a regex
       
   161 
       
   162 \begin{verbatim}
       
   163 (?:(?:\"|'|\]|\}|\\|\d|
       
   164 (?:nan|infinity|true|false|null|undefined|symbol|math)
       
   165 |\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) 
       
   166 \end{verbatim}
       
   167 
       
   168 
       
   169 %Could be from a network intrusion detection algorithm.
       
   170 %Checking whether there is some malicious code 
       
   171 %in the network data blocks being routed.
       
   172 %If so, discard the data and identify the sender for future alert.
       
   173 \section{Existing approaches}
       
   174 \subsection{Shortcomings of different methods}
       
   175 
       
   176 
       
   177 \subsubsection{ NFA's}
       
   178 $\bold{Problems With This:}$
       
   179 \begin{itemize}
       
   180 \item
       
   181 Can be slow especially when many states are active.
       
   182 \item
       
   183 Want Lexing Results: Can have  Exponential different matching results.
       
   184 \end{itemize}
       
   185 
       
   186 
       
   187 One regular expression can have multiple lexical values. For example
       
   188 for the regular expression $(a+b)^*$, it has a infinite list of
       
   189 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   190 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   191 $\ldots$, and vice versa.
       
   192 Even for the regular expression matching a certain string, there could 
       
   193 still be more than one value corresponding to it.
       
   194 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   195 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   196 The number of different ways of matching 
       
   197 without allowing any value under a star to be flattened
       
   198 to an empty string can be given by the following formula:
       
   199 \begin{center}
       
   200 	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
       
   201 \end{center}
       
   202 and a closed form formula can be calculated to be
       
   203 \begin{equation}
       
   204 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
       
   205 \end{equation}
       
   206 which is clearly in exponential order.
       
   207 A lexer aimed at getting all the possible values has an exponential
       
   208 worst case runtime. Therefore it is impractical to try to generate
       
   209 all possible matches in a run. In practice, we are usually 
       
   210 interested about POSIX values, which by intuition always
       
   211 match the leftmost regular expression when there is a choice
       
   212 and always match a sub part as much as possible before proceeding
       
   213 to the next token. For example, the above example has the POSIX value
       
   214 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   215 The output of an algorithm we want would be a POSIX matching
       
   216 encoded as a value.\\
       
   217 $\mathbf{TODO:}$
       
   218 \begin{itemize}
       
   219 \item
       
   220 Illustrate graphically how you can match $a*a**$ with $aaa$ in different ways. 
       
   221 \item
       
   222 Give a backtracking algorithm, and explain briefly why this can be exponentially slow.
       
   223 (When there is a matching, it finds straight away; where there is not one, this fails to 
       
   224 recognize immediately that a match cannot be possibly found, and tries out all remaining 
       
   225 possibilities, etc.)
       
   226 \item
       
   227 From the above point, are there statical analysis tools that single out those malicious 
       
   228 patterns and tell before a lexer is even run?
       
   229 Have a more thorough survey of the Birmingham paper.
       
   230 Give out the suitable scenarios for such static analysis algorithms.
       
   231 
       
   232 \end{itemize}
       
   233 
       
   234 \subsubsection{DFAs}
       
   235 The tool JFLEX uses it.
       
   236 Advantages: super fast on most regexes \\
       
   237 TODO: show it being fast on a lot of inputs\\
       
   238 Disavantages:
       
   239 state explosion for bounded repetitions due to 
       
   240 theoretic bottleneck of having to remember exactly what the
       
   241 suffix up to length $n$ of input string is.
       
   242 "Countdown States activation problem":
       
   243 $.*a.{100}$ requires $2^100$ + DFA states.
       
   244 Example:
       
   245 Converting $((a|b )*b.{10}){3}$ to a $\DFA$
       
   246 gives the error:
       
   247 \begin{verbatim}
       
   248 147972 states before minimization, 79107 states in minimized DFA
       
   249 Old file "search.java" saved as "search.java~"
       
   250 Writing code to "search.java"
       
   251 
       
   252 Unexpected exception encountered. This indicates a bug in JFlex.
       
   253 Please consider filing an issue at http://github.com/jflex-de/jflex/issues/new
       
   254 
       
   255 
       
   256 character value expected
       
   257 java.lang.IllegalArgumentException: character value expected
       
   258         at jflex.generator.PackEmitter.emitUC(PackEmitter.java:105)
       
   259         at jflex.generator.CountEmitter.emit(CountEmitter.java:116)
       
   260         at jflex.generator.Emitter.emitDynamicInit(Emitter.java:530)
       
   261         at jflex.generator.Emitter.emit(Emitter.java:1369)
       
   262         at jflex.generator.LexGenerator.generate(LexGenerator.java:115)
       
   263         at jflex.Main.generate(Main.java:320)
       
   264         at jflex.Main.main(Main.java:336)
       
   265 \end{verbatim}
       
   266 
       
   267 \subsubsection{variant of DFA's}
       
   268 counting set automata 
       
   269 \\
       
   270 TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it
       
   271 \\
       
   272 TODO: find weakness of such counting set automata?
       
   273 \\
       
   274 Other variants?
       
   275 
       
   276 \subsubsection{NFA and Regex: isomorphic structure}
       
   277 TODO: define mathematically an isomorphism?\\
       
   278 
       
   279 
       
   280 
       
   281 \subsubsection{variants of NFA's}
       
   282 How about acting on regular expressions themselves? Certain copies represent verbose info--that they will always match the same string--prune away!
       
   283 
       
   284 \subsection{Brzozowski's derivatives}
       
   285 
       
   286 \subsection{Sulzmann and Lu's algorithm}
       
   287 
       
   288 \subsection{Bit-coded algorithm}
       
   289 +bitcodes!
       
   290 Built on top of derivatives, but with auxiliary bits
       
   291 \subsection{Correctness Proof}
       
   292 
       
   293 Not proven by Sulzmann  and Lu
       
   294 
       
   295 Proven by Ausaf and Urban!!
       
   296 
       
   297 
       
   298 For this we have started with looking at the proof of
       
   299 \begin{equation}\label{lexer}
       
   300 \blexer \; (r^\uparrow)  s = \lexer \;r \;s,
       
   301 \end{equation}
       
   302 
       
   303 %\noindent
       
   304 %might provide us insight into proving 
       
   305 %\begin{center}
       
   306 %$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
       
   307 %\end{center}
       
   308 
       
   309 \noindent
       
   310 which established that the bit-sequence algorithm produces the same
       
   311 result as the original algorithm, which does not use 
       
   312 bit-sequences.
       
   313 The proof uses two ``tricks''. One is that it uses a \flex-function
       
   314 
       
   315 \begin{center}
       
   316 \begin{tabular}{lcl}
       
   317 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \;  (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
       
   318 $\textit{flex} \;r\; f\;  [\,] $ & $\dn$ & $f$
       
   319 \end{tabular}
       
   320 \end{center}
       
   321 
       
   322 \noindent
       
   323 
       
   324 The intuition behind the $\flex$ function is that
       
   325  it  accumulates a series of $\inj$ function applications when doing derivatives
       
   326  in a $\mathit{LIFO}$ manner. The arguments of the $\inj$ functions are kept by
       
   327   remembering which character
       
   328  was chopped off and what the regular expression looks like before
       
   329  chopping off that character.
       
   330  The $\mathit{LIFO}$ order was achieved by putting the newest $\inj$ application
       
   331  always before the application of $f$, the previously accumulated function applications.\\
       
   332 Therefore, the function $\flex$, when acted on a string $s@[c]$ where the last
       
   333 character is $c$, by nature can have its last injection function revealed already: 
       
   334 \begin{equation}\label{flex}
       
   335 \flex \; r \; id \; (s@[c]) \; v = \flex \;  r \; id \; s \; (inj \; (r\backslash s) \; c\; v).
       
   336 \end{equation}
       
   337 that the last character can be taken off, and the injection it causes be applied to 
       
   338 the argument value $v$.
       
   339 
       
   340 Ausaf and Urban proved that the Sulzmann and Lu's lexers
       
   341 can be charactarized by the $\flex$ function:
       
   342 \begin{center}
       
   343 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$.
       
   344 \end{center}
       
   345 
       
   346 \noindent
       
   347 This property says that the Sulzmann and Lu's $\lexer$ does lexing by
       
   348 stacking up injection functions while doing derivatives,
       
   349 explicitly showing the order of characters being
       
   350 injected back in each step.
       
   351 
       
   352 \noindent
       
   353 The other trick, which is the crux in the existing proof, 
       
   354 is the use of the $\retrieve$-function:
       
   355 \begin{center}
       
   356 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
       
   357   $\textit{retrieve}\,(_{bs}\ONE)\,\Empty$ & $\dn$ & $bs$\\
       
   358   $\textit{retrieve}\,(_{bs}{\bf c})\,(\Char\,d)$ & $\dn$ & $bs$\\
       
   359   $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Left\,v)$ & $\dn$ &
       
   360      $bs \,@\, \textit{retrieve}\,a\,v$\\
       
   361   $\textit{retrieve}\,(_{bs}\sum a::as)\,(\Right\,v)$ & $\dn$ &
       
   362   $\textit{bs} \,@\, \textit{retrieve}\,(_{[]}\sum as)\,v$\\
       
   363   $\textit{retrieve}\,(_{bs}a_1\cdot a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
       
   364      $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
       
   365   $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,[])$ & $\dn$ &
       
   366      $bs \,@\, [0]$\\
       
   367   $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
       
   368   \multicolumn{3}{l}{
       
   369      \hspace{3cm}$bs \,@\, [1] \,@\, \textit{retrieve}\,a\,v\,@\,
       
   370                     \textit{retrieve}\,(_{[]}a^*)\,(\Stars\,vs)$}\\
       
   371 \end{tabular}
       
   372 \end{center}
       
   373 
       
   374 \noindent
       
   375 Sulzmann and Lu proposed this function, but did not prove
       
   376 anything about it. Ausaf and Urban made use of the
       
   377 fact about $\retrieve$ in their proof:
       
   378  \begin{equation}\label{retrieve_reversible}
       
   379  \retrieve\; \rup \backslash c \; v = \retrieve \;  \rup (\inj \;r \;c \; v)
       
   380  \end{equation} 
       
   381 This says that $\retrieve$ will always pick up 
       
   382 partial information about a lexing value value and transform it into suitable bitcodes.
       
   383 If the information is in the regular expression (stored as bitcodes), it will keep those
       
   384 bitcodes with the guidance of the value,
       
   385 if the information is in the value, which has been injected back to the value,
       
   386 it will "digest" and transform that part of the value to bitcodes.
       
   387 
       
   388 \noindent
       
   389 
       
   390 Using this together with ~\eqref{flex}, we can prove that the bitcoded version of
       
   391 lexer is the same as Sulzmann and Lu's lexer:
       
   392 \begin{center} 
       
   393 $\lexer \; r \; s =  \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{bmkeps}\; (\rup \backslash s) ) r = \blexer \; r \; s$
       
   394 \end{center}
       
   395 \noindent
       
   396 \begin{proof}
       
   397 We express $\bmkeps$ using $\retrieve$, and the theorem to prove becomes:
       
   398 \begin{center} 
       
   399 $ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
       
   400 \end{center}
       
   401 \noindent
       
   402 We prove the above by reverse induction on string $s$(meaning that the inductive step is on 
       
   403 $s @ [c]$ rather than $c :: s$).
       
   404 $v$ takes arbitrary values.\\
       
   405 The base case goes through trivially.\\
       
   406 For the inductive step, assuming
       
   407 $ \flex \; r\; id\; s\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; v \;) r$
       
   408 holds for all values $v$. Now we need to show that
       
   409 $ \flex \; r\; id\; s@[c]\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash (s@[c])) \; v \;) r$.\\
       
   410 ~\eqref{flex} allows us to do the following rewrite:
       
   411 \begin{center} 
       
   412 $ \flex \; r\; id\; (s@[c])\; v = \flex \; r \; id\; s\; (\inj \; (r \backslash s) \; c\; v)=  \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
       
   413 \end{center}
       
   414 ~\eqref{retrieve_reversible} allows us  to further rewrite the $\mathit{RHS}$ of the above to
       
   415 \begin{center}
       
   416 $\textit{decode} \; (\textit{retrieve}\; (\rup \backslash (s @ [c])) \; v\;)  \;r$
       
   417 \end{center}
       
   418 
       
   419 
       
   420 \end{proof}
       
   421 
       
   422 
       
   423 
       
   424 \section{My Work}
       
   425 
       
   426 \subsection{an improved version of bit-coded algorithm: with simp!}
       
   427 
       
   428 \subsection{a correctness proof for bitcoded algorithm}
       
   429 
       
   430 \subsection{finiteness proof }
       
   431 \subsubsection{closed form}
       
   432 We can give the derivative of regular expressions
       
   433 with respect to string a closed form with respect to simplification:
       
   434 \begin{itemize}
       
   435 \item
       
   436 closed form for sequences:
       
   437 \begin{verbatim}
       
   438 lemma seq_closed_form: shows 
       
   439 "rsimp (rders_simp (RSEQ r1 r2) s) = 
       
   440 rsimp ( RALTS ( (RSEQ (rders_simp r1 s) r2) # 
       
   441                 (map (rders r2) (vsuf s r1)) 
       
   442               )  
       
   443       )"
       
   444 \end{verbatim}
       
   445 where the recursive function $\textit{vsuf}$ is defined as
       
   446 \begin{verbatim}
       
   447 fun vsuf :: "char list -> rrexp -> char list list" where
       
   448 "vsuf [] _ = []"
       
   449 |"vsuf (c#cs) r1 = (if (rnullable r1) then  (vsuf cs (rder c r1)) @ [c # cs]
       
   450                                       else  (vsuf cs (rder c r1))
       
   451                    ) "
       
   452 
       
   453 \end{verbatim}
       
   454 \item
       
   455 closed form for stars:
       
   456 \begin{verbatim}
       
   457 lemma star_closed_form:
       
   458   shows "rders_simp (RSTAR r0) (c#s) = 
       
   459 rsimp ( RALTS ( 
       
   460 (map (\lambda s1. RSEQ (rders_simp r0 s1) (RSTAR r0) ) 
       
   461 (star_updates s r [[c]]) ) ))"
       
   462 \end{verbatim}
       
   463 where the recursive function $\textit{star}\_\textit{updates}$ is defined as
       
   464 \begin{verbatim}
       
   465 fun star_update :: "char -> rrexp -> char list list -> char list list" where
       
   466 "star_update c r [] = []"
       
   467 |"star_update c r (s # Ss) = (if (rnullable (rders_simp r s)) 
       
   468                                 then (s@[c]) # [c] # (star_update c r Ss) 
       
   469                                else   (s@[c]) # (star_update c r Ss) )"
       
   470 
       
   471 fun star_updates :: "char list -> rrexp -> char list list -> char list list"
       
   472   where
       
   473 "star_updates [] r Ss = Ss"
       
   474 | "star_updates (c # cs) r Ss = star_updates cs r (star_update c r Ss)"
       
   475 
       
   476 \end{verbatim}
       
   477 
       
   478 
       
   479 \end{itemize}
       
   480 These closed form is a formalization of the intuition 
       
   481  that we can push in the derivatives
       
   482 of compound regular expressions to its sub-expressions, and the resulting
       
   483 expression is a linear combination of those sub-expressions' derivatives.
       
   484 \subsubsection{Estimation of closed forms' size}
       
   485 And thanks to $\textit{distinctBy}$ helping with deduplication,
       
   486 the linear combination can be  bounded by the set enumerating all 
       
   487 regular expressions up to a certain size :
       
   488 \begin{verbatim}
       
   489 
       
   490 lemma star_closed_form_bounded_by_rdistinct_list_estimate:
       
   491   shows "rsize (rsimp ( RALTS ( (map (\lambda s1. RSEQ (rders_simp r0 s1) (RSTAR r0) )
       
   492          (star_updates s r [[c]]) ) ))) <=
       
   493         Suc (sum_list (map rsize (rdistinct (map (\lambda s1. RSEQ (rders_simp r0 s1) (RSTAR r0) )
       
   494          (star_updates s r [[c]]) ) {})  ) )"
       
   495          
       
   496          lemma distinct_list_rexp_up_to_certain_size_bouded_by_set_enumerating_up_to_that_size:
       
   497   shows "\forallr\in set rs. (rsize r ) <= N ==> sum_list (map rsize (rdistinct rs {})) <=
       
   498          (card (rexp_enum N))* N"
       
   499          
       
   500          lemma ind_hypo_on_ders_leads_to_stars_bounded:
       
   501   shows "\foralls. rsize (rders_simp r0 s) <= N ==>        
       
   502       (sum_list (map rsize (rdistinct (map (\lambda s1. RSEQ (rders_simp r0 s1) (RSTAR r0) )
       
   503          (star_updates s r [[c]]) ) {})  ) ) <= 
       
   504 (card (rexp_enum (Suc (N + rsize (RSTAR r0))))) * (Suc (N + rsize (RSTAR r0)))
       
   505 "
       
   506 \end{verbatim}
       
   507 
       
   508 With the above 3 lemmas, we can argue that the inductive hypothesis
       
   509 $r_0$'s derivatives is bounded above  leads to $r_0^*$'s
       
   510 derivatives being bounded above.
       
   511 \begin{verbatim}
       
   512 
       
   513 lemma r0_bounded_star_bounded:
       
   514   shows "\foralls. rsize (rders_simp r0 s) <= N ==>
       
   515              \foralls. rsize (rders_simp (RSTAR r0) s) <= 
       
   516 (card (rexp_enum (Suc (N + rsize (RSTAR r0))))) * (Suc (N + rsize (RSTAR r0)))"
       
   517 \end{verbatim}
       
   518 
       
   519 
       
   520 And we have a similar argument for the sequence case.
       
   521 \subsection{stronger simplification needed}
       
   522 
       
   523 \subsubsection{Bounded List of Terms}
       
   524 We have seen that without simplification the size of $(a+aa)^*$
       
   525 grows exponentially and unbounded(where we omit certain nested
       
   526 parentheses among the four terms in the last explicitly written out regex):
       
   527 
       
   528 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
       
   529 \begin{center}
       
   530 \begin{tabular}{rll}
       
   531 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   532 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   533 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
       
   534 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   535 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
       
   536 \end{tabular}
       
   537 \end{center}
       
   538 
       
   539 But if we look closely at the regex
       
   540 \begin{center}
       
   541 \begin{tabular}{rll}
       
   542 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   543 \end{tabular}
       
   544 \end{center}
       
   545 we realize that:
       
   546 \begin{itemize}
       
   547 \item
       
   548 The regex is  equivalent to an alternative taking a long-flattened list, 
       
   549 where each list is a sequence, and the second child of that sequence
       
   550 is always $(a+aa)^*$. In other words, the regex is a "linear combination"
       
   551 of terms of the form $(a+aa)\backslash s \cdot (a+aa)^*$ ($s$ is any string).
       
   552 \item
       
   553 The number of different terms of the shape $(a+aa) \backslash s \cdot (a+aa)^*$ is 
       
   554 bounded because the first child $(a+aa) \backslash s$ can only be one of
       
   555 $(\ZERO + \ZERO{}a + \ZERO)$, $(\ZERO + \ZERO{}a + \ONE)$, 
       
   556 $(\ONE + \ONE{}a)$ and $(\ZERO + \ZERO{}a)$.
       
   557 \item
       
   558 With simplification we have that the regex is additionally reduced to,
       
   559 
       
   560 where each term $\bsimp((a+aa)\backslash s )  $
       
   561 is further reduced to only
       
   562 $\ONE$, $\ONE + a$ and $\ZERO$.
       
   563 
       
   564 
       
   565 \end{itemize}
       
   566 Generalizing this to any regular expression of the form 
       
   567 $\sum_{s\in L(r)} \bsimp(r\backslash s ) \cdot r^*$,
       
   568 we have the closed-form for star regex's string derivative as below:
       
   569 $\forall r \;s.\; \exists sl. \; s.t.\;\bsimp(r^* \backslash s) = \bsimp(\sum_{s'\in sl}(r\backslash s') \cdot r^* )$.
       
   570 
       
   571 The regex $\bsimp(\sum_{s' \in sl}(r\backslash s') \cdot r^*)$ is bounded by
       
   572 $\distinctBy(\flts(\sum_{s'\in sl}(\bsimp(r \backslash s')) \cdot r^*))$,
       
   573 which again is bounded by $\distinctBy(\sum_{s'\in sl}(\bsimp(r\backslash s')) \cdot r^*)$.
       
   574 This might give us a polynomial bound on the length of the list
       
   575 $\distinctBy[(\bsimp(r\backslash s')) \cdot r^* | {s'\in sl} ]$, if the terms in 
       
   576 $\distinctBy[(\bsimp (r\backslash s')) | {s' \in sl}]$ has a polynomial bound.
       
   577 This is unfortunately not true under our current $\distinctBy$ function:
       
   578 If we let $r_n = ( (aa)^* + (aaa)^* + \ldots + \underline{(a\ldots a)^*}{n \,a's}) $,
       
   579 then we have that $\textit{maxterms} r_n = \textit{sup} (\textit{length} [\bsimp(r_n\backslash s') | s' \in sl]) = 
       
   580 L.C.M(1,\ldots, n)$. According to \href{http://oeis.org/A003418}{OEISA003418}
       
   581 this grows exponentially quickly. So we have found a regex $r_n$ where
       
   582 $\textit{maxterms} (r_n ^* \backslash s) \geq 2^{(n-1)}$.
       
   583 
       
   584 
       
   585 \subsubsection{stronger version of \distinctBy}
       
   586 \href{https://www.researchgate.net/publication/340874991_Partial_derivatives_of_regular_expressions_and_finite_automaton_constructions}{Antimirove}
       
   587 has proven a linear bound on the number of terms in the "partial derivatives" of a regular expression:
       
   588 \begin{center}
       
   589 $\size (\PDER(r))  \leq \awidth (r)$.
       
   590 \end{center}
       
   591 
       
   592 The proof is by structural induction on the regular expression r.
       
   593 The hard cases are the sequence case $r_1\cdot r_2$ and star case $r^*$.
       
   594 The central idea that allows the induction to go through for this bound is on the inclusion:
       
   595 \begin{center}
       
   596 $\pder_{s@[c]} (a\cdot b) \subseteq (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))$
       
   597 \end{center} 
       
   598 
       
   599 This way, 
       
   600 
       
   601 \begin{center}
       
   602 \begin{tabular}{lcl}
       
   603 $| \pder_{s@[c]} (a\cdot b) |$ & $ \leq$ & $ | (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
       
   604 & $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
       
   605 & $=$ & $\awidth(a) + \awidth(b)$ \\
       
   606 & $=$ & $\awidth(a+b)$
       
   607 \end{tabular}
       
   608 \end{center} 
       
   609 
       
   610 we have that a compound regular expression $a\cdot b$'s subterms
       
   611  is bounded by its sub-expression's derivatives terms.
       
   612  
       
   613 This argument can be modified to bound the terms in 
       
   614 our  version of regex with strong simplification:
       
   615 \begin{center}
       
   616 \begin{tabular}{lcl}
       
   617 $| \maxterms (\bsimp  (a\cdot b) \backslash s)|$ & $=$ & $ |maxterms(\bsimp( (a\backslash s \cdot b) + \sum_{s'\in sl}(b\backslash s') ))|$\\
       
   618 & $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
       
   619 & $=$ & $\awidth(a) + \awidth(b)$ \\
       
   620 & $=$ & $\awidth(a+b)$
       
   621 \end{tabular}
       
   622 \end{center} 
       
   623 
       
   624 
       
   625 
       
   626 
       
   627 \subsection{cubic bound}
       
   628 Bounding the regex's subterms by
       
   629 its alphabetic width.
       
   630 
       
   631 The entire expression's size can be bounded by
       
   632 number of subterms times each subterms' size.
       
   633 
       
   634 
       
   635 
       
   636 
       
   637 
       
   638 \section{Support for bounded repetitions and other constructs}
       
   639 Example: 
       
   640 $.*a.\{100\}$ after translation to $\DFA$ and minimization will 
       
   641 always take over $2^{100}$ states.
       
   642 
       
   643 \section{Towards a library with fast running time practically}
       
   644 
       
   645 registers and cache-related optimizations?
       
   646 JVM related optimizations?
       
   647 
       
   648 \section{Past Report materials}
       
   649 
       
   650 Deciding whether a string is in the language of the regex
       
   651 can be intuitively done by constructing an NFA\cite{Thompson_1968}:
       
   652 and simulate the running of it.
       
   653 
       
   654 Which should be simple enough that modern programmers
       
   655 have no problems with it at all?
       
   656 Not really:
       
   657 
       
   658 Take $(a^*)^*\,b$ and ask whether
       
   659 strings of the form $aa..a$ match this regular
       
   660 expression. Obviously this is not the case---the expected $b$ in the last
       
   661 position is missing. One would expect that modern regular expression
       
   662 matching engines can find this out very quickly. Alas, if one tries
       
   663 this example in JavaScript, Python or Java 8 with strings like 28
       
   664 $a$'s, one discovers that this decision takes around 30 seconds and
       
   665 takes considerably longer when adding a few more $a$'s, as the graphs
       
   666 below show:
       
   667 
       
   668 \begin{center}
       
   669 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   670 \begin{tikzpicture}
       
   671 \begin{axis}[
       
   672     xlabel={$n$},
       
   673     x label style={at={(1.05,-0.05)}},
       
   674     ylabel={time in secs},
       
   675     enlargelimits=false,
       
   676     xtick={0,5,...,30},
       
   677     xmax=33,
       
   678     ymax=35,
       
   679     ytick={0,5,...,30},
       
   680     scaled ticks=false,
       
   681     axis lines=left,
       
   682     width=5cm,
       
   683     height=4cm, 
       
   684     legend entries={JavaScript},  
       
   685     legend pos=north west,
       
   686     legend cell align=left]
       
   687 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
   688 \end{axis}
       
   689 \end{tikzpicture}
       
   690   &
       
   691 \begin{tikzpicture}
       
   692 \begin{axis}[
       
   693     xlabel={$n$},
       
   694     x label style={at={(1.05,-0.05)}},
       
   695     %ylabel={time in secs},
       
   696     enlargelimits=false,
       
   697     xtick={0,5,...,30},
       
   698     xmax=33,
       
   699     ymax=35,
       
   700     ytick={0,5,...,30},
       
   701     scaled ticks=false,
       
   702     axis lines=left,
       
   703     width=5cm,
       
   704     height=4cm, 
       
   705     legend entries={Python},  
       
   706     legend pos=north west,
       
   707     legend cell align=left]
       
   708 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   709 \end{axis}
       
   710 \end{tikzpicture}
       
   711   &
       
   712 \begin{tikzpicture}
       
   713 \begin{axis}[
       
   714     xlabel={$n$},
       
   715     x label style={at={(1.05,-0.05)}},
       
   716     %ylabel={time in secs},
       
   717     enlargelimits=false,
       
   718     xtick={0,5,...,30},
       
   719     xmax=33,
       
   720     ymax=35,
       
   721     ytick={0,5,...,30},
       
   722     scaled ticks=false,
       
   723     axis lines=left,
       
   724     width=5cm,
       
   725     height=4cm, 
       
   726     legend entries={Java 8},  
       
   727     legend pos=north west,
       
   728     legend cell align=left]
       
   729 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   730 \end{axis}
       
   731 \end{tikzpicture}\\
       
   732 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   733            of the form $\underbrace{aa..a}_{n}$.}
       
   734 \end{tabular}    
       
   735 \end{center}  
       
   736 
       
   737 Why?
       
   738 Using $\textit{NFA}$'s that can backtrack.
       
   739 %TODO: what does it mean to do DFS BFS on NFA's
       
   740 	
       
   741 
       
   742 Then how about determinization?
       
   743 \begin{itemize}
       
   744 \item
       
   745  Turning NFA's to DFA's can cause the size of the automata
       
   746 to blow up exponentially.
       
   747 \item
       
   748 Want to extract submatch information.
       
   749 For example, 
       
   750 $r_1 \cdot r_2$ matches $s$,
       
   751 want to know $s = s_1@s_2$ where $s_i$ 
       
   752 corresponds to $r_i$. Where $s_i$ might be the
       
   753 attacker's ip address.
       
   754 \item
       
   755 Variants such as counting automaton exist.
       
   756 But usually made super fast on a certain class
       
   757 of regexes like bounded repetitions:
       
   758 \begin{verbatim}
       
   759 .*a.{100}
       
   760 \end{verbatim}
       
   761 On a lot of inputs this works very well.
       
   762 On average good practical performance.
       
   763 ~10MiB per second.
       
   764 But cannot be super fast on all inputs of regexes and strings,
       
   765 can be imprecise (incorrect) when it comes to more complex regexes.
       
   766 
       
   767 \end{itemize}
       
   768 %TODO: real world example?
       
   769 
       
   770 
       
   771 
       
   772 \subsection{derivatives}
       
   773 Q:
       
   774 Is there an efficient lexing algorithm with provable guarantees on 
       
   775 correctness and running time?
       
   776 Brzozowski Derivatives\cite{Brzozowski1964}!
       
   777 
       
   778 \begin{center}
       
   779 		\begin{tabular}{lcl}
       
   780 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   781 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   782 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   783 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   784 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   785 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   786 		\end{tabular}
       
   787 	\end{center}
       
   788 	
       
   789 	
       
   790 	
       
   791 This function simply tests whether the empty string is in $L(r)$.
       
   792 He then defined
       
   793 the following operation on regular expressions, written
       
   794 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
       
   795 
       
   796 \begin{center}
       
   797 \begin{tabular}{lcl}
       
   798 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   799 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   800 		$d \backslash c$     & $\dn$ & 
       
   801 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   802 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   803 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   804 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   805 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   806 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   807 \end{tabular}
       
   808 \end{center}
       
   809 
       
   810 
       
   811 
       
   812 
       
   813 \begin{ceqn}
       
   814 \begin{equation}\label{graph:01}
       
   815 \begin{tikzcd}
       
   816 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   817 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   818 \end{tikzcd}
       
   819 \end{equation}
       
   820 \end{ceqn}
       
   821 Nicely functional, correctness easily provable, but suffers
       
   822 from large stack size with long strings, and 
       
   823 inability to perform even moderate simplification.
       
   824 
       
   825 
       
   826 
       
   827 The Sulzmann and Lu's bit-coded algorithm:
       
   828 \begin{figure}
       
   829 \centering
       
   830 \includegraphics[scale=0.3]{bitcoded_sulzmann.png}
       
   831 \end{figure}
       
   832 
       
   833 This one-phase algorithm is free from the burden of large stack usage:
       
   834 
       
   835 \begin{center}
       
   836 \begin{tikzpicture}[scale=2,node distance=1.9cm,
       
   837                     every node/.style={minimum size=7mm}]
       
   838 \node (r0)  {$r_0$};
       
   839 \node (r1) [right=of r0]{$r_1$};
       
   840 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash\,c_0$};
       
   841 \node (r2) [right=of r1]{$r_2$};
       
   842 \draw[->, line width = 0.2mm](r1)--(r2) node[above,midway] {$\backslash\,c_1$};
       
   843 \node (rn) [right=of r2]{$r_n$};
       
   844 \draw[dashed,->,line width=0.2mm](r2)--(rn) node[above,midway] {} ;
       
   845 \draw (rn) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};
       
   846 \node (bs) [below=of rn]{$bs$};
       
   847 \draw[->,line width=0.2mm](rn) -- (bs);
       
   848 \node (v0) [left=of bs] {$v_0$};
       
   849 \draw[->,line width=0.2mm](bs)--(v0) node[below,midway] {$\textit{decode}$};
       
   850 \draw (rn) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{collect bits}$}};
       
   851 \draw[->, line width=0.2mm](v0)--(r0) node[below, midway] {};
       
   852 \end{tikzpicture}
       
   853 \end{center}
       
   854 
       
   855 
       
   856 
       
   857 This is functional code, and easily provable (proof by Urban and Ausaf).
       
   858 
       
   859 But it suffers from exponential blows even with the simplification steps:
       
   860 \begin{figure}
       
   861 \centering
       
   862 \includegraphics[scale= 0.3]{pics/nub_filter_simp.png}
       
   863 \end{figure}
       
   864 claim: Sulzmann and Lu claimed it linear $w.r.t$ input.
       
   865 
       
   866 example that blows it up:
       
   867 $(a+aa)^*$
       
   868 \section{Contributions}
       
   869 \subsection{Our contribution 1}
       
   870 an improved version of the above algorithm that solves most blow up 
       
   871 cases, including the above example.
       
   872 
       
   873 a formalized closed-form for string derivatives:
       
   874 \[ (\sum rs) \backslash_s s = simp(\sum_{r \in rs}(r \backslash_s s)) \]
       
   875 \[ (r1\cdot r2) \backslash_s s = simp(r_1 \backslash_s s  \cdot r_2 + \sum_{s' \in Suffix(s)} r_2 \backslash_s s' )\]
       
   876 \[r0^* \backslash_s s = simp(\sum_{s' \in substr(s)} (r0 \backslash_s s') \cdot r0^*) \]
       
   877 
       
   878 
       
   879 
       
   880 Also with a size guarantee that make sure the size of the derivatives 
       
   881 don't go up unbounded.
       
   882 
       
   883 
       
   884 \begin{theorem}
       
   885 Given a regular expression r, we have 
       
   886 \begin{center}
       
   887 $\exists N_r.\; s.t. \;\forall s. \; |r \backslash_s s| < N_r$
       
   888 \end{center}
       
   889 \end{theorem}
       
   890 
       
   891 The proof for this is using partial derivative's terms to bound it.
       
   892 \begin{center}
       
   893 \begin{tabular}{lcl}
       
   894 $| \maxterms (\bsimp  (a\cdot b) \backslash s)|$ & $=$ & $ |maxterms(\bsimp( (a\backslash s \cdot b) + \sum_{s'\in sl}(b\backslash s') ))|$\\
       
   895 & $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
       
   896 & $=$ & $\awidth(a) + \awidth(b)$ \\
       
   897 & $=$ & $\awidth(a+b)$
       
   898 \end{tabular}
       
   899 \end{center} 
       
   900 
       
   901 
       
   902 \subsection{Our Contribution 2}
       
   903 more aggressive simplification that prunes away sub-parts
       
   904 of a regex based on what terms has appeared before.
       
   905 Which gives us a truly linear bound on the input length.
       
   906 
       
   907 
       
   908 
       
   909 \section{To be completed}
       
   910 
       
   911 benchmarking our algorithm against JFLEX
       
   912 counting set automata, Silex, other main regex engines (incorporate their ideas such
       
   913 as zippers and other data structures reducing memory use).
       
   914 
       
   915 extend to back references.
       
   916 
       
   917 
       
   918 
       
   919 
       
   920 
       
   921 
       
   922 \noindent These are clearly abysmal and possibly surprising results. One
       
   923 would expect these systems to do  much better than that---after all,
       
   924 given a DFA and a string, deciding whether a string is matched by this
       
   925 DFA should be linear in terms of the size of the regular expression and
       
   926 the string?
       
   927 
       
   928 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
       
   929 exhibit this super-linear behaviour.  But unfortunately, such regular
       
   930 expressions are not just a few outliers. They are actually 
       
   931 frequent enough to have a separate name created for
       
   932 them---\emph{evil regular expressions}. In empiric work, Davis et al
       
   933 report that they have found thousands of such evil regular expressions
       
   934 in the JavaScript and Python ecosystems \cite{Davis18}. Static analysis
       
   935 approach that is both sound and complete exists\cite{17Bir}, but the running 
       
   936 time on certain examples in the RegExLib and Snort regular expressions
       
   937 libraries is unacceptable. Therefore the problem of efficiency still remains.
       
   938 
       
   939 This superlinear blowup in matching algorithms sometimes causes
       
   940 considerable grief in real life: for example on 20 July 2016 one evil
       
   941 regular expression brought the webpage
       
   942 \href{http://stackexchange.com}{Stack Exchange} to its
       
   943 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
       
   944 In this instance, a regular expression intended to just trim white
       
   945 spaces from the beginning and the end of a line actually consumed
       
   946 massive amounts of CPU-resources---causing web servers to grind to a
       
   947 halt. This happened when a post with 20,000 white spaces was submitted,
       
   948 but importantly the white spaces were neither at the beginning nor at
       
   949 the end. As a result, the regular expression matching engine needed to
       
   950 backtrack over many choices. In this example, the time needed to process
       
   951 the string was $O(n^2)$ with respect to the string length. This
       
   952 quadratic overhead was enough for the homepage of Stack Exchange to
       
   953 respond so slowly that the load balancer assumed there must be some
       
   954 attack and therefore stopped the servers from responding to any
       
   955 requests. This made the whole site become unavailable. Another very
       
   956 recent example is a global outage of all Cloudflare servers on 2 July
       
   957 2019. A poorly written regular expression exhibited exponential
       
   958 behaviour and exhausted CPUs that serve HTTP traffic. Although the
       
   959 outage had several causes, at the heart was a regular expression that
       
   960 was used to monitor network
       
   961 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
       
   962 
       
   963 The underlying problem is that many ``real life'' regular expression
       
   964 matching engines do not use DFAs for matching. This is because they
       
   965 support regular expressions that are not covered by the classical
       
   966 automata theory, and in this more general setting there are quite a few
       
   967 research questions still unanswered and fast algorithms still need to be
       
   968 developed (for example how to treat efficiently bounded repetitions, negation and
       
   969 back-references).
       
   970 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
       
   971 %how do they avoid dfas exponential states if they use them for fast matching?
       
   972 
       
   973 There is also another under-researched problem to do with regular
       
   974 expressions and lexing, i.e.~the process of breaking up strings into
       
   975 sequences of tokens according to some regular expressions. In this
       
   976 setting one is not just interested in whether or not a regular
       
   977 expression matches a string, but also in \emph{how}.  Consider for
       
   978 example a regular expression $r_{key}$ for recognising keywords such as
       
   979 \textit{if}, \textit{then} and so on; and a regular expression $r_{id}$
       
   980 for recognising identifiers (say, a single character followed by
       
   981 characters or numbers). One can then form the compound regular
       
   982 expression $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But
       
   983 then how should the string \textit{iffoo} be tokenised?  It could be
       
   984 tokenised as a keyword followed by an identifier, or the entire string
       
   985 as a single identifier.  Similarly, how should the string \textit{if} be
       
   986 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
       
   987 ``fire''---so is it an identifier or a keyword?  While in applications
       
   988 there is a well-known strategy to decide these questions, called POSIX
       
   989 matching, only relatively recently precise definitions of what POSIX
       
   990 matching actually means have been formalised
       
   991 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Such a
       
   992 definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014},
       
   993 but the corresponding correctness proof turned out to be  faulty
       
   994 \cite{AusafDyckhoffUrban2016}. Roughly, POSIX matching means matching
       
   995 the longest initial substring. In the case of a tie, the initial
       
   996 sub-match is chosen according to some priorities attached to the regular
       
   997 expressions (e.g.~keywords have a higher priority than identifiers).
       
   998 This sounds rather simple, but according to Grathwohl et al \cite[Page
       
   999 36]{CrashCourse2014} this is not the case. They wrote:
       
  1000 
       
  1001 \begin{quote}
       
  1002 \it{}``The POSIX strategy is more complicated than the greedy because of 
       
  1003 the dependence on information about the length of matched strings in the 
       
  1004 various subexpressions.''
       
  1005 \end{quote}
       
  1006 
       
  1007 \noindent
       
  1008 This is also supported by evidence collected by Kuklewicz
       
  1009 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
       
  1010 matchers calculate incorrect results.
       
  1011 
       
  1012 Our focus in this project is on an algorithm introduced by Sulzmann and
       
  1013 Lu in 2014 for regular expression matching according to the POSIX
       
  1014 strategy \cite{Sulzmann2014}. Their algorithm is based on an older
       
  1015 algorithm by Brzozowski from 1964 where he introduced the notion of
       
  1016 derivatives of regular expressions~\cite{Brzozowski1964}. We shall
       
  1017 briefly explain this algorithm next.
       
  1018 
       
  1019 \section{The Algorithm by Brzozowski based on Derivatives of Regular
       
  1020 Expressions}
       
  1021 
       
  1022 Suppose (basic) regular expressions are given by the following grammar:
       
  1023 \[			r ::=   \ZERO \mid  \ONE
       
  1024 			 \mid  c  
       
  1025 			 \mid  r_1 \cdot r_2
       
  1026 			 \mid  r_1 + r_2   
       
  1027 			 \mid r^*         
       
  1028 \]
       
  1029 
       
  1030 \noindent
       
  1031 The intended meaning of the constructors is as follows: $\ZERO$
       
  1032 cannot match any string, $\ONE$ can match the empty string, the
       
  1033 character regular expression $c$ can match the character $c$, and so
       
  1034 on.
       
  1035 
       
  1036 The ingenious contribution by Brzozowski is the notion of
       
  1037 \emph{derivatives} of regular expressions.  The idea behind this
       
  1038 notion is as follows: suppose a regular expression $r$ can match a
       
  1039 string of the form $c\!::\! s$ (that is a list of characters starting
       
  1040 with $c$), what does the regular expression look like that can match
       
  1041 just $s$? Brzozowski gave a neat answer to this question. He started
       
  1042 with the definition of $nullable$:
       
  1043 \begin{center}
       
  1044 		\begin{tabular}{lcl}
       
  1045 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
  1046 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
  1047 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
  1048 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
  1049 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
  1050 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
  1051 		\end{tabular}
       
  1052 	\end{center}
       
  1053 This function simply tests whether the empty string is in $L(r)$.
       
  1054 He then defined
       
  1055 the following operation on regular expressions, written
       
  1056 $r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
       
  1057 
       
  1058 \begin{center}
       
  1059 \begin{tabular}{lcl}
       
  1060 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
  1061 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
  1062 		$d \backslash c$     & $\dn$ & 
       
  1063 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
  1064 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
  1065 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
  1066 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
  1067 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
  1068 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
  1069 \end{tabular}
       
  1070 \end{center}
       
  1071 
       
  1072 %Assuming the classic notion of a
       
  1073 %\emph{language} of a regular expression, written $L(\_)$, t
       
  1074 
       
  1075 \noindent
       
  1076 The main property of the derivative operation is that
       
  1077 
       
  1078 \begin{center}
       
  1079 $c\!::\!s \in L(r)$ holds
       
  1080 if and only if $s \in L(r\backslash c)$.
       
  1081 \end{center}
       
  1082 
       
  1083 \noindent
       
  1084 For us the main advantage is that derivatives can be
       
  1085 straightforwardly implemented in any functional programming language,
       
  1086 and are easily definable and reasoned about in theorem provers---the
       
  1087 definitions just consist of inductive datatypes and simple recursive
       
  1088 functions. Moreover, the notion of derivatives can be easily
       
  1089 generalised to cover extended regular expression constructors such as
       
  1090 the not-regular expression, written $\neg\,r$, or bounded repetitions
       
  1091 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
       
  1092 straightforwardly realised within the classic automata approach.
       
  1093 For the moment however, we focus only on the usual basic regular expressions.
       
  1094 
       
  1095 
       
  1096 Now if we want to find out whether a string $s$ matches with a regular
       
  1097 expression $r$, we can build the derivatives of $r$ w.r.t.\ (in succession)
       
  1098 all the characters of the string $s$. Finally, test whether the
       
  1099 resulting regular expression can match the empty string.  If yes, then
       
  1100 $r$ matches $s$, and no in the negative case. To implement this idea
       
  1101 we can generalise the derivative operation to strings like this:
       
  1102 
       
  1103 \begin{center}
       
  1104 \begin{tabular}{lcl}
       
  1105 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
  1106 $r \backslash [\,] $ & $\dn$ & $r$
       
  1107 \end{tabular}
       
  1108 \end{center}
       
  1109 
       
  1110 \noindent
       
  1111 and then define as  regular-expression matching algorithm: 
       
  1112 \[
       
  1113 match\;s\;r \;\dn\; nullable(r\backslash s)
       
  1114 \]
       
  1115 
       
  1116 \noindent
       
  1117 This algorithm looks graphically as follows:
       
  1118 \begin{equation}\label{graph:*}
       
  1119 \begin{tikzcd}
       
  1120 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
  1121 \end{tikzcd}
       
  1122 \end{equation}
       
  1123 
       
  1124 \noindent
       
  1125 where we start with  a regular expression  $r_0$, build successive
       
  1126 derivatives until we exhaust the string and then use \textit{nullable}
       
  1127 to test whether the result can match the empty string. It can  be
       
  1128 relatively  easily shown that this matcher is correct  (that is given
       
  1129 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
  1130 
       
  1131  
       
  1132 \section{Values and the Algorithm by Sulzmann and Lu}
       
  1133 
       
  1134 One limitation of Brzozowski's algorithm is that it only produces a
       
  1135 YES/NO answer for whether a string is being matched by a regular
       
  1136 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
       
  1137 to allow generation of an actual matching, called a \emph{value} or
       
  1138 sometimes also \emph{lexical value}.  These values and regular
       
  1139 expressions correspond to each other as illustrated in the following
       
  1140 table:
       
  1141 
       
  1142 
       
  1143 \begin{center}
       
  1144 	\begin{tabular}{c@{\hspace{20mm}}c}
       
  1145 		\begin{tabular}{@{}rrl@{}}
       
  1146 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
  1147 			$r$ & $::=$  & $\ZERO$\\
       
  1148 			& $\mid$ & $\ONE$   \\
       
  1149 			& $\mid$ & $c$          \\
       
  1150 			& $\mid$ & $r_1 \cdot r_2$\\
       
  1151 			& $\mid$ & $r_1 + r_2$   \\
       
  1152 			\\
       
  1153 			& $\mid$ & $r^*$         \\
       
  1154 		\end{tabular}
       
  1155 		&
       
  1156 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
  1157 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
  1158 			$v$ & $::=$  & \\
       
  1159 			&        & $\Empty$   \\
       
  1160 			& $\mid$ & $\Char(c)$          \\
       
  1161 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
  1162 			& $\mid$ & $\Left(v)$   \\
       
  1163 			& $\mid$ & $\Right(v)$  \\
       
  1164 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
  1165 		\end{tabular}
       
  1166 	\end{tabular}
       
  1167 \end{center}
       
  1168 
       
  1169 \noindent
       
  1170 No value  corresponds to $\ZERO$; $\Empty$ corresponds to $\ONE$;
       
  1171 $\Char$ to the character regular expression; $\Seq$ to the sequence
       
  1172 regular expression and so on. The idea of values is to encode a kind of
       
  1173 lexical value for how the sub-parts of a regular expression match the
       
  1174 sub-parts of a string. To see this, suppose a \emph{flatten} operation,
       
  1175 written $|v|$ for values. We can use this function to extract the
       
  1176 underlying string of a value $v$. For example, $|\mathit{Seq} \,
       
  1177 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
       
  1178 flatten, we can describe how values encode lexical values: $\Seq\,v_1\,
       
  1179 v_2$ encodes a tree with two children nodes that tells how the string
       
  1180 $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches
       
  1181 the substring $|v_1|$ and, respectively, $r_2$ matches the substring
       
  1182 $|v_2|$. Exactly how these two are matched is contained in the children
       
  1183 nodes $v_1$ and $v_2$ of parent $\textit{Seq}$. 
       
  1184 
       
  1185 To give a concrete example of how values work, consider the string $xy$
       
  1186 and the regular expression $(x + (y + xy))^*$. We can view this regular
       
  1187 expression as a tree and if the string $xy$ is matched by two Star
       
  1188 ``iterations'', then the $x$ is matched by the left-most alternative in
       
  1189 this tree and the $y$ by the right-left alternative. This suggests to
       
  1190 record this matching as
       
  1191 
       
  1192 \begin{center}
       
  1193 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
       
  1194 \end{center}
       
  1195 
       
  1196 \noindent
       
  1197 where $\Stars \; [\ldots]$ records all the
       
  1198 iterations; and $\Left$, respectively $\Right$, which
       
  1199 alternative is used. The value for
       
  1200 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
       
  1201 would look as follows
       
  1202 
       
  1203 \begin{center}
       
  1204 $\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
       
  1205 \end{center}
       
  1206 
       
  1207 \noindent
       
  1208 where $\Stars$ has only a single-element list for the single iteration
       
  1209 and $\Seq$ indicates that $xy$ is matched by a sequence regular
       
  1210 expression.
       
  1211 
       
  1212 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
  1213 algorithm by a second phase (the first phase being building successive
       
  1214 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
  1215 is generated in case the regular expression matches  the string. 
       
  1216 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
  1217 
       
  1218 \begin{ceqn}
       
  1219 \begin{equation}\label{graph:2}
       
  1220 \begin{tikzcd}
       
  1221 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
  1222 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
  1223 \end{tikzcd}
       
  1224 \end{equation}
       
  1225 \end{ceqn}
       
  1226 
       
  1227 \noindent
       
  1228 For convenience, we shall employ the following notations: the regular
       
  1229 expression we start with is $r_0$, and the given string $s$ is composed
       
  1230 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
  1231 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
  1232 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
  1233 the derivative $r_n$. We test whether this derivative is
       
  1234 $\textit{nullable}$ or not. If not, we know the string does not match
       
  1235 $r$ and no value needs to be generated. If yes, we start building the
       
  1236 values incrementally by \emph{injecting} back the characters into the
       
  1237 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
  1238 algorithm from the right to left. For the first value $v_n$, we call the
       
  1239 function $\textit{mkeps}$, which builds the lexical value
       
  1240 for how the empty string has been matched by the (nullable) regular
       
  1241 expression $r_n$. This function is defined as
       
  1242 
       
  1243 	\begin{center}
       
  1244 		\begin{tabular}{lcl}
       
  1245 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
  1246 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
  1247 			& \textit{if} $\nullable(r_{1})$\\ 
       
  1248 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
  1249 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
  1250 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
  1251 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
  1252 		\end{tabular}
       
  1253 	\end{center}
       
  1254 
       
  1255 
       
  1256 \noindent There are no cases for $\ZERO$ and $c$, since
       
  1257 these regular expression cannot match the empty string. Note
       
  1258 also that in case of alternatives we give preference to the
       
  1259 regular expression on the left-hand side. This will become
       
  1260 important later on about what value is calculated.
       
  1261 
       
  1262 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
  1263 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
  1264 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
  1265 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
  1266 matches $s$. For this Sulzmann and Lu defined a function that reverses
       
  1267 the ``chopping off'' of characters during the derivative phase. The
       
  1268 corresponding function is called \emph{injection}, written
       
  1269 $\textit{inj}$; it takes three arguments: the first one is a regular
       
  1270 expression ${r_{i-1}}$, before the character is chopped off, the second
       
  1271 is a character ${c_{i-1}}$, the character we want to inject and the
       
  1272 third argument is the value ${v_i}$, into which one wants to inject the
       
  1273 character (it corresponds to the regular expression after the character
       
  1274 has been chopped off). The result of this function is a new value. The
       
  1275 definition of $\textit{inj}$ is as follows: 
       
  1276 
       
  1277 \begin{center}
       
  1278 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
  1279   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
  1280   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
  1281   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
  1282   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
  1283   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
  1284   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
  1285   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
  1286 \end{tabular}
       
  1287 \end{center}
       
  1288 
       
  1289 \noindent This definition is by recursion on the ``shape'' of regular
       
  1290 expressions and values. To understands this definition better consider
       
  1291 the situation when we build the derivative on regular expression $r_{i-1}$.
       
  1292 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
       
  1293 ``hole'' in $r_i$ and its corresponding value $v_i$. 
       
  1294 To calculate $v_{i-1}$, we need to
       
  1295 locate where that hole is and fill it. 
       
  1296 We can find this location by
       
  1297 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
       
  1298 $r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that 
       
  1299 %
       
  1300 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
       
  1301 
       
  1302 \noindent
       
  1303 otherwise if $r_a$ is not nullable,
       
  1304 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
       
  1305 
       
  1306 \noindent
       
  1307 the value $v_i$ should be  $\Seq(\ldots)$, contradicting the fact that
       
  1308 $v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
       
  1309 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
       
  1310 branch of \[ (r_a \cdot r_b)\backslash c =
       
  1311 \bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined)
       
  1312  is taken instead of the right one. This means $c$ is chopped off 
       
  1313 from $r_a$ rather than $r_b$.
       
  1314 We have therefore found out 
       
  1315 that the hole will be on $r_a$. So we recursively call $\inj\, 
       
  1316 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
       
  1317 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
       
  1318 Other clauses can be understood in a similar way.
       
  1319 
       
  1320 %\comment{Other word: insight?}
       
  1321 The following example gives an insight of $\textit{inj}$'s effect and
       
  1322 how Sulzmann and Lu's algorithm works as a whole. Suppose we have a
       
  1323 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it
       
  1324 against the string $abc$ (when $abc$ is written as a regular expression,
       
  1325 the standard way of expressing it is $a \cdot (b \cdot c)$. But we
       
  1326 usually omit the parentheses and dots here for better readability. This
       
  1327 algorithm returns a POSIX value, which means it will produce the longest
       
  1328 matching. Consequently, it matches the string $abc$ in one star
       
  1329 iteration, using the longest alternative $abc$ in the sub-expression (we shall use $r$ to denote this
       
  1330 sub-expression for conciseness):
       
  1331 
       
  1332 \[((((a+b)+ab)+c)+\underbrace{abc}_r)\] 
       
  1333 
       
  1334 \noindent
       
  1335 Before $\textit{inj}$ is called, our lexer first builds derivative using
       
  1336 string $abc$ (we simplified some regular expressions like $\ZERO \cdot
       
  1337 b$ to $\ZERO$ for conciseness; we also omit parentheses if they are
       
  1338 clear from the context):
       
  1339 
       
  1340 %Similarly, we allow
       
  1341 %$\textit{ALT}$ to take a list of regular expressions as an argument
       
  1342 %instead of just 2 operands to reduce the nested depth of
       
  1343 %$\textit{ALT}$
       
  1344 
       
  1345 \begin{center}
       
  1346 \begin{tabular}{lcl}
       
  1347 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\
       
  1348       & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$\\
       
  1349       & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^*) + $\\ 
       
  1350       &                              & $\phantom{r_3 = (} ((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^* + (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* )$
       
  1351 \end{tabular}
       
  1352 \end{center}
       
  1353 
       
  1354 \noindent
       
  1355 In  case $r_3$ is nullable, we can call $\textit{mkeps}$ 
       
  1356 to construct a lexical value for how $r_3$ matched the string $abc$. 
       
  1357 This function gives the following value $v_3$: 
       
  1358 
       
  1359 
       
  1360 \begin{center}
       
  1361 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
       
  1362 \end{center}
       
  1363 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
       
  1364 
       
  1365 \begin{center}
       
  1366 	\begin{tabular}{l@{\hspace{2mm}}l}
       
  1367     & $\big(\underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} 
       
  1368     \;+\; (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*\big)$ \smallskip\\
       
  1369     $+$ & $\big((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*
       
  1370     \;+\; (\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* \big)$
       
  1371   	\end{tabular}
       
  1372  \end{center}
       
  1373 
       
  1374 \noindent
       
  1375  Note that the leftmost location of term $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
       
  1376  \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
       
  1377  $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the
       
  1378  left one when it is nullable. In the case of this example, $abc$ is
       
  1379  preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
       
  1380  generated by two applications of the splitting clause
       
  1381 
       
  1382 \begin{center}
       
  1383      $(r_1 \cdot r_2)\backslash c  \;\;(when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
       
  1384 \end{center}
       
  1385        
       
  1386 \noindent
       
  1387 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the
       
  1388 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This
       
  1389 allows $\textit{mkeps}$ to always pick up among two matches the one with a longer
       
  1390 initial sub-match. Removing the outside $\Left(\Left(...))$, the inside
       
  1391 sub-value 
       
  1392  
       
  1393 \begin{center}
       
  1394  $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
       
  1395 \end{center}
       
  1396 
       
  1397 \noindent
       
  1398 tells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot
       
  1399 \ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of two nullable regular
       
  1400 expressions. The first one is an alternative, we take the rightmost
       
  1401 alternative---whose language contains the empty string. The second
       
  1402 nullable regular expression is a Kleene star. $\Stars$ tells us how it
       
  1403 generates the nullable regular expression: by 0 iterations to form
       
  1404 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
       
  1405 builds a lexical value based on $v_3$. Using the value $v_3$, the character
       
  1406 c, and the regular expression $r_2$, we can recover how $r_2$ matched
       
  1407 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
       
  1408  \begin{center}
       
  1409  $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
       
  1410  \end{center}
       
  1411 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
       
  1412 \begin{center}
       
  1413 $v_1 = \Seq(\Right(\Seq(\Empty, \Seq(b, c))), \Stars [])$
       
  1414 \end{center}
       
  1415  for how 
       
  1416  \begin{center}
       
  1417  $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$
       
  1418  \end{center}
       
  1419   matched  the string $bc$ before it split into two substrings. 
       
  1420   Finally, after injecting character $a$ back to $v_1$, 
       
  1421   we get  the lexical value tree 
       
  1422   \begin{center}
       
  1423   $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
       
  1424   \end{center}
       
  1425    for how $r$ matched $abc$. This completes the algorithm.
       
  1426    
       
  1427 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
       
  1428 Readers might have noticed that the lexical value information is actually
       
  1429 already available when doing derivatives. For example, immediately after
       
  1430 the operation $\backslash a$ we know that if we want to match a string
       
  1431 that starts with $a$, we can either take the initial match to be 
       
  1432 
       
  1433  \begin{center}
       
  1434 \begin{enumerate}
       
  1435     \item[1)] just $a$ or
       
  1436     \item[2)] string $ab$ or 
       
  1437     \item[3)] string $abc$.
       
  1438 \end{enumerate}
       
  1439 \end{center}
       
  1440 
       
  1441 \noindent
       
  1442 In order to differentiate between these choices, we just need to
       
  1443 remember their positions---$a$ is on the left, $ab$ is in the middle ,
       
  1444 and $abc$ is on the right. Which of these alternatives is chosen
       
  1445 later does not affect their relative position because the algorithm does
       
  1446 not change this order. If this parsing information can be determined and
       
  1447 does not change because of later derivatives, there is no point in
       
  1448 traversing this information twice. This leads to an optimisation---if we
       
  1449 store the information for lexical values inside the regular expression,
       
  1450 update it when we do derivative on them, and collect the information
       
  1451 when finished with derivatives and call $\textit{mkeps}$ for deciding which
       
  1452 branch is POSIX, we can generate the lexical value in one pass, instead of
       
  1453 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
       
  1454 idea of using bitcodes in derivatives.
       
  1455 
       
  1456 In the next section, we shall focus on the bitcoded algorithm and the
       
  1457 process of simplification of regular expressions. This is needed in
       
  1458 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
       
  1459 and Lu's algorithms.  This is where the PhD-project aims to advance the
       
  1460 state-of-the-art.
       
  1461 
       
  1462 
       
  1463 \section{Simplification of Regular Expressions}
       
  1464 
       
  1465 Using bitcodes to guide  parsing is not a novel idea. It was applied to
       
  1466 context free grammars and then adapted by Henglein and Nielson for
       
  1467 efficient regular expression  lexing using DFAs~\cite{nielson11bcre}.
       
  1468 Sulzmann and Lu took this idea of bitcodes a step further by integrating
       
  1469 bitcodes into derivatives. The reason why we want to use bitcodes in
       
  1470 this project is that we want to introduce more aggressive simplification
       
  1471 rules in order to keep the size of derivatives small throughout. This is
       
  1472 because the main drawback of building successive derivatives according
       
  1473 to Brzozowski's definition is that they can grow very quickly in size.
       
  1474 This is mainly due to the fact that the derivative operation generates
       
  1475 often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
       
  1476 implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
       
  1477 are excruciatingly slow. For example when starting with the regular
       
  1478 expression $(a + aa)^*$ and building 12 successive derivatives
       
  1479 w.r.t.~the character $a$, one obtains a derivative regular expression
       
  1480 with more than 8000 nodes (when viewed as a tree). Operations like
       
  1481 $\textit{der}$ and $\nullable$ need to traverse such trees and
       
  1482 consequently the bigger the size of the derivative the slower the
       
  1483 algorithm. 
       
  1484 
       
  1485 Fortunately, one can simplify regular expressions after each derivative
       
  1486 step. Various simplifications of regular expressions are possible, such
       
  1487 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
       
  1488 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
       
  1489 affect the answer for whether a regular expression matches a string or
       
  1490 not, but fortunately also do not affect the POSIX strategy of how
       
  1491 regular expressions match strings---although the latter is much harder
       
  1492 to establish. Some initial results in this regard have been
       
  1493 obtained in \cite{AusafDyckhoffUrban2016}. 
       
  1494 
       
  1495 Unfortunately, the simplification rules outlined above  are not
       
  1496 sufficient to prevent a size explosion in all cases. We
       
  1497 believe a tighter bound can be achieved that prevents an explosion in
       
  1498 \emph{all} cases. Such a tighter bound is suggested by work of Antimirov who
       
  1499 proved that (partial) derivatives can be bound by the number of
       
  1500 characters contained in the initial regular expression
       
  1501 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
       
  1502 expressions as follows:
       
  1503 
       
  1504 \begin{center}
       
  1505 \begin{tabular}{lcl}
       
  1506  $\textit{pder} \; c \; \ZERO$ & $\dn$ & $\emptyset$\\
       
  1507  $\textit{pder} \; c \; \ONE$ & $\dn$ & $\emptyset$ \\
       
  1508  $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{  \ONE   \}  \; \textit{else} \; \emptyset$ \\ 
       
  1509   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
       
  1510    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\
       
  1511      & & $\textit{then} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \;$\\
       
  1512      & & $\textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
       
  1513      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
       
  1514  \end{tabular}
       
  1515  \end{center}
       
  1516 
       
  1517 \noindent
       
  1518 A partial derivative of a regular expression $r$ is essentially a set of
       
  1519 regular expressions that are either $r$'s children expressions or a
       
  1520 concatenation of them. Antimirov has proved a tight bound of the sum of
       
  1521 the size of \emph{all} partial derivatives no matter what the string
       
  1522 looks like. Roughly speaking the size sum will be at most cubic in the
       
  1523 size of the regular expression.
       
  1524 
       
  1525 If we want the size of derivatives in Sulzmann and Lu's algorithm to
       
  1526 stay below this bound, we would need more aggressive simplifications.
       
  1527 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
  1528 deleting duplicates whenever possible. For example, the parentheses in
       
  1529 $(a+b) \cdot c + bc$ can be opened up to get $a\cdot c +  b \cdot c + b
       
  1530 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
  1531 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
  1532 $a^*+a+\ONE$. Adding these more aggressive simplification rules helps us
       
  1533 to achieve the same size bound as that of the partial derivatives. 
       
  1534 
       
  1535 In order to implement the idea of ``spilling out alternatives'' and to
       
  1536 make them compatible with the $\text{inj}$-mechanism, we use
       
  1537 \emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
       
  1538 
       
  1539 %This allows us to prove a tight
       
  1540 %bound on the size of regular expression during the running time of the
       
  1541 %algorithm if we can establish the connection between our simplification
       
  1542 %rules and partial derivatives.
       
  1543 
       
  1544  %We believe, and have generated test
       
  1545 %data, that a similar bound can be obtained for the derivatives in
       
  1546 %Sulzmann and Lu's algorithm. Let us give some details about this next.
       
  1547 
       
  1548 
       
  1549 \begin{center}
       
  1550 		$b ::=   S \mid  Z \qquad
       
  1551 bs ::= [] \mid b:bs    
       
  1552 $
       
  1553 \end{center}
       
  1554 
       
  1555 \noindent
       
  1556 The $S$ and $Z$ are arbitrary names for the bits in order to avoid 
       
  1557 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
  1558 bit-lists) can be used to encode values (or incomplete values) in a
       
  1559 compact form. This can be straightforwardly seen in the following
       
  1560 coding function from values to bitcodes: 
       
  1561 
       
  1562 \begin{center}
       
  1563 \begin{tabular}{lcl}
       
  1564   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
  1565   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
  1566   $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
       
  1567   $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
       
  1568   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
  1569   $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\
       
  1570   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
       
  1571                                                  code(\Stars\,vs)$
       
  1572 \end{tabular}    
       
  1573 \end{center} 
       
  1574 
       
  1575 \noindent
       
  1576 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
  1577 $\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty
       
  1578 star iteration into $\S$, and the border where a local star terminates
       
  1579 into $\Z$. This coding is lossy, as it throws away the information about
       
  1580 characters, and also does not encode the ``boundary'' between two
       
  1581 sequence values. Moreover, with only the bitcode we cannot even tell
       
  1582 whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The
       
  1583 reason for choosing this compact way of storing information is that the
       
  1584 relatively small size of bits can be easily manipulated and ``moved
       
  1585 around'' in a regular expression. In order to recover values, we will 
       
  1586 need the corresponding regular expression as an extra information. This
       
  1587 means the decoding function is defined as:
       
  1588 
       
  1589 
       
  1590 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
  1591 \begin{center}
       
  1592 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
  1593   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
  1594   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
  1595   $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  1596      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
  1597        (\Left\,v, bs_1)$\\
       
  1598   $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
  1599      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
  1600        (\Right\,v, bs_1)$\\                           
       
  1601   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
  1602         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
  1603   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
  1604   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
  1605   $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
  1606   $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & 
       
  1607          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  1608   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
  1609   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
  1610   
       
  1611   $\textit{decode}\,bs\,r$ & $\dn$ &
       
  1612      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
  1613   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
  1614        \textit{else}\;\textit{None}$                       
       
  1615 \end{tabular}    
       
  1616 \end{center}    
       
  1617 %\end{definition}
       
  1618 
       
  1619 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
  1620 create annotated regular expressions \cite{Sulzmann2014}.
       
  1621 \emph{Annotated regular expressions} are defined by the following
       
  1622 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
  1623 
       
  1624 \begin{center}
       
  1625 \begin{tabular}{lcl}
       
  1626   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
       
  1627                   & $\mid$ & $\textit{ONE}\;\;bs$\\
       
  1628                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
       
  1629                   & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
       
  1630                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
       
  1631                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
       
  1632 \end{tabular}    
       
  1633 \end{center}  
       
  1634 %(in \textit{ALTS})
       
  1635 
       
  1636 \noindent
       
  1637 where $bs$ stands for bitcodes, $a$  for $\bold{a}$nnotated regular
       
  1638 expressions and $as$ for a list of annotated regular expressions.
       
  1639 The alternative constructor($\textit{ALTS}$) has been generalized to 
       
  1640 accept a list of annotated regular expressions rather than just 2.
       
  1641 We will show that these bitcodes encode information about
       
  1642 the (POSIX) value that should be generated by the Sulzmann and Lu
       
  1643 algorithm.
       
  1644 
       
  1645 
       
  1646 To do lexing using annotated regular expressions, we shall first
       
  1647 transform the usual (un-annotated) regular expressions into annotated
       
  1648 regular expressions. This operation is called \emph{internalisation} and
       
  1649 defined as follows:
       
  1650 
       
  1651 %\begin{definition}
       
  1652 \begin{center}
       
  1653 \begin{tabular}{lcl}
       
  1654   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
       
  1655   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
       
  1656   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
       
  1657   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
  1658   $\textit{ALTS}\;[]\,List((\textit{fuse}\,[\Z]\,r_1^\uparrow),\,
       
  1659   (\textit{fuse}\,[\S]\,r_2^\uparrow))$\\
       
  1660   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
  1661          $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
       
  1662   $(r^*)^\uparrow$ & $\dn$ &
       
  1663          $\textit{STAR}\;[]\,r^\uparrow$\\
       
  1664 \end{tabular}    
       
  1665 \end{center}    
       
  1666 %\end{definition}
       
  1667 
       
  1668 \noindent
       
  1669 We use up arrows here to indicate that the basic un-annotated regular
       
  1670 expressions are ``lifted up'' into something slightly more complex. In the
       
  1671 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
  1672 attach bits to the front of an annotated regular expression. Its
       
  1673 definition is as follows:
       
  1674 
       
  1675 \begin{center}
       
  1676 \begin{tabular}{lcl}
       
  1677   $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
       
  1678   $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
       
  1679      $\textit{ONE}\,(bs\,@\,bs')$\\
       
  1680   $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
       
  1681      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
       
  1682   $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,as)$ & $\dn$ &
       
  1683      $\textit{ALTS}\,(bs\,@\,bs')\,as$\\
       
  1684   $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
       
  1685      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
       
  1686   $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
       
  1687      $\textit{STAR}\,(bs\,@\,bs')\,a$
       
  1688 \end{tabular}    
       
  1689 \end{center}  
       
  1690 
       
  1691 \noindent
       
  1692 After internalising the regular expression, we perform successive
       
  1693 derivative operations on the annotated regular expressions. This
       
  1694 derivative operation is the same as what we had previously for the
       
  1695 basic regular expressions, except that we beed to take care of
       
  1696 the bitcodes:
       
  1697 
       
  1698  %\begin{definition}{bder}
       
  1699 \begin{center}
       
  1700   \begin{tabular}{@{}lcl@{}}
       
  1701   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1702   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1703   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
  1704         $\textit{if}\;c=d\; \;\textit{then}\;
       
  1705          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
  1706   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
  1707   $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
       
  1708   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  1709      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  1710 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
  1711 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
  1712   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
  1713   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
  1714       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
  1715        (\textit{STAR}\,[]\,r)$
       
  1716 \end{tabular}    
       
  1717 \end{center}    
       
  1718 %\end{definition}
       
  1719 
       
  1720 \noindent
       
  1721 For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
       
  1722 we need to attach an additional bit $Z$ to the front of $r \backslash c$
       
  1723 to indicate that there is one more star iteration. Also the $SEQ$ clause
       
  1724 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
  1725 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
  1726 that it is for annotated regular expressions, therefore we omit the
       
  1727 definition). Assume that $bmkeps$ correctly extracts the bitcode for how
       
  1728 $a_1$ matches the string prior to character $c$ (more on this later),
       
  1729 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
       
  1730 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
  1731 already been fully matched) and store the parsing information at the
       
  1732 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
  1733 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
       
  1734 now been elevated to the top-level of $ALTS$, as this information will be
       
  1735 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
       
  1736 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
  1737 the lexing information, we complete the lexing by collecting the
       
  1738 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
  1739 for annotated regular expressions, called $\textit{bmkeps}$:
       
  1740 
       
  1741 
       
  1742 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
  1743 \begin{center}
       
  1744 \begin{tabular}{lcl}
       
  1745   $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
       
  1746   $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a::as)$ & $\dn$ &
       
  1747      $\textit{if}\;\textit{bnullable}\,a$\\
       
  1748   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
  1749   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(\textit{ALTS}\;bs\,as)$\\
       
  1750   $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
       
  1751      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
  1752   $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ &
       
  1753      $bs \,@\, [\S]$
       
  1754 \end{tabular}    
       
  1755 \end{center}    
       
  1756 %\end{definition}
       
  1757 
       
  1758 \noindent
       
  1759 This function completes the value information by travelling along the
       
  1760 path of the regular expression that corresponds to a POSIX value and
       
  1761 collecting all the bitcodes, and using $S$ to indicate the end of star
       
  1762 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
  1763 decode them, we get the value we expect. The corresponding lexing
       
  1764 algorithm looks as follows:
       
  1765 
       
  1766 \begin{center}
       
  1767 \begin{tabular}{lcl}
       
  1768   $\textit{blexer}\;r\,s$ & $\dn$ &
       
  1769       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
  1770   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1771   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1772   & & $\;\;\textit{else}\;\textit{None}$
       
  1773 \end{tabular}
       
  1774 \end{center}
       
  1775 
       
  1776 \noindent
       
  1777 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
  1778 operation from characters to strings (just like the derivatives for un-annotated
       
  1779 regular expressions).
       
  1780 
       
  1781 The main point of the bitcodes and annotated regular expressions is that
       
  1782 we can apply rather aggressive (in terms of size) simplification rules
       
  1783 in order to keep derivatives small. We have developed such
       
  1784 ``aggressive'' simplification rules and generated test data that show
       
  1785 that the expected bound can be achieved. Obviously we could only
       
  1786 partially cover  the search space as there are infinitely many regular
       
  1787 expressions and strings. 
       
  1788 
       
  1789 One modification we introduced is to allow a list of annotated regular
       
  1790 expressions in the \textit{ALTS} constructor. This allows us to not just
       
  1791 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
  1792 also unnecessary ``copies'' of regular expressions (very similar to
       
  1793 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
  1794 modification is that we use simplification rules inspired by Antimirov's
       
  1795 work on partial derivatives. They maintain the idea that only the first
       
  1796 ``copy'' of a regular expression in an alternative contributes to the
       
  1797 calculation of a POSIX value. All subsequent copies can be pruned away from
       
  1798 the regular expression. A recursive definition of our  simplification function 
       
  1799 that looks somewhat similar to our Scala code is given below:
       
  1800 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
  1801 %Is it $ALTS$ or $ALTS$?}\\
       
  1802 
       
  1803 \begin{center}
       
  1804   \begin{tabular}{@{}lcl@{}}
       
  1805    
       
  1806   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
  1807    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
  1808    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
  1809    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
  1810    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
  1811    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
       
  1812 
       
  1813   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
       
  1814   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  1815    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
  1816    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
       
  1817 
       
  1818    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  1819 \end{tabular}    
       
  1820 \end{center}    
       
  1821 
       
  1822 \noindent
       
  1823 The simplification does a pattern matching on the regular expression.
       
  1824 When it detected that the regular expression is an alternative or
       
  1825 sequence, it will try to simplify its children regular expressions
       
  1826 recursively and then see if one of the children turn into $\ZERO$ or
       
  1827 $\ONE$, which might trigger further simplification at the current level.
       
  1828 The most involved part is the $\textit{ALTS}$ clause, where we use two
       
  1829 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
  1830 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
       
  1831 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
       
  1832 when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.
       
  1833 Its recursive definition is given below:
       
  1834 
       
  1835  \begin{center}
       
  1836   \begin{tabular}{@{}lcl@{}}
       
  1837   $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
       
  1838      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
  1839   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
       
  1840     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
       
  1841 \end{tabular}    
       
  1842 \end{center}  
       
  1843 
       
  1844 \noindent
       
  1845 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
  1846 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
  1847 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
  1848 
       
  1849 Suppose we apply simplification after each derivative step, and view
       
  1850 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
       
  1851 \textit{simp}(a \backslash c)$. Then we can use the previous natural
       
  1852 extension from derivative w.r.t.~character to derivative
       
  1853 w.r.t.~string:%\comment{simp in  the [] case?}
       
  1854 
       
  1855 \begin{center}
       
  1856 \begin{tabular}{lcl}
       
  1857 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
  1858 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
  1859 \end{tabular}
       
  1860 \end{center}
       
  1861 
       
  1862 \noindent
       
  1863 we obtain an optimised version of the algorithm:
       
  1864 
       
  1865  \begin{center}
       
  1866 \begin{tabular}{lcl}
       
  1867   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
  1868       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
  1869   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1870   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1871   & & $\;\;\textit{else}\;\textit{None}$
       
  1872 \end{tabular}
       
  1873 \end{center}
       
  1874 
       
  1875 \noindent
       
  1876 This algorithm keeps the regular expression size small, for example,
       
  1877 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
  1878 will be reduced to just 6 and stays constant, no matter how long the
       
  1879 input string is.
       
  1880 
       
  1881 
       
  1882 
       
  1883 \section{Current Work}
       
  1884 
       
  1885 We are currently engaged in two tasks related to this algorithm. The
       
  1886 first task is proving that our simplification rules actually do not
       
  1887 affect the POSIX value that should be generated by the algorithm
       
  1888 according to the specification of a POSIX value and furthermore obtain a
       
  1889 much tighter bound on the sizes of derivatives. The result is that our
       
  1890 algorithm should be correct and faster on all inputs.  The original
       
  1891 blow-up, as observed in JavaScript, Python and Java, would be excluded
       
  1892 from happening in our algorithm. For this proof we use the theorem prover
       
  1893 Isabelle. Once completed, this result will advance the state-of-the-art:
       
  1894 Sulzmann and Lu wrote in their paper~\cite{Sulzmann2014} about the
       
  1895 bitcoded ``incremental parsing method'' (that is the lexing algorithm
       
  1896 outlined in this section):
       
  1897 
       
  1898 \begin{quote}\it
       
  1899   ``Correctness Claim: We further claim that the incremental parsing
       
  1900   method in Figure~5 in combination with the simplification steps in
       
  1901   Figure 6 yields POSIX parse tree [our lexical values]. We have tested this claim
       
  1902   extensively by using the method in Figure~3 as a reference but yet
       
  1903   have to work out all proof details.''
       
  1904 \end{quote}  
       
  1905 
       
  1906 \noindent 
       
  1907 We like to settle this correctness claim. It is relatively
       
  1908 straightforward to establish that after one simplification step, the part of a
       
  1909 nullable derivative that corresponds to a POSIX value remains intact and can
       
  1910 still be collected, in other words, we can show that
       
  1911 %\comment{Double-check....I
       
  1912 %think this  is not the case}
       
  1913 %\comment{If i remember correctly, you have proved this lemma.
       
  1914 %I feel this is indeed not true because you might place arbitrary 
       
  1915 %bits on the regex r, however if this is the case, did i remember wrongly that
       
  1916 %you proved something like simplification does not affect $\textit{bmkeps}$ results?
       
  1917 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
       
  1918 %to a regex. Maybe it works now.}
       
  1919 
       
  1920 \begin{center}
       
  1921 	$\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;($\textit{provided}$ \; a\; is \; \textit{bnullable} )$
       
  1922 \end{center} 
       
  1923 
       
  1924 \noindent
       
  1925 as this basically comes down to proving actions like removing the
       
  1926 additional $r$ in $r+r$  does not delete important POSIX information in
       
  1927 a regular expression. The hard part of this proof is to establish that
       
  1928 
       
  1929 \begin{center}
       
  1930 	$ \textit{blexer}\_{simp}(r, \; s) =  \textit{blexer}(r, \; s)$
       
  1931 \end{center}
       
  1932 %comment{This is not true either...look at the definion blexer/blexer-simp}
       
  1933 
       
  1934 \noindent That is, if we do derivative on regular expression $r$ and then
       
  1935 simplify it, and repeat this process until we exhaust the string, we get a
       
  1936 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
       
  1937 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
       
  1938 normal derivative algorithm that only does derivative repeatedly and has no
       
  1939 simplification at all.  This might seem at first glance very unintuitive, as
       
  1940 the $r'$ could be exponentially larger than $r''$, but can be explained in the
       
  1941 following way: we are pruning away the possible matches that are not POSIX.
       
  1942 Since there could be exponentially many 
       
  1943 non-POSIX matchings and only 1 POSIX matching, it
       
  1944 is understandable that our $r''$ can be a lot smaller.  we can still provide
       
  1945 the same POSIX value if there is one.  This is not as straightforward as the
       
  1946 previous proposition, as the two regular expressions $r'$ and $r''$ might have
       
  1947 become very different.  The crucial point is to find the
       
  1948 $\textit{POSIX}$  information of a regular expression and how it is modified,
       
  1949 augmented and propagated 
       
  1950 during simplification in parallel with the regular expression that
       
  1951 has not been simplified in the subsequent derivative operations.  To aid this,
       
  1952 we use the helper function retrieve described by Sulzmann and Lu:
       
  1953 \begin{center}
       
  1954 \begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
       
  1955   $\textit{retrieve}\,(\textit{ONE}\,bs)\,\Empty$ & $\dn$ & $bs$\\
       
  1956   $\textit{retrieve}\,(\textit{CHAR}\,bs\,c)\,(\Char\,d)$ & $\dn$ & $bs$\\
       
  1957   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Left\,v)$ & $\dn$ &
       
  1958      $bs \,@\, \textit{retrieve}\,a\,v$\\
       
  1959   $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Right\,v)$ & $\dn$ &
       
  1960   $bs \,@\, \textit{retrieve}\,(\textit{ALTS}\,bs\,as)\,v$\\
       
  1961   $\textit{retrieve}\,(\textit{SEQ}\,bs\,a_1\,a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
       
  1962      $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
       
  1963   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,[])$ & $\dn$ &
       
  1964      $bs \,@\, [\S]$\\
       
  1965   $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
       
  1966   \multicolumn{3}{l}{
       
  1967      \hspace{3cm}$bs \,@\, [\Z] \,@\, \textit{retrieve}\,a\,v\,@\,
       
  1968                     \textit{retrieve}\,(\textit{STAR}\,[]\,a)\,(\Stars\,vs)$}\\
       
  1969 \end{tabular}
       
  1970 \end{center}
       
  1971 %\comment{Did not read further}\\
       
  1972 This function assembles the bitcode 
       
  1973 %that corresponds to a lexical value for how
       
  1974 %the current derivative matches the suffix of the string(the characters that
       
  1975 %have not yet appeared, but will appear as the successive derivatives go on.
       
  1976 %How do we get this "future" information? By the value $v$, which is
       
  1977 %computed by a pass of the algorithm that uses
       
  1978 %$inj$ as described in the previous section).  
       
  1979 using information from both the derivative regular expression and the
       
  1980 value. Sulzmann and Lu poroposed this function, but did not prove
       
  1981 anything about it. Ausaf and Urban used it to connect the bitcoded
       
  1982 algorithm to the older algorithm by the following equation:
       
  1983  
       
  1984  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
       
  1985 	 (r^\uparrow)\backslash_{simp} \,c)\,v)$ 
       
  1986  \end{center} 
       
  1987 
       
  1988 \noindent
       
  1989 whereby $r^\uparrow$ stands for the internalised version of $r$. Ausaf
       
  1990 and Urban also used this fact to prove  the correctness of bitcoded
       
  1991 algorithm without simplification.  Our purpose of using this, however,
       
  1992 is to establish 
       
  1993 
       
  1994 \begin{center}
       
  1995 $ \textit{retrieve} \;
       
  1996 a \; v \;=\; \textit{retrieve}  \; (\textit{simp}\,a) \; v'.$
       
  1997 \end{center}
       
  1998 The idea is that using $v'$, a simplified version of $v$ that had gone
       
  1999 through the same simplification step as $\textit{simp}(a)$, we are able
       
  2000 to extract the bitcode that gives the same parsing information as the
       
  2001 unsimplified one. However, we noticed that constructing such a  $v'$
       
  2002 from $v$ is not so straightforward. The point of this is that  we might
       
  2003 be able to finally bridge the gap by proving
       
  2004 
       
  2005 \begin{center}
       
  2006 $\textit{retrieve} \; (r^\uparrow   \backslash  s) \; v = \;\textit{retrieve} \;
       
  2007 (\textit{simp}(r^\uparrow)  \backslash  s) \; v'$
       
  2008 \end{center}
       
  2009 
       
  2010 \noindent
       
  2011 and subsequently
       
  2012 
       
  2013 \begin{center}
       
  2014 $\textit{retrieve} \; (r^\uparrow \backslash  s) \; v\; = \; \textit{retrieve} \;
       
  2015 (r^\uparrow  \backslash_{simp}  \, s) \; v'$.
       
  2016 \end{center}
       
  2017 
       
  2018 \noindent
       
  2019 The $\textit{LHS}$ of the above equation is the bitcode we want. This
       
  2020 would prove that our simplified version of regular expression still
       
  2021 contains all the bitcodes needed. The task here is to find a way to
       
  2022 compute the correct $v'$.
       
  2023 
       
  2024 The second task is to speed up the more aggressive simplification.  Currently
       
  2025 it is slower than the original naive simplification by Ausaf and Urban (the
       
  2026 naive version as implemented by Ausaf   and Urban of course can ``explode'' in
       
  2027 some cases).  It is therefore not surprising that the speed is also much slower
       
  2028 than regular expression engines in popular programming languages such as Java
       
  2029 and Python on most inputs that are linear. For example, just by rewriting the
       
  2030 example regular expression in the beginning of this report  $(a^*)^*\,b$ into
       
  2031 $a^*\,b$ would eliminate the ambiguity in the matching and make the time
       
  2032 for matching linear with respect to the input string size. This allows the 
       
  2033 DFA approach to become blindingly fast, and dwarf the speed of our current
       
  2034 implementation. For example, here is a comparison of Java regex engine 
       
  2035 and our implementation on this example.
       
  2036 
       
  2037 \begin{center}
       
  2038 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
  2039 \begin{tikzpicture}
       
  2040 \begin{axis}[
       
  2041     xlabel={$n*1000$},
       
  2042     x label style={at={(1.05,-0.05)}},
       
  2043     ylabel={time in secs},
       
  2044     enlargelimits=false,
       
  2045     xtick={0,5,...,30},
       
  2046     xmax=33,
       
  2047     ymax=9,
       
  2048     scaled ticks=true,
       
  2049     axis lines=left,
       
  2050     width=5cm,
       
  2051     height=4cm, 
       
  2052     legend entries={Bitcoded Algorithm},  
       
  2053     legend pos=north west,
       
  2054     legend cell align=left]
       
  2055 \addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data};
       
  2056 \end{axis}
       
  2057 \end{tikzpicture}
       
  2058   &
       
  2059 \begin{tikzpicture}
       
  2060 \begin{axis}[
       
  2061     xlabel={$n*1000$},
       
  2062     x label style={at={(1.05,-0.05)}},
       
  2063     %ylabel={time in secs},
       
  2064     enlargelimits=false,
       
  2065     xtick={0,5,...,30},
       
  2066     xmax=33,
       
  2067     ymax=9,
       
  2068     scaled ticks=false,
       
  2069     axis lines=left,
       
  2070     width=5cm,
       
  2071     height=4cm, 
       
  2072     legend entries={Java},  
       
  2073     legend pos=north west,
       
  2074     legend cell align=left]
       
  2075 \addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data};
       
  2076 \end{axis}
       
  2077 \end{tikzpicture}\\
       
  2078 \multicolumn{3}{c}{Graphs: Runtime for matching $a^*\,b$ with strings 
       
  2079            of the form $\underbrace{aa..a}_{n}$.}
       
  2080 \end{tabular}    
       
  2081 \end{center}  
       
  2082 
       
  2083 
       
  2084 Java regex engine can match string of thousands of characters in a few milliseconds,
       
  2085 whereas our current algorithm gets excruciatingly slow on input of this size.
       
  2086 The running time in theory is linear, however it does not appear to be the 
       
  2087 case in an actual implementation. So it needs to be explored how to
       
  2088 make our algorithm faster on all inputs.  It could be the recursive calls that are
       
  2089 needed to manipulate bits that are causing the slow down. A possible solution
       
  2090 is to write recursive functions into tail-recusive form.
       
  2091 Another possibility would be to explore
       
  2092 again the connection to DFAs to speed up the algorithm on 
       
  2093 subcalls that are small enough. This is very much work in progress.
       
  2094 
       
  2095 \section{Conclusion}
       
  2096 
       
  2097 In this PhD-project we are interested in fast algorithms for regular
       
  2098 expression matching. While this seems to be a ``settled'' area, in
       
  2099 fact interesting research questions are popping up as soon as one steps
       
  2100 outside the classic automata theory (for example in terms of what kind
       
  2101 of regular expressions are supported). The reason why it is
       
  2102 interesting for us to look at the derivative approach introduced by
       
  2103 Brzozowski for regular expression matching, and then much further
       
  2104 developed by Sulzmann and Lu, is that derivatives can elegantly deal
       
  2105 with some of the regular expressions that are of interest in ``real
       
  2106 life''. This includes the not-regular expression, written $\neg\,r$
       
  2107 (that is all strings that are not recognised by $r$), but also bounded
       
  2108 regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
       
  2109 also hope that the derivatives can provide another angle for how to
       
  2110 deal more efficiently with back-references, which are one of the
       
  2111 reasons why regular expression engines in JavaScript, Python and Java
       
  2112 choose to not implement the classic automata approach of transforming
       
  2113 regular expressions into NFAs and then DFAs---because we simply do not
       
  2114 know how such back-references can be represented by DFAs.
       
  2115 We also plan to implement the bitcoded algorithm
       
  2116 in some imperative language like C to see if the inefficiency of the 
       
  2117 Scala implementation
       
  2118 is language specific. To make this research more comprehensive we also plan
       
  2119 to contrast our (faster) version of bitcoded algorithm with the
       
  2120 Symbolic Regex Matcher, the RE2, the Rust Regex Engine, and the static
       
  2121 analysis approach by implementing them in the same language and then compare
       
  2122 their performance.
       
  2123 
       
  2124 
       
  2125 \section{discarded}
       
  2126 haha
       
  2127 \bibliographystyle{plain}
       
  2128 \bibliography{root,regex_time_complexity}
       
  2129 
       
  2130 
       
  2131 
       
  2132 \end{document}