PhdThesisRealOne/LaTeXTemplates_masters-doctoral-thesis_v2/Chapters/Chapter1.tex
changeset 465 2e7c7111c0be
parent 456 26a5e640cdd7
equal deleted inserted replaced
464:e6248d2c20c2 465:2e7c7111c0be
    41 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
    41 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
    42 \def\map{\mathit{map}}
    42 \def\map{\mathit{map}}
    43 \def\distinct{\mathit{distinct}}
    43 \def\distinct{\mathit{distinct}}
    44 \def\blexersimp{\mathit{blexer}\_\mathit{simp}}
    44 \def\blexersimp{\mathit{blexer}\_\mathit{simp}}
    45 %----------------------------------------------------------------------------------------
    45 %----------------------------------------------------------------------------------------
    46 
    46 %This part is about regular expressions, Brzozowski derivatives,
    47 
    47 %and a bit-coded lexing algorithm with proven correctness and time bounds.
    48 
    48 Regular expressions are widely used in modern day programming tasks.
    49 
    49 Be it IDE with syntax hightlighting and auto completion, 
    50 Regular expression matching and lexing has been 
    50 command line tools like $\mathit{grep}$ that facilitates easy 
    51  widely-used and well-implemented
    51 processing of text by search and replace, network intrusion
    52 in software industry. 
    52 detection systems that rejects suspicious traffic, or compiler
       
    53 front-ends--there is always an important phase of the
       
    54 task that involves matching a regular 
       
    55 exression with a string.
       
    56 Given its usefulness and ubiquity, one would imagine that
       
    57 modern regular expression matching implementations
       
    58 are mature and fully-studied.
       
    59 
    53 If you go to a popular programming language's
    60 If you go to a popular programming language's
    54 regex engine,
    61 regex engine,
    55 you can supply it with regex and strings of your own,
    62 you can supply it with regex and strings of your own,
    56 and get matching/lexing  information such as how a 
    63 and get matching/lexing  information such as how a 
    57 sub-part of the regex matches a sub-part of the string.
    64 sub-part of the regex matches a sub-part of the string.
   149 
   156 
   150 
   157 
   151 
   158 
   152 The opens up the possibility of
   159 The opens up the possibility of
   153  a ReDoS (regular expression denial-of-service ) attack.
   160  a ReDoS (regular expression denial-of-service ) attack.
       
   161  \section{Why Backtracking Algorithm for Regexes?}
   154 
   162 
   155 Theoretical results say that regular expression matching
   163 Theoretical results say that regular expression matching
   156 should be linear with respect to the input. You could construct
   164 should be linear with respect to the input. You could construct
   157 an NFA via Thompson construction, and simulate  running it.
   165 an NFA via Thompson construction, and simulate  running it.
   158 This would be linear.
   166 This would be linear.
   210 2019. A poorly written regular expression exhibited exponential
   218 2019. A poorly written regular expression exhibited exponential
   211 behaviour and exhausted CPUs that serve HTTP traffic. Although the
   219 behaviour and exhausted CPUs that serve HTTP traffic. Although the
   212 outage had several causes, at the heart was a regular expression that
   220 outage had several causes, at the heart was a regular expression that
   213 was used to monitor network
   221 was used to monitor network
   214 %traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
   222 %traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
       
   223 
       
   224 It turns out that regex libraries not only suffer from 
       
   225 exponential backtracking problems, 
       
   226 but also undesired (or even buggy) outputs.
       
   227 %TODO: comment from who
       
   228 xxx commented that most regex libraries are not
       
   229 correctly implementing the POSIX (maximum-munch)
       
   230 rule of regular expression matching.
       
   231 A concrete example would be 
       
   232 the regex
       
   233 \begin{verbatim}
       
   234 (((((a*a*)b*)b){20})*)c
       
   235 \end{verbatim}
       
   236 and the string
       
   237 \begin{verbatim}
       
   238 baabaabababaabaaaaaaaaababaa
       
   239 aababababaaaabaaabaaaaaabaab
       
   240 aabababaababaaaaaaaaababaaaa
       
   241 babababaaaaaaaaaaaaac
       
   242 \end{verbatim}
       
   243 
       
   244 This seemingly complex regex simply says "some $a$'s
       
   245 followed by some $b$'s then followed by 1 single $b$,
       
   246 and this iterates 20 times, finally followed by a $c$.
       
   247 And a POSIX match would involve the entire string,"eating up"
       
   248 all the $b$'s in it.
       
   249 %TODO: give a coloured example of how this matches POSIXly
       
   250 
       
   251 This regex would trigger catastrophic backtracking in 
       
   252 languages like Python and Java,
       
   253 whereas it gives a correct but uninformative (non-POSIX)
       
   254 match in languages like Go or .NET--The match with only 
       
   255 character $c$.
       
   256 
       
   257 Backtracking or depth-first search might give us
       
   258 exponential running time, and quite a few tools 
       
   259 avoid that by determinising the $\mathit{NFA}$ 
       
   260 into a $\mathit{DFA}$ and minimizes it.
       
   261 For example, $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
       
   262 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
       
   263 lexers.
       
   264 However, they do not scale well with bounded repetitions.
       
   265 Bounded repetitions, usually written in the form
       
   266 $r^{\{c\}}$ (where $c$ is a constant natural number),
       
   267 denotes a regular expression accepting strings
       
   268 that can be divided into $c$ substrings, and each 
       
   269 substring is in $r$. 
       
   270 %TODO: draw example NFA.
       
   271 For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
       
   272 an $\mathit{NFA}$ describing it would look like:
       
   273 \begin{center}
       
   274 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   275    \node[state,initial] (q_0)   {$q_0$}; 
       
   276    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
       
   277    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
       
   278    \node[state,accepting](q_3) [right=of q_2] {$q_3$};
       
   279     \path[->] 
       
   280     (q_0) edge  node {a} (q_1)
       
   281     	  edge [loop below] node {a,b} ()
       
   282     (q_1) edge  node  {a,b} (q_2)
       
   283           edge [loop above] node {0} ()
       
   284     (q_2) edge  node  {a,b} (q_3);
       
   285 \end{tikzpicture}
       
   286 \end{center}
       
   287 The red states are "counter states" which counts down 
       
   288 the number of characters needed in addition to the current
       
   289 string to make a successful match.
       
   290 For example, state $q_1$ indicates a match that has
       
   291 gone past the $(a|b)$ part of $(a|b)^*a(a|b)^{\{2\}}$,
       
   292 and just consumed the "delimiter" $a$ in the middle, and 
       
   293 need to match 2 more iterations of $a|b$ to complete.
       
   294 State $q_2$ on the other hand, can be viewed as a state
       
   295 after $q_1$ has consumed 1 character, and just waits
       
   296 for 1 more character to complete.
       
   297 Depending on the actual characters appearing in the 
       
   298 input string, the states $q_1$ and $q_2$ may or may
       
   299 not be active, independent from each other.
       
   300 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
       
   301 contain at least 4 non-equivalent states that cannot be merged, 
       
   302 because subset states indicating which of $q_1$ and $q_2$
       
   303 are active are at least four: $\phi$, $\{q_1\}$, 
       
   304 $\{q_2\}$, $\{q_1, q_2\}$.
       
   305 Generalizing this to regular expressions with larger
       
   306 bounded repetitions number, we have $r^*ar^{\{n\}}$
       
   307 in general would require at least $2^n$ states
       
   308 in a $\mathit{DFA}$. This is to represent all different 
       
   309 configurations of "countdown" states.
       
   310 For those regexes, tools such as $\mathit{JFLEX}$ 
       
   311 would generate gigantic $\mathit{DFA}$'s or even
       
   312 give out memory errors.
       
   313 
       
   314 For this reason, regex libraries that support 
       
   315 bounded repetitions often choose to use the $\mathit{NFA}$ 
       
   316 approach.
       
   317 One can simulate the $\mathit{NFA}$ running in two ways:
       
   318 one by keeping track of all active states after consuming 
       
   319 a character, and update that set of states iteratively.
       
   320 This is a breadth-first-search of the $\mathit{NFA}$.
       
   321 for a path terminating
       
   322 at an accepting state.
       
   323 Languages like $\mathit{GO}$ and $\mathit{RUST}$ use this
       
   324 type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
       
   325 in terms of input string length.
       
   326 The other way to use $\mathit{NFA}$ for matching is to take 
       
   327 a single state in a path each time, and backtrack if that path
       
   328 fails. This is a depth-first-search that could end up
       
   329 with exponential run time.
       
   330 The reason behind backtracking algorithms in languages like
       
   331 Java and Python is that they support back-references.
       
   332 \subsection{Back References in Regex--Non-Regular part}
       
   333 If we label sub-expressions by parenthesizing them and give 
       
   334 them a number by the order their opening parenthesis appear,
       
   335 $\underset{1}{(}\ldots\underset{2}{(}\ldots\underset{3}{(}\ldots\underset{4}{(}\ldots)\ldots)\ldots)\ldots)$
       
   336 We can use the following syntax to denote that we want a string just matched by a 
       
   337 sub-expression to appear at a certain location again exactly:
       
   338 $(.*)\backslash 1$
       
   339 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.
       
   340 
       
   341 Back-reference is a construct in the "regex" standard
       
   342 that programmers found quite useful, but not exactly 
       
   343 regular any more.
       
   344 In fact, that allows the regex construct to express 
       
   345 languages that cannot be contained in context-free
       
   346 languages
       
   347 For example, the back reference $(a^*)\backslash 1 \backslash 1$
       
   348 expresses the language $\{a^na^na^n\mid n \in N\}$,
       
   349 which cannot be expressed by context-free grammars.
       
   350 To express such a language one would need context-sensitive
       
   351 grammars, and matching/parsing for such grammars is NP-hard in 
       
   352 general.
       
   353 %TODO:cite reference for NP-hardness of CSG matching
       
   354 For such constructs, the most intuitive way of matching is
       
   355 using backtracking algorithms, and there are no known algorithms 
       
   356 that does not backtrack.
       
   357 %TODO:read a bit more about back reference algorithms
       
   358 
       
   359 
       
   360 
       
   361 
       
   362 \section{Our Solution--Brzozowski Derivatives}
       
   363 
       
   364 
   215  
   365  
   216  Is it possible to have a regex lexing algorithm with proven correctness and 
   366  Is it possible to have a regex lexing algorithm with proven correctness and 
   217  time complexity, which allows easy extensions to
   367  time complexity, which allows easy extensions to
   218   constructs like 
   368   constructs like 
   219  bounded repetitions, negation,  lookarounds, and even back-references? 
   369  bounded repetitions, negation,  lookarounds, and even back-references? 
   220  
   370  
   221  We propose Brzozowski's derivatives as a solution to this problem.
   371  We propose Brzozowski's derivatives as a solution to this problem.
   222  
   372  
   223  
   373 The main contribution of this thesis is a proven correct lexing algorithm
   224  \section{Why Brzozowski}
   374 with formalized time bounds.
       
   375 To our best knowledge, there is no lexing libraries using Brzozowski derivatives
       
   376 that have a provable time guarantee, 
       
   377 and claims about running time are usually speculative and backed by thin empirical
       
   378 evidence.
       
   379 %TODO: give references
       
   380 For example, Sulzmann and Lu had proposed an algorithm  in which they
       
   381 claim a linear running time.
       
   382 But that was falsified by our experiments and the running time 
       
   383 is actually $\Omega(2^n)$ in the worst case.
       
   384 A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
       
   385 %TODO: give references
       
   386 lexer, which calculates POSIX matches and is based on derivatives.
       
   387 They formalized the correctness of the lexer, but not the complexity.
       
   388 In the performance evaluation section, they simply analyzed the run time
       
   389 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
       
   390 and concluded that the algorithm is quadratic in terms of input length.
       
   391 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
       
   392 the time it took to lex only 40 $a$'s was 5 minutes.
       
   393 We therefore believe our results of a proof of performance on general
       
   394 inputs rather than specific examples a novel contribution.\\
       
   395 
       
   396  \section{Preliminaries about Lexing Using Brzozowski derivatives}
   225  In the last fifteen or so years, Brzozowski's derivatives of regular
   397  In the last fifteen or so years, Brzozowski's derivatives of regular
   226 expressions have sparked quite a bit of interest in the functional
   398 expressions have sparked quite a bit of interest in the functional
   227 programming and theorem prover communities.  
   399 programming and theorem prover communities.  
   228 The beauty of
   400 The beauty of
   229 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
   401 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
   325 We can give the meaning of regular expressions derivatives
   497 We can give the meaning of regular expressions derivatives
   326 by language interpretation:
   498 by language interpretation:
   327 
   499 
   328 
   500 
   329  
   501  
   330  
   502   
       
   503 \begin{center}
       
   504 \begin{tabular}{lcl}
       
   505 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   506 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   507 		$d \backslash c$     & $\dn$ & 
       
   508 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   509 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   510 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   511 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   512 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   513 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   514 \end{tabular}
       
   515 \end{center}
       
   516 \noindent
       
   517 \noindent
       
   518 The function derivative, written $\backslash c$, 
       
   519 defines how a regular expression evolves into
       
   520 a new regular expression after all the string it contains
       
   521 is chopped off a certain head character $c$.
       
   522 The most involved cases are the sequence 
       
   523 and star case.
       
   524 The sequence case says that if the first regular expression
       
   525 contains an empty string then second component of the sequence
       
   526 might be chosen as the target regular expression to be chopped
       
   527 off its head character.
       
   528 The star regular expression unwraps the iteration of
       
   529 regular expression and attack the star regular expression
       
   530 to its back again to make sure there are 0 or more iterations
       
   531 following this unfolded iteration.
       
   532 
       
   533 
       
   534 The main property of the derivative operation
       
   535 that enables us to reason about the correctness of
       
   536 an algorithm using derivatives is 
       
   537 
       
   538 \begin{center}
       
   539 $c\!::\!s \in L(r)$ holds
       
   540 if and only if $s \in L(r\backslash c)$.
       
   541 \end{center}
       
   542 
       
   543 \noindent
       
   544 We can generalise the derivative operation shown above for single characters
       
   545 to strings as follows:
       
   546 
       
   547 \begin{center}
       
   548 \begin{tabular}{lcl}
       
   549 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   550 $r \backslash [\,] $ & $\dn$ & $r$
       
   551 \end{tabular}
       
   552 \end{center}
       
   553 
       
   554 \noindent
       
   555 and then define Brzozowski's  regular-expression matching algorithm as:
       
   556 
       
   557 \[
       
   558 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   559 \]
       
   560 
       
   561 \noindent
       
   562 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
       
   563 this algorithm presented graphically is as follows:
       
   564 
       
   565 \begin{equation}\label{graph:*}
       
   566 \begin{tikzcd}
       
   567 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   568 \end{tikzcd}
       
   569 \end{equation}
       
   570 
       
   571 \noindent
       
   572 where we start with  a regular expression  $r_0$, build successive
       
   573 derivatives until we exhaust the string and then use \textit{nullable}
       
   574 to test whether the result can match the empty string. It can  be
       
   575 relatively  easily shown that this matcher is correct  (that is given
       
   576 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   577 
       
   578 Beautiful and simple definition.
       
   579 
       
   580 If we implement the above algorithm naively, however,
       
   581 the algorithm can be excruciatingly slow. For example, when starting with the regular
       
   582 expression $(a + aa)^*$ and building 12 successive derivatives
       
   583 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   584 with more than 8000 nodes (when viewed as a tree). Operations like
       
   585 $\backslash$ and $\nullable$ need to traverse such trees and
       
   586 consequently the bigger the size of the derivative the slower the
       
   587 algorithm. 
       
   588 
       
   589 Brzozowski was quick in finding that during this process a lot useless
       
   590 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
       
   591 He also introduced some "similarity rules" such
       
   592 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
       
   593 different but language-equivalent sub-regexes to further decrease the size
       
   594 of the intermediate regexes. 
       
   595 
       
   596 More simplifications are possible, such as deleting duplicates
       
   597 and opening up nested alternatives to trigger even more simplifications.
       
   598 And suppose we apply simplification after each derivative step, and compose
       
   599 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
       
   600 \textit{simp}(a \backslash c)$. Then we can build
       
   601 a matcher without having  cumbersome regular expressions.
       
   602 
       
   603 
       
   604 If we want the size of derivatives in the algorithm to
       
   605 stay even lower, we would need more aggressive simplifications.
       
   606 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   607 deleting duplicates whenever possible. For example, the parentheses in
       
   608 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
       
   609 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   610 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   611 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
       
   612 to achieve a very tight size bound, namely,
       
   613  the same size bound as that of the \emph{partial derivatives}. 
       
   614 
       
   615 Building derivatives and then simplify them.
       
   616 So far so good. But what if we want to 
       
   617 do lexing instead of just a YES/NO answer?
       
   618 This requires us to go back again to the world 
       
   619 without simplification first for a moment.
       
   620 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
       
   621 elegant(arguably as beautiful as the original
       
   622 derivatives definition) solution for this.
       
   623 
       
   624 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
       
   625 
       
   626 
       
   627 They first defined the datatypes for storing the 
       
   628 lexing information called a \emph{value} or
       
   629 sometimes also \emph{lexical value}.  These values and regular
       
   630 expressions correspond to each other as illustrated in the following
       
   631 table:
       
   632 
       
   633 \begin{center}
       
   634 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   635 		\begin{tabular}{@{}rrl@{}}
       
   636 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   637 			$r$ & $::=$  & $\ZERO$\\
       
   638 			& $\mid$ & $\ONE$   \\
       
   639 			& $\mid$ & $c$          \\
       
   640 			& $\mid$ & $r_1 \cdot r_2$\\
       
   641 			& $\mid$ & $r_1 + r_2$   \\
       
   642 			\\
       
   643 			& $\mid$ & $r^*$         \\
       
   644 		\end{tabular}
       
   645 		&
       
   646 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   647 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   648 			$v$ & $::=$  & \\
       
   649 			&        & $\Empty$   \\
       
   650 			& $\mid$ & $\Char(c)$          \\
       
   651 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   652 			& $\mid$ & $\Left(v)$   \\
       
   653 			& $\mid$ & $\Right(v)$  \\
       
   654 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   655 		\end{tabular}
       
   656 	\end{tabular}
       
   657 \end{center}
       
   658 
       
   659 \noindent
       
   660 One regular expression can have multiple lexical values. For example
       
   661 for the regular expression $(a+b)^*$, it has a infinite list of
       
   662 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   663 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   664 $\ldots$, and vice versa.
       
   665 Even for the regular expression matching a certain string, there could 
       
   666 still be more than one value corresponding to it.
       
   667 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   668 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   669 The number of different ways of matching 
       
   670 without allowing any value under a star to be flattened
       
   671 to an empty string can be given by the following formula:
       
   672 \begin{center}
       
   673 	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
       
   674 \end{center}
       
   675 and a closed form formula can be calculated to be
       
   676 \begin{equation}
       
   677 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
       
   678 \end{equation}
       
   679 which is clearly in exponential order.
       
   680 A lexer aimed at getting all the possible values has an exponential
       
   681 worst case runtime. Therefore it is impractical to try to generate
       
   682 all possible matches in a run. In practice, we are usually 
       
   683 interested about POSIX values, which by intuition always
       
   684 match the leftmost regular expression when there is a choice
       
   685 and always match a sub part as much as possible before proceeding
       
   686 to the next token. For example, the above example has the POSIX value
       
   687 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   688 The output of an algorithm we want would be a POSIX matching
       
   689 encoded as a value.
       
   690 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   691 algorithm by a second phase (the first phase being building successive
       
   692 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
   693 is generated in case the regular expression matches  the string. 
       
   694 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
   695 
       
   696 \begin{ceqn}
       
   697 \begin{equation}\label{graph:2}
       
   698 \begin{tikzcd}
       
   699 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   700 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   701 \end{tikzcd}
       
   702 \end{equation}
       
   703 \end{ceqn}
       
   704 
       
   705 
       
   706 \noindent
       
   707 For convenience, we shall employ the following notations: the regular
       
   708 expression we start with is $r_0$, and the given string $s$ is composed
       
   709 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   710 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   711 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   712 the derivative $r_n$. We test whether this derivative is
       
   713 $\textit{nullable}$ or not. If not, we know the string does not match
       
   714 $r$ and no value needs to be generated. If yes, we start building the
       
   715 values incrementally by \emph{injecting} back the characters into the
       
   716 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   717 algorithm from the right to left. For the first value $v_n$, we call the
       
   718 function $\textit{mkeps}$, which builds a POSIX lexical value
       
   719 for how the empty string has been matched by the (nullable) regular
       
   720 expression $r_n$. This function is defined as
       
   721 
       
   722 	\begin{center}
       
   723 		\begin{tabular}{lcl}
       
   724 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   725 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   726 			& \textit{if} $\nullable(r_{1})$\\ 
       
   727 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   728 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   729 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   730 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   731 		\end{tabular}
       
   732 	\end{center}
       
   733 
       
   734 
       
   735 \noindent 
       
   736 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   737 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   738 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   739 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   740 matches $s$. The POSIX value is maintained throught out the process.
       
   741 For this Sulzmann and Lu defined a function that reverses
       
   742 the ``chopping off'' of characters during the derivative phase. The
       
   743 corresponding function is called \emph{injection}, written
       
   744 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   745 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   746 is a character ${c_{i-1}}$, the character we want to inject and the
       
   747 third argument is the value ${v_i}$, into which one wants to inject the
       
   748 character (it corresponds to the regular expression after the character
       
   749 has been chopped off). The result of this function is a new value. The
       
   750 definition of $\textit{inj}$ is as follows: 
       
   751 
       
   752 \begin{center}
       
   753 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   754   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   755   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   756   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   757   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   758   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   759   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   760   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   761 \end{tabular}
       
   762 \end{center}
       
   763 
       
   764 \noindent This definition is by recursion on the ``shape'' of regular
       
   765 expressions and values. 
       
   766 The clauses basically do one thing--identifying the ``holes'' on 
       
   767 value to inject the character back into.
       
   768 For instance, in the last clause for injecting back to a value
       
   769 that would turn into a new star value that corresponds to a star,
       
   770 we know it must be a sequence value. And we know that the first 
       
   771 value of that sequence corresponds to the child regex of the star
       
   772 with the first character being chopped off--an iteration of the star
       
   773 that had just been unfolded. This value is followed by the already
       
   774 matched star iterations we collected before. So we inject the character 
       
   775 back to the first value and form a new value with this new iteration
       
   776 being added to the previous list of iterations, all under the $Stars$
       
   777 top level.
       
   778 
       
   779 We have mentioned before that derivatives without simplification 
       
   780 can get clumsy, and this is true for values as well--they reflect
       
   781 the regular expressions size by definition.
       
   782 
       
   783 One can introduce simplification on the regex and values, but have to
       
   784 be careful in not breaking the correctness as the injection 
       
   785 function heavily relies on the structure of the regexes and values
       
   786 being correct and match each other.
       
   787 It can be achieved by recording some extra rectification functions
       
   788 during the derivatives step, and applying these rectifications in 
       
   789 each run during the injection phase.
       
   790 And we can prove that the POSIX value of how
       
   791 regular expressions match strings will not be affected---although is much harder
       
   792 to establish. Some initial results in this regard have been
       
   793 obtained in \cite{AusafDyckhoffUrban2016}. 
       
   794 
       
   795 %Brzozowski, after giving the derivatives and simplification,
       
   796 %did not explore lexing with simplification or he may well be 
       
   797 %stuck on an efficient simplificaiton with a proof.
       
   798 %He went on to explore the use of derivatives together with 
       
   799 %automaton, and did not try lexing using derivatives.
       
   800 
       
   801 We want to get rid of complex and fragile rectification of values.
       
   802 Can we not create those intermediate values $v_1,\ldots v_n$,
       
   803 and get the lexing information that should be already there while
       
   804 doing derivatives in one pass, without a second phase of injection?
       
   805 In the meantime, can we make sure that simplifications
       
   806 are easily handled without breaking the correctness of the algorithm?
       
   807 
       
   808 Sulzmann and Lu solved this problem by
       
   809 introducing additional informtaion to the 
       
   810 regular expressions called \emph{bitcodes}.
       
   811 
       
   812 \subsection*{Bit-coded Algorithm}
       
   813 Bits and bitcodes (lists of bits) are defined as:
       
   814 
       
   815 \begin{center}
       
   816 		$b ::=   1 \mid  0 \qquad
       
   817 bs ::= [] \mid b::bs    
       
   818 $
       
   819 \end{center}
       
   820 
       
   821 \noindent
       
   822 The $1$ and $0$ are not in bold in order to avoid 
       
   823 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
   824 bit-lists) can be used to encode values (or potentially incomplete values) in a
       
   825 compact form. This can be straightforwardly seen in the following
       
   826 coding function from values to bitcodes: 
       
   827 
       
   828 \begin{center}
       
   829 \begin{tabular}{lcl}
       
   830   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
   831   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
   832   $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
       
   833   $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
       
   834   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
   835   $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
       
   836   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
       
   837                                                  code(\Stars\,vs)$
       
   838 \end{tabular}    
       
   839 \end{center} 
       
   840 
       
   841 \noindent
       
   842 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
   843 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
       
   844 star iteration by $1$. The border where a local star terminates
       
   845 is marked by $0$. This coding is lossy, as it throws away the information about
       
   846 characters, and also does not encode the ``boundary'' between two
       
   847 sequence values. Moreover, with only the bitcode we cannot even tell
       
   848 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
       
   849 reason for choosing this compact way of storing information is that the
       
   850 relatively small size of bits can be easily manipulated and ``moved
       
   851 around'' in a regular expression. In order to recover values, we will 
       
   852 need the corresponding regular expression as an extra information. This
       
   853 means the decoding function is defined as:
       
   854 
       
   855 
       
   856 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
   857 \begin{center}
       
   858 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   859   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
   860   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
   861   $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   862      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
   863        (\Left\,v, bs_1)$\\
       
   864   $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   865      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
   866        (\Right\,v, bs_1)$\\                           
       
   867   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
   868         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
   869   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
   870   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
   871   $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
   872   $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
       
   873          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   874   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
   875   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
   876   
       
   877   $\textit{decode}\,bs\,r$ & $\dn$ &
       
   878      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   879   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
   880        \textit{else}\;\textit{None}$                       
       
   881 \end{tabular}    
       
   882 \end{center}    
       
   883 %\end{definition}
       
   884 
       
   885 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
   886 create annotated regular expressions \cite{Sulzmann2014}.
       
   887 \emph{Annotated regular expressions} are defined by the following
       
   888 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
   889 
       
   890 \begin{center}
       
   891 \begin{tabular}{lcl}
       
   892   $\textit{a}$ & $::=$  & $\ZERO$\\
       
   893                   & $\mid$ & $_{bs}\ONE$\\
       
   894                   & $\mid$ & $_{bs}{\bf c}$\\
       
   895                   & $\mid$ & $_{bs}\sum\,as$\\
       
   896                   & $\mid$ & $_{bs}a_1\cdot a_2$\\
       
   897                   & $\mid$ & $_{bs}a^*$
       
   898 \end{tabular}    
       
   899 \end{center}  
       
   900 %(in \textit{ALTS})
       
   901 
       
   902 \noindent
       
   903 where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
       
   904 expressions and $as$ for a list of annotated regular expressions.
       
   905 The alternative constructor($\sum$) has been generalized to 
       
   906 accept a list of annotated regular expressions rather than just 2.
       
   907 We will show that these bitcodes encode information about
       
   908 the (POSIX) value that should be generated by the Sulzmann and Lu
       
   909 algorithm.
       
   910 
       
   911 
       
   912 To do lexing using annotated regular expressions, we shall first
       
   913 transform the usual (un-annotated) regular expressions into annotated
       
   914 regular expressions. This operation is called \emph{internalisation} and
       
   915 defined as follows:
       
   916 
       
   917 %\begin{definition}
       
   918 \begin{center}
       
   919 \begin{tabular}{lcl}
       
   920   $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
       
   921   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
       
   922   $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
       
   923   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
   924   $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
       
   925   \textit{fuse}\,[1]\,r_2^\uparrow]$\\
       
   926   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
   927          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
       
   928   $(r^*)^\uparrow$ & $\dn$ &
       
   929          $_{[]}(r^\uparrow)^*$\\
       
   930 \end{tabular}    
       
   931 \end{center}    
       
   932 %\end{definition}
       
   933 
       
   934 \noindent
       
   935 We use up arrows here to indicate that the basic un-annotated regular
       
   936 expressions are ``lifted up'' into something slightly more complex. In the
       
   937 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
   938 attach bits to the front of an annotated regular expression. Its
       
   939 definition is as follows:
       
   940 
       
   941 \begin{center}
       
   942 \begin{tabular}{lcl}
       
   943   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
       
   944   $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
       
   945      $_{bs @ bs'}\ONE$\\
       
   946   $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
       
   947      $_{bs@bs'}{\bf c}$\\
       
   948   $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
       
   949      $_{bs@bs'}\sum\textit{as}$\\
       
   950   $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
       
   951      $_{bs@bs'}a_1 \cdot a_2$\\
       
   952   $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
       
   953      $_{bs @ bs'}a^*$
       
   954 \end{tabular}    
       
   955 \end{center}  
       
   956 
       
   957 \noindent
       
   958 After internalising the regular expression, we perform successive
       
   959 derivative operations on the annotated regular expressions. This
       
   960 derivative operation is the same as what we had previously for the
       
   961 basic regular expressions, except that we beed to take care of
       
   962 the bitcodes:
       
   963 
       
   964 
       
   965 \iffalse
       
   966  %\begin{definition}{bder}
       
   967 \begin{center}
       
   968   \begin{tabular}{@{}lcl@{}}
       
   969   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   970   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   971   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
   972         $\textit{if}\;c=d\; \;\textit{then}\;
       
   973          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
   974   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
   975   $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
       
   976   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   977      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   978 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
   979 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   980   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
   981   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
   982       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
   983        (\textit{STAR}\,[]\,r)$
       
   984 \end{tabular}    
       
   985 \end{center}    
       
   986 %\end{definition}
       
   987 
       
   988 \begin{center}
       
   989   \begin{tabular}{@{}lcl@{}}
       
   990   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   991   $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   992   $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
       
   993         $\textit{if}\;c=d\; \;\textit{then}\;
       
   994          _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
       
   995   $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
       
   996   $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
       
   997   $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   998      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   999 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
       
  1000 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
  1001   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
       
  1002   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
       
  1003       $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
       
  1004        (_{bs}\textit{STAR}\,[]\,r)$
       
  1005 \end{tabular}    
       
  1006 \end{center}    
       
  1007 %\end{definition}
       
  1008 \fi
       
  1009 
       
  1010 \begin{center}
       
  1011   \begin{tabular}{@{}lcl@{}}
       
  1012   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
  1013   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
  1014   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
       
  1015         $\textit{if}\;c=d\; \;\textit{then}\;
       
  1016          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
       
  1017   $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
       
  1018   $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
       
  1019   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
  1020      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  1021 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
  1022 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
  1023   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
       
  1024   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
  1025       $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
       
  1026        (_{[]}r^*))$
       
  1027 \end{tabular}    
       
  1028 \end{center}    
       
  1029 
       
  1030 %\end{definition}
       
  1031 \noindent
       
  1032 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
       
  1033 we need to unfold it into a sequence,
       
  1034 and attach an additional bit $0$ to the front of $r \backslash c$
       
  1035 to indicate that there is one more star iteration. Also the sequence clause
       
  1036 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
  1037 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
  1038 that it is for annotated regular expressions, therefore we omit the
       
  1039 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
       
  1040 $a_1$ matches the string prior to character $c$ (more on this later),
       
  1041 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
       
  1042 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
  1043 already been fully matched) and store the parsing information at the
       
  1044 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
  1045 bitsequence $\textit{bs}$, which was initially attached to the
       
  1046 first element of the sequence $a_1 \cdot a_2$, has
       
  1047 now been elevated to the top-level of $\sum$, as this information will be
       
  1048 needed whichever way the sequence is matched---no matter whether $c$ belongs
       
  1049 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
  1050 the lexing information, we complete the lexing by collecting the
       
  1051 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
  1052 for annotated regular expressions, called $\textit{bmkeps}$:
       
  1053 
       
  1054 
       
  1055 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
  1056 \begin{center}
       
  1057 \begin{tabular}{lcl}
       
  1058   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
       
  1059   $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
       
  1060      $\textit{if}\;\textit{bnullable}\,a$\\
       
  1061   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
  1062   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
       
  1063   $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
       
  1064      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
  1065   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
       
  1066      $bs \,@\, [0]$
       
  1067 \end{tabular}    
       
  1068 \end{center}    
       
  1069 %\end{definition}
       
  1070 
       
  1071 \noindent
       
  1072 This function completes the value information by travelling along the
       
  1073 path of the regular expression that corresponds to a POSIX value and
       
  1074 collecting all the bitcodes, and using $S$ to indicate the end of star
       
  1075 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
  1076 decode them, we get the value we expect. The corresponding lexing
       
  1077 algorithm looks as follows:
       
  1078 
       
  1079 \begin{center}
       
  1080 \begin{tabular}{lcl}
       
  1081   $\textit{blexer}\;r\,s$ & $\dn$ &
       
  1082       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
  1083   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1084   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1085   & & $\;\;\textit{else}\;\textit{None}$
       
  1086 \end{tabular}
       
  1087 \end{center}
       
  1088 
       
  1089 \noindent
       
  1090 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
  1091 operation from characters to strings (just like the derivatives for un-annotated
       
  1092 regular expressions).
       
  1093 
       
  1094 Remember tha one of the important reasons we introduced bitcodes
       
  1095 is that they can make simplification more structured and therefore guaranteeing
       
  1096 the correctness.
       
  1097 
       
  1098 \subsection*{Our Simplification Rules}
       
  1099 
       
  1100 In this section we introduce aggressive (in terms of size) simplification rules
       
  1101 on annotated regular expressions
       
  1102 in order to keep derivatives small. Such simplifications are promising
       
  1103 as we have
       
  1104 generated test data that show
       
  1105 that a good tight bound can be achieved. Obviously we could only
       
  1106 partially cover  the search space as there are infinitely many regular
       
  1107 expressions and strings. 
       
  1108 
       
  1109 One modification we introduced is to allow a list of annotated regular
       
  1110 expressions in the $\sum$ constructor. This allows us to not just
       
  1111 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
  1112 also unnecessary ``copies'' of regular expressions (very similar to
       
  1113 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
  1114 modification is that we use simplification rules inspired by Antimirov's
       
  1115 work on partial derivatives. They maintain the idea that only the first
       
  1116 ``copy'' of a regular expression in an alternative contributes to the
       
  1117 calculation of a POSIX value. All subsequent copies can be pruned away from
       
  1118 the regular expression. A recursive definition of our  simplification function 
       
  1119 that looks somewhat similar to our Scala code is given below:
       
  1120 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
  1121 %Is it $ALTS$ or $ALTS$?}\\
       
  1122 
       
  1123 \begin{center}
       
  1124   \begin{tabular}{@{}lcl@{}}
       
  1125    
       
  1126   $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
  1127    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
  1128    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
  1129    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
  1130    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
  1131    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
       
  1132 
       
  1133   $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
       
  1134   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  1135    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
  1136    &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
       
  1137 
       
  1138    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  1139 \end{tabular}    
       
  1140 \end{center}    
       
  1141 
       
  1142 \noindent
       
  1143 The simplification does a pattern matching on the regular expression.
       
  1144 When it detected that the regular expression is an alternative or
       
  1145 sequence, it will try to simplify its children regular expressions
       
  1146 recursively and then see if one of the children turn into $\ZERO$ or
       
  1147 $\ONE$, which might trigger further simplification at the current level.
       
  1148 The most involved part is the $\sum$ clause, where we use two
       
  1149 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
  1150 alternatives and reduce as many duplicates as possible. Function
       
  1151 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
       
  1152 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
       
  1153 Its recursive definition is given below:
       
  1154 
       
  1155  \begin{center}
       
  1156   \begin{tabular}{@{}lcl@{}}
       
  1157   $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
       
  1158      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
  1159   $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
       
  1160     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
       
  1161 \end{tabular}    
       
  1162 \end{center}  
       
  1163 
       
  1164 \noindent
       
  1165 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
  1166 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
  1167 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
  1168 
       
  1169 Having defined the $\simp$ function,
       
  1170 we can use the previous notation of  natural
       
  1171 extension from derivative w.r.t.~character to derivative
       
  1172 w.r.t.~string:%\comment{simp in  the [] case?}
       
  1173 
       
  1174 \begin{center}
       
  1175 \begin{tabular}{lcl}
       
  1176 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
  1177 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
  1178 \end{tabular}
       
  1179 \end{center}
       
  1180 
       
  1181 \noindent
       
  1182 to obtain an optimised version of the algorithm:
       
  1183 
       
  1184  \begin{center}
       
  1185 \begin{tabular}{lcl}
       
  1186   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
  1187       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
  1188   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1189   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1190   & & $\;\;\textit{else}\;\textit{None}$
       
  1191 \end{tabular}
       
  1192 \end{center}
       
  1193 
       
  1194 \noindent
       
  1195 This algorithm keeps the regular expression size small, for example,
       
  1196 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
  1197 will be reduced to just 6 and stays constant, no matter how long the
       
  1198 input string is.
       
  1199 
       
  1200 
   331  
  1201  
   332 Derivatives give a simple solution
  1202 Derivatives give a simple solution
   333 to the problem of matching a string $s$ with a regular
  1203 to the problem of matching a string $s$ with a regular
   334 expression $r$: if the derivative of $r$ w.r.t.\ (in
  1204 expression $r$: if the derivative of $r$ w.r.t.\ (in
   335 succession) all the characters of the string matches the empty string,
  1205 succession) all the characters of the string matches the empty string,
   428 allowing a ReDoS (regular expression denial-of-service ) attack.
  1298 allowing a ReDoS (regular expression denial-of-service ) attack.
   429 
  1299 
   430 
  1300 
   431 %----------------------------------------------------------------------------------------
  1301 %----------------------------------------------------------------------------------------
   432 
  1302 
   433 \section{Existing Practical Approaches}
  1303 \section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking}
   434 
  1304 
   435 The reason behind is that regex libraries in popular language engines
  1305 The reason behind is that regex libraries in popular language engines
   436  often want to support richer constructs
  1306  often want to support richer constructs
   437 than  the most basic regular expressions, and lexing rather than matching
  1307 than  the most basic regular expressions, and lexing rather than matching
   438 is needed for sub-match extraction.
  1308 is needed for sub-match extraction.
   439 
  1309 
   440 
  1310 There is also static analysis work on regular expression that
       
  1311 have potential expoential behavious. Rathnayake and Thielecke 
       
  1312 \parencite{Rathnayake2014StaticAF} proposed an algorithm
       
  1313 that detects regular expressions triggering exponential
       
  1314 behavious on backtracking matchers.
       
  1315 People also developed static analysis methods for
       
  1316 generating non-linear polynomial worst-time estimates
       
  1317 for regexes, attack string that exploit the worst-time 
       
  1318 scenario, and "attack automata" that generates
       
  1319 attack strings.
       
  1320 For a comprehensive analysis, please refer to Weideman's thesis 
       
  1321 \parencite{Weideman2017Static}.
   441 
  1322 
   442 \subsection{DFA Approach}
  1323 \subsection{DFA Approach}
   443 
  1324 
   444 Exponential states.
  1325 Exponential states.
   445 
  1326