ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 539 7cf9f17aa179
parent 538 8016a2480704
child 541 5bf9f94c02e1
equal deleted inserted replaced
538:8016a2480704 539:7cf9f17aa179
   287 \begin{lemma}
   287 \begin{lemma}
   288 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
   288 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
   289 \end{lemma}
   289 \end{lemma}
   290 
   290 
   291 \noindent
   291 \noindent
   292 
       
   293 
       
   294 The main property of the derivative operation
   292 The main property of the derivative operation
   295 that enables us to reason about the correctness of
   293 that enables us to reason about the correctness of
   296 an algorithm using derivatives is 
   294 an algorithm using derivatives is 
   297 
   295 
   298 \begin{center}
   296 \begin{lemma}\label{derStepwise}
   299 $c\!::\!s \in L(r)$ holds
   297 $c\!::\!s \in L(r)$ holds
   300 if and only if $s \in L(r\backslash c)$.
   298 if and only if $s \in L(r\backslash c)$.
   301 \end{center}
   299 \end{lemma}
   302 
   300 
   303 \noindent
   301 \noindent
   304 We can generalise the derivative operation shown above for single characters
   302 We can generalise the derivative operation shown above for single characters
   305 to strings as follows:
   303 to strings as follows:
   306 
   304 
   312 \end{center}
   310 \end{center}
   313 
   311 
   314 \noindent
   312 \noindent
   315 When there is no ambiguity we will use  $\backslash$ to denote
   313 When there is no ambiguity we will use  $\backslash$ to denote
   316 string derivatives for brevity.
   314 string derivatives for brevity.
   317 
   315 Brzozowski's  regular-expression matcher algorithm can then be described as:
   318 and then define Brzozowski's  regular-expression matching algorithm as:
       
   319 
   316 
   320 \begin{definition}
   317 \begin{definition}
   321 $\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
   318 $\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
   322 \end{definition}
   319 \end{definition}
   323 
   320 
   330 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   327 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
   331 \end{tikzcd}
   328 \end{tikzcd}
   332 \end{equation}
   329 \end{equation}
   333 
   330 
   334 \noindent
   331 \noindent
   335 
   332  It can  be
   336 
   333 relatively  easily shown that this matcher is correct:
       
   334 \begin{lemma}
       
   335 $\textit{match} \; s\; r  = \textit{true} \Longleftrightarrow s \in L(r)$
       
   336 \end{lemma}
       
   337 \begin{proof}
       
   338 By the stepwise property of $\backslash$ (\ref{derStepwise})
       
   339 \end{proof}
       
   340 \noindent
       
   341 If we implement the above algorithm naively, however,
       
   342 the algorithm can be excruciatingly slow.
       
   343 
       
   344 
       
   345 \begin{figure}
       
   346 \begin{tikzpicture}
       
   347 \begin{axis}[
       
   348     xlabel={$n$},
       
   349     ylabel={time in secs},
       
   350     ymode = log,
       
   351     legend entries={Naive Matcher},  
       
   352     legend pos=north west,
       
   353     legend cell align=left]
       
   354 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
       
   355 \end{axis}
       
   356 \end{tikzpicture} 
       
   357 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
       
   358 \end{figure}
       
   359    
       
   360 \noindent
       
   361 For this we need to introduce certain 
       
   362 rewrite rules for the intermediate results,
       
   363 such as $r + r \rightarrow r$,
       
   364 and make sure those rules do not change the 
       
   365 language of the regular expression.
       
   366 We have a simplification function (that is as simple as possible
       
   367 while having much power on making a regex simpler):
       
   368 \begin{verbatim}
       
   369 def simp(r: Rexp) : Rexp = r match {
       
   370   case SEQ(r1, r2) => 
       
   371     (simp(r1), simp(r2)) match {
       
   372       case (ZERO, _) => ZERO
       
   373       case (_, ZERO) => ZERO
       
   374       case (ONE, r2s) => r2s
       
   375       case (r1s, ONE) => r1s
       
   376       case (r1s, r2s) => SEQ(r1s, r2s)
       
   377     }
       
   378   case ALTS(r1, r2) => {
       
   379     (simp(r1), simp(r2)) match {
       
   380       case (ZERO, r2s) => r2s
       
   381       case (r1s, ZERO) => r1s
       
   382       case (r1s, r2s) =>
       
   383         if(r1s == r2s) r1s else ALTS(r1s, r2s)
       
   384     }
       
   385   }
       
   386   case r => r
       
   387 }
       
   388 \end{verbatim}
       
   389 If we repeatedly incorporate these 
       
   390 rules during the matching algorithm, 
       
   391 we have a lexer with simplification:
       
   392 \begin{verbatim}
       
   393 def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
       
   394   case Nil => simp(r)
       
   395   case c :: cs => ders_simp(cs, simp(der(c, r)))
       
   396 }
       
   397 
       
   398 def simp_matcher(s: String, r: Rexp) : Boolean = 
       
   399   nullable(ders_simp(s.toList, r))
       
   400 
       
   401 \end{verbatim}
       
   402 \noindent
       
   403 After putting in those rules, the example of \ref{NaiveMatcher}
       
   404 is now very tame in the length of inputs:
       
   405 
       
   406 
       
   407 \begin{tikzpicture}
       
   408 \begin{axis}[
       
   409     xlabel={$n$},
       
   410     ylabel={time in secs},
       
   411     ymode = log,
       
   412     xmode = log,
       
   413     legend entries={Matcher With Simp},  
       
   414     legend pos=north west,
       
   415     legend cell align=left]
       
   416 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
       
   417 \end{axis}
       
   418 \end{tikzpicture} \label{fig:BetterMatcher}
       
   419 
       
   420 
       
   421 \noindent
       
   422 Note how the x-axis is in logarithmic scale.
   337 Building derivatives and then testing the existence
   423 Building derivatives and then testing the existence
   338 of empty string in the resulting regular expression's language.
   424 of empty string in the resulting regular expression's language,
       
   425 and add simplification rules when necessary.
   339 So far, so good. But what if we want to 
   426 So far, so good. But what if we want to 
   340 do lexing instead of just getting a YES/NO answer?
   427 do lexing instead of just getting a YES/NO answer?
   341 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
   428 \citeauthor{Sulzmann2014} first came up with a nice and 
   342 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
   429 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
   343 
   430 
   344 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
   431 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
   345 Here we present the hybrid phases of a regular expression lexing 
   432 Here we present the hybrid phases of a regular expression lexing 
   346 algorithm using the function $\inj$, as given by Sulzmann and Lu.
   433 algorithm using the function $\inj$, as given by Sulzmann and Lu.
   347 They first defined the datatypes for storing the 
   434 They first defined the datatypes for storing the 
   348 lexing information called a \emph{value} or
   435 lexing information called a \emph{value} or
   349 sometimes also \emph{lexical value}.  These values and regular
   436 sometimes also \emph{lexical value}.  These values and regular
   428 given the same regular expression $r$ and string $s$,
   515 given the same regular expression $r$ and string $s$,
   429 one can always uniquely determine the $\POSIX$ value for it:
   516 one can always uniquely determine the $\POSIX$ value for it:
   430 \begin{lemma}
   517 \begin{lemma}
   431 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   518 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   432 \end{lemma}
   519 \end{lemma}
       
   520 \begin{proof}
       
   521 By induction on $s$, $r$ and $v_1$. The induction principle is
       
   522 the \POSIX rules. Each case is proven by a combination of
       
   523 the induction rules for $\POSIX$ values and the inductive hypothesis.
       
   524 Probably the most cumbersome cases are the sequence and star with non-empty iterations.
       
   525 
       
   526 We give the reasoning about the sequence case as follows:
       
   527 When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, 
       
   528 we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
       
   529 and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
       
   530 For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
       
   531 possibly form a $\POSIX$ for $s$.
       
   532 If we have some other values $v_1'$ and $v_2'$ such that 
       
   533 $(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
       
   534 Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, 
       
   535 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
       
   536 is the same as $\Seq(v_1, v_2)$. 
       
   537 \end{proof}
   433 Now we know what a $\POSIX$ value is, the problem is how do we achieve 
   538 Now we know what a $\POSIX$ value is, the problem is how do we achieve 
   434 such a value in a lexing algorithm, using derivatives?
   539 such a value in a lexing algorithm, using derivatives?
   435 
   540 
   436 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   541 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   437 
   542 
   438 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   543 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   439 algorithm by a second phase (the first phase being building successive
   544 algorithm by a second phase (the first phase being building successive
   440 derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
   545 derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value 
   441 is generated if the regular expression matches the string.
   546 is generated if the regular expression matches the string.
   442 Two functions are involved: $\inj$ and $\mkeps$.
   547 Two functions are involved: $\inj$ and $\mkeps$.
   443 The function $\mkeps$ constructs a value from the last
   548 The function $\mkeps$ constructs a value from the last
   444 one of all the successive derivatives:
   549 one of all the successive derivatives:
   445 \begin{ceqn}
   550 \begin{ceqn}
   532 that had just been unfolded. This value is followed by the already
   637 that had just been unfolded. This value is followed by the already
   533 matched star iterations we collected before. So we inject the character 
   638 matched star iterations we collected before. So we inject the character 
   534 back to the first value and form a new value with this latest iteration
   639 back to the first value and form a new value with this latest iteration
   535 being added to the previous list of iterations, all under the $\Stars$
   640 being added to the previous list of iterations, all under the $\Stars$
   536 top level.
   641 top level.
   537 The POSIX value is maintained throughout the process.
   642 The POSIX value is maintained throughout the process:
   538 \begin{lemma}
   643 \begin{lemma}
   539 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
   644 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
   540 \end{lemma}\label{injPosix}
   645 \end{lemma}\label{injPosix}
   541 
   646 
   542 
   647 
   543 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
   648 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
   544 and taking into consideration the possibility of a non-match,
   649 and taking into consideration the possibility of a non-match,
   545 we have a lexer with the following recursive definition:
   650 we have a lexer with the following recursive definition:
   546 \begin{center}
   651 \begin{center}
   547 \begin{tabular}{lcr}
   652 \begin{tabular}{lcl}
   548 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
   653 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
   549 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
   654 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
   550 & & $\None \implies \None$\\
   655 & & $\quad \None \implies \None$\\
   551 & & $\mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
   656 & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
   552 \end{tabular}
   657 \end{tabular}
   553 \end{center}
   658 \end{center}
   554  \noindent
   659  \noindent
   555  The central property of the $\lexer$ is that it gives the correct result by
   660  The central property of the $\lexer$ is that it gives the correct result by
   556  $\POSIX$ standards:
   661  $\POSIX$ standards:
   566  By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
   671  By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
   567  The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
   672  The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
   568  by lemma \ref{injPosix}.
   673  by lemma \ref{injPosix}.
   569  \end{proof}
   674  \end{proof}
   570 
   675 
   571 For convenience, we shall employ the following notations: the regular
   676 
       
   677 Pictorially, the algorithm is as follows (
       
   678 For convenience, we employ the following notations: the regular
   572 expression we start with is $r_0$, and the given string $s$ is composed
   679 expression we start with is $r_0$, and the given string $s$ is composed
   573 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
   680 of characters $c_0 c_1 \ldots c_{n-1}$. The
   574 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
   681 values built incrementally by \emph{injecting} back the characters into the
   575 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
   682 earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
   576 the derivative $r_n$. We test whether this derivative is
   683 are always in the same subscript, i.e. $\vdash v_i : r_i$):
   577 $\textit{nullable}$ or not. If not, we know the string does not match
       
   578 $r$, and no value needs to be generated. If yes, we start building the
       
   579 values incrementally by \emph{injecting} back the characters into the
       
   580 earlier values $v_n, \ldots, v_0$.
       
   581 Pictorially, the algorithm is as follows:
       
   582 
   684 
   583 \begin{ceqn}
   685 \begin{ceqn}
   584 \begin{equation}\label{graph:2}
   686 \begin{equation}\label{graph:2}
   585 \begin{tikzcd}
   687 \begin{tikzcd}
   586 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   688 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   587 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   689 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   588 \end{tikzcd}
   690 \end{tikzcd}
   589 \end{equation}
   691 \end{equation}
   590 \end{ceqn}
   692 \end{ceqn}
   591 
   693 
   592 
   694 \noindent
   593 \noindent
   695 As we did earlier in this chapter on the matcher, one can 
   594  This is the second phase of the
   696 introduce simplification on the regex.
   595 algorithm from right to left. For the first value $v_n$, we call the
   697 However, now we need to do a backward phase and make sure
   596 function $\textit{mkeps}$, which builds a POSIX lexical value
   698 the values align with the regular expressions.
   597 for how the empty string has been matched by the (nullable) regular
   699 Therefore one has to
   598 expression $r_n$. This function is defined as
       
   599 
       
   600 
       
   601 
       
   602 We have mentioned before that derivatives without simplification 
       
   603 can get clumsy, and this is true for values as well--they reflect
       
   604 the size of the regular expression by definition.
       
   605 
       
   606 One can introduce simplification on the regex and values but have to
       
   607 be careful not to break the correctness, as the injection 
   700 be careful not to break the correctness, as the injection 
   608 function heavily relies on the structure of the regexes and values
   701 function heavily relies on the structure of the regexes and values
   609 being correct and matching each other.
   702 being correct and matching each other.
   610 It can be achieved by recording some extra rectification functions
   703 It can be achieved by recording some extra rectification functions
   611 during the derivatives step, and applying these rectifications in 
   704 during the derivatives step, and applying these rectifications in 
   612 each run during the injection phase.
   705 each run during the injection phase.
       
   706 
       
   707 \ChristianComment{Do I introduce the lexer with rectification here?}
   613 And we can prove that the POSIX value of how
   708 And we can prove that the POSIX value of how
   614 regular expressions match strings will not be affected---although it is much harder
   709 regular expressions match strings will not be affected---although it is much harder
   615 to establish. 
   710 to establish. 
   616 Some initial results in this regard have been
   711 Some initial results in this regard have been
   617 obtained in \cite{AusafDyckhoffUrban2016}. 
   712 obtained in \cite{AusafDyckhoffUrban2016}. 
   618 
   713 
   619 
   714 However, even with these simplification rules, we could still end up in
   620 
   715 trouble, when we encounter cases that require more involved and aggressive
   621 %Brzozowski, after giving the derivatives and simplification,
   716 simplifications.
   622 %did not explore lexing with simplification, or he may well be 
   717 \section{A Case Requring More Aggressive Simplification}
   623 %stuck on an efficient simplification with proof.
   718 For example, when starting with the regular
   624 %He went on to examine the use of derivatives together with 
   719 expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
   625 %automaton, and did not try lexing using products.
   720 w.r.t.~the character $a$, one obtains a derivative regular expression
   626 
   721 with more than 9000 nodes (when viewed as a tree)
   627 We want to get rid of the complex and fragile rectification of values.
   722 even with simplification.
   628 Can we not create those intermediate values $v_1,\ldots v_n$,
   723 \begin{figure}
   629 and get the lexing information that should be already there while
   724 \begin{tikzpicture}
   630 doing derivatives in one pass, without a second injection phase?
   725 \begin{axis}[
   631 In the meantime, can we make sure that simplifications
   726     xlabel={$n$},
   632 are easily handled without breaking the correctness of the algorithm?
   727     ylabel={size},
   633 
   728     legend entries={Naive Matcher},  
   634 Sulzmann and Lu solved this problem by
   729     legend pos=north west,
   635 introducing additional information to the 
   730     legend cell align=left]
   636 regular expressions called \emph{bitcodes}.
   731 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
   637 
   732 \end{axis}
   638 
   733 \end{tikzpicture} 
   639 
   734 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
   640 
   735 \end{figure}\label{fig:BetterWaterloo}
   641 
   736    
   642 With the formally-specified rules for what a POSIX matching is,
   737 That is because our lexing algorithm currently keeps a lot of 
   643 they proved in Isabelle/HOL that the algorithm gives correct results.
   738 "useless values that will never not be used. 
   644 But having a correct result is still not enough, 
   739 These different ways of matching will grow exponentially with the string length.
   645 we want at least some degree of $\mathbf{efficiency}$.
   740 
   646 
   741 For $r= (a^*\cdot a^*)^*$ and  
   647 
   742 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
   648 A pair of regular expression and string can have multiple lexical values. 
   743 if we do not allow any empty iterations in its lexical values,
   649 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
   744 there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
   650 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   651 If we do not allow any empty iterations in its lexical values,
       
   652 there will be $n - 1$ "splitting points" on $s$ we can choose to 
       
   653 split or not so that each sub-string
   745 split or not so that each sub-string
   654 segmented by those chosen splitting points will form different iterations:
   746 segmented by those chosen splitting points will form different iterations.
       
   747 For example when $n=4$,
   655 \begin{center}
   748 \begin{center}
   656 \begin{tabular}{lcr}
   749 \begin{tabular}{lcr}
   657 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
   750 $aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\
   658 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
   751 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
   659 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
   752 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
       
   753 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
       
   754 $a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
   660  & $\textit{etc}.$ &
   755  & $\textit{etc}.$ &
   661  \end{tabular}
   756  \end{tabular}
   662 \end{center}
   757 \end{center}
   663 \noindent
   758 \noindent
   664 And for each iteration, there are still multiple ways to split
   759 And for each iteration, there are still multiple ways to split
   668 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   763 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
   669 A lexer to keep all the possible values will naturally 
   764 A lexer to keep all the possible values will naturally 
   670 have an exponential runtime on ambiguous regular expressions.
   765 have an exponential runtime on ambiguous regular expressions.
   671 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
   766 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
   672 of a match. This means Sulzmann and Lu's injection-based algorithm 
   767 of a match. This means Sulzmann and Lu's injection-based algorithm 
   673 will be exponential by nature.
   768  exponential by nature.
   674 Somehow one has to decide which lexical value to keep and
   769 Somehow one has to make sure which
   675 output in a lexing algorithm.
   770  lexical values are $\POSIX$ and need to be kept in a lexing algorithm.
   676 
   771 
   677 
   772 
   678  For example, the above $r= (a^*\cdot a^*)^*$  and 
   773  For example, the above $r= (a^*\cdot a^*)^*$  and 
   679 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
   774 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
   680 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   775 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
   681 The output of an algorithm we want would be a POSIX matching
   776 We want to keep this value only, and remove all the regular expression subparts
   682 encoded as a value.
   777 not corresponding to this value during lexing.
   683 
   778 To do this, a two-phase algorithm with rectification is a bit too fragile.
   684 
   779 Can we not create those intermediate values $v_1,\ldots v_n$,
   685 
   780 and get the lexing information that should be already there while
   686 
   781 doing derivatives in one pass, without a second injection phase?
   687 
   782 In the meantime, can we make sure that simplifications
   688 %kind of redundant material
   783 are easily handled without breaking the correctness of the algorithm?
   689 
   784 
   690 
   785 Sulzmann and Lu solved this problem by
   691 where we start with  a regular expression  $r_0$, build successive
   786 introducing additional information to the 
   692 derivatives until we exhaust the string and then use \textit{nullable}
   787 regular expressions called \emph{bitcodes}.
   693 to test whether the result can match the empty string. It can  be
   788 
   694 relatively  easily shown that this matcher is correct  (that is given
   789 
   695 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   790 
   696 
   791 
   697 Beautiful and simple definition.
       
   698 
       
   699 If we implement the above algorithm naively, however,
       
   700 the algorithm can be excruciatingly slow. 
       
   701 
       
   702 
       
   703 \begin{figure}
       
   704 \centering
       
   705 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   706 \begin{tikzpicture}
       
   707 \begin{axis}[
       
   708     xlabel={$n$},
       
   709     x label style={at={(1.05,-0.05)}},
       
   710     ylabel={time in secs},
       
   711     enlargelimits=false,
       
   712     xtick={0,5,...,30},
       
   713     xmax=33,
       
   714     ymax=10000,
       
   715     ytick={0,1000,...,10000},
       
   716     scaled ticks=false,
       
   717     axis lines=left,
       
   718     width=5cm,
       
   719     height=4cm, 
       
   720     legend entries={JavaScript},  
       
   721     legend pos=north west,
       
   722     legend cell align=left]
       
   723 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
       
   724 \end{axis}
       
   725 \end{tikzpicture}\\
       
   726 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   727            of the form $\underbrace{aa..a}_{n}$.}
       
   728 \end{tabular}    
       
   729 \caption{EightThousandNodes} \label{fig:EightThousandNodes}
       
   730 \end{figure}
       
   731 
       
   732 
       
   733 (8000 node data to be added here)
       
   734 For example, when starting with the regular
       
   735 expression $(a + aa)^*$ and building a few successive derivatives (around 10)
       
   736 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   737 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
       
   738 The reason why $(a + aa) ^*$ explodes so drastically is that without
       
   739 pruning, the algorithm will keep records of all possible ways of matching:
       
   740 \begin{center}
       
   741 $(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
       
   742 \end{center}
       
   743 
       
   744 \noindent
       
   745 Each of the above alternative branches correspond to the match 
       
   746 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
       
   747 These different ways of matching will grow exponentially with the string length,
       
   748 and without simplifications that throw away some of these very similar matchings,
       
   749 it is no surprise that these expressions grow so quickly.
       
   750 Operations like
       
   751 $\backslash$ and $\nullable$ need to traverse such trees and
       
   752 consequently the bigger the size of the derivative the slower the
       
   753 algorithm. 
       
   754 
       
   755 Brzozowski was quick in finding that during this process a lot useless
       
   756 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
       
   757 He also introduced some "similarity rules", such
       
   758 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
       
   759 different but language-equivalent sub-regexes to further decrease the size
       
   760 of the intermediate regexes. 
       
   761 
       
   762 More simplifications are possible, such as deleting duplicates
       
   763 and opening up nested alternatives to trigger even more simplifications.
       
   764 And suppose we apply simplification after each derivative step, and compose
       
   765 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
       
   766 \textit{simp}(a \backslash c)$. Then we can build
       
   767 a matcher with simpler regular expressions.
       
   768 
       
   769 If we want the size of derivatives in the algorithm to
       
   770 stay even lower, we would need more aggressive simplifications.
       
   771 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   772 delete duplicates whenever possible. For example, the parentheses in
       
   773 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
       
   774 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   775 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   776 $a^*+a+\ONE$.  These more aggressive simplification rules are for
       
   777  a very tight size bound, possibly as low
       
   778   as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 
       
   779 
       
   780 
       
   781