ninems/ninems.tex
changeset 82 3153338ec6e4
parent 81 a0df84875788
child 83 8c1195dd6136
equal deleted inserted replaced
81:a0df84875788 82:3153338ec6e4
   176 \end{center}  
   176 \end{center}  
   177 
   177 
   178 \noindent These are clearly abysmal and possibly surprising results. One
   178 \noindent These are clearly abysmal and possibly surprising results. One
   179 would expect these systems to do  much better than that---after all,
   179 would expect these systems to do  much better than that---after all,
   180 given a DFA and a string, deciding whether a string is matched by this
   180 given a DFA and a string, deciding whether a string is matched by this
   181 DFA should be linear?
   181 DFA should be linear in terms of the size of the regular expression and
       
   182 the string?
   182 
   183 
   183 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
   184 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
   184 exhibit this exponential behaviour.  But unfortunately, such regular
   185 exhibit this super-linear behaviour.  But unfortunately, such regular
   185 expressions are not just a few outliers. They are actually 
   186 expressions are not just a few outliers. They are actually 
   186 frequent enough to have a separate name created for
   187 frequent enough to have a separate name created for
   187 them---\emph{evil regular expressions}. In empiric work, Davis et al
   188 them---\emph{evil regular expressions}. In empiric work, Davis et al
   188 report that they have found thousands of such evil regular expressions
   189 report that they have found thousands of such evil regular expressions
   189 in the JavaScript and Python ecosystems \cite{Davis18}.
   190 in the JavaScript and Python ecosystems \cite{Davis18}.
   190 
   191 
   191 \comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in  the Davis et al paper}
   192 \comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in  the Davis et al paper}
   192 This exponential blowup in matching algorithms sometimes causes
   193 This superlinear blowup in matching algorithms sometimes causes
   193 considerable grief in real life: for example on 20 July 2016 one evil
   194 considerable grief in real life: for example on 20 July 2016 one evil
   194 regular expression brought the webpage
   195 regular expression brought the webpage
   195 \href{http://stackexchange.com}{Stack Exchange} to its
   196 \href{http://stackexchange.com}{Stack Exchange} to its
   196 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
   197 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
   197 In this instance, a regular expression intended to just trim white
   198 In this instance, a regular expression intended to just trim white
   200 grind to a halt. This happened when a post with 20,000 white spaces was
   201 grind to a halt. This happened when a post with 20,000 white spaces was
   201 submitted, but importantly the white spaces were neither at the
   202 submitted, but importantly the white spaces were neither at the
   202 beginning nor at the end. 
   203 beginning nor at the end. 
   203 As a result, the regular expression matching
   204 As a result, the regular expression matching
   204 engine needed to backtrack over many choices.
   205 engine needed to backtrack over many choices.
   205 In this example, the time needed to process the string is not 
   206 In this example, the time needed to process the string is
   206 exactly the classical exponential case, but rather $O(n^2)$
   207 $O(n^2)$
   207 with respect to the string length. But this is enough for the 
   208 with respect to the string length. This quadratic
   208 home page of Stack Exchange to respond not fast enough to
   209 overhead is enough for the 
   209 the load balancer, which thought that there must be some
   210 home page of Stack Exchange to respond so slowly to
       
   211 the load balancer that it thought there must be some
   210 attack and therefore stopped the servers from responding to 
   212 attack and therefore stopped the servers from responding to 
   211 requests. This made the whole site become unavailable.
   213 requests. This made the whole site become unavailable.
   212 Another very recent example is a global outage of all Cloudflare servers
   214 Another very recent example is a global outage of all Cloudflare servers
   213 on 2 July 2019. A poorly written regular expression exhibited
   215 on 2 July 2019. A poorly written regular expression exhibited
   214 exponential behaviour and exhausted CPUs that serve HTTP traffic.
   216 exponential behaviour and exhausted CPUs that serve HTTP traffic.
   430 sub-parts of a string. To see this, suppose a \emph{flatten} operation,
   432 sub-parts of a string. To see this, suppose a \emph{flatten} operation,
   431 written $|v|$ for values. We can use this function to extract the
   433 written $|v|$ for values. We can use this function to extract the
   432 underlying string of a value $v$. For example, $|\mathit{Seq} \,
   434 underlying string of a value $v$. For example, $|\mathit{Seq} \,
   433 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
   435 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$.  Using
   434 flatten, we can describe how values encode \comment{Avoid the notion
   436 flatten, we can describe how values encode \comment{Avoid the notion
   435 parse trees! Also later!!}parse trees: $\Seq\,v_1\, v_2$ encodes a tree with two
   437 parse trees! Also later!!}lexical values: $\Seq\,v_1\, v_2$ encodes a tree with two
   436 children nodes that tells how the string $|v_1| @ |v_2|$ matches the
   438 children nodes that tells how the string $|v_1| @ |v_2|$ matches the
   437 regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and,
   439 regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and,
   438 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two
   440 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two
   439 are matched is contained in the children nodes $v_1$ and $v_2$ of parent
   441 are matched is contained in the children nodes $v_1$ and $v_2$ of parent
   440 $\textit{Seq}$. 
   442 $\textit{Seq}$. 
   491 $\textit{nullable}$ or not. If not, we know the string does not match
   493 $\textit{nullable}$ or not. If not, we know the string does not match
   492 $r$ and no value needs to be generated. If yes, we start building the
   494 $r$ and no value needs to be generated. If yes, we start building the
   493 values incrementally by \emph{injecting} back the characters into the
   495 values incrementally by \emph{injecting} back the characters into the
   494 earlier values $v_n, \ldots, v_0$. This is the second phase of the
   496 earlier values $v_n, \ldots, v_0$. This is the second phase of the
   495 algorithm from the right to left. For the first value $v_n$, we call the
   497 algorithm from the right to left. For the first value $v_n$, we call the
   496 function $\textit{mkeps}$, which builds the \comment{Avoid}parse tree
   498 function $\textit{mkeps}$, which builds the \comment{Avoid}lexical value
   497 for how the empty string has been matched by the (nullable) regular
   499 for how the empty string has been matched by the (nullable) regular
   498 expression $r_n$. This function is defined as
   500 expression $r_n$. This function is defined as
   499 
   501 
   500 	\begin{center}
   502 	\begin{center}
   501 		\begin{tabular}{lcl}
   503 		\begin{tabular}{lcl}
   515 also that in case of alternatives we give preference to the
   517 also that in case of alternatives we give preference to the
   516 regular expression on the left-hand side. This will become
   518 regular expression on the left-hand side. This will become
   517 important later on about what value is calculated.
   519 important later on about what value is calculated.
   518 
   520 
   519 After the $\mkeps$-call, we inject back the characters one by one in order to build
   521 After the $\mkeps$-call, we inject back the characters one by one in order to build
   520 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
   522 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
   521 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
   523 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   522 After injecting back $n$ characters, we get the parse tree for how $r_0$
   524 After injecting back $n$ characters, we get the lexical value for how $r_0$
   523 matches $s$. For this Sulzmann and Lu defined a function that reverses
   525 matches $s$. For this Sulzmann and Lu defined a function that reverses
   524 the ``chopping off'' of characters during the derivative phase. The
   526 the ``chopping off'' of characters during the derivative phase. The
   525 corresponding function is called \emph{injection}, written
   527 corresponding function is called \emph{injection}, written
   526 $\textit{inj}$; it takes three arguments: the first one is a regular
   528 $\textit{inj}$; it takes three arguments: the first one is a regular
   527 expression ${r_{i-1}}$, before the character is chopped off, the second
   529 expression ${r_{i-1}}$, before the character is chopped off, the second
   608 \end{tabular}
   610 \end{tabular}
   609 \end{center}
   611 \end{center}
   610 
   612 
   611 \noindent
   613 \noindent
   612 In  case $r_3$ is nullable, we can call $\textit{mkeps}$ 
   614 In  case $r_3$ is nullable, we can call $\textit{mkeps}$ 
   613 to construct a parse tree for how $r_3$ matched the string $abc$. 
   615 to construct a lexical value for how $r_3$ matched the string $abc$. 
   614 This function gives the following value $v_3$: 
   616 This function gives the following value $v_3$: 
   615 
   617 
   616 
   618 
   617 \begin{center}
   619 \begin{center}
   618 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
   620 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
   619 \end{center}
   621 \end{center}
   620 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
   622 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
   621 
   623 
   622 \begin{center}\comment{better layout}
   624 \begin{center}\comment{better layout}
   623    $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
   625 	\begin{tabular}{lcl}
   624   \cdot r^*) +((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* ).$
   626 		$( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*}$ & $+$ & $(\ZERO+\ZERO+\ZERO + \ONE + \ZERO)
       
   627   		\cdot r^*) +$\\
       
   628 		$((\ZERO+\ONE+\ZERO  + \ZERO + \ZERO) \cdot r^*$ & $+$ & $(\ZERO+\ZERO+\ZERO  + \ONE + \ZERO) \cdot r^* ).$
       
   629   	\end{tabular}
   625  \end{center}
   630  \end{center}
   626 
   631 
   627 \noindent
   632 \noindent
   628  Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
   633  Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot
   629  \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
   634  \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
   653 expressions. The first one is an alternative, we take the rightmost
   658 expressions. The first one is an alternative, we take the rightmost
   654 alternative---whose language contains the empty string. The second
   659 alternative---whose language contains the empty string. The second
   655 nullable regular expression is a Kleene star. $\Stars$ tells us how it
   660 nullable regular expression is a Kleene star. $\Stars$ tells us how it
   656 generates the nullable regular expression: by 0 iterations to form
   661 generates the nullable regular expression: by 0 iterations to form
   657 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
   662 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally
   658 builds a parse tree based on $v_3$. Using the value $v_3$, the character
   663 builds a lexical value based on $v_3$. Using the value $v_3$, the character
   659 c, and the regular expression $r_2$, we can recover how $r_2$ matched
   664 c, and the regular expression $r_2$, we can recover how $r_2$ matched
   660 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
   665 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
   661  \begin{center}
   666  \begin{center}
   662  $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
   667  $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
   663  \end{center}
   668  \end{center}
   669  \begin{center}
   674  \begin{center}
   670  $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$
   675  $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$
   671  \end{center}
   676  \end{center}
   672   matched  the string $bc$ before it split into 2 pieces. 
   677   matched  the string $bc$ before it split into 2 pieces. 
   673   Finally, after injecting character $a$ back to $v_1$, 
   678   Finally, after injecting character $a$ back to $v_1$, 
   674   we get  the parse tree 
   679   we get  the lexical value tree 
   675   \begin{center}
   680   \begin{center}
   676   $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
   681   $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$
   677   \end{center}
   682   \end{center}
   678    for how $r$ matched $abc$. This completes the algorithm.
   683    for how $r$ matched $abc$. This completes the algorithm.
   679    
   684    
   680 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
   685 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. 
   681 Readers might have noticed that the parse tree information is actually
   686 Readers might have noticed that the lixical value information is actually
   682 already available when doing derivatives. For example, immediately after
   687 already available when doing derivatives. For example, immediately after
   683 the operation $\backslash a$ we know that if we want to match a string
   688 the operation $\backslash a$ we know that if we want to match a string
   684 that starts with $a$, we can either take the initial match to be 
   689 that starts with $a$, we can either take the initial match to be 
   685 
   690 
   686  \begin{center}
   691  \begin{center}
   697 and $abc$ is on the right. Which of these alternatives is chosen
   702 and $abc$ is on the right. Which of these alternatives is chosen
   698 later does not affect their relative position because the algorithm does
   703 later does not affect their relative position because the algorithm does
   699 not change this order. If this parsing information can be determined and
   704 not change this order. If this parsing information can be determined and
   700 does not change because of later derivatives, there is no point in
   705 does not change because of later derivatives, there is no point in
   701 traversing this information twice. This leads to an optimisation---if we
   706 traversing this information twice. This leads to an optimisation---if we
   702 store the information for parse trees inside the regular expression,
   707 store the information for lexical values inside the regular expression,
   703 update it when we do derivative on them, and collect the information
   708 update it when we do derivative on them, and collect the information
   704 when finished with derivatives and call $\textit{mkeps}$ for deciding which
   709 when finished with derivatives and call $\textit{mkeps}$ for deciding which
   705 branch is POSIX, we can generate the parse tree in one pass, instead of
   710 branch is POSIX, we can generate the lexical value in one pass, instead of
   706 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
   711 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
   707 idea of using bitcodes in derivatives.
   712 idea of using bitcodes in derivatives.
   708 
   713 
   709 In the next section, we shall focus on the bitcoded algorithm and the
   714 In the next section, we shall focus on the bitcoded algorithm and the
   710 process of simplification of regular expressions. This is needed in
   715 process of simplification of regular expressions. This is needed in
   713 state-of-the-art.
   718 state-of-the-art.
   714 
   719 
   715 
   720 
   716 \section{Simplification of Regular Expressions}
   721 \section{Simplification of Regular Expressions}
   717 
   722 
   718 Using bitcodes to guide  parsing is not a novel idea. It was applied to
   723 Using bitcodes to guide  parsing is not a novel idea. It was applied to context
   719 context free grammars and then adapted by Henglein and Nielson for
   724 free grammars and then adapted by Henglein and Nielson for efficient regular
   720 efficient regular expression \comment{?}parsing using DFAs~\cite{nielson11bcre}.
   725 expression \comment{?}\comment{i have changed parsing into lexing,
   721 Sulzmann and Lu took this idea of bitcodes a step further by integrating
   726 however, parsing is the terminology Henglein and Nielson used} lexing using
   722 bitcodes into derivatives. The reason why we want to use bitcodes in
   727 DFAs~\cite{nielson11bcre}.  Sulzmann and Lu took this idea of bitcodes a step
   723 this project is that we want to introduce more aggressive
   728 further by integrating bitcodes into derivatives. The reason why we want to use
   724 simplification rules in order to keep the size of derivatives small
   729 bitcodes in this project is that we want to introduce more aggressive
   725 throughout. This is because the main drawback of building successive
   730 simplification rules in order to keep the size of derivatives small throughout.
   726 derivatives according to Brzozowski's definition is that they can grow
   731 This is because the main drawback of building successive derivatives according
   727 very quickly in size. This is mainly due to the fact that the derivative
   732 to Brzozowski's definition is that they can grow very quickly in size. This is
   728 operation generates often ``useless'' $\ZERO$s and $\ONE$s in
   733 mainly due to the fact that the derivative operation generates often
   729 derivatives.  As a result, if implemented naively both algorithms by
   734 ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if implemented
   730 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
   735 naively both algorithms by Brzozowski and by Sulzmann and Lu are excruciatingly
   731 when starting with the regular expression $(a + aa)^*$ and building 12
   736 slow. For example when starting with the regular expression $(a + aa)^*$ and
   732 successive derivatives w.r.t.~the character $a$, one obtains a
   737 building 12 successive derivatives w.r.t.~the character $a$, one obtains a
   733 derivative regular expression with more than 8000 nodes (when viewed as
   738 derivative regular expression with more than 8000 nodes (when viewed as a
   734 a tree). Operations like $\textit{der}$ and $\nullable$ need to traverse
   739 tree). Operations like $\textit{der}$ and $\nullable$ need to traverse such
   735 such trees and consequently the bigger the size of the derivative the
   740 trees and consequently the bigger the size of the derivative the slower the
   736 slower the algorithm. 
   741 algorithm. 
   737 
   742 
   738 Fortunately, one can simplify regular expressions after each derivative
   743 Fortunately, one can simplify regular expressions after each derivative
   739 step. Various simplifications of regular expressions are possible, such
   744 step. Various simplifications of regular expressions are possible, such
   740 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   745 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   741 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
   746 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
   768  \end{center}
   773  \end{center}
   769 
   774 
   770 \noindent
   775 \noindent
   771 A partial derivative of a regular expression $r$ is essentially a set of
   776 A partial derivative of a regular expression $r$ is essentially a set of
   772 regular expressions that are either $r$'s children expressions or a
   777 regular expressions that are either $r$'s children expressions or a
   773 concatenation of them. Antimirov has proved a tight bound of the size of
   778 concatenation of them. Antimirov has proved a tight bound of the sum of
       
   779 the size of
   774 \emph{all} partial derivatives no matter what the string looks like.
   780 \emph{all} partial derivatives no matter what the string looks like.
   775 Roughly speaking the size will be quadruple in the size of the regular
   781 Roughly speaking the size sum will be at most cubic in the size of the regular
   776 expression.\comment{Are you sure? I have just proved that the sum of
   782 expression.\comment{Are you sure? I have just proved that the sum of
   777 sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely
   783 sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely
   778 not the best bound.} If we want the size of derivatives in Sulzmann and
   784 not the best bound.}\comment{this is just a very rough guess i made
   779 Lu's algorithm to stay equal or below this bound, we would need more
   785 2 months ago. i will take your suggested bound here.}
       
   786 If we want the size of derivatives in Sulzmann and
       
   787 Lu's algorithm to stay below this bound, we would need more
   780 aggressive simplifications. Essentially we need to delete useless
   788 aggressive simplifications. Essentially we need to delete useless
   781 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
   789 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
   782 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
   790 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
   783 get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a
   791 get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a
   784 \cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+
   792 \cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+
   878 \begin{center}
   886 \begin{center}
   879 \begin{tabular}{lcl}
   887 \begin{tabular}{lcl}
   880   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
   888   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
   881                   & $\mid$ & $\textit{ONE}\;\;bs$\\
   889                   & $\mid$ & $\textit{ONE}\;\;bs$\\
   882                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
   890                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
   883                   & $\mid$ & $\textit{ALT}\;\;bs\,a_1 \, a_2$\\
   891                   & $\mid$ & $\textit{ALTS}\;\;bs\,a_1 \, a_2$\\
   884                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
   892                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
   885                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
   893                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
   886 \end{tabular}    
   894 \end{tabular}    
   887 \end{center}  
   895 \end{center}  
   888 %(in \textit{ALT})
   896 %(in \textit{ALTS})
   889 
   897 
   890 \noindent
   898 \noindent
   891 where $bs$ stands for bitcodes, and $a$  for $\bold{a}$nnotated regular
   899 where $bs$ stands for bitcodes, and $a$  for $\bold{a}$nnotated regular
   892 expressions. We will show that these bitcodes encode information about
   900 expressions. We will show that these bitcodes encode information about
   893 the (POSIX) value that should be generated by the Sulzmann and Lu
   901 the (POSIX) value that should be generated by the Sulzmann and Lu
   904 \begin{tabular}{lcl}
   912 \begin{tabular}{lcl}
   905   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
   913   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
   906   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
   914   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
   907   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
   915   $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
   908   $(r_1 + r_2)^\uparrow$ & $\dn$ &
   916   $(r_1 + r_2)^\uparrow$ & $\dn$ &
   909          $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
   917          $\textit{ALTS}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
   910                             (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\
   918                             (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\
   911   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
   919   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
   912          $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
   920          $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
   913   $(r^*)^\uparrow$ & $\dn$ &
   921   $(r^*)^\uparrow$ & $\dn$ &
   914          $\textit{STAR}\;[]\,r^\uparrow$\\
   922          $\textit{STAR}\;[]\,r^\uparrow$\\
   928   $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
   936   $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
   929   $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
   937   $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
   930      $\textit{ONE}\,(bs\,@\,bs')$\\
   938      $\textit{ONE}\,(bs\,@\,bs')$\\
   931   $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
   939   $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
   932      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
   940      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
   933   $\textit{fuse}\;bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
   941   $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,a_1\,a_2)$ & $\dn$ &
   934      $\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\
   942      $\textit{ALTS}\,(bs\,@\,bs')\,a_1\,a_2$\\
   935   $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
   943   $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
   936      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
   944      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
   937   $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
   945   $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
   938      $\textit{STAR}\,(bs\,@\,bs')\,a$
   946      $\textit{STAR}\,(bs\,@\,bs')\,a$
   939 \end{tabular}    
   947 \end{tabular}    
   942 \noindent
   950 \noindent
   943 After internalising the regular expression, we perform successive
   951 After internalising the regular expression, we perform successive
   944 derivative operations on the annotated regular expressions. This
   952 derivative operations on the annotated regular expressions. This
   945 derivative operation is the same as what we had previously for the
   953 derivative operation is the same as what we had previously for the
   946 basic regular expressions, except that we beed to take care of
   954 basic regular expressions, except that we beed to take care of
   947 the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT
   955 the bitcodes:\comment{You need to be consitent with  ALTS and ALTS; ALTS
   948 is just an abbreviation; derivations and so on are defined for
   956 is just an abbreviation; derivations and so on are defined for
   949 ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a
   957 ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a
   950 list...\textcolor{blue}{This does not make sense to me. CU}}
   958 list...\textcolor{blue}{This does not make sense to me. CU}}
   951 
   959 
   952  %\begin{definition}{bder}
   960  %\begin{definition}{bder}
   955   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   963   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   956   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   964   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   957   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
   965   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
   958         $\textit{if}\;c=d\; \;\textit{then}\;
   966         $\textit{if}\;c=d\; \;\textit{then}\;
   959          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
   967          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
   960   $(\textit{ALT}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
   968   $(\textit{ALTS}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
   961         $\textit{ALT}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\
   969         $\textit{ALTS}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\
   962   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
   970   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
   963      $\textit{if}\;\textit{bnullable}\,a_1$\\
   971      $\textit{if}\;\textit{bnullable}\,a_1$\\
   964   & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\
   972   & &$\textit{then}\;\textit{ALTS}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\
   965   & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\
   973   & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\
   966   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
   974   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
   967   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
   975   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
   968       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
   976       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
   969        (\textit{STAR}\,[]\,r)$
   977        (\textit{STAR}\,[]\,r)$
   970 \end{tabular}    
   978 \end{tabular}    
   983 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
   991 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
   984 \backslash c)$ will collapse the regular expression $a_1$(as it has
   992 \backslash c)$ will collapse the regular expression $a_1$(as it has
   985 already been fully matched) and store the parsing information at the
   993 already been fully matched) and store the parsing information at the
   986 head of the regular expression $a_2 \backslash c$ by fusing to it. The
   994 head of the regular expression $a_2 \backslash c$ by fusing to it. The
   987 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
   995 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
   988 now been elevated to the top-level of $ALT$, as this information will be
   996 now been elevated to the top-level of $ALTS$, as this information will be
   989 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
   997 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
   990 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
   998 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
   991 the lexing information, we complete the lexing by collecting the
   999 the lexing information, we complete the lexing by collecting the
   992 bitcodes using a generalised version of the $\textit{mkeps}$ function
  1000 bitcodes using a generalised version of the $\textit{mkeps}$ function
   993 for annotated regular expressions, called $\textit{bmkeps}$:
  1001 for annotated regular expressions, called $\textit{bmkeps}$:
   995 
  1003 
   996 %\begin{definition}[\textit{bmkeps}]\mbox{}
  1004 %\begin{definition}[\textit{bmkeps}]\mbox{}
   997 \begin{center}
  1005 \begin{center}
   998 \begin{tabular}{lcl}
  1006 \begin{tabular}{lcl}
   999   $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
  1007   $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
  1000   $\textit{bmkeps}\,(\textit{ALT}\;bs\,a_1\,a_2)$ & $\dn$ &
  1008   $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a_1\,a_2)$ & $\dn$ &
  1001      $\textit{if}\;\textit{bnullable}\,a_1$\\
  1009      $\textit{if}\;\textit{bnullable}\,a_1$\\
  1002   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
  1010   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
  1003   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
  1011   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
  1004   $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
  1012   $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
  1005      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
  1013      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
  1048 modification is that we use simplification rules inspired by Antimirov's
  1056 modification is that we use simplification rules inspired by Antimirov's
  1049 work on partial derivatives. They maintain the idea that only the first
  1057 work on partial derivatives. They maintain the idea that only the first
  1050 ``copy'' of a regular expression in an alternative contributes to the
  1058 ``copy'' of a regular expression in an alternative contributes to the
  1051 calculation of a POSIX value. All subsequent copies can be pruned away from
  1059 calculation of a POSIX value. All subsequent copies can be pruned away from
  1052 the regular expression. A recursive definition of our  simplification function 
  1060 the regular expression. A recursive definition of our  simplification function 
  1053 that looks somewhat similar to our Scala code is given below:\comment{Use $\ZERO$, $\ONE$ and so on. 
  1061 that looks somewhat similar to our Scala code is given below:
  1054 Is it $ALT$ or $ALTS$?}\\
  1062 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
  1063 %Is it $ALTS$ or $ALTS$?}\\
  1055 
  1064 
  1056 \begin{center}
  1065 \begin{center}
  1057   \begin{tabular}{@{}lcl@{}}
  1066   \begin{tabular}{@{}lcl@{}}
  1058    
  1067    
  1059   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
  1068   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
  1149 outlined in this section):
  1158 outlined in this section):
  1150 
  1159 
  1151 \begin{quote}\it
  1160 \begin{quote}\it
  1152   ``Correctness Claim: We further claim that the incremental parsing
  1161   ``Correctness Claim: We further claim that the incremental parsing
  1153   method in Figure~5 in combination with the simplification steps in
  1162   method in Figure~5 in combination with the simplification steps in
  1154   Figure 6 yields POSIX parse trees. We have tested this claim
  1163   Figure 6 yields POSIX parse tree. We have tested this claim
  1155   extensively by using the method in Figure~3 as a reference but yet
  1164   extensively by using the method in Figure~3 as a reference but yet
  1156   have to work out all proof details.''
  1165   have to work out all proof details.''
  1157 \end{quote}  
  1166 \end{quote}  
  1158 
  1167 
  1159 \noindent We like to settle this correctness claim. It is relatively
  1168 \noindent (The so-called parse trees in this quote means lexical values.)
       
  1169 We like to settle this correctness claim. It is relatively
  1160 straightforward to establish that after one simplification step, the part of a
  1170 straightforward to establish that after one simplification step, the part of a
  1161 nullable derivative that corresponds to a POSIX value remains intact and can
  1171 nullable derivative that corresponds to a POSIX value remains intact and can
  1162 still be collected, in other words, we can show that\comment{Double-check....I
  1172 still be collected, in other words, we can show that\comment{Double-check....I
  1163 think this  is not the case}
  1173 think this  is not the case}
  1164 %\comment{If i remember correctly, you have proved this lemma.
  1174 %\comment{If i remember correctly, you have proved this lemma.
  1167 %you proved something like simplification does not affect $\textit{bmkeps}$ results?
  1177 %you proved something like simplification does not affect $\textit{bmkeps}$ results?
  1168 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
  1178 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
  1169 %to a regex. Maybe it works now.}
  1179 %to a regex. Maybe it works now.}
  1170 
  1180 
  1171 \begin{center}
  1181 \begin{center}
  1172 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$
  1182 	$\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;($\textit{provided}$ \; a\; is \; \textit{bnullable} )$
  1173 \end{center}
  1183 \end{center} 
  1174 \comment{\textcolor{blue}{I proved $bmkeps\,(bsimp\,a) = bmkeps\,a$ provided $a$ is 
       
  1175 $\textit{bnullable}$}}
       
  1176 
  1184 
  1177 \noindent
  1185 \noindent
  1178 as this basically comes down to proving actions like removing the
  1186 as this basically comes down to proving actions like removing the
  1179 additional $r$ in $r+r$  does not delete important POSIX information in
  1187 additional $r$ in $r+r$  does not delete important POSIX information in
  1180 a regular expression. The hard part of this proof is to establish that
  1188 a regular expression. The hard part of this proof is to establish that
  1181 
  1189 
  1182 \begin{center}
  1190 \begin{center}
  1183 	$\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$
  1191 	$ \textit{blexer}\_{simp}(r, \; s) =  \textit{blexer}(r, \; s)$
  1184 \end{center}\comment{This is not true either...look at the definion blexer/blexer-simp}
  1192 \end{center}\comment{This is not true either...look at the definion blexer/blexer-simp}
  1185 
  1193 
  1186 \noindent That is, if we do derivative on regular expression $r$ and then
  1194 \noindent That is, if we do derivative on regular expression $r$ and then
  1187 simplify it, and repeat this process until we exhaust the string, we get a
  1195 simplify it, and repeat this process until we exhaust the string, we get a
  1188 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
  1196 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
  1189 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
  1197 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
  1190 normal derivative algorithm that only does derivative repeatedly and has no
  1198 normal derivative algorithm that only does derivative repeatedly and has no
  1191 simplification at all.  This might seem at first glance very unintuitive, as
  1199 simplification at all.  This might seem at first glance very unintuitive, as
  1192 the $r'$ is exponentially larger than $r''$, but can be explained in the
  1200 the $r'$ could be exponentially larger than $r''$, but can be explained in the
  1193 following way: we are pruning away the possible matches that are not POSIX.
  1201 following way: we are pruning away the possible matches that are not POSIX.
  1194 Since there are exponentially non-POSIX matchings and only 1 POSIX matching, it
  1202 Since there could be exponentially many 
       
  1203 non-POSIX matchings and only 1 POSIX matching, it
  1195 is understandable that our $r''$ can be a lot smaller.  we can still provide
  1204 is understandable that our $r''$ can be a lot smaller.  we can still provide
  1196 the same POSIX value if there is one.  This is not as straightforward as the
  1205 the same POSIX value if there is one.  This is not as straightforward as the
  1197 previous proposition, as the two regular expressions $r'$ and $r''$ might have
  1206 previous proposition, as the two regular expressions $r'$ and $r''$ might have
  1198 become very different regular expressions.  The crucial point is to find the
  1207 become very different.  The crucial point is to find the
  1199 $\textit{POSIX}$  information of a regular expression and how it is modified,
  1208 $\textit{POSIX}$  information of a regular expression and how it is modified,
  1200 augmented and propagated 
  1209 augmented and propagated 
  1201 during simplification in parallel with the regularr expression that
  1210 during simplification in parallel with the regularr expression that
  1202 has not been simplified in the subsequent derivative operations.  To aid this,
  1211 has not been simplified in the subsequent derivative operations.  To aid this,
  1203 we use the helping function retrieve described by Sulzmann and Lu: \\definition
  1212 we use the helper function retrieve described by Sulzmann and Lu: \\definition
  1204 of retrieve TODO\comment{Did not read further}\\
  1213 of retrieve TODO\comment{Did not read further}\\
  1205 This function assembles the bitcode that corresponds to a parse tree for how
  1214 This function assembles the bitcode 
  1206 the current derivative matches the suffix of the string(the characters that
  1215 %that corresponds to a lexical value for how
  1207 have not yet appeared, but will appear as the successive derivatives go on,
  1216 %the current derivative matches the suffix of the string(the characters that
  1208 how do we get this "future" information? By the value $v$, which is
  1217 %have not yet appeared, but will appear as the successive derivatives go on.
  1209 computed by a pass of the algorithm that uses
  1218 %How do we get this "future" information? By the value $v$, which is
  1210 $inj$ as described in the previous section).  Sulzmann and Lu used this
  1219 %computed by a pass of the algorithm that uses
       
  1220 %$inj$ as described in the previous section).  
       
  1221 using information from both the derivative regular expression and the value.
       
  1222 Sulzmann and Lu used this
  1211 to connect the bitcoded algorithm to the older algorithm by the following
  1223 to connect the bitcoded algorithm to the older algorithm by the following
  1212 equation:
  1224 equation:
  1213  
  1225  
  1214  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
  1226  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
  1215 	 ((\textit{internalise}\; r)\backslash_{simp} c) v)$ 
  1227 	 ((\textit{internalise}\; r)\backslash_{simp} c) v)$ 
  1216  \end{center} 
  1228  \end{center} 
  1217  A little fact that needs to be stated to help comprehension:
  1229  A little fact that needs to be stated to aid comprehension:
  1218 \begin{center} 
  1230 \begin{center} 
  1219 	$r^\uparrow = a$($a$ stands for $\textit{annotated}).$
  1231 	$r^\uparrow = a$($a$ stands for $\textit{annotated}).$
  1220 \end{center} 
  1232 \end{center} 
  1221 Ausaf and Urban also used this fact to prove  the
  1233 Ausaf and Urban also used this fact to prove  the
  1222 correctness of bitcoded algorithm without simplification.  Our purpose
  1234 correctness of bitcoded algorithm without simplification.  Our purpose
  1224 \begin{center}
  1236 \begin{center}
  1225 $ \textit{retrieve} \;
  1237 $ \textit{retrieve} \;
  1226 a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$
  1238 a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$
  1227 \end{center}
  1239 \end{center}
  1228 The idea
  1240 The idea
  1229 is that using $v'$, a simplified version of $v$ that possibly had gone
  1241 is that using $v'$, a simplified version of $v$ that had gone
  1230 through the same simplification step as $\textit{simp}(a)$ we are
  1242 through the same simplification step as $\textit{simp}(a)$, we are
  1231 able to extract the bit-sequence that gives the same parsing
  1243 able to extract the bit-sequence that gives the same parsing
  1232 information as the unsimplified one.  After establishing this, we
  1244 information as the unsimplified one.  After establishing this, we
  1233 might be able to finally bridge the gap of proving
  1245 might be able to finally bridge the gap of proving
  1234 \begin{center}
  1246 \begin{center}
  1235 $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \;
  1247 $\textit{retrieve} \; r^\uparrow   \backslash  s \; v = \;\textit{retrieve} \;
  1236 \textit{simp}(r)  \backslash  s \; v'$
  1248 \textit{bsimp}(r^\uparrow)  \backslash  s \; v'$
  1237 \end{center}
  1249 \end{center}
  1238 and subsequently
  1250 and subsequently
  1239 \begin{center}
  1251 \begin{center}
  1240 $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \;
  1252 $\textit{retrieve} \; r^\uparrow \backslash  s \; v\; = \; \textit{retrieve} \;
  1241 r  \backslash_{simp}   s \; v'$.
  1253 r^\uparrow  \backslash_{simp}   s \; v'$.
  1242 \end{center}
  1254 \end{center}
       
  1255 The $\textit{LHS}$ of the above equation is the bitcode we want.
  1243 This proves that our simplified
  1256 This proves that our simplified
  1244 version of regular expression still contains all the bitcodes needed.
  1257 version of regular expression still contains all the bitcodes needed.
  1245 
  1258 The task here is to find a way to compute the correct $v'$.
  1246 
  1259 
  1247 The second task is to speed up the more aggressive simplification.
  1260 The second task is to speed up the more aggressive simplification.
  1248 Currently it is slower than a naive simplification(the naive version as
  1261 Currently it is slower than a naive simplification(the naive version as
  1249 implemented in ADU of course can explode in some cases). So it needs to
  1262 implemented in ADU of course can explode in some cases). So it needs to
  1250 be explored how to make it faster. Our possibility would be to explore
  1263 be explored how to make it faster. Our possibility would be to explore