ninems/ninems.tex
changeset 77 058133a9ffe0
parent 76 f575cf219377
child 79 481c8000de6d
equal deleted inserted replaced
76:f575cf219377 77:058133a9ffe0
   176 \end{center}  
   176 \end{center}  
   177 
   177 
   178 \noindent These are clearly abysmal and possibly surprising results. One
   178 \noindent These are clearly abysmal and possibly surprising results. One
   179 would expect these systems to do  much better than that---after all,
   179 would expect these systems to do  much better than that---after all,
   180 given a DFA and a string, deciding whether a string is matched by this
   180 given a DFA and a string, deciding whether a string is matched by this
   181 DFA should be linear.
   181 DFA should be linear?
   182 
   182 
   183 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
   183 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
   184 exhibit this exponential behaviour.  Unfortunately, such regular
   184 exhibit this exponential behaviour.  But unfortunately, such regular
   185 expressions are not just a few outliers. They are actually 
   185 expressions are not just a few outliers. They are actually 
   186 frequent enough to have a separate name created for
   186 frequent enough to have a separate name created for
   187 them---\emph{evil regular expressions}. In empiric work, Davis et al
   187 them---\emph{evil regular expressions}. In empiric work, Davis et al
   188 report that they have found thousands of such evil regular expressions
   188 report that they have found thousands of such evil regular expressions
   189 in the JavaScript and Python ecosystems \cite{Davis18}.
   189 in the JavaScript and Python ecosystems \cite{Davis18}.
   202 As a result, the regular expression matching
   202 As a result, the regular expression matching
   203 engine needed to backtrack over many choices.
   203 engine needed to backtrack over many choices.
   204 In this example, the time needed to process the string is not 
   204 In this example, the time needed to process the string is not 
   205 exactly the classical exponential case, but rather $O(n^2)$
   205 exactly the classical exponential case, but rather $O(n^2)$
   206 with respect to the string length. But this is enough for the 
   206 with respect to the string length. But this is enough for the 
   207 home page of stackexchnge to respond not fast enough to
   207 home page of Stack Exchange to respond not fast enough to
   208 the load balancer, which thought that there must be some
   208 the load balancer, which thought that there must be some
   209 attack and therefore stopped the servers from responding to 
   209 attack and therefore stopped the servers from responding to 
   210 requests. This makes the whole site become unavailable.
   210 requests. This made the whole site become unavailable.
   211 Another fresh example that just came out of the oven
   211 Another very recent example is a global outage of all Cloudflare servers
   212 is the cloudfare outage incident. A poorly written
   212 on 2 July 2019. A poorly written regular expression exhibited
   213 regular expression exhibited exponential behaviour
   213 exponential behaviour and exhausted CPUs that serve HTTP traffic.
   214 and exhausted CPUs that serve HTTP traffic on 2 July 2019,
   214 Although the outage had several causes, at the heart was a regular
   215  resulting in a 27-minute outage. The regular expression contained 
   215 expression that was used to monitor network
   216 adjacent Kleene stars, which is the source of the exponential number
   216 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
   217 of backtracking possibilities. Although this outage has a series of
   217 
   218 causes that came from different vulnerabilities within the system 
   218 The underlying problem is that many ``real life'' regular expression
   219 of cloudfare, the heart of the problem lies within regular expression.
   219 matching engines do not use DFAs for matching. This is because they
   220 Such outages could be excluded from happening if the matching 
   220 support regular expressions that are not covered by the classical
   221 algorithm is guaranteed to be fast.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
   221 automata theory, and in this more general setting there are quite a few
   222  The underlying problem is
   222 research questions still unanswered and fast algorithms still need to be
   223 that many ``real life'' regular expression matching engines do not use
   223 developed (for example how to treat bounded repetitions, negation and
   224 DFAs for matching. 
   224 back-references efficiently).
   225 This is because they support regular expressions that
       
   226 are not covered by the classical automata theory, and in this more
       
   227 general setting there are quite a few research questions still
       
   228 unanswered and fast algorithms still need to be developed (for example
       
   229 how to treat bounded repetitions, negation and  back-references
       
   230 efficiently).
       
   231 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
   225 %question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
   232 %how do they avoid dfas exponential states if they use them for fast matching?
   226 %how do they avoid dfas exponential states if they use them for fast matching?
       
   227 
   233 There is also another under-researched problem to do with regular
   228 There is also another under-researched problem to do with regular
   234 expressions and lexing, i.e.~the process of breaking up strings into
   229 expressions and lexing, i.e.~the process of breaking up strings into
   235 sequences of tokens according to some regular expressions. In this
   230 sequences of tokens according to some regular expressions. In this
   236 setting one is not just interested in whether or not a regular
   231 setting one is not just interested in whether or not a regular
   237 expression matches a string, but also in \emph{how}.  Consider for example a regular expression
   232 expression matches a string, but also in \emph{how}.  Consider for example a regular expression
   266 \noindent
   261 \noindent
   267 This is also supported by evidence collected by Kuklewicz
   262 This is also supported by evidence collected by Kuklewicz
   268 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
   263 \cite{Kuklewicz} who noticed that a number of POSIX regular expression
   269 matchers calculate incorrect results.
   264 matchers calculate incorrect results.
   270 
   265 
   271 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
   266 Our focus in this project is on an algorithm introduced by Sulzmann and
   272 regular expression matching according to the POSIX strategy
   267 Lu in 2014 for regular expression matching according to the POSIX
   273 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
   268 strategy \cite{Sulzmann2014}. Their algorithm is based on an older
   274 Brzozowski from 1964 where he introduced the notion of derivatives of
   269 algorithm by Brzozowski from 1964 where he introduced the notion of
   275 regular expressions~\cite{Brzozowski1964}. We shall briefly explain
   270 derivatives of regular expressions~\cite{Brzozowski1964}. We shall
   276 this algorithm next.
   271 briefly explain this algorithm next.
   277 
   272 
   278 \section{The Algorithm by Brzozowski based on Derivatives of Regular
   273 \section{The Algorithm by Brzozowski based on Derivatives of Regular
   279 Expressions}
   274 Expressions}
   280 
   275 
   281 Suppose (basic) regular expressions are given by the following grammar:
   276 Suppose (basic) regular expressions are given by the following grammar:
   388 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   383 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
   389 
   384 
   390  
   385  
   391 \section{Values and the Algorithm by Sulzmann and Lu}
   386 \section{Values and the Algorithm by Sulzmann and Lu}
   392 
   387 
   393 One limitation, however, of Brzozowski's algorithm is that it only
   388 One limitation of Brzozowski's algorithm is that it only produces a
   394 produces a YES/NO answer for whether a string is being matched by a
   389 YES/NO answer for whether a string is being matched by a regular
   395 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
   390 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
   396 algorithm to allow generation of an actual matching, called a
   391 to allow generation of an actual matching, called a \emph{value} or
   397 \emph{value} or sometimes lexical values.  These values and regular expressions correspond to each 
   392 sometimes also \emph{lexical values}.  These values and regular
   398 other as illustrated in the following table:
   393 expressions correspond to each other as illustrated in the following
       
   394 table:
   399 
   395 
   400 
   396 
   401 \begin{center}
   397 \begin{center}
   402 	\begin{tabular}{c@{\hspace{20mm}}c}
   398 	\begin{tabular}{c@{\hspace{20mm}}c}
   403 		\begin{tabular}{@{}rrl@{}}
   399 		\begin{tabular}{@{}rrl@{}}
   438 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
   434 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
   439 substring $|v_1|$ and, respectively, $r_2$ matches the substring
   435 substring $|v_1|$ and, respectively, $r_2$ matches the substring
   440 $|v_2|$. Exactly how these two are matched is contained in the
   436 $|v_2|$. Exactly how these two are matched is contained in the
   441 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . 
   437 children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . 
   442 
   438 
   443  To give a concrete example of how values work, consider the string $xy$
   439 To give a concrete example of how values work, consider the string $xy$
   444 and the regular expression $(x + (y + xy))^*$. We can view this regular
   440 and the regular expression $(x + (y + xy))^*$. We can view this regular
   445 expression as a tree and if the string $xy$ is matched by two Star
   441 expression as a tree and if the string $xy$ is matched by two Star
   446 ``iterations'', then the $x$ is matched by the left-most alternative in
   442 ``iterations'', then the $x$ is matched by the left-most alternative in
   447 this tree and the $y$ by the right-left alternative. This suggests to
   443 this tree and the $y$ by the right-left alternative. This suggests to
   448 record this matching as
   444 record this matching as
   481 \end{tikzcd}
   477 \end{tikzcd}
   482 \end{equation}
   478 \end{equation}
   483 \end{ceqn}
   479 \end{ceqn}
   484 
   480 
   485 \noindent
   481 \noindent
   486 For convenience, we shall employ the following notations: the regular expression we
   482 For convenience, we shall employ the following notations: the regular
   487 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   483 expression we start with is $r_0$, and the given string $s$ is composed
   488 \ldots c_{n-1}$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
   484 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
   489 the characters $c_0$, $c_1$  until we exhaust the string and
   485 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
   490 obtain the derivative $r_n$. We test whether this derivative is
   486 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   487 the derivative $r_n$. We test whether this derivative is
   491 $\textit{nullable}$ or not. If not, we know the string does not match
   488 $\textit{nullable}$ or not. If not, we know the string does not match
   492 $r$ and no value needs to be generated. If yes, we start building the
   489 $r$ and no value needs to be generated. If yes, we start building the
   493 values incrementally by \emph{injecting} back the characters into
   490 values incrementally by \emph{injecting} back the characters into the
   494 the earlier values $v_n, \ldots, v_0$. For the first value $v_0$, we call the function
   491 earlier values $v_n, \ldots, v_0$. This is the second phase of the
   495 $\textit{mkeps}$, which builds the parse tree for how the empty string
   492 algorithm from the right to left. For the first value $v_n$, we call the
   496 has been matched by the (nullable) regular expression $r_n$. This function is defined
   493 function $\textit{mkeps}$, which builds the parse tree for how the empty
   497 as
   494 string has been matched by the (nullable) regular expression $r_n$. This
       
   495 function is defined as
   498 
   496 
   499 	\begin{center}
   497 	\begin{center}
   500 		\begin{tabular}{lcl}
   498 		\begin{tabular}{lcl}
   501 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
   499 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
   502 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
   500 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
   515 regular expression on the left-hand side. This will become
   513 regular expression on the left-hand side. This will become
   516 important later on.
   514 important later on.
   517 
   515 
   518 After this, we inject back the characters one by one in order to build
   516 After this, we inject back the characters one by one in order to build
   519 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
   517 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
   520 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. After
   518 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$.
   521 injecting back $n$ characters, we get the parse tree for how $r_0$
   519 After injecting back $n$ characters, we get the parse tree for how $r_0$
   522 matches $s$. For this Sulzmann and Lu defined a function that reverses
   520 matches $s$. For this Sulzmann and Lu defined a function that reverses
   523 the ``chopping off'' of characters during the derivative phase. The
   521 the ``chopping off'' of characters during the derivative phase. The
   524 corresponding function is called $\textit{inj}$; it takes three
   522 corresponding function is called \emph{injection}, written
   525 arguments: the first one is a regular expression ${r_{i-1}}$, before the
   523 $\textit{inj}$; it takes three arguments: the first one is a regular
   526 character is chopped off, the second is a character ${c_{i-1}}$, the
   524 expression ${r_{i-1}}$, before the character is chopped off, the second
   527 character we want to inject and the third argument is the value
   525 is a character ${c_{i-1}}$, the character we want to inject and the
   528 ${v_i}$, into which one wants to inject the character (it
   526 third argument is the value ${v_i}$, into which one wants to inject the
   529 corresponds to the regular expression after the character has been
   527 character (it corresponds to the regular expression after the character
   530 chopped off). The result of this function is a new value. The definition
   528 has been chopped off). The result of this function is a new value. The
   531 of $\textit{inj}$ is as follows: 
   529 definition of $\textit{inj}$ is as follows: 
   532 
   530 
   533 \begin{center}
   531 \begin{center}
   534 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   532 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   535   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   533   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   536   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
   534   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
   711 context free grammars and then adapted by Henglein and Nielson for
   709 context free grammars and then adapted by Henglein and Nielson for
   712 efficient regular expression parsing using DFAs~\cite{nielson11bcre}.
   710 efficient regular expression parsing using DFAs~\cite{nielson11bcre}.
   713 Sulzmann and Lu took this idea of bitcodes a step further by integrating
   711 Sulzmann and Lu took this idea of bitcodes a step further by integrating
   714 bitcodes into derivatives. The reason why we want to use bitcodes in
   712 bitcodes into derivatives. The reason why we want to use bitcodes in
   715 this project is that we want to introduce more aggressive
   713 this project is that we want to introduce more aggressive
   716 simplifications in order to keep the size of derivatives small
   714 simplification rules in order to keep the size of derivatives small
   717 throughout. This is because the main drawback of building successive
   715 throughout. This is because the main drawback of building successive
   718 derivatives according to Brzozowski's definition is that they can grow
   716 derivatives according to Brzozowski's definition is that they can grow
   719 very quickly in size. This is mainly due to the fact that the derivative
   717 very quickly in size. This is mainly due to the fact that the derivative
   720 operation generates often ``useless'' $\ZERO$s and $\ONE$s in
   718 operation generates often ``useless'' $\ZERO$s and $\ONE$s in
   721 derivatives.  As a result, if implemented naively both algorithms by
   719 derivatives.  As a result, if implemented naively both algorithms by
   722 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
   720 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
   723 when starting with the regular expression $(a + aa)^*$ and building 12
   721 when starting with the regular expression $(a + aa)^*$ and building 12
   724 successive derivatives w.r.t.~the character $a$, one obtains a
   722 successive derivatives w.r.t.~the character $a$, one obtains a
   725 derivative regular expression with more than 8000 nodes (when viewed as
   723 derivative regular expression with more than 8000 nodes (when viewed as
   726 a tree). Operations like derivative and $\nullable$ need to traverse
   724 a tree). Operations like $\textit{der}$ and $\nullable$ need to traverse
   727 such trees and consequently the bigger the size of the derivative the
   725 such trees and consequently the bigger the size of the derivative the
   728 slower the algorithm. 
   726 slower the algorithm. 
   729 
   727 
   730 Fortunately, one can simplify regular expressions after each derivative
   728 Fortunately, one can simplify regular expressions after each derivative
   731 step. Various simplifications of regular expressions are possible, such
   729 step. Various simplifications of regular expressions are possible, such
   732 as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   730 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   733 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
   731 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
   734 affect the answer for whether a regular expression matches a string or
   732 affect the answer for whether a regular expression matches a string or
   735 not, but fortunately also do not affect the POSIX strategy of how
   733 not, but fortunately also do not affect the POSIX strategy of how
   736 regular expressions match strings---although the latter is much harder
   734 regular expressions match strings---although the latter is much harder
   737 to establish. Some initial results in this regard have been
   735 to establish. Some initial results in this regard have been
   738 obtained in \cite{AusafDyckhoffUrban2016}. 
   736 obtained in \cite{AusafDyckhoffUrban2016}. 
   739 
   737 
   740 Unfortunately, the simplification rules outlined above  are not
   738 Unfortunately, the simplification rules outlined above  are not
   741 sufficient to prevent an explosion for all regular expression. We
   739 sufficient to prevent a size explosion in all cases. We
   742 believe a tighter bound can be achieved that prevents an explosion in
   740 believe a tighter bound can be achieved that prevents an explosion in
   743 all cases. Such a tighter bound is suggested by work of Antimirov who
   741 \emph{all} cases. Such a tighter bound is suggested by work of Antimirov who
   744 proved that (partial) derivatives can be bound by the number of
   742 proved that (partial) derivatives can be bound by the number of
   745 characters contained in the initial regular expression
   743 characters contained in the initial regular expression
   746 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
   744 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
   747 expressions as follows:
   745 expressions as follows:
   748 
   746 
   761 
   759 
   762 \noindent
   760 \noindent
   763 A partial derivative of a regular expression $r$ is essentially a set of
   761 A partial derivative of a regular expression $r$ is essentially a set of
   764 regular expressions that are either $r$'s children expressions or a
   762 regular expressions that are either $r$'s children expressions or a
   765 concatenation of them. Antimirov has proved a tight bound of the size of
   763 concatenation of them. Antimirov has proved a tight bound of the size of
   766 partial derivatives. Roughly
   764 \emph{all} partial derivatives no matter what the string looks like.
   767 speaking the size will be quadruple in the size of the regular expression.
   765 Roughly speaking the size will be quadruple in the size of the regular
   768  If we want the size of derivatives 
   766 expression. If we want the size of derivatives in Sulzmann and Lu's
   769 to stay below this bound, we would need more aggressive simplifications
   767 algorithm to stay equal or below this bound, we would need more
   770 such as opening up alternatives to achieve the maximum level of duplicates
   768 aggressive simplifications. Essentially we need to delete useless
   771 cancellation.
   769 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible.
   772 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
   770 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
   773 get $a\cdot c +b \cdot c
   771 get $a\cdot c +  b \cdot c + b \cdot c$, and then simplified to just $a \cdot
   774 + b \cdot c$, and then simplified to $a \cdot c+b \cdot c$. Another example is from
   772 c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ \ONE) + (a
   775 $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to $a^*+a+\ONE$.
   773 +\ONE)$ to just $a^*+a+\ONE$. Adding these more aggressive simplification
   776 Adding these more aggressive simplification rules
   774 rules helps us to achieve the same size bound as that of the partial
   777  helped us to achieve the same size bound as that of the partial derivatives.
   775 derivatives. In order to implement the idea of ``spilling out alternatives''
   778 To introduce these "spilling out alternatives" simplifications
   776 and to make them compatible with the $\text{inj}$-mechanism, we use \emph{bitcodes}.
   779  and make the correctness proof easier,
   777 Bits and bitcodes (lists of bits) are just:
   780 we used bitcodes. 
       
   781 Bitcodes look like this:
       
   782 %This allows us to prove a tight
   778 %This allows us to prove a tight
   783 %bound on the size of regular expression during the running time of the
   779 %bound on the size of regular expression during the running time of the
   784 %algorithm if we can establish the connection between our simplification
   780 %algorithm if we can establish the connection between our simplification
   785 %rules and partial derivatives.
   781 %rules and partial derivatives.
   786 
   782 
   788 %data, that a similar bound can be obtained for the derivatives in
   784 %data, that a similar bound can be obtained for the derivatives in
   789 %Sulzmann and Lu's algorithm. Let us give some details about this next.
   785 %Sulzmann and Lu's algorithm. Let us give some details about this next.
   790 
   786 
   791 
   787 
   792 \begin{center}
   788 \begin{center}
   793 		$b ::=   S \mid  Z \; \;\;
   789 		$b ::=   S \mid  Z \qquad
   794 bs ::= [] \mid b:bs    
   790 bs ::= [] \mid b:bs    
   795 $
   791 $
   796 \end{center}
   792 \end{center}
   797 They are just a string of bits, 
   793 
   798 the names $S$ and $Z$  here are quite arbitrary, we can use 0 and 1 
   794 \noindent
   799 or any other set of binary symbols to substitute them. 
   795 The names $S$ and $Z$  here are quite arbitrary in order to avoid 
   800 Bitcodes(or bit-sequences) are a compact form of parse trees.
   796 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
   801 Bitcodes are essentially incomplete values.
   797 bit-lists) can be used to encode values (or incomplete values) in a
   802 This can be straightforwardly seen in the following transformation: 
   798 compact form. This can be straightforwardly seen in the following
       
   799 coding function from values to bitcodes: 
       
   800 
   803 \begin{center}
   801 \begin{center}
   804 \begin{tabular}{lcl}
   802 \begin{tabular}{lcl}
   805   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   803   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   806   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
   804   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
   807   $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
   805   $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
   811   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
   809   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
   812                                                  code(\Stars\,vs)$
   810                                                  code(\Stars\,vs)$
   813 \end{tabular}    
   811 \end{tabular}    
   814 \end{center} 
   812 \end{center} 
   815 
   813 
   816 Here code encodes a value into a bit-sequence by converting Left into
   814 \noindent
   817 $\Z$, Right into $\S$, the start point of a non-empty star iteration
   815 Here $\textit{code}$ encodes a value into a bitcodes by converting
   818 into $\S$, and the border where a local star terminates into $\Z$. This
   816 $\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty
   819 conversion is apparently lossy, as it throws away the character
   817 star iteration into $\S$, and the border where a local star terminates
   820 information, and does not decode the boundary between the two operands
   818 into $\Z$. This coding is lossy, as it throws away the information about
   821 of the sequence constructor. Moreover, with only the bitcode we cannot
   819 characters, and also does not encode the ``boundary'' between two
   822 even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or
   820 sequence values. Moreover, with only the bitcode we cannot even tell
   823 $\Stars$. The reason for choosing this compact way of storing
   821 whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The
   824 information is that the relatively small size of bits can be easily
   822 reason for choosing this compact way of storing information is that the
   825 moved around. In order to recover the bitcode back into values, we will
   823 relatively small size of bits can be easily manipulated and ``moved
   826 need the regular expression as the extra information and decode it back
   824 around'' in a regular expression. In order to recover values, we will 
   827 into value:\\
   825 need the corresponding regular expression as an extra information. This
       
   826 means the decoding function is defined as:
   828 
   827 
   829 
   828 
   830 %\begin{definition}[Bitdecoding of Values]\mbox{}
   829 %\begin{definition}[Bitdecoding of Values]\mbox{}
   831 \begin{center}
   830 \begin{center}
   832 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   831 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   854        \textit{else}\;\textit{None}$                       
   853        \textit{else}\;\textit{None}$                       
   855 \end{tabular}    
   854 \end{tabular}    
   856 \end{center}    
   855 \end{center}    
   857 %\end{definition}
   856 %\end{definition}
   858 
   857 
   859 Sulzmann and Lu's integrated the bitcodes into regular
   858 Sulzmann and Lu's integrated the bitcodes into regular expressions to
   860 expressions to create annotated regular expressions.
   859 create annotated regular expressions \cite{Sulzmann2014}.
   861 It is by attaching them to the head of every substructure of a
   860 \emph{Annotated regular expressions} are defined by the following
   862 regular expression\cite{Sulzmann2014}. Annotated regular expressions
       
   863  are defined by the following
       
   864 grammar:
   861 grammar:
   865 
   862 
   866 \begin{center}
   863 \begin{center}
   867 \begin{tabular}{lcl}
   864 \begin{tabular}{lcl}
   868   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
   865   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
   872                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
   869                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
   873                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
   870                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
   874 \end{tabular}    
   871 \end{tabular}    
   875 \end{center}  
   872 \end{center}  
   876 %(in \textit{ALT})
   873 %(in \textit{ALT})
   877 \noindent
   874 
   878 where $bs$ stands for bit-sequences, and $a$  for $\bold{a}$nnotated regular expressions. These bit-sequences encode
   875 \noindent
   879 information about the (POSIX) value that should be generated by the
   876 where $bs$ stands for bitcodes, and $a$  for $\bold{a}$nnotated regular
   880 Sulzmann and Lu algorithm. 
   877 expressions. We will show that these bitcodes encode information about
       
   878 the (POSIX) value that should be generated by the Sulzmann and Lu
       
   879 algorithm.
       
   880 
   881 
   881 
   882 To do lexing using annotated regular expressions, we shall first
   882 To do lexing using annotated regular expressions, we shall first
   883 transform the usual (un-annotated) regular expressions into annotated
   883 transform the usual (un-annotated) regular expressions into annotated
   884 regular expressions. This operation is called \emph{internalisation} and
   884 regular expressions. This operation is called \emph{internalisation} and
   885 defined as follows:
   885 defined as follows:
   900 \end{tabular}    
   900 \end{tabular}    
   901 \end{center}    
   901 \end{center}    
   902 %\end{definition}
   902 %\end{definition}
   903 
   903 
   904 \noindent
   904 \noindent
   905 We use up arrows here to imply that the basic un-annotated regular expressions
   905 We use up arrows here to indicate that the basic un-annotated regular
   906 are "lifted up" into something slightly more complex.
   906 expressions are ``lifted up'' into something slightly more complex. In the
   907 In the fourth clause, $\textit{fuse}$ is an auxiliary function that helps to attach bits to the
   907 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
   908 front of an annotated regular expression. Its definition is as follows:
   908 attach bits to the front of an annotated regular expression. Its
       
   909 definition is as follows:
   909 
   910 
   910 \begin{center}
   911 \begin{center}
   911 \begin{tabular}{lcl}
   912 \begin{tabular}{lcl}
   912   $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
   913   $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
   913   $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
   914   $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
   914      $\textit{ONE}\,(bs\,@\,bs')$\\
   915      $\textit{ONE}\,(bs\,@\,bs')$\\
   915   $\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
   916   $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
   916      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
   917      $\textit{CHAR}\,(bs\,@\,bs')\,c$\\
   917   $\textit{fuse}\,bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
   918   $\textit{fuse}\;bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
   918      $\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\
   919      $\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\
   919   $\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
   920   $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
   920      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
   921      $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
   921   $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
   922   $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
   922      $\textit{STAR}\,(bs\,@\,bs')\,a$
   923      $\textit{STAR}\,(bs\,@\,bs')\,a$
   923 \end{tabular}    
   924 \end{tabular}    
   924 \end{center}  
   925 \end{center}  
   925 
   926 
   926 \noindent
   927 \noindent
   927 After internalise we do successive derivative operations on the
   928 After internalise we do successive derivative operations on the
   928  annotated regular expression. This derivative operation is the same as
   929 annotated regular expressions. This derivative operation is the same as
   929  what we previously have for the simple regular expressions, except that
   930 what we previously have for the simple regular expressions, except that
   930  we take special care of the bits :\\
   931 we beed to take special care of the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT is just 
       
   932 an abbreviation; derivations and so on are defined for ALTS}
   931 
   933 
   932  %\begin{definition}{bder}
   934  %\begin{definition}{bder}
   933 \begin{center}
   935 \begin{center}
   934   \begin{tabular}{@{}lcl@{}}
   936   \begin{tabular}{@{}lcl@{}}
   935   $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   937   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   936   $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   938   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   937   $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
   939   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
   938         $\textit{if}\;c=d\; \;\textit{then}\;
   940         $\textit{if}\;c=d\; \;\textit{then}\;
   939          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
   941          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
   940   $(\textit{ALT}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
   942   $(\textit{ALT}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
   941         $\textit{ALT}\,bs\,(a_1\backslash c)\,(a_2\backslash c)$\\
   943         $\textit{ALT}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\
   942   $(\textit{SEQ}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
   944   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
   943      $\textit{if}\;\textit{bnullable}\,a_1$\\
   945      $\textit{if}\;\textit{bnullable}\,a_1$\\
   944   & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\backslash c)\,a_2)$\\
   946   & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\
   945   & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\backslash c))$\\
   947   & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\
   946   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\backslash c)\,a_2$\\
   948   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
   947   $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ &
   949   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
   948       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
   950       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
   949        (\textit{STAR}\,[]\,r)$
   951        (\textit{STAR}\,[]\,r)$
   950 \end{tabular}    
   952 \end{tabular}    
   951 \end{center}    
   953 \end{center}    
   952 %\end{definition}
   954 %\end{definition}
   953 
   955 
   954 For instance, when we unfold $STAR \; bs \; a$ into a sequence, we
   956 \noindent
   955 attach an additional bit Z to the front of $r \backslash c$ to indicate
   957 For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
   956 that there is one more star iteration. The other example, the $SEQ$
   958 we need to attach an additional bit $Z$ to the front of $r \backslash c$
   957 clause is more subtle-- when $a_1$ is $bnullable$(here bnullable is
   959 to indicate that there is one more star iteration. Also the $SEQ$ clause
   958 exactly the same as nullable, except that it is for annotated regular
   960 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
   959 expressions, therefore we omit the definition). Assume that $bmkeps$
   961 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
   960 correctly extracts the bitcode for how $a_1$ matches the string prior to
   962 that it is for annotated regular expressions, therefore we omit the
   961 character c(more on this later), then the right branch of $ALTS$, which
   963 definition). Assume that $bmkeps$ correctly extracts the bitcode for how
   962 is $fuse \; bmkeps \;  a_1 (a_2 \backslash c)$ will collapse the regular
   964 $a_1$ matches the string prior to character $c$ (more on this later),
   963 expression $a_1$(as it has already been fully matched) and store the
   965 then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
   964 parsing information at the head of the regular expression $a_2
   966 \backslash c)$ will collapse the regular expression $a_1$(as it has
   965 \backslash c$ by fusing to it. The bitsequence $bs$, which was initially
   967 already been fully matched) and store the parsing information at the
   966 attached to the head of $SEQ$, has now been elevated to the top-level of
   968 head of the regular expression $a_2 \backslash c$ by fusing to it. The
   967 ALT, as this information will be needed whichever way the $SEQ$ is
   969 bitsequence $bs$, which was initially attached to the head of $SEQ$, has
   968 matched--no matter whether c belongs to $a_1$ or $ a_2$. After carefully
   970 now been elevated to the top-level of ALT, as this information will be
   969 doing these derivatives and maintaining all the parsing information, we
   971 needed whichever way the $SEQ$ is matched--no matter whether $c$ belongs
   970 complete the parsing by collecting the bits using a special $mkeps$
   972 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
   971 function for annotated regular expressions--$bmkeps$:
   973 the lexing information, we complete the lexing by collecting the
       
   974 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
   975 for annotated regular expressions, called $\textit{bmkeps}$:
   972 
   976 
   973 
   977 
   974 %\begin{definition}[\textit{bmkeps}]\mbox{}
   978 %\begin{definition}[\textit{bmkeps}]\mbox{}
   975 \begin{center}
   979 \begin{center}
   976 \begin{tabular}{lcl}
   980 \begin{tabular}{lcl}
   977   $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
   981   $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
   978   $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
   982   $\textit{bmkeps}\,(\textit{ALT}\;bs\,a_1\,a_2)$ & $\dn$ &
   979      $\textit{if}\;\textit{bnullable}\,a_1$\\
   983      $\textit{if}\;\textit{bnullable}\,a_1$\\
   980   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
   984   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
   981   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
   985   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
   982   $\textit{bmkeps}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ &
   986   $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
   983      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
   987      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
   984   $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
   988   $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ &
   985      $bs \,@\, [\S]$
   989      $bs \,@\, [\S]$
   986 \end{tabular}    
   990 \end{tabular}    
   987 \end{center}    
   991 \end{center}    
   988 %\end{definition}
   992 %\end{definition}
   989 
   993 
   990 \noindent
   994 \noindent
   991 This function completes the parse tree information by 
   995 This function completes the value information by travelling along the
   992 travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and
   996 path of the regular expression that corresponds to a POSIX value and
   993 using S to indicate the end of star iterations. If we take the bits produced by $bmkeps$ and decode it, 
   997 collecting all the bitcodes, and using $S$ to indicate the end of star
   994 we get the parse tree we need, the working flow looks like this:\\
   998 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
   999 decode them, we get the value we expect. The corresponding lexing
       
  1000 algorithm looks as follows:
       
  1001 
   995 \begin{center}
  1002 \begin{center}
   996 \begin{tabular}{lcl}
  1003 \begin{tabular}{lcl}
   997   $\textit{blexer}\;r\,s$ & $\dn$ &
  1004   $\textit{blexer}\;r\,s$ & $\dn$ &
   998       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
  1005       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
   999   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  1006   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  1000   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  1007   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  1001   & & $\;\;\textit{else}\;\textit{None}$
  1008   & & $\;\;\textit{else}\;\textit{None}$
  1002 \end{tabular}
  1009 \end{tabular}
  1003 \end{center}
  1010 \end{center}
  1004 Here $(r^\uparrow)\backslash s$ is similar to what we have previously defined for 
  1011 
  1005 $r\backslash s$.
  1012 \noindent
  1006 
  1013 In this definition $_\backslash s$ is the  generalisation  of the derivative
  1007 The main point of the bit-sequences and annotated regular expressions
  1014 operation from characters to strings (just like the derivatives for un-annotated
  1008 is that we can apply rather aggressive (in terms of size)
  1015 regular expressions).
  1009 simplification rules in order to keep derivatives small.  
  1016 
  1010 
  1017 The main point of the bitcodes and annotated regular expressions is that
  1011 We have
  1018 we can apply rather aggressive (in terms of size) simplification rules
  1012 developed such ``aggressive'' simplification rules and generated test
  1019 in order to keep derivatives small. We have developed such
  1013 data that show that the expected bound can be achieved. Obviously we
  1020 ``aggressive'' simplification rules and generated test data that show
  1014 could only partially cover  the search space as there are infinitely
  1021 that the expected bound can be achieved. Obviously we could only
  1015 many regular expressions and strings. One modification we introduced
  1022 partially cover  the search space as there are infinitely many regular
  1016 is to allow a list of annotated regular expressions in the
  1023 expressions and strings. 
  1017 \textit{ALTS} constructor. This allows us to not just delete
  1024 
  1018 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
  1025 One modification we introduced is to allow a list of annotated regular
  1019 unnecessary ``copies'' of regular expressions (very similar to
  1026 expressions in the \textit{ALTS} constructor. This allows us to not just
  1020 simplifying $r + r$ to just $r$, but in a more general
  1027 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
  1021 setting). 
  1028 also unnecessary ``copies'' of regular expressions (very similar to
  1022 Another modification is that we use simplification rules
  1029 simplifying $r + r$ to just $r$, but in a more general setting). Another
  1023 inspired by Antimirov's work on partial derivatives. They maintain the
  1030 modification is that we use simplification rules inspired by Antimirov's
  1024 idea that only the first ``copy'' of a regular expression in an
  1031 work on partial derivatives. They maintain the idea that only the first
  1025 alternative contributes to the calculation of a POSIX value. All
  1032 ``copy'' of a regular expression in an alternative contributes to the
  1026 subsequent copies can be pruned from the regular expression.
  1033 calculation of a POSIX value. All subsequent copies can be pruned away from
  1027 
  1034 the regular expression. A recursive definition of our  simplification function 
  1028 A recursive definition of simplification function that looks similar to scala code is given below:\\
  1035 that looks somewhat similar to our Scala code is given below:\comment{Use $\ZERO$, $\ONE$ and so on. 
       
  1036 Is it $ALT$ or $ALTS$?}\\
       
  1037 
  1029 \begin{center}
  1038 \begin{center}
  1030   \begin{tabular}{@{}lcl@{}}
  1039   \begin{tabular}{@{}lcl@{}}
  1031   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \; \textit{if} \; a  =  (\textit{ONE} \; bs) \; or\; (\textit{CHAR} \, bs \; c) \; or\; (\textit{STAR}\; bs\; a_1)$\\  
  1040    
  1032   
  1041   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
  1033   $\textit{simp} \; \textit{SEQ}\;bs\,a_1\,a_2$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
  1042    &&$\quad\textit{case} \; (0, \_) \Rightarrow  0$ \\
  1034   &&$\textit{case} \; (0, \_) \Rightarrow  0$ \\
  1043    &&$\quad\textit{case} \; (\_, 0) \Rightarrow  0$ \\
  1035    &&$ \textit{case} \; (\_, 0) \Rightarrow  0$ \\
  1044    &&$\quad\textit{case} \;  (1, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
  1036    &&$ \textit{case} \;  (1, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
  1045    &&$\quad\textit{case} \; (a_1', 1) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
  1037    &&$ \textit{case} \; (a_1', 1) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
  1046    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
  1038    &&$ \textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
  1047 
  1039 
  1048   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
  1040   $\textit{simp} \; \textit{ALTS}\;bs\,as$ & $\dn$ & $\textit{ distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
  1049   &&$\quad\textit{case} \; [] \Rightarrow  0$ \\
  1041   &&$\textit{case} \; [] \Rightarrow  0$ \\
  1050    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
  1042    &&$ \textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
  1051    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALT}\;bs\;as'$\\ 
  1043    &&$ \textit{case} \;  as' \Rightarrow  \textit{ALT bs as'}$ 
  1052 
       
  1053    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
  1044 \end{tabular}    
  1054 \end{tabular}    
  1045 \end{center}    
  1055 \end{center}    
  1046 
  1056 
  1047 The simplification does a pattern matching on the regular expression. When it detected that
  1057 \noindent
  1048 the regular expression is an alternative or sequence, 
  1058 The simplification does a pattern matching on the regular expression.
  1049 it will try to simplify its children regular expressions
  1059 When it detected that the regular expression is an alternative or
  1050 recursively and then see if one of the children turn 
  1060 sequence, it will try to simplify its children regular expressions
  1051 into $\ZERO$ or $\ONE$, which might trigger further simplification
  1061 recursively and then see if one of the children turn into $\ZERO$ or
  1052  at the current level. The most involved part is the $\textit{ALTS}$
  1062 $\ONE$, which might trigger further simplification at the current level.
  1053   clause, where we use two auxiliary functions 
  1063 The most involved part is the $\textit{ALTS}$ clause, where we use two
  1054  flatten and distinct to open up nested $\textit{ALTS}$ and 
  1064 auxiliary functions flatten and distinct to open up nested
  1055  reduce as many duplicates as possible.
  1065 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
  1056  Function distinct  keeps the first occurring copy only and 
  1066 distinct  keeps the first occurring copy only and remove all later ones
  1057  remove all later ones when detected duplicates.
  1067 when detected duplicates. Function flatten opens up nested \textit{ALT}.
  1058  Function flatten opens up nested \textit{ALT}. Its recursive
  1068 Its recursive definition is given below:
  1059   definition is given below:
  1069 
  1060  \begin{center}
  1070  \begin{center}
  1061   \begin{tabular}{@{}lcl@{}}
  1071   \begin{tabular}{@{}lcl@{}}
  1062   $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
  1072   $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
  1063      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
  1073      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
  1064   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
  1074   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
  1065     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
  1075     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
  1066 \end{tabular}    
  1076 \end{tabular}    
  1067 \end{center}  
  1077 \end{center}  
  1068 
  1078 
  1069 \noindent
  1079 \noindent
  1070  Here flatten behaves like the traditional functional programming flatten function, except that it also removes $\ZERO$s.
  1080 Here flatten behaves like the traditional functional programming flatten
  1071  What it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
  1081 function, except that it also removes $\ZERO$s. What it does is
  1072 
  1082 basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
  1073 Suppose we apply simplification after each derivative step,
  1083 
  1074 and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$.
  1084 Suppose we apply simplification after each derivative step, and view
  1075 Then we can use the previous natural extension from derivative w.r.t   character to derivative w.r.t string:
  1085 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
       
  1086 \textit{simp}(a \backslash c)$. Then we can use the previous natural
       
  1087 extension from derivative w.r.t.~character to derivative
       
  1088 w.r.t.~string:\comment{simp in  the [] case?}
  1076 
  1089 
  1077 \begin{center}
  1090 \begin{center}
  1078 \begin{tabular}{lcl}
  1091 \begin{tabular}{lcl}
  1079 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp} c) \backslash_{simp} s$ \\
  1092 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
  1080 $r \backslash [\,] $ & $\dn$ & $r$
  1093 $r \backslash [\,] $ & $\dn$ & $r$
  1081 \end{tabular}
  1094 \end{tabular}
  1082 \end{center}
  1095 \end{center}
  1083 
  1096 
  1084  we get an optimized version of the algorithm:
  1097 \noindent
  1085 \begin{center}
  1098 we obtain an optimised version of the algorithm:
       
  1099 
       
  1100  \begin{center}
  1086 \begin{tabular}{lcl}
  1101 \begin{tabular}{lcl}
  1087   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
  1102   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
  1088       $\textit{let}\;a = (r^\uparrow)\backslash_{simp} s\;\textit{in}$\\                
  1103       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
  1089   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  1104   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  1090   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  1105   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  1091   & & $\;\;\textit{else}\;\textit{None}$
  1106   & & $\;\;\textit{else}\;\textit{None}$
  1092 \end{tabular}
  1107 \end{tabular}
  1093 \end{center}
  1108 \end{center}
  1094 
  1109 
  1095 This algorithm effectively keeps the regular expression size small, for example,
  1110 \noindent
  1096 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
  1111 This algorithm keeps the regular expression size small, for example,
       
  1112 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
  1113 will be reduced to just 6 and stays constant, no matter how long the
       
  1114 input string is.
  1097 
  1115 
  1098 
  1116 
  1099 
  1117 
  1100 \section{Current Work}
  1118 \section{Current Work}
  1101 
  1119 
  1102 We are currently engaged in two tasks related to this algorithm. 
  1120 We are currently engaged in two tasks related to this algorithm. The
  1103 
  1121 first task is proving that our simplification rules actually do not
  1104 
  1122 affect the POSIX value that should be generated by the algorithm
  1105 The first one is proving that our simplification rules
  1123 according to the specification of a POSIX value and furthermore obtain a
  1106 actually do not affect the POSIX value that should be generated by the
  1124 much tighter bound on the sizes of derivatives. The result is that our
  1107 algorithm according to the specification of a POSIX value
       
  1108  and furthermore obtain a much
       
  1109 tighter bound on the sizes of derivatives. The result is that our
       
  1110 algorithm should be correct and faster on all inputs.  The original
  1125 algorithm should be correct and faster on all inputs.  The original
  1111 blow-up, as observed in JavaScript, Python and Java, would be excluded
  1126 blow-up, as observed in JavaScript, Python and Java, would be excluded
  1112 from happening in our algorithm.For
  1127 from happening in our algorithm. For this proof we use the theorem prover
  1113 this proof we use the theorem prover Isabelle. Once completed, this
  1128 Isabelle. Once completed, this result will advance the state-of-the-art:
  1114 result will advance the state-of-the-art: Sulzmann and Lu wrote in
  1129 Sulzmann and Lu wrote in their paper~\cite{Sulzmann2014} about the
  1115 their paper \cite{Sulzmann2014} about the bitcoded ``incremental
  1130 bitcoded ``incremental parsing method'' (that is the lexing algorithm
  1116 parsing method'' (that is the matching algorithm outlined in this
  1131 outlined in this section):
  1117 section):
       
  1118 
  1132 
  1119 \begin{quote}\it
  1133 \begin{quote}\it
  1120   ``Correctness Claim: We further claim that the incremental parsing
  1134   ``Correctness Claim: We further claim that the incremental parsing
  1121   method in Figure~5 in combination with the simplification steps in
  1135   method in Figure~5 in combination with the simplification steps in
  1122   Figure 6 yields POSIX parse trees. We have tested this claim
  1136   Figure 6 yields POSIX parse trees. We have tested this claim
  1123   extensively by using the method in Figure~3 as a reference but yet
  1137   extensively by using the method in Figure~3 as a reference but yet
  1124   have to work out all proof details.''
  1138   have to work out all proof details.''
  1125 \end{quote}  
  1139 \end{quote}  
  1126 
  1140 
  1127 \noindent
  1141 \noindent
  1128 We would settle the correctness claim. It is relatively straightforward
  1142 We would settle this correctness claim. It is relatively straightforward
  1129 to establish that after one simplification step, the part of a nullable
  1143 to establish that after one simplification step, the part of a nullable
  1130 derivative that corresponds to a POSIX value remains intact and can
  1144 derivative that corresponds to a POSIX value remains intact and can
  1131 still be collected, in other words,
  1145 still be collected, in other words, we can show that\comment{Double-check....I think this  is not the case}
  1132 
  1146 
  1133 \begin{center}
  1147 \begin{center}
  1134 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$
  1148 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$
  1135 \end{center}
  1149 \end{center}
  1136 
  1150 
  1137 \noindent
  1151 \noindent
  1138 as this basically comes down to proving actions like removing the
  1152 as this basically comes down to proving actions like removing the
  1139 additional $r$ in $r+r$  does not delete important POSIX information in
  1153 additional $r$ in $r+r$  does not delete important POSIX information in
  1140 a regular expression. The hardcore of this problem is to prove that
  1154 a regular expression. The hard part of this proof is to establish that
  1141 
  1155 
  1142 \begin{center}
  1156 \begin{center}
  1143 $\textit{bmkeps} \; \textit{blexer}\_{simp} \; r = \textit{bmkeps} \; \textit{blexer} \; \textit{simp} \; r$
  1157 $\textit{bmkeps} \; \textit{blexer}\_{simp} \; r = \textit{bmkeps} \; \textit{blexer} \; \textit{simp} \; r$
  1144 \end{center}
  1158 \end{center}
  1145 
  1159 
  1146 \noindent
  1160 \noindent\comment{OK from here on you still need to work. Did not read.}
  1147 That is, if we do derivative on regular expression r and the simplified version, 
  1161 That is, if we do derivative on regular expression r and the simplified version, 
  1148 they can still provide the same POSIX value if there is one . 
  1162 they can still provide the same POSIX value if there is one . 
  1149 This is not as straightforward as the previous proposition, as the two regular expressions $r$ and $\textit{simp}\; r$
  1163 This is not as straightforward as the previous proposition, as the two regular expressions $r$ and $\textit{simp}\; r$
  1150   might become very different regular expressions after repeated application of $\textit{simp}$ and derivative.
  1164   might become very different regular expressions after repeated application of $\textit{simp}$ and derivative.
  1151 The crucial point is to find the indispensable information of 
  1165 The crucial point is to find the indispensable information of 
  1152 a regular expression and how it is kept intact during simplification so that it performs 
  1166 a regular expression and how it is kept intact during simplification so that it performs 
  1153 as good as a regular expression that has not been simplified in the subsequent derivative operations.
  1167 as good as a regular expression that has not been simplified in the subsequent derivative operations.
  1154 To aid this, we use the helping function retrieve described by Sulzmann and Lu:
  1168 To aid this, we use the helping function retrieve described by Sulzmann and Lu:
  1155 \\definition of retrieve\\
  1169 \\definition of retrieve\\
  1156  This function assembled the bitcode that corresponds to a parse tree for how the current 
  1170  
  1157  derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
  1171 This function assembled the bitcode that corresponds to a parse tree for
  1158  Sulzmann and Lu used this to connect the bitcoded algorithm to the older algorithm by the following equation:
  1172 how the current derivative matches the suffix of the string(the
       
  1173 characters that have not yet appeared, but is stored in the value).
       
  1174 Sulzmann and Lu used this to connect the bitcoded algorithm to the older
       
  1175 algorithm by the following equation:
       
  1176  
  1159  \begin{center}
  1177  \begin{center}
  1160  $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$
  1178  $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$
  1161  \end{center}
  1179  \end{center}
  1162  A little fact that needs to be stated to help comprehension:
  1180  A little fact that needs to be stated to help comprehension:
  1163  \begin{center}
  1181  \begin{center}