ninems/ninems.tex
changeset 70 cab5eab1f6f1
parent 69 4c7173b7ddca
child 71 0573615e41a3
equal deleted inserted replaced
69:4c7173b7ddca 70:cab5eab1f6f1
     4 \usepackage{tikz-cd}
     4 \usepackage{tikz-cd}
     5 %\usepackage{algorithm}
     5 %\usepackage{algorithm}
     6 \usepackage{amsmath}
     6 \usepackage{amsmath}
     7 \usepackage[noend]{algpseudocode}
     7 \usepackage[noend]{algpseudocode}
     8 \usepackage{enumitem}
     8 \usepackage{enumitem}
       
     9 \usepackage{nccmath}
     9 
    10 
    10 \definecolor{darkblue}{rgb}{0,0,0.6}
    11 \definecolor{darkblue}{rgb}{0,0,0.6}
    11 \hypersetup{colorlinks=true,allcolors=darkblue}
    12 \hypersetup{colorlinks=true,allcolors=darkblue}
       
    13 \newcommand{\comment}[1]%
       
    14 {{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}}
       
    15 
    12 % \documentclass{article}
    16 % \documentclass{article}
    13 %\usepackage[utf8]{inputenc}
    17 %\usepackage[utf8]{inputenc}
    14 %\usepackage[english]{babel}
    18 %\usepackage[english]{babel}
    15 %\usepackage{listings}
    19 %\usepackage{listings}
    16 % \usepackage{amsthm}
    20 % \usepackage{amsthm}
    80   looked at extended regular expressions, such as bounded repetitions,
    84   looked at extended regular expressions, such as bounded repetitions,
    81   negation and back-references.
    85   negation and back-references.
    82 \end{abstract}
    86 \end{abstract}
    83 
    87 
    84 \section{Introduction}
    88 \section{Introduction}
       
    89 
    85 
    90 
    86 This PhD-project is about regular expression matching and
    91 This PhD-project is about regular expression matching and
    87 lexing. Given the maturity of this topic, the reader might wonder:
    92 lexing. Given the maturity of this topic, the reader might wonder:
    88 Surely, regular expressions must have already been studied to death?
    93 Surely, regular expressions must have already been studied to death?
    89 What could possibly be \emph{not} known in this area? And surely all
    94 What could possibly be \emph{not} known in this area? And surely all
   445 algorithm by a second phase (the first phase being building successive
   450 algorithm by a second phase (the first phase being building successive
   446 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   451 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   447 is generated assuming the regular expression matches  the string. 
   452 is generated assuming the regular expression matches  the string. 
   448 Pictorially, the algorithm is as follows:
   453 Pictorially, the algorithm is as follows:
   449 
   454 
       
   455 \begin{ceqn}
   450 \begin{equation}\label{graph:2}
   456 \begin{equation}\label{graph:2}
   451 \begin{tikzcd}
   457 \begin{tikzcd}
   452 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   458 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   453 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   459 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   454 \end{tikzcd}
   460 \end{tikzcd}
   455 \end{equation}
   461 \end{equation}
       
   462 \end{ceqn}
   456 
   463 
   457 \noindent
   464 \noindent
   458 For convenience, we shall employ the following notations: the regular expression we
   465 For convenience, we shall employ the following notations: the regular expression we
   459 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   466 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   460 \ldots c_{n-1}$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
   467 \ldots c_{n-1}$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
   543 that the hole will be on $r_a$. So we recursively call $\inj\, 
   550 that the hole will be on $r_a$. So we recursively call $\inj\, 
   544 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
   551 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
   545 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   552 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   546 Other clauses can be understood in a similar way.
   553 Other clauses can be understood in a similar way.
   547 
   554 
   548 The following example gives a taste of $\textit{inj}$'s effect
   555 The following example gives a \comment{Other word: insight?}taste of $\textit{inj}$'s effect
   549 and how Sulzmann and Lu's algorithm works as a whole.
   556 and how Sulzmann and Lu's algorithm works as a whole.
   550  Suppose we have a
   557  Suppose we have a
   551 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
   558 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
   552 the string $abc$ (when $abc$ is written as a regular expression, the most
   559 the string $abc$ (when $abc$ is written as a regular expression, the most
   553 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   560 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   573       &                              & $\phantom{r_3 = (} ((0+1+0  + 0 + 0) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^* )$
   580       &                              & $\phantom{r_3 = (} ((0+1+0  + 0 + 0) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^* )$
   574 \end{tabular}
   581 \end{tabular}
   575 \end{center}
   582 \end{center}
   576 
   583 
   577 \noindent
   584 \noindent
   578 Now when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$ 
   585 In  case $r_3$ is nullable, we can call $mkeps$ 
   579 to construct a parse tree for how $r_3$ matched the string $abc$. 
   586 to construct a parse tree for how $r_3$ matched the string $abc$. 
   580 $mkeps$ gives the following value $v_3$: 
   587 $mkeps$ gives the following value $v_3$: 
   581 \begin{center}
   588 \begin{center}
   582 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
   589 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$
   583 \end{center}
   590 \end{center}
   584 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
   591 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):
       
   592 
   585 \begin{center}
   593 \begin{center}
   586    $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
   594    $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0  + 1 + 0)
   587   \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* ).$
   595   \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* ).$
   588   
       
   589  \end{center}
   596  \end{center}
   590  Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$
   597 
   591  (which corresponds to the initial sub-match $abc$)
   598 \noindent
   592   allows $mkeps$ to pick it up
   599  Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot
   593   because $mkeps$ is defined to always choose the left one when it is nullable.
   600  1) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows
   594   In the case of this example, $abc$ is preferred over $a$ or $ab$.
   601  $mkeps$ to pick it up because $mkeps$ is defined to always choose the
   595    This $\Left(\Left(\ldots))$ location is naturally generated by 
   602  left one when it is nullable. In the case of this example, $abc$ is
   596    two applications of the splitting clause
   603  preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is
   597    \begin{center}
   604  naturally generated by two applications of the splitting clause
       
   605 
       
   606 \begin{center}
   598      $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
   607      $(r_1 \cdot r_2)\backslash c  (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$
   599      \end{center}
   608 \end{center}
   600        By this clause, we put
   609        
   601 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. 
   610 \noindent
   602 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match.
   611 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the
   603  Removing the outside $\Left(\Left(...))$, the inside sub-value 
   612 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This
   604  \begin{center}
   613 allows $mkeps$ to always pick up among two matches the one with a longer
       
   614 initial sub-match. Removing the outside $\Left(\Left(...))$, the inside
       
   615 sub-value 
       
   616  
       
   617 \begin{center}
   605  $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
   618  $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$
   606  \end{center}
   619 \end{center}
   607 tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$.
   620 
   608 We match $[]$ by a sequence of 2 nullable regular expressions. 
   621 \noindent
   609 The first one is an alternative, we take the rightmost alternative---whose language
   622 tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot
   610 contains the empty string. The second nullable regular 
   623 1 \cdot 1) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular
   611 expression is a Kleene star. $\Stars$ tells us how it
   624 expressions. The first one is an alternative, we take the rightmost
   612 generates the nullable regular expression: by 0 iterations to form $\epsilon$.
   625 alternative---whose language contains the empty string. The second
   613 Now $\textit{inj}$ injects characters back and
   626 nullable regular expression is a Kleene star. $\Stars$ tells us how it
   614  incrementally builds a parse tree based on $v_3$.
   627 generates the nullable regular expression: by 0 iterations to form
   615 Using the value $v_3$, the character c, and the regular expression $r_2$, 
   628 $\epsilon$. Now $\textit{inj}$ injects characters back and incrementally
   616 we can recover how $r_2$ matched the string $[c]$ : 
   629 builds a parse tree based on $v_3$. Using the value $v_3$, the character
   617  $\textit{inj} \; r_2 \; c \; v_3$ gives us
   630 c, and the regular expression $r_2$, we can recover how $r_2$ matched
       
   631 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us
   618  \begin{center}
   632  \begin{center}
   619  $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
   633  $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$
   620  \end{center}
   634  \end{center}
   621 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
   635 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get
   622 \begin{center}
   636 \begin{center}
   644     \item[1)] just $a$ or
   658     \item[1)] just $a$ or
   645     \item[2)] string $ab$ or 
   659     \item[2)] string $ab$ or 
   646     \item[3)] string $abc$.
   660     \item[3)] string $abc$.
   647 \end{enumerate}
   661 \end{enumerate}
   648 \end{center}
   662 \end{center}
   649 In order to differentiate between these choices, 
   663 
   650 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. 
   664 \noindent
   651 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
   665 In order to differentiate between these choices, we just need to
   652 does not change this order. If this parsing information can be determined and does
   666 remember their positions--$a$ is on the left, $ab$ is in the middle ,
   653 not change because of later derivatives,
   667 and $abc$ is on the right. Which one of these alternatives is chosen
   654 there is no point in traversing this information twice. This leads to an optimisation---if we store the information for parse trees inside the regular expression, update it when we do derivative on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing the rest $n$ injections. This leads to Sulzmann and Lu's novel idea of using bit-codes in derivatives.
   668 later does not affect their relative position because our algorithm does
       
   669 not change this order. If this parsing information can be determined and
       
   670 does not change because of later derivatives, there is no point in
       
   671 traversing this information twice. This leads to an optimisation---if we
       
   672 store the information for parse trees inside the regular expression,
       
   673 update it when we do derivative on them, and collect the information
       
   674 when finished with derivatives and call $mkeps$ for deciding which
       
   675 branch is POSIX, we can generate the parse tree in one pass, instead of
       
   676 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
       
   677 idea of using bit-codes in derivatives.
   655 
   678 
   656 In the next section, we shall focus on the bit-coded algorithm and the
   679 In the next section, we shall focus on the bit-coded algorithm and the
   657 process of simplification of regular expressions. This is needed in
   680 process of simplification of regular expressions. This is needed in
   658 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   681 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   659 and Lu's algorithms.  This is where the PhD-project aims to advance the
   682 and Lu's algorithms.  This is where the PhD-project aims to advance the
   660 state-of-the-art.
   683 state-of-the-art.
   661 
   684 
   662 
   685 
   663 \section{Simplification of Regular Expressions}
   686 \section{Simplification of Regular Expressions}
   664 
   687 
   665 
   688 Using bitcodes to guide  parsing is not a novel idea. It was applied to
   666 Using bit-codes to guide  parsing is not a novel idea. It was applied to
       
   667 context free grammars and then adapted by Henglein and Nielson for
   689 context free grammars and then adapted by Henglein and Nielson for
   668 efficient regular expression parsing using DFAs \cite{nielson11bcre}. Sulzmann and
   690 efficient regular expression parsing using DFAs~\cite{nielson11bcre}.
   669 Lu took a step further by integrating bitcodes into derivatives.
   691 Sulzmann and Lu took this idea of bitcodes a step further by integrating
   670 The reason why we want to use bitcodes in 
   692 bitcodes into derivatives. The reason why we want to use bitcodes in
   671 this project is that we want to introduce aggressive 
   693 this project is that we want to introduce more aggressive
   672 simplifications. 
   694 simplifications in order to keep the size of derivatives small
   673 
   695 throughout. This is because the main drawback of building successive
   674 The main drawback of building
   696 derivatives according to Brzozowski's definition is that they can grow
   675 successive derivatives according to Brzozowski's definition is that they
   697 very quickly in size. This is mainly due to the fact that the derivative
   676 can grow very quickly in size. This is mainly due to the fact that the
   698 operation generates often ``useless'' $\ZERO$s and $\ONE$s in
   677 derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in
       
   678 derivatives.  As a result, if implemented naively both algorithms by
   699 derivatives.  As a result, if implemented naively both algorithms by
   679 Brzozowski and by Sulzmann and Lu are excruciatingly slow. 
   700 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
   680 For example when starting with the regular expression $(a + aa)^*$ and building 12
   701 when starting with the regular expression $(a + aa)^*$ and building 12
   681 successive derivatives w.r.t.~the character $a$, one obtains a
   702 successive derivatives w.r.t.~the character $a$, one obtains a
   682 derivative regular expression with more than 8000 nodes (when viewed as
   703 derivative regular expression with more than 8000 nodes (when viewed as
   683 a tree). Operations like derivative and $\nullable$ need to traverse
   704 a tree). Operations like derivative and $\nullable$ need to traverse
   684 such trees and consequently the bigger the size of the derivative the
   705 such trees and consequently the bigger the size of the derivative the
   685 slower the algorithm. 
   706 slower the algorithm. 
   686 Fortunately, one can simplify regular expressions
   707 
   687 after each derivative step. Various simplifications of regular
   708 Fortunately, one can simplify regular expressions after each derivative
   688 expressions are possible, such as the simplifications of $\ZERO + r$, $r
   709 step. Various simplifications of regular expressions are possible, such
   689 + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These
   710 as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   690 simplifications do not affect the answer for whether a regular
   711 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
   691 expression matches a string or not, but fortunately also do not affect
   712 affect the answer for whether a regular expression matches a string or
   692 the POSIX strategy of how regular expressions match strings---although
   713 not, but fortunately also do not affect the POSIX strategy of how
   693 the latter is much harder to establish. 
   714 regular expressions match strings---although the latter is much harder
   694 The argument for complicating the data structures from basic regular
   715 to establish. \comment{Does not make sense.} The argument for
   695 expressions to those with bitcodes is that we can introduce
   716 complicating the data structures from basic regular expressions to those
   696 simplification without making the algorithm crash or overly complex to
   717 with bitcodes is that we can introduce simplification without making the
   697 reason about. The latter is crucial for a correctness proof.
   718 algorithm crash or overly complex to reason about. The latter is crucial
   698 Some initial results in this
   719 for a correctness proof. Some initial results in this regard have been
   699 regard have been obtained in \cite{AusafDyckhoffUrban2016}. However,
   720 obtained in \cite{AusafDyckhoffUrban2016}. 
   700 what has not been achieved yet is correctness for the bit-coded algorithm
   721 
   701 that involves simplifications and a very tight bound for the size. Such
   722 Unfortunately, the simplification rules outlined above  are not
   702 a tight bound is suggested by work of Antimirov who proved that
   723 sufficient to prevent an explosion for all regular expression. We
   703 (partial) derivatives can be bound by the number of characters contained
   724 believe a tighter bound can be achieved that prevents an explosion in
   704 in the initial regular expression \cite{Antimirov95}.
   725 all cases. Such a tighter bound is suggested by work of Antimirov who
   705 
   726 proved that (partial) derivatives can be bound by the number of
   706 Antimirov defined the \emph{partial derivatives} of regular expressions to be this:
   727 characters contained in the initial regular expression
   707 %TODO definition of partial derivatives
   728 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
       
   729 expressions as follows:
       
   730 
   708 \begin{center}
   731 \begin{center}
   709 \begin{tabular}{lcl}
   732 \begin{tabular}{lcl}
   710  $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\
   733  $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\
   711  $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\
   734  $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\
   712  $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{  1   \}  \; \textit{else} \; \emptyset$ \\ 
   735  $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{  1   \}  \; \textit{else} \; \emptyset$ \\ 
   713   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
   736   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
   714    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \; \textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
   737    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\
       
   738      & & $\textit{then} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \;$\\
       
   739      & & $\textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
   715      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
   740      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
   716  \end{tabular}
   741  \end{tabular}
   717  \end{center}
   742  \end{center}
   718  A partial derivative of a regular expression $r$ is essentially a set of regular expressions that are either $r$'s children expressions or a concatenation of them. 
   743 
   719 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression.  Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.
   744 \noindent
       
   745 A partial derivative of a regular expression $r$ is essentially a set of
       
   746 regular expressions that are either $r$'s children expressions or a
       
   747 concatenation of them. Antimirov has proved a tight bound of the size of
       
   748 partial derivatives. \comment{That looks too preliminary to me.} Roughly
       
   749 speaking the size will not exceed the fourth power of the number of
       
   750 nodes in that regular expression. \comment{Improve: which
       
   751 simplifications?}Interestingly, we observed from experiment that after
       
   752 the simplification step, our regular expression has the same size or is
       
   753 smaller than the partial derivatives. This allows us to prove a tight
       
   754 bound on the size of regular expression during the running time of the
       
   755 algorithm if we can establish the connection between our simplification
       
   756 rules and partial derivatives.
   720 
   757 
   721  %We believe, and have generated test
   758  %We believe, and have generated test
   722 %data, that a similar bound can be obtained for the derivatives in
   759 %data, that a similar bound can be obtained for the derivatives in
   723 %Sulzmann and Lu's algorithm. Let us give some details about this next.
   760 %Sulzmann and Lu's algorithm. Let us give some details about this next.
   724 
   761 
   742   $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\
   779   $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\
   743   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
   780   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
   744                                                  code(\Stars\,vs)$
   781                                                  code(\Stars\,vs)$
   745 \end{tabular}    
   782 \end{tabular}    
   746 \end{center} 
   783 \end{center} 
   747 Here code encodes a value into a bit-sequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode it back into value:\\
   784 
       
   785 Here code encodes a value into a bit-sequence by converting Left into
       
   786 $\Z$, Right into $\S$, the start point of a non-empty star iteration
       
   787 into $\S$, and the border where a local star terminates into $\Z$. This
       
   788 conversion is apparently lossy, as it throws away the character
       
   789 information, and does not decode the boundary between the two operands
       
   790 of the sequence constructor. Moreover, with only the bitcode we cannot
       
   791 even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or
       
   792 $\Stars$. The reason for choosing this compact way of storing
       
   793 information is that the relatively small size of bits can be easily
       
   794 moved around. In order to recover the bitcode back into values, we will
       
   795 need the regular expression as the extra information and decode it back
       
   796 into value:\\
       
   797 
       
   798 
   748 %\begin{definition}[Bitdecoding of Values]\mbox{}
   799 %\begin{definition}[Bitdecoding of Values]\mbox{}
   749 \begin{center}
   800 \begin{center}
   750 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   801 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
   751   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
   802   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
   752   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
   803   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
   773 \end{tabular}    
   824 \end{tabular}    
   774 \end{center}    
   825 \end{center}    
   775 %\end{definition}
   826 %\end{definition}
   776 
   827 
   777 
   828 
   778 Sulzmann and Lu's integrated the bitcodes into annotated regular expressions by attaching them to the head of every substructure of a regular expression\cite{Sulzmann2014}. They are
   829 Sulzmann and Lu's integrated the bitcodes into annotated regular
   779 defined by the following grammar:
   830 expressions by attaching them to the head of every substructure of a
       
   831 regular expression\cite{Sulzmann2014}. They are defined by the following
       
   832 grammar:
   780 
   833 
   781 \begin{center}
   834 \begin{center}
   782 \begin{tabular}{lcl}
   835 \begin{tabular}{lcl}
   783   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
   836   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
   784                   & $\mid$ & $\textit{ONE}\;\;bs$\\
   837                   & $\mid$ & $\textit{ONE}\;\;bs$\\
   793 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
   846 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
   794 list of annotated regular expressions. These bitsequences encode
   847 list of annotated regular expressions. These bitsequences encode
   795 information about the (POSIX) value that should be generated by the
   848 information about the (POSIX) value that should be generated by the
   796 Sulzmann and Lu algorithm. 
   849 Sulzmann and Lu algorithm. 
   797 
   850 
   798 To do lexing using annotated regular expressions, we shall first transform the
   851 To do lexing using annotated regular expressions, we shall first
   799 usual (un-annotated) regular expressions into annotated regular
   852 transform the usual (un-annotated) regular expressions into annotated
   800 expressions:\\
   853 regular expressions. This operation is called \emph{internalisation} and
       
   854 defined as follows:
       
   855 
   801 %\begin{definition}
   856 %\begin{definition}
   802 \begin{center}
   857 \begin{center}
   803 \begin{tabular}{lcl}
   858 \begin{tabular}{lcl}
   804   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
   859   $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
   805   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
   860   $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
   813          $\textit{STAR}\;[]\,r^\uparrow$\\
   868          $\textit{STAR}\;[]\,r^\uparrow$\\
   814 \end{tabular}    
   869 \end{tabular}    
   815 \end{center}    
   870 \end{center}    
   816 %\end{definition}
   871 %\end{definition}
   817 
   872 
   818 Here $fuse$ is an auxiliary  function that helps to attach bits to the front of an annotated regular expression. Its definition goes as follows:
   873 \noindent
       
   874 In the fourth clause, $fuse$ is an auxiliary function that helps to attach bits to the
       
   875 front of an annotated regular expression. Its definition is as follows:
       
   876 
   819 \begin{center}
   877 \begin{center}
   820 \begin{tabular}{lcl}
   878 \begin{tabular}{lcl}
   821   $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
   879   $\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
   822   $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
   880   $\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
   823      $\textit{ONE}\,(bs\,@\,bs')$\\
   881      $\textit{ONE}\,(bs\,@\,bs')$\\
   830   $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
   888   $\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
   831      $\textit{STAR}\,(bs\,@\,bs')\,a$
   889      $\textit{STAR}\,(bs\,@\,bs')\,a$
   832 \end{tabular}    
   890 \end{tabular}    
   833 \end{center}  
   891 \end{center}  
   834 
   892 
   835 After internalise we do successive derivative operations on the annotated regular expression.
   893 \noindent
   836  This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits :\\
   894 After internalise we do successive derivative operations on the
   837 %\begin{definition}{bder}
   895  annotated regular expression. This derivative operation is the same as
       
   896  what we previously have for the simple regular expressions, except that
       
   897  we take special care of the bits :\\
       
   898 
       
   899  %\begin{definition}{bder}
   838 \begin{center}
   900 \begin{center}
   839   \begin{tabular}{@{}lcl@{}}
   901   \begin{tabular}{@{}lcl@{}}
   840   $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   902   $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   841   $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   903   $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   842   $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
   904   $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
   875   $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
   937   $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
   876      $bs \,@\, [\S]$
   938      $bs \,@\, [\S]$
   877 \end{tabular}    
   939 \end{tabular}    
   878 \end{center}    
   940 \end{center}    
   879 %\end{definition}
   941 %\end{definition}
       
   942 
       
   943 \noindent
   880 This function completes the parse tree information by 
   944 This function completes the parse tree information by 
   881 travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and
   945 travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and
   882 using S to indicate the end of star iterations. If we take the bitsproduced by $bmkeps$ and decode it, 
   946 using S to indicate the end of star iterations. If we take the bitsproduced by $bmkeps$ and decode it, 
   883 we get the parse tree we need, the working flow looks like this:\\
   947 we get the parse tree we need, the working flow looks like this:\\
   884 \begin{center}
   948 \begin{center}
   940  flatten and distinct to open up nested ALT and reduce as many duplicates as possible.
  1004  flatten and distinct to open up nested ALT and reduce as many duplicates as possible.
   941  Function distinct  keeps the first occurring copy only and remove all later ones when detected duplicates.
  1005  Function distinct  keeps the first occurring copy only and remove all later ones when detected duplicates.
   942  Function flatten opens up nested ALT. Its recursive definition is given below:
  1006  Function flatten opens up nested ALT. Its recursive definition is given below:
   943  \begin{center}
  1007  \begin{center}
   944   \begin{tabular}{@{}lcl@{}}
  1008   \begin{tabular}{@{}lcl@{}}
   945   $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{ map fuse}( \textit{bs, \_} )  \textit{ as}) \; +\!+ \; \textit{flatten} \; as' $ \\
  1009   $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
       
  1010      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
   946   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
  1011   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
   947     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as' $ 
  1012     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
   948 \end{tabular}    
  1013 \end{tabular}    
   949 \end{center}  
  1014 \end{center}  
   950 
  1015 
   951  Here flatten behaves like the traditional functional programming flatten function,
  1016 \noindent
       
  1017  \comment{No: functional flatten  does not remove ZEROs}Here flatten behaves like the traditional functional programming flatten function,
   952  what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
  1018  what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
   953 
  1019 
   954 Suppose we apply simplification after each derivative step,
  1020 Suppose we apply simplification after each derivative step,
   955 and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$.
  1021 and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$.
   956 Then we can use the previous natural extension from derivative w.r.t   character to derivative w.r.t string:
  1022 Then we can use the previous natural extension from derivative w.r.t   character to derivative w.r.t string:
   976 This algorithm effectively keeps the regular expression size small, for example,
  1042 This algorithm effectively keeps the regular expression size small, for example,
   977 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
  1043 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
   978 
  1044 
   979 
  1045 
   980 
  1046 
   981 We are currently engaged in 2 tasks related to this algorithm. 
  1047 \section{Current Work}
       
  1048 
       
  1049 We are currently engaged in two tasks related to this algorithm. 
       
  1050 
       
  1051 \begin{itemize}
       
  1052 \item  
   982 The first one is proving that our simplification rules
  1053 The first one is proving that our simplification rules
   983 actually do not affect the POSIX value that should be generated by the
  1054 actually do not affect the POSIX value that should be generated by the
   984 algorithm according to the specification of a POSIX value
  1055 algorithm according to the specification of a POSIX value
   985  and furthermore obtain a much
  1056  and furthermore obtain a much
   986 tighter bound on the sizes of derivatives. The result is that our
  1057 tighter bound on the sizes of derivatives. The result is that our
  1001   have to work out all proof details.''
  1072   have to work out all proof details.''
  1002 \end{quote}  
  1073 \end{quote}  
  1003 
  1074 
  1004 \noindent
  1075 \noindent
  1005 We would settle the correctness claim.
  1076 We would settle the correctness claim.
  1006 It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
  1077 It is relatively straightforward to establish that after one simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
  1007 bmkeps r = bmkeps simp r
  1078 \comment{that  only holds when r is nullable}bmkeps r = bmkeps simp r
  1008 as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete important POSIX information in a regular expression.
  1079 as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete important POSIX information in a regular expression.
  1009 The hardcore of this problem is to prove that
  1080 The hardcore of this problem is to prove that
  1010 bmkeps bders r = bmkeps bders simp r
  1081 bmkeps bders r = bmkeps bders simp r
  1011 That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straightforward as the previous proposition, as the two regular expression r and simp r  might become very different regular expressions after repeated application ofd simp and derivative.
  1082 That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straightforward as the previous proposition, as the two regular expression r and simp r  might become very different regular expressions after repeated application ofd simp and derivative.
  1012 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification.
  1083 The crucial point is to find the \comment{What?}"gene" of a regular expression and how it is kept intact during simplification.
  1013 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu:
  1084 To aid this, we are use the helping function retrieve described by Sulzmann and Lu:
  1014 \\definition of retrieve\\
  1085 \\definition of retrieve\\
  1015  This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
  1086  This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
  1016  Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
  1087  Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
  1017  $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\
  1088  $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\
  1018  A little fact that needs to be stated to help comprehension:\\
  1089  A little fact that needs to be stated to help comprehension:\\
  1026  $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \; \textit{simp}(r)  \backslash  s \; v'$\\
  1097  $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \; \textit{simp}(r)  \backslash  s \; v'$\\
  1027  and subsequently\\
  1098  and subsequently\\
  1028  $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \; r  \backslash_{simp}   s \; v'$.\\
  1099  $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \; r  \backslash_{simp}   s \; v'$.\\
  1029  This proves that our simplified version of regular expression still contains all the bitcodes needed.
  1100  This proves that our simplified version of regular expression still contains all the bitcodes needed.
  1030 
  1101 
  1031 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplification(the naive version as implemented in ADU of course can explode in some cases).
  1102 \item
  1032 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress.
  1103 The second task is to speed up the more aggressive simplification.
       
  1104 Currently it is slower than a naive simplification(the naive version as
       
  1105 implemented in ADU of course can explode in some cases). So it needs to
       
  1106 be explored how to make it faster. Our possibility would be to explore
       
  1107 again the connection to DFAs. This is very much work in progress.
       
  1108 \end{itemize}
  1033 
  1109 
  1034 \section{Conclusion}
  1110 \section{Conclusion}
  1035 
  1111 
  1036 In this PhD-project we are interested in fast algorithms for regular
  1112 In this PhD-project we are interested in fast algorithms for regular
  1037 expression matching. While this seems to be a ``settled'' area, in
  1113 expression matching. While this seems to be a ``settled'' area, in