ninems/ninems.tex
changeset 71 0573615e41a3
parent 70 cab5eab1f6f1
child 72 83b021fc7d29
equal deleted inserted replaced
70:cab5eab1f6f1 71:0573615e41a3
   550 that the hole will be on $r_a$. So we recursively call $\inj\, 
   550 that the hole will be on $r_a$. So we recursively call $\inj\, 
   551 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
   551 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
   552 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   552 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
   553 Other clauses can be understood in a similar way.
   553 Other clauses can be understood in a similar way.
   554 
   554 
   555 The following example gives a \comment{Other word: insight?}taste of $\textit{inj}$'s effect
   555 %\comment{Other word: insight?}
       
   556 The following example gives an insight of $\textit{inj}$'s effect
   556 and how Sulzmann and Lu's algorithm works as a whole.
   557 and how Sulzmann and Lu's algorithm works as a whole.
   557  Suppose we have a
   558  Suppose we have a
   558 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
   559 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against
   559 the string $abc$ (when $abc$ is written as a regular expression, the most
   560 the string $abc$ (when $abc$ is written as a regular expression, the most
   560 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   561 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
   672 store the information for parse trees inside the regular expression,
   673 store the information for parse trees inside the regular expression,
   673 update it when we do derivative on them, and collect the information
   674 update it when we do derivative on them, and collect the information
   674 when finished with derivatives and call $mkeps$ for deciding which
   675 when finished with derivatives and call $mkeps$ for deciding which
   675 branch is POSIX, we can generate the parse tree in one pass, instead of
   676 branch is POSIX, we can generate the parse tree in one pass, instead of
   676 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
   677 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel
   677 idea of using bit-codes in derivatives.
   678 idea of using bitcodes in derivatives.
   678 
   679 
   679 In the next section, we shall focus on the bit-coded algorithm and the
   680 In the next section, we shall focus on the bit-coded algorithm and the
   680 process of simplification of regular expressions. This is needed in
   681 process of simplification of regular expressions. This is needed in
   681 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   682 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   682 and Lu's algorithms.  This is where the PhD-project aims to advance the
   683 and Lu's algorithms.  This is where the PhD-project aims to advance the
   710 as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   711 as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   711 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
   712 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
   712 affect the answer for whether a regular expression matches a string or
   713 affect the answer for whether a regular expression matches a string or
   713 not, but fortunately also do not affect the POSIX strategy of how
   714 not, but fortunately also do not affect the POSIX strategy of how
   714 regular expressions match strings---although the latter is much harder
   715 regular expressions match strings---although the latter is much harder
   715 to establish. \comment{Does not make sense.} The argument for
   716 to establish. Some initial results in this regard have been
   716 complicating the data structures from basic regular expressions to those
       
   717 with bitcodes is that we can introduce simplification without making the
       
   718 algorithm crash or overly complex to reason about. The latter is crucial
       
   719 for a correctness proof. Some initial results in this regard have been
       
   720 obtained in \cite{AusafDyckhoffUrban2016}. 
   717 obtained in \cite{AusafDyckhoffUrban2016}. 
   721 
   718 
   722 Unfortunately, the simplification rules outlined above  are not
   719 Unfortunately, the simplification rules outlined above  are not
   723 sufficient to prevent an explosion for all regular expression. We
   720 sufficient to prevent an explosion for all regular expression. We
   724 believe a tighter bound can be achieved that prevents an explosion in
   721 believe a tighter bound can be achieved that prevents an explosion in
   743 
   740 
   744 \noindent
   741 \noindent
   745 A partial derivative of a regular expression $r$ is essentially a set of
   742 A partial derivative of a regular expression $r$ is essentially a set of
   746 regular expressions that are either $r$'s children expressions or a
   743 regular expressions that are either $r$'s children expressions or a
   747 concatenation of them. Antimirov has proved a tight bound of the size of
   744 concatenation of them. Antimirov has proved a tight bound of the size of
   748 partial derivatives. \comment{That looks too preliminary to me.} Roughly
   745 partial derivatives. Roughly
   749 speaking the size will not exceed the fourth power of the number of
   746 speaking the size will not exceed $t^2$, t is the number of characters 
   750 nodes in that regular expression. \comment{Improve: which
   747 appearing in that regular expression. If we want the size of derivatives 
   751 simplifications?}Interestingly, we observed from experiment that after
   748 to stay below this bound, we would need more aggressive simplifications
   752 the simplification step, our regular expression has the same size or is
   749 such as opening up alternatives to achieve the maximum level of duplicates
   753 smaller than the partial derivatives. This allows us to prove a tight
   750 cancellation.
   754 bound on the size of regular expression during the running time of the
   751 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to
   755 algorithm if we can establish the connection between our simplification
   752 get $a\cdot c +b \cdot c
   756 rules and partial derivatives.
   753 + b \cdot c$, and then simplified to $a \cdot c+b \cdot c$. Another example is from
       
   754 $(a^*+a) + (a^*+ 1) + (a +1)$ to $a^*+a+1$.
       
   755 Adding these more aggressive simplification rules
       
   756  helped us to achieve the same size bound as that of the partial derivatives.
       
   757 To introduce these "spilling out alternatives" simplifications
       
   758  and make the correctness proof easier,
       
   759 we used bitcodes. 
       
   760 
       
   761 %This allows us to prove a tight
       
   762 %bound on the size of regular expression during the running time of the
       
   763 %algorithm if we can establish the connection between our simplification
       
   764 %rules and partial derivatives.
   757 
   765 
   758  %We believe, and have generated test
   766  %We believe, and have generated test
   759 %data, that a similar bound can be obtained for the derivatives in
   767 %data, that a similar bound can be obtained for the derivatives in
   760 %Sulzmann and Lu's algorithm. Let us give some details about this next.
   768 %Sulzmann and Lu's algorithm. Let us give some details about this next.
   761 
   769 
   762 Bit-codes look like this:
   770 Bitcodes look like this:
   763 \begin{center}
   771 \begin{center}
   764 		$b ::=   S \mid  Z \; \;\;
   772 		$b ::=   S \mid  Z \; \;\;
   765 bs ::= [] \mid b:bs    
   773 bs ::= [] \mid b:bs    
   766 $
   774 $
   767 \end{center}
   775 \end{center}
   768 They are just a string of bits, the names $S$ and $Z$  here are quite arbitrary, we can use 0 and 1 or any other set of binary symbols to substitute them. Bit-codes(or bit-sequences) are a compact form of parse trees.
   776 They are just a string of bits, 
   769 Here is how values and bit-codes are related:
   777 the names $S$ and $Z$  here are quite arbitrary, we can use 0 and 1 
       
   778 or any other set of binary symbols to substitute them. 
       
   779 Bitcodes(or bit-sequences) are a compact form of parse trees.
   770 Bitcodes are essentially incomplete values.
   780 Bitcodes are essentially incomplete values.
   771 This can be straightforwardly seen in the following transformation: 
   781 This can be straightforwardly seen in the following transformation: 
   772 \begin{center}
   782 \begin{center}
   773 \begin{tabular}{lcl}
   783 \begin{tabular}{lcl}
   774   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   784   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
   823        \textit{else}\;\textit{None}$                       
   833        \textit{else}\;\textit{None}$                       
   824 \end{tabular}    
   834 \end{tabular}    
   825 \end{center}    
   835 \end{center}    
   826 %\end{definition}
   836 %\end{definition}
   827 
   837 
   828 
   838 Sulzmann and Lu's integrated the bitcodes into regular
   829 Sulzmann and Lu's integrated the bitcodes into annotated regular
   839 expressions to create annotated regular expressions.
   830 expressions by attaching them to the head of every substructure of a
   840 It is by attaching them to the head of every substructure of a
   831 regular expression\cite{Sulzmann2014}. They are defined by the following
   841 regular expression\cite{Sulzmann2014}. Annotated regular expressions
       
   842  are defined by the following
   832 grammar:
   843 grammar:
   833 
   844 
   834 \begin{center}
   845 \begin{center}
   835 \begin{tabular}{lcl}
   846 \begin{tabular}{lcl}
   836   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
   847   $\textit{a}$ & $::=$  & $\textit{ZERO}$\\
   837                   & $\mid$ & $\textit{ONE}\;\;bs$\\
   848                   & $\mid$ & $\textit{ONE}\;\;bs$\\
   838                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
   849                   & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
   839                   & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
   850                   & $\mid$ & $\textit{ALT}\;\;bs\,as$\\
   840                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
   851                   & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
   841                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
   852                   & $\mid$ & $\textit{STAR}\;\;bs\,a$
   842 \end{tabular}    
   853 \end{tabular}    
   843 \end{center}  
   854 \end{center}  
   844 
   855 
   845 \noindent
   856 \noindent
   846 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
   857 where $bs$ stands for bit-sequences, and $as$ (in \textit{ALTS}) for a
   847 list of annotated regular expressions. These bitsequences encode
   858 list of annotated regular expressions. These bit-sequences encode
   848 information about the (POSIX) value that should be generated by the
   859 information about the (POSIX) value that should be generated by the
   849 Sulzmann and Lu algorithm. 
   860 Sulzmann and Lu algorithm. 
   850 
   861 
   851 To do lexing using annotated regular expressions, we shall first
   862 To do lexing using annotated regular expressions, we shall first
   852 transform the usual (un-annotated) regular expressions into annotated
   863 transform the usual (un-annotated) regular expressions into annotated
   955 \end{tabular}
   966 \end{tabular}
   956 \end{center}
   967 \end{center}
   957 Here $(r^\uparrow)\backslash s$ is similar to what we have previously defined for 
   968 Here $(r^\uparrow)\backslash s$ is similar to what we have previously defined for 
   958 $r\backslash s$.
   969 $r\backslash s$.
   959 
   970 
   960 The main point of the bitsequences and annotated regular expressions
   971 The main point of the bit-sequences and annotated regular expressions
   961 is that we can apply rather aggressive (in terms of size)
   972 is that we can apply rather aggressive (in terms of size)
   962 simplification rules in order to keep derivatives small.  
   973 simplification rules in order to keep derivatives small.  
   963 
   974 
   964 We have
   975 We have
   965 developed such ``aggressive'' simplification rules and generated test
   976 developed such ``aggressive'' simplification rules and generated test
  1012     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
  1023     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
  1013 \end{tabular}    
  1024 \end{tabular}    
  1014 \end{center}  
  1025 \end{center}  
  1015 
  1026 
  1016 \noindent
  1027 \noindent
  1017  \comment{No: functional flatten  does not remove ZEROs}Here flatten behaves like the traditional functional programming flatten function,
  1028  Here flatten behaves like the traditional functional programming flatten function, except that it also removes $\ZERO$s.
  1018  what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
  1029  What it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
  1019 
  1030 
  1020 Suppose we apply simplification after each derivative step,
  1031 Suppose we apply simplification after each derivative step,
  1021 and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$.
  1032 and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$.
  1022 Then we can use the previous natural extension from derivative w.r.t   character to derivative w.r.t string:
  1033 Then we can use the previous natural extension from derivative w.r.t   character to derivative w.r.t string:
  1023 
  1034 
  1072   have to work out all proof details.''
  1083   have to work out all proof details.''
  1073 \end{quote}  
  1084 \end{quote}  
  1074 
  1085 
  1075 \noindent
  1086 \noindent
  1076 We would settle the correctness claim.
  1087 We would settle the correctness claim.
  1077 It is relatively straightforward to establish that after one simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
  1088 It is relatively straightforward to establish that after one simplification step, 
  1078 \comment{that  only holds when r is nullable}bmkeps r = bmkeps simp r
  1089 the part of a nullable derivative that corresponds to a POSIX value
  1079 as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete important POSIX information in a regular expression.
  1090 remains intact and can still be collected, 
       
  1091 in other words,
       
  1092 \begin{center}
       
  1093 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$
       
  1094 \end{center}
       
  1095 as this basically comes down to proving actions like removing 
       
  1096 the additional $r$ in $r+r$  does not delete important POSIX information
       
  1097 in a regular expression.
  1080 The hardcore of this problem is to prove that
  1098 The hardcore of this problem is to prove that
  1081 bmkeps bders r = bmkeps bders simp r
  1099 \begin{center}
  1082 That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straightforward as the previous proposition, as the two regular expression r and simp r  might become very different regular expressions after repeated application ofd simp and derivative.
  1100 $\textit{bmkeps} \; \textit{blexer}\_{simp} \; r = \textit{bmkeps} \; \textit{blexer} \; \textit{simp} \; r$
  1083 The crucial point is to find the \comment{What?}"gene" of a regular expression and how it is kept intact during simplification.
  1101 \end{center}
  1084 To aid this, we are use the helping function retrieve described by Sulzmann and Lu:
  1102 That is, if we do derivative on regular expression r and the simplified version, 
       
  1103 they can still provide the same POSIX value if there is one . 
       
  1104 This is not as straightforward as the previous proposition, as the two regular expressions $r$ and $\textit{simp}\; r$
       
  1105   might become very different regular expressions after repeated application of $\textit{simp}$ and derivative.
       
  1106 The crucial point is to find the indispensable information of 
       
  1107 a regular expression and how it is kept intact during simplification so that it performs 
       
  1108 as good as a regular expression that has not been simplified in the subsequent derivative operations.
       
  1109 To aid this, we use the helping function retrieve described by Sulzmann and Lu:
  1085 \\definition of retrieve\\
  1110 \\definition of retrieve\\
  1086  This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
  1111  This function assembles the bitcode that corresponds to a parse tree for how the current 
       
  1112  derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
  1087  Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
  1113  Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
  1088  $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\
  1114  $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\
  1089  A little fact that needs to be stated to help comprehension:\\
  1115  A little fact that needs to be stated to help comprehension:\\
  1090  $r^\uparrow = a$($a$ stands for $annotated$).\\
  1116  $r^\uparrow = a$($a$ stands for $annotated$).\\
  1091  Fahad and Christian also used this fact to prove  the correctness of bit-coded algorithm without simplification.
  1117  Fahad and Christian also used this fact to prove  the correctness of bit-coded algorithm without simplification.
  1092  Our purpose of using this, however, is try to establish \\
  1118  Our purpose of using this, however, is try to establish \\
  1093 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$\\
  1119 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$\\
  1094  The idea is that using $v'$,
  1120  The idea is that using $v'$,
  1095   a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still  able to extract the bitsequence that gives the same parsing information as the unsimplified one.
  1121   a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still  able to extract the bit-sequence that gives the same parsing information as the unsimplified one.
  1096  After establishing this, we might be able to finally bridge the gap of proving\\
  1122  After establishing this, we might be able to finally bridge the gap of proving\\
  1097  $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \; \textit{simp}(r)  \backslash  s \; v'$\\
  1123  $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \; \textit{simp}(r)  \backslash  s \; v'$\\
  1098  and subsequently\\
  1124  and subsequently\\
  1099  $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \; r  \backslash_{simp}   s \; v'$.\\
  1125  $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \; r  \backslash_{simp}   s \; v'$.\\
  1100  This proves that our simplified version of regular expression still contains all the bitcodes needed.
  1126  This proves that our simplified version of regular expression still contains all the bitcodes needed.