ninems/ninems.tex
changeset 80 d9d61a648292
parent 79 481c8000de6d
child 81 a0df84875788
equal deleted inserted replaced
79:481c8000de6d 80:d9d61a648292
   584 $abc$ in one star iteration, using the longest alternative $abc$ in the
   584 $abc$ in one star iteration, using the longest alternative $abc$ in the
   585 sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression
   585 sub-expression $((((a+b)+ab)+c)+abc)$ (we use $r$ to denote this sub-expression
   586 for conciseness). 
   586 for conciseness). 
   587 Before $\textit{inj}$ comes into play, 
   587 Before $\textit{inj}$ comes into play, 
   588 our lexer first builds derivative using string $abc$ (we simplified some regular expressions like
   588 our lexer first builds derivative using string $abc$ (we simplified some regular expressions like
   589 $0 \cdot b$ to $0$ for conciseness; we also omit parentheses if
   589 $\ZERO \cdot b$ to $\ZERO$ for conciseness; we also omit parentheses if
   590 they are clear from the context):
   590 they are clear from the context):
   591 %Similarly, we allow
   591 %Similarly, we allow
   592 %$\textit{ALT}$ to take a list of regular expressions as an argument
   592 %$\textit{ALT}$ to take a list of regular expressions as an argument
   593 %instead of just 2 operands to reduce the nested depth of
   593 %instead of just 2 operands to reduce the nested depth of
   594 %$\textit{ALT}$
   594 %$\textit{ALT}$
   744 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
   744 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular
   745 expressions as follows:
   745 expressions as follows:
   746 
   746 
   747 \begin{center}
   747 \begin{center}
   748 \begin{tabular}{lcl}
   748 \begin{tabular}{lcl}
   749  $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\
   749  $\textit{pder} \; c \; \ZERO$ & $\dn$ & $\emptyset$\\
   750  $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\
   750  $\textit{pder} \; c \; \ONE$ & $\dn$ & $\emptyset$ \\
   751  $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{  1   \}  \; \textit{else} \; \emptyset$ \\ 
   751  $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{  \ONE   \}  \; \textit{else} \; \emptyset$ \\ 
   752   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
   752   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
   753    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\
   753    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\
   754      & & $\textit{then} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \;$\\
   754      & & $\textit{then} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \;$\\
   755      & & $\textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
   755      & & $\textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
   756      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
   756      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
   927 \noindent
   927 \noindent
   928 After internalise we do successive derivative operations on the
   928 After internalise we do successive derivative operations on the
   929 annotated regular expressions. This derivative operation is the same as
   929 annotated regular expressions. This derivative operation is the same as
   930 what we previously have for the simple regular expressions, except that
   930 what we previously have for the simple regular expressions, except that
   931 we beed to take special care of the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT is just 
   931 we beed to take special care of the bitcodes:\comment{You need to be consitent with  ALTS and ALT; ALT is just 
   932 an abbreviation; derivations and so on are defined for ALTS}
   932 an abbreviation; derivations and so on are defined for ALTS}\comment{no this is not the case,
       
   933 ALT for 2 regexes, ALTS for a list}
   933 
   934 
   934  %\begin{definition}{bder}
   935  %\begin{definition}{bder}
   935 \begin{center}
   936 \begin{center}
   936   \begin{tabular}{@{}lcl@{}}
   937   \begin{tabular}{@{}lcl@{}}
   937   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
   938   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
  1037 
  1038 
  1038 \begin{center}
  1039 \begin{center}
  1039   \begin{tabular}{@{}lcl@{}}
  1040   \begin{tabular}{@{}lcl@{}}
  1040    
  1041    
  1041   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
  1042   $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
  1042    &&$\quad\textit{case} \; (0, \_) \Rightarrow  0$ \\
  1043    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
  1043    &&$\quad\textit{case} \; (\_, 0) \Rightarrow  0$ \\
  1044    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
  1044    &&$\quad\textit{case} \;  (1, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
  1045    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
  1045    &&$\quad\textit{case} \; (a_1', 1) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
  1046    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
  1046    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
  1047    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
  1047 
  1048 
  1048   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
  1049   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
  1049   &&$\quad\textit{case} \; [] \Rightarrow  0$ \\
  1050   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
  1050    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
  1051    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
  1051    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALT}\;bs\;as'$\\ 
  1052    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
  1052 
  1053 
  1053    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
  1054    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
  1054 \end{tabular}    
  1055 \end{tabular}    
  1055 \end{center}    
  1056 \end{center}    
  1056 
  1057 
  1062 $\ONE$, which might trigger further simplification at the current level.
  1063 $\ONE$, which might trigger further simplification at the current level.
  1063 The most involved part is the $\textit{ALTS}$ clause, where we use two
  1064 The most involved part is the $\textit{ALTS}$ clause, where we use two
  1064 auxiliary functions flatten and distinct to open up nested
  1065 auxiliary functions flatten and distinct to open up nested
  1065 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
  1066 $\textit{ALTS}$ and reduce as many duplicates as possible. Function
  1066 distinct  keeps the first occurring copy only and remove all later ones
  1067 distinct  keeps the first occurring copy only and remove all later ones
  1067 when detected duplicates. Function flatten opens up nested \textit{ALT}.
  1068 when detected duplicates. Function flatten opens up nested \textit{ALTS}.
  1068 Its recursive definition is given below:
  1069 Its recursive definition is given below:
  1069 
  1070 
  1070  \begin{center}
  1071  \begin{center}
  1071   \begin{tabular}{@{}lcl@{}}
  1072   \begin{tabular}{@{}lcl@{}}
  1072   $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
  1073   $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
  1073      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
  1074      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
  1074   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
  1075   $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
  1075     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
  1076     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) 
  1076 \end{tabular}    
  1077 \end{tabular}    
  1077 \end{center}  
  1078 \end{center}  
  1088 w.r.t.~string:\comment{simp in  the [] case?}
  1089 w.r.t.~string:\comment{simp in  the [] case?}
  1089 
  1090 
  1090 \begin{center}
  1091 \begin{center}
  1091 \begin{tabular}{lcl}
  1092 \begin{tabular}{lcl}
  1092 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
  1093 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
  1093 $r \backslash [\,] $ & $\dn$ & $r$
  1094 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
  1094 \end{tabular}
  1095 \end{tabular}
  1095 \end{center}
  1096 \end{center}
  1096 
  1097 
  1097 \noindent
  1098 \noindent
  1098 we obtain an optimised version of the algorithm:
  1099 we obtain an optimised version of the algorithm:
  1140 
  1141 
  1141 \noindent We would settle this correctness claim. It is relatively
  1142 \noindent We would settle this correctness claim. It is relatively
  1142 straightforward to establish that after one simplification step, the part of a
  1143 straightforward to establish that after one simplification step, the part of a
  1143 nullable derivative that corresponds to a POSIX value remains intact and can
  1144 nullable derivative that corresponds to a POSIX value remains intact and can
  1144 still be collected, in other words, we can show that\comment{Double-check....I
  1145 still be collected, in other words, we can show that\comment{Double-check....I
  1145 think this  is not the case}
  1146 think this  is not the case}\comment{If i remember correctly, you have proved this lemma.
  1146 
  1147 I feel this is indeed not true because you might place arbitrary 
  1147 \begin{center}
  1148 bits on the regex r, however if this is the case, did i remember wrongly that
  1148 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$
  1149 you proved something like simplification does not affect $\textit{bmkeps}$ results?
       
  1150 Anyway, i have amended this a little bit so it does not allow arbitrary bits attached
       
  1151 to a regex. Maybe it works now.}
       
  1152 
       
  1153 \begin{center}
       
  1154 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$
  1149 \end{center}
  1155 \end{center}
  1150 
  1156 
  1151 \noindent
  1157 \noindent
  1152 as this basically comes down to proving actions like removing the
  1158 as this basically comes down to proving actions like removing the
  1153 additional $r$ in $r+r$  does not delete important POSIX information in
  1159 additional $r$ in $r+r$  does not delete important POSIX information in
  1157 	$\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$
  1163 	$\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$
  1158 \end{center}
  1164 \end{center}
  1159 
  1165 
  1160 \noindent That is, if we do derivative on regular expression $r$ and then
  1166 \noindent That is, if we do derivative on regular expression $r$ and then
  1161 simplify it, and repeat this process until we exhaust the string, we get a
  1167 simplify it, and repeat this process until we exhaust the string, we get a
  1162 regular expression $r''$ that provides the POSIX matching as the result $r'$ of
  1168 regular expression $r''$($\textit{LHS}$)  that provides the POSIX matching
  1163 the normal derivative algorithm which only does derivative operation repeatedly
  1169 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the
  1164 and has no simplification at all.  This might seem at first glance very
  1170 normal derivative algorithm that only does derivative repeatedly and has no
  1165 unintuitive, as the $r'$ is exponentially larger than $r''$. But this can be
  1171 simplification at all.  This might seem at first glance very unintuitive, as
  1166 explained in the following way: we are pruning away the possible matches that
  1172 the $r'$ is exponentially larger than $r''$, but can be explained in the
  1167 are not POSIX. Since there are exponentially non-POSIX matchings and only 1
  1173 following way: we are pruning away the possible matches that are not POSIX.
  1168 POSIX matching, it is understandable that our $r''$ can be a lot smaller.  we
  1174 Since there are exponentially non-POSIX matchings and only 1 POSIX matching, it
  1169 can still provide the same POSIX value if there is one.  This is not as
  1175 is understandable that our $r''$ can be a lot smaller.  we can still provide
  1170 straightforward as the previous proposition, as the two regular expressions $r$
  1176 the same POSIX value if there is one.  This is not as straightforward as the
  1171 and $\textit{simp}\; r$ might become very different regular expressions after
  1177 previous proposition, as the two regular expressions $r'$ and $r''$ might have
  1172 repeated application of $\textit{simp}$ and derivative.  The crucial point is
  1178 become very different regular expressions.  The crucial point is to find the
  1173 to find the indispensable information of a regular expression and how it is
  1179 $\textit{POSIX}$  information of a regular expression and how it is modified,
  1174 kept intact during simplification so that it performs as good as a regular
  1180 augmented and propagated 
  1175 expression that has not been simplified in the subsequent derivative
  1181 during simplification in parallel with the regularr expression that
  1176 operations.  To aid this, we use the helping function retrieve described by
  1182 has not been simplified in the subsequent derivative operations.  To aid this,
  1177 Sulzmann and Lu: \\definition of retrieve\\
  1183 we use the helping function retrieve described by Sulzmann and Lu: \\definition
  1178  
  1184 of retrieve TODO\\
  1179 This function assembled the bitcode that corresponds to a parse tree for
  1185 This function assembles the bitcode that corresponds to a parse tree for how
  1180 how the current derivative matches the suffix of the string(the
  1186 the current derivative matches the suffix of the string(the characters that
  1181 characters that have not yet appeared, but is stored in the value).
  1187 have not yet appeared, but will appear as the successive derivatives go on,
  1182 Sulzmann and Lu used this to connect the bitcoded algorithm to the older
  1188 how do we get this "future" information? By the value $v$, which is
  1183 algorithm by the following equation:
  1189 computed by a pass of the algorithm that uses
       
  1190 $inj$ as described in the previous section).  Sulzmann and Lu used this
       
  1191 to connect the bitcoded algorithm to the older algorithm by the following
       
  1192 equation:
  1184  
  1193  
  1185  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
  1194  \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
  1186 	 ((\textit{internalise}\; r)\backslash_{simp} c) v)$ \end{center} A
  1195 	 ((\textit{internalise}\; r)\backslash_{simp} c) v)$ 
  1187 	 little fact that needs to be stated to help comprehension:
  1196  \end{center} 
  1188 	 \begin{center} $r^\uparrow = a$($a$ stands for $\textit{annotated}).$
  1197  A little fact that needs to be stated to help comprehension:
  1189 	 \end{center} Ausaf and Urban also used this fact to prove  the
  1198 \begin{center} 
  1190 	 correctness of bitcoded algorithm without simplification.  Our purpose
  1199 	$r^\uparrow = a$($a$ stands for $\textit{annotated}).$
  1191 	 of using this, however, is try to establish \\ $ \textit{retrieve} \;
  1200 \end{center} 
  1192 	 a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$\\ The idea
  1201 Ausaf and Urban also used this fact to prove  the
  1193 	 is that using $v'$, a simplified version of $v$ that possibly had gone
  1202 correctness of bitcoded algorithm without simplification.  Our purpose
  1194 	 through the same simplification step as $\textit{simp}(a)$ we are
  1203 of using this, however, is to establish 
  1195 	 still  able to extract the bit-sequence that gives the same parsing
  1204 \begin{center}
  1196 	 information as the unsimplified one.  After establishing this, we
  1205 $ \textit{retrieve} \;
  1197 	 might be able to finally bridge the gap of proving\\
  1206 a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$
  1198 	 $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \;
  1207 \end{center}
  1199 	 \textit{simp}(r)  \backslash  s \; v'$\\ and subsequently\\
  1208 The idea
  1200 	 $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \;
  1209 is that using $v'$, a simplified version of $v$ that possibly had gone
  1201 	 r  \backslash_{simp}   s \; v'$.\\ This proves that our simplified
  1210 through the same simplification step as $\textit{simp}(a)$ we are
  1202 	 version of regular expression still contains all the bitcodes needed.
  1211 able to extract the bit-sequence that gives the same parsing
       
  1212 information as the unsimplified one.  After establishing this, we
       
  1213 might be able to finally bridge the gap of proving
       
  1214 \begin{center}
       
  1215 $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \;
       
  1216 \textit{simp}(r)  \backslash  s \; v'$
       
  1217 \end{center}
       
  1218 and subsequently
       
  1219 \begin{center}
       
  1220 $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \;
       
  1221 r  \backslash_{simp}   s \; v'$.
       
  1222 \end{center}
       
  1223 This proves that our simplified
       
  1224 version of regular expression still contains all the bitcodes needed.
  1203 
  1225 
  1204 
  1226 
  1205 The second task is to speed up the more aggressive simplification.
  1227 The second task is to speed up the more aggressive simplification.
  1206 Currently it is slower than a naive simplification(the naive version as
  1228 Currently it is slower than a naive simplification(the naive version as
  1207 implemented in ADU of course can explode in some cases). So it needs to
  1229 implemented in ADU of course can explode in some cases). So it needs to