etnms/etnms.tex
changeset 131 b6984212cf87
parent 130 4d6f54c478b5
child 132 69292c2b54d8
equal deleted inserted replaced
130:4d6f54c478b5 131:b6984212cf87
   359 $\textit{der}$ and $\nullable$ need to traverse such trees and
   359 $\textit{der}$ and $\nullable$ need to traverse such trees and
   360 consequently the bigger the size of the derivative the slower the
   360 consequently the bigger the size of the derivative the slower the
   361 algorithm. 
   361 algorithm. 
   362 
   362 
   363 Fortunately, one can simplify regular expressions after each derivative
   363 Fortunately, one can simplify regular expressions after each derivative
   364 step. Various simplifications of regular expressions are possible, such
   364 step. 
       
   365 Various simplifications of regular expressions are possible, such
   365 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   366 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
   366 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
   367 \cdot \ONE$, and $r + r$ to just $r$.
   367 affect the answer for whether a regular expression matches a string or
   368 Suppose we apply simplification after each derivative step, and compose
   368 not, but fortunately also do not affect the POSIX strategy of how
   369 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
   369 regular expressions match strings---although the latter is much harder
   370 \textit{simp}(a \backslash c)$. Then we can build values without having
       
   371 a cumbersome regular expression, and  fortunately if we are careful
       
   372 enough in making some extra rectifications, the POSIX value of how
       
   373 regular expressions match strings will not be affected---although is much harder
   370 to establish. Some initial results in this regard have been
   374 to establish. Some initial results in this regard have been
   371 obtained in \cite{AusafDyckhoffUrban2016}. 
   375 obtained in \cite{AusafDyckhoffUrban2016}. 
   372 
   376 
       
   377 
   373 If we want the size of derivatives in Sulzmann and Lu's algorithm to
   378 If we want the size of derivatives in Sulzmann and Lu's algorithm to
   374 stay below this bound, we would need more aggressive simplifications.
   379 stay even lower, we would need more aggressive simplifications.
   375 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
   380 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
   376 deleting duplicates whenever possible. For example, the parentheses in
   381 deleting duplicates whenever possible. For example, the parentheses in
   377 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
   382 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
   378 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
   383 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
   379 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
   384 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
   380 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
   385 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
   381 to achieve the same size bound as that of the partial derivatives. 
   386 to achieve a very tight size bound, namely,
   382 
   387  the same size bound as that of the \emph{partial derivatives}. 
   383 
   388 And we want to get rid of complex and fragile rectification of values.
   384 Suppose we apply simplification after each derivative step, and view
       
   385 these two operations as an atomic one: $a \backslash_{simp}\,c \dn
       
   386 \textit{simp}(a \backslash c)$. Then we can use the previous natural
       
   387 extension from derivative w.r.t.~character to derivative
       
   388 w.r.t.~string:%\comment{simp in  the [] case?}
       
   389 
       
   390 \begin{center}
       
   391 \begin{tabular}{lcl}
       
   392 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
   393 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
   394 \end{tabular}
       
   395 \end{center}
       
   396 
       
   397 \noindent
       
   398 we obtain an optimised version of the algorithm:
       
   399 
       
   400  \begin{center}
       
   401 \begin{tabular}{lcl}
       
   402   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
   403       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
   404   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   405   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   406   & & $\;\;\textit{else}\;\textit{None}$
       
   407 \end{tabular}
       
   408 \end{center}
       
   409 
       
   410 \noindent
       
   411 This algorithm keeps the regular expression size small, for example,
       
   412 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
   413 will be reduced to just 6 and stays constant, no matter how long the
       
   414 input string is.
       
   415 
   389 
   416 
   390 
   417 In order to implement the idea of ``spilling out alternatives'' and to
   391 In order to implement the idea of ``spilling out alternatives'' and to
   418 make them compatible with the $\textit{inj}$-mechanism, we use
   392 make them compatible with the $\textit{inj}$-mechanism, we use
   419 \emph{bitcodes}. They were first introduced by Sulzmann and Lu.
   393 \emph{bitcodes}. They were first introduced by Sulzmann and Lu.
   767 \noindent
   741 \noindent
   768 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
   742 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
   769 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
   743 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
   770 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
   744 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
   771 
   745 
       
   746 Having defined the $\simp$ function,
       
   747 we can use the previous notation of  natural
       
   748 extension from derivative w.r.t.~character to derivative
       
   749 w.r.t.~string:%\comment{simp in  the [] case?}
       
   750 
       
   751 \begin{center}
       
   752 \begin{tabular}{lcl}
       
   753 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
   754 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
   755 \end{tabular}
       
   756 \end{center}
       
   757 
       
   758 \noindent
       
   759 to obtain an optimised version of the algorithm:
       
   760 
       
   761  \begin{center}
       
   762 \begin{tabular}{lcl}
       
   763   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
   764       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
   765   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
   766   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
   767   & & $\;\;\textit{else}\;\textit{None}$
       
   768 \end{tabular}
       
   769 \end{center}
       
   770 
       
   771 \noindent
       
   772 This algorithm keeps the regular expression size small, for example,
       
   773 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
   774 will be reduced to just 6 and stays constant, no matter how long the
       
   775 input string is.
   772 
   776 
   773 
   777 
   774 
   778 
   775 \section{Current Work and Progress}
   779 \section{Current Work and Progress}
   776 For reasons beyond this report, it turns out that a complete set of
   780 For reasons beyond this report, it turns out that a complete set of
  1148    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
  1152    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
  1149    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
  1153    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
  1150    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
  1154    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
  1151    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
  1155    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
  1152 
  1156 
  1153   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
  1157   $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
  1154   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
  1158   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
  1155    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
  1159    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
  1156    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
  1160    &&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\ 
  1157 
  1161 
  1158    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
  1162    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$