359 $\textit{der}$ and $\nullable$ need to traverse such trees and |
359 $\textit{der}$ and $\nullable$ need to traverse such trees and |
360 consequently the bigger the size of the derivative the slower the |
360 consequently the bigger the size of the derivative the slower the |
361 algorithm. |
361 algorithm. |
362 |
362 |
363 Fortunately, one can simplify regular expressions after each derivative |
363 Fortunately, one can simplify regular expressions after each derivative |
364 step. Various simplifications of regular expressions are possible, such |
364 step. |
|
365 Various simplifications of regular expressions are possible, such |
365 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
366 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
366 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
367 \cdot \ONE$, and $r + r$ to just $r$. |
367 affect the answer for whether a regular expression matches a string or |
368 Suppose we apply simplification after each derivative step, and compose |
368 not, but fortunately also do not affect the POSIX strategy of how |
369 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
369 regular expressions match strings---although the latter is much harder |
370 \textit{simp}(a \backslash c)$. Then we can build values without having |
|
371 a cumbersome regular expression, and fortunately if we are careful |
|
372 enough in making some extra rectifications, the POSIX value of how |
|
373 regular expressions match strings will not be affected---although is much harder |
370 to establish. Some initial results in this regard have been |
374 to establish. Some initial results in this regard have been |
371 obtained in \cite{AusafDyckhoffUrban2016}. |
375 obtained in \cite{AusafDyckhoffUrban2016}. |
372 |
376 |
|
377 |
373 If we want the size of derivatives in Sulzmann and Lu's algorithm to |
378 If we want the size of derivatives in Sulzmann and Lu's algorithm to |
374 stay below this bound, we would need more aggressive simplifications. |
379 stay even lower, we would need more aggressive simplifications. |
375 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
380 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
376 deleting duplicates whenever possible. For example, the parentheses in |
381 deleting duplicates whenever possible. For example, the parentheses in |
377 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
382 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
378 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
383 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
379 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
384 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
380 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
385 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
381 to achieve the same size bound as that of the partial derivatives. |
386 to achieve a very tight size bound, namely, |
382 |
387 the same size bound as that of the \emph{partial derivatives}. |
383 |
388 And we want to get rid of complex and fragile rectification of values. |
384 Suppose we apply simplification after each derivative step, and view |
|
385 these two operations as an atomic one: $a \backslash_{simp}\,c \dn |
|
386 \textit{simp}(a \backslash c)$. Then we can use the previous natural |
|
387 extension from derivative w.r.t.~character to derivative |
|
388 w.r.t.~string:%\comment{simp in the [] case?} |
|
389 |
|
390 \begin{center} |
|
391 \begin{tabular}{lcl} |
|
392 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
393 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
394 \end{tabular} |
|
395 \end{center} |
|
396 |
|
397 \noindent |
|
398 we obtain an optimised version of the algorithm: |
|
399 |
|
400 \begin{center} |
|
401 \begin{tabular}{lcl} |
|
402 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
403 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
404 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
405 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
406 & & $\;\;\textit{else}\;\textit{None}$ |
|
407 \end{tabular} |
|
408 \end{center} |
|
409 |
|
410 \noindent |
|
411 This algorithm keeps the regular expression size small, for example, |
|
412 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
413 will be reduced to just 6 and stays constant, no matter how long the |
|
414 input string is. |
|
415 |
389 |
416 |
390 |
417 In order to implement the idea of ``spilling out alternatives'' and to |
391 In order to implement the idea of ``spilling out alternatives'' and to |
418 make them compatible with the $\textit{inj}$-mechanism, we use |
392 make them compatible with the $\textit{inj}$-mechanism, we use |
419 \emph{bitcodes}. They were first introduced by Sulzmann and Lu. |
393 \emph{bitcodes}. They were first introduced by Sulzmann and Lu. |
767 \noindent |
741 \noindent |
768 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
742 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
769 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
743 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
770 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
744 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
771 |
745 |
|
746 Having defined the $\simp$ function, |
|
747 we can use the previous notation of natural |
|
748 extension from derivative w.r.t.~character to derivative |
|
749 w.r.t.~string:%\comment{simp in the [] case?} |
|
750 |
|
751 \begin{center} |
|
752 \begin{tabular}{lcl} |
|
753 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
754 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
755 \end{tabular} |
|
756 \end{center} |
|
757 |
|
758 \noindent |
|
759 to obtain an optimised version of the algorithm: |
|
760 |
|
761 \begin{center} |
|
762 \begin{tabular}{lcl} |
|
763 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
764 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
765 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
766 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
767 & & $\;\;\textit{else}\;\textit{None}$ |
|
768 \end{tabular} |
|
769 \end{center} |
|
770 |
|
771 \noindent |
|
772 This algorithm keeps the regular expression size small, for example, |
|
773 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
774 will be reduced to just 6 and stays constant, no matter how long the |
|
775 input string is. |
772 |
776 |
773 |
777 |
774 |
778 |
775 \section{Current Work and Progress} |
779 \section{Current Work and Progress} |
776 For reasons beyond this report, it turns out that a complete set of |
780 For reasons beyond this report, it turns out that a complete set of |
1148 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
1152 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
1149 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
1153 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
1150 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
1154 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
1151 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
1155 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
1152 |
1156 |
1153 $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
1157 $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ |
1154 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
1158 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
1155 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
1159 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
1156 &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\ |
1160 &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\ |
1157 |
1161 |
1158 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
1162 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |