181 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
181 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
182 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
182 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
183 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
183 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
184 \end{tabular} |
184 \end{tabular} |
185 \end{center} |
185 \end{center} |
186 |
186 \noindent |
187 %Assuming the classic notion of a |
187 And defines how a regular expression evolves into |
188 %\emph{language} of a regular expression, written $L(\_)$, t |
188 a new regular expression after all the string it contains |
189 |
189 is chopped off a certain head character $c$. |
190 \noindent |
190 |
191 The main property of the derivative operation is that |
191 The main property of the derivative operation is that |
192 |
192 |
193 \begin{center} |
193 \begin{center} |
194 $c\!::\!s \in L(r)$ holds |
194 $c\!::\!s \in L(r)$ holds |
195 if and only if $s \in L(r\backslash c)$. |
195 if and only if $s \in L(r\backslash c)$. |
212 \[ |
212 \[ |
213 match\;s\;r \;\dn\; nullable(r\backslash s) |
213 match\;s\;r \;\dn\; nullable(r\backslash s) |
214 \] |
214 \] |
215 |
215 |
216 \noindent |
216 \noindent |
217 Assuming the a string is givane as a sequence of characters, say $c_0c_1..c_n$, |
217 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, |
218 this algorithm presented graphically is as follows: |
218 this algorithm presented graphically is as follows: |
219 |
219 |
220 \begin{equation}\label{graph:*} |
220 \begin{equation}\label{graph:*} |
221 \begin{tikzcd} |
221 \begin{tikzcd} |
222 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
222 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
378 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
378 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
379 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
379 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
380 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
380 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
381 to achieve the same size bound as that of the partial derivatives. |
381 to achieve the same size bound as that of the partial derivatives. |
382 |
382 |
|
383 |
|
384 Suppose we apply simplification after each derivative step, and view |
|
385 these two operations as an atomic one: $a \backslash_{simp}\,c \dn |
|
386 \textit{simp}(a \backslash c)$. Then we can use the previous natural |
|
387 extension from derivative w.r.t.~character to derivative |
|
388 w.r.t.~string:%\comment{simp in the [] case?} |
|
389 |
|
390 \begin{center} |
|
391 \begin{tabular}{lcl} |
|
392 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
393 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
394 \end{tabular} |
|
395 \end{center} |
|
396 |
|
397 \noindent |
|
398 we obtain an optimised version of the algorithm: |
|
399 |
|
400 \begin{center} |
|
401 \begin{tabular}{lcl} |
|
402 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
403 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
404 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
405 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
406 & & $\;\;\textit{else}\;\textit{None}$ |
|
407 \end{tabular} |
|
408 \end{center} |
|
409 |
|
410 \noindent |
|
411 This algorithm keeps the regular expression size small, for example, |
|
412 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
413 will be reduced to just 6 and stays constant, no matter how long the |
|
414 input string is. |
|
415 |
|
416 |
383 In order to implement the idea of ``spilling out alternatives'' and to |
417 In order to implement the idea of ``spilling out alternatives'' and to |
384 make them compatible with the $\textit{inj}$-mechanism, we use |
418 make them compatible with the $\textit{inj}$-mechanism, we use |
385 \emph{bitcodes}. They were first introduced by Sulzmann and Lu. |
419 \emph{bitcodes}. They were first introduced by Sulzmann and Lu. |
386 Here bits and bitcodes (lists of bits) are defined as: |
420 Here bits and bitcodes (lists of bits) are defined as: |
387 |
421 |
388 \begin{center} |
422 \begin{center} |
389 $b ::= 1 \mid 0 \qquad |
423 $b ::= 1 \mid 0 \qquad |
390 bs ::= [] \mid b:bs |
424 bs ::= [] \mid b::bs |
391 $ |
425 $ |
392 \end{center} |
426 \end{center} |
393 |
427 |
394 \noindent |
428 \noindent |
395 The $1$ and $0$ are not in bold in order to avoid |
429 The $1$ and $0$ are not in bold in order to avoid |
396 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
430 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
397 bit-lists) can be used to encode values (or incomplete values) in a |
431 bit-lists) can be used to encode values (or potentially incomplete values) in a |
398 compact form. This can be straightforwardly seen in the following |
432 compact form. This can be straightforwardly seen in the following |
399 coding function from values to bitcodes: |
433 coding function from values to bitcodes: |
400 |
434 |
401 \begin{center} |
435 \begin{center} |
402 \begin{tabular}{lcl} |
436 \begin{tabular}{lcl} |
733 \noindent |
767 \noindent |
734 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
768 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
735 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
769 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
736 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
770 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
737 |
771 |
738 Suppose we apply simplification after each derivative step, and view |
|
739 these two operations as an atomic one: $a \backslash_{simp}\,c \dn |
|
740 \textit{simp}(a \backslash c)$. Then we can use the previous natural |
|
741 extension from derivative w.r.t.~character to derivative |
|
742 w.r.t.~string:%\comment{simp in the [] case?} |
|
743 |
|
744 \begin{center} |
|
745 \begin{tabular}{lcl} |
|
746 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
747 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
748 \end{tabular} |
|
749 \end{center} |
|
750 |
|
751 \noindent |
|
752 we obtain an optimised version of the algorithm: |
|
753 |
|
754 \begin{center} |
|
755 \begin{tabular}{lcl} |
|
756 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
757 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
758 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
759 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
760 & & $\;\;\textit{else}\;\textit{None}$ |
|
761 \end{tabular} |
|
762 \end{center} |
|
763 |
|
764 \noindent |
|
765 This algorithm keeps the regular expression size small, for example, |
|
766 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
767 will be reduced to just 6 and stays constant, no matter how long the |
|
768 input string is. |
|
769 |
772 |
770 |
773 |
771 |
774 |
772 \section{Current Work and Progress} |
775 \section{Current Work and Progress} |
773 For reasons beyond this report, it turns out that a complete set of |
776 For reasons beyond this report, it turns out that a complete set of |