734 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
734 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
735 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
735 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
736 & & $\;\;\textit{else}\;\textit{None}$ |
736 & & $\;\;\textit{else}\;\textit{None}$ |
737 \end{tabular} |
737 \end{tabular} |
738 \end{center} |
738 \end{center} |
|
739 Here $(r^\uparrow)\backslash s$ is similar to what we have previously defined for |
|
740 $r\backslash s$. |
739 |
741 |
740 The main point of the bitsequences and annotated regular expressions |
742 The main point of the bitsequences and annotated regular expressions |
741 is that we can apply rather aggressive (in terms of size) |
743 is that we can apply rather aggressive (in terms of size) |
742 simplification rules in order to keep derivatives small. |
744 simplification rules in order to keep derivatives small. |
743 |
745 |
782 recursively and then see if one of the children turn into 0 or 1, which might trigger further simplification |
784 recursively and then see if one of the children turn into 0 or 1, which might trigger further simplification |
783 at the current level. The most involved part is the ALTS clause, where we use two auxiliary functions |
785 at the current level. The most involved part is the ALTS clause, where we use two auxiliary functions |
784 flatten and distinct to open up nested ALT and reduce as many duplicates as possible. |
786 flatten and distinct to open up nested ALT and reduce as many duplicates as possible. |
785 Function distinct keeps the first occurring copy only and remove all later ones when detected duplicates. |
787 Function distinct keeps the first occurring copy only and remove all later ones when detected duplicates. |
786 Function flatten opens up nested ALT. Its recursive definition is given below: |
788 Function flatten opens up nested ALT. Its recursive definition is given below: |
787 \\ |
789 \begin{center} |
788 \[flatten ALT bs rs :: rss = (map fuse( bs,_) rs )@ flatten rss |
790 \begin{tabular}{@{}lcl@{}} |
789 flatten ZERO :: rss = flatten rss |
791 $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{ map fuse}( \textit{bs, \_} ) \textit{ as}) \; +\!+ \; \textit{flatten} \; as' $ \\ |
790 flatten r::rss = r @ flatten rss if r is of any other shape |
792 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ |
791 \] |
793 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as' $ |
792 |
794 \end{tabular} |
|
795 \end{center} |
|
796 |
793 Here flatten behaves like the traditional functional programming flatten function, |
797 Here flatten behaves like the traditional functional programming flatten function, |
794 what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. |
798 what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. |
795 |
799 |
796 If we apply simplification after each derivative step, we get an optimized version of the algorithm: |
800 Suppose we apply simplification after each derivative step, |
797 \\TODO definition of $blexer_simp$ |
801 and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$. |
|
802 Then we can use the previous natural extension from derivative w.r.t character to derivative w.r.t string: |
|
803 |
|
804 \begin{center} |
|
805 \begin{tabular}{lcl} |
|
806 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp} c) \backslash_{simp} s$ \\ |
|
807 $r \backslash [\,] $ & $\dn$ & $r$ |
|
808 \end{tabular} |
|
809 \end{center} |
|
810 |
|
811 we get an optimized version of the algorithm: |
|
812 \begin{center} |
|
813 \begin{tabular}{lcl} |
|
814 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
815 $\textit{let}\;a = (r^\uparrow)\backslash_{simp} s\;\textit{in}$\\ |
|
816 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
817 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
818 & & $\;\;\textit{else}\;\textit{None}$ |
|
819 \end{tabular} |
|
820 \end{center} |
798 |
821 |
799 This algorithm effectively keeps the regular expression size small, for example, |
822 This algorithm effectively keeps the regular expression size small, for example, |
800 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is. |
823 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is. |
801 |
824 |
802 |
825 |
835 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification. |
858 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification. |
836 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu: |
859 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu: |
837 \\definition of retrieve\\ |
860 \\definition of retrieve\\ |
838 This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
861 This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
839 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
862 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
840 inj r c v = decode retrieve (bder c (internalise r)) v |
863 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ |
|
864 A little fact that needs to be stated to help comprehension:\\ |
|
865 $r^\uparrow = a$($a$ stands for $annotated$).\\ |
841 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplificaiton. |
866 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplificaiton. |
842 Our purpose of using this, however, is try to establish \\ |
867 Our purpose of using this, however, is try to establish \\ |
843 $ retrieve \; r \; v \;=\; retrieve \; simp(r) \; v'.$\\ |
868 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ |
844 Although the r and v are different now, we can still extract the bitsequence that gies the same parsing information. |
869 The idea is that using $v'$, |
845 After establishing this, we might be able to finally bridge the gap of proving |
870 a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bitsequence that gievs the same parsing information as the unsimplified one. |
846 $retrieve \; r \; \backslash \;s \; v = \;retrieve \; simp(r) \; \backslash \; s \; v'$ |
871 After establishing this, we might be able to finally bridge the gap of proving\\ |
847 and subsequently |
872 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\ |
848 $retrieve \; r \; \backslash \; s \; v\; = \; retrieve \; r \; \backslash{with simplification} \; s \; v'$. |
873 and subsequently\\ |
|
874 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\ |
849 This proves that our simplified version of regular expression still contains all the bitcodes neeeded. |
875 This proves that our simplified version of regular expression still contains all the bitcodes neeeded. |
850 |
876 |
851 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases). |
877 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases). |
852 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress. |
878 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress. |
853 |
879 |