738 idea that only the first ``copy'' of a regular expression in an |
738 idea that only the first ``copy'' of a regular expression in an |
739 alternative contributes to the calculation of a POSIX value. All |
739 alternative contributes to the calculation of a POSIX value. All |
740 subsequent copies can be pruned from the regular expression. |
740 subsequent copies can be pruned from the regular expression. |
741 |
741 |
742 A psuedocode version of our algorithm is given below:\\ |
742 A psuedocode version of our algorithm is given below:\\ |
743 |
743 \[simp r \dn r if r = ONE bs or CHAR bs c or STAR bs r |
744 simp r \defn r if r = ONE bs or CHAR bs c or STAR bs r |
744 simp SEQ bs r_1 r_2 \dn \\ |
745 simp SEQ bs r_1 r_2 \defn \\ |
|
746 case (simp(r_1), simp(r_2) ) of (0, _) => 0 |
745 case (simp(r_1), simp(r_2) ) of (0, _) => 0 |
747 (_,0) => 0 |
746 (_,0) => 0 |
748 (1, r) => fuse bs r |
747 (1, r) => fuse bs r |
749 (r,1) => fuse bs r |
748 (r,1) => fuse bs r |
750 (r_1, r_2) => SEQ bs r_1 r_2 |
749 (r_1, r_2) => SEQ bs r_1 r_2 |
751 simp ALT bs rs = distinct(flatten( map simp rs)) match |
750 simp ALT bs rs = distinct(flatten( map simp rs)) match |
752 case Nil => ZERO |
751 case Nil => ZERO |
753 case r::Nil => fuse bs r |
752 case r::Nil => fuse bs r |
754 case rs => ALT bs rs |
753 case rs => ALT bs rs |
|
754 \] |
755 |
755 |
756 The simplification does a pattern matching on the regular expression. When it detected that |
756 The simplification does a pattern matching on the regular expression. When it detected that |
757 the regular expression is an alternative or sequence, it will try to simplify its children regular expressions |
757 the regular expression is an alternative or sequence, it will try to simplify its children regular expressions |
758 recursively and then see if one of the children turn into 0 or 1, which might trigger further simplification |
758 recursively and then see if one of the children turn into 0 or 1, which might trigger further simplification |
759 at the current level. The most involved part is the ALTS clause, where we use two auxiliary functions |
759 at the current level. The most involved part is the ALTS clause, where we use two auxiliary functions |
760 flatten and distinct to open up nested ALT and reduce as many duplicates as possible. |
760 flatten and distinct to open up nested ALT and reduce as many duplicates as possible. |
761 Function distinct keeps the first occurring copy only and remove all later ones when detected duplicates. |
761 Function distinct keeps the first occurring copy only and remove all later ones when detected duplicates. |
762 Function flatten opens up nested ALT. Its recursive definition is given below: |
762 Function flatten opens up nested ALT. Its recursive definition is given below: |
763 \\ |
763 \\ |
764 flatten ALT bs rs :: rss = (map fuse( bs,_) rs )@ flatten rss |
764 \[flatten ALT bs rs :: rss = (map fuse( bs,_) rs )@ flatten rss |
765 flatten ZERO :: rss = flatten rss |
765 flatten ZERO :: rss = flatten rss |
766 flatten r::rss = r @ flatten rss if r is of any other shape |
766 flatten r::rss = r @ flatten rss if r is of any other shape |
|
767 \] |
767 |
768 |
768 Here flatten behaves like the traditional functional programming flatten function, |
769 Here flatten behaves like the traditional functional programming flatten function, |
769 what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. |
770 what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$. |
770 |
771 |
771 If we apply simplification after each derivative step, we get an optimized version of the algorithm: |
772 If we apply simplification after each derivative step, we get an optimized version of the algorithm: |
772 \\TODO definition of blexer_simp |
773 \\TODO definition of $blexer_simp$ |
773 |
774 |
774 This algorithm effectively keeps the regular expression size small, for example, |
775 This algorithm effectively keeps the regular expression size small, for example, |
775 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is. |
776 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is. |
776 |
777 |
777 |
778 |
812 \\definition of retrieve\\ |
813 \\definition of retrieve\\ |
813 This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
814 This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
814 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
815 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
815 inj r c v = decode retrieve (bder c (internalise r)) v |
816 inj r c v = decode retrieve (bder c (internalise r)) v |
816 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplificaiton. |
817 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplificaiton. |
817 Our purpose of using this, however, is try to establish |
818 Our purpose of using this, however, is try to establish \\ |
818 retrieve r v = retrieve simp(r) v'. |
819 $ retrieve \; r \; v \;=\; retrieve \; simp(r) \; v'.$\\ |
819 Although the r and v are different now, we can still extract the bitsequence that gies the same parsing information. |
820 Although the r and v are different now, we can still extract the bitsequence that gies the same parsing information. |
820 After establishing this, we might be able to finally bridge the gap of proving |
821 After establishing this, we might be able to finally bridge the gap of proving |
821 retrieve r \backslash s v = retrieve simp(r) \backslash s v' |
822 $retrieve \; r \; \backslash \;s \; v = \;retrieve \; simp(r) \; \backslash \; s \; v'$ |
822 and subsequently |
823 and subsequently |
823 retrieve r \backslash s v = retrieve r \backslash{with simplification} s v'. |
824 $retrieve \; r \; \backslash \; s \; v\; = \; retrieve \; r \; \backslash{with simplification} \; s \; v'$. |
824 This proves that our simplified version of regular expression still contains all the bitcodes neeeded. |
825 This proves that our simplified version of regular expression still contains all the bitcodes neeeded. |
825 |
826 |
826 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases). |
827 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases). |
827 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress. |
828 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress. |
828 |
829 |