731 \textit{ALTS} constructor. This allows us to not just delete |
731 \textit{ALTS} constructor. This allows us to not just delete |
732 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also |
732 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also |
733 unnecessary ``copies'' of regular expressions (very similar to |
733 unnecessary ``copies'' of regular expressions (very similar to |
734 simplifying $r + r$ to just $r$, but in a more general |
734 simplifying $r + r$ to just $r$, but in a more general |
735 setting). |
735 setting). |
|
736 Another modification is that we use simplification rules |
|
737 inspired by Antimirov's work on partial derivatives. They maintain the |
|
738 idea that only the first ``copy'' of a regular expression in an |
|
739 alternative contributes to the calculation of a POSIX value. All |
|
740 subsequent copies can be pruned from the regular expression. |
|
741 |
736 A psuedocode version of our algorithm is given below:\\ |
742 A psuedocode version of our algorithm is given below:\\ |
737 |
743 |
738 simp r \defn r if r = ONE bs or CHAR bs c or STAR bs r |
744 simp r \defn r if r = ONE bs or CHAR bs c or STAR bs r |
739 simp SEQ bs r_1 r_2 \defn \\ |
745 simp SEQ bs r_1 r_2 \defn \\ |
740 case (simp(r_1), simp(r_2) ) of (0, _) => 0 |
746 case (simp(r_1), simp(r_2) ) of (0, _) => 0 |
766 \\TODO definition of blexer_simp |
772 \\TODO definition of blexer_simp |
767 |
773 |
768 This algorithm effectively keeps the regular expression size small, for example, |
774 This algorithm effectively keeps the regular expression size small, for example, |
769 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is. |
775 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is. |
770 |
776 |
771 Another modification is that we use simplification rules |
777 |
772 inspired by Antimirov's work on partial derivatives. They maintain the |
778 |
773 idea that only the first ``copy'' of a regular expression in an |
779 We are currently engaged in 2 tasks related to this algorithm. |
774 alternative contributes to the calculation of a POSIX value. All |
780 The first one is proving that our simplification rules |
775 subsequent copies can be pruned from the regular expression. |
|
776 |
|
777 |
|
778 We are currently engaged with proving that our simplification rules |
|
779 actually do not affect the POSIX value that should be generated by the |
781 actually do not affect the POSIX value that should be generated by the |
780 algorithm according to the specification of a POSIX value and |
782 algorithm according to the specification of a POSIX value |
781 furthermore that our derivatives stay small for all derivatives. For |
783 and furthermore obtain a much |
|
784 tighter bound on the sizes of derivatives. The result is that our |
|
785 algorithm should be correct and faster on all inputs. The original |
|
786 blow-up, as observed in JavaScript, Python and Java, would be excluded |
|
787 from happening in our algorithm.For |
782 this proof we use the theorem prover Isabelle. Once completed, this |
788 this proof we use the theorem prover Isabelle. Once completed, this |
783 result will advance the state-of-the-art: Sulzmann and Lu wrote in |
789 result will advance the state-of-the-art: Sulzmann and Lu wrote in |
784 their paper \cite{Sulzmann2014} about the bitcoded ``incremental |
790 their paper \cite{Sulzmann2014} about the bitcoded ``incremental |
785 parsing method'' (that is the matching algorithm outlined in this |
791 parsing method'' (that is the matching algorithm outlined in this |
786 section): |
792 section): |
792 extensively by using the method in Figure~3 as a reference but yet |
798 extensively by using the method in Figure~3 as a reference but yet |
793 have to work out all proof details.'' |
799 have to work out all proof details.'' |
794 \end{quote} |
800 \end{quote} |
795 |
801 |
796 \noindent |
802 \noindent |
797 We would settle the correctness claim and furthermore obtain a much |
803 We would settle the correctness claim. |
798 tighter bound on the sizes of derivatives. The result is that our |
804 It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, |
799 algorithm should be correct and faster on all inputs. The original |
805 bmkeps r = bmkeps simp r |
800 blow-up, as observed in JavaScript, Python and Java, would be excluded |
806 as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete imporatnt POSIX information in a regular expression. |
801 from happening in our algorithm. |
807 The hardcore of this problem is to prove that |
|
808 bmkeps bders r = bmkeps bders simp r |
|
809 That is, if we do derivative on regular expression r and the simplificed version fo r , they can still provie the same POSIZ calue if there is one . This is not as strightforward as the previous proposition, as the two regular expression r and simp r might become very different regular epxressions after repeated application ofd simp and derivative. |
|
810 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification. |
|
811 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu: |
|
812 \\definition of retrieve\\ |
|
813 This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
|
814 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
|
815 inj r c v = decode retrieve (bder c (internalise r)) v |
|
816 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplificaiton. |
|
817 Our purpose of using this, however, is try to establish |
|
818 retrieve r v = retrieve simp(r) v'. |
|
819 Although the r and v are different now, we can still extract the bitsequence that gies the same parsing information. |
|
820 After establishing this, we might be able to finally bridge the gap of proving |
|
821 retrieve r \backslash s v = retrieve simp(r) \backslash s v' |
|
822 and subsequently |
|
823 retrieve r \backslash s v = retrieve r \backslash{with simplification} s v'. |
|
824 This proves that our simplified version of regular expression still contains all the bitcodes neeeded. |
|
825 |
|
826 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases). |
|
827 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress. |
802 |
828 |
803 \section{Conclusion} |
829 \section{Conclusion} |
804 |
830 |
805 In this PhD-project we are interested in fast algorithms for regular |
831 In this PhD-project we are interested in fast algorithms for regular |
806 expression matching. While this seems to be a ``settled'' area, in |
832 expression matching. While this seems to be a ``settled'' area, in |