ninems/ninems.tex
changeset 49 d256aabe88f3
parent 48 bbefcf7351f2
child 50 866eda9ba66a
equal deleted inserted replaced
48:bbefcf7351f2 49:d256aabe88f3
   731 \textit{ALTS} constructor. This allows us to not just delete
   731 \textit{ALTS} constructor. This allows us to not just delete
   732 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
   732 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
   733 unnecessary ``copies'' of regular expressions (very similar to
   733 unnecessary ``copies'' of regular expressions (very similar to
   734 simplifying $r + r$ to just $r$, but in a more general
   734 simplifying $r + r$ to just $r$, but in a more general
   735 setting). 
   735 setting). 
       
   736 Another modification is that we use simplification rules
       
   737 inspired by Antimirov's work on partial derivatives. They maintain the
       
   738 idea that only the first ``copy'' of a regular expression in an
       
   739 alternative contributes to the calculation of a POSIX value. All
       
   740 subsequent copies can be pruned from the regular expression.
       
   741 
   736 A psuedocode version of our algorithm is given below:\\
   742 A psuedocode version of our algorithm is given below:\\
   737 
   743 
   738 simp r \defn r if r = ONE bs or CHAR bs c or STAR bs r
   744 simp r \defn r if r = ONE bs or CHAR bs c or STAR bs r
   739 simp SEQ bs r_1 r_2 \defn \\
   745 simp SEQ bs r_1 r_2 \defn \\
   740 case (simp(r_1), simp(r_2) ) of (0, _) => 0
   746 case (simp(r_1), simp(r_2) ) of (0, _) => 0
   766 \\TODO definition of blexer_simp
   772 \\TODO definition of blexer_simp
   767 
   773 
   768 This algorithm effectively keeps the regular expression size small, for example,
   774 This algorithm effectively keeps the regular expression size small, for example,
   769 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
   775 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
   770 
   776 
   771 Another modification is that we use simplification rules
   777 
   772 inspired by Antimirov's work on partial derivatives. They maintain the
   778 
   773 idea that only the first ``copy'' of a regular expression in an
   779 We are currently engaged in 2 tasks related to this algorithm. 
   774 alternative contributes to the calculation of a POSIX value. All
   780 The first one is proving that our simplification rules
   775 subsequent copies can be pruned from the regular expression.
       
   776 
       
   777 
       
   778 We are currently engaged with proving that our simplification rules
       
   779 actually do not affect the POSIX value that should be generated by the
   781 actually do not affect the POSIX value that should be generated by the
   780 algorithm according to the specification of a POSIX value and
   782 algorithm according to the specification of a POSIX value
   781 furthermore that our derivatives stay small for all derivatives. For
   783  and furthermore obtain a much
       
   784 tighter bound on the sizes of derivatives. The result is that our
       
   785 algorithm should be correct and faster on all inputs.  The original
       
   786 blow-up, as observed in JavaScript, Python and Java, would be excluded
       
   787 from happening in our algorithm.For
   782 this proof we use the theorem prover Isabelle. Once completed, this
   788 this proof we use the theorem prover Isabelle. Once completed, this
   783 result will advance the state-of-the-art: Sulzmann and Lu wrote in
   789 result will advance the state-of-the-art: Sulzmann and Lu wrote in
   784 their paper \cite{Sulzmann2014} about the bitcoded ``incremental
   790 their paper \cite{Sulzmann2014} about the bitcoded ``incremental
   785 parsing method'' (that is the matching algorithm outlined in this
   791 parsing method'' (that is the matching algorithm outlined in this
   786 section):
   792 section):
   792   extensively by using the method in Figure~3 as a reference but yet
   798   extensively by using the method in Figure~3 as a reference but yet
   793   have to work out all proof details.''
   799   have to work out all proof details.''
   794 \end{quote}  
   800 \end{quote}  
   795 
   801 
   796 \noindent
   802 \noindent
   797 We would settle the correctness claim and furthermore obtain a much
   803 We would settle the correctness claim.
   798 tighter bound on the sizes of derivatives. The result is that our
   804 It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
   799 algorithm should be correct and faster on all inputs.  The original
   805 bmkeps r = bmkeps simp r
   800 blow-up, as observed in JavaScript, Python and Java, would be excluded
   806 as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete imporatnt POSIX information in a regular expression.
   801 from happening in our algorithm.
   807 The hardcore of this problem is to prove that
       
   808 bmkeps bders r = bmkeps bders simp r
       
   809 That is, if we do derivative on regular expression r and the simplificed version fo r , they can still provie the same POSIZ calue if there is one . This is not as strightforward as the previous proposition, as the two regular expression r and simp r  might become very different regular epxressions after repeated application ofd simp and derivative.
       
   810 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification.
       
   811 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu:
       
   812 \\definition of retrieve\\
       
   813  This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value).
       
   814  Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
       
   815  inj r c v = decode retrieve (bder c (internalise r)) v
       
   816  Fahad and Christian also used this fact to prove  the correctness of bit-coded algorithm without simplificaiton.
       
   817  Our purpose of using this, however, is try to establish 
       
   818  retrieve r v = retrieve simp(r) v'.
       
   819  Although the r and v are different now,  we can still extract the bitsequence that gies the same parsing information.
       
   820  After establishing this, we might be able to finally bridge the gap of proving
       
   821  retrieve r \backslash s v = retrieve simp(r) \backslash s v'
       
   822  and subsequently
       
   823  retrieve r \backslash s v = retrieve r \backslash{with simplification} s v'.
       
   824  This proves that our simplified version of regular expression still contains all the bitcodes neeeded.
       
   825 
       
   826 The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases).
       
   827 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress.
   802 
   828 
   803 \section{Conclusion}
   829 \section{Conclusion}
   804 
   830 
   805 In this PhD-project we are interested in fast algorithms for regular
   831 In this PhD-project we are interested in fast algorithms for regular
   806 expression matching. While this seems to be a ``settled'' area, in
   832 expression matching. While this seems to be a ``settled'' area, in