cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:bbefcf7351f2
+:d256aabe88f3
 \textit{ALTS} constructor. This allows us to not just delete
 unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
 unnecessary ``copies'' of regular expressions (very similar to
 simplifying $r + r$ to just $r$, but in a more general
 setting).
+Another modification is that we use simplification rules
+inspired by Antimirov's work on partial derivatives. They maintain the
+idea that only the first ``copy'' of a regular expression in an
+alternative contributes to the calculation of a POSIX value. All
+subsequent copies can be pruned from the regular expression.
 A psuedocode version of our algorithm is given below:\\
 simp r \defn r if r = ONE bs or CHAR bs c or STAR bs r
 simp SEQ bs r_1 r_2 \defn \\
 case (simp(r_1), simp(r_2) ) of (0, _) => 0
 \\TODO definition of blexer_simp
 This algorithm effectively keeps the regular expression size small, for example,
 with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
-Another modification is that we use simplification rules
-inspired by Antimirov's work on partial derivatives. They maintain the
-idea that only the first ``copy'' of a regular expression in an
+We are currently engaged in 2 tasks related to this algorithm.
-alternative contributes to the calculation of a POSIX value. All
+The first one is proving that our simplification rules
-subsequent copies can be pruned from the regular expression.
-We are currently engaged with proving that our simplification rules
 actually do not affect the POSIX value that should be generated by the
-algorithm according to the specification of a POSIX value and
+algorithm according to the specification of a POSIX value
-furthermore that our derivatives stay small for all derivatives. For
+and furthermore obtain a much
+tighter bound on the sizes of derivatives. The result is that our
+algorithm should be correct and faster on all inputs.  The original
+blow-up, as observed in JavaScript, Python and Java, would be excluded
+from happening in our algorithm.For
 this proof we use the theorem prover Isabelle. Once completed, this
 result will advance the state-of-the-art: Sulzmann and Lu wrote in
 their paper \cite{Sulzmann2014} about the bitcoded ``incremental
 parsing method'' (that is the matching algorithm outlined in this
 section):
 extensively by using the method in Figure~3 as a reference but yet
 have to work out all proof details.''
 \end{quote}
 \noindent
-We would settle the correctness claim and furthermore obtain a much
+We would settle the correctness claim.
-tighter bound on the sizes of derivatives. The result is that our
+It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
-algorithm should be correct and faster on all inputs.  The original
+bmkeps r = bmkeps simp r
-blow-up, as observed in JavaScript, Python and Java, would be excluded
+as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete imporatnt POSIX information in a regular expression.
-from happening in our algorithm.
+The hardcore of this problem is to prove that
+bmkeps bders r = bmkeps bders simp r
+That is, if we do derivative on regular expression r and the simplificed version fo r , they can still provie the same POSIZ calue if there is one . This is not as strightforward as the previous proposition, as the two regular expression r and simp r  might become very different regular epxressions after repeated application ofd simp and derivative.
+The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification.
+To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu:
+\\definition of retrieve\\
+This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value).
+Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
+inj r c v = decode retrieve (bder c (internalise r)) v
+Fahad and Christian also used this fact to prove  the correctness of bit-coded algorithm without simplificaiton.
+Our purpose of using this, however, is try to establish
+retrieve r v = retrieve simp(r) v'.
+Although the r and v are different now,  we can still extract the bitsequence that gies the same parsing information.
+After establishing this, we might be able to finally bridge the gap of proving
+retrieve r \backslash s v = retrieve simp(r) \backslash s v'
+and subsequently
+retrieve r \backslash s v = retrieve r \backslash{with simplification} s v'.
+This proves that our simplified version of regular expression still contains all the bitcodes neeeded.
+The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases).
+So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress.
 \section{Conclusion}
 In this PhD-project we are interested in fast algorithms for regular
 expression matching. While this seems to be a ``settled'' area, in

changeset 49	d256aabe88f3
parent 48	bbefcf7351f2
child 50	866eda9ba66a