176 \end{center} |
176 \end{center} |
177 |
177 |
178 \noindent These are clearly abysmal and possibly surprising results. One |
178 \noindent These are clearly abysmal and possibly surprising results. One |
179 would expect these systems to do much better than that---after all, |
179 would expect these systems to do much better than that---after all, |
180 given a DFA and a string, deciding whether a string is matched by this |
180 given a DFA and a string, deciding whether a string is matched by this |
181 DFA should be linear? |
181 DFA should be linear in terms of the size of the regular expression and |
|
182 the string? |
182 |
183 |
183 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to |
184 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to |
184 exhibit this exponential behaviour. But unfortunately, such regular |
185 exhibit this super-linear behaviour. But unfortunately, such regular |
185 expressions are not just a few outliers. They are actually |
186 expressions are not just a few outliers. They are actually |
186 frequent enough to have a separate name created for |
187 frequent enough to have a separate name created for |
187 them---\emph{evil regular expressions}. In empiric work, Davis et al |
188 them---\emph{evil regular expressions}. In empiric work, Davis et al |
188 report that they have found thousands of such evil regular expressions |
189 report that they have found thousands of such evil regular expressions |
189 in the JavaScript and Python ecosystems \cite{Davis18}. |
190 in the JavaScript and Python ecosystems \cite{Davis18}. |
190 |
191 |
191 \comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in the Davis et al paper} |
192 \comment{Needs to be consistent: either exponential blowup; or quadratic blowup. Maybe you use the terminology superlinear, like in the Davis et al paper} |
192 This exponential blowup in matching algorithms sometimes causes |
193 This superlinear blowup in matching algorithms sometimes causes |
193 considerable grief in real life: for example on 20 July 2016 one evil |
194 considerable grief in real life: for example on 20 July 2016 one evil |
194 regular expression brought the webpage |
195 regular expression brought the webpage |
195 \href{http://stackexchange.com}{Stack Exchange} to its |
196 \href{http://stackexchange.com}{Stack Exchange} to its |
196 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
197 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
197 In this instance, a regular expression intended to just trim white |
198 In this instance, a regular expression intended to just trim white |
200 grind to a halt. This happened when a post with 20,000 white spaces was |
201 grind to a halt. This happened when a post with 20,000 white spaces was |
201 submitted, but importantly the white spaces were neither at the |
202 submitted, but importantly the white spaces were neither at the |
202 beginning nor at the end. |
203 beginning nor at the end. |
203 As a result, the regular expression matching |
204 As a result, the regular expression matching |
204 engine needed to backtrack over many choices. |
205 engine needed to backtrack over many choices. |
205 In this example, the time needed to process the string is not |
206 In this example, the time needed to process the string is |
206 exactly the classical exponential case, but rather $O(n^2)$ |
207 $O(n^2)$ |
207 with respect to the string length. But this is enough for the |
208 with respect to the string length. This quadratic |
208 home page of Stack Exchange to respond not fast enough to |
209 overhead is enough for the |
209 the load balancer, which thought that there must be some |
210 home page of Stack Exchange to respond so slowly to |
|
211 the load balancer that it thought there must be some |
210 attack and therefore stopped the servers from responding to |
212 attack and therefore stopped the servers from responding to |
211 requests. This made the whole site become unavailable. |
213 requests. This made the whole site become unavailable. |
212 Another very recent example is a global outage of all Cloudflare servers |
214 Another very recent example is a global outage of all Cloudflare servers |
213 on 2 July 2019. A poorly written regular expression exhibited |
215 on 2 July 2019. A poorly written regular expression exhibited |
214 exponential behaviour and exhausted CPUs that serve HTTP traffic. |
216 exponential behaviour and exhausted CPUs that serve HTTP traffic. |
430 sub-parts of a string. To see this, suppose a \emph{flatten} operation, |
432 sub-parts of a string. To see this, suppose a \emph{flatten} operation, |
431 written $|v|$ for values. We can use this function to extract the |
433 written $|v|$ for values. We can use this function to extract the |
432 underlying string of a value $v$. For example, $|\mathit{Seq} \, |
434 underlying string of a value $v$. For example, $|\mathit{Seq} \, |
433 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$. Using |
435 (\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$. Using |
434 flatten, we can describe how values encode \comment{Avoid the notion |
436 flatten, we can describe how values encode \comment{Avoid the notion |
435 parse trees! Also later!!}parse trees: $\Seq\,v_1\, v_2$ encodes a tree with two |
437 parse trees! Also later!!}lexical values: $\Seq\,v_1\, v_2$ encodes a tree with two |
436 children nodes that tells how the string $|v_1| @ |v_2|$ matches the |
438 children nodes that tells how the string $|v_1| @ |v_2|$ matches the |
437 regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and, |
439 regex $r_1 \cdot r_2$ whereby $r_1$ matches the substring $|v_1|$ and, |
438 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two |
440 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two |
439 are matched is contained in the children nodes $v_1$ and $v_2$ of parent |
441 are matched is contained in the children nodes $v_1$ and $v_2$ of parent |
440 $\textit{Seq}$. |
442 $\textit{Seq}$. |
491 $\textit{nullable}$ or not. If not, we know the string does not match |
493 $\textit{nullable}$ or not. If not, we know the string does not match |
492 $r$ and no value needs to be generated. If yes, we start building the |
494 $r$ and no value needs to be generated. If yes, we start building the |
493 values incrementally by \emph{injecting} back the characters into the |
495 values incrementally by \emph{injecting} back the characters into the |
494 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
496 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
495 algorithm from the right to left. For the first value $v_n$, we call the |
497 algorithm from the right to left. For the first value $v_n$, we call the |
496 function $\textit{mkeps}$, which builds the \comment{Avoid}parse tree |
498 function $\textit{mkeps}$, which builds the \comment{Avoid}lexical value |
497 for how the empty string has been matched by the (nullable) regular |
499 for how the empty string has been matched by the (nullable) regular |
498 expression $r_n$. This function is defined as |
500 expression $r_n$. This function is defined as |
499 |
501 |
500 \begin{center} |
502 \begin{center} |
501 \begin{tabular}{lcl} |
503 \begin{tabular}{lcl} |
515 also that in case of alternatives we give preference to the |
517 also that in case of alternatives we give preference to the |
516 regular expression on the left-hand side. This will become |
518 regular expression on the left-hand side. This will become |
517 important later on about what value is calculated. |
519 important later on about what value is calculated. |
518 |
520 |
519 After the $\mkeps$-call, we inject back the characters one by one in order to build |
521 After the $\mkeps$-call, we inject back the characters one by one in order to build |
520 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$ |
522 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
521 ($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. |
523 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
522 After injecting back $n$ characters, we get the parse tree for how $r_0$ |
524 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
523 matches $s$. For this Sulzmann and Lu defined a function that reverses |
525 matches $s$. For this Sulzmann and Lu defined a function that reverses |
524 the ``chopping off'' of characters during the derivative phase. The |
526 the ``chopping off'' of characters during the derivative phase. The |
525 corresponding function is called \emph{injection}, written |
527 corresponding function is called \emph{injection}, written |
526 $\textit{inj}$; it takes three arguments: the first one is a regular |
528 $\textit{inj}$; it takes three arguments: the first one is a regular |
527 expression ${r_{i-1}}$, before the character is chopped off, the second |
529 expression ${r_{i-1}}$, before the character is chopped off, the second |
608 \end{tabular} |
610 \end{tabular} |
609 \end{center} |
611 \end{center} |
610 |
612 |
611 \noindent |
613 \noindent |
612 In case $r_3$ is nullable, we can call $\textit{mkeps}$ |
614 In case $r_3$ is nullable, we can call $\textit{mkeps}$ |
613 to construct a parse tree for how $r_3$ matched the string $abc$. |
615 to construct a lexical value for how $r_3$ matched the string $abc$. |
614 This function gives the following value $v_3$: |
616 This function gives the following value $v_3$: |
615 |
617 |
616 |
618 |
617 \begin{center} |
619 \begin{center} |
618 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ |
620 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ |
619 \end{center} |
621 \end{center} |
620 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): |
622 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): |
621 |
623 |
622 \begin{center}\comment{better layout} |
624 \begin{center}\comment{better layout} |
623 $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) |
625 \begin{tabular}{lcl} |
624 \cdot r^*) +((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*+(\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* ).$ |
626 $( \underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*}$ & $+$ & $(\ZERO+\ZERO+\ZERO + \ONE + \ZERO) |
|
627 \cdot r^*) +$\\ |
|
628 $((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*$ & $+$ & $(\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* ).$ |
|
629 \end{tabular} |
625 \end{center} |
630 \end{center} |
626 |
631 |
627 \noindent |
632 \noindent |
628 Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot |
633 Note that the leftmost location of term $((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot |
629 \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows |
634 \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows |
653 expressions. The first one is an alternative, we take the rightmost |
658 expressions. The first one is an alternative, we take the rightmost |
654 alternative---whose language contains the empty string. The second |
659 alternative---whose language contains the empty string. The second |
655 nullable regular expression is a Kleene star. $\Stars$ tells us how it |
660 nullable regular expression is a Kleene star. $\Stars$ tells us how it |
656 generates the nullable regular expression: by 0 iterations to form |
661 generates the nullable regular expression: by 0 iterations to form |
657 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally |
662 $\ONE$. Now $\textit{inj}$ injects characters back and incrementally |
658 builds a parse tree based on $v_3$. Using the value $v_3$, the character |
663 builds a lexical value based on $v_3$. Using the value $v_3$, the character |
659 c, and the regular expression $r_2$, we can recover how $r_2$ matched |
664 c, and the regular expression $r_2$, we can recover how $r_2$ matched |
660 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us |
665 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us |
661 \begin{center} |
666 \begin{center} |
662 $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$ |
667 $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$ |
663 \end{center} |
668 \end{center} |
669 \begin{center} |
674 \begin{center} |
670 $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$ |
675 $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$ |
671 \end{center} |
676 \end{center} |
672 matched the string $bc$ before it split into 2 pieces. |
677 matched the string $bc$ before it split into 2 pieces. |
673 Finally, after injecting character $a$ back to $v_1$, |
678 Finally, after injecting character $a$ back to $v_1$, |
674 we get the parse tree |
679 we get the lexical value tree |
675 \begin{center} |
680 \begin{center} |
676 $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$ |
681 $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$ |
677 \end{center} |
682 \end{center} |
678 for how $r$ matched $abc$. This completes the algorithm. |
683 for how $r$ matched $abc$. This completes the algorithm. |
679 |
684 |
680 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
685 %We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. |
681 Readers might have noticed that the parse tree information is actually |
686 Readers might have noticed that the lixical value information is actually |
682 already available when doing derivatives. For example, immediately after |
687 already available when doing derivatives. For example, immediately after |
683 the operation $\backslash a$ we know that if we want to match a string |
688 the operation $\backslash a$ we know that if we want to match a string |
684 that starts with $a$, we can either take the initial match to be |
689 that starts with $a$, we can either take the initial match to be |
685 |
690 |
686 \begin{center} |
691 \begin{center} |
697 and $abc$ is on the right. Which of these alternatives is chosen |
702 and $abc$ is on the right. Which of these alternatives is chosen |
698 later does not affect their relative position because the algorithm does |
703 later does not affect their relative position because the algorithm does |
699 not change this order. If this parsing information can be determined and |
704 not change this order. If this parsing information can be determined and |
700 does not change because of later derivatives, there is no point in |
705 does not change because of later derivatives, there is no point in |
701 traversing this information twice. This leads to an optimisation---if we |
706 traversing this information twice. This leads to an optimisation---if we |
702 store the information for parse trees inside the regular expression, |
707 store the information for lexical values inside the regular expression, |
703 update it when we do derivative on them, and collect the information |
708 update it when we do derivative on them, and collect the information |
704 when finished with derivatives and call $\textit{mkeps}$ for deciding which |
709 when finished with derivatives and call $\textit{mkeps}$ for deciding which |
705 branch is POSIX, we can generate the parse tree in one pass, instead of |
710 branch is POSIX, we can generate the lexical value in one pass, instead of |
706 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel |
711 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel |
707 idea of using bitcodes in derivatives. |
712 idea of using bitcodes in derivatives. |
708 |
713 |
709 In the next section, we shall focus on the bitcoded algorithm and the |
714 In the next section, we shall focus on the bitcoded algorithm and the |
710 process of simplification of regular expressions. This is needed in |
715 process of simplification of regular expressions. This is needed in |
713 state-of-the-art. |
718 state-of-the-art. |
714 |
719 |
715 |
720 |
716 \section{Simplification of Regular Expressions} |
721 \section{Simplification of Regular Expressions} |
717 |
722 |
718 Using bitcodes to guide parsing is not a novel idea. It was applied to |
723 Using bitcodes to guide parsing is not a novel idea. It was applied to context |
719 context free grammars and then adapted by Henglein and Nielson for |
724 free grammars and then adapted by Henglein and Nielson for efficient regular |
720 efficient regular expression \comment{?}parsing using DFAs~\cite{nielson11bcre}. |
725 expression \comment{?}\comment{i have changed parsing into lexing, |
721 Sulzmann and Lu took this idea of bitcodes a step further by integrating |
726 however, parsing is the terminology Henglein and Nielson used} lexing using |
722 bitcodes into derivatives. The reason why we want to use bitcodes in |
727 DFAs~\cite{nielson11bcre}. Sulzmann and Lu took this idea of bitcodes a step |
723 this project is that we want to introduce more aggressive |
728 further by integrating bitcodes into derivatives. The reason why we want to use |
724 simplification rules in order to keep the size of derivatives small |
729 bitcodes in this project is that we want to introduce more aggressive |
725 throughout. This is because the main drawback of building successive |
730 simplification rules in order to keep the size of derivatives small throughout. |
726 derivatives according to Brzozowski's definition is that they can grow |
731 This is because the main drawback of building successive derivatives according |
727 very quickly in size. This is mainly due to the fact that the derivative |
732 to Brzozowski's definition is that they can grow very quickly in size. This is |
728 operation generates often ``useless'' $\ZERO$s and $\ONE$s in |
733 mainly due to the fact that the derivative operation generates often |
729 derivatives. As a result, if implemented naively both algorithms by |
734 ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if implemented |
730 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example |
735 naively both algorithms by Brzozowski and by Sulzmann and Lu are excruciatingly |
731 when starting with the regular expression $(a + aa)^*$ and building 12 |
736 slow. For example when starting with the regular expression $(a + aa)^*$ and |
732 successive derivatives w.r.t.~the character $a$, one obtains a |
737 building 12 successive derivatives w.r.t.~the character $a$, one obtains a |
733 derivative regular expression with more than 8000 nodes (when viewed as |
738 derivative regular expression with more than 8000 nodes (when viewed as a |
734 a tree). Operations like $\textit{der}$ and $\nullable$ need to traverse |
739 tree). Operations like $\textit{der}$ and $\nullable$ need to traverse such |
735 such trees and consequently the bigger the size of the derivative the |
740 trees and consequently the bigger the size of the derivative the slower the |
736 slower the algorithm. |
741 algorithm. |
737 |
742 |
738 Fortunately, one can simplify regular expressions after each derivative |
743 Fortunately, one can simplify regular expressions after each derivative |
739 step. Various simplifications of regular expressions are possible, such |
744 step. Various simplifications of regular expressions are possible, such |
740 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
745 as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
741 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
746 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
768 \end{center} |
773 \end{center} |
769 |
774 |
770 \noindent |
775 \noindent |
771 A partial derivative of a regular expression $r$ is essentially a set of |
776 A partial derivative of a regular expression $r$ is essentially a set of |
772 regular expressions that are either $r$'s children expressions or a |
777 regular expressions that are either $r$'s children expressions or a |
773 concatenation of them. Antimirov has proved a tight bound of the size of |
778 concatenation of them. Antimirov has proved a tight bound of the sum of |
|
779 the size of |
774 \emph{all} partial derivatives no matter what the string looks like. |
780 \emph{all} partial derivatives no matter what the string looks like. |
775 Roughly speaking the size will be quadruple in the size of the regular |
781 Roughly speaking the size sum will be at most cubic in the size of the regular |
776 expression.\comment{Are you sure? I have just proved that the sum of |
782 expression.\comment{Are you sure? I have just proved that the sum of |
777 sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely |
783 sizes in $pder$ is less or equal $(1 + size\;r)^3$. And this is surely |
778 not the best bound.} If we want the size of derivatives in Sulzmann and |
784 not the best bound.}\comment{this is just a very rough guess i made |
779 Lu's algorithm to stay equal or below this bound, we would need more |
785 2 months ago. i will take your suggested bound here.} |
|
786 If we want the size of derivatives in Sulzmann and |
|
787 Lu's algorithm to stay below this bound, we would need more |
780 aggressive simplifications. Essentially we need to delete useless |
788 aggressive simplifications. Essentially we need to delete useless |
781 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible. |
789 $\ZERO$s and $\ONE$s, as well as deleting duplicates whenever possible. |
782 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to |
790 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to |
783 get $a\cdot c + b \cdot c + b \cdot c$, and then simplified to just $a |
791 get $a\cdot c + b \cdot c + b \cdot c$, and then simplified to just $a |
784 \cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ |
792 \cdot c + b \cdot c$. Another example is simplifying $(a^*+a) + (a^*+ |
878 \begin{center} |
886 \begin{center} |
879 \begin{tabular}{lcl} |
887 \begin{tabular}{lcl} |
880 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
888 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
881 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
889 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
882 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
890 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
883 & $\mid$ & $\textit{ALT}\;\;bs\,a_1 \, a_2$\\ |
891 & $\mid$ & $\textit{ALTS}\;\;bs\,a_1 \, a_2$\\ |
884 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
892 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
885 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
893 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
886 \end{tabular} |
894 \end{tabular} |
887 \end{center} |
895 \end{center} |
888 %(in \textit{ALT}) |
896 %(in \textit{ALTS}) |
889 |
897 |
890 \noindent |
898 \noindent |
891 where $bs$ stands for bitcodes, and $a$ for $\bold{a}$nnotated regular |
899 where $bs$ stands for bitcodes, and $a$ for $\bold{a}$nnotated regular |
892 expressions. We will show that these bitcodes encode information about |
900 expressions. We will show that these bitcodes encode information about |
893 the (POSIX) value that should be generated by the Sulzmann and Lu |
901 the (POSIX) value that should be generated by the Sulzmann and Lu |
904 \begin{tabular}{lcl} |
912 \begin{tabular}{lcl} |
905 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\ |
913 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\ |
906 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\ |
914 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\ |
907 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ |
915 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ |
908 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
916 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
909 $\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\, |
917 $\textit{ALTS}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\, |
910 (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\ |
918 (\textit{fuse}\,[\S]\,r_2^\uparrow)$\\ |
911 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
919 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
912 $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\ |
920 $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\ |
913 $(r^*)^\uparrow$ & $\dn$ & |
921 $(r^*)^\uparrow$ & $\dn$ & |
914 $\textit{STAR}\;[]\,r^\uparrow$\\ |
922 $\textit{STAR}\;[]\,r^\uparrow$\\ |
928 $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ |
936 $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ |
929 $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ & |
937 $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ & |
930 $\textit{ONE}\,(bs\,@\,bs')$\\ |
938 $\textit{ONE}\,(bs\,@\,bs')$\\ |
931 $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ & |
939 $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ & |
932 $\textit{CHAR}\,(bs\,@\,bs')\,c$\\ |
940 $\textit{CHAR}\,(bs\,@\,bs')\,c$\\ |
933 $\textit{fuse}\;bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ & |
941 $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,a_1\,a_2)$ & $\dn$ & |
934 $\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\ |
942 $\textit{ALTS}\,(bs\,@\,bs')\,a_1\,a_2$\\ |
935 $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ & |
943 $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ & |
936 $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\ |
944 $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\ |
937 $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ & |
945 $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ & |
938 $\textit{STAR}\,(bs\,@\,bs')\,a$ |
946 $\textit{STAR}\,(bs\,@\,bs')\,a$ |
939 \end{tabular} |
947 \end{tabular} |
942 \noindent |
950 \noindent |
943 After internalising the regular expression, we perform successive |
951 After internalising the regular expression, we perform successive |
944 derivative operations on the annotated regular expressions. This |
952 derivative operations on the annotated regular expressions. This |
945 derivative operation is the same as what we had previously for the |
953 derivative operation is the same as what we had previously for the |
946 basic regular expressions, except that we beed to take care of |
954 basic regular expressions, except that we beed to take care of |
947 the bitcodes:\comment{You need to be consitent with ALTS and ALT; ALT |
955 the bitcodes:\comment{You need to be consitent with ALTS and ALTS; ALTS |
948 is just an abbreviation; derivations and so on are defined for |
956 is just an abbreviation; derivations and so on are defined for |
949 ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a |
957 ALTS}\comment{no this is not the case, ALT for 2 regexes, ALTS for a |
950 list...\textcolor{blue}{This does not make sense to me. CU}} |
958 list...\textcolor{blue}{This does not make sense to me. CU}} |
951 |
959 |
952 %\begin{definition}{bder} |
960 %\begin{definition}{bder} |
955 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
963 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
956 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
964 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
957 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
965 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
958 $\textit{if}\;c=d\; \;\textit{then}\; |
966 $\textit{if}\;c=d\; \;\textit{then}\; |
959 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
967 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
960 $(\textit{ALT}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
968 $(\textit{ALTS}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
961 $\textit{ALT}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\ |
969 $\textit{ALTS}\;bs\,(a_1\,\backslash c)\,(a_2\,\backslash c)$\\ |
962 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
970 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
963 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
971 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
964 & &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\ |
972 & &$\textit{then}\;\textit{ALTS}\,bs\,(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2)$\\ |
965 & &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\ |
973 & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))$\\ |
966 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
974 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
967 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
975 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
968 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
976 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
969 (\textit{STAR}\,[]\,r)$ |
977 (\textit{STAR}\,[]\,r)$ |
970 \end{tabular} |
978 \end{tabular} |
983 then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2 |
991 then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2 |
984 \backslash c)$ will collapse the regular expression $a_1$(as it has |
992 \backslash c)$ will collapse the regular expression $a_1$(as it has |
985 already been fully matched) and store the parsing information at the |
993 already been fully matched) and store the parsing information at the |
986 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
994 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
987 bitsequence $bs$, which was initially attached to the head of $SEQ$, has |
995 bitsequence $bs$, which was initially attached to the head of $SEQ$, has |
988 now been elevated to the top-level of $ALT$, as this information will be |
996 now been elevated to the top-level of $ALTS$, as this information will be |
989 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs |
997 needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs |
990 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
998 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
991 the lexing information, we complete the lexing by collecting the |
999 the lexing information, we complete the lexing by collecting the |
992 bitcodes using a generalised version of the $\textit{mkeps}$ function |
1000 bitcodes using a generalised version of the $\textit{mkeps}$ function |
993 for annotated regular expressions, called $\textit{bmkeps}$: |
1001 for annotated regular expressions, called $\textit{bmkeps}$: |
995 |
1003 |
996 %\begin{definition}[\textit{bmkeps}]\mbox{} |
1004 %\begin{definition}[\textit{bmkeps}]\mbox{} |
997 \begin{center} |
1005 \begin{center} |
998 \begin{tabular}{lcl} |
1006 \begin{tabular}{lcl} |
999 $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\ |
1007 $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\ |
1000 $\textit{bmkeps}\,(\textit{ALT}\;bs\,a_1\,a_2)$ & $\dn$ & |
1008 $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a_1\,a_2)$ & $\dn$ & |
1001 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
1009 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
1002 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\ |
1010 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\ |
1003 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\ |
1011 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\ |
1004 $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & |
1012 $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & |
1005 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
1013 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
1048 modification is that we use simplification rules inspired by Antimirov's |
1056 modification is that we use simplification rules inspired by Antimirov's |
1049 work on partial derivatives. They maintain the idea that only the first |
1057 work on partial derivatives. They maintain the idea that only the first |
1050 ``copy'' of a regular expression in an alternative contributes to the |
1058 ``copy'' of a regular expression in an alternative contributes to the |
1051 calculation of a POSIX value. All subsequent copies can be pruned away from |
1059 calculation of a POSIX value. All subsequent copies can be pruned away from |
1052 the regular expression. A recursive definition of our simplification function |
1060 the regular expression. A recursive definition of our simplification function |
1053 that looks somewhat similar to our Scala code is given below:\comment{Use $\ZERO$, $\ONE$ and so on. |
1061 that looks somewhat similar to our Scala code is given below: |
1054 Is it $ALT$ or $ALTS$?}\\ |
1062 %\comment{Use $\ZERO$, $\ONE$ and so on. |
|
1063 %Is it $ALTS$ or $ALTS$?}\\ |
1055 |
1064 |
1056 \begin{center} |
1065 \begin{center} |
1057 \begin{tabular}{@{}lcl@{}} |
1066 \begin{tabular}{@{}lcl@{}} |
1058 |
1067 |
1059 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
1068 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
1149 outlined in this section): |
1158 outlined in this section): |
1150 |
1159 |
1151 \begin{quote}\it |
1160 \begin{quote}\it |
1152 ``Correctness Claim: We further claim that the incremental parsing |
1161 ``Correctness Claim: We further claim that the incremental parsing |
1153 method in Figure~5 in combination with the simplification steps in |
1162 method in Figure~5 in combination with the simplification steps in |
1154 Figure 6 yields POSIX parse trees. We have tested this claim |
1163 Figure 6 yields POSIX parse tree. We have tested this claim |
1155 extensively by using the method in Figure~3 as a reference but yet |
1164 extensively by using the method in Figure~3 as a reference but yet |
1156 have to work out all proof details.'' |
1165 have to work out all proof details.'' |
1157 \end{quote} |
1166 \end{quote} |
1158 |
1167 |
1159 \noindent We like to settle this correctness claim. It is relatively |
1168 \noindent (The so-called parse trees in this quote means lexical values.) |
|
1169 We like to settle this correctness claim. It is relatively |
1160 straightforward to establish that after one simplification step, the part of a |
1170 straightforward to establish that after one simplification step, the part of a |
1161 nullable derivative that corresponds to a POSIX value remains intact and can |
1171 nullable derivative that corresponds to a POSIX value remains intact and can |
1162 still be collected, in other words, we can show that\comment{Double-check....I |
1172 still be collected, in other words, we can show that\comment{Double-check....I |
1163 think this is not the case} |
1173 think this is not the case} |
1164 %\comment{If i remember correctly, you have proved this lemma. |
1174 %\comment{If i remember correctly, you have proved this lemma. |
1167 %you proved something like simplification does not affect $\textit{bmkeps}$ results? |
1177 %you proved something like simplification does not affect $\textit{bmkeps}$ results? |
1168 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached |
1178 %Anyway, i have amended this a little bit so it does not allow arbitrary bits attached |
1169 %to a regex. Maybe it works now.} |
1179 %to a regex. Maybe it works now.} |
1170 |
1180 |
1171 \begin{center} |
1181 \begin{center} |
1172 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$ |
1182 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;($\textit{provided}$ \; a\; is \; \textit{bnullable} )$ |
1173 \end{center} |
1183 \end{center} |
1174 \comment{\textcolor{blue}{I proved $bmkeps\,(bsimp\,a) = bmkeps\,a$ provided $a$ is |
|
1175 $\textit{bnullable}$}} |
|
1176 |
1184 |
1177 \noindent |
1185 \noindent |
1178 as this basically comes down to proving actions like removing the |
1186 as this basically comes down to proving actions like removing the |
1179 additional $r$ in $r+r$ does not delete important POSIX information in |
1187 additional $r$ in $r+r$ does not delete important POSIX information in |
1180 a regular expression. The hard part of this proof is to establish that |
1188 a regular expression. The hard part of this proof is to establish that |
1181 |
1189 |
1182 \begin{center} |
1190 \begin{center} |
1183 $\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$ |
1191 $ \textit{blexer}\_{simp}(r, \; s) = \textit{blexer}(r, \; s)$ |
1184 \end{center}\comment{This is not true either...look at the definion blexer/blexer-simp} |
1192 \end{center}\comment{This is not true either...look at the definion blexer/blexer-simp} |
1185 |
1193 |
1186 \noindent That is, if we do derivative on regular expression $r$ and then |
1194 \noindent That is, if we do derivative on regular expression $r$ and then |
1187 simplify it, and repeat this process until we exhaust the string, we get a |
1195 simplify it, and repeat this process until we exhaust the string, we get a |
1188 regular expression $r''$($\textit{LHS}$) that provides the POSIX matching |
1196 regular expression $r''$($\textit{LHS}$) that provides the POSIX matching |
1189 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the |
1197 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the |
1190 normal derivative algorithm that only does derivative repeatedly and has no |
1198 normal derivative algorithm that only does derivative repeatedly and has no |
1191 simplification at all. This might seem at first glance very unintuitive, as |
1199 simplification at all. This might seem at first glance very unintuitive, as |
1192 the $r'$ is exponentially larger than $r''$, but can be explained in the |
1200 the $r'$ could be exponentially larger than $r''$, but can be explained in the |
1193 following way: we are pruning away the possible matches that are not POSIX. |
1201 following way: we are pruning away the possible matches that are not POSIX. |
1194 Since there are exponentially non-POSIX matchings and only 1 POSIX matching, it |
1202 Since there could be exponentially many |
|
1203 non-POSIX matchings and only 1 POSIX matching, it |
1195 is understandable that our $r''$ can be a lot smaller. we can still provide |
1204 is understandable that our $r''$ can be a lot smaller. we can still provide |
1196 the same POSIX value if there is one. This is not as straightforward as the |
1205 the same POSIX value if there is one. This is not as straightforward as the |
1197 previous proposition, as the two regular expressions $r'$ and $r''$ might have |
1206 previous proposition, as the two regular expressions $r'$ and $r''$ might have |
1198 become very different regular expressions. The crucial point is to find the |
1207 become very different. The crucial point is to find the |
1199 $\textit{POSIX}$ information of a regular expression and how it is modified, |
1208 $\textit{POSIX}$ information of a regular expression and how it is modified, |
1200 augmented and propagated |
1209 augmented and propagated |
1201 during simplification in parallel with the regularr expression that |
1210 during simplification in parallel with the regularr expression that |
1202 has not been simplified in the subsequent derivative operations. To aid this, |
1211 has not been simplified in the subsequent derivative operations. To aid this, |
1203 we use the helping function retrieve described by Sulzmann and Lu: \\definition |
1212 we use the helper function retrieve described by Sulzmann and Lu: \\definition |
1204 of retrieve TODO\comment{Did not read further}\\ |
1213 of retrieve TODO\comment{Did not read further}\\ |
1205 This function assembles the bitcode that corresponds to a parse tree for how |
1214 This function assembles the bitcode |
1206 the current derivative matches the suffix of the string(the characters that |
1215 %that corresponds to a lexical value for how |
1207 have not yet appeared, but will appear as the successive derivatives go on, |
1216 %the current derivative matches the suffix of the string(the characters that |
1208 how do we get this "future" information? By the value $v$, which is |
1217 %have not yet appeared, but will appear as the successive derivatives go on. |
1209 computed by a pass of the algorithm that uses |
1218 %How do we get this "future" information? By the value $v$, which is |
1210 $inj$ as described in the previous section). Sulzmann and Lu used this |
1219 %computed by a pass of the algorithm that uses |
|
1220 %$inj$ as described in the previous section). |
|
1221 using information from both the derivative regular expression and the value. |
|
1222 Sulzmann and Lu used this |
1211 to connect the bitcoded algorithm to the older algorithm by the following |
1223 to connect the bitcoded algorithm to the older algorithm by the following |
1212 equation: |
1224 equation: |
1213 |
1225 |
1214 \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; |
1226 \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; |
1215 ((\textit{internalise}\; r)\backslash_{simp} c) v)$ |
1227 ((\textit{internalise}\; r)\backslash_{simp} c) v)$ |
1216 \end{center} |
1228 \end{center} |
1217 A little fact that needs to be stated to help comprehension: |
1229 A little fact that needs to be stated to aid comprehension: |
1218 \begin{center} |
1230 \begin{center} |
1219 $r^\uparrow = a$($a$ stands for $\textit{annotated}).$ |
1231 $r^\uparrow = a$($a$ stands for $\textit{annotated}).$ |
1220 \end{center} |
1232 \end{center} |
1221 Ausaf and Urban also used this fact to prove the |
1233 Ausaf and Urban also used this fact to prove the |
1222 correctness of bitcoded algorithm without simplification. Our purpose |
1234 correctness of bitcoded algorithm without simplification. Our purpose |
1224 \begin{center} |
1236 \begin{center} |
1225 $ \textit{retrieve} \; |
1237 $ \textit{retrieve} \; |
1226 a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$ |
1238 a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$ |
1227 \end{center} |
1239 \end{center} |
1228 The idea |
1240 The idea |
1229 is that using $v'$, a simplified version of $v$ that possibly had gone |
1241 is that using $v'$, a simplified version of $v$ that had gone |
1230 through the same simplification step as $\textit{simp}(a)$ we are |
1242 through the same simplification step as $\textit{simp}(a)$, we are |
1231 able to extract the bit-sequence that gives the same parsing |
1243 able to extract the bit-sequence that gives the same parsing |
1232 information as the unsimplified one. After establishing this, we |
1244 information as the unsimplified one. After establishing this, we |
1233 might be able to finally bridge the gap of proving |
1245 might be able to finally bridge the gap of proving |
1234 \begin{center} |
1246 \begin{center} |
1235 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; |
1247 $\textit{retrieve} \; r^\uparrow \backslash s \; v = \;\textit{retrieve} \; |
1236 \textit{simp}(r) \backslash s \; v'$ |
1248 \textit{bsimp}(r^\uparrow) \backslash s \; v'$ |
1237 \end{center} |
1249 \end{center} |
1238 and subsequently |
1250 and subsequently |
1239 \begin{center} |
1251 \begin{center} |
1240 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; |
1252 $\textit{retrieve} \; r^\uparrow \backslash s \; v\; = \; \textit{retrieve} \; |
1241 r \backslash_{simp} s \; v'$. |
1253 r^\uparrow \backslash_{simp} s \; v'$. |
1242 \end{center} |
1254 \end{center} |
|
1255 The $\textit{LHS}$ of the above equation is the bitcode we want. |
1243 This proves that our simplified |
1256 This proves that our simplified |
1244 version of regular expression still contains all the bitcodes needed. |
1257 version of regular expression still contains all the bitcodes needed. |
1245 |
1258 The task here is to find a way to compute the correct $v'$. |
1246 |
1259 |
1247 The second task is to speed up the more aggressive simplification. |
1260 The second task is to speed up the more aggressive simplification. |
1248 Currently it is slower than a naive simplification(the naive version as |
1261 Currently it is slower than a naive simplification(the naive version as |
1249 implemented in ADU of course can explode in some cases). So it needs to |
1262 implemented in ADU of course can explode in some cases). So it needs to |
1250 be explored how to make it faster. Our possibility would be to explore |
1263 be explored how to make it faster. Our possibility would be to explore |