573 & & $\phantom{r_3 = (} ((0+1+0 + 0 + 0) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^* )$ |
580 & & $\phantom{r_3 = (} ((0+1+0 + 0 + 0) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^* )$ |
574 \end{tabular} |
581 \end{tabular} |
575 \end{center} |
582 \end{center} |
576 |
583 |
577 \noindent |
584 \noindent |
578 Now when $nullable$ gives a $yes$ on $r_3$, we call $mkeps$ |
585 In case $r_3$ is nullable, we can call $mkeps$ |
579 to construct a parse tree for how $r_3$ matched the string $abc$. |
586 to construct a parse tree for how $r_3$ matched the string $abc$. |
580 $mkeps$ gives the following value $v_3$: |
587 $mkeps$ gives the following value $v_3$: |
581 \begin{center} |
588 \begin{center} |
582 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ |
589 $\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$ |
583 \end{center} |
590 \end{center} |
584 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): |
591 The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined): |
|
592 |
585 \begin{center} |
593 \begin{center} |
586 $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0 + 1 + 0) |
594 $( \underline{(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*} + (0+0+0 + 1 + 0) |
587 \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ).$ |
595 \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ).$ |
588 |
|
589 \end{center} |
596 \end{center} |
590 Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$ |
597 |
591 (which corresponds to the initial sub-match $abc$) |
598 \noindent |
592 allows $mkeps$ to pick it up |
599 Note that the leftmost location of term $((0+0+0 + 0 + 1 \cdot 1 \cdot |
593 because $mkeps$ is defined to always choose the left one when it is nullable. |
600 1) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows |
594 In the case of this example, $abc$ is preferred over $a$ or $ab$. |
601 $mkeps$ to pick it up because $mkeps$ is defined to always choose the |
595 This $\Left(\Left(\ldots))$ location is naturally generated by |
602 left one when it is nullable. In the case of this example, $abc$ is |
596 two applications of the splitting clause |
603 preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is |
597 \begin{center} |
604 naturally generated by two applications of the splitting clause |
|
605 |
|
606 \begin{center} |
598 $(r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$ |
607 $(r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$ |
599 \end{center} |
608 \end{center} |
600 By this clause, we put |
609 |
601 $r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. |
610 \noindent |
602 This allows $mkeps$ to always pick up among two matches the one with a longer initial sub-match. |
611 By this clause, we put $r_1 \backslash c \cdot r_2 $ at the |
603 Removing the outside $\Left(\Left(...))$, the inside sub-value |
612 $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This |
604 \begin{center} |
613 allows $mkeps$ to always pick up among two matches the one with a longer |
|
614 initial sub-match. Removing the outside $\Left(\Left(...))$, the inside |
|
615 sub-value |
|
616 |
|
617 \begin{center} |
605 $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$ |
618 $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$ |
606 \end{center} |
619 \end{center} |
607 tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^*$. |
620 |
608 We match $[]$ by a sequence of 2 nullable regular expressions. |
621 \noindent |
609 The first one is an alternative, we take the rightmost alternative---whose language |
622 tells us how the empty string $[]$ is matched with $(0+0+0 + 0 + 1 \cdot |
610 contains the empty string. The second nullable regular |
623 1 \cdot 1) \cdot r^*$. We match $[]$ by a sequence of 2 nullable regular |
611 expression is a Kleene star. $\Stars$ tells us how it |
624 expressions. The first one is an alternative, we take the rightmost |
612 generates the nullable regular expression: by 0 iterations to form $\epsilon$. |
625 alternative---whose language contains the empty string. The second |
613 Now $\textit{inj}$ injects characters back and |
626 nullable regular expression is a Kleene star. $\Stars$ tells us how it |
614 incrementally builds a parse tree based on $v_3$. |
627 generates the nullable regular expression: by 0 iterations to form |
615 Using the value $v_3$, the character c, and the regular expression $r_2$, |
628 $\epsilon$. Now $\textit{inj}$ injects characters back and incrementally |
616 we can recover how $r_2$ matched the string $[c]$ : |
629 builds a parse tree based on $v_3$. Using the value $v_3$, the character |
617 $\textit{inj} \; r_2 \; c \; v_3$ gives us |
630 c, and the regular expression $r_2$, we can recover how $r_2$ matched |
|
631 the string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us |
618 \begin{center} |
632 \begin{center} |
619 $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$ |
633 $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$ |
620 \end{center} |
634 \end{center} |
621 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get |
635 which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get |
622 \begin{center} |
636 \begin{center} |
644 \item[1)] just $a$ or |
658 \item[1)] just $a$ or |
645 \item[2)] string $ab$ or |
659 \item[2)] string $ab$ or |
646 \item[3)] string $abc$. |
660 \item[3)] string $abc$. |
647 \end{enumerate} |
661 \end{enumerate} |
648 \end{center} |
662 \end{center} |
649 In order to differentiate between these choices, |
663 |
650 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. |
664 \noindent |
651 Which one of these alternatives is chosen later does not affect their relative position because our algorithm |
665 In order to differentiate between these choices, we just need to |
652 does not change this order. If this parsing information can be determined and does |
666 remember their positions--$a$ is on the left, $ab$ is in the middle , |
653 not change because of later derivatives, |
667 and $abc$ is on the right. Which one of these alternatives is chosen |
654 there is no point in traversing this information twice. This leads to an optimisation---if we store the information for parse trees inside the regular expression, update it when we do derivative on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing the rest $n$ injections. This leads to Sulzmann and Lu's novel idea of using bit-codes in derivatives. |
668 later does not affect their relative position because our algorithm does |
|
669 not change this order. If this parsing information can be determined and |
|
670 does not change because of later derivatives, there is no point in |
|
671 traversing this information twice. This leads to an optimisation---if we |
|
672 store the information for parse trees inside the regular expression, |
|
673 update it when we do derivative on them, and collect the information |
|
674 when finished with derivatives and call $mkeps$ for deciding which |
|
675 branch is POSIX, we can generate the parse tree in one pass, instead of |
|
676 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel |
|
677 idea of using bit-codes in derivatives. |
655 |
678 |
656 In the next section, we shall focus on the bit-coded algorithm and the |
679 In the next section, we shall focus on the bit-coded algorithm and the |
657 process of simplification of regular expressions. This is needed in |
680 process of simplification of regular expressions. This is needed in |
658 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
681 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
659 and Lu's algorithms. This is where the PhD-project aims to advance the |
682 and Lu's algorithms. This is where the PhD-project aims to advance the |
660 state-of-the-art. |
683 state-of-the-art. |
661 |
684 |
662 |
685 |
663 \section{Simplification of Regular Expressions} |
686 \section{Simplification of Regular Expressions} |
664 |
687 |
665 |
688 Using bitcodes to guide parsing is not a novel idea. It was applied to |
666 Using bit-codes to guide parsing is not a novel idea. It was applied to |
|
667 context free grammars and then adapted by Henglein and Nielson for |
689 context free grammars and then adapted by Henglein and Nielson for |
668 efficient regular expression parsing using DFAs \cite{nielson11bcre}. Sulzmann and |
690 efficient regular expression parsing using DFAs~\cite{nielson11bcre}. |
669 Lu took a step further by integrating bitcodes into derivatives. |
691 Sulzmann and Lu took this idea of bitcodes a step further by integrating |
670 The reason why we want to use bitcodes in |
692 bitcodes into derivatives. The reason why we want to use bitcodes in |
671 this project is that we want to introduce aggressive |
693 this project is that we want to introduce more aggressive |
672 simplifications. |
694 simplifications in order to keep the size of derivatives small |
673 |
695 throughout. This is because the main drawback of building successive |
674 The main drawback of building |
696 derivatives according to Brzozowski's definition is that they can grow |
675 successive derivatives according to Brzozowski's definition is that they |
697 very quickly in size. This is mainly due to the fact that the derivative |
676 can grow very quickly in size. This is mainly due to the fact that the |
698 operation generates often ``useless'' $\ZERO$s and $\ONE$s in |
677 derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in |
|
678 derivatives. As a result, if implemented naively both algorithms by |
699 derivatives. As a result, if implemented naively both algorithms by |
679 Brzozowski and by Sulzmann and Lu are excruciatingly slow. |
700 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example |
680 For example when starting with the regular expression $(a + aa)^*$ and building 12 |
701 when starting with the regular expression $(a + aa)^*$ and building 12 |
681 successive derivatives w.r.t.~the character $a$, one obtains a |
702 successive derivatives w.r.t.~the character $a$, one obtains a |
682 derivative regular expression with more than 8000 nodes (when viewed as |
703 derivative regular expression with more than 8000 nodes (when viewed as |
683 a tree). Operations like derivative and $\nullable$ need to traverse |
704 a tree). Operations like derivative and $\nullable$ need to traverse |
684 such trees and consequently the bigger the size of the derivative the |
705 such trees and consequently the bigger the size of the derivative the |
685 slower the algorithm. |
706 slower the algorithm. |
686 Fortunately, one can simplify regular expressions |
707 |
687 after each derivative step. Various simplifications of regular |
708 Fortunately, one can simplify regular expressions after each derivative |
688 expressions are possible, such as the simplifications of $\ZERO + r$, $r |
709 step. Various simplifications of regular expressions are possible, such |
689 + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These |
710 as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
690 simplifications do not affect the answer for whether a regular |
711 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
691 expression matches a string or not, but fortunately also do not affect |
712 affect the answer for whether a regular expression matches a string or |
692 the POSIX strategy of how regular expressions match strings---although |
713 not, but fortunately also do not affect the POSIX strategy of how |
693 the latter is much harder to establish. |
714 regular expressions match strings---although the latter is much harder |
694 The argument for complicating the data structures from basic regular |
715 to establish. \comment{Does not make sense.} The argument for |
695 expressions to those with bitcodes is that we can introduce |
716 complicating the data structures from basic regular expressions to those |
696 simplification without making the algorithm crash or overly complex to |
717 with bitcodes is that we can introduce simplification without making the |
697 reason about. The latter is crucial for a correctness proof. |
718 algorithm crash or overly complex to reason about. The latter is crucial |
698 Some initial results in this |
719 for a correctness proof. Some initial results in this regard have been |
699 regard have been obtained in \cite{AusafDyckhoffUrban2016}. However, |
720 obtained in \cite{AusafDyckhoffUrban2016}. |
700 what has not been achieved yet is correctness for the bit-coded algorithm |
721 |
701 that involves simplifications and a very tight bound for the size. Such |
722 Unfortunately, the simplification rules outlined above are not |
702 a tight bound is suggested by work of Antimirov who proved that |
723 sufficient to prevent an explosion for all regular expression. We |
703 (partial) derivatives can be bound by the number of characters contained |
724 believe a tighter bound can be achieved that prevents an explosion in |
704 in the initial regular expression \cite{Antimirov95}. |
725 all cases. Such a tighter bound is suggested by work of Antimirov who |
705 |
726 proved that (partial) derivatives can be bound by the number of |
706 Antimirov defined the \emph{partial derivatives} of regular expressions to be this: |
727 characters contained in the initial regular expression |
707 %TODO definition of partial derivatives |
728 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular |
|
729 expressions as follows: |
|
730 |
708 \begin{center} |
731 \begin{center} |
709 \begin{tabular}{lcl} |
732 \begin{tabular}{lcl} |
710 $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\ |
733 $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\ |
711 $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\ |
734 $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\ |
712 $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{ 1 \} \; \textit{else} \; \emptyset$ \\ |
735 $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{ 1 \} \; \textit{else} \; \emptyset$ \\ |
713 $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\ |
736 $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\ |
714 $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \; \textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ |
737 $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\ |
|
738 & & $\textit{then} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \;$\\ |
|
739 & & $\textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ |
715 $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\ |
740 $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\ |
716 \end{tabular} |
741 \end{tabular} |
717 \end{center} |
742 \end{center} |
718 A partial derivative of a regular expression $r$ is essentially a set of regular expressions that are either $r$'s children expressions or a concatenation of them. |
743 |
719 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression. Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives. |
744 \noindent |
|
745 A partial derivative of a regular expression $r$ is essentially a set of |
|
746 regular expressions that are either $r$'s children expressions or a |
|
747 concatenation of them. Antimirov has proved a tight bound of the size of |
|
748 partial derivatives. \comment{That looks too preliminary to me.} Roughly |
|
749 speaking the size will not exceed the fourth power of the number of |
|
750 nodes in that regular expression. \comment{Improve: which |
|
751 simplifications?}Interestingly, we observed from experiment that after |
|
752 the simplification step, our regular expression has the same size or is |
|
753 smaller than the partial derivatives. This allows us to prove a tight |
|
754 bound on the size of regular expression during the running time of the |
|
755 algorithm if we can establish the connection between our simplification |
|
756 rules and partial derivatives. |
720 |
757 |
721 %We believe, and have generated test |
758 %We believe, and have generated test |
722 %data, that a similar bound can be obtained for the derivatives in |
759 %data, that a similar bound can be obtained for the derivatives in |
723 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
760 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
724 |
761 |
742 $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\ |
779 $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\ |
743 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\; |
780 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\; |
744 code(\Stars\,vs)$ |
781 code(\Stars\,vs)$ |
745 \end{tabular} |
782 \end{tabular} |
746 \end{center} |
783 \end{center} |
747 Here code encodes a value into a bit-sequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode it back into value:\\ |
784 |
|
785 Here code encodes a value into a bit-sequence by converting Left into |
|
786 $\Z$, Right into $\S$, the start point of a non-empty star iteration |
|
787 into $\S$, and the border where a local star terminates into $\Z$. This |
|
788 conversion is apparently lossy, as it throws away the character |
|
789 information, and does not decode the boundary between the two operands |
|
790 of the sequence constructor. Moreover, with only the bitcode we cannot |
|
791 even tell whether the $\S$s and $\Z$s are for $\Left/\Right$ or |
|
792 $\Stars$. The reason for choosing this compact way of storing |
|
793 information is that the relatively small size of bits can be easily |
|
794 moved around. In order to recover the bitcode back into values, we will |
|
795 need the regular expression as the extra information and decode it back |
|
796 into value:\\ |
|
797 |
|
798 |
748 %\begin{definition}[Bitdecoding of Values]\mbox{} |
799 %\begin{definition}[Bitdecoding of Values]\mbox{} |
749 \begin{center} |
800 \begin{center} |
750 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
801 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
751 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
802 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
752 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
803 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
1001 have to work out all proof details.'' |
1072 have to work out all proof details.'' |
1002 \end{quote} |
1073 \end{quote} |
1003 |
1074 |
1004 \noindent |
1075 \noindent |
1005 We would settle the correctness claim. |
1076 We would settle the correctness claim. |
1006 It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, |
1077 It is relatively straightforward to establish that after one simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, |
1007 bmkeps r = bmkeps simp r |
1078 \comment{that only holds when r is nullable}bmkeps r = bmkeps simp r |
1008 as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete important POSIX information in a regular expression. |
1079 as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete important POSIX information in a regular expression. |
1009 The hardcore of this problem is to prove that |
1080 The hardcore of this problem is to prove that |
1010 bmkeps bders r = bmkeps bders simp r |
1081 bmkeps bders r = bmkeps bders simp r |
1011 That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straightforward as the previous proposition, as the two regular expression r and simp r might become very different regular expressions after repeated application ofd simp and derivative. |
1082 That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straightforward as the previous proposition, as the two regular expression r and simp r might become very different regular expressions after repeated application ofd simp and derivative. |
1012 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification. |
1083 The crucial point is to find the \comment{What?}"gene" of a regular expression and how it is kept intact during simplification. |
1013 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu: |
1084 To aid this, we are use the helping function retrieve described by Sulzmann and Lu: |
1014 \\definition of retrieve\\ |
1085 \\definition of retrieve\\ |
1015 This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
1086 This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
1016 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
1087 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
1017 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ |
1088 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ |
1018 A little fact that needs to be stated to help comprehension:\\ |
1089 A little fact that needs to be stated to help comprehension:\\ |