550 to include extended regular expressions and |
552 to include extended regular expressions and |
551 simplification of internal data structures |
553 simplification of internal data structures |
552 eliminating the exponential behaviours. |
554 eliminating the exponential behaviours. |
553 |
555 |
554 |
556 |
|
557 \section{Motivation} |
555 |
558 |
556 |
|
557 |
|
558 |
|
559 |
|
560 |
|
561 %---------------------------------------------------------------------------------------- |
|
562 |
|
563 \section{Contribution} |
|
564 |
|
565 |
|
566 |
|
567 This work addresses the vulnerability of super-linear and |
|
568 buggy regex implementations by the combination |
|
569 of Brzozowski's derivatives and interactive theorem proving. |
|
570 We give an |
|
571 improved version of Sulzmann and Lu's bit-coded algorithm using |
|
572 derivatives, which come with a formal guarantee in terms of correctness and |
|
573 running time as an Isabelle/HOL proof. |
|
574 Then we improve the algorithm with an even stronger version of |
|
575 simplification, and prove a time bound linear to input and |
|
576 cubic to regular expression size using a technique by |
|
577 Antimirov. |
|
578 |
|
579 |
|
580 The main contribution of this thesis is a proven correct lexing algorithm |
|
581 with formalized time bounds. |
|
582 To our best knowledge, no lexing libraries using Brzozowski derivatives |
|
583 have a provable time guarantee, |
|
584 and claims about running time are usually speculative and backed by thin empirical |
|
585 evidence. |
|
586 %TODO: give references |
|
587 For example, Sulzmann and Lu had proposed an algorithm in which they |
|
588 claim a linear running time. |
|
589 But that was falsified by our experiments and the running time |
|
590 is actually $\Omega(2^n)$ in the worst case. |
|
591 A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim |
|
592 %TODO: give references |
|
593 lexer, which calculates POSIX matches and is based on derivatives. |
|
594 They formalized the correctness of the lexer, but not the complexity. |
|
595 In the performance evaluation section, they simply analyzed the run time |
|
596 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$ |
|
597 and concluded that the algorithm is quadratic in terms of input length. |
|
598 When we tried out their extracted OCaml code with our example $(a+aa)^*$, |
|
599 the time it took to lex only 40 $a$'s was 5 minutes. |
|
600 |
|
601 We believe our results of a proof of performance on general |
|
602 inputs rather than specific examples a novel contribution.\\ |
|
603 |
|
604 |
|
605 \subsection{Related Work} |
|
606 We are aware |
|
607 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
608 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
609 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
|
610 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
611 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
|
612 |
|
613 %We propose Brzozowski's derivatives as a solution to this problem. |
|
614 % about Lexing Using Brzozowski derivatives |
|
615 \section{Preliminaries} |
|
616 |
|
617 Suppose we have an alphabet $\Sigma$, the strings whose characters |
|
618 are from $\Sigma$ |
|
619 can be expressed as $\Sigma^*$. |
|
620 |
|
621 We use patterns to define a set of strings concisely. Regular expressions |
|
622 are one of such patterns systems: |
|
623 The basic regular expressions are defined inductively |
|
624 by the following grammar: |
|
625 \[ r ::= \ZERO \mid \ONE |
|
626 \mid c |
|
627 \mid r_1 \cdot r_2 |
|
628 \mid r_1 + r_2 |
|
629 \mid r^* |
|
630 \] |
|
631 |
|
632 The language or set of strings defined by regular expressions are defined as |
|
633 %TODO: FILL in the other defs |
|
634 \begin{center} |
|
635 \begin{tabular}{lcl} |
|
636 $L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\ |
|
637 $L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\ |
|
638 \end{tabular} |
|
639 \end{center} |
|
640 Which are also called the "language interpretation". |
|
641 |
|
642 |
|
643 |
|
644 The Brzozowski derivative w.r.t character $c$ is an operation on the regex, |
|
645 where the operation transforms the regex to a new one containing |
|
646 strings without the head character $c$. |
|
647 |
|
648 Formally, we define first such a transformation on any string set, which |
|
649 we call semantic derivative: |
|
650 \begin{center} |
|
651 $\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$ |
|
652 \end{center} |
|
653 Mathematically, it can be expressed as the |
|
654 |
|
655 If the $\textit{StringSet}$ happen to have some structure, for example, |
|
656 if it is regular, then we have that it |
|
657 |
|
658 % Derivatives of a |
|
659 %regular expression, written $r \backslash c$, give a simple solution |
|
660 %to the problem of matching a string $s$ with a regular |
|
661 %expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
662 %succession) all the characters of the string matches the empty string, |
|
663 %then $r$ matches $s$ (and {\em vice versa}). |
|
664 |
|
665 The the derivative of regular expression, denoted as |
|
666 $r \backslash c$, is a function that takes parameters |
|
667 $r$ and $c$, and returns another regular expression $r'$, |
|
668 which is computed by the following recursive function: |
|
669 |
|
670 \begin{center} |
|
671 \begin{tabular}{lcl} |
|
672 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
673 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
674 $d \backslash c$ & $\dn$ & |
|
675 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
676 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
677 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
678 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
679 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
680 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
681 \end{tabular} |
|
682 \end{center} |
|
683 \noindent |
|
684 \noindent |
|
685 |
|
686 The $\nullable$ function tests whether the empty string $""$ |
|
687 is in the language of $r$: |
|
688 |
|
689 |
|
690 \begin{center} |
|
691 \begin{tabular}{lcl} |
|
692 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
693 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
694 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
695 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
696 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
697 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
698 \end{tabular} |
|
699 \end{center} |
|
700 \noindent |
|
701 The empty set does not contain any string and |
|
702 therefore not the empty string, the empty string |
|
703 regular expression contains the empty string |
|
704 by definition, the character regular expression |
|
705 is the singleton that contains character only, |
|
706 and therefore does not contain the empty string, |
|
707 the alternative regular expression(or "or" expression) |
|
708 might have one of its children regular expressions |
|
709 being nullable and any one of its children being nullable |
|
710 would suffice. The sequence regular expression |
|
711 would require both children to have the empty string |
|
712 to compose an empty string and the Kleene star |
|
713 operation naturally introduced the empty string. |
|
714 |
|
715 We can give the meaning of regular expressions derivatives |
|
716 by language interpretation: |
|
717 |
|
718 |
|
719 |
|
720 |
|
721 \begin{center} |
|
722 \begin{tabular}{lcl} |
|
723 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
724 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
725 $d \backslash c$ & $\dn$ & |
|
726 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
727 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
728 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
729 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
730 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
731 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
732 \end{tabular} |
|
733 \end{center} |
|
734 \noindent |
|
735 \noindent |
|
736 The function derivative, written $\backslash c$, |
|
737 defines how a regular expression evolves into |
|
738 a new regular expression after all the string it contains |
|
739 is chopped off a certain head character $c$. |
|
740 The most involved cases are the sequence |
|
741 and star case. |
|
742 The sequence case says that if the first regular expression |
|
743 contains an empty string then the second component of the sequence |
|
744 might be chosen as the target regular expression to be chopped |
|
745 off its head character. |
|
746 The star regular expression's derivative unwraps the iteration of |
|
747 regular expression and attaches the star regular expression |
|
748 to the sequence's second element to make sure a copy is retained |
|
749 for possible more iterations in later phases of lexing. |
|
750 |
|
751 |
|
752 The main property of the derivative operation |
|
753 that enables us to reason about the correctness of |
|
754 an algorithm using derivatives is |
|
755 |
|
756 \begin{center} |
|
757 $c\!::\!s \in L(r)$ holds |
|
758 if and only if $s \in L(r\backslash c)$. |
|
759 \end{center} |
|
760 |
|
761 \noindent |
|
762 We can generalise the derivative operation shown above for single characters |
|
763 to strings as follows: |
|
764 |
|
765 \begin{center} |
|
766 \begin{tabular}{lcl} |
|
767 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
|
768 $r \backslash [\,] $ & $\dn$ & $r$ |
|
769 \end{tabular} |
|
770 \end{center} |
|
771 |
|
772 \noindent |
|
773 and then define Brzozowski's regular-expression matching algorithm as: |
|
774 |
|
775 \[ |
|
776 match\;s\;r \;\dn\; nullable(r\backslash s) |
|
777 \] |
|
778 |
|
779 \noindent |
|
780 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, |
|
781 this algorithm presented graphically is as follows: |
|
782 |
|
783 \begin{equation}\label{graph:*} |
|
784 \begin{tikzcd} |
|
785 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
|
786 \end{tikzcd} |
|
787 \end{equation} |
|
788 |
|
789 \noindent |
|
790 where we start with a regular expression $r_0$, build successive |
|
791 derivatives until we exhaust the string and then use \textit{nullable} |
|
792 to test whether the result can match the empty string. It can be |
|
793 relatively easily shown that this matcher is correct (that is given |
|
794 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
|
795 |
|
796 Beautiful and simple definition. |
|
797 |
|
798 If we implement the above algorithm naively, however, |
|
799 the algorithm can be excruciatingly slow. |
|
800 |
|
801 |
|
802 \begin{figure} |
|
803 \centering |
|
804 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
805 \begin{tikzpicture} |
|
806 \begin{axis}[ |
|
807 xlabel={$n$}, |
|
808 x label style={at={(1.05,-0.05)}}, |
|
809 ylabel={time in secs}, |
|
810 enlargelimits=false, |
|
811 xtick={0,5,...,30}, |
|
812 xmax=33, |
|
813 ymax=10000, |
|
814 ytick={0,1000,...,10000}, |
|
815 scaled ticks=false, |
|
816 axis lines=left, |
|
817 width=5cm, |
|
818 height=4cm, |
|
819 legend entries={JavaScript}, |
|
820 legend pos=north west, |
|
821 legend cell align=left] |
|
822 \addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data}; |
|
823 \end{axis} |
|
824 \end{tikzpicture}\\ |
|
825 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
826 of the form $\underbrace{aa..a}_{n}$.} |
|
827 \end{tabular} |
|
828 \caption{EightThousandNodes} \label{fig:EightThousandNodes} |
|
829 \end{figure} |
|
830 |
|
831 |
|
832 (8000 node data to be added here) |
|
833 For example, when starting with the regular |
|
834 expression $(a + aa)^*$ and building a few successive derivatives (around 10) |
|
835 w.r.t.~the character $a$, one obtains a derivative regular expression |
|
836 with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}. |
|
837 The reason why $(a + aa) ^*$ explodes so drastically is that without |
|
838 pruning, the algorithm will keep records of all possible ways of matching: |
|
839 \begin{center} |
|
840 $(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$ |
|
841 \end{center} |
|
842 |
|
843 \noindent |
|
844 Each of the above alternative branches correspond to the match |
|
845 $aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete). |
|
846 These different ways of matching will grow exponentially with the string length, |
|
847 and without simplifications that throw away some of these very similar matchings, |
|
848 it is no surprise that these expressions grow so quickly. |
|
849 Operations like |
|
850 $\backslash$ and $\nullable$ need to traverse such trees and |
|
851 consequently the bigger the size of the derivative the slower the |
|
852 algorithm. |
|
853 |
|
854 Brzozowski was quick in finding that during this process a lot useless |
|
855 $\ONE$s and $\ZERO$s are generated and therefore not optimal. |
|
856 He also introduced some "similarity rules" such |
|
857 as $P+(Q+R) = (P+Q)+R$ to merge syntactically |
|
858 different but language-equivalent sub-regexes to further decrease the size |
|
859 of the intermediate regexes. |
|
860 |
|
861 More simplifications are possible, such as deleting duplicates |
|
862 and opening up nested alternatives to trigger even more simplifications. |
|
863 And suppose we apply simplification after each derivative step, and compose |
|
864 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
|
865 \textit{simp}(a \backslash c)$. Then we can build |
|
866 a matcher without having cumbersome regular expressions. |
|
867 |
|
868 |
|
869 If we want the size of derivatives in the algorithm to |
|
870 stay even lower, we would need more aggressive simplifications. |
|
871 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
|
872 deleting duplicates whenever possible. For example, the parentheses in |
|
873 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
|
874 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
|
875 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
|
876 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
|
877 to achieve a very tight size bound, namely, |
|
878 the same size bound as that of the \emph{partial derivatives}. |
|
879 |
|
880 Building derivatives and then simplify them. |
|
881 So far so good. But what if we want to |
|
882 do lexing instead of just a YES/NO answer? |
|
883 This requires us to go back again to the world |
|
884 without simplification first for a moment. |
|
885 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
|
886 elegant(arguably as beautiful as the original |
|
887 derivatives definition) solution for this. |
|
888 |
|
889 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
|
890 |
|
891 |
|
892 They first defined the datatypes for storing the |
|
893 lexing information called a \emph{value} or |
|
894 sometimes also \emph{lexical value}. These values and regular |
|
895 expressions correspond to each other as illustrated in the following |
|
896 table: |
|
897 |
|
898 \begin{center} |
|
899 \begin{tabular}{c@{\hspace{20mm}}c} |
|
900 \begin{tabular}{@{}rrl@{}} |
|
901 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
902 $r$ & $::=$ & $\ZERO$\\ |
|
903 & $\mid$ & $\ONE$ \\ |
|
904 & $\mid$ & $c$ \\ |
|
905 & $\mid$ & $r_1 \cdot r_2$\\ |
|
906 & $\mid$ & $r_1 + r_2$ \\ |
|
907 \\ |
|
908 & $\mid$ & $r^*$ \\ |
|
909 \end{tabular} |
|
910 & |
|
911 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
912 \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
913 $v$ & $::=$ & \\ |
|
914 & & $\Empty$ \\ |
|
915 & $\mid$ & $\Char(c)$ \\ |
|
916 & $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
917 & $\mid$ & $\Left(v)$ \\ |
|
918 & $\mid$ & $\Right(v)$ \\ |
|
919 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
920 \end{tabular} |
|
921 \end{tabular} |
|
922 \end{center} |
|
923 |
|
924 \noindent |
|
925 |
|
926 Building on top of Sulzmann and Lu's attempt to formalize the |
|
927 notion of POSIX lexing rules \parencite{Sulzmann2014}, |
|
928 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled |
|
929 POSIX matching as a ternary relation recursively defined in a |
|
930 natural deduction style. |
|
931 With the formally-specified rules for what a POSIX matching is, |
|
932 they proved in Isabelle/HOL that the algorithm gives correct results. |
|
933 |
|
934 But having a correct result is still not enough, |
|
935 we want at least some degree of $\mathbf{efficiency}$. |
|
936 |
|
937 |
|
938 |
|
939 One regular expression can have multiple lexical values. For example |
|
940 for the regular expression $(a+b)^*$, it has a infinite list of |
|
941 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
|
942 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
|
943 $\ldots$, and vice versa. |
|
944 Even for the regular expression matching a certain string, there could |
|
945 still be more than one value corresponding to it. |
|
946 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
|
947 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
948 The number of different ways of matching |
|
949 without allowing any value under a star to be flattened |
|
950 to an empty string can be given by the following formula: |
|
951 \begin{equation} |
|
952 C_n = (n+1)+n C_1+\ldots + 2 C_{n-1} |
|
953 \end{equation} |
|
954 and a closed form formula can be calculated to be |
|
955 \begin{equation} |
|
956 C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} |
|
957 \end{equation} |
|
958 which is clearly in exponential order. |
|
959 |
|
960 A lexer aimed at getting all the possible values has an exponential |
|
961 worst case runtime. Therefore it is impractical to try to generate |
|
962 all possible matches in a run. In practice, we are usually |
|
963 interested about POSIX values, which by intuition always |
|
964 \begin{itemize} |
|
965 \item |
|
966 match the leftmost regular expression when multiple options of matching |
|
967 are available |
|
968 \item |
|
969 always match a subpart as much as possible before proceeding |
|
970 to the next token. |
|
971 \end{itemize} |
|
972 |
|
973 |
|
974 For example, the above example has the POSIX value |
|
975 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
|
976 The output of an algorithm we want would be a POSIX matching |
|
977 encoded as a value. |
|
978 The reason why we are interested in $\POSIX$ values is that they can |
|
979 be practically used in the lexing phase of a compiler front end. |
|
980 For instance, when lexing a code snippet |
|
981 $\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized |
|
982 as an identifier rather than a keyword. |
|
983 |
|
984 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
985 algorithm by a second phase (the first phase being building successive |
|
986 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
|
987 is generated in case the regular expression matches the string. |
|
988 Pictorially, the Sulzmann and Lu algorithm is as follows: |
|
989 |
|
990 \begin{ceqn} |
|
991 \begin{equation}\label{graph:2} |
|
992 \begin{tikzcd} |
|
993 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
994 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
995 \end{tikzcd} |
|
996 \end{equation} |
|
997 \end{ceqn} |
|
998 |
|
999 |
|
1000 \noindent |
|
1001 For convenience, we shall employ the following notations: the regular |
|
1002 expression we start with is $r_0$, and the given string $s$ is composed |
|
1003 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
|
1004 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
|
1005 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
|
1006 the derivative $r_n$. We test whether this derivative is |
|
1007 $\textit{nullable}$ or not. If not, we know the string does not match |
|
1008 $r$ and no value needs to be generated. If yes, we start building the |
|
1009 values incrementally by \emph{injecting} back the characters into the |
|
1010 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
|
1011 algorithm from the right to left. For the first value $v_n$, we call the |
|
1012 function $\textit{mkeps}$, which builds a POSIX lexical value |
|
1013 for how the empty string has been matched by the (nullable) regular |
|
1014 expression $r_n$. This function is defined as |
|
1015 |
|
1016 \begin{center} |
|
1017 \begin{tabular}{lcl} |
|
1018 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
|
1019 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
|
1020 & \textit{if} $\nullable(r_{1})$\\ |
|
1021 & & \textit{then} $\Left(\mkeps(r_{1}))$\\ |
|
1022 & & \textit{else} $\Right(\mkeps(r_{2}))$\\ |
|
1023 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
|
1024 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
|
1025 \end{tabular} |
|
1026 \end{center} |
|
1027 |
|
1028 |
|
1029 \noindent |
|
1030 After the $\mkeps$-call, we inject back the characters one by one in order to build |
|
1031 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
|
1032 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
1033 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
|
1034 matches $s$. The POSIX value is maintained throught out the process. |
|
1035 For this Sulzmann and Lu defined a function that reverses |
|
1036 the ``chopping off'' of characters during the derivative phase. The |
|
1037 corresponding function is called \emph{injection}, written |
|
1038 $\textit{inj}$; it takes three arguments: the first one is a regular |
|
1039 expression ${r_{i-1}}$, before the character is chopped off, the second |
|
1040 is a character ${c_{i-1}}$, the character we want to inject and the |
|
1041 third argument is the value ${v_i}$, into which one wants to inject the |
|
1042 character (it corresponds to the regular expression after the character |
|
1043 has been chopped off). The result of this function is a new value. The |
|
1044 definition of $\textit{inj}$ is as follows: |
|
1045 |
|
1046 \begin{center} |
|
1047 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
1048 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
1049 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
1050 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
1051 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
1052 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
1053 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
1054 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
1055 \end{tabular} |
|
1056 \end{center} |
|
1057 |
|
1058 \noindent This definition is by recursion on the ``shape'' of regular |
|
1059 expressions and values. |
|
1060 The clauses basically do one thing--identifying the ``holes'' on |
|
1061 value to inject the character back into. |
|
1062 For instance, in the last clause for injecting back to a value |
|
1063 that would turn into a new star value that corresponds to a star, |
|
1064 we know it must be a sequence value. And we know that the first |
|
1065 value of that sequence corresponds to the child regex of the star |
|
1066 with the first character being chopped off--an iteration of the star |
|
1067 that had just been unfolded. This value is followed by the already |
|
1068 matched star iterations we collected before. So we inject the character |
|
1069 back to the first value and form a new value with this new iteration |
|
1070 being added to the previous list of iterations, all under the $Stars$ |
|
1071 top level. |
|
1072 |
|
1073 We have mentioned before that derivatives without simplification |
|
1074 can get clumsy, and this is true for values as well--they reflect |
|
1075 the regular expressions size by definition. |
|
1076 |
|
1077 One can introduce simplification on the regex and values, but have to |
|
1078 be careful in not breaking the correctness as the injection |
|
1079 function heavily relies on the structure of the regexes and values |
|
1080 being correct and match each other. |
|
1081 It can be achieved by recording some extra rectification functions |
|
1082 during the derivatives step, and applying these rectifications in |
|
1083 each run during the injection phase. |
|
1084 And we can prove that the POSIX value of how |
|
1085 regular expressions match strings will not be affected---although is much harder |
|
1086 to establish. |
|
1087 Some initial results in this regard have been |
|
1088 obtained in \cite{AusafDyckhoffUrban2016}. |
|
1089 |
|
1090 |
|
1091 |
|
1092 %Brzozowski, after giving the derivatives and simplification, |
|
1093 %did not explore lexing with simplification or he may well be |
|
1094 %stuck on an efficient simplificaiton with a proof. |
|
1095 %He went on to explore the use of derivatives together with |
|
1096 %automaton, and did not try lexing using derivatives. |
|
1097 |
|
1098 We want to get rid of complex and fragile rectification of values. |
|
1099 Can we not create those intermediate values $v_1,\ldots v_n$, |
|
1100 and get the lexing information that should be already there while |
|
1101 doing derivatives in one pass, without a second phase of injection? |
|
1102 In the meantime, can we make sure that simplifications |
|
1103 are easily handled without breaking the correctness of the algorithm? |
|
1104 |
|
1105 Sulzmann and Lu solved this problem by |
|
1106 introducing additional informtaion to the |
|
1107 regular expressions called \emph{bitcodes}. |
|
1108 |
|
1109 \subsection*{Bit-coded Algorithm} |
|
1110 Bits and bitcodes (lists of bits) are defined as: |
|
1111 |
|
1112 \begin{center} |
|
1113 $b ::= 1 \mid 0 \qquad |
|
1114 bs ::= [] \mid b::bs |
|
1115 $ |
|
1116 \end{center} |
|
1117 |
|
1118 \noindent |
|
1119 The $1$ and $0$ are not in bold in order to avoid |
|
1120 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
|
1121 bit-lists) can be used to encode values (or potentially incomplete values) in a |
|
1122 compact form. This can be straightforwardly seen in the following |
|
1123 coding function from values to bitcodes: |
|
1124 |
|
1125 \begin{center} |
|
1126 \begin{tabular}{lcl} |
|
1127 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
|
1128 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
|
1129 $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ |
|
1130 $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ |
|
1131 $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ |
|
1132 $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ |
|
1133 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; |
|
1134 code(\Stars\,vs)$ |
|
1135 \end{tabular} |
|
1136 \end{center} |
|
1137 |
|
1138 \noindent |
|
1139 Here $\textit{code}$ encodes a value into a bitcodes by converting |
|
1140 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty |
|
1141 star iteration by $1$. The border where a local star terminates |
|
1142 is marked by $0$. This coding is lossy, as it throws away the information about |
|
1143 characters, and also does not encode the ``boundary'' between two |
|
1144 sequence values. Moreover, with only the bitcode we cannot even tell |
|
1145 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The |
|
1146 reason for choosing this compact way of storing information is that the |
|
1147 relatively small size of bits can be easily manipulated and ``moved |
|
1148 around'' in a regular expression. In order to recover values, we will |
|
1149 need the corresponding regular expression as an extra information. This |
|
1150 means the decoding function is defined as: |
|
1151 |
|
1152 |
|
1153 %\begin{definition}[Bitdecoding of Values]\mbox{} |
|
1154 \begin{center} |
|
1155 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
|
1156 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
|
1157 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
|
1158 $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
1159 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; |
|
1160 (\Left\,v, bs_1)$\\ |
|
1161 $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
1162 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; |
|
1163 (\Right\,v, bs_1)$\\ |
|
1164 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
|
1165 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
|
1166 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ |
|
1167 & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
|
1168 $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
|
1169 $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & |
|
1170 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
1171 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ |
|
1172 & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ |
|
1173 |
|
1174 $\textit{decode}\,bs\,r$ & $\dn$ & |
|
1175 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
1176 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
|
1177 \textit{else}\;\textit{None}$ |
|
1178 \end{tabular} |
|
1179 \end{center} |
|
1180 %\end{definition} |
|
1181 |
|
1182 Sulzmann and Lu's integrated the bitcodes into regular expressions to |
|
1183 create annotated regular expressions \cite{Sulzmann2014}. |
|
1184 \emph{Annotated regular expressions} are defined by the following |
|
1185 grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} |
|
1186 |
|
1187 \begin{center} |
|
1188 \begin{tabular}{lcl} |
|
1189 $\textit{a}$ & $::=$ & $\ZERO$\\ |
|
1190 & $\mid$ & $_{bs}\ONE$\\ |
|
1191 & $\mid$ & $_{bs}{\bf c}$\\ |
|
1192 & $\mid$ & $_{bs}\sum\,as$\\ |
|
1193 & $\mid$ & $_{bs}a_1\cdot a_2$\\ |
|
1194 & $\mid$ & $_{bs}a^*$ |
|
1195 \end{tabular} |
|
1196 \end{center} |
|
1197 %(in \textit{ALTS}) |
|
1198 |
|
1199 \noindent |
|
1200 where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular |
|
1201 expressions and $as$ for a list of annotated regular expressions. |
|
1202 The alternative constructor($\sum$) has been generalized to |
|
1203 accept a list of annotated regular expressions rather than just 2. |
|
1204 We will show that these bitcodes encode information about |
|
1205 the (POSIX) value that should be generated by the Sulzmann and Lu |
|
1206 algorithm. |
|
1207 |
|
1208 |
|
1209 To do lexing using annotated regular expressions, we shall first |
|
1210 transform the usual (un-annotated) regular expressions into annotated |
|
1211 regular expressions. This operation is called \emph{internalisation} and |
|
1212 defined as follows: |
|
1213 |
|
1214 %\begin{definition} |
|
1215 \begin{center} |
|
1216 \begin{tabular}{lcl} |
|
1217 $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ |
|
1218 $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ |
|
1219 $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ |
|
1220 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
|
1221 $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, |
|
1222 \textit{fuse}\,[1]\,r_2^\uparrow]$\\ |
|
1223 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
|
1224 $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ |
|
1225 $(r^*)^\uparrow$ & $\dn$ & |
|
1226 $_{[]}(r^\uparrow)^*$\\ |
|
1227 \end{tabular} |
|
1228 \end{center} |
|
1229 %\end{definition} |
|
1230 |
|
1231 \noindent |
|
1232 We use up arrows here to indicate that the basic un-annotated regular |
|
1233 expressions are ``lifted up'' into something slightly more complex. In the |
|
1234 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
|
1235 attach bits to the front of an annotated regular expression. Its |
|
1236 definition is as follows: |
|
1237 |
|
1238 \begin{center} |
|
1239 \begin{tabular}{lcl} |
|
1240 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
|
1241 $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & |
|
1242 $_{bs @ bs'}\ONE$\\ |
|
1243 $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & |
|
1244 $_{bs@bs'}{\bf c}$\\ |
|
1245 $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & |
|
1246 $_{bs@bs'}\sum\textit{as}$\\ |
|
1247 $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & |
|
1248 $_{bs@bs'}a_1 \cdot a_2$\\ |
|
1249 $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & |
|
1250 $_{bs @ bs'}a^*$ |
|
1251 \end{tabular} |
|
1252 \end{center} |
|
1253 |
|
1254 \noindent |
|
1255 After internalising the regular expression, we perform successive |
|
1256 derivative operations on the annotated regular expressions. This |
|
1257 derivative operation is the same as what we had previously for the |
|
1258 basic regular expressions, except that we beed to take care of |
|
1259 the bitcodes: |
|
1260 |
|
1261 |
|
1262 \iffalse |
|
1263 %\begin{definition}{bder} |
|
1264 \begin{center} |
|
1265 \begin{tabular}{@{}lcl@{}} |
|
1266 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
1267 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
1268 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
|
1269 $\textit{if}\;c=d\; \;\textit{then}\; |
|
1270 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
|
1271 $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & |
|
1272 $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\ |
|
1273 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
1274 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
1275 & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
|
1276 & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
1277 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
|
1278 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
|
1279 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
|
1280 (\textit{STAR}\,[]\,r)$ |
|
1281 \end{tabular} |
|
1282 \end{center} |
|
1283 %\end{definition} |
|
1284 |
|
1285 \begin{center} |
|
1286 \begin{tabular}{@{}lcl@{}} |
|
1287 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
1288 $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
1289 $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & |
|
1290 $\textit{if}\;c=d\; \;\textit{then}\; |
|
1291 _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ |
|
1292 $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & |
|
1293 $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ |
|
1294 $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
1295 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
1296 & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ |
|
1297 & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
1298 & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ |
|
1299 $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & |
|
1300 $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, |
|
1301 (_{bs}\textit{STAR}\,[]\,r)$ |
|
1302 \end{tabular} |
|
1303 \end{center} |
|
1304 %\end{definition} |
|
1305 \fi |
|
1306 |
|
1307 \begin{center} |
|
1308 \begin{tabular}{@{}lcl@{}} |
|
1309 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
1310 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
1311 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & |
|
1312 $\textit{if}\;c=d\; \;\textit{then}\; |
|
1313 _{bs}\ONE\;\textit{else}\;\ZERO$\\ |
|
1314 $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & |
|
1315 $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\ |
|
1316 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
|
1317 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
1318 & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
|
1319 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
|
1320 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ |
|
1321 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
|
1322 $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot |
|
1323 (_{[]}r^*))$ |
|
1324 \end{tabular} |
|
1325 \end{center} |
|
1326 |
|
1327 %\end{definition} |
|
1328 \noindent |
|
1329 For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
|
1330 we need to unfold it into a sequence, |
|
1331 and attach an additional bit $0$ to the front of $r \backslash c$ |
|
1332 to indicate one more star iteration. Also the sequence clause |
|
1333 is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
|
1334 \textit{bnullable} is exactly the same as $\textit{nullable}$, except |
|
1335 that it is for annotated regular expressions, therefore we omit the |
|
1336 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how |
|
1337 $a_1$ matches the string prior to character $c$ (more on this later), |
|
1338 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 |
|
1339 \backslash c)$ will collapse the regular expression $a_1$(as it has |
|
1340 already been fully matched) and store the parsing information at the |
|
1341 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
|
1342 bitsequence $\textit{bs}$, which was initially attached to the |
|
1343 first element of the sequence $a_1 \cdot a_2$, has |
|
1344 now been elevated to the top-level of $\sum$, as this information will be |
|
1345 needed whichever way the sequence is matched---no matter whether $c$ belongs |
|
1346 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
|
1347 the lexing information, we complete the lexing by collecting the |
|
1348 bitcodes using a generalised version of the $\textit{mkeps}$ function |
|
1349 for annotated regular expressions, called $\textit{bmkeps}$: |
|
1350 |
|
1351 |
|
1352 %\begin{definition}[\textit{bmkeps}]\mbox{} |
|
1353 \begin{center} |
|
1354 \begin{tabular}{lcl} |
|
1355 $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ |
|
1356 $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & |
|
1357 $\textit{if}\;\textit{bnullable}\,a$\\ |
|
1358 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ |
|
1359 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ |
|
1360 $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & |
|
1361 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
|
1362 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & |
|
1363 $bs \,@\, [0]$ |
|
1364 \end{tabular} |
|
1365 \end{center} |
|
1366 %\end{definition} |
|
1367 |
|
1368 \noindent |
|
1369 This function completes the value information by travelling along the |
|
1370 path of the regular expression that corresponds to a POSIX value and |
|
1371 collecting all the bitcodes, and using $S$ to indicate the end of star |
|
1372 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and |
|
1373 decode them, we get the value we expect. The corresponding lexing |
|
1374 algorithm looks as follows: |
|
1375 |
|
1376 \begin{center} |
|
1377 \begin{tabular}{lcl} |
|
1378 $\textit{blexer}\;r\,s$ & $\dn$ & |
|
1379 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
|
1380 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
1381 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
1382 & & $\;\;\textit{else}\;\textit{None}$ |
|
1383 \end{tabular} |
|
1384 \end{center} |
|
1385 |
|
1386 \noindent |
|
1387 In this definition $\_\backslash s$ is the generalisation of the derivative |
|
1388 operation from characters to strings (just like the derivatives for un-annotated |
|
1389 regular expressions). |
|
1390 |
|
1391 Now we introduce the simplifications, which is why we introduce the |
|
1392 bitcodes in the first place. |
|
1393 |
|
1394 \subsection*{Simplification Rules} |
|
1395 |
|
1396 This section introduces aggressive (in terms of size) simplification rules |
|
1397 on annotated regular expressions |
|
1398 to keep derivatives small. Such simplifications are promising |
|
1399 as we have |
|
1400 generated test data that show |
|
1401 that a good tight bound can be achieved. We could only |
|
1402 partially cover the search space as there are infinitely many regular |
|
1403 expressions and strings. |
|
1404 |
|
1405 One modification we introduced is to allow a list of annotated regular |
|
1406 expressions in the $\sum$ constructor. This allows us to not just |
|
1407 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
|
1408 also unnecessary ``copies'' of regular expressions (very similar to |
|
1409 simplifying $r + r$ to just $r$, but in a more general setting). Another |
|
1410 modification is that we use simplification rules inspired by Antimirov's |
|
1411 work on partial derivatives. They maintain the idea that only the first |
|
1412 ``copy'' of a regular expression in an alternative contributes to the |
|
1413 calculation of a POSIX value. All subsequent copies can be pruned away from |
|
1414 the regular expression. A recursive definition of our simplification function |
|
1415 that looks somewhat similar to our Scala code is given below: |
|
1416 %\comment{Use $\ZERO$, $\ONE$ and so on. |
|
1417 %Is it $ALTS$ or $ALTS$?}\\ |
|
1418 |
|
1419 \begin{center} |
|
1420 \begin{tabular}{@{}lcl@{}} |
|
1421 |
|
1422 $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
1423 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
1424 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
1425 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
1426 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
1427 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ |
|
1428 |
|
1429 $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ |
|
1430 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
1431 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
1432 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
|
1433 |
|
1434 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
1435 \end{tabular} |
|
1436 \end{center} |
|
1437 |
|
1438 \noindent |
|
1439 The simplification does a pattern matching on the regular expression. |
|
1440 When it detected that the regular expression is an alternative or |
|
1441 sequence, it will try to simplify its child regular expressions |
|
1442 recursively and then see if one of the children turns into $\ZERO$ or |
|
1443 $\ONE$, which might trigger further simplification at the current level. |
|
1444 The most involved part is the $\sum$ clause, where we use two |
|
1445 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
|
1446 alternatives and reduce as many duplicates as possible. Function |
|
1447 $\textit{distinct}$ keeps the first occurring copy only and removes all later ones |
|
1448 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. |
|
1449 Its recursive definition is given below: |
|
1450 |
|
1451 \begin{center} |
|
1452 \begin{tabular}{@{}lcl@{}} |
|
1453 $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
1454 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
|
1455 $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ |
|
1456 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) |
|
1457 \end{tabular} |
|
1458 \end{center} |
|
1459 |
|
1460 \noindent |
|
1461 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
|
1462 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
|
1463 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
|
1464 |
|
1465 Having defined the $\simp$ function, |
|
1466 we can use the previous notation of natural |
|
1467 extension from derivative w.r.t.~character to derivative |
|
1468 w.r.t.~string:%\comment{simp in the [] case?} |
|
1469 |
|
1470 \begin{center} |
|
1471 \begin{tabular}{lcl} |
|
1472 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
1473 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
1474 \end{tabular} |
|
1475 \end{center} |
|
1476 |
|
1477 \noindent |
|
1478 to obtain an optimised version of the algorithm: |
|
1479 |
|
1480 \begin{center} |
|
1481 \begin{tabular}{lcl} |
|
1482 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
1483 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
1484 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
1485 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
1486 & & $\;\;\textit{else}\;\textit{None}$ |
|
1487 \end{tabular} |
|
1488 \end{center} |
|
1489 |
|
1490 \noindent |
|
1491 This algorithm keeps the regular expression size small, for example, |
|
1492 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
1493 will be reduced to just 6 and stays constant, no matter how long the |
|
1494 input string is. |
|
1495 |
|
1496 |
|
1497 |
|
1498 Derivatives give a simple solution |
559 Derivatives give a simple solution |
1499 to the problem of matching a string $s$ with a regular |
560 to the problem of matching a string $s$ with a regular |
1500 expression $r$: if the derivative of $r$ w.r.t.\ (in |
561 expression $r$: if the derivative of $r$ w.r.t.\ (in |
1501 succession) all the characters of the string matches the empty string, |
562 succession) all the characters of the string matches the empty string, |
1502 then $r$ matches $s$ (and {\em vice versa}). |
563 then $r$ matches $s$ (and {\em vice versa}). |