749 \noindent |
749 \noindent |
750 $\textit{inj}$ takes three arguments: a regular |
750 $\textit{inj}$ takes three arguments: a regular |
751 expression ${r_{i}}$, before the character is chopped off, |
751 expression ${r_{i}}$, before the character is chopped off, |
752 a character ${c_{i}}$, the character we want to inject back and |
752 a character ${c_{i}}$, the character we want to inject back and |
753 the third argument $v_{i+1}$ the value we want to inject into. |
753 the third argument $v_{i+1}$ the value we want to inject into. |
754 The result of this function is a new value $v_i$. |
754 The result of an application |
|
755 $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that |
|
756 \[ |
|
757 (s_i, r_i) \rightarrow v_i |
|
758 \] |
|
759 holds. |
755 The definition of $\textit{inj}$ is as follows: |
760 The definition of $\textit{inj}$ is as follows: |
756 \begin{center} |
761 \begin{center} |
757 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
762 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l} |
758 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
763 $\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\ |
759 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
764 $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\ |
760 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
765 $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\ |
761 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
766 $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ & |
762 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
767 $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\ |
763 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
768 $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & |
764 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
769 $\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\ |
|
770 $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\ |
|
771 $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\ |
765 \end{tabular} |
772 \end{tabular} |
766 \end{center} |
773 \end{center} |
767 |
774 |
768 \noindent |
775 \noindent |
769 This definition is by recursion on the ``shape'' of regular |
776 The function does a recursion on |
770 expressions and values. |
777 the shape of regular |
771 The clauses do one thing--identifying the ``hole'' on a |
778 expression $r_i$ and value $v_{i+1}$. |
772 value to inject the character back into. |
779 Intuitively, each clause analyses |
773 For instance, in the last clause for injecting back to a value |
780 how $r_i$ could have transformed when being |
774 that would turn into a new star value that corresponds to a star, |
781 derived by $c$, identifying which subpart |
775 we know it must be a sequence value. And we know that the first |
782 of $v_{i+1}$ has the ``hole'' |
776 value of that sequence corresponds to the child regex of the star |
783 to inject the character back into. |
777 with the first character being chopped off--an iteration of the star |
784 Once the character is |
778 that had just been unfolded. This value is followed by the already |
785 injected back to that sub-value; |
779 matched star iterations we collected before. So we inject the character |
786 $\inj$ assembles all things together |
780 back to the first value and form a new value with this latest iteration |
787 to form a new value. |
781 being added to the previous list of iterations, all under the $\Stars$ |
788 |
782 top level. |
789 For instance, the last clause is an |
783 The POSIX value is maintained throughout the process: |
790 injection into a sequence value $v_{i+1}$ |
784 \begin{lemma} |
791 whose second child |
785 $(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$ |
792 value is a star, and the shape of the |
786 \end{lemma}\label{injPosix} |
793 regular expression $r_i$ before injection |
787 |
794 is a star. |
788 |
795 We therefore know |
789 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together, |
796 the derivative |
|
797 starts on a star and ends as a sequence: |
|
798 \[ |
|
799 (r^*) \backslash c \longrightarrow r\backslash c \cdot r^* |
|
800 \] |
|
801 during which an iteration of the star |
|
802 had just been unfolded, giving the below |
|
803 value inhabitation relation: |
|
804 \[ |
|
805 \vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*. |
|
806 \] |
|
807 The value list $vs$ corresponds to |
|
808 matched star iterations, |
|
809 and the ``hole'' lies in $v$ because |
|
810 \[ |
|
811 \vdash v: r\backslash c. |
|
812 \] |
|
813 Finally, |
|
814 $\inj \; r \;c \; v$ is prepended |
|
815 to the previous list of iterations, and then |
|
816 wrapped under the $\Stars$ |
|
817 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$. |
|
818 |
|
819 Recall that lemma |
|
820 \ref{mePosix} tells us |
|
821 $\mkeps$ always selects the POSIX matching among |
|
822 multiple values that flatten to the empty string. |
|
823 Now $\inj$ preserves the POSIXness, provided |
|
824 the value before injection is POSIX: |
|
825 \begin{lemma}\label{injPosix} |
|
826 If |
|
827 \[ |
|
828 (r \backslash c, s) \rightarrow v |
|
829 \] |
|
830 then |
|
831 \[ |
|
832 (r, c :: s) \rightarrow (\inj r \; c\; v). |
|
833 \] |
|
834 \end{lemma} |
|
835 \begin{proof} |
|
836 By induction on $r$. |
|
837 The involved cases are sequence and star. |
|
838 When $r = a \cdot b$, there could be |
|
839 three cases for the value $v$ satisfying $\vdash v:a\backslash c$. |
|
840 We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each |
|
841 case. |
|
842 \begin{itemize} |
|
843 \item |
|
844 $v = \Seq \; v_a \; v_b$.\\ |
|
845 The ``not nullable'' clause of the $\inj$ function is taken: |
|
846 \begin{center} |
|
847 \begin{tabular}{lcl} |
|
848 $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\ |
|
849 & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ |
|
850 \end{tabular} |
|
851 \end{center} |
|
852 We know that there exists a unique pair of |
|
853 $s_a$ and $s_b$ satisfaying |
|
854 $(a \backslash c, s_a) \rightarrow v_a$, |
|
855 $(b , s_b) \rightarrow v_b$, and |
|
856 $\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in |
|
857 L \; (a\backslash c) \land |
|
858 s_4 \in L \; b$. |
|
859 The last condition gives us |
|
860 $\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in |
|
861 L \; a \land |
|
862 s_4 \in L \; b$. |
|
863 By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds, |
|
864 and this gives us |
|
865 \[ |
|
866 (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. |
|
867 \] |
|
868 \item |
|
869 $v = \Left \; (\Seq \; v_a \; v_b)$\\ |
|
870 The argument is almost identical to the above case, |
|
871 except that a different clause of $\inj$ is taken: |
|
872 \begin{center} |
|
873 \begin{tabular}{lcl} |
|
874 $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\ |
|
875 & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ |
|
876 \end{tabular} |
|
877 \end{center} |
|
878 With a similar reasoning, |
|
879 |
|
880 \[ |
|
881 (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. |
|
882 \] |
|
883 again holds. |
|
884 \item |
|
885 $v = \Right \; v_b$\\ |
|
886 Again the injection result would be |
|
887 \begin{center} |
|
888 \begin{tabular}{lcl} |
|
889 $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\ |
|
890 & $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$ |
|
891 \end{tabular} |
|
892 \end{center} |
|
893 We know that $a$ must be nullable, |
|
894 allowing us to call $\mkeps$ and get |
|
895 \[ |
|
896 (a, []) \rightarrow \mkeps \; a. |
|
897 \] |
|
898 Also by inductive hypothesis |
|
899 \[ |
|
900 (b, c::s) \rightarrow \inj\; b \; c \; v_b |
|
901 \] |
|
902 holds. |
|
903 In addition, as |
|
904 $\Right \;v_b$ instead of $\Left \ldots$ is |
|
905 the POSIX value for $v$, it must be the case |
|
906 that $s \notin L \;( (a\backslash c)\cdot b)$. |
|
907 This tells us that |
|
908 \[ |
|
909 \nexists s_3 \; s_4. |
|
910 s_3 @s_4 = s \land s_3 \in L \; (a\backslash c) |
|
911 \land s_4 \in L \; b |
|
912 \] |
|
913 which translates to |
|
914 \[ |
|
915 \nexists s_3 \; s_4. \; s_3 \neq [] \land |
|
916 s_3 @s_4 = c::s \land s_3 \in L \; a |
|
917 \land s_4 \in L \; b. |
|
918 \] |
|
919 (Which basically says there cannot be a longer |
|
920 initial split for $s$ other than the empty string.) |
|
921 Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$ |
|
922 as the POSIX value for $a\cdot b$. |
|
923 \end{itemize} |
|
924 The star case can be proven similarly. |
|
925 \end{proof} |
|
926 \noindent |
|
927 Putting all the functions $\inj$, $\mkeps$, $\backslash$ together |
|
928 by following the procedure outlined in the diagram \ref{graph:inj}, |
790 and taking into consideration the possibility of a non-match, |
929 and taking into consideration the possibility of a non-match, |
791 we have a lexer with the following recursive definition: |
930 a lexer can be built with the following recursive definition: |
792 \begin{center} |
931 \begin{center} |
793 \begin{tabular}{lcl} |
932 \begin{tabular}{lcl} |
794 $\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
933 $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ |
795 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\ |
934 $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ |
796 & & $\quad \None \implies \None$\\ |
935 & & $\quad \phantom{\mid}\; \None \implies \None$\\ |
797 & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
936 & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$ |
798 \end{tabular} |
937 \end{tabular} |
799 \end{center} |
938 \end{center} |
800 \noindent |
939 \noindent |
801 The central property of the $\lexer$ is that it gives the correct result by |
940 The central property of the $\lexer$ is that it gives the correct result by |
802 $\POSIX$ standards: |
941 $\POSIX$ standards: |
803 \begin{theorem} |
942 \begin{theorem} |
804 The $\lexer$ based on derivatives and injections is both sound and complete: |
943 The $\lexer$ based on derivatives and injections is correct: |
805 \begin{tabular}{lcl} |
944 \begin{center} |
806 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\ |
945 \begin{tabular}{lcl} |
807 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$ |
946 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\ |
808 \end{tabular} |
947 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$ |
809 \end{theorem} |
948 \end{tabular} |
810 |
949 \end{center} |
811 |
950 \end{theorem} |
812 \begin{proof} |
951 \begin{proof} |
813 By induction on $s$. $r$ is allowed to be an arbitrary regular expression. |
952 By induction on $s$. $r$ is allowed to be an arbitrary regular expression. |
814 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
953 The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case |
815 by lemma \ref{injPosix}. |
954 by lemma \ref{injPosix}. |
816 \end{proof} |
955 \end{proof} |
817 |
|
818 |
|
819 We now give a pictorial view of the algorithm ( |
|
820 For convenience, we employ the following notations: the regular |
|
821 expression we start with is $r_0$, and the given string $s$ is composed |
|
822 of characters $c_0 c_1 \ldots c_{n-1}$. The |
|
823 values built incrementally by \emph{injecting} back the characters into the |
|
824 earlier values are $v_n, \ldots, v_0$. Corresponding values and characters |
|
825 are always in the same subscript, i.e. $\vdash v_i : r_i$): |
|
826 |
|
827 \begin{ceqn} |
|
828 \begin{equation}\label{graph:2} |
|
829 \begin{tikzcd} |
|
830 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
831 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
832 \end{tikzcd} |
|
833 \end{equation} |
|
834 \end{ceqn} |
|
835 |
|
836 \noindent |
956 \noindent |
837 As we did earlier in this chapter on the matcher, one can |
957 As we did earlier in this chapter on the matcher, one can |
838 introduce simplification on the regex. |
958 introduce simplification on the regex. |
839 However, now we need to do a backward phase and make sure |
959 However, now one needs to do a backward phase and make sure |
840 the values align with the regular expressions. |
960 the values align with the regular expressions. |
841 Therefore one has to |
961 Therefore one has to |
842 be careful not to break the correctness, as the injection |
962 be careful not to break the correctness, as the injection |
843 function heavily relies on the structure of the regexes and values |
963 function heavily relies on the structure of the regexes and values |
844 being correct and matching each other. |
964 being correct and matching each other. |
845 It can be achieved by recording some extra rectification functions |
965 It can be achieved by recording some extra rectification functions |
846 during the derivatives step, and applying these rectifications in |
966 during the derivatives step, and applying these rectifications in |
847 each run during the injection phase. |
967 each run during the injection phase. |
848 |
968 With extra care |
849 \ChristianComment{Do I introduce the lexer with rectification here?} |
969 one can show that POSIXness will not be affected---although it is much harder |
850 And we can prove that the POSIX value of how |
|
851 regular expressions match strings will not be affected---although it is much harder |
|
852 to establish. |
970 to establish. |
853 Some initial results in this regard have been |
971 Some initial results in this regard have been |
854 obtained in \cite{AusafDyckhoffUrban2016}. |
972 obtained in \cite{AusafDyckhoffUrban2016}. |
855 |
973 |
856 However, even with these simplification rules, we could still end up in |
974 However, with all the simplification rules allowed |
857 trouble, when we encounter cases that require more involved and aggressive |
975 in an injection-based lexer, one could still end up in |
858 simplifications. |
976 trouble, when cases that require more involved and aggressive |
859 \section{A Case Requring More Aggressive Simplification} |
977 simplifications arise. |
|
978 \section{A Case Requring More Aggressive Simplifications} |
860 For example, when starting with the regular |
979 For example, when starting with the regular |
861 expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10) |
980 expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10) |
862 w.r.t.~the character $a$, one obtains a derivative regular expression |
981 w.r.t.~the character $a$, one obtains a derivative regular expression |
863 with more than 9000 nodes (when viewed as a tree) |
982 with more than 9000 nodes (when viewed as a tree) |
864 even with simplification. |
983 even with simplification. |
874 \end{axis} |
993 \end{axis} |
875 \end{tikzpicture} |
994 \end{tikzpicture} |
876 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$} |
995 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$} |
877 \end{figure}\label{fig:BetterWaterloo} |
996 \end{figure}\label{fig:BetterWaterloo} |
878 |
997 |
879 That is because our lexing algorithm currently keeps a lot of |
998 That is because Sulzmann and Lu's |
|
999 injection-based lexing algorithm keeps a lot of |
880 "useless" values that will not be used. |
1000 "useless" values that will not be used. |
881 These different ways of matching will grow exponentially with the string length. |
1001 These different ways of matching will grow exponentially with the string length. |
882 |
1002 Take |
883 For $r= (a^*\cdot a^*)^*$ and |
1003 \[ |
884 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$, |
1004 r= (a^*\cdot a^*)^* \quad and \quad |
885 if we do not allow any empty iterations in its lexical values, |
1005 s=\underbrace{aa\ldots a}_\text{n \textit{a}s} |
886 there will be $n - 1$ "splitting points" on $s$ we can independently choose to |
1006 \] |
887 split or not so that each sub-string |
1007 as an example. |
888 segmented by those chosen splitting points will form different iterations. |
1008 This is a highly ambiguous regular expression, with |
889 For example when $n=4$, we give out a few of the many possibilities of splitting: |
1009 many ways to split up the string into multiple segments for |
890 \begin{center} |
1010 different star iteratioins, |
891 \begin{tabular}{lcr} |
1011 and each segment will have multiple ways of splitting between |
892 $aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\ |
1012 the two $a^*$ sub-expressions. |
893 $a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$ (two iterations)\\ |
1013 It is not surprising there are exponentially many |
894 $aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$ (two iterations)\\ |
1014 distinct lexical values |
895 $a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\ |
1015 for such a pair of regular expression and string. |
896 $a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\ |
1016 A lexer without a good enough strategy to |
897 & $\textit{etc}.$ & |
1017 deduplicate will naturally |
898 \end{tabular} |
|
899 \end{center} |
|
900 \noindent |
|
901 And for each iteration, there are still multiple ways to split |
|
902 between the two $a^*$s. |
|
903 It is not surprising there are exponentially many lexical values |
|
904 that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and |
|
905 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
906 A lexer to keep all the possible values will naturally |
|
907 have an exponential runtime on ambiguous regular expressions. |
1018 have an exponential runtime on ambiguous regular expressions. |
908 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values |
1019 |
909 of a match. This means Sulzmann and Lu's injection-based algorithm |
|
910 exponential by nature. |
|
911 Somehow one has to make sure which |
1020 Somehow one has to make sure which |
912 lexical values are $\POSIX$ and must be kept in a lexing algorithm. |
1021 lexical values are $\POSIX$ and must be kept in a lexing algorithm. |
913 |
|
914 |
|
915 For example, the above $r= (a^*\cdot a^*)^*$ and |
1022 For example, the above $r= (a^*\cdot a^*)^*$ and |
916 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value |
1023 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value |
917 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
1024 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
918 We want to keep this value only, and remove all the regular expression subparts |
1025 We want to keep this value only, and remove all the regular expression subparts |
919 not corresponding to this value during lexing. |
1026 not corresponding to this value during lexing. |