9 %why more aggressive simplifications are needed. |
9 %why more aggressive simplifications are needed. |
10 |
10 |
11 In this chapter, we define the basic notions |
11 In this chapter, we define the basic notions |
12 for regular languages and regular expressions. |
12 for regular languages and regular expressions. |
13 This is essentially a description in ``English'' |
13 This is essentially a description in ``English'' |
14 of our formalisation in Isabelle/HOL. |
14 the functions and datatypes of our formalisation in Isabelle/HOL. |
15 We also give the definition of what $\POSIX$ lexing means, |
15 We also define what $\POSIX$ lexing means, |
16 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} |
16 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} |
17 that produces the output conforming |
17 that produces the output conforming |
18 to the $\POSIX$ standard\footnote{In what follows |
18 to the $\POSIX$ standard\footnote{In what follows |
19 we choose to use the Isabelle-style notation |
19 we choose to use the Isabelle-style notation |
20 for function applications, where |
20 for function applications, where |
24 to make the text visually more concise.}. |
24 to make the text visually more concise.}. |
25 |
25 |
26 \section{Basic Concepts} |
26 \section{Basic Concepts} |
27 Formal language theory usually starts with an alphabet |
27 Formal language theory usually starts with an alphabet |
28 denoting a set of characters. |
28 denoting a set of characters. |
29 Here we just use the datatype of characters from Isabelle, |
29 Here we use the datatype of characters from Isabelle, |
30 which roughly corresponds to the ASCII characters. |
30 which roughly corresponds to the ASCII characters. |
31 In what follows, we shall leave the information about the alphabet |
31 In what follows, we shall leave the information about the alphabet |
32 implicit. |
32 implicit. |
33 Then using the usual bracket notation for lists, |
33 Then using the usual bracket notation for lists, |
34 we can define strings made up of characters as: |
34 we can define strings made up of characters as: |
38 \end{tabular} |
38 \end{tabular} |
39 \end{center} |
39 \end{center} |
40 where $c$ is a variable ranging over characters. |
40 where $c$ is a variable ranging over characters. |
41 The $::$ stands for list cons and $[]$ for the empty |
41 The $::$ stands for list cons and $[]$ for the empty |
42 list. |
42 list. |
43 A singleton list is sometimes written as $[c]$ for brevity. |
43 For brevity, a singleton list is sometimes written as $[c]$. |
44 Strings can be concatenated to form longer strings in the same |
44 Strings can be concatenated to form longer strings in the same |
45 way as we concatenate two lists, which we shall write as $s_1 @ s_2$. |
45 way we concatenate two lists, which we shall write as $s_1 @ s_2$. |
46 We omit the precise |
46 We omit the precise |
47 recursive definition here. |
47 recursive definition here. |
48 We overload this concatenation operator for two sets of strings: |
48 We overload this concatenation operator for two sets of strings: |
49 \begin{center} |
49 \begin{center} |
50 \begin{tabular}{lcl} |
50 \begin{tabular}{lcl} |
119 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\ |
119 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\ |
120 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\ |
120 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\ |
121 \end{tabular} |
121 \end{tabular} |
122 \end{center} |
122 \end{center} |
123 \noindent |
123 \noindent |
124 and in the end test whether the set |
124 and in the end, test whether the set |
125 has the empty string.\footnote{We use the infix notation $A\backslash c$ |
125 has the empty string.\footnote{We use the infix notation $A\backslash c$ |
126 instead of $\Der \; c \; A$ for brevity, as it is clear we are operating |
126 instead of $\Der \; c \; A$ for brevity, as it is clear we are operating |
127 on languages rather than regular expressions.} |
127 on languages rather than regular expressions.} |
128 |
128 |
129 In general, if we have a language $S$, |
129 In general, if we have a language $S$, |
154 \begin{proof} |
154 \begin{proof} |
155 There are two inclusions to prove: |
155 There are two inclusions to prove: |
156 \begin{itemize} |
156 \begin{itemize} |
157 \item{$\subseteq$}:\\ |
157 \item{$\subseteq$}:\\ |
158 The set |
158 The set |
159 \[ \{s \mid c :: s \in A*\} \] |
159 \[ S_1 = \{s \mid c :: s \in A*\} \] |
160 is enclosed in the set |
160 is enclosed in the set |
161 \[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \] |
161 \[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \] |
162 because whenever you have a string starting with a character |
162 This is because for any string $c::s$ satisfying $c::s \in A*$, |
163 in the language of a Kleene star $A*$, |
163 %whenever you have a string starting with a character |
164 then that character together with some sub-string |
164 %in the language of a Kleene star $A*$, |
165 immediately after it will form the first iteration, |
165 %then that |
166 and the rest of the string will |
166 the character $c$, together with a prefix of $s$ |
167 be still in $A*$. |
167 %immediately after $c$ |
|
168 forms the first iteration of $A*$, |
|
169 and the rest of the $s$ is also $A*$. |
|
170 This coincides with the definition of $S_2$. |
168 \item{$\supseteq$}:\\ |
171 \item{$\supseteq$}:\\ |
169 Note that |
172 Note that |
170 \[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \] |
173 \[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \] |
171 holds. |
174 holds. |
172 Also the following holds: |
175 Also the following holds: |
192 \mid r_1 + r_2 |
195 \mid r_1 + r_2 |
193 \mid r^* |
196 \mid r^* |
194 \] |
197 \] |
195 \noindent |
198 \noindent |
196 We call them basic because we will introduce |
199 We call them basic because we will introduce |
197 additional constructors in later chapters such as negation |
200 additional constructors in later chapters, such as negation |
198 and bounded repetitions. |
201 and bounded repetitions. |
199 We use $\ZERO$ for the regular expression that |
202 We use $\ZERO$ for the regular expression that |
200 matches no string, and $\ONE$ for the regular |
203 matches no string, and $\ONE$ for the regular |
201 expression that matches only the empty string.\footnote{ |
204 expression that matches only the empty string.\footnote{ |
202 Some authors |
205 Some authors |
221 \end{tabular} |
224 \end{tabular} |
222 \end{center} |
225 \end{center} |
223 \noindent |
226 \noindent |
224 %Now with language derivatives of a language and regular expressions and |
227 %Now with language derivatives of a language and regular expressions and |
225 %their language interpretations in place, we are ready to define derivatives on regular expressions. |
228 %their language interpretations in place, we are ready to define derivatives on regular expressions. |
226 With $L$ we are ready to introduce Brzozowski derivatives on regular expressions. |
229 With $L$, we are ready to introduce Brzozowski derivatives on regular expressions. |
227 We do so by first introducing what properties it should satisfy. |
230 We do so by first introducing what properties it should satisfy. |
228 |
231 |
229 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
232 \subsection{Brzozowski Derivatives and a Regular Expression Matcher} |
230 %Recall, the language derivative acts on a set of strings |
233 %Recall, the language derivative acts on a set of strings |
231 %and essentially chops off a particular character from |
234 %and essentially chops off a particular character from |
378 |
381 |
379 \begin{property} |
382 \begin{property} |
380 $s \in L \; r_{start} \iff [] \in L \; r_{end}$ |
383 $s \in L \; r_{start} \iff [] \in L \; r_{end}$ |
381 \end{property} |
384 \end{property} |
382 \noindent |
385 \noindent |
383 Next we give the recursive definition of derivative on |
386 Next, we give the recursive definition of derivative on |
384 regular expressions, so that it satisfies the properties above. |
387 regular expressions so that it satisfies the properties above. |
385 The derivative function, written $r\backslash c$, |
388 The derivative function, written $r\backslash c$, |
386 takes a regular expression $r$ and character $c$, and |
389 takes a regular expression $r$ and character $c$, and |
387 returns a new regular expression representing |
390 returns a new regular expression representing |
388 the original regular expression's language $L \; r$ |
391 the original regular expression's language $L \; r$ |
389 being taken the language derivative with respect to $c$. |
392 being taken the language derivative with respect to $c$. |
439 we can further unfold it as many times as needed: |
442 we can further unfold it as many times as needed: |
440 \[ |
443 \[ |
441 (r^*) \backslash c \dn (r \backslash c)\cdot r^*. |
444 (r^*) \backslash c \dn (r \backslash c)\cdot r^*. |
442 \] |
445 \] |
443 Again, |
446 Again, |
444 the structure is the same as the language derivative of Kleene star: |
447 the structure is the same as the language derivative of the Kleene star: |
445 \[ |
448 \[ |
446 \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*) |
449 \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*) |
447 \] |
450 \] |
448 In the above definition of $(r_1\cdot r_2) \backslash c$, |
451 In the above definition of $(r_1\cdot r_2) \backslash c$, |
449 the $\textit{if}$ clause's |
452 the $\textit{if}$ clause's |
520 \begin{definition} |
523 \begin{definition} |
521 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$ |
524 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$ |
522 \end{definition} |
525 \end{definition} |
523 |
526 |
524 \noindent |
527 \noindent |
525 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, |
528 Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, |
526 this algorithm presented graphically is as follows: |
529 this algorithm, presented graphically, is as follows: |
527 |
530 |
528 \begin{equation}\label{matcher} |
531 \begin{equation}\label{matcher} |
529 \begin{tikzcd} |
532 \begin{tikzcd} |
530 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & |
533 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & |
531 r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & |
534 r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & |
564 \noindent |
567 \noindent |
565 If we implement the above algorithm naively, however, |
568 If we implement the above algorithm naively, however, |
566 the algorithm can be excruciatingly slow, as shown in |
569 the algorithm can be excruciatingly slow, as shown in |
567 \ref{NaiveMatcher}. |
570 \ref{NaiveMatcher}. |
568 Note that both axes are in logarithmic scale. |
571 Note that both axes are in logarithmic scale. |
569 Around two dozens characters |
572 Around two dozen characters |
570 would already ``explode'' the matcher with the regular expression |
573 would already ``explode'' the matcher with the regular expression |
571 $(a^*)^*b$. |
574 $(a^*)^*b$. |
572 To improve this situation, we need to introduce simplification |
575 To improve this situation, we need to introduce simplification |
573 rules for the intermediate results, |
576 rules for the intermediate results, |
574 such as $r + r \rightarrow r$, |
577 such as $r + r \rightarrow r$, |
626 The running time of $\textit{ders}\_\textit{simp}$ |
629 The running time of $\textit{ders}\_\textit{simp}$ |
627 on the same example of Figure \ref{NaiveMatcher} |
630 on the same example of Figure \ref{NaiveMatcher} |
628 is now ``tame'' in terms of the length of inputs, |
631 is now ``tame'' in terms of the length of inputs, |
629 as shown in Figure \ref{BetterMatcher}. |
632 as shown in Figure \ref{BetterMatcher}. |
630 |
633 |
631 So far the story is use Brzozowski derivatives and |
634 So far, the story is use Brzozowski derivatives and |
632 simplify as much as possible and at the end test |
635 simplify as much as possible, and at the end test |
633 whether the empty string is recognised |
636 whether the empty string is recognised |
634 by the final derivative. |
637 by the final derivative. |
635 But what if we want to |
638 But what if we want to |
636 do lexing instead of just getting a true/false answer? |
639 do lexing instead of just getting a true/false answer? |
637 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and |
640 Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and |
725 multiple values for it. For example, both |
728 multiple values for it. For example, both |
726 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and |
729 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and |
727 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold |
730 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold |
728 and the values both flatten to $abc$. |
731 and the values both flatten to $abc$. |
729 Lexers therefore have to disambiguate and choose only |
732 Lexers therefore have to disambiguate and choose only |
730 one of the values be generated. $\POSIX$ is one of the |
733 one of the values to be generated. $\POSIX$ is one of the |
731 disambiguation strategies that is widely adopted. |
734 disambiguation strategies that is widely adopted. |
732 |
735 |
733 Ausaf et al.\parencite{AusafDyckhoffUrban2016} |
736 Ausaf et al.\parencite{AusafDyckhoffUrban2016} |
734 formalised the property |
737 formalised the property |
735 as a ternary relation. |
738 as a ternary relation. |
794 % |
797 % |
795 % |
798 % |
796 %\end{tikzpicture} |
799 %\end{tikzpicture} |
797 %\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch} |
800 %\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch} |
798 %\end{figure} |
801 %\end{figure} |
799 The above $\POSIX$ rules follows the intuition described below: |
802 The above $\POSIX$ rules follow the intuition described below: |
800 \begin{itemize} |
803 \begin{itemize} |
801 \item (Left Priority)\\ |
804 \item (Left Priority)\\ |
802 Match the leftmost regular expression when multiple options of matching |
805 Match the leftmost regular expression when multiple options for matching |
803 are available. See P+L and P+R where in P+R $s$ cannot |
806 are available. See P+L and P+R where in P+R $s$ cannot |
804 be in the language of $L \; r_1$. |
807 be in the language of $L \; r_1$. |
805 \item (Maximum munch)\\ |
808 \item (Maximum munch)\\ |
806 Always match a subpart as much as possible before proceeding |
809 Always match a subpart as much as possible before proceeding |
807 to the next part of the string. |
810 to the next part of the string. |
808 For example, when the string $s$ matches |
811 For example, when the string $s$ matches |
809 $r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split: |
812 $r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split: |
810 Then the split that matches a longer string for the first part |
813 Then the split that matches a longer string for the first part |
811 $r_{part1}$ is preferred by this maximum munch rule. |
814 $r_{part1}$ is preferred by this maximum munch rule. |
812 This is caused by the side-condition |
815 The side-condition |
813 \begin{center} |
816 \begin{center} |
814 $\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
817 $\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
815 s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$ |
818 s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$ |
816 \end{center} |
819 \end{center} |
817 in PS. |
820 in PS causes this. |
818 %(See |
821 %(See |
819 %\ref{munch} for an illustration). |
822 %\ref{munch} for an illustration). |
820 \end{itemize} |
823 \end{itemize} |
821 \noindent |
824 \noindent |
822 These disambiguation strategies can be |
825 These disambiguation strategies can be |
881 \[ |
884 \[ |
882 (s_1'', r_1) \rightarrow v_1'' |
885 (s_1'', r_1) \rightarrow v_1'' |
883 \;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s |
886 \;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s |
884 \] |
887 \] |
885 cannot possibly form a $\POSIX$ value either, because |
888 cannot possibly form a $\POSIX$ value either, because |
886 by definition there is a candidate |
889 by definition, there is a candidate |
887 with longer initial string |
890 with a longer initial string |
888 $s_1$. Therefore, we know that the POSIX |
891 $s_1$. Therefore, we know that the POSIX |
889 value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching |
892 value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching |
890 $s$ must have the |
893 $s$ must have the |
891 property that |
894 property that |
892 \[ |
895 \[ |
898 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, |
901 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, |
899 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$ |
902 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$ |
900 is the same as $\Seq(v_1, v_2)$. |
903 is the same as $\Seq(v_1, v_2)$. |
901 \end{proof} |
904 \end{proof} |
902 \noindent |
905 \noindent |
903 We have now defined what a POSIX value is and shown that it is unique, |
906 We have now defined what a POSIX value is and shown that it is unique. |
904 the problem is to generate |
907 The problem is to generate |
905 such a value in a lexing algorithm using derivatives. |
908 such a value in a lexing algorithm using derivatives. |
906 |
909 |
907 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
910 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
908 |
911 |
909 Sulzmann and Lu extended Brzozowski's |
912 Sulzmann and Lu extended Brzozowski's |
961 By induction on $r$. |
964 By induction on $r$. |
962 \end{proof} |
965 \end{proof} |
963 \noindent |
966 \noindent |
964 After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one |
967 After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one |
965 in reverse order as they were chopped off in the derivative phase. |
968 in reverse order as they were chopped off in the derivative phase. |
966 The fucntion for this is called $\inj$. This function |
969 The function for this is called $\inj$. This function |
967 operates on values, unlike $\backslash$ which operates on regular expressions. |
970 operates on values, unlike $\backslash$ which operates on regular expressions. |
968 In the diagram below, $v_i$ stands for the (POSIX) value |
971 In the diagram below, $v_i$ stands for the (POSIX) value |
969 for how the regular expression |
972 for how the regular expression |
970 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters |
973 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters |
971 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
974 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
983 \caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016}, |
986 \caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016}, |
984 matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$. |
987 matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$. |
985 The first phase involves taking successive derivatives w.r.t the characters $c_0$, |
988 The first phase involves taking successive derivatives w.r.t the characters $c_0$, |
986 $c_1$, and so on. These are the same operations as they have appeared in the matcher |
989 $c_1$, and so on. These are the same operations as they have appeared in the matcher |
987 \ref{matcher}. When the final derivative regular expression is nullable (contains the empty string), |
990 \ref{matcher}. When the final derivative regular expression is nullable (contains the empty string), |
988 then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches |
991 then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches |
989 the empty string , by always selecting the leftmost |
992 the empty string, by always selecting the leftmost |
990 nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they |
993 nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they |
991 appeared in the string, always preserving POSIXness.}\label{graph:inj} |
994 appeared in the string, always preserving POSIXness.}\label{graph:inj} |
992 \end{figure} |
995 \end{figure} |
993 \noindent |
996 \noindent |
994 The function $\textit{inj}$ as defined by Sulzmann and Lu |
997 The function $\textit{inj}$ as defined by Sulzmann and Lu |
995 takes three arguments: a regular |
998 takes three arguments: a regular |
1018 \end{center} |
1021 \end{center} |
1019 |
1022 |
1020 \noindent |
1023 \noindent |
1021 The function recurses on |
1024 The function recurses on |
1022 the shape of regular |
1025 the shape of regular |
1023 expressionsw and values. |
1026 expressions and values. |
1024 Intuitively, each clause analyses |
1027 Intuitively, each clause analyses |
1025 how $r_i$ could have transformed when being |
1028 how $r_i$ could have transformed when being |
1026 derived by $c$, identifying which subpart |
1029 derived by $c$, identifying which subpart |
1027 of $v_{i+1}$ has the ``hole'' |
1030 of $v_{i+1}$ has the ``hole'' |
1028 to inject the character back into. |
1031 to inject the character back into. |
1029 Once the character is |
1032 Once the character is |
1030 injected back to that sub-value; |
1033 injected back to that sub-value; |
1031 $\inj$ assembles all parts together |
1034 $\inj$ assembles all parts |
1032 to form a new value. |
1035 to form a new value. |
1033 |
1036 |
1034 For instance, the last clause is an |
1037 For instance, the last clause is an |
1035 injection into a sequence value $v_{i+1}$ |
1038 injection into a sequence value $v_{i+1}$ |
1036 whose second child |
1039 whose second child |
1037 value is a star, and the shape of the |
1040 value is a star and the shape of the |
1038 regular expression $r_i$ before injection |
1041 regular expression $r_i$ before injection |
1039 is a star. |
1042 is a star. |
1040 We therefore know |
1043 We therefore know |
1041 the derivative |
1044 the derivative |
1042 starts on a star and ends as a sequence: |
1045 starts on a star and ends as a sequence: |
1055 \[ |
1058 \[ |
1056 \vdash v: r\backslash c. |
1059 \vdash v: r\backslash c. |
1057 \] |
1060 \] |
1058 Finally, |
1061 Finally, |
1059 $\inj \; r \;c \; v$ is prepended |
1062 $\inj \; r \;c \; v$ is prepended |
1060 to the previous list of iterations, and then |
1063 to the previous list of iterations and then |
1061 wrapped under the $\Stars$ |
1064 wrapped under the $\Stars$ |
1062 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$. |
1065 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$. |
1063 |
1066 |
1064 Recall that lemma |
1067 Recall that lemma |
1065 \ref{mePosix} tells us that |
1068 \ref{mePosix} tells us that |
1086 $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\ |
1089 $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\ |
1087 & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ |
1090 & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ |
1088 \end{tabular} |
1091 \end{tabular} |
1089 \end{center} |
1092 \end{center} |
1090 We know that there exists a unique pair of |
1093 We know that there exists a unique pair of |
1091 $s_a$ and $s_b$ satisfaying |
1094 $s_a$ and $s_b$ satisfying |
1092 $(a \backslash c, s_a) \rightarrow v_a$, |
1095 $(a \backslash c, s_a) \rightarrow v_a$, |
1093 $(b , s_b) \rightarrow v_b$, and |
1096 $(b , s_b) \rightarrow v_b$, and |
1094 $\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in |
1097 $\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in |
1095 L \; (a\backslash c) \land |
1098 L \; (a\backslash c) \land |
1096 s_4 \in L \; b$. |
1099 s_4 \in L \; b$. |
1111 \begin{tabular}{lcl} |
1114 \begin{tabular}{lcl} |
1112 $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\ |
1115 $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\ |
1113 & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ |
1116 & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ |
1114 \end{tabular} |
1117 \end{tabular} |
1115 \end{center} |
1118 \end{center} |
1116 With a similar reasoning, |
1119 With similar reasoning, |
1117 |
1120 |
1118 \[ |
1121 \[ |
1119 (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. |
1122 (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. |
1120 \] |
1123 \] |
1121 again holds. |
1124 again holds. |
1131 We know that $a$ must be nullable, |
1134 We know that $a$ must be nullable, |
1132 allowing us to call $\mkeps$ and get |
1135 allowing us to call $\mkeps$ and get |
1133 \[ |
1136 \[ |
1134 (a, []) \rightarrow \mkeps \; a. |
1137 (a, []) \rightarrow \mkeps \; a. |
1135 \] |
1138 \] |
1136 Also by inductive hypothesis |
1139 Also, by inductive hypothesis |
1137 \[ |
1140 \[ |
1138 (b, c::s) \rightarrow \inj\; b \; c \; v_b |
1141 (b, c::s) \rightarrow \inj\; b \; c \; v_b |
1139 \] |
1142 \] |
1140 holds. |
1143 holds. |
1141 In addition, as |
1144 In addition, as |
1152 \[ |
1155 \[ |
1153 \nexists s_3 \; s_4. \; s_3 \neq [] \land |
1156 \nexists s_3 \; s_4. \; s_3 \neq [] \land |
1154 s_3 @s_4 = c::s \land s_3 \in L \; a |
1157 s_3 @s_4 = c::s \land s_3 \in L \; a |
1155 \land s_4 \in L \; b. |
1158 \land s_4 \in L \; b. |
1156 \] |
1159 \] |
1157 (Which basically says there cannot be a longer |
1160 (Which says there cannot be a longer |
1158 initial split for $s$ other than the empty string.) |
1161 initial split for $s$ other than the empty string.) |
1159 Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$ |
1162 Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$ |
1160 as the POSIX value for $a\cdot b$. |
1163 as the POSIX value for $a\cdot b$. |
1161 \end{itemize} |
1164 \end{itemize} |
1162 The star case can be proven similarly. |
1165 The star case can be proven similarly. |
1185 \end{tabular} |
1188 \end{tabular} |
1186 \end{center} |
1189 \end{center} |
1187 \end{theorem} |
1190 \end{theorem} |
1188 \begin{proof} |
1191 \begin{proof} |
1189 By induction on $s$. $r$ generalising over an arbitrary regular expression. |
1192 By induction on $s$. $r$ generalising over an arbitrary regular expression. |
1190 The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case |
1193 The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case |
1191 by lemma \ref{injPosix}. |
1194 by lemma \ref{injPosix}. |
1192 \end{proof} |
1195 \end{proof} |
1193 \noindent |
1196 \noindent |
1194 As we did earlier in this chapter with the matcher, one can |
1197 As we did earlier in this chapter with the matcher, one can |
1195 introduce simplification on the regular expression in each derivative step. |
1198 introduce simplification on the regular expression in each derivative step. |
1196 However, now one needs to do a backward phase and make sure |
1199 However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase) |
1197 the values align with the regular expression. |
1200 and ensure that |
|
1201 the values align with the regular expression at each step. |
1198 Therefore one has to |
1202 Therefore one has to |
1199 be careful not to break the correctness, as the injection |
1203 be careful not to break the correctness, as the injection |
1200 function heavily relies on the structure of |
1204 function heavily relies on the structure of |
1201 the regular expressions and values being aligned. |
1205 the regular expressions and values being aligned. |
1202 This can be achieved by recording some extra rectification functions |
1206 This can be achieved by recording some extra rectification functions |
1203 during the derivatives step, and applying these rectifications in |
1207 during the derivatives step and applying these rectifications in |
1204 each run during the injection phase. |
1208 each run during the injection phase. |
1205 With extra care |
1209 With extra care |
1206 one can show that POSIXness will not be affected |
1210 one can show that POSIXness will not be affected |
1207 by the simplifications listed here \cite{AusafDyckhoffUrban2016}. |
1211 by the simplifications listed here \cite{AusafDyckhoffUrban2016}. |
1208 \begin{center} |
1212 \begin{center} |
1221 $\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ |
1225 $\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ |
1222 |
1226 |
1223 \end{tabular} |
1227 \end{tabular} |
1224 \end{center} |
1228 \end{center} |
1225 |
1229 |
1226 However, with the simple-minded simplification rules allowed |
1230 However, one can still end up |
1227 in an injection-based lexer, one can still end up |
1231 with exploding derivatives, |
1228 with exploding derivatives. |
1232 even with the simple-minded simplification rules allowed |
1229 \section{A Case Requring More Aggressive Simplifications} |
1233 in an injection-based lexer. |
|
1234 \section{A Case Requiring More Aggressive Simplifications} |
1230 For example, when starting with the regular |
1235 For example, when starting with the regular |
1231 expression $(a^* \cdot a^*)^*$ and building just over |
1236 expression $(a^* \cdot a^*)^*$ and building just over |
1232 a dozen successive derivatives |
1237 a dozen successive derivatives |
1233 w.r.t.~the character $a$, one obtains a derivative regular expression |
1238 w.r.t.~the character $a$, one obtains a derivative regular expression |
1234 with millions of nodes (when viewed as a tree) |
1239 with millions of nodes (when viewed as a tree) |
1235 even with the simple-minded simplification. |
1240 even with the mentioned simplifications. |
1236 \begin{figure}[H] |
1241 \begin{figure}[H] |
1237 \begin{center} |
1242 \begin{center} |
1238 \begin{tikzpicture} |
1243 \begin{tikzpicture} |
1239 \begin{axis}[ |
1244 \begin{axis}[ |
1240 xlabel={$n$}, |
1245 xlabel={$n$}, |
1260 s=\underbrace{aa\ldots a}_\text{n \textit{a}s} |
1265 s=\underbrace{aa\ldots a}_\text{n \textit{a}s} |
1261 \] |
1266 \] |
1262 as an example. |
1267 as an example. |
1263 This is a highly ambiguous regular expression, with |
1268 This is a highly ambiguous regular expression, with |
1264 many ways to split up the string into multiple segments for |
1269 many ways to split up the string into multiple segments for |
1265 different star iteratioins, |
1270 different star iterations, |
1266 and for each segment |
1271 and for each segment |
1267 multiple ways of splitting between |
1272 multiple ways of splitting between |
1268 the two $a^*$ sub-expressions. |
1273 the two $a^*$ sub-expressions. |
1269 When $n$ is equal to $1$, there are two lexical values for |
1274 When $n$ is equal to $1$, there are two lexical values for |
1270 the match: |
1275 the match: |
1329 distinct lexical values that cannot be eliminated by |
1334 distinct lexical values that cannot be eliminated by |
1330 the simple-minded simplification of $\derssimp$. |
1335 the simple-minded simplification of $\derssimp$. |
1331 A lexer without a good enough strategy to |
1336 A lexer without a good enough strategy to |
1332 deduplicate will naturally |
1337 deduplicate will naturally |
1333 have an exponential runtime on highly |
1338 have an exponential runtime on highly |
1334 ambiguous regular expressions, because there |
1339 ambiguous regular expressions because there |
1335 are exponentially many matches. |
1340 are exponentially many matches. |
1336 For this particular example, it seems |
1341 For this particular example, it seems |
1337 that the number of distinct matches growth |
1342 that the number of distinct matches growth |
1338 speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length). |
1343 speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length). |
1339 |
1344 |
1340 On the other hand, the |
1345 On the other hand, the |
1341 $\POSIX$ value for $r= (a^*\cdot a^*)^*$ and |
1346 $\POSIX$ value for $r= (a^*\cdot a^*)^*$ and |
1342 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is |
1347 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is |
1343 \[ |
1348 \[ |
1344 \Stars\, |
1349 \Stars\, |
1345 [\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]] |
1350 [\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]. |
1346 \] |
1351 \] |
1347 and at any moment the subterms in a regular expression |
1352 At any moment, the subterms in a regular expression |
1348 that will result in a POSIX value is only |
1353 that will potentially result in a POSIX value is only |
1349 a minority among the many other terms, |
1354 a minority among the many other terms, |
1350 and one can remove ones that are absolutely not possible to |
1355 and one can remove ones that are not possible to |
1351 be POSIX. |
1356 be POSIX. |
1352 In the above example, |
1357 In the above example, |
1353 \begin{equation}\label{eqn:growth2} |
1358 \begin{equation}\label{eqn:growth2} |
1354 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
1359 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
1355 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}. |
1360 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}. |
1368 \begin{tabular}{lr} |
1373 \begin{tabular}{lr} |
1369 $\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\ |
1374 $\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\ |
1370 $\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$ |
1375 $\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$ |
1371 \end{tabular} |
1376 \end{tabular} |
1372 \end{center} |
1377 \end{center} |
1373 Other terms with an underlying value such as |
1378 Other terms with an underlying value, such as |
1374 \[ |
1379 \[ |
1375 \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
1380 \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])] |
1376 \] |
1381 \] |
1377 is simply too hopeless to contribute a POSIX lexical value, |
1382 is too hopeless to contribute a POSIX lexical value, |
1378 and is therefore thrown away. |
1383 and is therefore thrown away. |
1379 |
1384 |
1380 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
1385 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
1381 have come up with some simplification steps, however those steps |
1386 have come up with some simplification steps, however those steps |
1382 are not yet sufficiently strong so that they achieve the above effects. |
1387 are not yet sufficiently strong, so they achieve the above effects. |
1383 And even with these relatively mild simplifications the proof |
1388 And even with these relatively mild simplifications, the proof |
1384 is already quite a bit complicated than the theorem \ref{lexerCorrectness}. |
1389 is already quite a bit more complicated than the theorem \ref{lexerCorrectness}. |
1385 One would prove something like: |
1390 One would prove something like this: |
1386 \[ |
1391 \[ |
1387 \textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\; |
1392 \textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\; |
1388 \textit{then}\;\; (r, c::s) \rightarrow |
1393 \textit{then}\;\; (r, c::s) \rightarrow |
1389 \inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v) |
1394 \inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v). |
1390 \] |
1395 \] |
1391 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$ |
1396 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$ |
1392 not only has to return a simplified regular expression, |
1397 not only has to return a simplified regular expression, |
1393 but also what specific simplifications |
1398 but also what specific simplifications |
1394 has been done as a function on values |
1399 has been done as a function on values |