514 |
514 |
515 \noindent |
515 \noindent |
516 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, |
516 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, |
517 this algorithm presented graphically is as follows: |
517 this algorithm presented graphically is as follows: |
518 |
518 |
519 \begin{equation}\label{graph:successive_ders} |
519 \begin{equation}\label{matcher} |
520 \begin{tikzcd} |
520 \begin{tikzcd} |
521 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & |
521 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & |
522 r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & |
522 r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & |
523 \;\textrm{true}/\textrm{false} |
523 \;\textrm{true}/\textrm{false} |
524 \end{tikzcd} |
524 \end{tikzcd} |
533 \begin{proof} |
533 \begin{proof} |
534 By induction on $s$ using property of derivatives: |
534 By induction on $s$ using property of derivatives: |
535 lemma \ref{derDer}. |
535 lemma \ref{derDer}. |
536 \end{proof} |
536 \end{proof} |
537 \noindent |
537 \noindent |
538 \begin{center} |
|
539 \begin{figure} |
538 \begin{figure} |
|
539 \begin{center} |
540 \begin{tikzpicture} |
540 \begin{tikzpicture} |
541 \begin{axis}[ |
541 \begin{axis}[ |
542 xlabel={$n$}, |
542 xlabel={$n$}, |
543 ylabel={time in secs}, |
543 ylabel={time in secs}, |
544 ymode = log, |
544 %ymode = log, |
545 legend entries={Naive Matcher}, |
545 legend entries={Naive Matcher}, |
546 legend pos=north west, |
546 legend pos=north west, |
547 legend cell align=left] |
547 legend cell align=left] |
548 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data}; |
548 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data}; |
549 \end{axis} |
549 \end{axis} |
550 \end{tikzpicture} |
550 \end{tikzpicture} |
551 \caption{Matching the regular expression $(a^*)^*b$ against strings of the form |
551 \caption{Matching the regular expression $(a^*)^*b$ against strings of the form |
552 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
552 $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s} |
553 $ using Brzozowski's original algorithm}\label{NaiveMatcher} |
553 $ using Brzozowski's original algorithm}\label{NaiveMatcher} |
|
554 \end{center} |
554 \end{figure} |
555 \end{figure} |
555 \end{center} |
|
556 \noindent |
556 \noindent |
557 If we implement the above algorithm naively, however, |
557 If we implement the above algorithm naively, however, |
558 the algorithm can be excruciatingly slow, as shown in |
558 the algorithm can be excruciatingly slow, as shown in |
559 \ref{NaiveMatcher}. |
559 \ref{NaiveMatcher}. |
560 Note that both axes are in logarithmic scale. |
560 Note that both axes are in logarithmic scale. |
598 \begin{figure} |
598 \begin{figure} |
599 \begin{tikzpicture} |
599 \begin{tikzpicture} |
600 \begin{axis}[ |
600 \begin{axis}[ |
601 xlabel={$n$}, |
601 xlabel={$n$}, |
602 ylabel={time in secs}, |
602 ylabel={time in secs}, |
603 ymode = log, |
603 %ymode = log, |
604 xmode = log, |
604 %xmode = log, |
605 grid = both, |
605 grid = both, |
606 legend entries={Matcher With Simp}, |
606 legend entries={Matcher With Simp}, |
607 legend pos=north west, |
607 legend pos=north west, |
608 legend cell align=left] |
608 legend cell align=left] |
609 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data}; |
609 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data}; |
700 \noindent |
700 \noindent |
701 The condition $|v| \neq []$ in the premise of star's rule |
701 The condition $|v| \neq []$ in the premise of star's rule |
702 is to make sure that for a given pair of regular |
702 is to make sure that for a given pair of regular |
703 expression $r$ and string $s$, the number of values |
703 expression $r$ and string $s$, the number of values |
704 satisfying $|v| = s$ and $\vdash v:r$ is finite. |
704 satisfying $|v| = s$ and $\vdash v:r$ is finite. |
|
705 This additional condition was |
|
706 imposed by Ausaf and Urban to make their proofs easier. |
705 Given the same string and regular expression, there can be |
707 Given the same string and regular expression, there can be |
706 multiple values for it. For example, both |
708 multiple values for it. For example, both |
707 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and |
709 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and |
708 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold |
710 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold |
709 and the values both flatten to $abc$. |
711 and the values both flatten to $abc$. |
715 formalised the property |
717 formalised the property |
716 as a ternary relation. |
718 as a ternary relation. |
717 The $\POSIX$ value $v$ for a regular expression |
719 The $\POSIX$ value $v$ for a regular expression |
718 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
720 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified |
719 in the following set of rules\footnote{The names of the rules are used |
721 in the following set of rules\footnote{The names of the rules are used |
720 as they were originally given in \cite{AusafDyckhoffUrban2016}}: |
722 as they were originally given in \cite{AusafDyckhoffUrban2016} }: |
721 \noindent |
723 \begin{figure}[H] |
722 \begin{figure} |
|
723 \begin{mathpar} |
724 \begin{mathpar} |
724 \inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty} |
725 \inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty} |
725 |
726 |
726 \inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c} |
727 \inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c} |
727 |
728 |
739 \inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\ |
740 \inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\ |
740 |v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land |
741 |v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land |
741 s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \; |
742 s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \; |
742 (v::vs)} |
743 (v::vs)} |
743 \end{mathpar} |
744 \end{mathpar} |
744 \caption{POSIX Lexing Rules} |
745 \caption{The inductive POSIX Lexing Rules defined by Ausaf, Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}. |
|
746 The ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on the |
|
747 value $v$ given a string $s$ and |
|
748 regular expression $r$. |
|
749 For example, this specification says that all matches for an alternative |
|
750 must always prefer a left value to a right one. |
|
751 } |
745 \end{figure} |
752 \end{figure} |
746 \noindent |
753 \noindent |
747 |
754 |
748 \begin{figure} |
755 \begin{figure} |
749 \begin{tikzpicture}[] |
756 \begin{tikzpicture}[] |
830 We shall give the details for proving the sequence case here. |
837 We shall give the details for proving the sequence case here. |
831 |
838 |
832 When we have |
839 When we have |
833 \[ |
840 \[ |
834 (s_1, r_1) \rightarrow v_1 \;\, and \;\, |
841 (s_1, r_1) \rightarrow v_1 \;\, and \;\, |
835 (s_2, r_2) \rightarrow v_2 \;\, and \;\,\\ |
842 (s_2, r_2) \rightarrow v_2 \;\, and \;\, |
836 \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
843 \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
837 s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2 |
844 s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2 |
838 \] |
845 \] |
839 we know that the last condition |
846 we know that the last condition |
840 excludes the possibility of a |
847 excludes the possibility of a |
940 for how the regular expression |
947 for how the regular expression |
941 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters |
948 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters |
942 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
949 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
943 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
950 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
944 matches $s$. |
951 matches $s$. |
|
952 \begin{figure}[H] |
|
953 \begin{center} |
945 \begin{ceqn} |
954 \begin{ceqn} |
946 \begin{equation}\label{graph:inj} |
|
947 \begin{tikzcd} |
955 \begin{tikzcd} |
948 r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
956 r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
949 v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] |
957 v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] |
950 \end{tikzcd} |
958 \end{tikzcd} |
951 \end{equation} |
|
952 \end{ceqn} |
959 \end{ceqn} |
|
960 \end{center} |
|
961 \caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016}, |
|
962 matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$. |
|
963 The first phase involves taking successive derivatives w.r.t the characters $c_0$, |
|
964 $c_1$, and so on. These are the same operations as they have appeared in the matcher |
|
965 \ref{matcher}. When the final derivative regular expression is nullable (contains the empty string), |
|
966 then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches |
|
967 the empty string , by always selecting the leftmost |
|
968 nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they |
|
969 appeared in the string, always preserving POSIXness.}\label{graph:inj} |
|
970 \end{figure} |
953 \noindent |
971 \noindent |
954 $\textit{inj}$ takes three arguments: a regular |
972 $\textit{inj}$ takes three arguments: a regular |
955 expression ${r_{i}}$, before the character is chopped off, |
973 expression ${r_{i}}$, before the character is chopped off, |
956 a character ${c_{i}}$, the character we want to inject back and |
974 a character ${c_{i}}$, the character we want to inject back and |
957 the third argument $v_{i+1}$ the value we want to inject into. |
975 the third argument $v_{i+1}$ the value we want to inject into. |
1186 w.r.t.~the character $a$, one obtains a derivative regular expression |
1204 w.r.t.~the character $a$, one obtains a derivative regular expression |
1187 with millions of nodes (when viewed as a tree) |
1205 with millions of nodes (when viewed as a tree) |
1188 even with simplification, which is not much better compared |
1206 even with simplification, which is not much better compared |
1189 with the naive version without any simplifications: |
1207 with the naive version without any simplifications: |
1190 \begin{figure}[H] |
1208 \begin{figure}[H] |
1191 \centering |
1209 \begin{center} |
1192 \begin{tikzpicture} |
1210 \begin{tikzpicture} |
1193 \begin{axis}[ |
1211 \begin{axis}[ |
1194 xlabel={$n$}, |
1212 xlabel={$n$}, |
1195 ylabel={size}, |
1213 ylabel={size}, |
1196 legend entries={Simple-Minded Simp, Naive Matcher}, |
1214 legend entries={Simple-Minded Simp, Naive Matcher}, |
1198 legend cell align=left] |
1216 legend cell align=left] |
1199 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; |
1217 \addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data}; |
1200 \addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data}; |
1218 \addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data}; |
1201 \end{axis} |
1219 \end{axis} |
1202 \end{tikzpicture} |
1220 \end{tikzpicture} |
|
1221 \end{center} |
1203 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$} |
1222 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$} |
1204 \end{figure}\label{fig:BetterWaterloo} |
1223 \end{figure}\label{fig:BetterWaterloo} |
1205 |
1224 |
1206 That is because Sulzmann and Lu's |
1225 That is because Sulzmann and Lu's |
1207 injection-based lexing algorithm keeps a lot of |
1226 injection-based lexing algorithm keeps a lot of |