619 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
619 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$ |
620 \end{lemma} |
620 \end{lemma} |
621 \begin{proof} |
621 \begin{proof} |
622 By induction on $s$, $r$ and $v_1$. The inductive cases |
622 By induction on $s$, $r$ and $v_1$. The inductive cases |
623 are all the POSIX rules. |
623 are all the POSIX rules. |
624 Each case is proven by a combination of |
|
625 the induction rules for $\POSIX$ |
|
626 values and the inductive hypothesis. |
|
627 Probably the most cumbersome cases are |
624 Probably the most cumbersome cases are |
628 the sequence and star with non-empty iterations. |
625 the sequence and star with non-empty iterations. |
629 |
626 We shall give the details for proving the sequence case here. |
630 We give the reasoning about the sequence case as follows: |
627 |
631 When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, |
628 When we have |
632 we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$ |
629 \[ |
633 and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold. |
630 (s_1, r_1) \rightarrow v_1 \;\, and \;\, |
634 For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot |
631 (s_2, r_2) \rightarrow v_2 \;\, and \;\,\\ |
635 possibly form a $\POSIX$ for $s$. |
632 \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land |
636 If we have some other values $v_1'$ and $v_2'$ such that |
633 s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2 |
637 $(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$, |
634 \] |
638 Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, |
635 we know that the last condition |
639 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$ |
636 excludes the possibility of a |
|
637 string $s_1'$ longer than $s_1$ such that |
|
638 \[ |
|
639 (s_1', r_1) \rightarrow v_1' \;\; |
|
640 and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s |
|
641 \] |
|
642 hold. |
|
643 A shorter string $s_1''$ with $s_2''$ satisfying |
|
644 \[ |
|
645 (s_1'', r_1) \rightarrow v_1'' |
|
646 \;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s |
|
647 \] |
|
648 cannot possibly form a $\POSIX$ value either, because |
|
649 by definition there is a candidate |
|
650 with longer initial string |
|
651 $s_1$. Therefore, we know that the POSIX |
|
652 value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching |
|
653 $s$ must have the |
|
654 property that |
|
655 \[ |
|
656 |a| = s_1 \;\; and \;\; |b| = s_2. |
|
657 \] |
|
658 The goal is to prove that $a = v_1 $ and $b = v_2$. |
|
659 If we have some other POSIX values $v_{10}$ and $v_{20}$ such that |
|
660 $(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold, |
|
661 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, |
|
662 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$ |
640 is the same as $\Seq(v_1, v_2)$. |
663 is the same as $\Seq(v_1, v_2)$. |
641 \end{proof} |
664 \end{proof} |
642 Now we know what a $\POSIX$ value is; the problem is how do we achieve |
665 \noindent |
643 such a value in a lexing algorithm, using derivatives? |
666 Now we know what a $\POSIX$ value is and why it is unique; |
|
667 the problem is generating |
|
668 such a value in a lexing algorithm using derivatives. |
644 |
669 |
645 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
670 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm} |
646 |
671 |
647 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
672 Sulzmann and Lu extended Brzozowski's |
648 algorithm by a second phase (the first phase being building successive |
673 derivative-based matching |
649 derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string. |
674 to a lexing algorithm by a second pass |
|
675 after the initial phase of successive derivatives. |
|
676 This second phase generates a POSIX value |
|
677 if the regular expression matches the string. |
650 Two functions are involved: $\inj$ and $\mkeps$. |
678 Two functions are involved: $\inj$ and $\mkeps$. |
651 The function $\mkeps$ constructs a value from the last |
679 The first one used is $\mkeps$, which constructs a POSIX value from the last |
652 one of all the successive derivatives: |
680 derivative $r_n$: |
653 \begin{ceqn} |
681 \begin{ceqn} |
654 \begin{equation}\label{graph:mkeps} |
682 \begin{equation}\label{graph:mkeps} |
655 \begin{tikzcd} |
683 \begin{tikzcd} |
656 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\ |
684 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\ |
657 & & & v_n |
685 & & & v_n |
658 \end{tikzcd} |
686 \end{tikzcd} |
659 \end{equation} |
687 \end{equation} |
660 \end{ceqn} |
688 \end{ceqn} |
661 |
689 \noindent |
662 It tells us how can an empty string be matched by a |
690 In the above diagram, again we assume that |
663 regular expression, in a $\POSIX$ way: |
691 the input string $s$ is made of $n$ characters |
664 |
692 $c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$ |
|
693 is given label $0$ and after each character $c_i$ is taken off |
|
694 by the derivative operation the resulting derivative regular |
|
695 expressioin is $r_{i+1}$.The last derivative operation |
|
696 $\backslash c_{n-1}$ gives back $r_n$, which is transformed into |
|
697 a value $v_n$ by $\mkeps$. |
|
698 $v_n$ tells us how an empty string is matched by the (nullable) |
|
699 regular expression $r_n$, in a $\POSIX$ way. |
|
700 The definition of $\mkeps$ is |
665 \begin{center} |
701 \begin{center} |
666 \begin{tabular}{lcl} |
702 \begin{tabular}{lcl} |
667 $\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\ |
703 $\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\ |
668 $\mkeps \; r_{1}+r_{2}$ & $\dn$ |
704 $\mkeps \; (r_{1}+r_{2})$ & $\dn$ |
669 & \textit{if} ($\nullable \; r_{1})$\\ |
705 & $\textit{if}\; (\nullable \; r_{1}) \;\, |
670 & & \textit{then} $\Left (\mkeps \; r_{1})$\\ |
706 \textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ |
671 & & \textit{else} $\Right(\mkeps \; r_{2})$\\ |
707 & & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\ |
672 $\mkeps \; r_1\cdot r_2$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\ |
708 $\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\ |
673 $\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$ |
709 $\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$ |
674 \end{tabular} |
710 \end{tabular} |
675 \end{center} |
711 \end{center} |
676 |
712 |
677 |
713 |
678 \noindent |
714 \noindent |
679 We favour the left to match an empty string if there is a choice. |
715 We favour the left child $r_1$ of $r_1 + r_2$ |
|
716 to match an empty string if there is a choice. |
680 When there is a star for us to match the empty string, |
717 When there is a star for us to match the empty string, |
681 we give the $\Stars$ constructor an empty list, meaning |
718 we give the $\Stars$ constructor an empty list, meaning |
682 no iterations are taken. |
719 no iteration is taken. |
683 The result of a call to $\mkeps$ on a $\nullable$ $r$ would |
720 The result of a call to $\mkeps$ on a $\nullable$ $r$ would |
684 be a $\POSIX$ value corresponding to $r$: |
721 be a $\POSIX$ value corresponding to $r$: |
685 \begin{lemma} |
722 \begin{lemma}\label{mePosix} |
686 $\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$ |
723 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$ |
687 \end{lemma}\label{mePosix} |
724 \end{lemma} |
688 |
725 \begin{proof} |
689 |
726 By induction on the shape of $r$. |
690 After the $\mkeps$-call, we inject back the characters one by one in order to build |
727 \end{proof} |
691 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
728 \noindent |
692 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
729 After the $\mkeps$-call, we inject back the characters one by one |
|
730 in reverse order as they were chopped off in the derivative phase. |
|
731 The fucntion for this is called $\inj$. $\inj$ and $\backslash$ |
|
732 are not exactly reverse operations of one another, as $\inj$ |
|
733 operates on values instead of regular |
|
734 expressions. |
|
735 In the diagram below, $v_i$ stands for the (POSIX) value |
|
736 for how the regular expression |
|
737 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters |
|
738 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
693 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
739 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
694 matches $s$. |
740 matches $s$. |
695 To do this, Sulzmann and Lu defined a function that reverses |
|
696 the ``chopping off'' of characters during the derivative phase. The |
|
697 corresponding function is called \emph{injection}, written |
|
698 $\textit{inj}$; it takes three arguments: the first one is a regular |
|
699 expression ${r_{i-1}}$, before the character is chopped off, the second |
|
700 is a character ${c_{i-1}}$, the character we want to inject and the |
|
701 third argument is the value ${v_i}$, into which one wants to inject the |
|
702 character (it corresponds to the regular expression after the character |
|
703 has been chopped off). The result of this function is a new value. |
|
704 \begin{ceqn} |
741 \begin{ceqn} |
705 \begin{equation}\label{graph:inj} |
742 \begin{equation}\label{graph:inj} |
706 \begin{tikzcd} |
743 \begin{tikzcd} |
707 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
744 r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
708 v_1 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] |
745 v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] |
709 \end{tikzcd} |
746 \end{tikzcd} |
710 \end{equation} |
747 \end{equation} |
711 \end{ceqn} |
748 \end{ceqn} |
712 |
749 \noindent |
713 |
750 $\textit{inj}$ takes three arguments: a regular |
714 \noindent |
751 expression ${r_{i}}$, before the character is chopped off, |
715 The |
752 a character ${c_{i}}$, the character we want to inject back and |
716 definition of $\textit{inj}$ is as follows: |
753 the third argument $v_{i+1}$ the value we want to inject into. |
717 |
754 The result of this function is a new value $v_i$. |
|
755 The definition of $\textit{inj}$ is as follows: |
718 \begin{center} |
756 \begin{center} |
719 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
757 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
720 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
758 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
721 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
759 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
722 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
760 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |