ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 567 28cb8089ec36
parent 564 3cbcd7cda0a9
child 568 7a579f5533f8
equal deleted inserted replaced
566:94604a5fd271 567:28cb8089ec36
   619 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   619 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
   620 \end{lemma}
   620 \end{lemma}
   621 \begin{proof}
   621 \begin{proof}
   622 By induction on $s$, $r$ and $v_1$. The inductive cases
   622 By induction on $s$, $r$ and $v_1$. The inductive cases
   623 are all the POSIX rules. 
   623 are all the POSIX rules. 
   624 Each case is proven by a combination of
       
   625 the induction rules for $\POSIX$ 
       
   626 values and the inductive hypothesis.
       
   627 Probably the most cumbersome cases are 
   624 Probably the most cumbersome cases are 
   628 the sequence and star with non-empty iterations.
   625 the sequence and star with non-empty iterations.
   629 
   626 We shall give the details for proving the sequence case here.
   630 We give the reasoning about the sequence case as follows:
   627 
   631 When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, 
   628 When we have 
   632 we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
   629 \[
   633 and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
   630 	(s_1, r_1) \rightarrow v_1 \;\, and \;\, 
   634 For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
   631 	(s_2, r_2) \rightarrow v_2  \;\, and \;\,\\ 
   635 possibly form a $\POSIX$ for $s$.
   632 	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
   636 If we have some other values $v_1'$ and $v_2'$ such that 
   633 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
   637 $(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
   634 \]
   638 Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, 
   635 we know that the last condition 
   639 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
   636 excludes the possibility of a 
       
   637 string $s_1'$ longer than $s_1$ such that 
       
   638 \[
       
   639 (s_1', r_1) \rightarrow v_1'   \;\; 
       
   640 and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s 
       
   641 \]
       
   642 hold.
       
   643 A shorter string $s_1''$ with $s_2''$ satisfying
       
   644 \[
       
   645 (s_1'', r_1) \rightarrow v_1''
       
   646 \;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
       
   647 \]
       
   648 cannot possibly form a $\POSIX$ value either, because
       
   649 by definition there is a candidate
       
   650 with longer initial string
       
   651 $s_1$. Therefore, we know that the POSIX
       
   652 value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
       
   653 $s$ must have the 
       
   654 property that 
       
   655 \[
       
   656 	|a| = s_1 \;\; and \;\; |b| = s_2.
       
   657 \]
       
   658 The goal is to prove that $a = v_1 $ and $b = v_2$.
       
   659 If we have some other POSIX values $v_{10}$ and $v_{20}$ such that 
       
   660 $(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
       
   661 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
       
   662 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
   640 is the same as $\Seq(v_1, v_2)$. 
   663 is the same as $\Seq(v_1, v_2)$. 
   641 \end{proof}
   664 \end{proof}
   642 Now we know what a $\POSIX$ value is; the problem is how do we achieve 
   665 \noindent
   643 such a value in a lexing algorithm, using derivatives?
   666 Now we know what a $\POSIX$ value is and why it is unique;
       
   667 the problem is generating 
       
   668 such a value in a lexing algorithm using derivatives.
   644 
   669 
   645 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   670 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
   646 
   671 
   647 The contribution of Sulzmann and Lu is an extension of Brzozowski's
   672 Sulzmann and Lu extended Brzozowski's 
   648 algorithm by a second phase (the first phase being building successive
   673 derivative-based matching
   649 derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
   674 to a lexing algorithm by a second pass
       
   675 after the initial phase of successive derivatives.
       
   676 This second phase generates a POSIX value 
       
   677 if the regular expression matches the string.
   650 Two functions are involved: $\inj$ and $\mkeps$.
   678 Two functions are involved: $\inj$ and $\mkeps$.
   651 The function $\mkeps$ constructs a value from the last
   679 The first one used is $\mkeps$, which constructs a POSIX value from the last
   652 one of all the successive derivatives:
   680 derivative $r_n$:
   653 \begin{ceqn}
   681 \begin{ceqn}
   654 \begin{equation}\label{graph:mkeps}
   682 \begin{equation}\label{graph:mkeps}
   655 \begin{tikzcd}
   683 \begin{tikzcd}
   656 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
   684 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
   657 	        & 	              & 	            & v_n       
   685 	        & 	              & 	            & v_n       
   658 \end{tikzcd}
   686 \end{tikzcd}
   659 \end{equation}
   687 \end{equation}
   660 \end{ceqn}
   688 \end{ceqn}
   661 
   689 \noindent
   662 It tells us how can an empty string be matched by a 
   690 In the above diagram, again we assume that
   663 regular expression, in a $\POSIX$ way:
   691 the input string $s$ is made of $n$ characters
   664 
   692 $c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$
       
   693 is given label $0$ and after each character $c_i$ is taken off
       
   694 by the derivative operation the resulting derivative regular 
       
   695 expressioin is $r_{i+1}$.The last derivative operation 
       
   696 $\backslash c_{n-1}$ gives back $r_n$, which is transformed into 
       
   697 a value $v_n$ by $\mkeps$.
       
   698 $v_n$ tells us how an empty string is matched by the (nullable)
       
   699 regular expression $r_n$, in a $\POSIX$ way.
       
   700 The definition of $\mkeps$ is
   665 	\begin{center}
   701 	\begin{center}
   666 		\begin{tabular}{lcl}
   702 		\begin{tabular}{lcl}
   667 			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
   703 			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
   668 			$\mkeps \; r_{1}+r_{2}$	& $\dn$ 
   704 			$\mkeps \; (r_{1}+r_{2})$	& $\dn$ 
   669 						& \textit{if} ($\nullable \; r_{1})$\\ 
   705 						& $\textit{if}\; (\nullable \; r_{1}) \;\,
   670 			& & \textit{then} $\Left (\mkeps \; r_{1})$\\ 
   706 							\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ 
   671 			& & \textit{else} $\Right(\mkeps \; r_{2})$\\
   707 						& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
   672 			$\mkeps \; r_1\cdot r_2$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
   708 			$\mkeps \; (r_1 \cdot r_2)$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
   673 			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
   709 			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
   674 		\end{tabular}
   710 		\end{tabular}
   675 	\end{center}
   711 	\end{center}
   676 
   712 
   677 
   713 
   678 \noindent 
   714 \noindent 
   679 We favour the left to match an empty string if there is a choice.
   715 We favour the left child $r_1$ of $r_1 + r_2$ 
       
   716 to match an empty string if there is a choice.
   680 When there is a star for us to match the empty string,
   717 When there is a star for us to match the empty string,
   681 we give the $\Stars$ constructor an empty list, meaning
   718 we give the $\Stars$ constructor an empty list, meaning
   682 no iterations are taken.
   719 no iteration is taken.
   683 The result of a call to $\mkeps$ on a $\nullable$ $r$ would
   720 The result of a call to $\mkeps$ on a $\nullable$ $r$ would
   684 be a $\POSIX$ value corresponding to $r$:
   721 be a $\POSIX$ value corresponding to $r$:
   685 \begin{lemma}
   722 \begin{lemma}\label{mePosix}
   686 $\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
   723 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
   687 \end{lemma}\label{mePosix}
   724 \end{lemma}
   688 
   725 \begin{proof}
   689 
   726 	By induction on the shape of $r$.
   690 After the $\mkeps$-call, we inject back the characters one by one in order to build
   727 \end{proof}
   691 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
   728 \noindent
   692 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   729 After the $\mkeps$-call, we inject back the characters one by one
       
   730 in reverse order as they were chopped off in the derivative phase.
       
   731 The fucntion for this is called $\inj$. $\inj$ and $\backslash$
       
   732 are not exactly reverse operations of one another, as $\inj$ 
       
   733 operates on values instead of regular
       
   734 expressions.
       
   735 In the diagram below, $v_i$ stands for the (POSIX) value 
       
   736 for how the regular expression 
       
   737 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
       
   738 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
   693 After injecting back $n$ characters, we get the lexical value for how $r_0$
   739 After injecting back $n$ characters, we get the lexical value for how $r_0$
   694 matches $s$. 
   740 matches $s$. 
   695 To do this, Sulzmann and Lu defined a function that reverses
       
   696 the ``chopping off'' of characters during the derivative phase. The
       
   697 corresponding function is called \emph{injection}, written
       
   698 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   699 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   700 is a character ${c_{i-1}}$, the character we want to inject and the
       
   701 third argument is the value ${v_i}$, into which one wants to inject the
       
   702 character (it corresponds to the regular expression after the character
       
   703 has been chopped off). The result of this function is a new value. 
       
   704 \begin{ceqn}
   741 \begin{ceqn}
   705 \begin{equation}\label{graph:inj}
   742 \begin{equation}\label{graph:inj}
   706 \begin{tikzcd}
   743 \begin{tikzcd}
   707 r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
   744 r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
   708 v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
   745 v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
   709 \end{tikzcd}
   746 \end{tikzcd}
   710 \end{equation}
   747 \end{equation}
   711 \end{ceqn}
   748 \end{ceqn}
   712 
   749 \noindent
   713 
   750 $\textit{inj}$ takes three arguments: a regular
   714 \noindent
   751 expression ${r_{i}}$, before the character is chopped off, 
   715 The
   752 a character ${c_{i}}$, the character we want to inject back and 
   716 definition of $\textit{inj}$ is as follows: 
   753 the third argument $v_{i+1}$ the value we want to inject into. 
   717 
   754 The result of this function is a new value $v_i$.
       
   755 The definition of $\textit{inj}$ is as follows: 
   718 \begin{center}
   756 \begin{center}
   719 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   757 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   720   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   758   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   721   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
   759   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
   722   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
   760   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\