lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:94604a5fd271
+:28cb8089ec36
 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
 \end{lemma}
 \begin{proof}
 By induction on $s$, $r$ and $v_1$. The inductive cases
 are all the POSIX rules.
-Each case is proven by a combination of
-the induction rules for $\POSIX$
-values and the inductive hypothesis.
 Probably the most cumbersome cases are
 the sequence and star with non-empty iterations.
+We shall give the details for proving the sequence case here.
-We give the reasoning about the sequence case as follows:
-When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$,
+When we have
-we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
+\[
-and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
+	(s_1, r_1) \rightarrow v_1 \;\, and \;\,
-For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
+	(s_2, r_2) \rightarrow v_2  \;\, and \;\,\\
-possibly form a $\POSIX$ for $s$.
+	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
-If we have some other values $v_1'$ and $v_2'$ such that
+		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
-$(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
+\]
-Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$,
+we know that the last condition
-which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
+excludes the possibility of a
+string $s_1'$ longer than $s_1$ such that
+\[
+(s_1', r_1) \rightarrow v_1'   \;\;
+and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s
+\]
+hold.
+A shorter string $s_1''$ with $s_2''$ satisfying
+\[
+(s_1'', r_1) \rightarrow v_1''
+\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s
+\]
+cannot possibly form a $\POSIX$ value either, because
+by definition there is a candidate
+with longer initial string
+$s_1$. Therefore, we know that the POSIX
+value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
+$s$ must have the
+property that
+\[
+	|a| = s_1 \;\; and \;\; |b| = s_2.
+\]
+The goal is to prove that $a = v_1 $ and $b = v_2$.
+If we have some other POSIX values $v_{10}$ and $v_{20}$ such that
+$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
+then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$,
+which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
 is the same as $\Seq(v_1, v_2)$.
 \end{proof}
-Now we know what a $\POSIX$ value is; the problem is how do we achieve
+\noindent
-such a value in a lexing algorithm, using derivatives?
+Now we know what a $\POSIX$ value is and why it is unique;
+the problem is generating
+such a value in a lexing algorithm using derivatives.
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
-The contribution of Sulzmann and Lu is an extension of Brzozowski's
+Sulzmann and Lu extended Brzozowski's
-algorithm by a second phase (the first phase being building successive
+derivative-based matching
-derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
+to a lexing algorithm by a second pass
+after the initial phase of successive derivatives.
+This second phase generates a POSIX value
+if the regular expression matches the string.
 Two functions are involved: $\inj$ and $\mkeps$.
-The function $\mkeps$ constructs a value from the last
+The first one used is $\mkeps$, which constructs a POSIX value from the last
-one of all the successive derivatives:
+derivative $r_n$:
 \begin{ceqn}
 \begin{equation}\label{graph:mkeps}
 \begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
 	        & 	              & 	            & v_n
 \end{tikzcd}
 \end{equation}
 \end{ceqn}
+\noindent
-It tells us how can an empty string be matched by a
+In the above diagram, again we assume that
-regular expression, in a $\POSIX$ way:
+the input string $s$ is made of $n$ characters
+$c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$
+is given label $0$ and after each character $c_i$ is taken off
+by the derivative operation the resulting derivative regular
+expressioin is $r_{i+1}$.The last derivative operation
+$\backslash c_{n-1}$ gives back $r_n$, which is transformed into
+a value $v_n$ by $\mkeps$.
+$v_n$ tells us how an empty string is matched by the (nullable)
+regular expression $r_n$, in a $\POSIX$ way.
+The definition of $\mkeps$ is
 	\begin{center}
 		\begin{tabular}{lcl}
 			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
-			$\mkeps \; r_{1}+r_{2}$	& $\dn$
+			$\mkeps \; (r_{1}+r_{2})$	& $\dn$
-						& \textit{if} ($\nullable \; r_{1})$\\
+						& $\textit{if}\; (\nullable \; r_{1}) \;\,
-			& & \textit{then} $\Left (\mkeps \; r_{1})$\\
+							\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\
-			& & \textit{else} $\Right(\mkeps \; r_{2})$\\
+						& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
-			$\mkeps \; r_1\cdot r_2$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
+			$\mkeps \; (r_1 \cdot r_2)$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
 			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
 		\end{tabular}
 	\end{center}
 \noindent
-We favour the left to match an empty string if there is a choice.
+We favour the left child $r_1$ of $r_1 + r_2$
+to match an empty string if there is a choice.
 When there is a star for us to match the empty string,
 we give the $\Stars$ constructor an empty list, meaning
-no iterations are taken.
+no iteration is taken.
 The result of a call to $\mkeps$ on a $\nullable$ $r$ would
 be a $\POSIX$ value corresponding to $r$:
-\begin{lemma}
+\begin{lemma}\label{mePosix}
-$\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
+$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
-\end{lemma}\label{mePosix}
+\end{lemma}
+\begin{proof}
+	By induction on the shape of $r$.
-After the $\mkeps$-call, we inject back the characters one by one in order to build
+\end{proof}
-the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
+\noindent
-($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
+After the $\mkeps$-call, we inject back the characters one by one
+in reverse order as they were chopped off in the derivative phase.
+The fucntion for this is called $\inj$. $\inj$ and $\backslash$
+are not exactly reverse operations of one another, as $\inj$
+operates on values instead of regular
+expressions.
+In the diagram below, $v_i$ stands for the (POSIX) value
+for how the regular expression
+$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
+of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
 After injecting back $n$ characters, we get the lexical value for how $r_0$
 matches $s$.
-To do this, Sulzmann and Lu defined a function that reverses
-the ``chopping off'' of characters during the derivative phase. The
-corresponding function is called \emph{injection}, written
-$\textit{inj}$; it takes three arguments: the first one is a regular
-expression ${r_{i-1}}$, before the character is chopped off, the second
-is a character ${c_{i-1}}$, the character we want to inject and the
-third argument is the value ${v_i}$, into which one wants to inject the
-character (it corresponds to the regular expression after the character
-has been chopped off). The result of this function is a new value.
 \begin{ceqn}
 \begin{equation}\label{graph:inj}
 \begin{tikzcd}
-r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
+r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
-v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]
+v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]
 \end{tikzcd}
 \end{equation}
 \end{ceqn}
+\noindent
+$\textit{inj}$ takes three arguments: a regular
-\noindent
+expression ${r_{i}}$, before the character is chopped off,
-The
+a character ${c_{i}}$, the character we want to inject back and
-definition of $\textit{inj}$ is as follows:
+the third argument $v_{i+1}$ the value we want to inject into.
+The result of this function is a new value $v_i$.
+The definition of $\textit{inj}$ is as follows:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\

changeset 567	28cb8089ec36
parent 564	3cbcd7cda0a9
child 568	7a579f5533f8