lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:28cb8089ec36
+:7a579f5533f8
 \noindent
 $\textit{inj}$ takes three arguments: a regular
 expression ${r_{i}}$, before the character is chopped off,
 a character ${c_{i}}$, the character we want to inject back and
 the third argument $v_{i+1}$ the value we want to inject into.
-The result of this function is a new value $v_i$.
+The result of an application
+$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
+\[
+	(s_i, r_i) \rightarrow v_i
+\]
+holds.
 The definition of $\textit{inj}$ is as follows:
 \begin{center}
-\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
-$\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+$\textit{inj}\;(c)\;c\,Empty$            & $\dn$ & $\Char\,c$\\
-$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+$\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left  \; (\textit{inj}\; r_1 \; c\,v)$\\
-$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c  \; v)$\\
-$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$  &
-$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
-$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ &
-$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+$\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
+$\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
+$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
 \end{tabular}
 \end{center}
 \noindent
-This definition is by recursion on the ``shape'' of regular
+The function does a recursion on
-expressions and values.
+the shape of regular
-The clauses do one thing--identifying the ``hole'' on a
+expression $r_i$ and value $v_{i+1}$.
-value to inject the character back into.
+Intuitively, each clause analyses
-For instance, in the last clause for injecting back to a value
+how $r_i$ could have transformed when being
-that would turn into a new star value that corresponds to a star,
+derived by $c$, identifying which subpart
-we know it must be a sequence value. And we know that the first
+of $v_{i+1}$ has the ``hole''
-value of that sequence corresponds to the child regex of the star
+to inject the character back into.
-with the first character being chopped off--an iteration of the star
+Once the character is
-that had just been unfolded. This value is followed by the already
+injected back to that sub-value;
-matched star iterations we collected before. So we inject the character
+$\inj$ assembles all things together
-back to the first value and form a new value with this latest iteration
+to form a new value.
-being added to the previous list of iterations, all under the $\Stars$
-top level.
+For instance, the last clause is an
-The POSIX value is maintained throughout the process:
+injection into a sequence value $v_{i+1}$
-\begin{lemma}
+whose second child
-$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
+value is a star, and the shape of the
-\end{lemma}\label{injPosix}
+regular expression $r_i$ before injection
+is a star.
+We therefore know
-Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
+the derivative
+starts on a star and ends as a sequence:
+\[
+	(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
+\]
+during which an iteration of the star
+had just been unfolded, giving the below
+value inhabitation relation:
+\[
+	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
+\]
+The value list $vs$ corresponds to
+matched star iterations,
+and the ``hole'' lies in $v$ because
+\[
+	\vdash v: r\backslash c.
+\]
+Finally,
+$\inj \; r \;c \; v$ is prepended
+to the previous list of iterations, and then
+wrapped under the $\Stars$
+constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
+Recall that lemma
+\ref{mePosix} tells us
+$\mkeps$ always selects the POSIX matching among
+multiple values that flatten to the empty string.
+Now $\inj$ preserves the POSIXness, provided
+the value before injection is POSIX:
+\begin{lemma}\label{injPosix}
+	If
+	\[
+		(r \backslash c, s) \rightarrow v
+	\]
+	then
+	\[
+		(r, c :: s) \rightarrow (\inj r \; c\; v).
+	\]
+\end{lemma}
+\begin{proof}
+	By induction on $r$.
+	The involved cases are sequence and star.
+	When $r = a \cdot b$, there could be
+	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
+	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
+	case.
+	\begin{itemize}
+		\item
+			$v = \Seq \; v_a \; v_b$.\\
+			The ``not nullable'' clause of the $\inj$ function is taken:
+			\begin{center}
+				\begin{tabular}{lcl}
+					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
+					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
+				\end{tabular}
+			\end{center}
+			We know that there exists a unique pair of
+			$s_a$ and $s_b$ satisfaying
+				$(a \backslash c, s_a) \rightarrow v_a$,
+				$(b , s_b) \rightarrow v_b$, and
+				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
+				L \; (a\backslash c) \land
+				s_4 \in L \; b$.
+			The last condition gives us
+			$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in
+				L \; a \land
+				s_4 \in L \; b$.
+			By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
+			and this gives us
+			\[
+				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
+			\]
+		\item
+			$v = \Left \; (\Seq \; v_a \; v_b)$\\
+			The argument is almost identical to the above case,
+			except that a different clause of $\inj$ is taken:
+			\begin{center}
+				\begin{tabular}{lcl}
+					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
+					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
+				\end{tabular}
+			\end{center}
+			With a similar reasoning,
+			\[
+				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
+			\]
+			again holds.
+		\item
+			$v = \Right \; v_b$\\
+			Again the injection result would be
+			\begin{center}
+				\begin{tabular}{lcl}
+					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
+					& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
+				\end{tabular}
+			\end{center}
+			We know that $a$ must be nullable,
+			allowing us to call $\mkeps$ and get
+			\[
+				(a, []) \rightarrow \mkeps \; a.
+			\]
+			Also by inductive hypothesis
+			\[
+				(b, c::s) \rightarrow \inj\; b \; c \; v_b
+			\]
+			holds.
+			In addition, as
+			$\Right \;v_b$  instead of $\Left \ldots$ is
+			the POSIX value for $v$, it must be the case
+			that $s \notin L \;( (a\backslash c)\cdot b)$.
+			This tells us that
+			\[
+				\nexists s_3 \; s_4.
+				s_3 @s_4 = s  \land s_3 \in L \; (a\backslash c)
+				\land s_4 \in L \; b
+			\]
+			which translates to
+			\[
+				\nexists s_3 \; s_4. \; s_3 \neq [] \land
+				s_3 @s_4 = c::s  \land s_3 \in L \; a
+				\land s_4 \in L \; b.
+			\]
+			(Which basically says there cannot be a longer
+			initial split for $s$ other than the empty string.)
+			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
+			as the POSIX value for $a\cdot b$.
+	\end{itemize}
+	The star case can be proven similarly.
+\end{proof}
+\noindent
+Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
+by following the procedure outlined in the diagram \ref{graph:inj},
 and taking into consideration the possibility of a non-match,
-we have a lexer with the following recursive definition:
+a lexer can be built with the following recursive definition:
 \begin{center}
 \begin{tabular}{lcl}
-$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
+	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
-$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
+	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
-& & $\quad \None \implies \None$\\
+	& & $\quad \phantom{\mid}\; \None \implies \None$\\
-& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
+	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
 \end{tabular}
 \end{center}
 \noindent
 The central property of the $\lexer$ is that it gives the correct result by
 $\POSIX$ standards:
 \begin{theorem}
-	 The $\lexer$ based on derivatives and injections is both sound and complete:
+	The $\lexer$ based on derivatives and injections is correct:
-\begin{tabular}{lcl}
+	\begin{center}
-	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
+		\begin{tabular}{lcl}
-	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
+			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
-\end{tabular}
+			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
-\end{theorem}
+		\end{tabular}
+	\end{center}
+\end{theorem}
 \begin{proof}
 By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
 The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}
-We now give a pictorial view of the algorithm (
-For convenience, we employ the following notations: the regular
-expression we start with is $r_0$, and the given string $s$ is composed
-of characters $c_0 c_1 \ldots c_{n-1}$. The
-values built incrementally by \emph{injecting} back the characters into the
-earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
-are always in the same subscript, i.e. $\vdash v_i : r_i$):
-\begin{ceqn}
-\begin{equation}\label{graph:2}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
 \noindent
 As we did earlier in this chapter on the matcher, one can
 introduce simplification on the regex.
-However, now we need to do a backward phase and make sure
+However, now one needs to do a backward phase and make sure
 the values align with the regular expressions.
 Therefore one has to
 be careful not to break the correctness, as the injection
 function heavily relies on the structure of the regexes and values
 being correct and matching each other.
 It can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in
 each run during the injection phase.
+With extra care
-\ChristianComment{Do I introduce the lexer with rectification here?}
+one can show that POSIXness will not be affected---although it is much harder
-And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although it is much harder
 to establish.
 Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}.
-However, even with these simplification rules, we could still end up in
+However, with all the simplification rules allowed
-trouble, when we encounter cases that require more involved and aggressive
+in an injection-based lexer, one could still end up in
-simplifications.
+trouble, when cases that require more involved and aggressive
-\section{A Case Requring More Aggressive Simplification}
+simplifications arise.
+\section{A Case Requring More Aggressive Simplifications}
 For example, when starting with the regular
 expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
 w.r.t.~the character $a$, one obtains a derivative regular expression
 with more than 9000 nodes (when viewed as a tree)
 even with simplification.
 \end{axis}
 \end{tikzpicture}
 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
 \end{figure}\label{fig:BetterWaterloo}
-That is because our lexing algorithm currently keeps a lot of
+That is because Sulzmann and Lu's
+injection-based lexing algorithm keeps a lot of
 "useless" values that will not be used.
 These different ways of matching will grow exponentially with the string length.
+Take
-For $r= (a^*\cdot a^*)^*$ and
+\[
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
+	r= (a^*\cdot a^*)^* \quad and \quad
-if we do not allow any empty iterations in its lexical values,
+	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
-there will be $n - 1$ "splitting points" on $s$ we can independently choose to
+\]
-split or not so that each sub-string
+as an example.
-segmented by those chosen splitting points will form different iterations.
+This is a highly ambiguous regular expression, with
-For example when $n=4$, we give out a few of the many possibilities of splitting:
+many ways to split up the string into multiple segments for
-\begin{center}
+different star iteratioins,
-\begin{tabular}{lcr}
+and each segment will have multiple ways of splitting between
-$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
+the two $a^*$ sub-expressions.
-$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
+It is not surprising there are exponentially many
-$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
+distinct lexical values
-$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
+for such a pair of regular expression and string.
-$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
+A lexer without a good enough strategy to
-& $\textit{etc}.$ &
+deduplicate will naturally
-\end{tabular}
-\end{center}
-\noindent
-And for each iteration, there are still multiple ways to split
-between the two $a^*$s.
-It is not surprising there are exponentially many lexical values
-that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-A lexer to keep all the possible values will naturally
 have an exponential runtime on ambiguous regular expressions.
-With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
-of a match. This means Sulzmann and Lu's injection-based algorithm
-exponential by nature.
 Somehow one has to make sure which
 lexical values are $\POSIX$ and must be kept in a lexing algorithm.
 For example, the above $r= (a^*\cdot a^*)^*$  and
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
 We want to keep this value only, and remove all the regular expression subparts
 not corresponding to this value during lexing.

changeset 568	7a579f5533f8
parent 567	28cb8089ec36
child 573	454ced557605