ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 568 7a579f5533f8
parent 567 28cb8089ec36
child 573 454ced557605
--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Thu Jul 14 14:57:32 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Sat Jul 16 18:34:46 2022 +0100
@@ -751,92 +751,212 @@
 expression ${r_{i}}$, before the character is chopped off, 
 a character ${c_{i}}$, the character we want to inject back and 
 the third argument $v_{i+1}$ the value we want to inject into. 
-The result of this function is a new value $v_i$.
+The result of an application 
+$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
+\[
+	(s_i, r_i) \rightarrow v_i
+\]
+holds.
 The definition of $\textit{inj}$ is as follows: 
 \begin{center}
-\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
-  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
+  $\textit{inj}\;(c)\;c\,Empty$            & $\dn$ & $\Char\,c$\\
+  $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left  \; (\textit{inj}\; r_1 \; c\,v)$\\
+  $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c  \; v)$\\
+  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$  & 
+  $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
+  $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & 
+  $\dn$  & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
+  $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$  & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
+  $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
 \end{tabular}
 \end{center}
 
 \noindent 
-This definition is by recursion on the ``shape'' of regular
-expressions and values. 
-The clauses do one thing--identifying the ``hole'' on a
-value to inject the character back into.
-For instance, in the last clause for injecting back to a value
-that would turn into a new star value that corresponds to a star,
-we know it must be a sequence value. And we know that the first 
-value of that sequence corresponds to the child regex of the star
-with the first character being chopped off--an iteration of the star
-that had just been unfolded. This value is followed by the already
-matched star iterations we collected before. So we inject the character 
-back to the first value and form a new value with this latest iteration
-being added to the previous list of iterations, all under the $\Stars$
-top level.
-The POSIX value is maintained throughout the process:
-\begin{lemma}
-$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
-\end{lemma}\label{injPosix}
+The function does a recursion on 
+the shape of regular
+expression $r_i$ and value $v_{i+1}$. 
+Intuitively, each clause analyses 
+how $r_i$ could have transformed when being 
+derived by $c$, identifying which subpart
+of $v_{i+1}$ has the ``hole'' 
+to inject the character back into.
+Once the character is
+injected back to that sub-value; 
+$\inj$ assembles all things together
+to form a new value.
+
+For instance, the last clause is an
+injection into a sequence value $v_{i+1}$
+whose second child
+value is a star, and the shape of the 
+regular expression $r_i$ before injection 
+is a star.
+We therefore know 
+the derivative 
+starts on a star and ends as a sequence:
+\[
+	(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
+\]
+during which an iteration of the star
+had just been unfolded, giving the below
+value inhabitation relation:
+\[
+	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
+\]
+The value list $vs$ corresponds to
+matched star iterations,
+and the ``hole'' lies in $v$ because
+\[
+	\vdash v: r\backslash c.
+\]
+Finally, 
+$\inj \; r \;c \; v$ is prepended
+to the previous list of iterations, and then
+wrapped under the $\Stars$
+constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
 
+Recall that lemma 
+\ref{mePosix} tells us
+$\mkeps$ always selects the POSIX matching among
+multiple values that flatten to the empty string.
+Now $\inj$ preserves the POSIXness, provided
+the value before injection is POSIX:
+\begin{lemma}\label{injPosix}
+	If
+	\[
+		(r \backslash c, s) \rightarrow v 
+	\]
+	then
+	\[
+		(r, c :: s) \rightarrow (\inj r \; c\; v).
+	\]
+\end{lemma}
+\begin{proof}
+	By induction on $r$.
+	The involved cases are sequence and star.
+	When $r = a \cdot b$, there could be
+	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
+	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
+	case.
+	\begin{itemize}
+		\item
+			$v = \Seq \; v_a \; v_b$.\\
+			The ``not nullable'' clause of the $\inj$ function is taken:
+			\begin{center}
+				\begin{tabular}{lcl}
+					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
+					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
+				\end{tabular}
+			\end{center}
+			We know that there exists a unique pair of
+			$s_a$ and $s_b$ satisfaying	
+				$(a \backslash c, s_a) \rightarrow v_a$,
+				$(b , s_b) \rightarrow v_b$, and
+				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
+				L \; (a\backslash c) \land
+				s_4 \in L \; b$.
+			The last condition gives us
+			$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in 
+				L \; a \land
+				s_4 \in L \; b$.
+			By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
+			and this gives us
+			\[
+				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
+			\]
+		\item
+			$v = \Left \; (\Seq \; v_a \; v_b)$\\
+			The argument is almost identical to the above case,	
+			except that a different clause of $\inj$ is taken:
+			\begin{center}
+				\begin{tabular}{lcl}
+					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
+					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
+				\end{tabular}
+			\end{center}
+			With a similar reasoning, 
 
-Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
+			\[
+				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
+			\]
+			again holds.
+		\item 
+			$v = \Right \; v_b$\\
+			Again the injection result would be 
+			\begin{center}
+				\begin{tabular}{lcl}
+					$\inj \; r \; c \; v$ &   $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
+					& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
+				\end{tabular}
+			\end{center}
+			We know that $a$ must be nullable,
+			allowing us to call $\mkeps$ and get
+			\[
+				(a, []) \rightarrow \mkeps \; a.
+			\]
+			Also by inductive hypothesis
+			\[
+				(b, c::s) \rightarrow \inj\; b \; c \; v_b
+			\]
+			holds.
+			In addition, as
+			$\Right \;v_b$  instead of $\Left \ldots$ is 
+			the POSIX value for $v$, it must be the case
+			that $s \notin L \;( (a\backslash c)\cdot b)$.
+			This tells us that 
+			\[
+				\nexists s_3 \; s_4.
+				s_3 @s_4 = s  \land s_3 \in L \; (a\backslash c) 
+				\land s_4 \in L \; b
+			\]
+			which translates to
+			\[
+				\nexists s_3 \; s_4. \; s_3 \neq [] \land
+				s_3 @s_4 = c::s  \land s_3 \in L \; a 
+				\land s_4 \in L \; b.
+			\]
+			(Which basically says there cannot be a longer 
+			initial split for $s$ other than the empty string.)
+			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
+			as the POSIX value for $a\cdot b$.
+	\end{itemize}
+	The star case can be proven similarly.
+\end{proof}
+\noindent
+Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
+by following the procedure outlined in the diagram \ref{graph:inj},
 and taking into consideration the possibility of a non-match,
-we have a lexer with the following recursive definition:
+a lexer can be built with the following recursive definition:
 \begin{center}
 \begin{tabular}{lcl}
-$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
-$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
-& & $\quad \None \implies \None$\\
-& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
+	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
+	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
+	& & $\quad \phantom{\mid}\; \None \implies \None$\\
+	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
 \end{tabular}
 \end{center}
- \noindent
- The central property of the $\lexer$ is that it gives the correct result by
- $\POSIX$ standards:
- \begin{theorem}
-	 The $\lexer$ based on derivatives and injections is both sound and complete:
- \begin{tabular}{lcl}
-	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
-	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
- \end{tabular}
- \end{theorem}
- 
- 
- \begin{proof}
- By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
- The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
- by lemma \ref{injPosix}.
- \end{proof}
-
-
-We now give a pictorial view of the algorithm (
-For convenience, we employ the following notations: the regular
-expression we start with is $r_0$, and the given string $s$ is composed
-of characters $c_0 c_1 \ldots c_{n-1}$. The
-values built incrementally by \emph{injecting} back the characters into the
-earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
-are always in the same subscript, i.e. $\vdash v_i : r_i$):
-
-\begin{ceqn}
-\begin{equation}\label{graph:2}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
+\noindent
+The central property of the $\lexer$ is that it gives the correct result by
+$\POSIX$ standards:
+\begin{theorem}
+	The $\lexer$ based on derivatives and injections is correct: 
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
+			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
+		\end{tabular}
+	\end{center}
+\end{theorem} 
+\begin{proof}
+By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
+The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
+by lemma \ref{injPosix}.
+\end{proof}
 \noindent
 As we did earlier in this chapter on the matcher, one can 
 introduce simplification on the regex.
-However, now we need to do a backward phase and make sure
+However, now one needs to do a backward phase and make sure
 the values align with the regular expressions.
 Therefore one has to
 be careful not to break the correctness, as the injection 
@@ -845,18 +965,17 @@
 It can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in 
 each run during the injection phase.
-
-\ChristianComment{Do I introduce the lexer with rectification here?}
-And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although it is much harder
+With extra care
+one can show that POSIXness will not be affected---although it is much harder
 to establish. 
 Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}. 
 
-However, even with these simplification rules, we could still end up in
-trouble, when we encounter cases that require more involved and aggressive
-simplifications.
-\section{A Case Requring More Aggressive Simplification}
+However, with all the simplification rules allowed
+in an injection-based lexer, one could still end up in
+trouble, when cases that require more involved and aggressive
+simplifications arise.
+\section{A Case Requring More Aggressive Simplifications}
 For example, when starting with the regular
 expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
 w.r.t.~the character $a$, one obtains a derivative regular expression
@@ -876,42 +995,30 @@
 \caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
 \end{figure}\label{fig:BetterWaterloo}
    
-That is because our lexing algorithm currently keeps a lot of 
+That is because Sulzmann and Lu's 
+injection-based lexing algorithm keeps a lot of 
 "useless" values that will not be used. 
 These different ways of matching will grow exponentially with the string length.
+Take 
+\[
+	r= (a^*\cdot a^*)^* \quad and \quad
+	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
+\]
+as an example.
+This is a highly ambiguous regular expression, with
+many ways to split up the string into multiple segments for
+different star iteratioins,
+and each segment will have multiple ways of splitting between 
+the two $a^*$ sub-expressions.
+It is not surprising there are exponentially many 
+distinct lexical values
+for such a pair of regular expression and string.
+A lexer without a good enough strategy to 
+deduplicate will naturally
+have an exponential runtime on ambiguous regular expressions.
 
-For $r= (a^*\cdot a^*)^*$ and  
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
-if we do not allow any empty iterations in its lexical values,
-there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
-split or not so that each sub-string
-segmented by those chosen splitting points will form different iterations.
-For example when $n=4$, we give out a few of the many possibilities of splitting:
-\begin{center}
-\begin{tabular}{lcr}
-$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
-$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
-$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
-$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
-$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
- & $\textit{etc}.$ &
- \end{tabular}
-\end{center}
-\noindent
-And for each iteration, there are still multiple ways to split
-between the two $a^*$s.
-It is not surprising there are exponentially many lexical values
-that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-A lexer to keep all the possible values will naturally 
-have an exponential runtime on ambiguous regular expressions.
-With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
-of a match. This means Sulzmann and Lu's injection-based algorithm 
- exponential by nature.
 Somehow one has to make sure which
  lexical values are $\POSIX$ and must be kept in a lexing algorithm.
-
-
  For example, the above $r= (a^*\cdot a^*)^*$  and 
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.