--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Thu Jul 14 14:57:32 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Sat Jul 16 18:34:46 2022 +0100
@@ -751,92 +751,212 @@
expression ${r_{i}}$, before the character is chopped off,
a character ${c_{i}}$, the character we want to inject back and
the third argument $v_{i+1}$ the value we want to inject into.
-The result of this function is a new value $v_i$.
+The result of an application
+$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
+\[
+ (s_i, r_i) \rightarrow v_i
+\]
+holds.
The definition of $\textit{inj}$ is as follows:
\begin{center}
-\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
- $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
- $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
- $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
- $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
- $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
- $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
- $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
+ $\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\
+ $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\
+ $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\
+ $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ &
+ $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
+ $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ &
+ $\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
+ $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
+ $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
\end{tabular}
\end{center}
\noindent
-This definition is by recursion on the ``shape'' of regular
-expressions and values.
-The clauses do one thing--identifying the ``hole'' on a
-value to inject the character back into.
-For instance, in the last clause for injecting back to a value
-that would turn into a new star value that corresponds to a star,
-we know it must be a sequence value. And we know that the first
-value of that sequence corresponds to the child regex of the star
-with the first character being chopped off--an iteration of the star
-that had just been unfolded. This value is followed by the already
-matched star iterations we collected before. So we inject the character
-back to the first value and form a new value with this latest iteration
-being added to the previous list of iterations, all under the $\Stars$
-top level.
-The POSIX value is maintained throughout the process:
-\begin{lemma}
-$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
-\end{lemma}\label{injPosix}
+The function does a recursion on
+the shape of regular
+expression $r_i$ and value $v_{i+1}$.
+Intuitively, each clause analyses
+how $r_i$ could have transformed when being
+derived by $c$, identifying which subpart
+of $v_{i+1}$ has the ``hole''
+to inject the character back into.
+Once the character is
+injected back to that sub-value;
+$\inj$ assembles all things together
+to form a new value.
+
+For instance, the last clause is an
+injection into a sequence value $v_{i+1}$
+whose second child
+value is a star, and the shape of the
+regular expression $r_i$ before injection
+is a star.
+We therefore know
+the derivative
+starts on a star and ends as a sequence:
+\[
+ (r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
+\]
+during which an iteration of the star
+had just been unfolded, giving the below
+value inhabitation relation:
+\[
+ \vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
+\]
+The value list $vs$ corresponds to
+matched star iterations,
+and the ``hole'' lies in $v$ because
+\[
+ \vdash v: r\backslash c.
+\]
+Finally,
+$\inj \; r \;c \; v$ is prepended
+to the previous list of iterations, and then
+wrapped under the $\Stars$
+constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
+Recall that lemma
+\ref{mePosix} tells us
+$\mkeps$ always selects the POSIX matching among
+multiple values that flatten to the empty string.
+Now $\inj$ preserves the POSIXness, provided
+the value before injection is POSIX:
+\begin{lemma}\label{injPosix}
+ If
+ \[
+ (r \backslash c, s) \rightarrow v
+ \]
+ then
+ \[
+ (r, c :: s) \rightarrow (\inj r \; c\; v).
+ \]
+\end{lemma}
+\begin{proof}
+ By induction on $r$.
+ The involved cases are sequence and star.
+ When $r = a \cdot b$, there could be
+ three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
+ We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
+ case.
+ \begin{itemize}
+ \item
+ $v = \Seq \; v_a \; v_b$.\\
+ The ``not nullable'' clause of the $\inj$ function is taken:
+ \begin{center}
+ \begin{tabular}{lcl}
+ $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
+ & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
+ \end{tabular}
+ \end{center}
+ We know that there exists a unique pair of
+ $s_a$ and $s_b$ satisfaying
+ $(a \backslash c, s_a) \rightarrow v_a$,
+ $(b , s_b) \rightarrow v_b$, and
+ $\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
+ L \; (a\backslash c) \land
+ s_4 \in L \; b$.
+ The last condition gives us
+ $\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in
+ L \; a \land
+ s_4 \in L \; b$.
+ By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
+ and this gives us
+ \[
+ (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
+ \]
+ \item
+ $v = \Left \; (\Seq \; v_a \; v_b)$\\
+ The argument is almost identical to the above case,
+ except that a different clause of $\inj$ is taken:
+ \begin{center}
+ \begin{tabular}{lcl}
+ $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
+ & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
+ \end{tabular}
+ \end{center}
+ With a similar reasoning,
-Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
+ \[
+ (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
+ \]
+ again holds.
+ \item
+ $v = \Right \; v_b$\\
+ Again the injection result would be
+ \begin{center}
+ \begin{tabular}{lcl}
+ $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
+ & $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
+ \end{tabular}
+ \end{center}
+ We know that $a$ must be nullable,
+ allowing us to call $\mkeps$ and get
+ \[
+ (a, []) \rightarrow \mkeps \; a.
+ \]
+ Also by inductive hypothesis
+ \[
+ (b, c::s) \rightarrow \inj\; b \; c \; v_b
+ \]
+ holds.
+ In addition, as
+ $\Right \;v_b$ instead of $\Left \ldots$ is
+ the POSIX value for $v$, it must be the case
+ that $s \notin L \;( (a\backslash c)\cdot b)$.
+ This tells us that
+ \[
+ \nexists s_3 \; s_4.
+ s_3 @s_4 = s \land s_3 \in L \; (a\backslash c)
+ \land s_4 \in L \; b
+ \]
+ which translates to
+ \[
+ \nexists s_3 \; s_4. \; s_3 \neq [] \land
+ s_3 @s_4 = c::s \land s_3 \in L \; a
+ \land s_4 \in L \; b.
+ \]
+ (Which basically says there cannot be a longer
+ initial split for $s$ other than the empty string.)
+ Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
+ as the POSIX value for $a\cdot b$.
+ \end{itemize}
+ The star case can be proven similarly.
+\end{proof}
+\noindent
+Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
+by following the procedure outlined in the diagram \ref{graph:inj},
and taking into consideration the possibility of a non-match,
-we have a lexer with the following recursive definition:
+a lexer can be built with the following recursive definition:
\begin{center}
\begin{tabular}{lcl}
-$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
-$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
-& & $\quad \None \implies \None$\\
-& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
+ $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
+ $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
+ & & $\quad \phantom{\mid}\; \None \implies \None$\\
+ & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
\end{tabular}
\end{center}
- \noindent
- The central property of the $\lexer$ is that it gives the correct result by
- $\POSIX$ standards:
- \begin{theorem}
- The $\lexer$ based on derivatives and injections is both sound and complete:
- \begin{tabular}{lcl}
- $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
- $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
- \end{tabular}
- \end{theorem}
-
-
- \begin{proof}
- By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
- The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case
- by lemma \ref{injPosix}.
- \end{proof}
-
-
-We now give a pictorial view of the algorithm (
-For convenience, we employ the following notations: the regular
-expression we start with is $r_0$, and the given string $s$ is composed
-of characters $c_0 c_1 \ldots c_{n-1}$. The
-values built incrementally by \emph{injecting} back the characters into the
-earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
-are always in the same subscript, i.e. $\vdash v_i : r_i$):
-
-\begin{ceqn}
-\begin{equation}\label{graph:2}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
+\noindent
+The central property of the $\lexer$ is that it gives the correct result by
+$\POSIX$ standards:
+\begin{theorem}
+ The $\lexer$ based on derivatives and injections is correct:
+ \begin{center}
+ \begin{tabular}{lcl}
+ $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
+ $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
+ \end{tabular}
+ \end{center}
+\end{theorem}
+\begin{proof}
+By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
+The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case
+by lemma \ref{injPosix}.
+\end{proof}
\noindent
As we did earlier in this chapter on the matcher, one can
introduce simplification on the regex.
-However, now we need to do a backward phase and make sure
+However, now one needs to do a backward phase and make sure
the values align with the regular expressions.
Therefore one has to
be careful not to break the correctness, as the injection
@@ -845,18 +965,17 @@
It can be achieved by recording some extra rectification functions
during the derivatives step, and applying these rectifications in
each run during the injection phase.
-
-\ChristianComment{Do I introduce the lexer with rectification here?}
-And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although it is much harder
+With extra care
+one can show that POSIXness will not be affected---although it is much harder
to establish.
Some initial results in this regard have been
obtained in \cite{AusafDyckhoffUrban2016}.
-However, even with these simplification rules, we could still end up in
-trouble, when we encounter cases that require more involved and aggressive
-simplifications.
-\section{A Case Requring More Aggressive Simplification}
+However, with all the simplification rules allowed
+in an injection-based lexer, one could still end up in
+trouble, when cases that require more involved and aggressive
+simplifications arise.
+\section{A Case Requring More Aggressive Simplifications}
For example, when starting with the regular
expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
w.r.t.~the character $a$, one obtains a derivative regular expression
@@ -876,42 +995,30 @@
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
\end{figure}\label{fig:BetterWaterloo}
-That is because our lexing algorithm currently keeps a lot of
+That is because Sulzmann and Lu's
+injection-based lexing algorithm keeps a lot of
"useless" values that will not be used.
These different ways of matching will grow exponentially with the string length.
+Take
+\[
+ r= (a^*\cdot a^*)^* \quad and \quad
+ s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
+\]
+as an example.
+This is a highly ambiguous regular expression, with
+many ways to split up the string into multiple segments for
+different star iteratioins,
+and each segment will have multiple ways of splitting between
+the two $a^*$ sub-expressions.
+It is not surprising there are exponentially many
+distinct lexical values
+for such a pair of regular expression and string.
+A lexer without a good enough strategy to
+deduplicate will naturally
+have an exponential runtime on ambiguous regular expressions.
-For $r= (a^*\cdot a^*)^*$ and
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
-if we do not allow any empty iterations in its lexical values,
-there will be $n - 1$ "splitting points" on $s$ we can independently choose to
-split or not so that each sub-string
-segmented by those chosen splitting points will form different iterations.
-For example when $n=4$, we give out a few of the many possibilities of splitting:
-\begin{center}
-\begin{tabular}{lcr}
-$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
-$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\, v_{iteration \,aaa}]$ (two iterations)\\
-$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\, v_{iteration \, aa}]$ (two iterations)\\
-$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
-$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\, v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
- & $\textit{etc}.$ &
- \end{tabular}
-\end{center}
-\noindent
-And for each iteration, there are still multiple ways to split
-between the two $a^*$s.
-It is not surprising there are exponentially many lexical values
-that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$ and
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-A lexer to keep all the possible values will naturally
-have an exponential runtime on ambiguous regular expressions.
-With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
-of a match. This means Sulzmann and Lu's injection-based algorithm
- exponential by nature.
Somehow one has to make sure which
lexical values are $\POSIX$ and must be kept in a lexing algorithm.
-
-
For example, the above $r= (a^*\cdot a^*)^*$ and
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.