--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Fri Nov 11 00:23:53 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Sat Nov 12 00:37:23 2022 +0000
@@ -984,9 +984,10 @@
appeared in the string, always preserving POSIXness.}\label{graph:inj}
\end{figure}
\noindent
-$\textit{inj}$ takes three arguments: a regular
+The function $\textit{inj}$ as defined by Sulzmann and Lu
+takes three arguments: a regular
expression ${r_{i}}$, before the character is chopped off,
-a character ${c_{i}}$, the character we want to inject back and
+a character ${c_{i}}$ (the character we want to inject back) and
the third argument $v_{i+1}$ the value we want to inject into.
The result of an application
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
@@ -1010,9 +1011,9 @@
\end{center}
\noindent
-The function does a recursion on
+The function recurses on
the shape of regular
-expression $r_i$ and value $v_{i+1}$.
+expressionsw and values.
Intuitively, each clause analyses
how $r_i$ could have transformed when being
derived by $c$, identifying which subpart
@@ -1020,7 +1021,7 @@
to inject the character back into.
Once the character is
injected back to that sub-value;
-$\inj$ assembles all things together
+$\inj$ assembles all parts together
to form a new value.
For instance, the last clause is an
@@ -1054,25 +1055,18 @@
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
Recall that lemma
-\ref{mePosix} tells us
-$\mkeps$ always selects the POSIX matching among
-multiple values that flatten to the empty string.
-Now $\inj$ preserves the POSIXness, provided
-the value before injection is POSIX:
+\ref{mePosix} tells us that
+$\mkeps$ always generates the POSIX value.
+The function $\inj$ preserves the POSIXness, provided
+the value before injection is POSIX, namely
\begin{lemma}\label{injPosix}
- If
- \[
- (r \backslash c, s) \rightarrow v
- \]
- then
- \[
- (r, c :: s) \rightarrow (\inj r \; c\; v).
- \]
+ If$(r \backslash c, s) \rightarrow v $,
+ then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
\end{lemma}
\begin{proof}
By induction on $r$.
- The involved cases are sequence and star.
- When $r = a \cdot b$, there could be
+ The non-trivial cases are sequence and star.
+ When $r = a \cdot b$, there can be
three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
case.
@@ -1161,10 +1155,8 @@
The star case can be proven similarly.
\end{proof}
\noindent
-Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
-by following the procedure outlined in the diagram \ref{graph:inj},
-and taking into consideration the possibility of a non-match,
-a lexer can be built with the following recursive definition:
+Putting all together, Sulzmann and Lu obtained the following algorithm
+outlined in the diagram \ref{graph:inj}:
\begin{center}
\begin{tabular}{lcl}
$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
@@ -1174,8 +1166,9 @@
\end{tabular}
\end{center}
\noindent
-The central property of the $\lexer$ is that it gives the correct result by
-$\POSIX$ standards:
+The central property of the $\lexer$ is that it gives the correct result
+according to
+POSIX rules.
\begin{theorem}\label{lexerCorrectness}
The $\lexer$ based on derivatives and injections is correct:
\begin{center}
@@ -1186,40 +1179,53 @@
\end{center}
\end{theorem}
\begin{proof}
-By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
-The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case
+By induction on $s$. $r$ generalising over an arbitrary regular expression.
+The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
by lemma \ref{injPosix}.
\end{proof}
\noindent
-As we did earlier in this chapter on the matcher, one can
-introduce simplification on the regular expression.
+As we did earlier in this chapter with the matcher, one can
+introduce simplification on the regular expression in each derivative step.
However, now one needs to do a backward phase and make sure
-the values align with the regular expressions.
+the values align with the regular expression.
Therefore one has to
be careful not to break the correctness, as the injection
-function heavily relies on the structure of the regular expressions and values
-being correct and matching each other.
-It can be achieved by recording some extra rectification functions
+function heavily relies on the structure of
+the regular expressions and values being aligned.
+This can be achieved by recording some extra rectification functions
during the derivatives step, and applying these rectifications in
each run during the injection phase.
With extra care
-one can show that POSIXness will not be affected---although it is much harder
-to establish.
-Some initial results in this regard have been
-obtained in \cite{AusafDyckhoffUrban2016}.
+one can show that POSIXness will not be affected
+by the simplifications listed here \cite{AusafDyckhoffUrban2016}.
+\begin{center}
+ \begin{tabular}{lcl}
+ $\simp \; r_1 \cdot r_2 $ & $ \dn$ &
+ $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
+ & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
+ & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
+ & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
+ & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
+ & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
+ $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
+ & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
+ & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
+ & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
+ $\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
+
+ \end{tabular}
+\end{center}
-However, with all the simplification rules allowed
-in an injection-based lexer, one could still end up in
-trouble, when cases that require more involved and aggressive
-simplifications arise.
+However, with the simple-minded simplification rules allowed
+in an injection-based lexer, one can still end up
+with exploding derivatives.
\section{A Case Requring More Aggressive Simplifications}
For example, when starting with the regular
expression $(a^* \cdot a^*)^*$ and building just over
a dozen successive derivatives
w.r.t.~the character $a$, one obtains a derivative regular expression
with millions of nodes (when viewed as a tree)
-even with simplification, which is not much better compared
-with the naive version without any simplifications:
+even with the simple-minded simplification.
\begin{figure}[H]
\begin{center}
\begin{tikzpicture}
@@ -1241,7 +1247,7 @@
injection-based lexing algorithm keeps a lot of
"useless" values that will not be used.
These different ways of matching will grow exponentially with the string length.
-Take
+Consider the case
\[
r= (a^*\cdot a^*)^* \quad and \quad
s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
@@ -1256,11 +1262,11 @@
When $n$ is equal to $1$, there are two lexical values for
the match:
\[
- \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (value 1)
+ \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
\]
and
\[
- \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (value 2)
+ \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
\]
The derivative of $\derssimp \;s \; r$ is
\[
@@ -1272,25 +1278,25 @@
\[
\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
\]
-,
+
\[
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
\]
-,
+
\[
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
\]
-,
+
\[
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
\]
-,
+
\[
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []),
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
]
\]
-,
+
\[
\Stars \; [
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
@@ -1315,10 +1321,14 @@
It is not surprising that there are exponentially many
distinct lexical values that cannot be eliminated by
the simple-minded simplification of $\derssimp$.
-
A lexer without a good enough strategy to
deduplicate will naturally
-have an exponential runtime on ambiguous regular expressions.
+have an exponential runtime on highly
+ambiguous regular expressions, because there
+are exponentially many matches.
+For this particular example, it seems
+that the number of distinct matches growth
+speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
On the other hand, the
$\POSIX$ value for $r= (a^*\cdot a^*)^*$ and
@@ -1360,7 +1370,7 @@
is simply too hopeless to contribute a POSIX lexical value,
and is therefore thrown away.
-Ausaf and Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}
+Ausaf et al. \cite{AusafDyckhoffUrban2016}
have come up with some simplification steps, however those steps
are not yet sufficiently strong so that they achieve the above effects.
And even with these relatively mild simplifications the proof
@@ -1379,7 +1389,8 @@
underlying the simplified regular expression
to the unsimplified one.
-We therefore choose a slightly different approach to
+We therefore choose a slightly different approach
+also described by Sulzmann and Lu to
get better simplifications, which uses
some augmented data structures compared to
plain regular expressions.
@@ -1388,8 +1399,6 @@
With annotated regular expressions,
we can avoid creating the intermediate values $v_1,\ldots v_n$ and a
second phase altogether.
-In the meantime, we can also ensure that simplifications
-are easily handled without breaking the correctness of the algorithm.
We introduce this new datatype and the
corresponding algorithm in the next chapter.