lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:4b1149fb5aec
+:c0c1ebe09c7d
 	the empty string , by always selecting the leftmost
 	nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they
 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
 \end{figure}
 \noindent
-$\textit{inj}$ takes three arguments: a regular
+The function $\textit{inj}$ as defined by Sulzmann and Lu
+takes three arguments: a regular
 expression ${r_{i}}$, before the character is chopped off,
-a character ${c_{i}}$, the character we want to inject back and
+a character ${c_{i}}$ (the character we want to inject back) and
 the third argument $v_{i+1}$ the value we want to inject into.
 The result of an application
 $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
 \[
 	(s_i, r_i) \rightarrow v_i
 $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$         & $\dn$  & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
 \end{tabular}
 \end{center}
 \noindent
-The function does a recursion on
+The function recurses on
 the shape of regular
-expression $r_i$ and value $v_{i+1}$.
+expressionsw  and values.
 Intuitively, each clause analyses
 how $r_i$ could have transformed when being
 derived by $c$, identifying which subpart
 of $v_{i+1}$ has the ``hole''
 to inject the character back into.
 Once the character is
 injected back to that sub-value;
-$\inj$ assembles all things together
+$\inj$ assembles all parts together
 to form a new value.
 For instance, the last clause is an
 injection into a sequence value $v_{i+1}$
 whose second child
 to the previous list of iterations, and then
 wrapped under the $\Stars$
 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
 Recall that lemma
-\ref{mePosix} tells us
+\ref{mePosix} tells us that
-$\mkeps$ always selects the POSIX matching among
+$\mkeps$ always generates the POSIX value.
-multiple values that flatten to the empty string.
+The function $\inj$ preserves the POSIXness, provided
-Now $\inj$ preserves the POSIXness, provided
+the value before injection is POSIX, namely
-the value before injection is POSIX:
 \begin{lemma}\label{injPosix}
-	If
+	If$(r \backslash c, s) \rightarrow v $,
-	\[
+	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
-		(r \backslash c, s) \rightarrow v
-	\]
-	then
-	\[
-		(r, c :: s) \rightarrow (\inj r \; c\; v).
-	\]
 \end{lemma}
 \begin{proof}
 	By induction on $r$.
-	The involved cases are sequence and star.
+	The non-trivial cases are sequence and star.
-	When $r = a \cdot b$, there could be
+	When $r = a \cdot b$, there can be
 	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
 	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
 	case.
 	\begin{itemize}
 		\item
 			as the POSIX value for $a\cdot b$.
 	\end{itemize}
 	The star case can be proven similarly.
 \end{proof}
 \noindent
-Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
+Putting all together, Sulzmann and Lu obtained the following algorithm
-by following the procedure outlined in the diagram \ref{graph:inj},
+outlined in the diagram \ref{graph:inj}:
-and taking into consideration the possibility of a non-match,
-a lexer can be built with the following recursive definition:
 \begin{center}
 \begin{tabular}{lcl}
 	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
 	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
 	& & $\quad \phantom{\mid}\; \None \implies \None$\\
 	& & $\quad \mid           \Some(v) \implies \Some(\inj \; r\; c\; v)$
 \end{tabular}
 \end{center}
 \noindent
-The central property of the $\lexer$ is that it gives the correct result by
+The central property of the $\lexer$ is that it gives the correct result
-$\POSIX$ standards:
+according to
+POSIX rules.
 \begin{theorem}\label{lexerCorrectness}
 	The $\lexer$ based on derivatives and injections is correct:
 	\begin{center}
 		\begin{tabular}{lcl}
 			$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
 			$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
 		\end{tabular}
 	\end{center}
 \end{theorem}
 \begin{proof}
-By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
+By induction on $s$. $r$ generalising over an arbitrary regular expression.
-The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
+The $[]$ case is proven by  an application of lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}
 \noindent
-As we did earlier in this chapter on the matcher, one can
+As we did earlier in this chapter with the matcher, one can
-introduce simplification on the regular expression.
+introduce simplification on the regular expression in each derivative step.
 However, now one needs to do a backward phase and make sure
-the values align with the regular expressions.
+the values align with the regular expression.
 Therefore one has to
 be careful not to break the correctness, as the injection
-function heavily relies on the structure of the regular expressions and values
+function heavily relies on the structure of
-being correct and matching each other.
+the regular expressions and values being aligned.
-It can be achieved by recording some extra rectification functions
+This can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in
 each run during the injection phase.
 With extra care
-one can show that POSIXness will not be affected---although it is much harder
+one can show that POSIXness will not be affected
-to establish.
+by the simplifications listed here \cite{AusafDyckhoffUrban2016}.
-Some initial results in this regard have been
+\begin{center}
-obtained in \cite{AusafDyckhoffUrban2016}.
+	\begin{tabular}{lcl}
+		$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
-However, with all the simplification rules allowed
+		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
-in an injection-based lexer, one could still end up in
+					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
-trouble, when cases that require more involved and aggressive
+					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
-simplifications arise.
+					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
+					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
+					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
+		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
+				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
+				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
+				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
+		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
+	\end{tabular}
+\end{center}
+However, with the simple-minded simplification rules allowed
+in an injection-based lexer, one can still end up
+with exploding derivatives.
 \section{A Case Requring More Aggressive Simplifications}
 For example, when starting with the regular
 expression $(a^* \cdot a^*)^*$ and building just over
 a dozen successive derivatives
 w.r.t.~the character $a$, one obtains a derivative regular expression
 with millions of nodes (when viewed as a tree)
-even with simplification, which is not much better compared
+even with the simple-minded simplification.
-with the naive version without any simplifications:
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 That is because Sulzmann and Lu's
 injection-based lexing algorithm keeps a lot of
 "useless" values that will not be used.
 These different ways of matching will grow exponentially with the string length.
-Take
+Consider the case
 \[
 	r= (a^*\cdot a^*)^* \quad and \quad
 	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
 \]
 as an example.
 multiple ways of splitting between
 the two $a^*$ sub-expressions.
 When $n$ is equal to $1$, there are two lexical values for
 the match:
 \[
-	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (value 1)
+	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
 \]
 and
 \[
-	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (value 2)
+	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
 \]
 The derivative of $\derssimp \;s \; r$ is
 \[
 	(a^*a^* + a^*)\cdot(a^*a^*)^*.
 \]
 correspond to value 1 and value 2, respectively.
 When $n=2$, the number goes up to 7:
 \[
 	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
 \]
-,
 \[
 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
 \]
-,
 \[
 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
 \]
-,
 \[
 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
 \]
-,
 \[
 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []),
 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
 		  ]
 \]
-,
 \[
 	\Stars \; [
 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
 		  ]
 seven distinct lexical values.
 It is not surprising that there are exponentially many
 distinct lexical values that cannot be eliminated by
 the simple-minded simplification of $\derssimp$.
 A lexer without a good enough strategy to
 deduplicate will naturally
-have an exponential runtime on ambiguous regular expressions.
+have an exponential runtime on highly
+ambiguous regular expressions, because there
+are exponentially many matches.
+For this particular example, it seems
+that the number of distinct matches growth
+speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
 On the other hand, the
 $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is
 \[
 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
 \]
 is simply too hopeless to contribute a POSIX lexical value,
 and is therefore thrown away.
-Ausaf and Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}
+Ausaf et al. \cite{AusafDyckhoffUrban2016}
 have come up with some simplification steps, however those steps
 are not yet sufficiently strong so that they achieve the above effects.
 And even with these relatively mild simplifications the proof
 is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
 One would prove something like:
 has been done as a function on values
 showing how one can transform the value
 underlying the simplified regular expression
 to the unsimplified one.
-We therefore choose a slightly different approach to
+We therefore choose a slightly different approach
+also described by Sulzmann and Lu to
 get better simplifications, which uses
 some augmented data structures compared to
 plain regular expressions.
 We call them \emph{annotated}
 regular expressions.
 With annotated regular expressions,
 we can avoid creating the intermediate values $v_1,\ldots v_n$ and a
 second phase altogether.
-In the meantime, we can also ensure that simplifications
-are easily handled without breaking the correctness of the algorithm.
 We introduce this new datatype and the
 corresponding algorithm in the next chapter.

changeset 623	c0c1ebe09c7d
parent 622	4b1149fb5aec
child 626	1c8525061545