lexing: changeset 623:c0c1ebe09c7d

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Nov 11 00:23:53 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Sat Nov 12 00:37:23 2022 +0000
@@ -984,9 +984,10 @@
 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
 \end{figure}
 \noindent
-$\textit{inj}$ takes three arguments: a regular
+The function $\textit{inj}$ as defined by Sulzmann and Lu
+takes three arguments: a regular
 expression ${r_{i}}$, before the character is chopped off, 
-a character ${c_{i}}$, the character we want to inject back and 
+a character ${c_{i}}$ (the character we want to inject back) and 
 the third argument $v_{i+1}$ the value we want to inject into. 
 The result of an application 
 $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
@@ -1010,9 +1011,9 @@
 \end{center}
 
 \noindent 
-The function does a recursion on 
+The function recurses on 
 the shape of regular
-expression $r_i$ and value $v_{i+1}$. 
+expressionsw  and values.
 Intuitively, each clause analyses 
 how $r_i$ could have transformed when being 
 derived by $c$, identifying which subpart
@@ -1020,7 +1021,7 @@
 to inject the character back into.
 Once the character is
 injected back to that sub-value; 
-$\inj$ assembles all things together
+$\inj$ assembles all parts together
 to form a new value.
 
 For instance, the last clause is an
@@ -1054,25 +1055,18 @@
 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
 
 Recall that lemma 
-\ref{mePosix} tells us
-$\mkeps$ always selects the POSIX matching among
-multiple values that flatten to the empty string.
-Now $\inj$ preserves the POSIXness, provided
-the value before injection is POSIX:
+\ref{mePosix} tells us that
+$\mkeps$ always generates the POSIX value.
+The function $\inj$ preserves the POSIXness, provided
+the value before injection is POSIX, namely
 \begin{lemma}\label{injPosix}
-	If
-	\[
-		(r \backslash c, s) \rightarrow v 
-	\]
-	then
-	\[
-		(r, c :: s) \rightarrow (\inj r \; c\; v).
-	\]
+	If$(r \backslash c, s) \rightarrow v $,
+	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
 \end{lemma}
 \begin{proof}
 	By induction on $r$.
-	The involved cases are sequence and star.
-	When $r = a \cdot b$, there could be
+	The non-trivial cases are sequence and star.
+	When $r = a \cdot b$, there can be
 	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
 	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
 	case.
@@ -1161,10 +1155,8 @@
 	The star case can be proven similarly.
 \end{proof}
 \noindent
-Putting all the functions $\inj$, $\mkeps$, $\backslash$ together
-by following the procedure outlined in the diagram \ref{graph:inj},
-and taking into consideration the possibility of a non-match,
-a lexer can be built with the following recursive definition:
+Putting all together, Sulzmann and Lu obtained the following algorithm
+outlined in the diagram \ref{graph:inj}:
 \begin{center}
 \begin{tabular}{lcl}
 	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
@@ -1174,8 +1166,9 @@
 \end{tabular}
 \end{center}
 \noindent
-The central property of the $\lexer$ is that it gives the correct result by
-$\POSIX$ standards:
+The central property of the $\lexer$ is that it gives the correct result
+according to
+POSIX rules. 
 \begin{theorem}\label{lexerCorrectness}
 	The $\lexer$ based on derivatives and injections is correct: 
 	\begin{center}
@@ -1186,40 +1179,53 @@
 	\end{center}
 \end{theorem} 
 \begin{proof}
-By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
-The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
+By induction on $s$. $r$ generalising over an arbitrary regular expression.
+The $[]$ case is proven by  an application of lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}
 \noindent
-As we did earlier in this chapter on the matcher, one can 
-introduce simplification on the regular expression.
+As we did earlier in this chapter with the matcher, one can 
+introduce simplification on the regular expression in each derivative step.
 However, now one needs to do a backward phase and make sure
-the values align with the regular expressions.
+the values align with the regular expression.
 Therefore one has to
 be careful not to break the correctness, as the injection 
-function heavily relies on the structure of the regular expressions and values
-being correct and matching each other.
-It can be achieved by recording some extra rectification functions
+function heavily relies on the structure of 
+the regular expressions and values being aligned.
+This can be achieved by recording some extra rectification functions
 during the derivatives step, and applying these rectifications in 
 each run during the injection phase.
 With extra care
-one can show that POSIXness will not be affected---although it is much harder
-to establish. 
-Some initial results in this regard have been
-obtained in \cite{AusafDyckhoffUrban2016}. 
+one can show that POSIXness will not be affected
+by the simplifications listed here \cite{AusafDyckhoffUrban2016}. 
+\begin{center}
+	\begin{tabular}{lcl}
+		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
+		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
+					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
+					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
+					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
+					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
+					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
+		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
+				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
+				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
+				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
+		$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
+				   
+	\end{tabular}
+\end{center}
 
-However, with all the simplification rules allowed
-in an injection-based lexer, one could still end up in
-trouble, when cases that require more involved and aggressive
-simplifications arise.
+However, with the simple-minded simplification rules allowed
+in an injection-based lexer, one can still end up
+with exploding derivatives.
 \section{A Case Requring More Aggressive Simplifications}
 For example, when starting with the regular
 expression $(a^* \cdot a^*)^*$ and building just over
 a dozen successive derivatives 
 w.r.t.~the character $a$, one obtains a derivative regular expression
 with millions of nodes (when viewed as a tree)
-even with simplification, which is not much better compared
-with the naive version without any simplifications:
+even with the simple-minded simplification.
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture}
@@ -1241,7 +1247,7 @@
 injection-based lexing algorithm keeps a lot of 
 "useless" values that will not be used. 
 These different ways of matching will grow exponentially with the string length.
-Take 
+Consider the case
 \[
 	r= (a^*\cdot a^*)^* \quad and \quad
 	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
@@ -1256,11 +1262,11 @@
 When $n$ is equal to $1$, there are two lexical values for
 the match:
 \[
-	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (value 1)
+	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
 \]
 and
 \[
-	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (value 2)
+	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
 \]
 The derivative of $\derssimp \;s \; r$ is
 \[
@@ -1272,25 +1278,25 @@
 \[
 	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
 \]
-,
+
 \[
 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
 \]
-,
+
 \[
 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
 \]
-,
+
 \[
 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
 \]
-,
+
 \[
 	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), 
 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
 		  ] 
 \]
-,
+
 \[
 	\Stars \; [
 	 	   \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
@@ -1315,10 +1321,14 @@
 It is not surprising that there are exponentially many 
 distinct lexical values that cannot be eliminated by 
 the simple-minded simplification of $\derssimp$. 
-
 A lexer without a good enough strategy to 
 deduplicate will naturally
-have an exponential runtime on ambiguous regular expressions.
+have an exponential runtime on highly
+ambiguous regular expressions, because there
+are exponentially many matches.
+For this particular example, it seems
+that the number of distinct matches growth
+speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
 
 On the other hand, the
  $\POSIX$ value for $r= (a^*\cdot a^*)^*$  and 
@@ -1360,7 +1370,7 @@
 is simply too hopeless to contribute a POSIX lexical value,
 and is therefore thrown away.
 
-Ausaf and Dyckhoff and Urban \cite{AusafDyckhoffUrban2016} 
+Ausaf et al. \cite{AusafDyckhoffUrban2016} 
 have come up with some simplification steps, however those steps
 are not yet sufficiently strong so that they achieve the above effects.
 And even with these relatively mild simplifications the proof
@@ -1379,7 +1389,8 @@
 underlying the simplified regular expression
 to the unsimplified one.
 
-We therefore choose a slightly different approach to
+We therefore choose a slightly different approach
+also described by Sulzmann and Lu to
 get better simplifications, which uses
 some augmented data structures compared to 
 plain regular expressions.
@@ -1388,8 +1399,6 @@
 With annotated regular expressions,
 we can avoid creating the intermediate values $v_1,\ldots v_n$ and a 
 second phase altogether.
-In the meantime, we can also ensure that simplifications
-are easily handled without breaking the correctness of the algorithm.
 We introduce this new datatype and the 
 corresponding algorithm in the next chapter.
author	Chengsong
	Sat, 12 Nov 2022 00:37:23 +0000
changeset 623	c0c1ebe09c7d
parent 622	4b1149fb5aec
child 624	8ffa28fce271