ChengsongTanPhdThesis/Chapters/Inj.tex
changeset 637 e3752aac8ec2
parent 628 7af4e2420a8c
child 638 dd9dde2d902b
--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Dec 23 13:27:56 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Sat Dec 24 11:55:04 2022 +0000
@@ -11,8 +11,8 @@
 In this chapter, we define the basic notions 
 for regular languages and regular expressions.
 This is essentially a description in ``English''
-of our formalisation in Isabelle/HOL.
-We also give the definition of what $\POSIX$ lexing means, 
+the functions and datatypes of our formalisation in Isabelle/HOL.
+We also define what $\POSIX$ lexing means, 
 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
 that produces the output conforming
 to the $\POSIX$ standard\footnote{In what follows 
@@ -26,7 +26,7 @@
 \section{Basic Concepts}
 Formal language theory usually starts with an alphabet 
 denoting a set of characters.
-Here we just use the datatype of characters from Isabelle,
+Here we use the datatype of characters from Isabelle,
 which roughly corresponds to the ASCII characters.
 In what follows, we shall leave the information about the alphabet
 implicit.
@@ -40,9 +40,9 @@
 where $c$ is a variable ranging over characters.
 The $::$ stands for list cons and $[]$ for the empty
 list.
-A singleton list is sometimes written as $[c]$ for brevity.
+For brevity, a singleton list is sometimes written as $[c]$.
 Strings can be concatenated to form longer strings in the same
-way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
+way we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise 
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
@@ -121,7 +121,7 @@
 \end{tabular}	
 \end{center}
 \noindent
-and in the end test whether the set
+and in the end, test whether the set
 has the empty string.\footnote{We use the infix notation $A\backslash c$
 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
 on languages rather than regular expressions.}
@@ -156,15 +156,18 @@
 \begin{itemize}
 \item{$\subseteq$}:\\
 The set 
-\[ \{s \mid c :: s \in A*\} \]
+\[ S_1 = \{s \mid c :: s \in A*\} \]
 is enclosed in the set
-\[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
-because whenever you have a string starting with a character 
-in the language of a Kleene star $A*$, 
-then that character together with some sub-string
-immediately after it will form the first iteration, 
-and the rest of the string will 
-be still in $A*$.
+\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
+This is because for any string $c::s$ satisfying $c::s \in A*$,
+%whenever you have a string starting with a character 
+%in the language of a Kleene star $A*$, 
+%then that
+the character $c$, together with a prefix of $s$
+%immediately after $c$ 
+forms the first iteration of $A*$, 
+and the rest of the $s$ is also $A*$.
+This coincides with the definition of $S_2$.
 \item{$\supseteq$}:\\
 Note that
 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
@@ -194,7 +197,7 @@
 \]
 \noindent
 We call them basic because we will introduce
-additional constructors in later chapters such as negation
+additional constructors in later chapters, such as negation
 and bounded repetitions.
 We use $\ZERO$ for the regular expression that
 matches no string, and $\ONE$ for the regular
@@ -223,7 +226,7 @@
 \noindent
 %Now with language derivatives of a language and regular expressions and
 %their language interpretations in place, we are ready to define derivatives on regular expressions.
-With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
+With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
 We do so by first introducing what properties it should satisfy.
 
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
@@ -380,8 +383,8 @@
 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
 \end{property}
 \noindent
-Next we give the recursive definition of derivative on
-regular expressions, so that it satisfies the properties above.
+Next, we give the recursive definition of derivative on
+regular expressions so that it satisfies the properties above.
 The derivative function, written $r\backslash c$, 
 takes a regular expression $r$ and character $c$, and
 returns a new regular expression representing
@@ -441,7 +444,7 @@
 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
 \]
 Again,
-the structure is the same as the language derivative of Kleene star: 
+the structure is the same as the language derivative of the Kleene star: 
 \[
 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
 \]
@@ -522,8 +525,8 @@
 \end{definition}
 
 \noindent
-Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
-this algorithm presented graphically is as follows:
+Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, 
+this algorithm, presented graphically, is as follows:
 
 \begin{equation}\label{matcher}
 \begin{tikzcd}
@@ -540,7 +543,7 @@
 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
-	By induction on $s$ using property of derivatives:
+	By induction on $s$ using the property of derivatives:
 	lemma \ref{derDer}.
 \end{proof}
 \begin{figure}
@@ -566,7 +569,7 @@
 the algorithm can be excruciatingly slow, as shown in 
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
-Around two dozens characters
+Around two dozen characters
 would already ``explode'' the matcher with the regular expression 
 $(a^*)^*b$.
 To improve this situation, we need to introduce simplification
@@ -628,8 +631,8 @@
 is now ``tame''  in terms of the length of inputs,
 as shown in Figure \ref{BetterMatcher}.
 
-So far the story is use Brzozowski derivatives and
-simplify as much as possible and at the end test
+So far, the story is use Brzozowski derivatives and
+simplify as much as possible, and at the end test
 whether the empty string is recognised 
 by the final derivative.
 But what if we want to 
@@ -727,7 +730,7 @@
 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
 and the values both flatten to $abc$.
 Lexers therefore have to disambiguate and choose only
-one of the values be generated. $\POSIX$ is one of the
+one of the values to be generated. $\POSIX$ is one of the
 disambiguation strategies that is widely adopted.
 
 Ausaf et al.\parencite{AusafDyckhoffUrban2016} 
@@ -796,10 +799,10 @@
 %\end{tikzpicture}
 %\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
 %\end{figure}
-The above $\POSIX$ rules follows the intuition described below: 
+The above $\POSIX$ rules follow the intuition described below: 
 \begin{itemize}
 	\item (Left Priority)\\
-		Match the leftmost regular expression when multiple options of matching
+		Match the leftmost regular expression when multiple options for matching
 		are available. See P+L and P+R where in P+R $s$ cannot
 		be in the language of $L \; r_1$.
 	\item (Maximum munch)\\
@@ -809,12 +812,12 @@
 		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
 		Then the split that matches a longer string for the first part
 		$r_{part1}$ is preferred by this maximum munch rule.
-		This is caused by the side-condition 
+		The side-condition 
 		\begin{center}
 		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
 		\end{center}
-		in PS.
+		in PS causes this.
 		%(See
 		%\ref{munch} for an illustration).
 \end{itemize}
@@ -883,8 +886,8 @@
 \;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
 \]
 cannot possibly form a $\POSIX$ value either, because
-by definition there is a candidate
-with longer initial string
+by definition, there is a candidate
+with a longer initial string
 $s_1$. Therefore, we know that the POSIX
 value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
 $s$ must have the 
@@ -900,8 +903,8 @@
 is the same as $\Seq(v_1, v_2)$. 
 \end{proof}
 \noindent
-We have now defined what a POSIX value is and shown that it is unique,
-the problem is to generate
+We have now defined what a POSIX value is and shown that it is unique.
+The problem is to generate
 such a value in a lexing algorithm using derivatives.
 
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
@@ -963,7 +966,7 @@
 \noindent
 After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
 in reverse order as they were chopped off in the derivative phase.
-The fucntion for this is called $\inj$. This function 
+The function for this is called $\inj$. This function 
 operates on values, unlike $\backslash$ which operates on regular expressions.
 In the diagram below, $v_i$ stands for the (POSIX) value 
 for how the regular expression 
@@ -985,9 +988,9 @@
 	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
 	$c_1$, and so on. These are the same operations as they have appeared in the matcher
 	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
-	then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
-	the empty string , by always selecting the leftmost 
-	nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they 
+	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
+	the empty string, by always selecting the leftmost 
+	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they 
 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
 \end{figure}
 \noindent
@@ -1020,7 +1023,7 @@
 \noindent 
 The function recurses on 
 the shape of regular
-expressionsw  and values.
+expressions and values.
 Intuitively, each clause analyses 
 how $r_i$ could have transformed when being 
 derived by $c$, identifying which subpart
@@ -1028,13 +1031,13 @@
 to inject the character back into.
 Once the character is
 injected back to that sub-value; 
-$\inj$ assembles all parts together
+$\inj$ assembles all parts
 to form a new value.
 
 For instance, the last clause is an
 injection into a sequence value $v_{i+1}$
 whose second child
-value is a star, and the shape of the 
+value is a star and the shape of the 
 regular expression $r_i$ before injection 
 is a star.
 We therefore know 
@@ -1057,7 +1060,7 @@
 \]
 Finally, 
 $\inj \; r \;c \; v$ is prepended
-to the previous list of iterations, and then
+to the previous list of iterations and then
 wrapped under the $\Stars$
 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
 
@@ -1088,7 +1091,7 @@
 				\end{tabular}
 			\end{center}
 			We know that there exists a unique pair of
-			$s_a$ and $s_b$ satisfaying	
+			$s_a$ and $s_b$ satisfying	
 				$(a \backslash c, s_a) \rightarrow v_a$,
 				$(b , s_b) \rightarrow v_b$, and
 				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
@@ -1113,7 +1116,7 @@
 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
 				\end{tabular}
 			\end{center}
-			With a similar reasoning, 
+			With similar reasoning, 
 
 			\[
 				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
@@ -1133,7 +1136,7 @@
 			\[
 				(a, []) \rightarrow \mkeps \; a.
 			\]
-			Also by inductive hypothesis
+			Also, by inductive hypothesis
 			\[
 				(b, c::s) \rightarrow \inj\; b \; c \; v_b
 			\]
@@ -1154,7 +1157,7 @@
 				s_3 @s_4 = c::s  \land s_3 \in L \; a 
 				\land s_4 \in L \; b.
 			\]
-			(Which basically says there cannot be a longer 
+			(Which says there cannot be a longer 
 			initial split for $s$ other than the empty string.)
 			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
 			as the POSIX value for $a\cdot b$.
@@ -1187,20 +1190,21 @@
 \end{theorem} 
 \begin{proof}
 By induction on $s$. $r$ generalising over an arbitrary regular expression.
-The $[]$ case is proven by  an application of lemma \ref{mePosix}, and the inductive case
+The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}
 \noindent
 As we did earlier in this chapter with the matcher, one can 
 introduce simplification on the regular expression in each derivative step.
-However, now one needs to do a backward phase and make sure
-the values align with the regular expression.
+However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
+and ensure that
+the values align with the regular expression at each step.
 Therefore one has to
 be careful not to break the correctness, as the injection 
 function heavily relies on the structure of 
 the regular expressions and values being aligned.
 This can be achieved by recording some extra rectification functions
-during the derivatives step, and applying these rectifications in 
+during the derivatives step and applying these rectifications in 
 each run during the injection phase.
 With extra care
 one can show that POSIXness will not be affected
@@ -1223,16 +1227,17 @@
 	\end{tabular}
 \end{center}
 
-However, with the simple-minded simplification rules allowed
-in an injection-based lexer, one can still end up
-with exploding derivatives.
-\section{A Case Requring More Aggressive Simplifications}
+However, one can still end up
+with exploding derivatives,
+even with the simple-minded simplification rules allowed
+in an injection-based lexer.
+\section{A Case Requiring More Aggressive Simplifications}
 For example, when starting with the regular
 expression $(a^* \cdot a^*)^*$ and building just over
 a dozen successive derivatives 
 w.r.t.~the character $a$, one obtains a derivative regular expression
 with millions of nodes (when viewed as a tree)
-even with the simple-minded simplification.
+even with the mentioned simplifications.
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture}
@@ -1262,7 +1267,7 @@
 as an example.
 This is a highly ambiguous regular expression, with
 many ways to split up the string into multiple segments for
-different star iteratioins,
+different star iterations,
 and for each segment 
 multiple ways of splitting between 
 the two $a^*$ sub-expressions.
@@ -1331,7 +1336,7 @@
 A lexer without a good enough strategy to 
 deduplicate will naturally
 have an exponential runtime on highly
-ambiguous regular expressions, because there
+ambiguous regular expressions because there
 are exponentially many matches.
 For this particular example, it seems
 that the number of distinct matches growth
@@ -1342,12 +1347,12 @@
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
 \[
 	\Stars\,
-	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]
+	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
 \]
-and at any moment the  subterms in a regular expression
-that will result in a POSIX value is only
+At any moment, the  subterms in a regular expression
+that will potentially result in a POSIX value is only
 a minority among the many other terms,
-and one can remove ones that are absolutely not possible to 
+and one can remove ones that are not possible to 
 be POSIX.
 In the above example,
 \begin{equation}\label{eqn:growth2}
@@ -1370,23 +1375,23 @@
 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
 	\end{tabular}
 \end{center}
-Other terms with an underlying value such as
+Other terms with an underlying value, such as
 \[
 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
 \]
-is simply too hopeless to contribute a POSIX lexical value,
+is too hopeless to contribute a POSIX lexical value,
 and is therefore thrown away.
 
 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
 have come up with some simplification steps, however those steps
-are not yet sufficiently strong so that they achieve the above effects.
-And even with these relatively mild simplifications the proof
-is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
-One would prove something like: 
+are not yet sufficiently strong, so they achieve the above effects.
+And even with these relatively mild simplifications, the proof
+is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
+One would prove something like this: 
 \[
 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
 	\textit{then}\;\; (r, c::s) \rightarrow 
-	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v) 
+	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
 \]
 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
 not only has to return a simplified regular expression,