--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Tue Jul 05 00:42:06 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Wed Jul 13 08:27:28 2022 +0100
@@ -4,22 +4,32 @@
 
 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
 %and notations we 
-%use for describing the lexing algorithm by Sulzmann and Lu,
-%and then give the algorithm and its variant, and discuss
+% used for describing the lexing algorithm by Sulzmann and Lu,
+%and then give the algorithm and its variant and discuss
 %why more aggressive simplifications are needed. 
 
 In this chapter, we define the basic notions 
 for regular languages and regular expressions.
-This is essentially a description in "English"
-of your formalisation in Isabelle/HOL.
-We also give the definition of what $\POSIX$ lexing means.
+This is essentially a description in ``English"
+of our formalisation in Isabelle/HOL.
+We also give the definition of what $\POSIX$ lexing means, 
+followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} 
+that produces the output conforming
+to the $\POSIX$ standard.
+It is also worth mentioning that
+we choose to use the ML-style notation
+for function applications, where
+the parameters of a function is not enclosed
+inside a pair of parentheses (e.g. $f \;x \;y$
+instead of $f(x,\;y)$). This is mainly
+to make the text visually more concise.
 
 \section{Basic Concepts}
-Usually formal language theory starts with an alphabet 
+Usually, formal language theory starts with an alphabet 
 denoting a set of characters.
 Here we just use the datatype of characters from Isabelle,
 which roughly corresponds to the ASCII characters.
-In what follows we shall leave the information about the alphabet
+In what follows, we shall leave the information about the alphabet
 implicit.
 Then using the usual bracket notation for lists,
 we can define strings made up of characters: 
@@ -30,7 +40,7 @@
 \end{center}
 Where $c$ is a variable ranging over characters.
 Strings can be concatenated to form longer strings in the same
-way as we concatenate two lists, which we write as @.
+way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise 
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
@@ -48,8 +58,8 @@
 $A^{n+1}$ & $\dn$ & $A @ A^n$
 \end{tabular}
 \end{center}
-The union of all the natural number powers of a language   
-is usually defined as the Kleene star operator:
+The union of all powers of a language   
+can be used to define the Kleene star operator:
 \begin{center}
 \begin{tabular}{lcl}
  $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
@@ -57,19 +67,19 @@
 \end{center}
 
 \noindent
-However, to obtain a convenient induction principle 
+However, to obtain a more convenient induction principle 
 in Isabelle/HOL, 
 we instead define the Kleene star
 as an inductive set: 
 
 \begin{center}
 \begin{mathpar}
-\inferrule{}{[] \in A*\\}
+	\inferrule{\mbox{}}{[] \in A*\\}
 
-\inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
+\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
 \end{mathpar}
 \end{center}
-\ChristianComment{Yes, used the inferrule command in mathpar}
+\noindent
 We also define an operation of "chopping off" a character from
 a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
 \begin{center}
@@ -89,10 +99,10 @@
 which is essentially the left quotient $A \backslash L$ of $A$ against 
 the singleton language with $L = \{w\}$
 in formal language theory.
-However for the purposes here, the $\textit{Ders}$ definition with 
+However, for the purposes here, the $\textit{Ders}$ definition with 
 a single string is sufficient.
 
-With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
+With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
 we have a  few properties of how the language derivative can be defined using 
 sub-languages.
 \begin{lemma}
@@ -113,26 +123,28 @@
 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
 \end{lemma}
 \begin{proof}
+There are too inclusions to prove:
 \begin{itemize}
-\item{$\subseteq$}
-\noindent
+\item{$\subseteq$}:\\
 The set 
 \[ \{s \mid c :: s \in A*\} \]
 is enclosed in the set
-\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
+\[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
 because whenever you have a string starting with a character 
 in the language of a Kleene star $A*$, 
 then that character together with some sub-string
 immediately after it will form the first iteration, 
 and the rest of the string will 
 be still in $A*$.
-\item{$\supseteq$}
+\item{$\supseteq$}:\\
 Note that
 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
-and 
+hold.
+Also this holds:
 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
-where the $\textit{RHS}$ of the above equatioin can be rewritten
-as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
+where the $\textit{RHS}$ can be rewritten
+as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
+which of course contains $\Der \; c \; A @ A*$.
 \end{itemize}
 \end{proof}
 
@@ -141,7 +153,7 @@
 for regular languages, we need to first give definitions for regular expressions.
 
 \subsection{Regular Expressions and Their Meaning}
-The basic regular expressions  are defined inductively
+The \emph{basic regular expressions} are defined inductively
  by the following grammar:
 \[			r ::=   \ZERO \mid  \ONE
 			 \mid  c  
@@ -150,81 +162,127 @@
 			 \mid r^*         
 \]
 \noindent
-We call them basic because we might introduce
-more constructs later such as negation
+We call them basic because we will introduce
+additional constructors in later chapters such as negation
 and bounded repetitions.
-We defined the regular expression containing
-nothing as $\ZERO$, note that some authors
-also use $\phi$ for that.
-Similarly, the regular expression denoting the 
-singleton set with only $[]$ is sometimes 
-denoted by $\epsilon$, but we use $\ONE$ here.
-
-The language or set of strings denoted 
-by regular expressions are defined as
+We use $\ZERO$ for the regular expression that
+matches no string, and $\ONE$ for the regular
+expression that matches only the empty string\footnote{
+some authors
+also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
+but we prefer our notation}. 
+The sequence regular expression is written $r_1\cdot r_2$
+and sometimes we omit the dot if it is clear which
+regular expression is meant; the alternative
+is written $r_1 + r_2$.
+The \emph{language} or meaning of 
+a regular expression is defined recursively as
+a set of strings:
 %TODO: FILL in the other defs
 \begin{center}
 \begin{tabular}{lcl}
-$L \; (\ZERO)$ & $\dn$ & $\phi$\\
-$L \; (\ONE)$ & $\dn$ & $\{[]\}$\\
-$L \; (c)$ & $\dn$ & $\{[c]\}$\\
-$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
-$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) @ L \; (r_2)$\\
-$L \; (r^*)$ & $\dn$ & $ (L(r))^*$
+$L \; \ZERO$ & $\dn$ & $\phi$\\
+$L \; \ONE$ & $\dn$ & $\{[]\}$\\
+$L \; c$ & $\dn$ & $\{[c]\}$\\
+$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
+$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
+$L \; r^*$ & $\dn$ & $ (L\;r)*$
 \end{tabular}
 \end{center}
 \noindent
-Which is also called the "language interpretation" of
-a regular expression.
-
 Now with semantic derivatives of a language and regular expressions and
 their language interpretations in place, we are ready to define derivatives on regexes.
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
-
-\ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the 
-readers engaged with a story how we got to the definition of $\backslash$, rather 
-than first "overwhelming" them with the definition of $\nullable$.}
-
-The language derivative acts on a string set and chops off a character from
-all strings in that set, we want to define a derivative operation on regular expressions
-so that after derivative $L(r\backslash c)$ 
-will look as if it was obtained by doing a language derivative on $L(r)$:
+%Recall, the language derivative acts on a set of strings
+%and essentially chops off a particular character from
+%all strings in that set, Brzozowski defined a derivative operation on regular expressions
+%so that after derivative $L(r\backslash c)$ 
+%will look as if it was obtained by doing a language derivative on $L(r)$:
+Recall that the semantic derivative acts on a set of 
+strings. Brzozowski noticed that this operation
+can be ``mirrored" on regular expressions which
+he calls the derivative of a regular expression $r$
+with respect to a character $c$, written
+$r \backslash c$.
+He defined this operation such that the following property holds:
 \begin{center}
-\[
-r\backslash c \dn ?
-\]
-so that
-\[
-L(r \backslash c) = \Der \; c \; L(r) ?
-\]
-\end{center}
-So we mimic the equalities we have for $\Der$ on language concatenation
 
 \[
-\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
+L(r \backslash c) = \Der \; c \; L(r)
 \]
-to get the derivative for sequence regular expressions:
-\[
-(r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
-\]
+\end{center}
+\noindent
+For example in the sequence case we have 
+\begin{center}
+	\begin{tabular}{lcl}
+		$\Der \; c \; (A @ B)$ & $\dn$ & 
+		$ \textit{if} \;\,  [] \in A \; 
+		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
+		\Der \; c\; B$\\
+		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
+	\end{tabular}
+\end{center}
+\noindent
+This can be translated to  
+regular expressions in the following 
+manner:
+\begin{center}
+	\begin{tabular}{lcl}
+		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
+	\end{tabular}
+\end{center}
 
 \noindent
-and language Kleene star:
+And similarly, the Kleene star's semantic derivative
+can be expressed as
 \[
-\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)
+	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
 \]
-to get derivative of the Kleene star regular expression:
+which translates to
 \[
-r^* \backslash c = (r \backslash c)\cdot r^*
+	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
 \]
-Note that although we can formalise the boolean predicate
-$[] \in L(r_1)$ without problems, if we want a function that works
-computationally, then we would have to define a function that tests
-whether an empty string is in the language of a regular expression.
-We call such a function $\nullable$:
-
-
-
+In the above definition of $(r_1\cdot r_2) \backslash c$,
+the $\textit{if}$ clause's
+boolean condition 
+$[] \in L(r_1)$ needs to be 
+somehow recursively computed.
+We call such a function that checks
+whether the empty string $[]$ is 
+in the language of a regular expression $\nullable$:
+\begin{center}
+		\begin{tabular}{lcl}
+			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
+			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
+			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
+			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
+		\end{tabular}
+\end{center}
+\noindent
+The $\ZERO$ regular expression
+does not contain any string and
+therefore is not \emph{nullable}.
+$\ONE$ is \emph{nullable} 
+by definition. 
+The character regular expression $c$
+corresponds to the singleton set $\{c\}$, 
+and therefore does not contain the empty string.
+The alternative regular expression is nullable
+if at least one of its children is nullable.
+The sequence regular expression
+would require both children to have the empty string
+to compose an empty string, and the Kleene star
+is always nullable because it naturally
+contains the empty string. 
+  
+The derivative function, written $r\backslash c$, 
+defines how a regular expression evolves into
+a new one after all the string it contains is acted on: 
+if it starts with $c$, then the character is chopped of,
+if not, that string is removed.
 \begin{center}
 \begin{tabular}{lcl}
 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
@@ -239,67 +297,32 @@
 \end{tabular}
 \end{center}
 \noindent
-The function derivative, written $r\backslash c$, 
-defines how a regular expression evolves into
-a new regular expression after all the string it contains
-is chopped off a certain head character $c$.
-The most involved cases are the sequence 
-and star case.
+The most involved cases are the sequence case
+and the star case.
 The sequence case says that if the first regular expression
-contains an empty string then the second component of the sequence
-might be chosen as the target regular expression to be chopped
-off its head character.
-The star regular expression's derivative unwraps the iteration of
-regular expression and attaches the star regular expression
-to the sequence's second element to make sure a copy is retained
-for possible more iterations in later phases of lexing.
-
-
-To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
-which tests whether the empty string $""$ 
-is in the language of $r$:
-
-
-\begin{center}
-		\begin{tabular}{lcl}
-			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
-			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
-			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
-			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
-			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
-			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
-		\end{tabular}
-\end{center}
-\noindent
-The empty set does not contain any string and
-therefore not the empty string, the empty string 
-regular expression contains the empty string
-by definition, the character regular expression
-is the singleton that contains character only,
-and therefore does not contain the empty string,
-the alternative regular expression (or "or" expression)
-might have one of its children regular expressions
-being nullable and any one of its children being nullable
-would suffice. The sequence regular expression
-would require both children to have the empty string
-to compose an empty string and the Kleene star
-operation naturally introduced the empty string. 
-  
-We have the following property where the derivative on regular 
-expressions coincides with the derivative on a set of strings:
-
-\begin{lemma}
+contains an empty string, then the second component of the sequence
+needs to be considered, as its derivative will contribute to the
+result of this derivative.
+The star regular expression $r^*$'s derivative 
+unwraps one iteration of $r$, turns it into $r\backslash c$,
+and attaches the original $r^*$
+after $r\backslash c$, so that 
+we can further unfold it as many times as needed.
+We have the following correspondence between 
+derivatives on regular expressions and
+derivatives on a set of strings:
+\begin{lemma}\label{derDer}
 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
 \end{lemma}
 
 \noindent
 The main property of the derivative operation
-that enables us to reason about the correctness of
-an algorithm using derivatives is 
+(that enables us to reason about the correctness of
+derivative-based matching)
+is 
 
 \begin{lemma}\label{derStepwise}
-$c\!::\!s \in L(r)$ holds
-if and only if $s \in L(r\backslash c)$.
+	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
 \end{lemma}
 
 \noindent
@@ -314,12 +337,14 @@
 \end{center}
 
 \noindent
-When there is no ambiguity we will use  $\backslash$ to denote
+When there is no ambiguity, we will 
+omit the subscript and use $\backslash$ instead
+of $\backslash_r$ to denote
 string derivatives for brevity.
 Brzozowski's  regular-expression matcher algorithm can then be described as:
 
 \begin{definition}
-$\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
+$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
 \end{definition}
 
 \noindent
@@ -336,17 +361,15 @@
  It can  be
 relatively  easily shown that this matcher is correct:
 \begin{lemma}
-$\textit{match} \; s\; r  = \textit{true} \Longleftrightarrow s \in L(r)$
+	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
-By the stepwise property of $\backslash$ (\ref{derStepwise})
+	By the stepwise property of derivatives (lemma \ref{derStepwise})
+	and lemma \ref{derDer}. 
 \end{proof}
 \noindent
-If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow.
-
-
-\begin{figure}
+\begin{center}
+	\begin{figure}
 \begin{tikzpicture}
 \begin{axis}[
     xlabel={$n$},
@@ -360,84 +383,93 @@
 \end{tikzpicture} 
 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
 \end{figure}
-   
+\end{center} 
 \noindent
-For this we need to introduce certain 
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow, as shown in 
+\ref{NaiveMatcher}.
+Note that both axes are in logarithmic scale.
+Around two dozens characters
+would already explode the matcher on regular expression 
+$(a^*)^*b$.
+For this, we need to introduce certain 
 rewrite rules for the intermediate results,
 such as $r + r \rightarrow r$,
 and make sure those rules do not change the 
 language of the regular expression.
-We have a simplification function (that is as simple as possible
-while having much power on making a regex simpler):
-\begin{verbatim}
-def simp(r: Rexp) : Rexp = r match {
-  case SEQ(r1, r2) => 
-    (simp(r1), simp(r2)) match {
-      case (ZERO, _) => ZERO
-      case (_, ZERO) => ZERO
-      case (ONE, r2s) => r2s
-      case (r1s, ONE) => r1s
-      case (r1s, r2s) => SEQ(r1s, r2s)
-    }
-  case ALTS(r1, r2) => {
-    (simp(r1), simp(r2)) match {
-      case (ZERO, r2s) => r2s
-      case (r1s, ZERO) => r1s
-      case (r1s, r2s) =>
-        if(r1s == r2s) r1s else ALTS(r1s, r2s)
-    }
-  }
-  case r => r
-}
-\end{verbatim}
-If we repeatedly incorporate these 
-rules during the matching algorithm, 
-we have a lexer with simplification:
-\begin{verbatim}
-def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
-  case Nil => simp(r)
-  case c :: cs => ders_simp(cs, simp(der(c, r)))
-}
-
-def simp_matcher(s: String, r: Rexp) : Boolean = 
-  nullable(ders_simp(s.toList, r))
-
-\end{verbatim}
-\noindent
-After putting in those rules, the example of \ref{NaiveMatcher}
-is now very tame in the length of inputs:
-
-
+One simpled-minded simplification function
+that achieves these requirements is given below:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
+		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
+					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
+					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
+					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
+					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
+					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
+		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
+				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
+				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
+				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
+		$\simp \; r$ & $\dn$ & $r$
+				   
+	\end{tabular}
+\end{center}
+If we repeatedly apply this simplification  
+function during the matching algorithm, 
+we have a matcher with simplification:
+\begin{center}
+	\begin{tabular}{lcl}
+		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
+		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
+		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
+	\end{tabular}
+\end{center}
+\begin{figure}
 \begin{tikzpicture}
 \begin{axis}[
     xlabel={$n$},
     ylabel={time in secs},
     ymode = log,
     xmode = log,
+    grid = both,
     legend entries={Matcher With Simp},  
     legend pos=north west,
     legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
 \end{axis}
-\end{tikzpicture} \label{fig:BetterMatcher}
-
+\end{tikzpicture} 
+\caption{$(a^*)^*b$ 
+against 
+$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
+\end{figure}
+\noindent
+The running time of $\textit{ders}\_\textit{simp}$
+on the same example of \ref{NaiveMatcher}
+is now very tame in terms of the length of inputs,
+as shown in \ref{BetterMatcher}.
 
-\noindent
-Note how the x-axis is in logarithmic scale.
 Building derivatives and then testing the existence
 of empty string in the resulting regular expression's language,
-and add simplification rules when necessary.
+adding simplifications when necessary.
 So far, so good. But what if we want to 
 do lexing instead of just getting a YES/NO answer?
-\citeauthor{Sulzmann2014} first came up with a nice and 
+Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and 
 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
 
 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
-Here we present the hybrid phases of a regular expression lexing 
-algorithm using the function $\inj$, as given by Sulzmann and Lu.
-They first defined the datatypes for storing the 
-lexing information called a \emph{value} or
-sometimes also \emph{lexical value}.  These values and regular
+In this section, we present a two-phase regular expression lexing 
+algorithm.
+The first phase takes successive derivatives with 
+respect to the input string,
+and the second phase does the reverse, \emph{injecting} back
+characters, in the meantime constructing a lexing result.
+We will introduce the injection phase in detail slightly
+later, but as a preliminary we have to first define 
+the datatype for lexing results, 
+called \emph{value} or
+sometimes also \emph{lexical value}.  Values and regular
 expressions correspond to each other as illustrated in the following
 table:
 
@@ -466,56 +498,116 @@
 		\end{tabular}
 	\end{tabular}
 \end{center}
-
 \noindent
-We have a formal binary relation for telling whether the structure
-of a regular expression agrees with the value.
+A value has an underlying string, which 
+can be calculated by the ``flatten" function $|\_|$:
+\begin{center}
+	\begin{tabular}{lcl}
+		$|\Empty|$ & $\dn$ &  $[]$\\
+		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
+		$|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\
+		$|\Left(v)|$ & $ \dn$ & $ |v|$\\
+		$|\Right(v)|$ & $ \dn$ & $ |v|$\\
+		$|\Stars([])|$ & $\dn$ & $[]$\\
+		$|\Stars(v::vs)|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
+	\end{tabular}
+\end{center}
+Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
+to indicate that a value $v$ could be generated from a lexing algorithm
+with input $r$. They call it the value inhabitation relation. 
 \begin{mathpar}
-\inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
-\inferrule{}{\vdash \Empty :  \ONE} \hspace{2em}
-\inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
-\end{mathpar}
+	\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
+
+	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
+
+\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
+
+\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}
+
+\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}
 
-Building on top of Sulzmann and Lu's attempt to formalise the 
-notion of POSIX lexing rules \parencite{Sulzmann2014}, 
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
-POSIX matching as a ternary relation recursively defined in a
-natural deduction style.
-The formal definition of a $\POSIX$ value $v$ for a regular expression
+\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars(vs):r^*}
+\end{mathpar}
+\noindent
+The condition $|v| \neq []$ in the premise of star's rule
+is to make sure that for a given pair of regular 
+expression $r$ and string $s$, the number of values 
+satisfying $|v| = s$ and $\vdash v:r$ is finite.
+Given the same string and regular expression, there can be
+multiple values for it. For example, both
+$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
+$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
+and the values both flatten to $abc$.
+Lexers therefore have to disambiguate and choose only
+one of the values to output. $\POSIX$ is one of the
+disambiguation strategies that is widely adopted.
+
+Ausaf and Urban\parencite{AusafDyckhoffUrban2016} 
+formalised the property 
+as a ternary relation.
+The $\POSIX$ value $v$ for a regular expression
 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
-in the following set of rules:
-\ChristianComment{Will complete later}
-\newcommand*{\inference}[3][t]{%
-   \begingroup
-   \def\and{\\}%
-   \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
-   #2 \\
-   \hline
-   #3
-   \end{tabular}%
-   \endgroup
-}
-\begin{center}
-\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
-\end{center}
+in the following set of rules\footnote{The names of the rules are used
+as they were originally given in \cite{AusafDyckhoffUrban2016}}:
+\noindent
+\begin{figure}
+\begin{mathpar}
+	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
+		
+	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
+
+	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
+
+	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
+
+	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
+		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
+		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
+	\Seq \; v_1 \; v_2}
+
+	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
+
+	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
+		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
+		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
+	(v::vs)}
+\end{mathpar}
+\caption{POSIX Lexing Rules}
+\end{figure}
 \noindent
-The above $\POSIX$ rules could be explained intuitionally as
+The above $\POSIX$ rules follows the intuition described below: 
 \begin{itemize}
-\item
-match the leftmost regular expression when multiple options of matching
-are available  
-\item 
-always match a subpart as much as possible before proceeding
-to the next token.
+	\item (Left Priority)\\
+		Match the leftmost regular expression when multiple options of matching
+		are available.
+	\item (Maximum munch)\\
+		Always match a subpart as much as possible before proceeding
+		to the next token.
 \end{itemize}
-
-The reason why we are interested in $\POSIX$ values is that they can
-be practically used in the lexing phase of a compiler front end.
+\noindent
+These disambiguation strategies can be 
+quite practical.
 For instance, when lexing a code snippet 
-$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
-as an identifier rather than a keyword.
-\ChristianComment{Do I also introduce lexical values $LV$ here?}
-We know that $\POSIX$ values are also part of the normal values:
+\[ 
+	\textit{iffoo} = 3
+\]
+using the regular expression (with 
+keyword and identifier having their 
+usualy definitions on any formal
+language textbook, for instance
+keyword is a nonempty string starting with letters 
+followed by alphanumeric characters or underscores):
+\[
+	\textit{keyword} + \textit{identifier},
+\]
+we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword (if)
+followed by
+an identifier (foo).
+POSIX lexing achieves this.
+
+We know that a $\POSIX$ value is also a normal underlying
+value:
 \begin{lemma}
 $(r, s) \rightarrow v \implies \vdash v: r$
 \end{lemma}
@@ -527,10 +619,13 @@
 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
 \end{lemma}
 \begin{proof}
-By induction on $s$, $r$ and $v_1$. The induction principle is
-the \POSIX rules. Each case is proven by a combination of
-the induction rules for $\POSIX$ values and the inductive hypothesis.
-Probably the most cumbersome cases are the sequence and star with non-empty iterations.
+By induction on $s$, $r$ and $v_1$. The inductive cases
+are all the POSIX rules. 
+Each case is proven by a combination of
+the induction rules for $\POSIX$ 
+values and the inductive hypothesis.
+Probably the most cumbersome cases are 
+the sequence and star with non-empty iterations.
 
 We give the reasoning about the sequence case as follows:
 When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, 
@@ -544,15 +639,14 @@
 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
 is the same as $\Seq(v_1, v_2)$. 
 \end{proof}
-Now we know what a $\POSIX$ value is, the problem is how do we achieve 
+Now we know what a $\POSIX$ value is; the problem is how do we achieve 
 such a value in a lexing algorithm, using derivatives?
 
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
 
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
-derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value 
-is generated if the regular expression matches the string.
+derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
 Two functions are involved: $\inj$ and $\mkeps$.
 The function $\mkeps$ constructs a value from the last
 one of all the successive derivatives:
@@ -570,13 +664,13 @@
 
 	\begin{center}
 		\begin{tabular}{lcl}
-			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
-			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
-			& \textit{if} $\nullable(r_{1})$\\ 
-			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
-			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
-			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
-			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
+			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
+			$\mkeps \; r_{1}+r_{2}$	& $\dn$ 
+						& \textit{if} ($\nullable \; r_{1})$\\ 
+			& & \textit{then} $\Left (\mkeps \; r_{1})$\\ 
+			& & \textit{else} $\Right(\mkeps \; r_{2})$\\
+			$\mkeps \; r_1\cdot r_2$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
+			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
 		\end{tabular}
 	\end{center}
 
@@ -669,9 +763,10 @@
  The central property of the $\lexer$ is that it gives the correct result by
  $\POSIX$ standards:
  \begin{theorem}
- \begin{tabular}{l}
- $\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\
- $\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$
+	 The $\lexer$ based on derivatives and injections is both sound and complete:
+ \begin{tabular}{lcl}
+	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
+	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
  \end{tabular}
  \end{theorem}
  
@@ -776,7 +871,7 @@
 of a match. This means Sulzmann and Lu's injection-based algorithm 
  exponential by nature.
 Somehow one has to make sure which
- lexical values are $\POSIX$ and need to be kept in a lexing algorithm.
+ lexical values are $\POSIX$ and must be kept in a lexing algorithm.
 
 
  For example, the above $r= (a^*\cdot a^*)^*$  and 
@@ -788,7 +883,7 @@
 Can we not create those intermediate values $v_1,\ldots v_n$,
 and get the lexing information that should be already there while
 doing derivatives in one pass, without a second injection phase?
-In the meantime, can we make sure that simplifications
+In the meantime, can we ensure that simplifications
 are easily handled without breaking the correctness of the algorithm?
 
 Sulzmann and Lu solved this problem by
changeset 564	3cbcd7cda0a9
parent 543	b2bea5968b89
child 567	28cb8089ec36