lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:57e33978e55d
+:3cbcd7cda0a9
 \chapter{Regular Expressions and POSIX Lexing} % Main chapter title
 \label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
 %and notations we
-%use for describing the lexing algorithm by Sulzmann and Lu,
+% used for describing the lexing algorithm by Sulzmann and Lu,
-%and then give the algorithm and its variant, and discuss
+%and then give the algorithm and its variant and discuss
 %why more aggressive simplifications are needed.
 In this chapter, we define the basic notions
 for regular languages and regular expressions.
-This is essentially a description in "English"
+This is essentially a description in ``English"
-of your formalisation in Isabelle/HOL.
+of our formalisation in Isabelle/HOL.
-We also give the definition of what $\POSIX$ lexing means.
+We also give the definition of what $\POSIX$ lexing means,
+followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014}
+that produces the output conforming
+to the $\POSIX$ standard.
+It is also worth mentioning that
+we choose to use the ML-style notation
+for function applications, where
+the parameters of a function is not enclosed
+inside a pair of parentheses (e.g. $f \;x \;y$
+instead of $f(x,\;y)$). This is mainly
+to make the text visually more concise.
 \section{Basic Concepts}
-Usually formal language theory starts with an alphabet
+Usually, formal language theory starts with an alphabet
 denoting a set of characters.
 Here we just use the datatype of characters from Isabelle,
 which roughly corresponds to the ASCII characters.
-In what follows we shall leave the information about the alphabet
+In what follows, we shall leave the information about the alphabet
 implicit.
 Then using the usual bracket notation for lists,
 we can define strings made up of characters:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
 \end{tabular}
 \end{center}
 Where $c$ is a variable ranging over characters.
 Strings can be concatenated to form longer strings in the same
-way as we concatenate two lists, which we write as @.
+way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
 \begin{center}
 \begin{tabular}{lcl}
 \begin{tabular}{lcl}
 $A^0 $ & $\dn$ & $\{ [] \}$\\
 $A^{n+1}$ & $\dn$ & $A @ A^n$
 \end{tabular}
 \end{center}
-The union of all the natural number powers of a language
+The union of all powers of a language
-is usually defined as the Kleene star operator:
+can be used to define the Kleene star operator:
 \begin{center}
 \begin{tabular}{lcl}
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
 \end{tabular}
 \end{center}
 \noindent
-However, to obtain a convenient induction principle
+However, to obtain a more convenient induction principle
 in Isabelle/HOL,
 we instead define the Kleene star
 as an inductive set:
 \begin{center}
 \begin{mathpar}
-\inferrule{}{[] \in A*\\}
+	\inferrule{\mbox{}}{[] \in A*\\}
-\inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
+\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
 \end{mathpar}
 \end{center}
-\ChristianComment{Yes, used the inferrule command in mathpar}
+\noindent
 We also define an operation of "chopping off" a character from
 a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
 \end{center}
 \noindent
 which is essentially the left quotient $A \backslash L$ of $A$ against
 the singleton language with $L = \{w\}$
 in formal language theory.
-However for the purposes here, the $\textit{Ders}$ definition with
+However, for the purposes here, the $\textit{Ders}$ definition with
 a single string is sufficient.
-With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
+With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
 we have a  few properties of how the language derivative can be defined using
 sub-languages.
 \begin{lemma}
 \[
 	\Der \; c \; (A @ B) =
 of $A$:
 \begin{lemma}
 $\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
 \end{lemma}
 \begin{proof}
+There are too inclusions to prove:
 \begin{itemize}
-\item{$\subseteq$}
+\item{$\subseteq$}:\\
-\noindent
 The set
 \[ \{s \mid c :: s \in A*\} \]
 is enclosed in the set
-\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
+\[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
 because whenever you have a string starting with a character
 in the language of a Kleene star $A*$,
 then that character together with some sub-string
 immediately after it will form the first iteration,
 and the rest of the string will
 be still in $A*$.
-\item{$\supseteq$}
+\item{$\supseteq$}:\\
 Note that
 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
-and
+hold.
+Also this holds:
 \[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
-where the $\textit{RHS}$ of the above equatioin can be rewritten
+where the $\textit{RHS}$ can be rewritten
-as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
+as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
+which of course contains $\Der \; c \; A @ A*$.
 \end{itemize}
 \end{proof}
 \noindent
 Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
 for regular languages, we need to first give definitions for regular expressions.
 \subsection{Regular Expressions and Their Meaning}
-The basic regular expressions  are defined inductively
+The \emph{basic regular expressions} are defined inductively
 by the following grammar:
 \[			r ::=   \ZERO \mid  \ONE
 			 \mid  c
 			 \mid  r_1 \cdot r_2
 			 \mid  r_1 + r_2
 			 \mid r^*
 \]
 \noindent
-We call them basic because we might introduce
+We call them basic because we will introduce
-more constructs later such as negation
+additional constructors in later chapters such as negation
 and bounded repetitions.
-We defined the regular expression containing
+We use $\ZERO$ for the regular expression that
-nothing as $\ZERO$, note that some authors
+matches no string, and $\ONE$ for the regular
-also use $\phi$ for that.
+expression that matches only the empty string\footnote{
-Similarly, the regular expression denoting the
+some authors
-singleton set with only $[]$ is sometimes
+also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
-denoted by $\epsilon$, but we use $\ONE$ here.
+but we prefer our notation}.
+The sequence regular expression is written $r_1\cdot r_2$
-The language or set of strings denoted
+and sometimes we omit the dot if it is clear which
-by regular expressions are defined as
+regular expression is meant; the alternative
+is written $r_1 + r_2$.
+The \emph{language} or meaning of
+a regular expression is defined recursively as
+a set of strings:
 %TODO: FILL in the other defs
 \begin{center}
 \begin{tabular}{lcl}
-$L \; (\ZERO)$ & $\dn$ & $\phi$\\
+$L \; \ZERO$ & $\dn$ & $\phi$\\
-$L \; (\ONE)$ & $\dn$ & $\{[]\}$\\
+$L \; \ONE$ & $\dn$ & $\{[]\}$\\
-$L \; (c)$ & $\dn$ & $\{[c]\}$\\
+$L \; c$ & $\dn$ & $\{[c]\}$\\
-$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
+$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
-$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) @ L \; (r_2)$\\
+$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
-$L \; (r^*)$ & $\dn$ & $ (L(r))^*$
+$L \; r^*$ & $\dn$ & $ (L\;r)*$
 \end{tabular}
 \end{center}
 \noindent
-Which is also called the "language interpretation" of
-a regular expression.
 Now with semantic derivatives of a language and regular expressions and
 their language interpretations in place, we are ready to define derivatives on regexes.
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
+%Recall, the language derivative acts on a set of strings
-\ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the
+%and essentially chops off a particular character from
-readers engaged with a story how we got to the definition of $\backslash$, rather
+%all strings in that set, Brzozowski defined a derivative operation on regular expressions
-than first "overwhelming" them with the definition of $\nullable$.}
+%so that after derivative $L(r\backslash c)$
+%will look as if it was obtained by doing a language derivative on $L(r)$:
-The language derivative acts on a string set and chops off a character from
+Recall that the semantic derivative acts on a set of
-all strings in that set, we want to define a derivative operation on regular expressions
+strings. Brzozowski noticed that this operation
-so that after derivative $L(r\backslash c)$
+can be ``mirrored" on regular expressions which
-will look as if it was obtained by doing a language derivative on $L(r)$:
+he calls the derivative of a regular expression $r$
-\begin{center}
+with respect to a character $c$, written
+$r \backslash c$.
+He defined this operation such that the following property holds:
+\begin{center}
 \[
-r\backslash c \dn ?
+L(r \backslash c) = \Der \; c \; L(r)
 \]
-so that
+\end{center}
+\noindent
+For example in the sequence case we have
+\begin{center}
+	\begin{tabular}{lcl}
+		$\Der \; c \; (A @ B)$ & $\dn$ &
+		$ \textit{if} \;\,  [] \in A \;
+		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
+		\Der \; c\; B$\\
+		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
+	\end{tabular}
+\end{center}
+\noindent
+This can be translated to
+regular expressions in the following
+manner:
+\begin{center}
+	\begin{tabular}{lcl}
+		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
+	\end{tabular}
+\end{center}
+\noindent
+And similarly, the Kleene star's semantic derivative
+can be expressed as
 \[
-L(r \backslash c) = \Der \; c \; L(r) ?
+	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
 \]
-\end{center}
+which translates to
-So we mimic the equalities we have for $\Der$ on language concatenation
 \[
-\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
+	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
 \]
-to get the derivative for sequence regular expressions:
+In the above definition of $(r_1\cdot r_2) \backslash c$,
-\[
+the $\textit{if}$ clause's
-(r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
+boolean condition
-\]
+$[] \in L(r_1)$ needs to be
+somehow recursively computed.
-\noindent
+We call such a function that checks
-and language Kleene star:
+whether the empty string $[]$ is
-\[
+in the language of a regular expression $\nullable$:
-\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)
+\begin{center}
-\]
+		\begin{tabular}{lcl}
-to get derivative of the Kleene star regular expression:
+			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\
-\[
+			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
-r^* \backslash c = (r \backslash c)\cdot r^*
+			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
-\]
+			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
-Note that although we can formalise the boolean predicate
+			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
-$[] \in L(r_1)$ without problems, if we want a function that works
+			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
-computationally, then we would have to define a function that tests
+		\end{tabular}
-whether an empty string is in the language of a regular expression.
+\end{center}
-We call such a function $\nullable$:
+\noindent
+The $\ZERO$ regular expression
+does not contain any string and
+therefore is not \emph{nullable}.
+$\ONE$ is \emph{nullable}
+by definition.
+The character regular expression $c$
+corresponds to the singleton set $\{c\}$,
+and therefore does not contain the empty string.
+The alternative regular expression is nullable
+if at least one of its children is nullable.
+The sequence regular expression
+would require both children to have the empty string
+to compose an empty string, and the Kleene star
+is always nullable because it naturally
+contains the empty string.
+The derivative function, written $r\backslash c$,
+defines how a regular expression evolves into
+a new one after all the string it contains is acted on:
+if it starts with $c$, then the character is chopped of,
+if not, that string is removed.
 \begin{center}
 \begin{tabular}{lcl}
 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
 		$d \backslash c$     & $\dn$ &
 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
 \end{tabular}
 \end{center}
 \noindent
-The function derivative, written $r\backslash c$,
+The most involved cases are the sequence case
-defines how a regular expression evolves into
+and the star case.
-a new regular expression after all the string it contains
-is chopped off a certain head character $c$.
-The most involved cases are the sequence
-and star case.
 The sequence case says that if the first regular expression
-contains an empty string then the second component of the sequence
+contains an empty string, then the second component of the sequence
-might be chosen as the target regular expression to be chopped
+needs to be considered, as its derivative will contribute to the
-off its head character.
+result of this derivative.
-The star regular expression's derivative unwraps the iteration of
+The star regular expression $r^*$'s derivative
-regular expression and attaches the star regular expression
+unwraps one iteration of $r$, turns it into $r\backslash c$,
-to the sequence's second element to make sure a copy is retained
+and attaches the original $r^*$
-for possible more iterations in later phases of lexing.
+after $r\backslash c$, so that
+we can further unfold it as many times as needed.
+We have the following correspondence between
-To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
+derivatives on regular expressions and
-which tests whether the empty string $""$
+derivatives on a set of strings:
-is in the language of $r$:
+\begin{lemma}\label{derDer}
-\begin{center}
-		\begin{tabular}{lcl}
-			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\
-			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
-			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
-			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
-			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
-			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
-		\end{tabular}
-\end{center}
-\noindent
-The empty set does not contain any string and
-therefore not the empty string, the empty string
-regular expression contains the empty string
-by definition, the character regular expression
-is the singleton that contains character only,
-and therefore does not contain the empty string,
-the alternative regular expression (or "or" expression)
-might have one of its children regular expressions
-being nullable and any one of its children being nullable
-would suffice. The sequence regular expression
-would require both children to have the empty string
-to compose an empty string and the Kleene star
-operation naturally introduced the empty string.
-We have the following property where the derivative on regular
-expressions coincides with the derivative on a set of strings:
-\begin{lemma}
 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
 \end{lemma}
 \noindent
 The main property of the derivative operation
-that enables us to reason about the correctness of
+(that enables us to reason about the correctness of
-an algorithm using derivatives is
+derivative-based matching)
+is
 \begin{lemma}\label{derStepwise}
-$c\!::\!s \in L(r)$ holds
+	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
-if and only if $s \in L(r\backslash c)$.
 \end{lemma}
 \noindent
 We can generalise the derivative operation shown above for single characters
 to strings as follows:
 $r \backslash [\,] $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
 \noindent
-When there is no ambiguity we will use  $\backslash$ to denote
+When there is no ambiguity, we will
+omit the subscript and use $\backslash$ instead
+of $\backslash_r$ to denote
 string derivatives for brevity.
 Brzozowski's  regular-expression matcher algorithm can then be described as:
 \begin{definition}
-$\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
+$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
 \end{definition}
 \noindent
 Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$,
 this algorithm presented graphically is as follows:
 \noindent
 It can  be
 relatively  easily shown that this matcher is correct:
 \begin{lemma}
-$\textit{match} \; s\; r  = \textit{true} \Longleftrightarrow s \in L(r)$
+	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
-By the stepwise property of $\backslash$ (\ref{derStepwise})
+	By the stepwise property of derivatives (lemma \ref{derStepwise})
+	and lemma \ref{derDer}.
 \end{proof}
 \noindent
-If we implement the above algorithm naively, however,
+\begin{center}
-the algorithm can be excruciatingly slow.
+	\begin{figure}
-\begin{figure}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 ylabel={time in secs},
 ymode = log,
 \addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
 \end{axis}
 \end{tikzpicture}
 \caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
 \end{figure}
+\end{center}
 \noindent
-For this we need to introduce certain
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow, as shown in
+\ref{NaiveMatcher}.
+Note that both axes are in logarithmic scale.
+Around two dozens characters
+would already explode the matcher on regular expression
+$(a^*)^*b$.
+For this, we need to introduce certain
 rewrite rules for the intermediate results,
 such as $r + r \rightarrow r$,
 and make sure those rules do not change the
 language of the regular expression.
-We have a simplification function (that is as simple as possible
+One simpled-minded simplification function
-while having much power on making a regex simpler):
+that achieves these requirements is given below:
-\begin{verbatim}
+\begin{center}
-def simp(r: Rexp) : Rexp = r match {
+	\begin{tabular}{lcl}
-case SEQ(r1, r2) =>
+		$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
-(simp(r1), simp(r2)) match {
+		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
-case (ZERO, _) => ZERO
+					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
-case (_, ZERO) => ZERO
+					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
-case (ONE, r2s) => r2s
+					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
-case (r1s, ONE) => r1s
+					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
-case (r1s, r2s) => SEQ(r1s, r2s)
+					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
-}
+		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
-case ALTS(r1, r2) => {
+				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
-(simp(r1), simp(r2)) match {
+				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
-case (ZERO, r2s) => r2s
+				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
-case (r1s, ZERO) => r1s
+		$\simp \; r$ & $\dn$ & $r$
-case (r1s, r2s) =>
-if(r1s == r2s) r1s else ALTS(r1s, r2s)
+	\end{tabular}
-}
+\end{center}
-}
+If we repeatedly apply this simplification
-case r => r
+function during the matching algorithm,
-}
+we have a matcher with simplification:
-\end{verbatim}
+\begin{center}
-If we repeatedly incorporate these
+	\begin{tabular}{lcl}
-rules during the matching algorithm,
+		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
-we have a lexer with simplification:
+		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
-\begin{verbatim}
+		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
-def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
+	\end{tabular}
-case Nil => simp(r)
+\end{center}
-case c :: cs => ders_simp(cs, simp(der(c, r)))
+\begin{figure}
-}
-def simp_matcher(s: String, r: Rexp) : Boolean =
-nullable(ders_simp(s.toList, r))
-\end{verbatim}
-\noindent
-After putting in those rules, the example of \ref{NaiveMatcher}
-is now very tame in the length of inputs:
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 ylabel={time in secs},
 ymode = log,
 xmode = log,
+grid = both,
 legend entries={Matcher With Simp},
 legend pos=north west,
 legend cell align=left]
 \addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
 \end{axis}
-\end{tikzpicture} \label{fig:BetterMatcher}
+\end{tikzpicture}
+\caption{$(a^*)^*b$
+against
-\noindent
+$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
-Note how the x-axis is in logarithmic scale.
+\end{figure}
+\noindent
+The running time of $\textit{ders}\_\textit{simp}$
+on the same example of \ref{NaiveMatcher}
+is now very tame in terms of the length of inputs,
+as shown in \ref{BetterMatcher}.
 Building derivatives and then testing the existence
 of empty string in the resulting regular expression's language,
-and add simplification rules when necessary.
+adding simplifications when necessary.
 So far, so good. But what if we want to
 do lexing instead of just getting a YES/NO answer?
-\citeauthor{Sulzmann2014} first came up with a nice and
+Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and
 elegant (arguably as beautiful as the definition of the original derivative) solution for this.
 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
-Here we present the hybrid phases of a regular expression lexing
+In this section, we present a two-phase regular expression lexing
-algorithm using the function $\inj$, as given by Sulzmann and Lu.
+algorithm.
-They first defined the datatypes for storing the
+The first phase takes successive derivatives with
-lexing information called a \emph{value} or
+respect to the input string,
-sometimes also \emph{lexical value}.  These values and regular
+and the second phase does the reverse, \emph{injecting} back
+characters, in the meantime constructing a lexing result.
+We will introduce the injection phase in detail slightly
+later, but as a preliminary we have to first define
+the datatype for lexing results,
+called \emph{value} or
+sometimes also \emph{lexical value}.  Values and regular
 expressions correspond to each other as illustrated in the following
 table:
 \begin{center}
 	\begin{tabular}{c@{\hspace{20mm}}c}
 			& $\mid$ & $\Right(v)$  \\
 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
 		\end{tabular}
 	\end{tabular}
 \end{center}
+\noindent
-\noindent
+A value has an underlying string, which
-We have a formal binary relation for telling whether the structure
+can be calculated by the ``flatten" function $|\_|$:
-of a regular expression agrees with the value.
+\begin{center}
+	\begin{tabular}{lcl}
+		$|\Empty|$ & $\dn$ &  $[]$\\
+		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
+		$|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\
+		$|\Left(v)|$ & $ \dn$ & $ |v|$\\
+		$|\Right(v)|$ & $ \dn$ & $ |v|$\\
+		$|\Stars([])|$ & $\dn$ & $[]$\\
+		$|\Stars(v::vs)|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
+	\end{tabular}
+\end{center}
+Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
+to indicate that a value $v$ could be generated from a lexing algorithm
+with input $r$. They call it the value inhabitation relation.
 \begin{mathpar}
-\inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
+	\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
-\inferrule{}{\vdash \Empty :  \ONE} \hspace{2em}
-\inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
+	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
+\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
+\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}
+\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}
+\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars(vs):r^*}
 \end{mathpar}
+\noindent
-Building on top of Sulzmann and Lu's attempt to formalise the
+The condition $|v| \neq []$ in the premise of star's rule
-notion of POSIX lexing rules \parencite{Sulzmann2014},
+is to make sure that for a given pair of regular
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
+expression $r$ and string $s$, the number of values
-POSIX matching as a ternary relation recursively defined in a
+satisfying $|v| = s$ and $\vdash v:r$ is finite.
-natural deduction style.
+Given the same string and regular expression, there can be
-The formal definition of a $\POSIX$ value $v$ for a regular expression
+multiple values for it. For example, both
+$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
+$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
+and the values both flatten to $abc$.
+Lexers therefore have to disambiguate and choose only
+one of the values to output. $\POSIX$ is one of the
+disambiguation strategies that is widely adopted.
+Ausaf and Urban\parencite{AusafDyckhoffUrban2016}
+formalised the property
+as a ternary relation.
+The $\POSIX$ value $v$ for a regular expression
 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
-in the following set of rules:
+in the following set of rules\footnote{The names of the rules are used
-\ChristianComment{Will complete later}
+as they were originally given in \cite{AusafDyckhoffUrban2016}}:
-\newcommand*{\inference}[3][t]{%
+\noindent
-\begingroup
+\begin{figure}
-\def\and{\\}%
+\begin{mathpar}
-\begin{tabular}[#1]{@{\enspace}c@{\enspace}}
+	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
-#2 \\
-\hline
+	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
-#3
-\end{tabular}%
+	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
-\endgroup
-}
+	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
-\begin{center}
-\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
+	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
-\end{center}
+		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
-\noindent
+		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
-The above $\POSIX$ rules could be explained intuitionally as
+	\Seq \; v_1 \; v_2}
+	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
+	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
+		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
+		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
+	(v::vs)}
+\end{mathpar}
+\caption{POSIX Lexing Rules}
+\end{figure}
+\noindent
+The above $\POSIX$ rules follows the intuition described below:
 \begin{itemize}
-\item
+	\item (Left Priority)\\
-match the leftmost regular expression when multiple options of matching
+		Match the leftmost regular expression when multiple options of matching
-are available
+		are available.
-\item
+	\item (Maximum munch)\\
-always match a subpart as much as possible before proceeding
+		Always match a subpart as much as possible before proceeding
-to the next token.
+		to the next token.
 \end{itemize}
+\noindent
-The reason why we are interested in $\POSIX$ values is that they can
+These disambiguation strategies can be
-be practically used in the lexing phase of a compiler front end.
+quite practical.
 For instance, when lexing a code snippet
-$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
+\[
-as an identifier rather than a keyword.
+	\textit{iffoo} = 3
-\ChristianComment{Do I also introduce lexical values $LV$ here?}
+\]
-We know that $\POSIX$ values are also part of the normal values:
+using the regular expression (with
+keyword and identifier having their
+usualy definitions on any formal
+language textbook, for instance
+keyword is a nonempty string starting with letters
+followed by alphanumeric characters or underscores):
+\[
+	\textit{keyword} + \textit{identifier},
+\]
+we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword (if)
+followed by
+an identifier (foo).
+POSIX lexing achieves this.
+We know that a $\POSIX$ value is also a normal underlying
+value:
 \begin{lemma}
 $(r, s) \rightarrow v \implies \vdash v: r$
 \end{lemma}
 \noindent
 The good property about a $\POSIX$ value is that
 one can always uniquely determine the $\POSIX$ value for it:
 \begin{lemma}
 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
 \end{lemma}
 \begin{proof}
-By induction on $s$, $r$ and $v_1$. The induction principle is
+By induction on $s$, $r$ and $v_1$. The inductive cases
-the \POSIX rules. Each case is proven by a combination of
+are all the POSIX rules.
-the induction rules for $\POSIX$ values and the inductive hypothesis.
+Each case is proven by a combination of
-Probably the most cumbersome cases are the sequence and star with non-empty iterations.
+the induction rules for $\POSIX$
+values and the inductive hypothesis.
+Probably the most cumbersome cases are
+the sequence and star with non-empty iterations.
 We give the reasoning about the sequence case as follows:
 When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$,
 we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
 and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
 $(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
 Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$,
 which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
 is the same as $\Seq(v_1, v_2)$.
 \end{proof}
-Now we know what a $\POSIX$ value is, the problem is how do we achieve
+Now we know what a $\POSIX$ value is; the problem is how do we achieve
 such a value in a lexing algorithm, using derivatives?
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
-derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value
+derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
-is generated if the regular expression matches the string.
 Two functions are involved: $\inj$ and $\mkeps$.
 The function $\mkeps$ constructs a value from the last
 one of all the successive derivatives:
 \begin{ceqn}
 \begin{equation}\label{graph:mkeps}
 It tells us how can an empty string be matched by a
 regular expression, in a $\POSIX$ way:
 	\begin{center}
 		\begin{tabular}{lcl}
-			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
+			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
-			$\mkeps(r_{1}+r_{2})$	& $\dn$
+			$\mkeps \; r_{1}+r_{2}$	& $\dn$
-			& \textit{if} $\nullable(r_{1})$\\
+						& \textit{if} ($\nullable \; r_{1})$\\
-			& & \textit{then} $\Left(\mkeps(r_{1}))$\\
+			& & \textit{then} $\Left (\mkeps \; r_{1})$\\
-			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
+			& & \textit{else} $\Right(\mkeps \; r_{2})$\\
-			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
+			$\mkeps \; r_1\cdot r_2$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
-			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
+			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
 		\end{tabular}
 	\end{center}
 \noindent
 \end{center}
 \noindent
 The central property of the $\lexer$ is that it gives the correct result by
 $\POSIX$ standards:
 \begin{theorem}
-\begin{tabular}{l}
+	 The $\lexer$ based on derivatives and injections is both sound and complete:
-$\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\
+\begin{tabular}{lcl}
-$\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$
+	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
+	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
 \end{tabular}
 \end{theorem}
 \begin{proof}
 have an exponential runtime on ambiguous regular expressions.
 With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
 of a match. This means Sulzmann and Lu's injection-based algorithm
 exponential by nature.
 Somehow one has to make sure which
-lexical values are $\POSIX$ and need to be kept in a lexing algorithm.
+lexical values are $\POSIX$ and must be kept in a lexing algorithm.
 For example, the above $r= (a^*\cdot a^*)^*$  and
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
 not corresponding to this value during lexing.
 To do this, a two-phase algorithm with rectification is a bit too fragile.
 Can we not create those intermediate values $v_1,\ldots v_n$,
 and get the lexing information that should be already there while
 doing derivatives in one pass, without a second injection phase?
-In the meantime, can we make sure that simplifications
+In the meantime, can we ensure that simplifications
 are easily handled without breaking the correctness of the algorithm?
 Sulzmann and Lu solved this problem by
 introducing additional information to the
 regular expressions called \emph{bitcodes}.

changeset 564	3cbcd7cda0a9
parent 543	b2bea5968b89
child 567	28cb8089ec36