lexing: comparison ChengsongTanPhdThesis/Chapters/Inj.tex

equal deleted inserted replaced

-:17c7611fb0a9
+:4b1149fb5aec
 In this chapter, we define the basic notions
 for regular languages and regular expressions.
 This is essentially a description in ``English''
 of our formalisation in Isabelle/HOL.
 We also give the definition of what $\POSIX$ lexing means,
-followed by a lexing algorithm by Sulzmanna and Lu\parencite{Sulzmann2014}
+followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
 that produces the output conforming
-to the $\POSIX$ standard.
+to the $\POSIX$ standard\footnote{In what follows
-\footnote{In what follows
 we choose to use the Isabelle-style notation
 for function applications, where
 the parameters of a function are not enclosed
 inside a pair of parentheses (e.g. $f \;x \;y$
 instead of $f(x,\;y)$). This is mainly
-to make the text visually more concise.}
+to make the text visually more concise.}.
 \section{Basic Concepts}
-Usually, formal language theory starts with an alphabet
+Formal language theory usually starts with an alphabet
 denoting a set of characters.
 Here we just use the datatype of characters from Isabelle,
 which roughly corresponds to the ASCII characters.
 In what follows, we shall leave the information about the alphabet
 implicit.
 Then using the usual bracket notation for lists,
-we can define strings made up of characters:
+we can define strings made up of characters as:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
 \end{tabular}
 \end{center}
 where $c$ is a variable ranging over characters.
+The $::$ stands for list cons and $[]$ for the empty
+list.
+A singleton list is sometimes written as $[c]$ for brevity.
 Strings can be concatenated to form longer strings in the same
 way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
 $\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
 \end{tabular}
 \end{center}
 \noindent
 which is essentially the left quotient $A \backslash L$ of $A$ against
-the singleton language with $L = \{s\}$
+the singleton language with $L = \{s\}$.
-in formal language theory.
+However, for our purposes here, the $\textit{Ders}$ definition with
-However, for the purposes here, the $\textit{Ders}$ definition with
 a single string is sufficient.
 The reason for defining derivatives
-is that it provides a different approach
+is that they provide another approach
 to test membership of a string in
 a set of strings.
 For example, to test whether the string
 $bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
 respect to the string $bar$:
 \begin{center}
-\begin{tabular}{lclll}
+\begin{tabular}{lll}
 	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
-	$\{ar, rak\}$ &
+	$\{ar, rak\}$ \\
-	$\stackrel{\backslash a}{\rightarrow}$ &
+				 & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
-	\\
+				 & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
-	$\{r \}$ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$ &
+				 %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
-	$\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
 \end{tabular}
 \end{center}
 \noindent
 and in the end test whether the set
-has the empty string \footnote{ we use the infix notation $A\backslash c$
+has the empty string.\footnote{We use the infix notation $A\backslash c$
 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
-on languages rather than regular expressions }.
+on languages rather than regular expressions.}
-In general, if we have a language $S_{start}$,
-then we can test whether $s$ is in $S_{start}$
+In general, if we have a language $S$,
+then we can test whether $s$ is in $S$
 by testing whether $[] \in S \backslash s$.
 With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
 we have a  few properties of how the language derivative can be defined using
 sub-languages.
 For example, for the sequence operator, we have
-something similar to the ``chain rule'' of the calculus derivative:
+something similar to a ``chain rule'':
 \begin{lemma}
 \[
 	\Der \; c \; (A @ B) =
 	\begin{cases}
 	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
 which of course contains $\Der \; c \; A @ A*$.
 \end{itemize}
 \end{proof}
 \noindent
-Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
+The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
-for regular languages, we need to first give definitions for regular expressions.
+for regular expressions.
+To introduce them, we need to first give definitions for regular expressions,
+which we shall do next.
 \subsection{Regular Expressions and Their Meaning}
 The \emph{basic regular expressions} are defined inductively
 by the following grammar:
 \[			r ::=   \ZERO \mid  \ONE
 We call them basic because we will introduce
 additional constructors in later chapters such as negation
 and bounded repetitions.
 We use $\ZERO$ for the regular expression that
 matches no string, and $\ONE$ for the regular
-expression that matches only the empty string\footnote{
+expression that matches only the empty string.\footnote{
-some authors
+Some authors
 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
-but we prefer our notation}.
+but we prefer our notation.}
 The sequence regular expression is written $r_1\cdot r_2$
 and sometimes we omit the dot if it is clear which
 regular expression is meant; the alternative
 is written $r_1 + r_2$.
 The \emph{language} or meaning of
 \begin{center}
 \begin{tabular}{lcl}
 $L \; \ZERO$ & $\dn$ & $\phi$\\
 $L \; \ONE$ & $\dn$ & $\{[]\}$\\
 $L \; c$ & $\dn$ & $\{[c]\}$\\
-$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
+$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
-$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
+$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
-$L \; r^*$ & $\dn$ & $ (L\;r)*$
+$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
 \end{tabular}
 \end{center}
 \noindent
-Now with language derivatives of a language and regular expressions and
+%Now with language derivatives of a language and regular expressions and
-their language interpretations in place, we are ready to define derivatives on regular expressions.
+%their language interpretations in place, we are ready to define derivatives on regular expressions.
+With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
+We do so by first introducing what properties it should satisfy.
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
 %Recall, the language derivative acts on a set of strings
 %and essentially chops off a particular character from
 %all strings in that set, Brzozowski defined a derivative operation on regular expressions
 %so that after derivative $L(r\backslash c)$
 %will look as if it was obtained by doing a language derivative on $L(r)$:
-Recall that the language derivative acts on a
+%Recall that the language derivative acts on a
-language (set of strings).
+%language (set of strings).
-One can decide whether a string $s$ belongs
+%One can decide whether a string $s$ belongs
-to a language $S$ by taking derivative with respect to
+%to a language $S$ by taking derivative with respect to
-that string and then checking whether the empty
+%that string and then checking whether the empty
-string is in the derivative:
+%string is in the derivative:
-\begin{center}
+%\begin{center}
-\parskip \baselineskip
+%\parskip \baselineskip
-\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
+%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
-\def\rlwd{.5pt}
+%\def\rlwd{.5pt}
-\newcommand\notate[3]{%
+%\newcommand\notate[3]{%
-\unskip\def\useanchorwidth{T}%
+%  \unskip\def\useanchorwidth{T}%
-\setbox0=\hbox{#1}%
+%  \setbox0=\hbox{#1}%
-\def\stackalignment{c}\stackunder[-6pt]{%
+%  \def\stackalignment{c}\stackunder[-6pt]{%
-\def\stackalignment{c}\stackunder[-1.5pt]{%
+%    \def\stackalignment{c}\stackunder[-1.5pt]{%
-\stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
+%      \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
-\rule{\rlwd}{#2\baselineskip}}}{%
+%    \rule{\rlwd}{#2\baselineskip}}}{%
-\strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
+%  \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
-}
+%}
-\Longstack{
+%\Longstack{
-\notate{$\{ \ldots ,\;$
+%\notate{$\{ \ldots ,\;$
-	\notate{s}{1}{$(c_1 :: s_1)$}
+%	\notate{s}{1}{$(c_1 :: s_1)$}
-	$, \; \ldots \}$
+%	$, \; \ldots \}$
-}{1}{$S_{start}$}
+%}{1}{$S_{start}$}
-}
+%}
-\Longstack{
+%\Longstack{
-	$\stackrel{\backslash c_1}{\longrightarrow}$
+%	$\stackrel{\backslash c_1}{\longrightarrow}$
-}
+%}
-\Longstack{
+%\Longstack{
-	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$}
+%	$\{ \ldots,\;$  \notate{$s_1$}{1}{$(c_2::s_2)$}
-	$,\; \ldots \}$
+%	$,\; \ldots \}$
-}
+%}
-\Longstack{
+%\Longstack{
-	$\stackrel{\backslash c_2}{\longrightarrow}$
+%	$\stackrel{\backslash c_2}{\longrightarrow}$
-}
+%}
-\Longstack{
+%\Longstack{
-	$\{ \ldots,\;  s_2
+%	$\{ \ldots,\;  s_2
-	,\; \ldots \}$
+%	,\; \ldots \}$
-}
+%}
-\Longstack{
+%\Longstack{
-	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $
+%	$ \xdashrightarrow{\backslash c_3\ldots\ldots} $
-}
+%}
-\Longstack{
+%\Longstack{
-	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
+%	\notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
-	S_{start}\backslash s$}
+%	S_{start}\backslash s$}
-}
+%}
-\end{center}
+%\end{center}
-\begin{center}
+%\begin{center}
-	$s \in S_{start} \iff [] \in S_{end}$
+%	$s \in S_{start} \iff [] \in S_{end}$
-\end{center}
+%\end{center}
-\noindent
+%\noindent
-Brzozowski noticed that this operation
+Brzozowski noticed that $\Der$
 can be ``mirrored'' on regular expressions which
 he calls the derivative of a regular expression $r$
 with respect to a character $c$, written
 $r \backslash c$. This infix operator
 takes an original regular expression $r$ as input
 \[
 	L \; (r \backslash c) = \Der \; c \; (L \; r)
 \]
 \end{property}
 \noindent
-Pictorially, this looks as follows:
+Pictorially, this looks as follows:\\
+\vspace{3mm}
 \parskip \baselineskip
 \def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
 \def\rlwd{.5pt}
 \newcommand\notate[3]{%
 	$\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$
 }
 \Longstack{
 	\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
 	\{\ldots,\;s_1,\;\ldots\}$}
-}
+}\\
-\\
+\vspace{ 3mm }
 The derivatives on regular expression can again be
 generalised to a string.
-One could compute $r\backslash s$  and test membership of $s$
+One could compute $r_{start} \backslash s$  and test membership of $s$
-in $L \; r$ by checking
+in $L \; r_{start}$ by checking
 whether the empty string is in the language of
-$r\backslash s$.
+$r_{end}$ ($r_{start}\backslash s$).\\
+\vspace{2mm}
 \Longstack{
 	\notate{$r_{start}$}{4}{
 		\Longstack{$L \; r_{start} = \{\ldots, \;$
 			\notate{$s$}{1}{$c_1::s_1$}
 		$, \ldots\} $
 \begin{property}
 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
 \end{property}
 \noindent
-Now we give the recursive definition of derivative on
+Next we give the recursive definition of derivative on
 regular expressions, so that it satisfies the properties above.
 The derivative function, written $r\backslash c$,
 takes a regular expression $r$ and character $c$, and
 returns a new regular expression representing
 the original regular expression's language $L \; r$
 \end{lemma}
 \begin{proof}
 	By induction on $r$.
 \end{proof}
 \noindent
-The main property of the derivative operation
+which is the main property of derivatives
-(that enables us to reason about the correctness of
+that enables us to reason about the correctness of
-derivative-based matching)
+derivative-based matching.
-is
-\noindent
 We can generalise the derivative operation shown above for single characters
 to strings as follows:
 \begin{center}
 \begin{tabular}{lcl}
 \noindent
 When there is no ambiguity, we will
 omit the subscript and use $\backslash$ instead
 of $\backslash_s$ to denote
 string derivatives for brevity.
-Brzozowski's  regular-expression matching algorithm can then be described as:
+Brzozowski's derivative-based
+regular-expression matching algorithm can then be described as:
 \begin{definition}
 $\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
 \end{definition}
 \;\textrm{true}/\textrm{false}
 \end{tikzcd}
 \end{equation}
 \noindent
-It can  be
+It is relatively
-relatively  easily shown that this matcher is correct:
+easy to show that this matcher is correct, namely
 \begin{lemma}
 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
 	By induction on $s$ using property of derivatives:
 	lemma \ref{derDer}.
 \end{proof}
-\noindent
 \begin{figure}
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={$n$},
 If we implement the above algorithm naively, however,
 the algorithm can be excruciatingly slow, as shown in
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
 Around two dozens characters
-would already explode the matcher on the regular expression
+would already ``explode'' the matcher with the regular expression
 $(a^*)^*b$.
-Too improve this situation, we need to introduce simplification
+To improve this situation, we need to introduce simplification
-rewrite rules for the intermediate results,
+rules for the intermediate results,
 such as $r + r \rightarrow r$,
 and make sure those rules do not change the
 language of the regular expression.
 One simpled-minded simplification function
-that achieves these requirements is given below:
+that achieves these requirements
+is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
 \begin{center}
 	\begin{tabular}{lcl}
 		$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
 		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
 					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
 The running time of $\textit{ders}\_\textit{simp}$
 on the same example of Figure \ref{NaiveMatcher}
 is now ``tame''  in terms of the length of inputs,
 as shown in Figure \ref{BetterMatcher}.
-So far the story is use Brzozowski derivatives,
+So far the story is use Brzozowski derivatives and
-simplifying where possible and at the end test
+simplify as much as possible and at the end test
-whether the empty string is recognised by the final derivative.
+whether the empty string is recognised
+by the final derivative.
 But what if we want to
 do lexing instead of just getting a true/false answer?
-Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and
+Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and
-elegant (arguably as beautiful as the definition of the original derivative) solution for this.
+elegant (arguably as beautiful as the definition of the
+Brzozowski derivative) solution for this.
 \section{Values and the Lexing Algorithm by Sulzmann and Lu}
 In this section, we present a two-phase regular expression lexing
 algorithm.
 The first phase takes successive derivatives with
 		$|\Stars \; v::vs|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
 	\end{tabular}
 \end{center}
 Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
 to indicate that a value $v$ could be generated from a lexing algorithm
-with input $r$. They call it the value inhabitation relation.
+with input $r$. They call it the value inhabitation relation,
+defined by the rules.
 \begin{mathpar}
 	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
 	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}
 is to make sure that for a given pair of regular
 expression $r$ and string $s$, the number of values
 satisfying $|v| = s$ and $\vdash v:r$ is finite.
 This additional condition was
 imposed by Ausaf and Urban to make their proofs easier.
-Given the same string and regular expression, there can be
+Given a string and a regular expression, there can be
 multiple values for it. For example, both
 $\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
 and the values both flatten to $abc$.
 Lexers therefore have to disambiguate and choose only
-one of the values to output. $\POSIX$ is one of the
+one of the values be generated. $\POSIX$ is one of the
 disambiguation strategies that is widely adopted.
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016}
+Ausaf et al.\parencite{AusafDyckhoffUrban2016}
 formalised the property
 as a ternary relation.
 The $\POSIX$ value $v$ for a regular expression
 $r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
-in the following set of rules\footnote{The names of the rules are used
+in the following rules\footnote{The names of the rules are used
-as they were originally given in \cite{AusafDyckhoffUrban2016} }:
+as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
-\begin{figure}[H]
+\begin{figure}[p]
 \begin{mathpar}
 	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
 	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
 	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
 		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
 		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
 	(v::vs)}
 \end{mathpar}
-\caption{The inductive POSIX Lexing Rules defined by Ausaf, Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}.
+\caption{The inductive POSIX rules given by Ausaf et al.
-The ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on the
+	\cite{AusafDyckhoffUrban2016}.
+This ternary relation, written $(s, r) \rightarrow v$,
+formalises the POSIX constraints on the
 value $v$ given a string $s$ and
 regular expression $r$.
-For example, this specification says that all matches for an alternative
-must always prefer a left value to a right one.
 }
-\end{figure}
+\end{figure}\afterpage{\clearpage}
 \noindent
-\begin{figure}
+%\begin{figure}
-\begin{tikzpicture}[]
+%\begin{tikzpicture}[]
-\node [minimum width = 6cm, rectangle split, rectangle split horizontal,
+%    \node [minimum width = 6cm, rectangle split, rectangle split horizontal,
-	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
+%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
-	    (node1)
+%	    (node1)
-	    {$r_{token1}$
+%	    {$r_{token1}$
-	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
+%	    \nodepart{two}  $\;\;\; \quad r_{token2}\;\;\;\quad$ };
-	    %\node [left = 6.0cm of node1] (start1) {hi};
+%	    %\node [left = 6.0cm of node1] (start1) {hi};
-	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
+%	    \node [left = 0.2cm of node1] (middle) {$v.s.$};
-\node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
+%    \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
-	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
+%	    rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
-	    (node2)
+%	    (node2)
-	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
+%	    {$\quad\;\;\;r_{token1}\quad\;\;\;$
-	    \nodepart{two}  $r_{token2}$ };
+%	    \nodepart{two}  $r_{token2}$ };
-	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
+%	    \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
-		\node [above = 1.5cm of middle, minimum width = 6cm,
+%		\node [above = 1.5cm of middle, minimum width = 6cm,
-			rectangle, style={draw, rounded corners, inner sep=10pt}]
+%			rectangle, style={draw, rounded corners, inner sep=10pt}]
-			(topNode) {$s$};
+%			(topNode) {$s$};
-	    \path[->,draw]
+%	    \path[->,draw]
-	        (topNode) edge node {split $A$} (node2)
+%	        (topNode) edge node {split $A$} (node2)
-	        (topNode) edge node {split $B$} (node1)
+%	        (topNode) edge node {split $B$} (node1)
-		;
+%		;
+%
+%
-\end{tikzpicture}
+%\end{tikzpicture}
-\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
+%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
-\end{figure}
+%\end{figure}
 The above $\POSIX$ rules follows the intuition described below:
 \begin{itemize}
 	\item (Left Priority)\\
 		Match the leftmost regular expression when multiple options of matching
-		are available.
+		are available. See P+L and P+R where in P+R $s$ cannot
+		be in the language of $L \; r_1$.
 	\item (Maximum munch)\\
 		Always match a subpart as much as possible before proceeding
-		to the next token.
+		to the next part of the string.
 		For example, when the string $s$ matches
-		$r_{token1}\cdot r_{token2}$, and we have two ways $s$ can be split:
+		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
-		Then the split that matches a longer string for the first token
+		Then the split that matches a longer string for the first part
-		$r_{token1}$ is preferred by this maximum munch rule (See
+		$r_{part1}$ is preferred by this maximum munch rule.
-		\ref{munch} for an illustration).
+		This is caused by the side-condition
-\noindent
+		\begin{center}
+		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
+		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
+		\end{center}
+		in PS.
+		%(See
+		%\ref{munch} for an illustration).
 \end{itemize}
 \noindent
 These disambiguation strategies can be
 quite practical.
 For instance, when lexing a code snippet
 \[
 	\textit{iffoo} = 3
 \]
-using the regular expression (with
+using a regular expression
-keyword and identifier having their
+for keywords and
-usualy definitions on any formal
+identifiers:
-language textbook, for instance
+%(for example, keyword is a nonempty string starting with letters
-keyword is a nonempty string starting with letters
+%followed by alphanumeric characters or underscores):
-followed by alphanumeric characters or underscores):
+\[
-\[
+	r_{keyword} + r_{identifier}.
-	\textit{keyword} + \textit{identifier},
+\]
-\]
+If we want $\textit{iffoo}$ to be recognized
-we want $\textit{iffoo}$ to be recognized
+as an identifier
-as an identifier rather than a keyword (if)
+where identifiers are defined as usual (letters
+followed by letters, numbers or underscores),
+then a match with a keyword (if)
 followed by
-an identifier (foo).
+an identifier (foo) would be incorrect.
-POSIX lexing achieves this.
+POSIX lexing would generate what we want.
-We know that a $\POSIX$ value is also a normal underlying
+\noindent
-value:
+We know that a POSIX
+value for regular expression $r$ is inhabited by $r$.
 \begin{lemma}
 $(r, s) \rightarrow v \implies \vdash v: r$
 \end{lemma}
 \noindent
-The good property about a $\POSIX$ value is that
+The main property about a $\POSIX$ value is that
 given the same regular expression $r$ and string $s$,
 one can always uniquely determine the $\POSIX$ value for it:
 \begin{lemma}
 $\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
 \end{lemma}
 then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$,
 which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
 is the same as $\Seq(v_1, v_2)$.
 \end{proof}
 \noindent
-Now we know what a $\POSIX$ value is and why it is unique;
+We have now defined what a POSIX value is and shown that it is unique,
-the problem is generating
+the problem is to generate
 such a value in a lexing algorithm using derivatives.
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
 Sulzmann and Lu extended Brzozowski's
 derivative-based matching
-to a lexing algorithm by a second pass
+to a lexing algorithm by a second phase
 after the initial phase of successive derivatives.
 This second phase generates a POSIX value
 if the regular expression matches the string.
 Two functions are involved: $\inj$ and $\mkeps$.
-The first one used is $\mkeps$, which constructs a POSIX value from the last
+The function $\mkeps$ constructs a POSIX value from the last
 derivative $r_n$:
 \begin{ceqn}
 \begin{equation}\label{graph:mkeps}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
 \end{equation}
 \end{ceqn}
 \noindent
 In the above diagram, again we assume that
 the input string $s$ is made of $n$ characters
-$c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$
+$c_0c_1 \ldots c_{n-1}$
-is given label $0$ and after each character $c_i$ is taken off
+The last derivative operation
-by the derivative operation the resulting derivative regular
+$\backslash c_{n-1}$ generates the derivative $r_n$, for which
-expressioin is $r_{i+1}$.The last derivative operation
+$\mkeps$ produces the value $v_n$. This value
-$\backslash c_{n-1}$ gives back $r_n$, which is transformed into
+tells us how the empty string is matched by the (nullable)
-a value $v_n$ by $\mkeps$.
+regular expression $r_n$, in a POSIX way.
-$v_n$ tells us how an empty string is matched by the (nullable)
-regular expression $r_n$, in a $\POSIX$ way.
 The definition of $\mkeps$ is
 	\begin{center}
 		\begin{tabular}{lcl}
 			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
 			$\mkeps \; (r_{1}+r_{2})$	& $\dn$
 		\end{tabular}
 	\end{center}
 \noindent
-We favour the left child $r_1$ of $r_1 + r_2$
+The function prefers the left child $r_1$ of $r_1 + r_2$
 to match an empty string if there is a choice.
-When there is a star for us to match the empty string,
+When there is a star to match the empty string,
 we give the $\Stars$ constructor an empty list, meaning
 no iteration is taken.
-The result of a call to $\mkeps$ on a $\nullable$ $r$ would
+The result of $\mkeps$ on a $\nullable$ $r$
-be a $\POSIX$ value corresponding to $r$:
+is a POSIX value for $r$ and the empty string:
 \begin{lemma}\label{mePosix}
 $\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
 \end{lemma}
 \begin{proof}
-	By induction on the shape of $r$.
+	By induction on $r$.
 \end{proof}
 \noindent
-After the $\mkeps$-call, we inject back the characters one by one
+After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
 in reverse order as they were chopped off in the derivative phase.
-The fucntion for this is called $\inj$. $\inj$ and $\backslash$
+The fucntion for this is called $\inj$. This function
-are not exactly reverse operations of one another, as $\inj$
+operates on values, unlike $\backslash$ which operates on regular expressions.
-operates on values instead of regular
-expressions.
 In the diagram below, $v_i$ stands for the (POSIX) value
 for how the regular expression
 $r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
 of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
 After injecting back $n$ characters, we get the lexical value for how $r_0$

changeset 622	4b1149fb5aec
parent 608	37b6fd310a16
child 623	c0c1ebe09c7d