--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Tue Nov 08 23:24:54 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Fri Nov 11 00:23:53 2022 +0000
@@ -13,32 +13,34 @@
This is essentially a description in ``English''
of our formalisation in Isabelle/HOL.
We also give the definition of what $\POSIX$ lexing means,
-followed by a lexing algorithm by Sulzmanna and Lu\parencite{Sulzmann2014}
+followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
that produces the output conforming
-to the $\POSIX$ standard.
-\footnote{In what follows
+to the $\POSIX$ standard\footnote{In what follows
we choose to use the Isabelle-style notation
for function applications, where
the parameters of a function are not enclosed
inside a pair of parentheses (e.g. $f \;x \;y$
instead of $f(x,\;y)$). This is mainly
-to make the text visually more concise.}
+to make the text visually more concise.}.
\section{Basic Concepts}
-Usually, formal language theory starts with an alphabet
+Formal language theory usually starts with an alphabet
denoting a set of characters.
Here we just use the datatype of characters from Isabelle,
which roughly corresponds to the ASCII characters.
In what follows, we shall leave the information about the alphabet
implicit.
Then using the usual bracket notation for lists,
-we can define strings made up of characters:
+we can define strings made up of characters as:
\begin{center}
\begin{tabular}{lcl}
$\textit{s}$ & $\dn$ & $[] \; |\; c :: s$
\end{tabular}
\end{center}
where $c$ is a variable ranging over characters.
+The $::$ stands for list cons and $[]$ for the empty
+list.
+A singleton list is sometimes written as $[c]$ for brevity.
Strings can be concatenated to form longer strings in the same
way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
We omit the precise
@@ -98,42 +100,40 @@
\end{center}
\noindent
which is essentially the left quotient $A \backslash L$ of $A$ against
-the singleton language with $L = \{s\}$
-in formal language theory.
-However, for the purposes here, the $\textit{Ders}$ definition with
+the singleton language with $L = \{s\}$.
+However, for our purposes here, the $\textit{Ders}$ definition with
a single string is sufficient.
The reason for defining derivatives
-is that it provides a different approach
+is that they provide another approach
to test membership of a string in
a set of strings.
For example, to test whether the string
$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
respect to the string $bar$:
\begin{center}
-\begin{tabular}{lclll}
+\begin{tabular}{lll}
$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
- $\{ar, rak\}$ &
- $\stackrel{\backslash a}{\rightarrow}$ &
- \\
- $\{r \}$ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$ &
- $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
+ $\{ar, rak\}$ \\
+ & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
+ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
+ %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
\end{tabular}
\end{center}
\noindent
and in the end test whether the set
-has the empty string \footnote{ we use the infix notation $A\backslash c$
+has the empty string.\footnote{We use the infix notation $A\backslash c$
instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
-on languages rather than regular expressions }.
-In general, if we have a language $S_{start}$,
-then we can test whether $s$ is in $S_{start}$
+on languages rather than regular expressions.}
+
+In general, if we have a language $S$,
+then we can test whether $s$ is in $S$
by testing whether $[] \in S \backslash s$.
-
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
we have a few properties of how the language derivative can be defined using
sub-languages.
For example, for the sequence operator, we have
-something similar to the ``chain rule'' of the calculus derivative:
+something similar to a ``chain rule'':
\begin{lemma}
\[
\Der \; c \; (A @ B) =
@@ -178,8 +178,10 @@
\end{proof}
\noindent
-Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
-for regular languages, we need to first give definitions for regular expressions.
+The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
+for regular expressions.
+To introduce them, we need to first give definitions for regular expressions,
+which we shall do next.
\subsection{Regular Expressions and Their Meaning}
The \emph{basic regular expressions} are defined inductively
@@ -196,10 +198,10 @@
and bounded repetitions.
We use $\ZERO$ for the regular expression that
matches no string, and $\ONE$ for the regular
-expression that matches only the empty string\footnote{
-some authors
+expression that matches only the empty string.\footnote{
+Some authors
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
-but we prefer our notation}.
+but we prefer our notation.}
The sequence regular expression is written $r_1\cdot r_2$
and sometimes we omit the dot if it is clear which
regular expression is meant; the alternative
@@ -213,72 +215,75 @@
$L \; \ZERO$ & $\dn$ & $\phi$\\
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
$L \; c$ & $\dn$ & $\{[c]\}$\\
-$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
-$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
-$L \; r^*$ & $\dn$ & $ (L\;r)*$
+$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
+$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
+$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
\end{tabular}
\end{center}
\noindent
-Now with language derivatives of a language and regular expressions and
-their language interpretations in place, we are ready to define derivatives on regular expressions.
+%Now with language derivatives of a language and regular expressions and
+%their language interpretations in place, we are ready to define derivatives on regular expressions.
+With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
+We do so by first introducing what properties it should satisfy.
+
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
%Recall, the language derivative acts on a set of strings
%and essentially chops off a particular character from
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
%so that after derivative $L(r\backslash c)$
%will look as if it was obtained by doing a language derivative on $L(r)$:
-Recall that the language derivative acts on a
-language (set of strings).
-One can decide whether a string $s$ belongs
-to a language $S$ by taking derivative with respect to
-that string and then checking whether the empty
-string is in the derivative:
-\begin{center}
-\parskip \baselineskip
-\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
-\def\rlwd{.5pt}
-\newcommand\notate[3]{%
- \unskip\def\useanchorwidth{T}%
- \setbox0=\hbox{#1}%
- \def\stackalignment{c}\stackunder[-6pt]{%
- \def\stackalignment{c}\stackunder[-1.5pt]{%
- \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
- \rule{\rlwd}{#2\baselineskip}}}{%
- \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
-}
-\Longstack{
-\notate{$\{ \ldots ,\;$
- \notate{s}{1}{$(c_1 :: s_1)$}
- $, \; \ldots \}$
-}{1}{$S_{start}$}
-}
-\Longstack{
- $\stackrel{\backslash c_1}{\longrightarrow}$
-}
-\Longstack{
- $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$}
- $,\; \ldots \}$
-}
-\Longstack{
- $\stackrel{\backslash c_2}{\longrightarrow}$
-}
-\Longstack{
- $\{ \ldots,\; s_2
- ,\; \ldots \}$
-}
-\Longstack{
- $ \xdashrightarrow{\backslash c_3\ldots\ldots} $
-}
-\Longstack{
- \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
- S_{start}\backslash s$}
-}
-\end{center}
-\begin{center}
- $s \in S_{start} \iff [] \in S_{end}$
-\end{center}
-\noindent
-Brzozowski noticed that this operation
+%Recall that the language derivative acts on a
+%language (set of strings).
+%One can decide whether a string $s$ belongs
+%to a language $S$ by taking derivative with respect to
+%that string and then checking whether the empty
+%string is in the derivative:
+%\begin{center}
+%\parskip \baselineskip
+%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
+%\def\rlwd{.5pt}
+%\newcommand\notate[3]{%
+% \unskip\def\useanchorwidth{T}%
+% \setbox0=\hbox{#1}%
+% \def\stackalignment{c}\stackunder[-6pt]{%
+% \def\stackalignment{c}\stackunder[-1.5pt]{%
+% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
+% \rule{\rlwd}{#2\baselineskip}}}{%
+% \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
+%}
+%\Longstack{
+%\notate{$\{ \ldots ,\;$
+% \notate{s}{1}{$(c_1 :: s_1)$}
+% $, \; \ldots \}$
+%}{1}{$S_{start}$}
+%}
+%\Longstack{
+% $\stackrel{\backslash c_1}{\longrightarrow}$
+%}
+%\Longstack{
+% $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$}
+% $,\; \ldots \}$
+%}
+%\Longstack{
+% $\stackrel{\backslash c_2}{\longrightarrow}$
+%}
+%\Longstack{
+% $\{ \ldots,\; s_2
+% ,\; \ldots \}$
+%}
+%\Longstack{
+% $ \xdashrightarrow{\backslash c_3\ldots\ldots} $
+%}
+%\Longstack{
+% \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
+% S_{start}\backslash s$}
+%}
+%\end{center}
+%\begin{center}
+% $s \in S_{start} \iff [] \in S_{end}$
+%\end{center}
+%\noindent
+Brzozowski noticed that $\Der$
can be ``mirrored'' on regular expressions which
he calls the derivative of a regular expression $r$
with respect to a character $c$, written
@@ -298,7 +303,8 @@
\]
\end{property}
\noindent
-Pictorially, this looks as follows:
+Pictorially, this looks as follows:\\
+\vspace{3mm}
\parskip \baselineskip
\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
@@ -322,16 +328,17 @@
\Longstack{
\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
\{\ldots,\;s_1,\;\ldots\}$}
-}
-\\
+}\\
+\vspace{ 3mm }
+
The derivatives on regular expression can again be
generalised to a string.
-One could compute $r\backslash s$ and test membership of $s$
-in $L \; r$ by checking
+One could compute $r_{start} \backslash s$ and test membership of $s$
+in $L \; r_{start}$ by checking
whether the empty string is in the language of
-$r\backslash s$.
+$r_{end}$ ($r_{start}\backslash s$).\\
-
+\vspace{2mm}
\Longstack{
\notate{$r_{start}$}{4}{
\Longstack{$L \; r_{start} = \{\ldots, \;$
@@ -373,7 +380,7 @@
$s \in L \; r_{start} \iff [] \in L \; r_{end}$
\end{property}
\noindent
-Now we give the recursive definition of derivative on
+Next we give the recursive definition of derivative on
regular expressions, so that it satisfies the properties above.
The derivative function, written $r\backslash c$,
takes a regular expression $r$ and character $c$, and
@@ -485,13 +492,9 @@
By induction on $r$.
\end{proof}
\noindent
-The main property of the derivative operation
-(that enables us to reason about the correctness of
-derivative-based matching)
-is
-
-
-\noindent
+which is the main property of derivatives
+that enables us to reason about the correctness of
+derivative-based matching.
We can generalise the derivative operation shown above for single characters
to strings as follows:
@@ -507,7 +510,8 @@
omit the subscript and use $\backslash$ instead
of $\backslash_s$ to denote
string derivatives for brevity.
-Brzozowski's regular-expression matching algorithm can then be described as:
+Brzozowski's derivative-based
+regular-expression matching algorithm can then be described as:
\begin{definition}
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
@@ -526,8 +530,8 @@
\end{equation}
\noindent
- It can be
-relatively easily shown that this matcher is correct:
+ It is relatively
+easy to show that this matcher is correct, namely
\begin{lemma}
$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
\end{lemma}
@@ -535,7 +539,6 @@
By induction on $s$ using property of derivatives:
lemma \ref{derDer}.
\end{proof}
-\noindent
\begin{figure}
\begin{center}
\begin{tikzpicture}
@@ -560,15 +563,16 @@
\ref{NaiveMatcher}.
Note that both axes are in logarithmic scale.
Around two dozens characters
-would already explode the matcher on the regular expression
+would already ``explode'' the matcher with the regular expression
$(a^*)^*b$.
-Too improve this situation, we need to introduce simplification
-rewrite rules for the intermediate results,
+To improve this situation, we need to introduce simplification
+rules for the intermediate results,
such as $r + r \rightarrow r$,
and make sure those rules do not change the
language of the regular expression.
One simpled-minded simplification function
-that achieves these requirements is given below:
+that achieves these requirements
+is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
\begin{center}
\begin{tabular}{lcl}
$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
@@ -620,13 +624,15 @@
is now ``tame'' in terms of the length of inputs,
as shown in Figure \ref{BetterMatcher}.
-So far the story is use Brzozowski derivatives,
-simplifying where possible and at the end test
-whether the empty string is recognised by the final derivative.
+So far the story is use Brzozowski derivatives and
+simplify as much as possible and at the end test
+whether the empty string is recognised
+by the final derivative.
But what if we want to
do lexing instead of just getting a true/false answer?
-Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and
-elegant (arguably as beautiful as the definition of the original derivative) solution for this.
+Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and
+elegant (arguably as beautiful as the definition of the
+Brzozowski derivative) solution for this.
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
In this section, we present a two-phase regular expression lexing
@@ -686,7 +692,8 @@
\end{center}
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
to indicate that a value $v$ could be generated from a lexing algorithm
-with input $r$. They call it the value inhabitation relation.
+with input $r$. They call it the value inhabitation relation,
+defined by the rules.
\begin{mathpar}
\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
@@ -707,23 +714,23 @@
satisfying $|v| = s$ and $\vdash v:r$ is finite.
This additional condition was
imposed by Ausaf and Urban to make their proofs easier.
-Given the same string and regular expression, there can be
+Given a string and a regular expression, there can be
multiple values for it. For example, both
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
and the values both flatten to $abc$.
Lexers therefore have to disambiguate and choose only
-one of the values to output. $\POSIX$ is one of the
+one of the values be generated. $\POSIX$ is one of the
disambiguation strategies that is widely adopted.
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016}
+Ausaf et al.\parencite{AusafDyckhoffUrban2016}
formalised the property
as a ternary relation.
The $\POSIX$ value $v$ for a regular expression
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
-in the following set of rules\footnote{The names of the rules are used
-as they were originally given in \cite{AusafDyckhoffUrban2016} }:
-\begin{figure}[H]
+in the following rules\footnote{The names of the rules are used
+as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
+\begin{figure}[p]
\begin{mathpar}
\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
@@ -745,58 +752,64 @@
s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
(v::vs)}
\end{mathpar}
-\caption{The inductive POSIX Lexing Rules defined by Ausaf, Dyckhoff and Urban \cite{AusafDyckhoffUrban2016}.
-The ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on the
+\caption{The inductive POSIX rules given by Ausaf et al.
+ \cite{AusafDyckhoffUrban2016}.
+This ternary relation, written $(s, r) \rightarrow v$,
+formalises the POSIX constraints on the
value $v$ given a string $s$ and
regular expression $r$.
-For example, this specification says that all matches for an alternative
-must always prefer a left value to a right one.
}
-\end{figure}
+\end{figure}\afterpage{\clearpage}
\noindent
-\begin{figure}
-\begin{tikzpicture}[]
- \node [minimum width = 6cm, rectangle split, rectangle split horizontal,
- rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
- (node1)
- {$r_{token1}$
- \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ };
- %\node [left = 6.0cm of node1] (start1) {hi};
- \node [left = 0.2cm of node1] (middle) {$v.s.$};
- \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
- rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
- (node2)
- {$\quad\;\;\;r_{token1}\quad\;\;\;$
- \nodepart{two} $r_{token2}$ };
- \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
- \node [above = 1.5cm of middle, minimum width = 6cm,
- rectangle, style={draw, rounded corners, inner sep=10pt}]
- (topNode) {$s$};
- \path[->,draw]
- (topNode) edge node {split $A$} (node2)
- (topNode) edge node {split $B$} (node1)
- ;
-
-
-\end{tikzpicture}
-\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
-\end{figure}
+%\begin{figure}
+%\begin{tikzpicture}[]
+% \node [minimum width = 6cm, rectangle split, rectangle split horizontal,
+% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
+% (node1)
+% {$r_{token1}$
+% \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ };
+% %\node [left = 6.0cm of node1] (start1) {hi};
+% \node [left = 0.2cm of node1] (middle) {$v.s.$};
+% \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
+% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
+% (node2)
+% {$\quad\;\;\;r_{token1}\quad\;\;\;$
+% \nodepart{two} $r_{token2}$ };
+% \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
+% \node [above = 1.5cm of middle, minimum width = 6cm,
+% rectangle, style={draw, rounded corners, inner sep=10pt}]
+% (topNode) {$s$};
+% \path[->,draw]
+% (topNode) edge node {split $A$} (node2)
+% (topNode) edge node {split $B$} (node1)
+% ;
+%
+%
+%\end{tikzpicture}
+%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
+%\end{figure}
The above $\POSIX$ rules follows the intuition described below:
\begin{itemize}
\item (Left Priority)\\
Match the leftmost regular expression when multiple options of matching
- are available.
+ are available. See P+L and P+R where in P+R $s$ cannot
+ be in the language of $L \; r_1$.
\item (Maximum munch)\\
Always match a subpart as much as possible before proceeding
- to the next token.
+ to the next part of the string.
For example, when the string $s$ matches
- $r_{token1}\cdot r_{token2}$, and we have two ways $s$ can be split:
- Then the split that matches a longer string for the first token
- $r_{token1}$ is preferred by this maximum munch rule (See
- \ref{munch} for an illustration).
-\noindent
-
+ $r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
+ Then the split that matches a longer string for the first part
+ $r_{part1}$ is preferred by this maximum munch rule.
+ This is caused by the side-condition
+ \begin{center}
+ $\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
+ s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
+ \end{center}
+ in PS.
+ %(See
+ %\ref{munch} for an illustration).
\end{itemize}
\noindent
These disambiguation strategies can be
@@ -805,28 +818,31 @@
\[
\textit{iffoo} = 3
\]
-using the regular expression (with
-keyword and identifier having their
-usualy definitions on any formal
-language textbook, for instance
-keyword is a nonempty string starting with letters
-followed by alphanumeric characters or underscores):
+using a regular expression
+for keywords and
+identifiers:
+%(for example, keyword is a nonempty string starting with letters
+%followed by alphanumeric characters or underscores):
\[
- \textit{keyword} + \textit{identifier},
+ r_{keyword} + r_{identifier}.
\]
-we want $\textit{iffoo}$ to be recognized
-as an identifier rather than a keyword (if)
+If we want $\textit{iffoo}$ to be recognized
+as an identifier
+where identifiers are defined as usual (letters
+followed by letters, numbers or underscores),
+then a match with a keyword (if)
followed by
-an identifier (foo).
-POSIX lexing achieves this.
+an identifier (foo) would be incorrect.
+POSIX lexing would generate what we want.
-We know that a $\POSIX$ value is also a normal underlying
-value:
+\noindent
+We know that a POSIX
+value for regular expression $r$ is inhabited by $r$.
\begin{lemma}
$(r, s) \rightarrow v \implies \vdash v: r$
\end{lemma}
\noindent
-The good property about a $\POSIX$ value is that
+The main property about a $\POSIX$ value is that
given the same regular expression $r$ and string $s$,
one can always uniquely determine the $\POSIX$ value for it:
\begin{lemma}
@@ -877,20 +893,20 @@
is the same as $\Seq(v_1, v_2)$.
\end{proof}
\noindent
-Now we know what a $\POSIX$ value is and why it is unique;
-the problem is generating
+We have now defined what a POSIX value is and shown that it is unique,
+the problem is to generate
such a value in a lexing algorithm using derivatives.
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
Sulzmann and Lu extended Brzozowski's
derivative-based matching
-to a lexing algorithm by a second pass
+to a lexing algorithm by a second phase
after the initial phase of successive derivatives.
This second phase generates a POSIX value
if the regular expression matches the string.
Two functions are involved: $\inj$ and $\mkeps$.
-The first one used is $\mkeps$, which constructs a POSIX value from the last
+The function $\mkeps$ constructs a POSIX value from the last
derivative $r_n$:
\begin{ceqn}
\begin{equation}\label{graph:mkeps}
@@ -903,14 +919,12 @@
\noindent
In the above diagram, again we assume that
the input string $s$ is made of $n$ characters
-$c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$
-is given label $0$ and after each character $c_i$ is taken off
-by the derivative operation the resulting derivative regular
-expressioin is $r_{i+1}$.The last derivative operation
-$\backslash c_{n-1}$ gives back $r_n$, which is transformed into
-a value $v_n$ by $\mkeps$.
-$v_n$ tells us how an empty string is matched by the (nullable)
-regular expression $r_n$, in a $\POSIX$ way.
+$c_0c_1 \ldots c_{n-1}$
+The last derivative operation
+$\backslash c_{n-1}$ generates the derivative $r_n$, for which
+$\mkeps$ produces the value $v_n$. This value
+tells us how the empty string is matched by the (nullable)
+regular expression $r_n$, in a POSIX way.
The definition of $\mkeps$ is
\begin{center}
\begin{tabular}{lcl}
@@ -926,26 +940,24 @@
\noindent
-We favour the left child $r_1$ of $r_1 + r_2$
+The function prefers the left child $r_1$ of $r_1 + r_2$
to match an empty string if there is a choice.
-When there is a star for us to match the empty string,
+When there is a star to match the empty string,
we give the $\Stars$ constructor an empty list, meaning
no iteration is taken.
-The result of a call to $\mkeps$ on a $\nullable$ $r$ would
-be a $\POSIX$ value corresponding to $r$:
+The result of $\mkeps$ on a $\nullable$ $r$
+is a POSIX value for $r$ and the empty string:
\begin{lemma}\label{mePosix}
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
\end{lemma}
\begin{proof}
- By induction on the shape of $r$.
+ By induction on $r$.
\end{proof}
\noindent
-After the $\mkeps$-call, we inject back the characters one by one
+After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
in reverse order as they were chopped off in the derivative phase.
-The fucntion for this is called $\inj$. $\inj$ and $\backslash$
-are not exactly reverse operations of one another, as $\inj$
-operates on values instead of regular
-expressions.
+The fucntion for this is called $\inj$. This function
+operates on values, unlike $\backslash$ which operates on regular expressions.
In the diagram below, $v_i$ stands for the (POSIX) value
for how the regular expression
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters