--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Tue Jul 05 00:42:06 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Wed Jul 13 08:27:28 2022 +0100
@@ -4,22 +4,32 @@
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
%and notations we
-%use for describing the lexing algorithm by Sulzmann and Lu,
-%and then give the algorithm and its variant, and discuss
+% used for describing the lexing algorithm by Sulzmann and Lu,
+%and then give the algorithm and its variant and discuss
%why more aggressive simplifications are needed.
In this chapter, we define the basic notions
for regular languages and regular expressions.
-This is essentially a description in "English"
-of your formalisation in Isabelle/HOL.
-We also give the definition of what $\POSIX$ lexing means.
+This is essentially a description in ``English"
+of our formalisation in Isabelle/HOL.
+We also give the definition of what $\POSIX$ lexing means,
+followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014}
+that produces the output conforming
+to the $\POSIX$ standard.
+It is also worth mentioning that
+we choose to use the ML-style notation
+for function applications, where
+the parameters of a function is not enclosed
+inside a pair of parentheses (e.g. $f \;x \;y$
+instead of $f(x,\;y)$). This is mainly
+to make the text visually more concise.
\section{Basic Concepts}
-Usually formal language theory starts with an alphabet
+Usually, formal language theory starts with an alphabet
denoting a set of characters.
Here we just use the datatype of characters from Isabelle,
which roughly corresponds to the ASCII characters.
-In what follows we shall leave the information about the alphabet
+In what follows, we shall leave the information about the alphabet
implicit.
Then using the usual bracket notation for lists,
we can define strings made up of characters:
@@ -30,7 +40,7 @@
\end{center}
Where $c$ is a variable ranging over characters.
Strings can be concatenated to form longer strings in the same
-way as we concatenate two lists, which we write as @.
+way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
We omit the precise
recursive definition here.
We overload this concatenation operator for two sets of strings:
@@ -48,8 +58,8 @@
$A^{n+1}$ & $\dn$ & $A @ A^n$
\end{tabular}
\end{center}
-The union of all the natural number powers of a language
-is usually defined as the Kleene star operator:
+The union of all powers of a language
+can be used to define the Kleene star operator:
\begin{center}
\begin{tabular}{lcl}
$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
@@ -57,19 +67,19 @@
\end{center}
\noindent
-However, to obtain a convenient induction principle
+However, to obtain a more convenient induction principle
in Isabelle/HOL,
we instead define the Kleene star
as an inductive set:
\begin{center}
\begin{mathpar}
-\inferrule{}{[] \in A*\\}
+ \inferrule{\mbox{}}{[] \in A*\\}
-\inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
+\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
\end{mathpar}
\end{center}
-\ChristianComment{Yes, used the inferrule command in mathpar}
+\noindent
We also define an operation of "chopping off" a character from
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
\begin{center}
@@ -89,10 +99,10 @@
which is essentially the left quotient $A \backslash L$ of $A$ against
the singleton language with $L = \{w\}$
in formal language theory.
-However for the purposes here, the $\textit{Ders}$ definition with
+However, for the purposes here, the $\textit{Ders}$ definition with
a single string is sufficient.
-With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
+With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
we have a few properties of how the language derivative can be defined using
sub-languages.
\begin{lemma}
@@ -113,26 +123,28 @@
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
\end{lemma}
\begin{proof}
+There are too inclusions to prove:
\begin{itemize}
-\item{$\subseteq$}
-\noindent
+\item{$\subseteq$}:\\
The set
\[ \{s \mid c :: s \in A*\} \]
is enclosed in the set
-\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
+\[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
because whenever you have a string starting with a character
in the language of a Kleene star $A*$,
then that character together with some sub-string
immediately after it will form the first iteration,
and the rest of the string will
be still in $A*$.
-\item{$\supseteq$}
+\item{$\supseteq$}:\\
Note that
\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \]
-and
+hold.
+Also this holds:
\[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
-where the $\textit{RHS}$ of the above equatioin can be rewritten
-as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
+where the $\textit{RHS}$ can be rewritten
+as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
+which of course contains $\Der \; c \; A @ A*$.
\end{itemize}
\end{proof}
@@ -141,7 +153,7 @@
for regular languages, we need to first give definitions for regular expressions.
\subsection{Regular Expressions and Their Meaning}
-The basic regular expressions are defined inductively
+The \emph{basic regular expressions} are defined inductively
by the following grammar:
\[ r ::= \ZERO \mid \ONE
\mid c
@@ -150,81 +162,127 @@
\mid r^*
\]
\noindent
-We call them basic because we might introduce
-more constructs later such as negation
+We call them basic because we will introduce
+additional constructors in later chapters such as negation
and bounded repetitions.
-We defined the regular expression containing
-nothing as $\ZERO$, note that some authors
-also use $\phi$ for that.
-Similarly, the regular expression denoting the
-singleton set with only $[]$ is sometimes
-denoted by $\epsilon$, but we use $\ONE$ here.
-
-The language or set of strings denoted
-by regular expressions are defined as
+We use $\ZERO$ for the regular expression that
+matches no string, and $\ONE$ for the regular
+expression that matches only the empty string\footnote{
+some authors
+also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
+but we prefer our notation}.
+The sequence regular expression is written $r_1\cdot r_2$
+and sometimes we omit the dot if it is clear which
+regular expression is meant; the alternative
+is written $r_1 + r_2$.
+The \emph{language} or meaning of
+a regular expression is defined recursively as
+a set of strings:
%TODO: FILL in the other defs
\begin{center}
\begin{tabular}{lcl}
-$L \; (\ZERO)$ & $\dn$ & $\phi$\\
-$L \; (\ONE)$ & $\dn$ & $\{[]\}$\\
-$L \; (c)$ & $\dn$ & $\{[c]\}$\\
-$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
-$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) @ L \; (r_2)$\\
-$L \; (r^*)$ & $\dn$ & $ (L(r))^*$
+$L \; \ZERO$ & $\dn$ & $\phi$\\
+$L \; \ONE$ & $\dn$ & $\{[]\}$\\
+$L \; c$ & $\dn$ & $\{[c]\}$\\
+$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
+$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
+$L \; r^*$ & $\dn$ & $ (L\;r)*$
\end{tabular}
\end{center}
\noindent
-Which is also called the "language interpretation" of
-a regular expression.
-
Now with semantic derivatives of a language and regular expressions and
their language interpretations in place, we are ready to define derivatives on regexes.
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
-
-\ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the
-readers engaged with a story how we got to the definition of $\backslash$, rather
-than first "overwhelming" them with the definition of $\nullable$.}
-
-The language derivative acts on a string set and chops off a character from
-all strings in that set, we want to define a derivative operation on regular expressions
-so that after derivative $L(r\backslash c)$
-will look as if it was obtained by doing a language derivative on $L(r)$:
+%Recall, the language derivative acts on a set of strings
+%and essentially chops off a particular character from
+%all strings in that set, Brzozowski defined a derivative operation on regular expressions
+%so that after derivative $L(r\backslash c)$
+%will look as if it was obtained by doing a language derivative on $L(r)$:
+Recall that the semantic derivative acts on a set of
+strings. Brzozowski noticed that this operation
+can be ``mirrored" on regular expressions which
+he calls the derivative of a regular expression $r$
+with respect to a character $c$, written
+$r \backslash c$.
+He defined this operation such that the following property holds:
\begin{center}
-\[
-r\backslash c \dn ?
-\]
-so that
-\[
-L(r \backslash c) = \Der \; c \; L(r) ?
-\]
-\end{center}
-So we mimic the equalities we have for $\Der$ on language concatenation
\[
-\Der \; c \; (A @ B) = \textit{if} \; [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
+L(r \backslash c) = \Der \; c \; L(r)
\]
-to get the derivative for sequence regular expressions:
-\[
-(r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
-\]
+\end{center}
+\noindent
+For example in the sequence case we have
+\begin{center}
+ \begin{tabular}{lcl}
+ $\Der \; c \; (A @ B)$ & $\dn$ &
+ $ \textit{if} \;\, [] \in A \;
+ \textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
+ \Der \; c\; B$\\
+ & & $\textit{else}\; (\Der \; c \; A) @ B$\\
+ \end{tabular}
+\end{center}
+\noindent
+This can be translated to
+regular expressions in the following
+manner:
+\begin{center}
+ \begin{tabular}{lcl}
+ $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+ & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
+ \end{tabular}
+\end{center}
\noindent
-and language Kleene star:
+And similarly, the Kleene star's semantic derivative
+can be expressed as
\[
-\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)
+ \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
\]
-to get derivative of the Kleene star regular expression:
+which translates to
\[
-r^* \backslash c = (r \backslash c)\cdot r^*
+ (r^*) \backslash c \dn (r \backslash c)\cdot r^*.
\]
-Note that although we can formalise the boolean predicate
-$[] \in L(r_1)$ without problems, if we want a function that works
-computationally, then we would have to define a function that tests
-whether an empty string is in the language of a regular expression.
-We call such a function $\nullable$:
-
-
-
+In the above definition of $(r_1\cdot r_2) \backslash c$,
+the $\textit{if}$ clause's
+boolean condition
+$[] \in L(r_1)$ needs to be
+somehow recursively computed.
+We call such a function that checks
+whether the empty string $[]$ is
+in the language of a regular expression $\nullable$:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
+ $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
+ $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
+ $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+ $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+ $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
+ \end{tabular}
+\end{center}
+\noindent
+The $\ZERO$ regular expression
+does not contain any string and
+therefore is not \emph{nullable}.
+$\ONE$ is \emph{nullable}
+by definition.
+The character regular expression $c$
+corresponds to the singleton set $\{c\}$,
+and therefore does not contain the empty string.
+The alternative regular expression is nullable
+if at least one of its children is nullable.
+The sequence regular expression
+would require both children to have the empty string
+to compose an empty string, and the Kleene star
+is always nullable because it naturally
+contains the empty string.
+
+The derivative function, written $r\backslash c$,
+defines how a regular expression evolves into
+a new one after all the string it contains is acted on:
+if it starts with $c$, then the character is chopped of,
+if not, that string is removed.
\begin{center}
\begin{tabular}{lcl}
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
@@ -239,67 +297,32 @@
\end{tabular}
\end{center}
\noindent
-The function derivative, written $r\backslash c$,
-defines how a regular expression evolves into
-a new regular expression after all the string it contains
-is chopped off a certain head character $c$.
-The most involved cases are the sequence
-and star case.
+The most involved cases are the sequence case
+and the star case.
The sequence case says that if the first regular expression
-contains an empty string then the second component of the sequence
-might be chosen as the target regular expression to be chopped
-off its head character.
-The star regular expression's derivative unwraps the iteration of
-regular expression and attaches the star regular expression
-to the sequence's second element to make sure a copy is retained
-for possible more iterations in later phases of lexing.
-
-
-To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
-which tests whether the empty string $""$
-is in the language of $r$:
-
-
-\begin{center}
- \begin{tabular}{lcl}
- $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
- $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
- $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
- $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
- $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
- $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
- \end{tabular}
-\end{center}
-\noindent
-The empty set does not contain any string and
-therefore not the empty string, the empty string
-regular expression contains the empty string
-by definition, the character regular expression
-is the singleton that contains character only,
-and therefore does not contain the empty string,
-the alternative regular expression (or "or" expression)
-might have one of its children regular expressions
-being nullable and any one of its children being nullable
-would suffice. The sequence regular expression
-would require both children to have the empty string
-to compose an empty string and the Kleene star
-operation naturally introduced the empty string.
-
-We have the following property where the derivative on regular
-expressions coincides with the derivative on a set of strings:
-
-\begin{lemma}
+contains an empty string, then the second component of the sequence
+needs to be considered, as its derivative will contribute to the
+result of this derivative.
+The star regular expression $r^*$'s derivative
+unwraps one iteration of $r$, turns it into $r\backslash c$,
+and attaches the original $r^*$
+after $r\backslash c$, so that
+we can further unfold it as many times as needed.
+We have the following correspondence between
+derivatives on regular expressions and
+derivatives on a set of strings:
+\begin{lemma}\label{derDer}
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
\end{lemma}
\noindent
The main property of the derivative operation
-that enables us to reason about the correctness of
-an algorithm using derivatives is
+(that enables us to reason about the correctness of
+derivative-based matching)
+is
\begin{lemma}\label{derStepwise}
-$c\!::\!s \in L(r)$ holds
-if and only if $s \in L(r\backslash c)$.
+ $c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
\end{lemma}
\noindent
@@ -314,12 +337,14 @@
\end{center}
\noindent
-When there is no ambiguity we will use $\backslash$ to denote
+When there is no ambiguity, we will
+omit the subscript and use $\backslash$ instead
+of $\backslash_r$ to denote
string derivatives for brevity.
Brzozowski's regular-expression matcher algorithm can then be described as:
\begin{definition}
-$\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
+$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
\end{definition}
\noindent
@@ -336,17 +361,15 @@
It can be
relatively easily shown that this matcher is correct:
\begin{lemma}
-$\textit{match} \; s\; r = \textit{true} \Longleftrightarrow s \in L(r)$
+ $\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
\end{lemma}
\begin{proof}
-By the stepwise property of $\backslash$ (\ref{derStepwise})
+ By the stepwise property of derivatives (lemma \ref{derStepwise})
+ and lemma \ref{derDer}.
\end{proof}
\noindent
-If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow.
-
-
-\begin{figure}
+\begin{center}
+ \begin{figure}
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
@@ -360,84 +383,93 @@
\end{tikzpicture}
\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
\end{figure}
-
+\end{center}
\noindent
-For this we need to introduce certain
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow, as shown in
+\ref{NaiveMatcher}.
+Note that both axes are in logarithmic scale.
+Around two dozens characters
+would already explode the matcher on regular expression
+$(a^*)^*b$.
+For this, we need to introduce certain
rewrite rules for the intermediate results,
such as $r + r \rightarrow r$,
and make sure those rules do not change the
language of the regular expression.
-We have a simplification function (that is as simple as possible
-while having much power on making a regex simpler):
-\begin{verbatim}
-def simp(r: Rexp) : Rexp = r match {
- case SEQ(r1, r2) =>
- (simp(r1), simp(r2)) match {
- case (ZERO, _) => ZERO
- case (_, ZERO) => ZERO
- case (ONE, r2s) => r2s
- case (r1s, ONE) => r1s
- case (r1s, r2s) => SEQ(r1s, r2s)
- }
- case ALTS(r1, r2) => {
- (simp(r1), simp(r2)) match {
- case (ZERO, r2s) => r2s
- case (r1s, ZERO) => r1s
- case (r1s, r2s) =>
- if(r1s == r2s) r1s else ALTS(r1s, r2s)
- }
- }
- case r => r
-}
-\end{verbatim}
-If we repeatedly incorporate these
-rules during the matching algorithm,
-we have a lexer with simplification:
-\begin{verbatim}
-def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
- case Nil => simp(r)
- case c :: cs => ders_simp(cs, simp(der(c, r)))
-}
-
-def simp_matcher(s: String, r: Rexp) : Boolean =
- nullable(ders_simp(s.toList, r))
-
-\end{verbatim}
-\noindent
-After putting in those rules, the example of \ref{NaiveMatcher}
-is now very tame in the length of inputs:
-
-
+One simpled-minded simplification function
+that achieves these requirements is given below:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\simp \; r_1 \cdot r_2 $ & $ \dn$ &
+ $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
+ & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
+ & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
+ & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
+ & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
+ & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
+ $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
+ & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
+ & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
+ & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
+ $\simp \; r$ & $\dn$ & $r$
+
+ \end{tabular}
+\end{center}
+If we repeatedly apply this simplification
+function during the matching algorithm,
+we have a matcher with simplification:
+\begin{center}
+ \begin{tabular}{lcl}
+ $\derssimp \; [] \; r$ & $\dn$ & $r$\\
+ $\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
+ $\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
+ \end{tabular}
+\end{center}
+\begin{figure}
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
ylabel={time in secs},
ymode = log,
xmode = log,
+ grid = both,
legend entries={Matcher With Simp},
legend pos=north west,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
\end{axis}
-\end{tikzpicture} \label{fig:BetterMatcher}
-
+\end{tikzpicture}
+\caption{$(a^*)^*b$
+against
+$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
+\end{figure}
+\noindent
+The running time of $\textit{ders}\_\textit{simp}$
+on the same example of \ref{NaiveMatcher}
+is now very tame in terms of the length of inputs,
+as shown in \ref{BetterMatcher}.
-\noindent
-Note how the x-axis is in logarithmic scale.
Building derivatives and then testing the existence
of empty string in the resulting regular expression's language,
-and add simplification rules when necessary.
+adding simplifications when necessary.
So far, so good. But what if we want to
do lexing instead of just getting a YES/NO answer?
-\citeauthor{Sulzmann2014} first came up with a nice and
+Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and
elegant (arguably as beautiful as the definition of the original derivative) solution for this.
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
-Here we present the hybrid phases of a regular expression lexing
-algorithm using the function $\inj$, as given by Sulzmann and Lu.
-They first defined the datatypes for storing the
-lexing information called a \emph{value} or
-sometimes also \emph{lexical value}. These values and regular
+In this section, we present a two-phase regular expression lexing
+algorithm.
+The first phase takes successive derivatives with
+respect to the input string,
+and the second phase does the reverse, \emph{injecting} back
+characters, in the meantime constructing a lexing result.
+We will introduce the injection phase in detail slightly
+later, but as a preliminary we have to first define
+the datatype for lexing results,
+called \emph{value} or
+sometimes also \emph{lexical value}. Values and regular
expressions correspond to each other as illustrated in the following
table:
@@ -466,56 +498,116 @@
\end{tabular}
\end{tabular}
\end{center}
-
\noindent
-We have a formal binary relation for telling whether the structure
-of a regular expression agrees with the value.
+A value has an underlying string, which
+can be calculated by the ``flatten" function $|\_|$:
+\begin{center}
+ \begin{tabular}{lcl}
+ $|\Empty|$ & $\dn$ & $[]$\\
+ $|\Char \; c|$ & $ \dn$ & $ [c]$\\
+ $|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\
+ $|\Left(v)|$ & $ \dn$ & $ |v|$\\
+ $|\Right(v)|$ & $ \dn$ & $ |v|$\\
+ $|\Stars([])|$ & $\dn$ & $[]$\\
+ $|\Stars(v::vs)|$ & $\dn$ & $ |v| @ |\Stars(vs)|$
+ \end{tabular}
+\end{center}
+Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
+to indicate that a value $v$ could be generated from a lexing algorithm
+with input $r$. They call it the value inhabitation relation.
\begin{mathpar}
-\inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
-\inferrule{}{\vdash \Empty : \ONE} \hspace{2em}
-\inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
-\end{mathpar}
+ \inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
+
+ \inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}
+
+\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
+
+\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}
+
+\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}
-Building on top of Sulzmann and Lu's attempt to formalise the
-notion of POSIX lexing rules \parencite{Sulzmann2014},
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
-POSIX matching as a ternary relation recursively defined in a
-natural deduction style.
-The formal definition of a $\POSIX$ value $v$ for a regular expression
+\inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars(vs):r^*}
+\end{mathpar}
+\noindent
+The condition $|v| \neq []$ in the premise of star's rule
+is to make sure that for a given pair of regular
+expression $r$ and string $s$, the number of values
+satisfying $|v| = s$ and $\vdash v:r$ is finite.
+Given the same string and regular expression, there can be
+multiple values for it. For example, both
+$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
+$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
+and the values both flatten to $abc$.
+Lexers therefore have to disambiguate and choose only
+one of the values to output. $\POSIX$ is one of the
+disambiguation strategies that is widely adopted.
+
+Ausaf and Urban\parencite{AusafDyckhoffUrban2016}
+formalised the property
+as a ternary relation.
+The $\POSIX$ value $v$ for a regular expression
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
-in the following set of rules:
-\ChristianComment{Will complete later}
-\newcommand*{\inference}[3][t]{%
- \begingroup
- \def\and{\\}%
- \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
- #2 \\
- \hline
- #3
- \end{tabular}%
- \endgroup
-}
-\begin{center}
-\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$ }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
-\end{center}
+in the following set of rules\footnote{The names of the rules are used
+as they were originally given in \cite{AusafDyckhoffUrban2016}}:
+\noindent
+\begin{figure}
+\begin{mathpar}
+ \inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
+
+ \inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
+
+ \inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
+
+ \inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
+
+ \inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
+ \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
+ s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
+ \Seq \; v_1 \; v_2}
+
+ \inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
+
+ \inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
+ |v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
+ s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
+ (v::vs)}
+\end{mathpar}
+\caption{POSIX Lexing Rules}
+\end{figure}
\noindent
-The above $\POSIX$ rules could be explained intuitionally as
+The above $\POSIX$ rules follows the intuition described below:
\begin{itemize}
-\item
-match the leftmost regular expression when multiple options of matching
-are available
-\item
-always match a subpart as much as possible before proceeding
-to the next token.
+ \item (Left Priority)\\
+ Match the leftmost regular expression when multiple options of matching
+ are available.
+ \item (Maximum munch)\\
+ Always match a subpart as much as possible before proceeding
+ to the next token.
\end{itemize}
-
-The reason why we are interested in $\POSIX$ values is that they can
-be practically used in the lexing phase of a compiler front end.
+\noindent
+These disambiguation strategies can be
+quite practical.
For instance, when lexing a code snippet
-$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
-as an identifier rather than a keyword.
-\ChristianComment{Do I also introduce lexical values $LV$ here?}
-We know that $\POSIX$ values are also part of the normal values:
+\[
+ \textit{iffoo} = 3
+\]
+using the regular expression (with
+keyword and identifier having their
+usualy definitions on any formal
+language textbook, for instance
+keyword is a nonempty string starting with letters
+followed by alphanumeric characters or underscores):
+\[
+ \textit{keyword} + \textit{identifier},
+\]
+we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword (if)
+followed by
+an identifier (foo).
+POSIX lexing achieves this.
+
+We know that a $\POSIX$ value is also a normal underlying
+value:
\begin{lemma}
$(r, s) \rightarrow v \implies \vdash v: r$
\end{lemma}
@@ -527,10 +619,13 @@
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
\end{lemma}
\begin{proof}
-By induction on $s$, $r$ and $v_1$. The induction principle is
-the \POSIX rules. Each case is proven by a combination of
-the induction rules for $\POSIX$ values and the inductive hypothesis.
-Probably the most cumbersome cases are the sequence and star with non-empty iterations.
+By induction on $s$, $r$ and $v_1$. The inductive cases
+are all the POSIX rules.
+Each case is proven by a combination of
+the induction rules for $\POSIX$
+values and the inductive hypothesis.
+Probably the most cumbersome cases are
+the sequence and star with non-empty iterations.
We give the reasoning about the sequence case as follows:
When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$,
@@ -544,15 +639,14 @@
which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
is the same as $\Seq(v_1, v_2)$.
\end{proof}
-Now we know what a $\POSIX$ value is, the problem is how do we achieve
+Now we know what a $\POSIX$ value is; the problem is how do we achieve
such a value in a lexing algorithm, using derivatives?
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
-derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value
-is generated if the regular expression matches the string.
+derivatives---see \ref{graph:successive_ders}). This second phase generates a POSIX value if the regular expression matches the string.
Two functions are involved: $\inj$ and $\mkeps$.
The function $\mkeps$ constructs a value from the last
one of all the successive derivatives:
@@ -570,13 +664,13 @@
\begin{center}
\begin{tabular}{lcl}
- $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
- $\mkeps(r_{1}+r_{2})$ & $\dn$
- & \textit{if} $\nullable(r_{1})$\\
- & & \textit{then} $\Left(\mkeps(r_{1}))$\\
- & & \textit{else} $\Right(\mkeps(r_{2}))$\\
- $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
- $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
+ $\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\
+ $\mkeps \; r_{1}+r_{2}$ & $\dn$
+ & \textit{if} ($\nullable \; r_{1})$\\
+ & & \textit{then} $\Left (\mkeps \; r_{1})$\\
+ & & \textit{else} $\Right(\mkeps \; r_{2})$\\
+ $\mkeps \; r_1\cdot r_2$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
+ $\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$
\end{tabular}
\end{center}
@@ -669,9 +763,10 @@
The central property of the $\lexer$ is that it gives the correct result by
$\POSIX$ standards:
\begin{theorem}
- \begin{tabular}{l}
- $\lexer \; r \; s = \Some(v) \Longleftrightarrow (r, \; s) \rightarrow v$\\
- $\lexer \;r \; s = \None \Longleftrightarrow \neg(\exists v. (r, s) \rightarrow v)$
+ The $\lexer$ based on derivatives and injections is both sound and complete:
+ \begin{tabular}{lcl}
+ $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
+ $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
\end{tabular}
\end{theorem}
@@ -776,7 +871,7 @@
of a match. This means Sulzmann and Lu's injection-based algorithm
exponential by nature.
Somehow one has to make sure which
- lexical values are $\POSIX$ and need to be kept in a lexing algorithm.
+ lexical values are $\POSIX$ and must be kept in a lexing algorithm.
For example, the above $r= (a^*\cdot a^*)^*$ and
@@ -788,7 +883,7 @@
Can we not create those intermediate values $v_1,\ldots v_n$,
and get the lexing information that should be already there while
doing derivatives in one pass, without a second injection phase?
-In the meantime, can we make sure that simplifications
+In the meantime, can we ensure that simplifications
are easily handled without breaking the correctness of the algorithm?
Sulzmann and Lu solved this problem by