% Chapter Template
\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
%and notations we
% used for describing the lexing algorithm by Sulzmann and Lu,
%and then give the algorithm and its variant and discuss
%why more aggressive simplifications are needed.
In this chapter, we define the basic notions
for regular languages and regular expressions.
This is essentially a description in ``English''
the functions and datatypes of our formalisation in Isabelle/HOL.
We also define what $\POSIX$ lexing means,
followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
that produces the output conforming
to the $\POSIX$ standard\footnote{In what follows
we choose to use the Isabelle-style notation
for function applications, where
the parameters of a function are not enclosed
inside a pair of parentheses (e.g. $f \;x \;y$
instead of $f(x,\;y)$). This is mainly
to make the text visually more concise.}.
\section{Basic Concepts}
Formal language theory usually starts with an alphabet
denoting a set of characters.
Here we use the datatype of characters from Isabelle,
which roughly corresponds to the ASCII characters.
In what follows, we shall leave the information about the alphabet
implicit.
Then using the usual bracket notation for lists,
we can define strings made up of characters as:
\begin{center}
\begin{tabular}{lcl}
$\textit{s}$ & $\dn$ & $[] \; |\; c :: s$
\end{tabular}
\end{center}
where $c$ is a variable ranging over characters.
The $::$ stands for list cons and $[]$ for the empty
list.
For brevity, a singleton list is sometimes written as $[c]$.
Strings can be concatenated to form longer strings in the same
way we concatenate two lists, which we shall write as $s_1 @ s_2$.
We omit the precise
recursive definition here.
We overload this concatenation operator for two sets of strings:
\begin{center}
\begin{tabular}{lcl}
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
\end{tabular}
\end{center}
We also call the above \emph{language concatenation}.
The power of a language is defined recursively, using the
concatenation operator $@$:
\begin{center}
\begin{tabular}{lcl}
$A^0 $ & $\dn$ & $\{ [] \}$\\
$A^{n+1}$ & $\dn$ & $A @ A^n$
\end{tabular}
\end{center}
The union of all powers of a language
can be used to define the Kleene star operator:
\begin{center}
\begin{tabular}{lcl}
$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
\end{tabular}
\end{center}
\noindent
However, to obtain a more convenient induction principle
in Isabelle/HOL,
we instead define the Kleene star
as an inductive set:
\begin{center}
\begin{mathpar}
\inferrule{\mbox{}}{[] \in A*\\}
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
\end{mathpar}
\end{center}
\noindent
We also define an operation of "chopping off" a character from
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
\begin{center}
\begin{tabular}{lcl}
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
\end{tabular}
\end{center}
\noindent
This can be generalised to ``chopping off'' a string
from all strings within a set $A$,
namely:
\begin{center}
\begin{tabular}{lcl}
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
\end{tabular}
\end{center}
\noindent
which is essentially the left quotient $A \backslash L$ of $A$ against
the singleton language with $L = \{s\}$.
However, for our purposes here, the $\textit{Ders}$ definition with
a single string is sufficient.
The reason for defining derivatives
is that they provide another approach
to test membership of a string in
a set of strings.
For example, to test whether the string
$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
respect to the string $bar$:
\begin{center}
\begin{tabular}{lll}
$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
$\{ar, rak\}$ \\
& $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
& $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
%& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
\end{tabular}
\end{center}
\noindent
and in the end, test whether the set
contains the empty string.\footnote{We use the infix notation $A\backslash c$
instead of $\Der \; c \; A$ for brevity, as it will always be
clear from the context that we are operating
on languages rather than regular expressions.}
In general, if we have a language $S$,
then we can test whether $s$ is in $S$
by testing whether $[] \in S \backslash s$.
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
we have a few properties of how the language derivative can be defined using
sub-languages.
For example, for the sequence operator, we have
something similar to a ``chain rule'':
\begin{lemma}
\[
\Der \; c \; (A @ B) =
\begin{cases}
((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
(\Der \; c \; A) \, @ \, B, & \text{otherwise}
\end{cases}
\]
\end{lemma}
\noindent
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
and get to $B$.
The language derivative for $A*$ can be described using the language derivative
of $A$:
\begin{lemma}
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
\end{lemma}
\begin{proof}
There are two inclusions to prove:
\begin{itemize}
\item{$\subseteq$}:\\
The set
\[ S_1 = \{s \mid c :: s \in A*\} \]
is enclosed in the set
\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
This is because for any string $c::s$ satisfying $c::s \in A*$,
%whenever you have a string starting with a character
%in the language of a Kleene star $A*$,
%then that
the character $c$, together with a prefix of $s$
%immediately after $c$
forms the first iteration of $A*$,
and the rest of the $s$ is also $A*$.
This coincides with the definition of $S_2$.
\item{$\supseteq$}:\\
Note that
\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \]
holds.
Also the following holds:
\[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
where the $\textit{RHS}$ can be rewritten
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
which of course contains $\Der \; c \; A @ A*$.
\end{itemize}
\end{proof}
\noindent
The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
for regular expressions.
To introduce them, we need to first give definitions for regular expressions,
which we shall do next.
\subsection{Regular Expressions and Their Meaning}
The \emph{basic regular expressions} are defined inductively
by the following grammar:
\[ r ::= \ZERO \mid \ONE
\mid c
\mid r_1 \cdot r_2
\mid r_1 + r_2
\mid r^*
\]
\noindent
We call them basic because we will introduce
additional constructors in later chapters, such as negation
and bounded repetitions.
We use $\ZERO$ for the regular expression that
matches no string, and $\ONE$ for the regular
expression that matches only the empty string.\footnote{
Some authors
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
but we prefer this notation.}
The sequence regular expression is written $r_1\cdot r_2$
and sometimes we omit the dot if it is clear which
regular expression is meant; the alternative
is written $r_1 + r_2$.
The \emph{language} or meaning of
a regular expression is defined recursively as
a set of strings:
%TODO: FILL in the other defs
\begin{center}
\begin{tabular}{lcl}
$L \; \ZERO$ & $\dn$ & $\phi$\\
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
$L \; c$ & $\dn$ & $\{[c]\}$\\
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
$L \; (r^*)$ & $\dn$ & $ (L\;r)*$
\end{tabular}
\end{center}
\noindent
%Now with language derivatives of a language and regular expressions and
%their language interpretations in place, we are ready to define derivatives on regular expressions.
With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
We do so by first introducing what properties they should satisfy.
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
%Recall, the language derivative acts on a set of strings
%and essentially chops off a particular character from
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
%so that after derivative $L(r\backslash c)$
%will look as if it was obtained by doing a language derivative on $L(r)$:
%Recall that the language derivative acts on a
%language (set of strings).
%One can decide whether a string $s$ belongs
%to a language $S$ by taking derivative with respect to
%that string and then checking whether the empty
%string is in the derivative:
%\begin{center}
%\parskip \baselineskip
%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
%\def\rlwd{.5pt}
%\newcommand\notate[3]{%
% \unskip\def\useanchorwidth{T}%
% \setbox0=\hbox{#1}%
% \def\stackalignment{c}\stackunder[-6pt]{%
% \def\stackalignment{c}\stackunder[-1.5pt]{%
% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
% \rule{\rlwd}{#2\baselineskip}}}{%
% \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
%}
%\Longstack{
%\notate{$\{ \ldots ,\;$
% \notate{s}{1}{$(c_1 :: s_1)$}
% $, \; \ldots \}$
%}{1}{$S_{start}$}
%}
%\Longstack{
% $\stackrel{\backslash c_1}{\longrightarrow}$
%}
%\Longstack{
% $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$}
% $,\; \ldots \}$
%}
%\Longstack{
% $\stackrel{\backslash c_2}{\longrightarrow}$
%}
%\Longstack{
% $\{ \ldots,\; s_2
% ,\; \ldots \}$
%}
%\Longstack{
% $ \xdashrightarrow{\backslash c_3\ldots\ldots} $
%}
%\Longstack{
% \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
% S_{start}\backslash s$}
%}
%\end{center}
%\begin{center}
% $s \in S_{start} \iff [] \in S_{end}$
%\end{center}
%\noindent
Brzozowski noticed that $\Der$
can be ``mirrored'' on regular expressions which
he calls the derivative of a regular expression $r$
with respect to a character $c$, written
$r \backslash c$. This infix operator
takes regular expression $r$ as input
and a character as a right operand.
The derivative operation on regular expression
is defined such that the language of the derivative result
coincides with the language of the original
regular expression being taken
derivative with respect to the same characters, namely
\begin{property}
\[
L \; (r \backslash c) = \Der \; c \; (L \; r)
\]
\end{property}
\noindent
Pictorially, this looks as follows:\\
\vspace{3mm}
\parskip \baselineskip
\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
\def\rlwd{.5pt}
\newcommand\notate[3]{%
\unskip\def\useanchorwidth{T}%
\setbox0=\hbox{#1}%
\def\stackalignment{c}\stackunder[-6pt]{%
\def\stackalignment{c}\stackunder[-1.5pt]{%
\stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
\rule{\rlwd}{#2\baselineskip}}}{%
\strut\kern8pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
}
\Longstack{
\notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1,
\;\ldots\}$}
}
\Longstack{
$\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$
}
\Longstack{
\notate{$r\backslash c$}{1}{$L \; (r\backslash c)=
\{\ldots,\;s_1,\;\ldots\}$}
}\\
\vspace{ 3mm }
The derivatives on regular expression can again be
generalised to strings.
One could compute $r_{start} \backslash s$ and test membership of $s$
in $L \; r_{start}$ by checking
whether the empty string is in the language of
$r_{end}$ (that is $r_{start}\backslash s$).\\
\vspace{2mm}
\Longstack{
\notate{$r_{start}$}{4}{
\Longstack{$L \; r_{start} = \{\ldots, \;$
\notate{$s$}{1}{$c_1::s_1$}
$, \ldots\} $
}
}
}
\Longstack{
$\stackrel{\backslash c_1}{ \xrightarrow{\hspace*{1.8cm}} }$
}
\Longstack{
\notate{$r_1$}{3}{
$r_1 = r_{start}\backslash c_1$,
$L \; r_1 = $
\Longstack{
$\{ \ldots,\;$ \notate{$s_1$}{1}{$c_2::s_2$}
$,\; \ldots \}$
}
}
}
\Longstack{
$\stackrel{\backslash c_2}{\xrightarrow{\hspace*{1.8cm}}}$
}
\Longstack{
$r_2$
}
\Longstack{
$ \xdashrightarrow{\hspace*{0.3cm} \backslash c_3 \ldots \ldots \ldots \hspace*{0.3cm}} $
}
\Longstack{
\notate{$r_{end}$}{1}{
$L \; r_{end} = \{\ldots, \; [], \ldots\}$}
}
We have the property that
\begin{property}
$s \in L \; r_{start} \iff [] \in L \; r_{end}$
\end{property}
\noindent
Next, we give the recursive definition of derivative on
regular expressions so that it satisfies the properties above.
%The derivative function, written $r\backslash c$,
%takes a regular expression $r$ and character $c$, and
%returns a new regular expression representing
%the original regular expression's language $L \; r$
%being taken the language derivative with respect to $c$.
\begin{table}
\begin{center}
\begin{tabular}{lcl}
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
$d \backslash c$ & $\dn$ &
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
\end{tabular}
\end{center}
\caption{Derivative on Regular Expressions}
\label{table:der}
\end{table}
\noindent
The most involved cases are the sequence case
and the star case.
The sequence case says that if the first regular expression
contains an empty string, then the second component of the sequence
needs to be considered, as its derivative will contribute to the
result of this derivative:
\begin{center}
\begin{tabular}{lcl}
$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ &
$\textit{if}\;\,([] \in L(r_1))\;
\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
\end{tabular}
\end{center}
\noindent
Notice how this closely resembles
the language derivative operation $\Der$:
\begin{center}
\begin{tabular}{lcl}
$\Der \; c \; (A @ B)$ & $\dn$ &
$ \textit{if} \;\, [] \in A \;
\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
\Der \; c\; B$\\
& & $\textit{else}\; (\Der \; c \; A) @ B$\\
\end{tabular}
\end{center}
\noindent
The derivative of the star regular expression $r^*$
unwraps one iteration of $r$, turns it into $r\backslash c$,
and attaches the original $r^*$
after $r\backslash c$, so that
we can further unfold it as many times as needed:
\[
(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
\]
Again,
the structure is the same as the language derivative of the Kleene star:
\[
\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
\]
In the above definition of $(r_1\cdot r_2) \backslash c$,
the $\textit{if}$ clause's
boolean condition
$[] \in L(r_1)$ needs to be
somehow recursively computed.
We call such a function that checks
whether the empty string $[]$ is
in the language of a regular expression $\nullable$:
\begin{center}
\begin{tabular}{lcl}
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
\end{tabular}
\end{center}
\noindent
The $\ZERO$ regular expression
does not contain any string and
therefore is not \emph{nullable}.
$\ONE$ is \emph{nullable}
by definition.
The character regular expression $c$
corresponds to the singleton set $\{c\}$,
and therefore does not contain the empty string.
The alternative regular expression is nullable
if at least one of its children is nullable.
The sequence regular expression
would require both children to have the empty string
to compose an empty string, and the Kleene star
is always nullable because it naturally
contains the empty string.
\noindent
We have the following two correspondences between
derivatives on regular expressions and
derivatives on a set of strings:
\begin{lemma}\label{derDer}
\mbox{}
\begin{itemize}
\item
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
\item
$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
\end{itemize}
\end{lemma}
\begin{proof}
By induction on $r$.
\end{proof}
\noindent
which are the main properties of derivatives
that enables us later to reason about the correctness of
derivative-based matching.
We can generalise the derivative operation shown above for single characters
to strings as follows:
\begin{center}
\begin{tabular}{lcl}
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
$r \backslash_s [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}
\noindent
When there is no ambiguity, we will
omit the subscript and use $\backslash$ instead
of $\backslash_s$ to denote
string derivatives for brevity.
Brzozowski's derivative-based
regular-expression matching algorithm can then be described as:
\begin{definition}
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
\end{definition}
\noindent
Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$,
this algorithm, presented graphically, is as follows:
\begin{equation}\label{matcher}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] &
r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] &
\;\textrm{true}/\textrm{false}
\end{tikzcd}
\end{equation}
\noindent
It is relatively
easy to show that this matcher is correct, namely
\begin{lemma}
$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
\end{lemma}
\begin{proof}
By induction on $s$ using the property of derivatives:
lemma \ref{derDer}.
\end{proof}
\begin{figure}
\begin{center}
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
ylabel={time in secs},
%ymode = log,
legend entries={Naive Matcher},
legend pos=north west,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
\end{axis}
\end{tikzpicture}
\caption{Matching the regular expression $(a^*)^*b$ against strings of the form
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
$ using Brzozowski's original algorithm}\label{NaiveMatcher}
\end{center}
\end{figure}
\noindent
If we implement the above algorithm naively, however,
the algorithm can be excruciatingly slow, as shown in
\ref{NaiveMatcher}.
Note that both axes are in logarithmic scale.
Around two dozen characters
this algorithm already ``explodes'' with the regular expression
$(a^*)^*b$.
To improve this situation, we need to introduce simplification
rules for the intermediate results,
such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
and make sure those rules do not change the
language of the regular expression.
One simple-minded simplification function
that achieves these requirements
is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
\begin{center}
\begin{tabular}{lcl}
$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
\end{tabular}
\end{center}
If we repeatedly apply this simplification
function during the matching algorithm,
we have a matcher with simplification:
\begin{center}
\begin{tabular}{lcl}
$\derssimp \; [] \; r$ & $\dn$ & $r$\\
$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
\end{tabular}
\end{center}
\begin{figure}
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
ylabel={time in secs},
%ymode = log,
%xmode = log,
grid = both,
legend entries={Matcher With Simp},
legend pos=north west,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
\end{axis}
\end{tikzpicture}
\caption{$(a^*)^*b$
against
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
\end{figure}
\noindent
The running time of $\textit{ders}\_\textit{simp}$
on the same example of Figure \ref{NaiveMatcher}
is now ``tame'' in terms of the length of inputs,
as shown in Figure \ref{BetterMatcher}.
So far, the story is use Brzozowski derivatives and
simplify as much as possible, and at the end test
whether the empty string is recognised
by the final derivative.
But what if we want to
do lexing instead of just getting a true/false answer?
Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and
elegant (arguably as beautiful as the definition of the
Brzozowski derivative) solution for this.
\section{Values and the Lexing Algorithm by Sulzmann and Lu}
In this section, we present a two-phase regular expression lexing
algorithm.
The first phase takes successive derivatives with
respect to the input string,
and the second phase does the reverse, \emph{injecting} back
characters, in the meantime constructing a lexing result.
We will introduce the injection phase in detail slightly
later, but as a preliminary we have to first define
the datatype for lexing results,
called \emph{value} or
sometimes also \emph{lexical value}.
Values and regular
expressions correspond to each other
as illustrated in the following
table:
\begin{center}
\begin{tabular}{c@{\hspace{20mm}}c}
\begin{tabular}{@{}rrl@{}}
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
$r$ & $::=$ & $\ZERO$\\
& $\mid$ & $\ONE$ \\
& $\mid$ & $c$ \\
& $\mid$ & $r_1 \cdot r_2$\\
& $\mid$ & $r_1 + r_2$ \\
\\
& $\mid$ & $r^*$ \\
\end{tabular}
&
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
$v$ & $::=$ & \\
& & $\Empty$ \\
& $\mid$ & $\Char \; c$ \\
& $\mid$ & $\Seq\,v_1\, v_2$\\
& $\mid$ & $\Left \;v$ \\
& $\mid$ & $\Right\;v$ \\
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
\end{tabular}
\end{tabular}
\end{center}
\noindent
A value has an underlying string, which
can be calculated by the ``flatten" function $|\_|$:
\begin{center}
\begin{tabular}{lcl}
$|\Empty|$ & $\dn$ & $[]$\\
$|\Char \; c|$ & $ \dn$ & $ [c]$\\
$|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\
$|\Left \; v|$ & $ \dn$ & $ |v|$\\
$|\Right \; v|$ & $ \dn$ & $ |v|$\\
$|\Stars \; []|$ & $\dn$ & $[]$\\
$|\Stars \; v::vs|$ & $\dn$ & $ |v| @ |\Stars(vs)|$
\end{tabular}
\end{center}
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
to indicate that a value $v$ could be generated from a lexing algorithm
with input $r$. They call it the value inhabitation relation,
defined by the rules.
\begin{figure}[H]
\begin{mathpar}
\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
\inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars \; vs : r^*}
\end{mathpar}
\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
\end{figure}
\noindent
The condition $|v| \neq []$ in the premise of star's rule
is to make sure that for a given pair of regular
expression $r$ and string $s$, the number of values
satisfying $|v| = s$ and $\vdash v:r$ is finite.
This additional condition was
imposed by Ausaf and Urban to make their proofs easier.
Given a string and a regular expression, there can be
multiple values for it. For example, both
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
and the values both flatten to $abc$.
Lexers therefore have to disambiguate and choose only
one of the values to be generated. $\POSIX$ is one of the
disambiguation strategies that is widely adopted.
Ausaf et al. \cite{AusafDyckhoffUrban2016}
formalised the property
as a ternary relation.
The $\POSIX$ value $v$ for a regular expression
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
in the following rules\footnote{The names of the rules are used
as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
\begin{figure}[p]
\begin{mathpar}
\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
\Seq \; v_1 \; v_2}
\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
(v::vs)}
\end{mathpar}
\caption{The inductive POSIX rules given by Ausaf et al.
\cite{AusafDyckhoffUrban2016}.
This ternary relation, written $(s, r) \rightarrow v$,
formalises the POSIX constraints on the
value $v$ given a string $s$ and
regular expression $r$.
}
\end{figure}\afterpage{\clearpage}
\noindent
%\begin{figure}
%\begin{tikzpicture}[]
% \node [minimum width = 6cm, rectangle split, rectangle split horizontal,
% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
% (node1)
% {$r_{token1}$
% \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ };
% %\node [left = 6.0cm of node1] (start1) {hi};
% \node [left = 0.2cm of node1] (middle) {$v.s.$};
% \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
% (node2)
% {$\quad\;\;\;r_{token1}\quad\;\;\;$
% \nodepart{two} $r_{token2}$ };
% \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
% \node [above = 1.5cm of middle, minimum width = 6cm,
% rectangle, style={draw, rounded corners, inner sep=10pt}]
% (topNode) {$s$};
% \path[->,draw]
% (topNode) edge node {split $A$} (node2)
% (topNode) edge node {split $B$} (node1)
% ;
%
%
%\end{tikzpicture}
%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
%\end{figure}
The above $\POSIX$ rules follow the intuition described below:
\begin{itemize}
\item (Left Priority)\\
Match the leftmost regular expression when multiple options for matching
are available. See P+L and P+R where in P+R $s$ cannot
be in the language of $L \; r_1$.
\item (Maximum munch)\\
Always match a subpart as much as possible before proceeding
to the next part of the string.
For example, when the string $s$ matches
$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
Then the split that matches a longer string for the first part
$r_{part1}$ is preferred by this maximum munch rule.
The side-condition
\begin{center}
$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
\end{center}
in PS causes this.
%(See
%\ref{munch} for an illustration).
\end{itemize}
\noindent
These disambiguation strategies can be
quite practical.
For instance, when lexing a code snippet
\[
\textit{iffoo} = 3
\]
using a regular expression
for keywords and
identifiers:
%(for example, keyword is a nonempty string starting with letters
%followed by alphanumeric characters or underscores):
\[
r_{keyword} + r_{identifier}.
\]
If we want $\textit{iffoo}$ to be recognized
as an identifier
where identifiers are defined as usual (letters
followed by letters, numbers or underscores),
then a match with a keyword (if)
followed by
an identifier (foo) would be incorrect.
POSIX lexing generates what is included by lexing.
\noindent
We know that a POSIX
value for regular expression $r$ is inhabited by $r$.
\begin{lemma}
$(r, s) \rightarrow v \implies \vdash v: r$
\end{lemma}
\noindent
The main property about a $\POSIX$ value is that
given the same regular expression $r$ and string $s$,
one can always uniquely determine the $\POSIX$ value for it:
\begin{lemma}
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
\end{lemma}
\begin{proof}
By induction on $s$, $r$ and $v_1$. The inductive cases
are all the POSIX rules.
Probably the most cumbersome cases are
the sequence and star with non-empty iterations.
We shall give the details for proving the sequence case here.
When we have
\[
(s_1, r_1) \rightarrow v_1 \;\, and \;\,
(s_2, r_2) \rightarrow v_2 \;\, and \;\,
\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
\]
we know that the last condition
excludes the possibility of a
string $s_1'$ longer than $s_1$ such that
\[
(s_1', r_1) \rightarrow v_1' \;\;
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s
\]
hold.
A shorter string $s_1''$ with $s_2''$ satisfying
\[
(s_1'', r_1) \rightarrow v_1''
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s
\]
cannot possibly form a $\POSIX$ value either, because
by definition, there is a candidate
with a longer initial string
$s_1$. Therefore, we know that the POSIX
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
$s$ must have the
property that
\[
|a| = s_1 \;\; and \;\; |b| = s_2.
\]
The goal is to prove that $a = v_1 $ and $b = v_2$.
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$,
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
is the same as $\Seq(v_1, v_2)$.
\end{proof}
\noindent
We have now defined what a POSIX value is and shown that it is unique.
The problem is to generate
such a value in a lexing algorithm using derivatives.
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
Sulzmann and Lu extended Brzozowski's
derivative-based matching
to a lexing algorithm by a second phase
after the initial phase of successive derivatives.
This second phase generates a POSIX value
if the regular expression matches the string.
The algorithm uses two functions called $\inj$ and $\mkeps$.
The function $\mkeps$ constructs a POSIX value from the last
derivative $r_n$:
\begin{ceqn}
\begin{equation}\label{graph:mkeps}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
& & & v_n
\end{tikzcd}
\end{equation}
\end{ceqn}
\noindent
In the above diagram, again we assume that
the input string $s$ is made of $n$ characters
$c_0c_1 \ldots c_{n-1}$
The last derivative operation
$\backslash c_{n-1}$ generates the derivative $r_n$, for which
$\mkeps$ produces the value $v_n$. This value
tells us how the empty string is matched by the (nullable)
regular expression $r_n$, in a POSIX way.
The definition of $\mkeps$ is
\begin{center}
\begin{tabular}{lcl}
$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\
$\mkeps \; (r_{1}+r_{2})$ & $\dn$
& $\textit{if}\; (\nullable \; r_{1}) \;\,
\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\
& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
$\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$
\end{tabular}
\end{center}
\noindent
The function prefers the left child $r_1$ of $r_1 + r_2$
to match an empty string if there is a choice.
When there is a star to match the empty string,
we give the $\Stars$ constructor an empty list, meaning
no iteration is taken.
The result of $\mkeps$ on a $\nullable$ $r$
is a POSIX value for $r$ and the empty string:
\begin{lemma}\label{mePosix}
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
\end{lemma}
\begin{proof}
By induction on $r$.
\end{proof}
\noindent
After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
in reverse order as they were chopped off in the derivative phase.
The function for this is called $\inj$. This function
operates on values, unlike $\backslash$ which operates on regular expressions.
In the diagram below, $v_i$ stands for the (POSIX) value
for how the regular expression
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
After injecting back $n$ characters, we get the lexical value for how $r_0$
matches $s$.
\begin{figure}[H]
\begin{center}
\begin{ceqn}
\begin{tikzcd}
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
\end{tikzcd}
\end{ceqn}
\end{center}
\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
The first phase involves taking successive derivatives w.r.t the characters $c_0$,
$c_1$, and so on. These are the same operations as they have appeared in the matcher
\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
the empty string, by always selecting the leftmost
nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they
appeared in the string, always preserving POSIXness.}\label{graph:inj}
\end{figure}
\noindent
The function $\textit{inj}$ as defined by Sulzmann and Lu
takes three arguments: a regular
expression ${r_{i}}$, before the character is chopped off,
a character ${c_{i}}$ (the character we want to inject back) and
the third argument $v_{i+1}$ the value we want to inject into.
The result of an application
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
\[
(s_i, r_i) \rightarrow v_i
\]
holds.
The definition of $\textit{inj}$ is as follows:
\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
$\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\
$\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\
$\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\
$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ &
$\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ &
$\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
$\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
\end{tabular}
\end{center}
\noindent
The function recurses on
the shape of regular
expressions and values.
Intuitively, each clause analyses
how $r_i$ could have transformed when being
derived by $c$, identifying which subpart
of $v_{i+1}$ has the ``hole''
to inject the character back into.
Once the character is
injected back to that sub-value;
$\inj$ assembles all parts
to form a new value.
For instance, the last clause is an
injection into a sequence value $v_{i+1}$
whose second child
value is a star and the shape of the
regular expression $r_i$ before injection
is a star.
We therefore know
the derivative
starts on a star and ends as a sequence:
\[
(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*
\]
during which an iteration of the star
had just been unfolded, giving the below
value inhabitation relation:
\[
\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
\]
The value list $vs$ corresponds to
matched star iterations,
and the ``hole'' lies in $v$ because
\[
\vdash v: r\backslash c.
\]
Finally,
$\inj \; r \;c \; v$ is prepended
to the previous list of iterations and then
wrapped under the $\Stars$
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
Recall that lemma
\ref{mePosix} tells us that
$\mkeps$ always generates the POSIX value.
The function $\inj$ preserves the POSIXness, provided
the value before injection is POSIX, namely
\begin{lemma}\label{injPosix}
If$(r \backslash c, s) \rightarrow v $,
then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
\end{lemma}
\begin{proof}
By induction on $r$.
The non-trivial cases are sequence and star.
When $r = a \cdot b$, there can be
three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
case.
\begin{itemize}
\item
$v = \Seq \; v_a \; v_b$.\\
The ``not nullable'' clause of the $\inj$ function is taken:
\begin{center}
\begin{tabular}{lcl}
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
\end{tabular}
\end{center}
We know that there exists a unique pair of
$s_a$ and $s_b$ satisfying
$(a \backslash c, s_a) \rightarrow v_a$,
$(b , s_b) \rightarrow v_b$, and
$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
L \; (a\backslash c) \land
s_4 \in L \; b$.
The last condition gives us
$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in
L \; a \land
s_4 \in L \; b$.
By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
and this gives us
\[
(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
\]
\item
$v = \Left \; (\Seq \; v_a \; v_b)$\\
The argument is almost identical to the above case,
except that a different clause of $\inj$ is taken:
\begin{center}
\begin{tabular}{lcl}
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
\end{tabular}
\end{center}
With similar reasoning,
\[
(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
\]
again holds.
\item
$v = \Right \; v_b$\\
Again the injection result would be
\begin{center}
\begin{tabular}{lcl}
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
\end{tabular}
\end{center}
We know that $a$ must be nullable,
allowing us to call $\mkeps$ and get
\[
(a, []) \rightarrow \mkeps \; a.
\]
Also, by inductive hypothesis
\[
(b, c::s) \rightarrow \inj\; b \; c \; v_b
\]
holds.
In addition, as
$\Right \;v_b$ instead of $\Left \ldots$ is
the POSIX value for $v$, it must be the case
that $s \notin L \;( (a\backslash c)\cdot b)$.
This tells us that
\[
\nexists s_3 \; s_4.
s_3 @s_4 = s \land s_3 \in L \; (a\backslash c)
\land s_4 \in L \; b
\]
which translates to
\[
\nexists s_3 \; s_4. \; s_3 \neq [] \land
s_3 @s_4 = c::s \land s_3 \in L \; a
\land s_4 \in L \; b.
\]
(Which says there cannot be a longer
initial split for $s$ other than the empty string.)
Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
as the POSIX value for $a\cdot b$.
\end{itemize}
The star case can be proven similarly.
\end{proof}
\noindent
Putting all together, Sulzmann and Lu obtained the following algorithm
outlined in the diagram \ref{graph:inj}:
\begin{center}
\begin{tabular}{lcl}
$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
& & $\quad \phantom{\mid}\; \None \implies \None$\\
& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
\end{tabular}
\end{center}
\noindent
The central property of the $\lexer$ is that it gives the correct result
according to
POSIX rules.
\begin{theorem}\label{lexerCorrectness}
The $\lexer$ based on derivatives and injections is correct:
\begin{center}
\begin{tabular}{lcl}
$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
\end{tabular}
\end{center}
\end{theorem}
\begin{proof}
By induction on $s$. $r$ generalising over an arbitrary regular expression.
The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
by lemma \ref{injPosix}.
\end{proof}
\noindent
As we did earlier in this chapter with the matcher, one can
introduce simplification on the regular expression in each derivative step.
However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
and ensure that
the values align with the regular expression at each step.
Therefore one has to
be careful not to break the correctness, as the injection
function heavily relies on the structure of
the regular expressions and values being aligned.
This can be achieved by recording some extra rectification functions
during the derivatives step and applying these rectifications in
each run during the injection phase.
With extra care
one can show that POSIXness will not be affected
by the simplifications listed here \cite{AusafDyckhoffUrban2016}.
\begin{center}
\begin{tabular}{lcl}
$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
\end{tabular}
\end{center}
However, one can still end up
with exploding derivatives,
even with the simple-minded simplification rules allowed
in an injection-based lexer.
\section{A Case Requiring More Aggressive Simplifications}
For example, when starting with the regular
expression $(a^* \cdot a^*)^*$ and building just over
a dozen successive derivatives
w.r.t.~the character $a$, one obtains a derivative regular expression
with millions of nodes (when viewed as a tree)
even with the mentioned simplifications.
\begin{figure}[H]
\begin{center}
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
ylabel={size},
legend entries={Simple-Minded Simp, Naive Matcher},
legend pos=north west,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
\end{axis}
\end{tikzpicture}
\end{center}
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
\end{figure}\label{fig:BetterWaterloo}
That is because Sulzmann and Lu's
injection-based lexing algorithm keeps a lot of
"useless" values that will not be used.
These different ways of matching will grow exponentially with the string length.
Consider the case
\[
r= (a^*\cdot a^*)^* \quad and \quad
s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
\]
as an example.
This is a highly ambiguous regular expression, with
many ways to split up the string into multiple segments for
different star iterations,
and for each segment
multiple ways of splitting between
the two $a^*$ sub-expressions.
When $n$ is equal to $1$, there are two lexical values for
the match:
\[
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
\]
and
\[
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
\]
The derivative of $\derssimp \;s \; r$ is
\[
(a^*a^* + a^*)\cdot(a^*a^*)^*.
\]
The $a^*a^*$ and $a^*$ in the first child of the above sequence
correspond to value 1 and value 2, respectively.
When $n=2$, the number goes up to 7:
\[
\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
\]
\[
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
\]
\[
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
\]
\[
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
\]
\[
\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []),
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
]
\]
\[
\Stars \; [
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
]
\]
and
\[
\Stars \; [
\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
]
\]
And $\derssimp \; aa \; (a^*a^*)^*$ is
\[
((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* +
(a^*a^* + a^*)\cdot(a^*a^*)^*.
\]
which removes two out of the seven terms corresponding to the
seven distinct lexical values.
It is not surprising that there are exponentially many
distinct lexical values that cannot be eliminated by
the simple-minded simplification of $\derssimp$.
A lexer without a good enough strategy to
deduplicate will naturally
have an exponential runtime on highly
ambiguous regular expressions because there
are exponentially many matches.
For this particular example, it seems
that the number of distinct matches growth
speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
On the other hand, the
$\POSIX$ value for $r= (a^*\cdot a^*)^*$ and
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is
\[
\Stars\,
[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
\]
At any moment, the subterms in a regular expression
that will potentially result in a POSIX value is only
a minority among the many other terms,
and one can remove the ones that are not possible to
be POSIX.
In the above example,
\begin{equation}\label{eqn:growth2}
((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* +
\underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.
\end{equation}
can be further simplified by
removing the underlined term first,
which would open up possibilities
of further simplification that removes the
underbraced part.
The result would be
\[
(\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.
\]
with corresponding values
\begin{center}
\begin{tabular}{lr}
$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\
$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$
\end{tabular}
\end{center}
Other terms with an underlying value, such as
\[
\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
\]
do not to contribute a POSIX lexical value,
and therefore can be thrown away.
Ausaf et al. \cite{AusafDyckhoffUrban2016}
have come up with some simplification steps,
and incorporated the simplification into $\lexer$.
They call this lexer $\textit{lexer}_{simp}$ and proved that
\[
\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
\]
The function $\textit{lexer}_{simp}$
involves some fiddly manipulation of value rectification,
which we omit here.
however those steps
are not yet sufficiently strong, to achieve the above effects.
And even with these relatively mild simplifications, the proof
is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
One would need to prove something like this:
\[
\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\;
\textit{then}\;\; (r, c::s) \rightarrow
\inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
\]
instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
not only has to return a simplified regular expression,
but also what specific simplifications
have been done as a function on values
showing how one can transform the value
underlying the simplified regular expression
to the unsimplified one.
We therefore choose a slightly different approach
also described by Sulzmann and Lu to
get better simplifications, which uses
some augmented data structures compared to
plain regular expressions.
We call them \emph{annotated}
regular expressions.
With annotated regular expressions,
we can avoid creating the intermediate values $v_1,\ldots v_n$ and a
second phase altogether.
We introduce this new datatype and the
corresponding algorithm in the next chapter.