lexing: ChengsongTanPhdThesis/Chapters/Inj.tex@28cb8089ec36


% Chapter Template

\chapter{Regular Expressions and POSIX Lexing} % Main chapter title

\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
%and notations we 
% used for describing the lexing algorithm by Sulzmann and Lu,
%and then give the algorithm and its variant and discuss
%why more aggressive simplifications are needed. 

In this chapter, we define the basic notions 
for regular languages and regular expressions.
This is essentially a description in ``English"
of our formalisation in Isabelle/HOL.
We also give the definition of what $\POSIX$ lexing means, 
followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} 
that produces the output conforming
to the $\POSIX$ standard.
It is also worth mentioning that
we choose to use the ML-style notation
for function applications, where
the parameters of a function is not enclosed
inside a pair of parentheses (e.g. $f \;x \;y$
instead of $f(x,\;y)$). This is mainly
to make the text visually more concise.

\section{Basic Concepts}
Usually, formal language theory starts with an alphabet 
denoting a set of characters.
Here we just use the datatype of characters from Isabelle,
which roughly corresponds to the ASCII characters.
In what follows, we shall leave the information about the alphabet
implicit.
Then using the usual bracket notation for lists,
we can define strings made up of characters: 
\begin{center}
\begin{tabular}{lcl}
$\textit{s}$ & $\dn$ & $[] \; |\; c  :: s$
\end{tabular}
\end{center}
Where $c$ is a variable ranging over characters.
Strings can be concatenated to form longer strings in the same
way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
We omit the precise 
recursive definition here.
We overload this concatenation operator for two sets of strings:
\begin{center}
\begin{tabular}{lcl}
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
\end{tabular}
\end{center}
We also call the above \emph{language concatenation}.
The power of a language is defined recursively, using the 
concatenation operator $@$:
\begin{center}
\begin{tabular}{lcl}
$A^0 $ & $\dn$ & $\{ [] \}$\\
$A^{n+1}$ & $\dn$ & $A @ A^n$
\end{tabular}
\end{center}
The union of all powers of a language   
can be used to define the Kleene star operator:
\begin{center}
\begin{tabular}{lcl}
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
\end{tabular}
\end{center}

\noindent
However, to obtain a more convenient induction principle 
in Isabelle/HOL, 
we instead define the Kleene star
as an inductive set: 

\begin{center}
\begin{mathpar}
	\inferrule{\mbox{}}{[] \in A*\\}

\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}
\end{mathpar}
\end{center}
\noindent
We also define an operation of "chopping off" a character from
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
\begin{center}
\begin{tabular}{lcl}
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
\end{tabular}
\end{center}
\noindent
This can be generalised to "chopping off" a string from all strings within set $A$, 
namely:
\begin{center}
\begin{tabular}{lcl}
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
\end{tabular}
\end{center}
\noindent
which is essentially the left quotient $A \backslash L$ of $A$ against 
the singleton language with $L = \{w\}$
in formal language theory.
However, for the purposes here, the $\textit{Ders}$ definition with 
a single string is sufficient.

With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
we have a  few properties of how the language derivative can be defined using 
sub-languages.
\begin{lemma}
\[
	\Der \; c \; (A @ B) =
	\begin{cases}
	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
	 \end{cases}	
\]
\end{lemma}
\noindent
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
and get to $B$.
The language $A*$'s derivative can be described using the language derivative
of $A$:
\begin{lemma}
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
\end{lemma}
\begin{proof}
There are too inclusions to prove:
\begin{itemize}
\item{$\subseteq$}:\\
The set 
\[ \{s \mid c :: s \in A*\} \]
is enclosed in the set
\[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
because whenever you have a string starting with a character 
in the language of a Kleene star $A*$, 
then that character together with some sub-string
immediately after it will form the first iteration, 
and the rest of the string will 
be still in $A*$.
\item{$\supseteq$}:\\
Note that
\[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
hold.
Also this holds:
\[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
where the $\textit{RHS}$ can be rewritten
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
which of course contains $\Der \; c \; A @ A*$.
\end{itemize}
\end{proof}

\noindent
Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
for regular languages, we need to first give definitions for regular expressions.

\subsection{Regular Expressions and Their Meaning}
The \emph{basic regular expressions} are defined inductively
 by the following grammar:
\[			r ::=   \ZERO \mid  \ONE
			 \mid  c  
			 \mid  r_1 \cdot r_2
			 \mid  r_1 + r_2   
			 \mid r^*         
\]
\noindent
We call them basic because we will introduce
additional constructors in later chapters such as negation
and bounded repetitions.
We use $\ZERO$ for the regular expression that
matches no string, and $\ONE$ for the regular
expression that matches only the empty string\footnote{
some authors
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
but we prefer our notation}. 
The sequence regular expression is written $r_1\cdot r_2$
and sometimes we omit the dot if it is clear which
regular expression is meant; the alternative
is written $r_1 + r_2$.
The \emph{language} or meaning of 
a regular expression is defined recursively as
a set of strings:
%TODO: FILL in the other defs
\begin{center}
\begin{tabular}{lcl}
$L \; \ZERO$ & $\dn$ & $\phi$\\
$L \; \ONE$ & $\dn$ & $\{[]\}$\\
$L \; c$ & $\dn$ & $\{[c]\}$\\
$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
$L \; r^*$ & $\dn$ & $ (L\;r)*$
\end{tabular}
\end{center}
\noindent
Now with semantic derivatives of a language and regular expressions and
their language interpretations in place, we are ready to define derivatives on regexes.
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
%Recall, the language derivative acts on a set of strings
%and essentially chops off a particular character from
%all strings in that set, Brzozowski defined a derivative operation on regular expressions
%so that after derivative $L(r\backslash c)$ 
%will look as if it was obtained by doing a language derivative on $L(r)$:
Recall that the semantic derivative acts on a set of 
strings. Brzozowski noticed that this operation
can be ``mirrored" on regular expressions which
he calls the derivative of a regular expression $r$
with respect to a character $c$, written
$r \backslash c$.
He defined this operation such that the following property holds:
\begin{center}

\[
L(r \backslash c) = \Der \; c \; L(r)
\]
\end{center}
\noindent
For example in the sequence case we have 
\begin{center}
	\begin{tabular}{lcl}
		$\Der \; c \; (A @ B)$ & $\dn$ & 
		$ \textit{if} \;\,  [] \in A \; 
		\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup 
		\Der \; c\; B$\\
		& & $\textit{else}\; (\Der \; c \; A) @ B$\\
	\end{tabular}
\end{center}
\noindent
This can be translated to  
regular expressions in the following 
manner:
\begin{center}
	\begin{tabular}{lcl}
		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
	\end{tabular}
\end{center}

\noindent
And similarly, the Kleene star's semantic derivative
can be expressed as
\[
	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
\]
which translates to
\[
	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
\]
In the above definition of $(r_1\cdot r_2) \backslash c$,
the $\textit{if}$ clause's
boolean condition 
$[] \in L(r_1)$ needs to be 
somehow recursively computed.
We call such a function that checks
whether the empty string $[]$ is 
in the language of a regular expression $\nullable$:
\begin{center}
		\begin{tabular}{lcl}
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
		\end{tabular}
\end{center}
\noindent
The $\ZERO$ regular expression
does not contain any string and
therefore is not \emph{nullable}.
$\ONE$ is \emph{nullable} 
by definition. 
The character regular expression $c$
corresponds to the singleton set $\{c\}$, 
and therefore does not contain the empty string.
The alternative regular expression is nullable
if at least one of its children is nullable.
The sequence regular expression
would require both children to have the empty string
to compose an empty string, and the Kleene star
is always nullable because it naturally
contains the empty string. 
  
The derivative function, written $r\backslash c$, 
defines how a regular expression evolves into
a new one after all the string it contains is acted on: 
if it starts with $c$, then the character is chopped of,
if not, that string is removed.
\begin{center}
\begin{tabular}{lcl}
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
		$d \backslash c$     & $\dn$ & 
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
\end{tabular}
\end{center}
\noindent
The most involved cases are the sequence case
and the star case.
The sequence case says that if the first regular expression
contains an empty string, then the second component of the sequence
needs to be considered, as its derivative will contribute to the
result of this derivative.
The star regular expression $r^*$'s derivative 
unwraps one iteration of $r$, turns it into $r\backslash c$,
and attaches the original $r^*$
after $r\backslash c$, so that 
we can further unfold it as many times as needed.
We have the following correspondence between 
derivatives on regular expressions and
derivatives on a set of strings:
\begin{lemma}\label{derDer}
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
\end{lemma}

\noindent
The main property of the derivative operation
(that enables us to reason about the correctness of
derivative-based matching)
is 

\begin{lemma}\label{derStepwise}
	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
\end{lemma}

\noindent
We can generalise the derivative operation shown above for single characters
to strings as follows:

\begin{center}
\begin{tabular}{lcl}
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
$r \backslash [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}

\noindent
When there is no ambiguity, we will 
omit the subscript and use $\backslash$ instead
of $\backslash_r$ to denote
string derivatives for brevity.
Brzozowski's  regular-expression matcher algorithm can then be described as:

\begin{definition}
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
\end{definition}

\noindent
Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
this algorithm presented graphically is as follows:

\begin{equation}\label{graph:successive_ders}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
\end{tikzcd}
\end{equation}

\noindent
 It can  be
relatively  easily shown that this matcher is correct:
\begin{lemma}
	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
\end{lemma}
\begin{proof}
	By the stepwise property of derivatives (lemma \ref{derStepwise})
	and lemma \ref{derDer}. 
\end{proof}
\noindent
\begin{center}
	\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={time in secs},
    ymode = log,
    legend entries={Naive Matcher},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
\end{axis}
\end{tikzpicture} 
\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
\end{figure}
\end{center} 
\noindent
If we implement the above algorithm naively, however,
the algorithm can be excruciatingly slow, as shown in 
\ref{NaiveMatcher}.
Note that both axes are in logarithmic scale.
Around two dozens characters
would already explode the matcher on regular expression 
$(a^*)^*b$.
For this, we need to introduce certain 
rewrite rules for the intermediate results,
such as $r + r \rightarrow r$,
and make sure those rules do not change the 
language of the regular expression.
One simpled-minded simplification function
that achieves these requirements is given below:
\begin{center}
	\begin{tabular}{lcl}
		$\simp \; r_1 \cdot r_2 $ & $ \dn$ & 
		$(\simp \; r_1,  \simp \; r_2) \; \textit{match}$\\
					  & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
					  & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
					  & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
					  & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
					  & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
		$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
				     & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
				     & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
				     & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
		$\simp \; r$ & $\dn$ & $r$
				   
	\end{tabular}
\end{center}
If we repeatedly apply this simplification  
function during the matching algorithm, 
we have a matcher with simplification:
\begin{center}
	\begin{tabular}{lcl}
		$\derssimp \; [] \; r$ & $\dn$ & $r$\\
		$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
		$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
	\end{tabular}
\end{center}
\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={time in secs},
    ymode = log,
    xmode = log,
    grid = both,
    legend entries={Matcher With Simp},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
\end{axis}
\end{tikzpicture} 
\caption{$(a^*)^*b$ 
against 
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
\end{figure}
\noindent
The running time of $\textit{ders}\_\textit{simp}$
on the same example of \ref{NaiveMatcher}
is now very tame in terms of the length of inputs,
as shown in \ref{BetterMatcher}.

Building derivatives and then testing the existence
of empty string in the resulting regular expression's language,
adding simplifications when necessary.
So far, so good. But what if we want to 
do lexing instead of just getting a YES/NO answer?
Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and 
elegant (arguably as beautiful as the definition of the original derivative) solution for this.

\section{Values and the Lexing Algorithm by Sulzmann and Lu}
In this section, we present a two-phase regular expression lexing 
algorithm.
The first phase takes successive derivatives with 
respect to the input string,
and the second phase does the reverse, \emph{injecting} back
characters, in the meantime constructing a lexing result.
We will introduce the injection phase in detail slightly
later, but as a preliminary we have to first define 
the datatype for lexing results, 
called \emph{value} or
sometimes also \emph{lexical value}.  Values and regular
expressions correspond to each other as illustrated in the following
table:

\begin{center}
	\begin{tabular}{c@{\hspace{20mm}}c}
		\begin{tabular}{@{}rrl@{}}
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
			$r$ & $::=$  & $\ZERO$\\
			& $\mid$ & $\ONE$   \\
			& $\mid$ & $c$          \\
			& $\mid$ & $r_1 \cdot r_2$\\
			& $\mid$ & $r_1 + r_2$   \\
			\\
			& $\mid$ & $r^*$         \\
		\end{tabular}
		&
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
			$v$ & $::=$  & \\
			&        & $\Empty$   \\
			& $\mid$ & $\Char(c)$          \\
			& $\mid$ & $\Seq\,v_1\, v_2$\\
			& $\mid$ & $\Left(v)$   \\
			& $\mid$ & $\Right(v)$  \\
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
		\end{tabular}
	\end{tabular}
\end{center}
\noindent
A value has an underlying string, which 
can be calculated by the ``flatten" function $|\_|$:
\begin{center}
	\begin{tabular}{lcl}
		$|\Empty|$ & $\dn$ &  $[]$\\
		$|\Char \; c|$ & $ \dn$ & $ [c]$\\
		$|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\
		$|\Left(v)|$ & $ \dn$ & $ |v|$\\
		$|\Right(v)|$ & $ \dn$ & $ |v|$\\
		$|\Stars([])|$ & $\dn$ & $[]$\\
		$|\Stars(v::vs)|$ &  $\dn$ & $ |v| @ |\Stars(vs)|$
	\end{tabular}
\end{center}
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
to indicate that a value $v$ could be generated from a lexing algorithm
with input $r$. They call it the value inhabitation relation. 
\begin{mathpar}
	\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}

	\inferrule{\mbox{}}{\vdash \Empty :  \ONE} \hspace{2em}

\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}

\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}

\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}

\inferrule{\forall v \in vs. \vdash v:r \land  |v| \neq []}{\vdash \Stars(vs):r^*}
\end{mathpar}
\noindent
The condition $|v| \neq []$ in the premise of star's rule
is to make sure that for a given pair of regular 
expression $r$ and string $s$, the number of values 
satisfying $|v| = s$ and $\vdash v:r$ is finite.
Given the same string and regular expression, there can be
multiple values for it. For example, both
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
and the values both flatten to $abc$.
Lexers therefore have to disambiguate and choose only
one of the values to output. $\POSIX$ is one of the
disambiguation strategies that is widely adopted.

Ausaf and Urban\parencite{AusafDyckhoffUrban2016} 
formalised the property 
as a ternary relation.
The $\POSIX$ value $v$ for a regular expression
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
in the following set of rules\footnote{The names of the rules are used
as they were originally given in \cite{AusafDyckhoffUrban2016}}:
\noindent
\begin{figure}
\begin{mathpar}
	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
		
	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}

	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}

	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}

	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
		\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
	\Seq \; v_1 \; v_2}

	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}

	\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\
		|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
		s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;
	(v::vs)}
\end{mathpar}
\caption{POSIX Lexing Rules}
\end{figure}
\noindent
The above $\POSIX$ rules follows the intuition described below: 
\begin{itemize}
	\item (Left Priority)\\
		Match the leftmost regular expression when multiple options of matching
		are available.
	\item (Maximum munch)\\
		Always match a subpart as much as possible before proceeding
		to the next token.
\end{itemize}
\noindent
These disambiguation strategies can be 
quite practical.
For instance, when lexing a code snippet 
\[ 
	\textit{iffoo} = 3
\]
using the regular expression (with 
keyword and identifier having their 
usualy definitions on any formal
language textbook, for instance
keyword is a nonempty string starting with letters 
followed by alphanumeric characters or underscores):
\[
	\textit{keyword} + \textit{identifier},
\]
we want $\textit{iffoo}$ to be recognized
as an identifier rather than a keyword (if)
followed by
an identifier (foo).
POSIX lexing achieves this.

We know that a $\POSIX$ value is also a normal underlying
value:
\begin{lemma}
$(r, s) \rightarrow v \implies \vdash v: r$
\end{lemma}
\noindent
The good property about a $\POSIX$ value is that 
given the same regular expression $r$ and string $s$,
one can always uniquely determine the $\POSIX$ value for it:
\begin{lemma}
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
\end{lemma}
\begin{proof}
By induction on $s$, $r$ and $v_1$. The inductive cases
are all the POSIX rules. 
Probably the most cumbersome cases are 
the sequence and star with non-empty iterations.
We shall give the details for proving the sequence case here.

When we have 
\[
	(s_1, r_1) \rightarrow v_1 \;\, and \;\, 
	(s_2, r_2) \rightarrow v_2  \;\, and \;\,\\ 
	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
\]
we know that the last condition 
excludes the possibility of a 
string $s_1'$ longer than $s_1$ such that 
\[
(s_1', r_1) \rightarrow v_1'   \;\; 
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s 
\]
hold.
A shorter string $s_1''$ with $s_2''$ satisfying
\[
(s_1'', r_1) \rightarrow v_1''
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
\]
cannot possibly form a $\POSIX$ value either, because
by definition there is a candidate
with longer initial string
$s_1$. Therefore, we know that the POSIX
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
$s$ must have the 
property that 
\[
	|a| = s_1 \;\; and \;\; |b| = s_2.
\]
The goal is to prove that $a = v_1 $ and $b = v_2$.
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that 
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, 
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
is the same as $\Seq(v_1, v_2)$. 
\end{proof}
\noindent
Now we know what a $\POSIX$ value is and why it is unique;
the problem is generating 
such a value in a lexing algorithm using derivatives.

\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}

Sulzmann and Lu extended Brzozowski's 
derivative-based matching
to a lexing algorithm by a second pass
after the initial phase of successive derivatives.
This second phase generates a POSIX value 
if the regular expression matches the string.
Two functions are involved: $\inj$ and $\mkeps$.
The first one used is $\mkeps$, which constructs a POSIX value from the last
derivative $r_n$:
\begin{ceqn}
\begin{equation}\label{graph:mkeps}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
	        & 	              & 	            & v_n       
\end{tikzcd}
\end{equation}
\end{ceqn}
\noindent
In the above diagram, again we assume that
the input string $s$ is made of $n$ characters
$c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$
is given label $0$ and after each character $c_i$ is taken off
by the derivative operation the resulting derivative regular 
expressioin is $r_{i+1}$.The last derivative operation 
$\backslash c_{n-1}$ gives back $r_n$, which is transformed into 
a value $v_n$ by $\mkeps$.
$v_n$ tells us how an empty string is matched by the (nullable)
regular expression $r_n$, in a $\POSIX$ way.
The definition of $\mkeps$ is
	\begin{center}
		\begin{tabular}{lcl}
			$\mkeps \; \ONE$ 		& $\dn$ & $\Empty$ \\
			$\mkeps \; (r_{1}+r_{2})$	& $\dn$ 
						& $\textit{if}\; (\nullable \; r_{1}) \;\,
							\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ 
						& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
			$\mkeps \; (r_1 \cdot r_2)$ 	& $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
			$\mkeps \; r^* $	        & $\dn$ & $\Stars\;[]$
		\end{tabular}
	\end{center}


\noindent 
We favour the left child $r_1$ of $r_1 + r_2$ 
to match an empty string if there is a choice.
When there is a star for us to match the empty string,
we give the $\Stars$ constructor an empty list, meaning
no iteration is taken.
The result of a call to $\mkeps$ on a $\nullable$ $r$ would
be a $\POSIX$ value corresponding to $r$:
\begin{lemma}\label{mePosix}
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
\end{lemma}
\begin{proof}
	By induction on the shape of $r$.
\end{proof}
\noindent
After the $\mkeps$-call, we inject back the characters one by one
in reverse order as they were chopped off in the derivative phase.
The fucntion for this is called $\inj$. $\inj$ and $\backslash$
are not exactly reverse operations of one another, as $\inj$ 
operates on values instead of regular
expressions.
In the diagram below, $v_i$ stands for the (POSIX) value 
for how the regular expression 
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
After injecting back $n$ characters, we get the lexical value for how $r_0$
matches $s$. 
\begin{ceqn}
\begin{equation}\label{graph:inj}
\begin{tikzcd}
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
v_0           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
\end{tikzcd}
\end{equation}
\end{ceqn}
\noindent
$\textit{inj}$ takes three arguments: a regular
expression ${r_{i}}$, before the character is chopped off, 
a character ${c_{i}}$, the character we want to inject back and 
the third argument $v_{i+1}$ the value we want to inject into. 
The result of this function is a new value $v_i$.
The definition of $\textit{inj}$ is as follows: 
\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
\end{tabular}
\end{center}

\noindent 
This definition is by recursion on the ``shape'' of regular
expressions and values. 
The clauses do one thing--identifying the ``hole'' on a
value to inject the character back into.
For instance, in the last clause for injecting back to a value
that would turn into a new star value that corresponds to a star,
we know it must be a sequence value. And we know that the first 
value of that sequence corresponds to the child regex of the star
with the first character being chopped off--an iteration of the star
that had just been unfolded. This value is followed by the already
matched star iterations we collected before. So we inject the character 
back to the first value and form a new value with this latest iteration
being added to the previous list of iterations, all under the $\Stars$
top level.
The POSIX value is maintained throughout the process:
\begin{lemma}
$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
\end{lemma}\label{injPosix}


Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
and taking into consideration the possibility of a non-match,
we have a lexer with the following recursive definition:
\begin{center}
\begin{tabular}{lcl}
$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
& & $\quad \None \implies \None$\\
& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
\end{tabular}
\end{center}
 \noindent
 The central property of the $\lexer$ is that it gives the correct result by
 $\POSIX$ standards:
 \begin{theorem}
	 The $\lexer$ based on derivatives and injections is both sound and complete:
 \begin{tabular}{lcl}
	 $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
	 $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
 \end{tabular}
 \end{theorem}
 
 
 \begin{proof}
 By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
 The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}


We now give a pictorial view of the algorithm (
For convenience, we employ the following notations: the regular
expression we start with is $r_0$, and the given string $s$ is composed
of characters $c_0 c_1 \ldots c_{n-1}$. The
values built incrementally by \emph{injecting} back the characters into the
earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
are always in the same subscript, i.e. $\vdash v_i : r_i$):

\begin{ceqn}
\begin{equation}\label{graph:2}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
\end{tikzcd}
\end{equation}
\end{ceqn}

\noindent
As we did earlier in this chapter on the matcher, one can 
introduce simplification on the regex.
However, now we need to do a backward phase and make sure
the values align with the regular expressions.
Therefore one has to
be careful not to break the correctness, as the injection 
function heavily relies on the structure of the regexes and values
being correct and matching each other.
It can be achieved by recording some extra rectification functions
during the derivatives step, and applying these rectifications in 
each run during the injection phase.

\ChristianComment{Do I introduce the lexer with rectification here?}
And we can prove that the POSIX value of how
regular expressions match strings will not be affected---although it is much harder
to establish. 
Some initial results in this regard have been
obtained in \cite{AusafDyckhoffUrban2016}. 

However, even with these simplification rules, we could still end up in
trouble, when we encounter cases that require more involved and aggressive
simplifications.
\section{A Case Requring More Aggressive Simplification}
For example, when starting with the regular
expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
w.r.t.~the character $a$, one obtains a derivative regular expression
with more than 9000 nodes (when viewed as a tree)
even with simplification.
\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={size},
    legend entries={Naive Matcher},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
\end{axis}
\end{tikzpicture} 
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
\end{figure}\label{fig:BetterWaterloo}
   
That is because our lexing algorithm currently keeps a lot of 
"useless" values that will not be used. 
These different ways of matching will grow exponentially with the string length.

For $r= (a^*\cdot a^*)^*$ and  
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
if we do not allow any empty iterations in its lexical values,
there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
split or not so that each sub-string
segmented by those chosen splitting points will form different iterations.
For example when $n=4$, we give out a few of the many possibilities of splitting:
\begin{center}
\begin{tabular}{lcr}
$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration)\\
$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
 & $\textit{etc}.$ &
 \end{tabular}
\end{center}
\noindent
And for each iteration, there are still multiple ways to split
between the two $a^*$s.
It is not surprising there are exponentially many lexical values
that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
A lexer to keep all the possible values will naturally 
have an exponential runtime on ambiguous regular expressions.
With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
of a match. This means Sulzmann and Lu's injection-based algorithm 
 exponential by nature.
Somehow one has to make sure which
 lexical values are $\POSIX$ and must be kept in a lexing algorithm.


 For example, the above $r= (a^*\cdot a^*)^*$  and 
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
We want to keep this value only, and remove all the regular expression subparts
not corresponding to this value during lexing.
To do this, a two-phase algorithm with rectification is a bit too fragile.
Can we not create those intermediate values $v_1,\ldots v_n$,
and get the lexing information that should be already there while
doing derivatives in one pass, without a second injection phase?
In the meantime, can we ensure that simplifications
are easily handled without breaking the correctness of the algorithm?

Sulzmann and Lu solved this problem by
introducing additional information to the 
regular expressions called \emph{bitcodes}.
author	Chengsong
	Thu, 14 Jul 2022 14:57:32 +0100
changeset 567	28cb8089ec36
parent 564	3cbcd7cda0a9
child 568	7a579f5533f8
permissions	-rwxr-xr-x