lexing: ChengsongTanPhdThesis/Chapters/Inj.tex@7cf9f17aa179


% Chapter Template

\chapter{Regular Expressions and POSIX Lexing} % Main chapter title

\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
%and notations we 
%use for describing the lexing algorithm by Sulzmann and Lu,
%and then give the algorithm and its variant, and discuss
%why more aggressive simplifications are needed. 

In this chapter, we define the basic notions 
for regular languages and regular expressions.
We also give the definition of what $\POSIX$ lexing means.

\section{Basic Concepts}
Usually in formal language theory there is an alphabet 
denoting a set of characters.
Here we only use the datatype of characters from Isabelle,
which roughly corresponds to the ASCII character.
Then using the usual $[]$ notation for lists,
we can define strings using chars:
\begin{center}
\begin{tabular}{lcl}
$\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
& & $(c\; \text{has char type})$
\end{tabular}
\end{center}
And strings can be concatenated to form longer strings,
in the same way as we concatenate two lists,
which we denote as $@$. We omit the precise 
recursive definition here.
We overload this concatenation operator for two sets of strings:
\begin{center}
\begin{tabular}{lcl}
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
\end{tabular}
\end{center}
We also call the above \emph{language concatenation}.
The power of a language is defined recursively, using the 
concatenation operator $@$:
\begin{center}
\begin{tabular}{lcl}
$A^0 $ & $\dn$ & $\{ [] \}$\\
$A^{n+1}$ & $\dn$ & $A^n @ A$
\end{tabular}
\end{center}
The union of all the natural number powers of a language   
is defined as the Kleene star operator:
\begin{center}
\begin{tabular}{lcl}
 $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
\end{tabular}
\end{center}

\noindent
However, to obtain a convenient induction principle 
in Isabelle/HOL, 
we instead define the Kleene star
as an inductive set: 

\begin{center}
\begin{mathpar}
\inferrule{}{[] \in A*\\}

\inferrule{\\s_1 \in A \land \; s_2 \in A*}{s_1 @ s_2 \in A*}
\end{mathpar}
\end{center}

We also define an operation of "chopping of" a character from
a language, which we call $\Der$, meaning "Derivative for a language":
\begin{center}
\begin{tabular}{lcl}
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
\end{tabular}
\end{center}
\noindent
This can be generalised to "chopping off" a string from all strings within set $A$,
with the help of the concatenation operator:
\begin{center}
\begin{tabular}{lcl}
$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
\end{tabular}
\end{center}
\noindent
which is essentially the left quotient $A \backslash L'$ of $A$ against 
the singleton language $L' = \{w\}$
in formal language theory.
For this dissertation the $\textit{Ders}$ definition with 
a single string suffices.

With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
we have a  few properties of how the language derivative can be defined using 
sub-languages.
\begin{lemma}
\[
	\Der \; c \; (A @ B) =
	\begin{cases}
	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , &  \text{if} \;  [] \in A  \\
	 (\Der \; c \; A) \,  @ \, B, & \text{otherwise}
	 \end{cases}	
\]
\end{lemma}
\noindent
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
and get to $B$.
The language $A*$'s derivative can be described using the language derivative
of $A$:
\begin{lemma}
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\
\end{lemma}
\begin{proof}
\begin{itemize}
\item{$\subseteq$}
\noindent
The set 
\[ \{s \mid c :: s \in A*\} \]
is enclosed in the set
\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
because whenever you have a string starting with a character 
in the language of a Kleene star $A*$, 
then that character together with some sub-string
immediately after it will form the first iteration, 
and the rest of the string will 
be still in $A*$.
\item{$\supseteq$}
Note that
\[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
and 
\[ \Der \; c \;  (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]
where the $\textit{RHS}$ of the above equatioin can be rewritten
as \[ (\Der \; c\; A) @ A* \cup A' \], $A'$ being a possibly empty set.
\end{itemize}
\end{proof}

\noindent
Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
for regular languages, we need to first give definitions for regular expressions.

\subsection{Regular Expressions and Their Meaning}
The basic regular expressions  are defined inductively
 by the following grammar:
\[			r ::=   \ZERO \mid  \ONE
			 \mid  c  
			 \mid  r_1 \cdot r_2
			 \mid  r_1 + r_2   
			 \mid r^*         
\]
\noindent
We call them basic because we might introduce
more constructs later such as negation
and bounded repetitions.
We defined the regular expression containing
nothing as $\ZERO$, note that some authors
also use $\phi$ for that.
Similarly, the regular expression denoting the 
singleton set with only $[]$ is sometimes 
denoted by $\epsilon$, but we use $\ONE$ here.

The language or set of strings denoted 
by regular expressions are defined as
%TODO: FILL in the other defs
\begin{center}
\begin{tabular}{lcl}
$L \; (\ZERO)$ & $\dn$ & $\phi$\\
$L \; (\ONE)$ & $\dn$ & $\{[]\}$\\
$L \; (c)$ & $\dn$ & $\{[c]\}$\\
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
$L \; (r^*)$ & $\dn$ & $ (L(r))^*$
\end{tabular}
\end{center}
\noindent
Which is also called the "language interpretation" of
a regular expression.

Now with semantic derivatives of a language and regular expressions and
their language interpretations in place, we are ready to define derivatives on regexes.
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}

\ChristianComment{Hi this part I want to keep the ordering as is, so that it keeps the 
readers engaged with a story how we got to the definition of $\backslash$, rather 
than first "overwhelming" them with the definition of $\nullable$.}

The language derivative acts on a string set and chops off a character from
all strings in that set, we want to define a derivative operation on regular expressions
so that after derivative $L(r\backslash c)$ 
will look as if it was obtained by doing a language derivative on $L(r)$:
\begin{center}
\[
r\backslash c \dn ?
\]
so that
\[
L(r \backslash c) = \Der \; c \; L(r) ?
\]
\end{center}
So we mimic the equalities we have for $\Der$ on language concatenation

\[
\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
\]
to get the derivative for sequence regular expressions:
\[
(r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
\]

\noindent
and language Kleene star:
\[
\textit{Der} \;c \;A* = (\textit{Der}\; c A) @ (A*)
\]
to get derivative of the Kleene star regular expression:
\[
r^* \backslash c = (r \backslash c)\cdot r^*
\]
Note that although we can formalise the boolean predicate
$[] \in L(r_1)$ without problems, if we want a function that works
computationally, then we would have to define a function that tests
whether an empty string is in the language of a regular expression.
We call such a function $\nullable$:



\begin{center}
\begin{tabular}{lcl}
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
		$d \backslash c$     & $\dn$ & 
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
\end{tabular}
\end{center}
\noindent
The function derivative, written $r\backslash c$, 
defines how a regular expression evolves into
a new regular expression after all the string it contains
is chopped off a certain head character $c$.
The most involved cases are the sequence 
and star case.
The sequence case says that if the first regular expression
contains an empty string then the second component of the sequence
might be chosen as the target regular expression to be chopped
off its head character.
The star regular expression's derivative unwraps the iteration of
regular expression and attaches the star regular expression
to the sequence's second element to make sure a copy is retained
for possible more iterations in later phases of lexing.


To test whether $[] \in L(r_1)$, we need the $\nullable$ function,
which tests whether the empty string $""$ 
is in the language of $r$:


\begin{center}
		\begin{tabular}{lcl}
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
		\end{tabular}
\end{center}
\noindent
The empty set does not contain any string and
therefore not the empty string, the empty string 
regular expression contains the empty string
by definition, the character regular expression
is the singleton that contains character only,
and therefore does not contain the empty string,
the alternative regular expression (or "or" expression)
might have one of its children regular expressions
being nullable and any one of its children being nullable
would suffice. The sequence regular expression
would require both children to have the empty string
to compose an empty string and the Kleene star
operation naturally introduced the empty string. 
  
We have the following property where the derivative on regular 
expressions coincides with the derivative on a set of strings:

\begin{lemma}
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
\end{lemma}

\noindent
The main property of the derivative operation
that enables us to reason about the correctness of
an algorithm using derivatives is 

\begin{lemma}\label{derStepwise}
$c\!::\!s \in L(r)$ holds
if and only if $s \in L(r\backslash c)$.
\end{lemma}

\noindent
We can generalise the derivative operation shown above for single characters
to strings as follows:

\begin{center}
\begin{tabular}{lcl}
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
$r \backslash [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}

\noindent
When there is no ambiguity we will use  $\backslash$ to denote
string derivatives for brevity.
Brzozowski's  regular-expression matcher algorithm can then be described as:

\begin{definition}
$\textit{match}\;s\;r \;\dn\; \nullable(r\backslash s)$
\end{definition}

\noindent
Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
this algorithm presented graphically is as follows:

\begin{equation}\label{graph:successive_ders}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
\end{tikzcd}
\end{equation}

\noindent
 It can  be
relatively  easily shown that this matcher is correct:
\begin{lemma}
$\textit{match} \; s\; r  = \textit{true} \Longleftrightarrow s \in L(r)$
\end{lemma}
\begin{proof}
By the stepwise property of $\backslash$ (\ref{derStepwise})
\end{proof}
\noindent
If we implement the above algorithm naively, however,
the algorithm can be excruciatingly slow.


\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={time in secs},
    ymode = log,
    legend entries={Naive Matcher},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
\end{axis}
\end{tikzpicture} 
\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}
\end{figure}
   
\noindent
For this we need to introduce certain 
rewrite rules for the intermediate results,
such as $r + r \rightarrow r$,
and make sure those rules do not change the 
language of the regular expression.
We have a simplification function (that is as simple as possible
while having much power on making a regex simpler):
\begin{verbatim}
def simp(r: Rexp) : Rexp = r match {
  case SEQ(r1, r2) => 
    (simp(r1), simp(r2)) match {
      case (ZERO, _) => ZERO
      case (_, ZERO) => ZERO
      case (ONE, r2s) => r2s
      case (r1s, ONE) => r1s
      case (r1s, r2s) => SEQ(r1s, r2s)
    }
  case ALTS(r1, r2) => {
    (simp(r1), simp(r2)) match {
      case (ZERO, r2s) => r2s
      case (r1s, ZERO) => r1s
      case (r1s, r2s) =>
        if(r1s == r2s) r1s else ALTS(r1s, r2s)
    }
  }
  case r => r
}
\end{verbatim}
If we repeatedly incorporate these 
rules during the matching algorithm, 
we have a lexer with simplification:
\begin{verbatim}
def ders_simp(s: List[Char], r: Rexp) : Rexp = s match {
  case Nil => simp(r)
  case c :: cs => ders_simp(cs, simp(der(c, r)))
}

def simp_matcher(s: String, r: Rexp) : Boolean = 
  nullable(ders_simp(s.toList, r))

\end{verbatim}
\noindent
After putting in those rules, the example of \ref{NaiveMatcher}
is now very tame in the length of inputs:


\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={time in secs},
    ymode = log,
    xmode = log,
    legend entries={Matcher With Simp},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
\end{axis}
\end{tikzpicture} \label{fig:BetterMatcher}


\noindent
Note how the x-axis is in logarithmic scale.
Building derivatives and then testing the existence
of empty string in the resulting regular expression's language,
and add simplification rules when necessary.
So far, so good. But what if we want to 
do lexing instead of just getting a YES/NO answer?
\citeauthor{Sulzmann2014} first came up with a nice and 
elegant (arguably as beautiful as the definition of the original derivative) solution for this.

\section{Values and the Lexing Algorithm by Sulzmann and Lu}
Here we present the hybrid phases of a regular expression lexing 
algorithm using the function $\inj$, as given by Sulzmann and Lu.
They first defined the datatypes for storing the 
lexing information called a \emph{value} or
sometimes also \emph{lexical value}.  These values and regular
expressions correspond to each other as illustrated in the following
table:

\begin{center}
	\begin{tabular}{c@{\hspace{20mm}}c}
		\begin{tabular}{@{}rrl@{}}
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
			$r$ & $::=$  & $\ZERO$\\
			& $\mid$ & $\ONE$   \\
			& $\mid$ & $c$          \\
			& $\mid$ & $r_1 \cdot r_2$\\
			& $\mid$ & $r_1 + r_2$   \\
			\\
			& $\mid$ & $r^*$         \\
		\end{tabular}
		&
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
			$v$ & $::=$  & \\
			&        & $\Empty$   \\
			& $\mid$ & $\Char(c)$          \\
			& $\mid$ & $\Seq\,v_1\, v_2$\\
			& $\mid$ & $\Left(v)$   \\
			& $\mid$ & $\Right(v)$  \\
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
		\end{tabular}
	\end{tabular}
\end{center}

\noindent
We have a formal binary relation for telling whether the structure
of a regular expression agrees with the value.
\begin{mathpar}
\inferrule{}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}
\inferrule{}{\vdash \Empty :  \ONE} \hspace{2em}
\inferrule{\vdash v_1 : r_1 \\ \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}
\end{mathpar}

Building on top of Sulzmann and Lu's attempt to formalise the 
notion of POSIX lexing rules \parencite{Sulzmann2014}, 
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
POSIX matching as a ternary relation recursively defined in a
natural deduction style.
The formal definition of a $\POSIX$ value $v$ for a regular expression
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
in the following set of rules:
\ChristianComment{Will complete later}
\newcommand*{\inference}[3][t]{%
   \begingroup
   \def\and{\\}%
   \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
   #2 \\
   \hline
   #3
   \end{tabular}%
   \endgroup
}
\begin{center}
\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
\end{center}
\noindent
The above $\POSIX$ rules could be explained intuitionally as
\begin{itemize}
\item
match the leftmost regular expression when multiple options of matching
are available  
\item 
always match a subpart as much as possible before proceeding
to the next token.
\end{itemize}

The reason why we are interested in $\POSIX$ values is that they can
be practically used in the lexing phase of a compiler front end.
For instance, when lexing a code snippet 
$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
as an identifier rather than a keyword.

The good property about a $\POSIX$ value is that 
given the same regular expression $r$ and string $s$,
one can always uniquely determine the $\POSIX$ value for it:
\begin{lemma}
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad  \textit{then} \; v_1 = v_2$
\end{lemma}
\begin{proof}
By induction on $s$, $r$ and $v_1$. The induction principle is
the \POSIX rules. Each case is proven by a combination of
the induction rules for $\POSIX$ values and the inductive hypothesis.
Probably the most cumbersome cases are the sequence and star with non-empty iterations.

We give the reasoning about the sequence case as follows:
When we have $(s_1, r_1) \rightarrow v_1$ and $(s_2, r_2) \rightarrow v_2$, 
we know that there could not be a longer string $r_1'$ such that $(s_1', r_1) \rightarrow v_1'$
and $(s_2', r_2) \rightarrow v2'$ and $s_1' @s_2' = s$ all hold.
For possible values of $s_1'$ and $s_2'$ where $s_1'$ is shorter, they cannot
possibly form a $\POSIX$ for $s$.
If we have some other values $v_1'$ and $v_2'$ such that 
$(s_1, r_1) \rightarrow v_1'$ and $(s_2, r_2) \rightarrow v_2'$,
Then by induction hypothesis $v_1' = v_1$ and $v_2'= v_2$, 
which means this "different" $\POSIX$ value $\Seq(v_1', v_2')$
is the same as $\Seq(v_1, v_2)$. 
\end{proof}
Now we know what a $\POSIX$ value is, the problem is how do we achieve 
such a value in a lexing algorithm, using derivatives?

\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}

The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
derivatives---see \ref{graph:successive_ders}). In this second phase, a POSIX value 
is generated if the regular expression matches the string.
Two functions are involved: $\inj$ and $\mkeps$.
The function $\mkeps$ constructs a value from the last
one of all the successive derivatives:
\begin{ceqn}
\begin{equation}\label{graph:mkeps}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
	        & 	              & 	            & v_n       
\end{tikzcd}
\end{equation}
\end{ceqn}

It tells us how can an empty string be matched by a 
regular expression, in a $\POSIX$ way:

	\begin{center}
		\begin{tabular}{lcl}
			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
			& \textit{if} $\nullable(r_{1})$\\ 
			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
		\end{tabular}
	\end{center}


\noindent 
We favour the left to match an empty string if there is a choice.
When there is a star for us to match the empty string,
we give the $\Stars$ constructor an empty list, meaning
no iterations are taken.
The result of a call to $\mkeps$ on a $\nullable$ $r$ would
be a $\POSIX$ value corresponding to $r$:
\begin{lemma}
$\nullable(r) \implies (r, []) \rightarrow (\mkeps\; v)$
\end{lemma}\label{mePosix}


After the $\mkeps$-call, we inject back the characters one by one in order to build
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
After injecting back $n$ characters, we get the lexical value for how $r_0$
matches $s$. 
To do this, Sulzmann and Lu defined a function that reverses
the ``chopping off'' of characters during the derivative phase. The
corresponding function is called \emph{injection}, written
$\textit{inj}$; it takes three arguments: the first one is a regular
expression ${r_{i-1}}$, before the character is chopped off, the second
is a character ${c_{i-1}}$, the character we want to inject and the
third argument is the value ${v_i}$, into which one wants to inject the
character (it corresponds to the regular expression after the character
has been chopped off). The result of this function is a new value. 
\begin{ceqn}
\begin{equation}\label{graph:inj}
\begin{tikzcd}
r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
\end{tikzcd}
\end{equation}
\end{ceqn}


\noindent
The
definition of $\textit{inj}$ is as follows: 

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
\end{tabular}
\end{center}

\noindent 
This definition is by recursion on the ``shape'' of regular
expressions and values. 
The clauses do one thing--identifying the ``hole'' on a
value to inject the character back into.
For instance, in the last clause for injecting back to a value
that would turn into a new star value that corresponds to a star,
we know it must be a sequence value. And we know that the first 
value of that sequence corresponds to the child regex of the star
with the first character being chopped off--an iteration of the star
that had just been unfolded. This value is followed by the already
matched star iterations we collected before. So we inject the character 
back to the first value and form a new value with this latest iteration
being added to the previous list of iterations, all under the $\Stars$
top level.
The POSIX value is maintained throughout the process:
\begin{lemma}
$(r \backslash c, s) \rightarrow v \implies (r, c :: s) \rightarrow (\inj r \; c\; v)$
\end{lemma}\label{injPosix}


Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
and taking into consideration the possibility of a non-match,
we have a lexer with the following recursive definition:
\begin{center}
\begin{tabular}{lcl}
$\lexer \; r \; [] $ & $=$ & $\textit{if} (\nullable \; r)\; \textit{then}\;  \Some(\mkeps \; r) \; \textit{else} \; \None$\\
$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer (r\backslash c) s) \textit{of} $\\
& & $\quad \None \implies \None$\\
& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
\end{tabular}
\end{center}
 \noindent
 The central property of the $\lexer$ is that it gives the correct result by
 $\POSIX$ standards:
 \begin{lemma}
 \begin{tabular}{l}
 $s \in L(r) \Longleftrightarrow  (\exists v. \; r \; s = \Some(v) \land (r, \; s) \rightarrow v)$\\
 $s \notin L(r) \Longleftrightarrow (\lexer \; r\; s = \None)$
 \end{tabular}
 \end{lemma}
 
 
 \begin{proof}
 By induction on $s$. $r$ is allowed to be an arbitrary regular expression.
 The $[]$ case is proven by  lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}


Pictorially, the algorithm is as follows (
For convenience, we employ the following notations: the regular
expression we start with is $r_0$, and the given string $s$ is composed
of characters $c_0 c_1 \ldots c_{n-1}$. The
values built incrementally by \emph{injecting} back the characters into the
earlier values are $v_n, \ldots, v_0$. Corresponding values and characters
are always in the same subscript, i.e. $\vdash v_i : r_i$):

\begin{ceqn}
\begin{equation}\label{graph:2}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
\end{tikzcd}
\end{equation}
\end{ceqn}

\noindent
As we did earlier in this chapter on the matcher, one can 
introduce simplification on the regex.
However, now we need to do a backward phase and make sure
the values align with the regular expressions.
Therefore one has to
be careful not to break the correctness, as the injection 
function heavily relies on the structure of the regexes and values
being correct and matching each other.
It can be achieved by recording some extra rectification functions
during the derivatives step, and applying these rectifications in 
each run during the injection phase.

\ChristianComment{Do I introduce the lexer with rectification here?}
And we can prove that the POSIX value of how
regular expressions match strings will not be affected---although it is much harder
to establish. 
Some initial results in this regard have been
obtained in \cite{AusafDyckhoffUrban2016}. 

However, even with these simplification rules, we could still end up in
trouble, when we encounter cases that require more involved and aggressive
simplifications.
\section{A Case Requring More Aggressive Simplification}
For example, when starting with the regular
expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)
w.r.t.~the character $a$, one obtains a derivative regular expression
with more than 9000 nodes (when viewed as a tree)
even with simplification.
\begin{figure}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    ylabel={size},
    legend entries={Naive Matcher},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
\end{axis}
\end{tikzpicture} 
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
\end{figure}\label{fig:BetterWaterloo}
   
That is because our lexing algorithm currently keeps a lot of 
"useless values that will never not be used. 
These different ways of matching will grow exponentially with the string length.

For $r= (a^*\cdot a^*)^*$ and  
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$,
if we do not allow any empty iterations in its lexical values,
there will be $n - 1$ "splitting points" on $s$ we can independently choose to 
split or not so that each sub-string
segmented by those chosen splitting points will form different iterations.
For example when $n=4$,
\begin{center}
\begin{tabular}{lcr}
$aaaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,aaaa}]$ (1 iteration, this iteration will be divided between the inner sequence $a^*\cdot a^*$)\\
$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$ (two iterations)\\
$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$ (two iterations)\\
$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$ (three iterations)\\
$a \mid a \mid a\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, a} \,v_{iteration \, a}, \, v_{iteration \, a}]$ (four iterations)\\
 & $\textit{etc}.$ &
 \end{tabular}
\end{center}
\noindent
And for each iteration, there are still multiple ways to split
between the two $a^*$s.
It is not surprising there are exponentially many lexical values
that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
A lexer to keep all the possible values will naturally 
have an exponential runtime on ambiguous regular expressions.
With just $\inj$ and $\mkeps$, the lexing algorithm will keep track of all different values
of a match. This means Sulzmann and Lu's injection-based algorithm 
 exponential by nature.
Somehow one has to make sure which
 lexical values are $\POSIX$ and need to be kept in a lexing algorithm.


 For example, the above $r= (a^*\cdot a^*)^*$  and 
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
We want to keep this value only, and remove all the regular expression subparts
not corresponding to this value during lexing.
To do this, a two-phase algorithm with rectification is a bit too fragile.
Can we not create those intermediate values $v_1,\ldots v_n$,
and get the lexing information that should be already there while
doing derivatives in one pass, without a second injection phase?
In the meantime, can we make sure that simplifications
are easily handled without breaking the correctness of the algorithm?

Sulzmann and Lu solved this problem by
introducing additional information to the 
regular expressions called \emph{bitcodes}.
author	Chengsong
	Thu, 09 Jun 2022 22:07:44 +0100
changeset 539	7cf9f17aa179
parent 538	8016a2480704
child 541	5bf9f94c02e1
permissions	-rwxr-xr-x