lexing: ChengsongTanPhdThesis/Chapters/Inj.tex@cc54ce075db5


% Chapter Template

\chapter{Regular Expressions and POSIX Lexing} % Main chapter title

\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
%and notations we 
%use for describing the lexing algorithm by Sulzmann and Lu,
%and then give the algorithm and its variant, and discuss
%why more aggressive simplifications are needed. 


\section{Basic Concepts and Notations for Strings, Languages, and Regular Expressions}
We have a primitive datatype char, denoting characters.
\[			char ::=  a
			 \mid b
			 \mid c
			 \mid  \ldots
			 \mid z       
\]
(which one is better?)
\begin{center}
\begin{tabular}{lcl}
$\textit{char}$ & $\dn$ & $a | b | c | \ldots$\\
\end{tabular}
\end{center}
They can form strings by lists:
\begin{center}
\begin{tabular}{lcl}
$\textit{string}$ & $\dn$ & $[] | c  :: cs$\\
& & $(c\; \text{has char type})$
\end{tabular}
\end{center}
And strings can be concatenated to form longer strings:
\begin{center}
\begin{tabular}{lcl}
$[] @ s_2$ & $\dn$ & $s_2$\\
$(c :: s_1) @ s_2$ & $\dn$ & $c :: (s_1 @ s_2)$
\end{tabular}
\end{center}

A set of strings can operate with another set of strings:
\begin{center}
\begin{tabular}{lcl}
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A; s_B \in B \}$\\
\end{tabular}
\end{center}
We also call the above "language concatenation".
The power of a language is defined recursively, using the 
concatenation operator $@$:
\begin{center}
\begin{tabular}{lcl}
$A^0 $ & $\dn$ & $\{ [] \}$\\
$A^{n+1}$ & $\dn$ & $A^n @ A$
\end{tabular}
\end{center}
The union of all the natural number powers of a language   
is denoted by the Kleene star operator:
\begin{center}
\begin{tabular}{lcl}
$\bigcup_{i \geq 0} A^i$ & $\denote$ & $A^*$\\
\end{tabular}
\end{center}

In Isabelle of course we cannot easily get a counterpart of
the $\bigcup_{i \geq 0}$ operator, so we instead define the Kleene star
as an inductive set: 
\begin{center}
\begin{tabular}{lcl}
$[] \in A^*$  & &\\
$s_1 \in A \land \; s_2 \in A^* $ & $\implies$ & $s_1 @ s_2 \in A^*$\\
\end{tabular}
\end{center}
\subsection{Language Derivatives}
We also define an operation of chopping off a character from
a language:
\begin{center}
\begin{tabular}{lcl}
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
\end{tabular}
\end{center}

This can be generalised to chopping off a string from all strings within set $A$:
\begin{center}
\begin{tabular}{lcl}
$\textit{Ders} \;w \;A$ & $\dn$ & $\{ s \mid w@s \in A \}$\\
\end{tabular}
\end{center}

which is essentially the left quotient $A \backslash L'$ of $A$ against 
the singleton language $L' = \{w\}$
in formal language theory.
For this dissertation the $\textit{Ders}$ notation would suffice, there is
no need for a more general derivative definition.

With the  sequencing, Kleene star, and $\textit{Der}$ operator on languages,
we have a  few properties of how the language derivative can be defined using 
sub-languages.
\begin{lemma}
$\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B$
\end{lemma}
\noindent
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
and get to $B$.

The language $A^*$'s derivative can be described using the language derivative
of $A$:
\begin{lemma}
$\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)$\\
\end{lemma}
\begin{proof}
\begin{itemize}
\item{$\subseteq$}
The set 
\[ \{s \mid c :: s \in A^*\} \]
is enclosed in the set
\[ \{s_1 @ s_2 \mid s_1 \, s_2. s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A^* \} \]
because whenever you have a string starting with a character 
in the language of a Kleene star $A^*$, then that character together with some sub-string
immediately after it will form the first iteration, and the rest of the string will 
be still in $A^*$.
\item{$\supseteq$}
Note that
\[ \Der \; c \; A^* = \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) \]
and 
\[ \Der \; c \;  (\{ [] \} \cup (A @ A^*) ) = \Der\; c \; (A @ A^*) \]
where the $\textit{RHS}$ of the above equatioin can be rewritten
as \[ (\Der \; c\; A) @ A^* \cup A' \], $A'$ being a possibly empty set.
\end{itemize}
\end{proof}
Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart
for regular languages, we need to first give definitions for regular expressions.

\subsection{Regular Expressions and Their Language Interpretation}
Suppose we have an alphabet $\Sigma$, the strings  whose characters
are from $\Sigma$
can be expressed as $\Sigma^*$.

We use patterns to define a set of strings concisely. Regular expressions
are one of such patterns systems:
The basic regular expressions  are defined inductively
 by the following grammar:
\[			r ::=   \ZERO \mid  \ONE
			 \mid  c  
			 \mid  r_1 \cdot r_2
			 \mid  r_1 + r_2   
			 \mid r^*         
\]

The language or set of strings denoted 
by regular expressions are defined as
%TODO: FILL in the other defs
\begin{center}
\begin{tabular}{lcl}
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
\end{tabular}
\end{center}
Which is also called the "language interpretation" of
a regular expression.

Now with semantic derivatives of a language and regular expressions and
their language interpretations in place, we are ready to define derivatives on regexes.
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
The language derivatives acts on a string set and chops off a character from
all strings in that set, we want to define a derivative operation on regular expressions
so that after derivative $L(r\backslash c)$ 
will look as if it was obtained by doing a language derivative on $L(r)$:
\[
L(r \backslash c) = \Der \; c \; L(r)
\]
So we mimic the equalities we have for $\Der$ on language concatenation

\[
\Der \; c \; (A @ B) = \textit{if} \;  [] \in A \; \textit{then} ((\Der \; c \; A) @ B ) \cup \Der \; c\; B \quad \textit{else}\; (\Der \; c \; A) @ B\\
\]
to get the derivative for sequence regular expressions:
\[
(r_1 \cdot r_2 ) \backslash c = \textit{if}\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c \textit{else} (r_1 \backslash c) \cdot r_2
\]

\noindent
and language Kleene star:
\[
\textit{Der} \;c \;A^* = (\textit{Der}\; c A) @ (A^*)
\]
to get derivative of the Kleene star regular expression:
\[
r^* \backslash c = (r \backslash c)\cdot r^*
\]
Note that although we can formalise the boolean predicate
$[] \in L(r_1)$ without problems, if we want a function that works
computationally, then we would have to define a function that tests
whether an empty string is in the language of a regular expression.
We call such a function $\nullable$:



\begin{center}
\begin{tabular}{lcl}
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
		$d \backslash c$     & $\dn$ & 
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
\end{tabular}
\end{center}
\noindent
The function derivative, written $r\backslash c$, 
defines how a regular expression evolves into
a new regular expression after all the string it contains
is chopped off a certain head character $c$.
The most involved cases are the sequence 
and star case.
The sequence case says that if the first regular expression
contains an empty string then the second component of the sequence
might be chosen as the target regular expression to be chopped
off its head character.
The star regular expression's derivative unwraps the iteration of
regular expression and attaches the star regular expression
to the sequence's second element to make sure a copy is retained
for possible more iterations in later phases of lexing.


The $\nullable$ function tests whether the empty string $""$ 
is in the language of $r$:


\begin{center}
		\begin{tabular}{lcl}
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
		\end{tabular}
\end{center}
\noindent
The empty set does not contain any string and
therefore not the empty string, the empty string 
regular expression contains the empty string
by definition, the character regular expression
is the singleton that contains character only,
and therefore does not contain the empty string,
the alternative regular expression (or "or" expression)
might have one of its children regular expressions
being nullable and any one of its children being nullable
would suffice. The sequence regular expression
would require both children to have the empty string
to compose an empty string and the Kleene star
operation naturally introduced the empty string. 
  
We have the following property where the derivative on regular 
expressions coincides with the derivative on a set of strings:

\begin{lemma}
$\textit{Der} \; c \; L(r) = L (r\backslash c)$
\end{lemma}

\noindent


The main property of the derivative operation
that enables us to reason about the correctness of
an algorithm using derivatives is 

\begin{center}
$c\!::\!s \in L(r)$ holds
if and only if $s \in L(r\backslash c)$.
\end{center}

\noindent
We can generalise the derivative operation shown above for single characters
to strings as follows:

\begin{center}
\begin{tabular}{lcl}
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
$r \backslash [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}

\noindent
When there is no ambiguity we will use  $\backslash$ to denote
string derivatives for brevity.

and then define Brzozowski's  regular-expression matching algorithm as:

\begin{definition}
$match\;s\;r \;\dn\; nullable(r\backslash s)$
\end{definition}

\noindent
Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
this algorithm presented graphically is as follows:

\begin{equation}\label{graph:successive_ders}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
\end{tikzcd}
\end{equation}

\noindent
where we start with  a regular expression  $r_0$, build successive
derivatives until we exhaust the string and then use \textit{nullable}
to test whether the result can match the empty string. It can  be
relatively  easily shown that this matcher is correct  (that is given
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).

Beautiful and simple definition.

If we implement the above algorithm naively, however,
the algorithm can be excruciatingly slow. 


\begin{figure}
\centering
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=10000,
    ytick={0,1000,...,10000},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={JavaScript},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
\end{axis}
\end{tikzpicture}\\
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
           of the form $\underbrace{aa..a}_{n}$.}
\end{tabular}    
\caption{EightThousandNodes} \label{fig:EightThousandNodes}
\end{figure}


(8000 node data to be added here)
For example, when starting with the regular
expression $(a + aa)^*$ and building a few successive derivatives (around 10)
w.r.t.~the character $a$, one obtains a derivative regular expression
with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
The reason why $(a + aa) ^*$ explodes so drastically is that without
pruning, the algorithm will keep records of all possible ways of matching:
\begin{center}
$(a + aa) ^* \backslash [aa] = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
\end{center}

\noindent
Each of the above alternative branches correspond to the match 
$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
These different ways of matching will grow exponentially with the string length,
and without simplifications that throw away some of these very similar matchings,
it is no surprise that these expressions grow so quickly.
Operations like
$\backslash$ and $\nullable$ need to traverse such trees and
consequently the bigger the size of the derivative the slower the
algorithm. 

Brzozowski was quick in finding that during this process a lot useless
$\ONE$s and $\ZERO$s are generated and therefore not optimal.
He also introduced some "similarity rules", such
as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
different but language-equivalent sub-regexes to further decrease the size
of the intermediate regexes. 

More simplifications are possible, such as deleting duplicates
and opening up nested alternatives to trigger even more simplifications.
And suppose we apply simplification after each derivative step, and compose
these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
\textit{simp}(a \backslash c)$. Then we can build
a matcher with simpler regular expressions.

If we want the size of derivatives in the algorithm to
stay even lower, we would need more aggressive simplifications.
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
delete duplicates whenever possible. For example, the parentheses in
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
$a^*+a+\ONE$.  These more aggressive simplification rules are for
 a very tight size bound, possibly as low
  as that of the \emph{partial derivatives}\parencite{Antimirov1995}. 

Building derivatives and then simplifying them.
So far, so good. But what if we want to 
do lexing instead of just getting a YES/NO answer?
This requires us to go back again to the world 
without simplification first for a moment.
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
elegant(arguably as beautiful as the original
derivatives definition) solution for this.

\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}


They first defined the datatypes for storing the 
lexing information called a \emph{value} or
sometimes also \emph{lexical value}.  These values and regular
expressions correspond to each other as illustrated in the following
table:

\begin{center}
	\begin{tabular}{c@{\hspace{20mm}}c}
		\begin{tabular}{@{}rrl@{}}
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
			$r$ & $::=$  & $\ZERO$\\
			& $\mid$ & $\ONE$   \\
			& $\mid$ & $c$          \\
			& $\mid$ & $r_1 \cdot r_2$\\
			& $\mid$ & $r_1 + r_2$   \\
			\\
			& $\mid$ & $r^*$         \\
		\end{tabular}
		&
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
			$v$ & $::=$  & \\
			&        & $\Empty$   \\
			& $\mid$ & $\Char(c)$          \\
			& $\mid$ & $\Seq\,v_1\, v_2$\\
			& $\mid$ & $\Left(v)$   \\
			& $\mid$ & $\Right(v)$  \\
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
		\end{tabular}
	\end{tabular}
\end{center}

\noindent

Building on top of Sulzmann and Lu's attempt to formalise the 
notion of POSIX lexing rules \parencite{Sulzmann2014}, 
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
POSIX matching as a ternary relation recursively defined in a
natural deduction style.
With the formally-specified rules for what a POSIX matching is,
they proved in Isabelle/HOL that the algorithm gives correct results.

But having a correct result is still not enough, 
we want at least some degree of $\mathbf{efficiency}$.



One regular expression can have multiple lexical values. For example
for the regular expression $(a+b)^*$, it has a infinite list of
values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
$\ldots$, and vice versa.
Even for the regular expression matching a particular string, there could 
be more than one value corresponding to it.
Take the example where $r= (a^*\cdot a^*)^*$ and the string 
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
If we do not allow any empty iterations in its lexical values,
there will be $n - 1$ "splitting points" on $s$ we can choose to 
split or not so that each sub-string
segmented by those chosen splitting points will form different iterations:
\begin{center}
\begin{tabular}{lcr}
$a \mid aaa $ & $\rightarrow$ & $\Stars\, [v_{iteration \,a},\,  v_{iteration \,aaa}]$\\
$aa \mid aa $ & $\rightarrow$ & $\Stars\, [v_{iteration \, aa},\,  v_{iteration \, aa}]$\\
$a \mid aa\mid a $ & $\rightarrow$ & $\Stars\, [v_{iteration \, a},\,  v_{iteration \, aa}, \, v_{iteration \, a}]$\\
 & $\textit{etc}.$ &
 \end{tabular}
\end{center}

And for each iteration, there are still multiple ways to split
between the two $a^*$s.
It is not surprising there are exponentially many lexical values
that are distinct for the regex and string pair $r= (a^*\cdot a^*)^*$  and 
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.

A lexer to keep all the possible values will naturally 
have an exponential runtime on ambiguous regular expressions.
Somehow one has to decide which lexical value to keep and
output in a lexing algorithm.
In practice, we are usually 
interested in POSIX values, which by intuition always
\begin{itemize}
\item
match the leftmost regular expression when multiple options of matching
are available  
\item 
always match a subpart as much as possible before proceeding
to the next token.
\end{itemize}
The formal definition of a $\POSIX$ value $v$ for a regular expression
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified 
in the following set of rules:
(TODO: write the entire set of inference rules for POSIX )
\newcommand*{\inference}[3][t]{%
   \begingroup
   \def\and{\\}%
   \begin{tabular}[#1]{@{\enspace}c@{\enspace}}
   #2 \\
   \hline
   #3
   \end{tabular}%
   \endgroup
}
\begin{center}
\inference{$s_1 @ s_2 = s$ \and $(\nexists s_3 s_4 s_5. s_1 @ s_5 = s_3 \land s_5 \neq [] \land s_3 @ s_4 = s \land (s_3, r_1) \rightarrow v_3 \land (s_4, r_2) \rightarrow v_4)$ \and $(s_1, r_1) \rightarrow v_1$ \and $(s_2, r_2) \rightarrow v_2$  }{$(s, r_1 \cdot r_2) \rightarrow \Seq(v_1, v_2)$ }
\end{center}

The reason why we are interested in $\POSIX$ values is that they can
be practically used in the lexing phase of a compiler front end.
For instance, when lexing a code snippet 
$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
as an identifier rather than a keyword.

 For example, the above $r= (a^*\cdot a^*)^*$  and 
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
The output of an algorithm we want would be a POSIX matching
encoded as a value.




The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
derivatives---see \eqref{graph:successive_ders}). In this second phase, a POSIX value 
is generated if the regular expression matches the string.
How can we construct a value out of regular expressions and character
sequences only?
Two functions are involved: $\inj$ and $\mkeps$.
The function $\mkeps$ constructs a value from the last
one of all the successive derivatives:
\begin{ceqn}
\begin{equation}\label{graph:mkeps}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[d, "mkeps" description] \\
	        & 	              & 	            & v_n       
\end{tikzcd}
\end{equation}
\end{ceqn}

It tells us how can an empty string be matched by a 
regular expression, in a $\POSIX$ way:

	\begin{center}
		\begin{tabular}{lcl}
			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
			& \textit{if} $\nullable(r_{1})$\\ 
			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
		\end{tabular}
	\end{center}


\noindent 
We favour the left to match an empty string if there is a choice.
When there is a star for us to match the empty string,
we give the $\Stars$ constructor an empty list, meaning
no iterations are taken.


After the $\mkeps$-call, we inject back the characters one by one in order to build
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
After injecting back $n$ characters, we get the lexical value for how $r_0$
matches $s$. The POSIX value is maintained throughout the process.
For this Sulzmann and Lu defined a function that reverses
the ``chopping off'' of characters during the derivative phase. The
corresponding function is called \emph{injection}, written
$\textit{inj}$; it takes three arguments: the first one is a regular
expression ${r_{i-1}}$, before the character is chopped off, the second
is a character ${c_{i-1}}$, the character we want to inject and the
third argument is the value ${v_i}$, into which one wants to inject the
character (it corresponds to the regular expression after the character
has been chopped off). The result of this function is a new value. 
\begin{ceqn}
\begin{equation}\label{graph:inj}
\begin{tikzcd}
r_1 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"]  \arrow[d]  & r_{i+1}  \arrow[r, dashed] \arrow[d]        & r_n \arrow[d, "mkeps" description] \\
v_1           \arrow[u]                 & v_i  \arrow[l, dashed]                              & v_{i+1} \arrow[l,"inj_{r_i} c_i"]                 & v_n \arrow[l, dashed]         
\end{tikzcd}
\end{equation}
\end{ceqn}


\noindent
The
definition of $\textit{inj}$ is as follows: 

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
\end{tabular}
\end{center}

\noindent This definition is by recursion on the ``shape'' of regular
expressions and values. 
The clauses do one thing--identifying the ``hole'' on a
value to inject the character back into.
For instance, in the last clause for injecting back to a value
that would turn into a new star value that corresponds to a star,
we know it must be a sequence value. And we know that the first 
value of that sequence corresponds to the child regex of the star
with the first character being chopped off--an iteration of the star
that had just been unfolded. This value is followed by the already
matched star iterations we collected before. So we inject the character 
back to the first value and form a new value with this latest iteration
being added to the previous list of iterations, all under the $\Stars$
top level.

Putting all the functions $\inj$, $\mkeps$, $\backslash$ together,
we have a lexer with the following recursive definition:
\begin{center}
\begin{tabular}{lcr}
$\lexer \; r \; [] $ & $=$ & $\mkeps \; r$\\
$\lexer \; r \;c::s$ & $=$ & $\inj \; r \; c (\lexer (r\backslash c) s)$
\end{tabular}
\end{center}
 
Pictorially, the algorithm is as follows:

\begin{ceqn}
\begin{equation}\label{graph:2}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
\end{tikzcd}
\end{equation}
\end{ceqn}


\noindent
For convenience, we shall employ the following notations: the regular
expression we start with is $r_0$, and the given string $s$ is composed
of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
to the characters $c_0$, $c_1$  until we exhaust the string and obtain
the derivative $r_n$. We test whether this derivative is
$\textit{nullable}$ or not. If not, we know the string does not match
$r$, and no value needs to be generated. If yes, we start building the
values incrementally by \emph{injecting} back the characters into the
earlier values $v_n, \ldots, v_0$. This is the second phase of the
algorithm from right to left. For the first value $v_n$, we call the
function $\textit{mkeps}$, which builds a POSIX lexical value
for how the empty string has been matched by the (nullable) regular
expression $r_n$. This function is defined as



We have mentioned before that derivatives without simplification 
can get clumsy, and this is true for values as well--they reflect
the size of the regular expression by definition.

One can introduce simplification on the regex and values but have to
be careful not to break the correctness, as the injection 
function heavily relies on the structure of the regexes and values
being correct and matching each other.
It can be achieved by recording some extra rectification functions
during the derivatives step, and applying these rectifications in 
each run during the injection phase.
And we can prove that the POSIX value of how
regular expressions match strings will not be affected---although it is much harder
to establish. 
Some initial results in this regard have been
obtained in \cite{AusafDyckhoffUrban2016}. 



%Brzozowski, after giving the derivatives and simplification,
%did not explore lexing with simplification, or he may well be 
%stuck on an efficient simplification with proof.
%He went on to examine the use of derivatives together with 
%automaton, and did not try lexing using products.

We want to get rid of the complex and fragile rectification of values.
Can we not create those intermediate values $v_1,\ldots v_n$,
and get the lexing information that should be already there while
doing derivatives in one pass, without a second injection phase?
In the meantime, can we make sure that simplifications
are easily handled without breaking the correctness of the algorithm?

Sulzmann and Lu solved this problem by
introducing additional information to the 
regular expressions called \emph{bitcodes}.
author	Chengsong
	Fri, 03 Jun 2022 16:45:30 +0100
changeset 532	cc54ce075db5
child 536	aff7bf93b9c7
permissions	-rwxr-xr-x