% Chapter Template+ −
+ −
\chapter{Regular Expressions and POSIX Lexing} % Main chapter title+ −
+ −
\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts+ −
%and notations we + −
% used for describing the lexing algorithm by Sulzmann and Lu,+ −
%and then give the algorithm and its variant and discuss+ −
%why more aggressive simplifications are needed. + −
+ −
In this chapter, we define the basic notions + −
for regular languages and regular expressions.+ −
This is essentially a description in ``English"+ −
of our formalisation in Isabelle/HOL.+ −
We also give the definition of what $\POSIX$ lexing means, + −
followed by an algorithm by Sulzmanna and Lu\parencite{Sulzmann2014} + −
that produces the output conforming+ −
to the $\POSIX$ standard.+ −
It is also worth mentioning that+ −
we choose to use the ML-style notation+ −
for function applications, where+ −
the parameters of a function is not enclosed+ −
inside a pair of parentheses (e.g. $f \;x \;y$+ −
instead of $f(x,\;y)$). This is mainly+ −
to make the text visually more concise.+ −
+ −
\section{Basic Concepts}+ −
Usually, formal language theory starts with an alphabet + −
denoting a set of characters.+ −
Here we just use the datatype of characters from Isabelle,+ −
which roughly corresponds to the ASCII characters.+ −
In what follows, we shall leave the information about the alphabet+ −
implicit.+ −
Then using the usual bracket notation for lists,+ −
we can define strings made up of characters: + −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{s}$ & $\dn$ & $[] \; |\; c :: s$+ −
\end{tabular}+ −
\end{center}+ −
Where $c$ is a variable ranging over characters.+ −
Strings can be concatenated to form longer strings in the same+ −
way as we concatenate two lists, which we shall write as $s_1 @ s_2$.+ −
We omit the precise + −
recursive definition here.+ −
We overload this concatenation operator for two sets of strings:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\+ −
\end{tabular}+ −
\end{center}+ −
We also call the above \emph{language concatenation}.+ −
The power of a language is defined recursively, using the + −
concatenation operator $@$:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$A^0 $ & $\dn$ & $\{ [] \}$\\+ −
$A^{n+1}$ & $\dn$ & $A @ A^n$+ −
\end{tabular}+ −
\end{center}+ −
The union of all powers of a language + −
can be used to define the Kleene star operator:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
However, to obtain a more convenient induction principle + −
in Isabelle/HOL, + −
we instead define the Kleene star+ −
as an inductive set: + −
+ −
\begin{center}+ −
\begin{mathpar}+ −
\inferrule{\mbox{}}{[] \in A*\\}+ −
+ −
\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}+ −
\end{mathpar}+ −
\end{center}+ −
\noindent+ −
We also define an operation of "chopping off" a character from+ −
a language, which we call $\Der$, meaning \emph{Derivative} (for a language):+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
This can be generalised to "chopping off" a string from all strings within set $A$, + −
namely:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
which is essentially the left quotient $A \backslash L$ of $A$ against + −
the singleton language with $L = \{w\}$+ −
in formal language theory.+ −
However, for the purposes here, the $\textit{Ders}$ definition with + −
a single string is sufficient.+ −
+ −
With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,+ −
we have a few properties of how the language derivative can be defined using + −
sub-languages.+ −
\begin{lemma}+ −
\[+ −
\Der \; c \; (A @ B) =+ −
\begin{cases}+ −
((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\+ −
(\Der \; c \; A) \, @ \, B, & \text{otherwise}+ −
\end{cases} + −
\]+ −
\end{lemma}+ −
\noindent+ −
This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it+ −
and get to $B$.+ −
The language $A*$'s derivative can be described using the language derivative+ −
of $A$:+ −
\begin{lemma}+ −
$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\+ −
\end{lemma}+ −
\begin{proof}+ −
There are too inclusions to prove:+ −
\begin{itemize}+ −
\item{$\subseteq$}:\\+ −
The set + −
\[ \{s \mid c :: s \in A*\} \]+ −
is enclosed in the set+ −
\[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]+ −
because whenever you have a string starting with a character + −
in the language of a Kleene star $A*$, + −
then that character together with some sub-string+ −
immediately after it will form the first iteration, + −
and the rest of the string will + −
be still in $A*$.+ −
\item{$\supseteq$}:\\+ −
Note that+ −
\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \]+ −
hold.+ −
Also this holds:+ −
\[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]+ −
where the $\textit{RHS}$ can be rewritten+ −
as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]+ −
which of course contains $\Der \; c \; A @ A*$.+ −
\end{itemize}+ −
\end{proof}+ −
+ −
\noindent+ −
Before we define the $\textit{Der}$ and $\textit{Ders}$ counterpart+ −
for regular languages, we need to first give definitions for regular expressions.+ −
+ −
\subsection{Regular Expressions and Their Meaning}+ −
The \emph{basic regular expressions} are defined inductively+ −
by the following grammar:+ −
\[ r ::= \ZERO \mid \ONE+ −
\mid c + −
\mid r_1 \cdot r_2+ −
\mid r_1 + r_2 + −
\mid r^* + −
\]+ −
\noindent+ −
We call them basic because we will introduce+ −
additional constructors in later chapters such as negation+ −
and bounded repetitions.+ −
We use $\ZERO$ for the regular expression that+ −
matches no string, and $\ONE$ for the regular+ −
expression that matches only the empty string\footnote{+ −
some authors+ −
also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$+ −
but we prefer our notation}. + −
The sequence regular expression is written $r_1\cdot r_2$+ −
and sometimes we omit the dot if it is clear which+ −
regular expression is meant; the alternative+ −
is written $r_1 + r_2$.+ −
The \emph{language} or meaning of + −
a regular expression is defined recursively as+ −
a set of strings:+ −
%TODO: FILL in the other defs+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$L \; \ZERO$ & $\dn$ & $\phi$\\+ −
$L \; \ONE$ & $\dn$ & $\{[]\}$\\+ −
$L \; c$ & $\dn$ & $\{[c]\}$\\+ −
$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\+ −
$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\+ −
$L \; r^*$ & $\dn$ & $ (L\;r)*$+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
Now with semantic derivatives of a language and regular expressions and+ −
their language interpretations in place, we are ready to define derivatives on regexes.+ −
\subsection{Brzozowski Derivatives and a Regular Expression Matcher}+ −
%Recall, the language derivative acts on a set of strings+ −
%and essentially chops off a particular character from+ −
%all strings in that set, Brzozowski defined a derivative operation on regular expressions+ −
%so that after derivative $L(r\backslash c)$ + −
%will look as if it was obtained by doing a language derivative on $L(r)$:+ −
Recall that the semantic derivative acts on a set of + −
strings. Brzozowski noticed that this operation+ −
can be ``mirrored" on regular expressions which+ −
he calls the derivative of a regular expression $r$+ −
with respect to a character $c$, written+ −
$r \backslash c$.+ −
He defined this operation such that the following property holds:+ −
\begin{center}+ −
+ −
\[+ −
L(r \backslash c) = \Der \; c \; L(r)+ −
\]+ −
\end{center}+ −
\noindent+ −
For example in the sequence case we have + −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\Der \; c \; (A @ B)$ & $\dn$ & + −
$ \textit{if} \;\, [] \in A \; + −
\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup + −
\Der \; c\; B$\\+ −
& & $\textit{else}\; (\Der \; c \; A) @ B$\\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
This can be translated to + −
regular expressions in the following + −
manner:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1)) r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\+ −
& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
And similarly, the Kleene star's semantic derivative+ −
can be expressed as+ −
\[+ −
\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)+ −
\]+ −
which translates to+ −
\[+ −
(r^*) \backslash c \dn (r \backslash c)\cdot r^*.+ −
\]+ −
In the above definition of $(r_1\cdot r_2) \backslash c$,+ −
the $\textit{if}$ clause's+ −
boolean condition + −
$[] \in L(r_1)$ needs to be + −
somehow recursively computed.+ −
We call such a function that checks+ −
whether the empty string $[]$ is + −
in the language of a regular expression $\nullable$:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ + −
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\+ −
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\+ −
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\+ −
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\+ −
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
The $\ZERO$ regular expression+ −
does not contain any string and+ −
therefore is not \emph{nullable}.+ −
$\ONE$ is \emph{nullable} + −
by definition. + −
The character regular expression $c$+ −
corresponds to the singleton set $\{c\}$, + −
and therefore does not contain the empty string.+ −
The alternative regular expression is nullable+ −
if at least one of its children is nullable.+ −
The sequence regular expression+ −
would require both children to have the empty string+ −
to compose an empty string, and the Kleene star+ −
is always nullable because it naturally+ −
contains the empty string. + −
+ −
The derivative function, written $r\backslash c$, + −
defines how a regular expression evolves into+ −
a new one after all the string it contains is acted on: + −
if it starts with $c$, then the character is chopped of,+ −
if not, that string is removed.+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ + −
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\+ −
$d \backslash c$ & $\dn$ & + −
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\+ −
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\+ −
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\+ −
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\+ −
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\+ −
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
The most involved cases are the sequence case+ −
and the star case.+ −
The sequence case says that if the first regular expression+ −
contains an empty string, then the second component of the sequence+ −
needs to be considered, as its derivative will contribute to the+ −
result of this derivative.+ −
The star regular expression $r^*$'s derivative + −
unwraps one iteration of $r$, turns it into $r\backslash c$,+ −
and attaches the original $r^*$+ −
after $r\backslash c$, so that + −
we can further unfold it as many times as needed.+ −
We have the following correspondence between + −
derivatives on regular expressions and+ −
derivatives on a set of strings:+ −
\begin{lemma}\label{derDer}+ −
$\textit{Der} \; c \; L(r) = L (r\backslash c)$+ −
\end{lemma}+ −
+ −
\noindent+ −
The main property of the derivative operation+ −
(that enables us to reason about the correctness of+ −
derivative-based matching)+ −
is + −
+ −
\begin{lemma}\label{derStepwise}+ −
$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.+ −
\end{lemma}+ −
+ −
\noindent+ −
We can generalise the derivative operation shown above for single characters+ −
to strings as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\+ −
$r \backslash [\,] $ & $\dn$ & $r$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
When there is no ambiguity, we will + −
omit the subscript and use $\backslash$ instead+ −
of $\backslash_r$ to denote+ −
string derivatives for brevity.+ −
Brzozowski's regular-expression matcher algorithm can then be described as:+ −
+ −
\begin{definition}+ −
$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$+ −
\end{definition}+ −
+ −
\noindent+ −
Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, + −
this algorithm presented graphically is as follows:+ −
+ −
\begin{equation}\label{graph:successive_ders}+ −
\begin{tikzcd}+ −
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}+ −
\end{tikzcd}+ −
\end{equation}+ −
+ −
\noindent+ −
It can be+ −
relatively easily shown that this matcher is correct:+ −
\begin{lemma}+ −
$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$+ −
\end{lemma}+ −
\begin{proof}+ −
By the stepwise property of derivatives (lemma \ref{derStepwise})+ −
and lemma \ref{derDer}. + −
\end{proof}+ −
\noindent+ −
\begin{center}+ −
\begin{figure}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
ylabel={time in secs},+ −
ymode = log,+ −
legend entries={Naive Matcher}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};+ −
\end{axis}+ −
\end{tikzpicture} + −
\caption{Matching $(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\label{NaiveMatcher}+ −
\end{figure}+ −
\end{center} + −
\noindent+ −
If we implement the above algorithm naively, however,+ −
the algorithm can be excruciatingly slow, as shown in + −
\ref{NaiveMatcher}.+ −
Note that both axes are in logarithmic scale.+ −
Around two dozens characters+ −
would already explode the matcher on regular expression + −
$(a^*)^*b$.+ −
For this, we need to introduce certain + −
rewrite rules for the intermediate results,+ −
such as $r + r \rightarrow r$,+ −
and make sure those rules do not change the + −
language of the regular expression.+ −
One simpled-minded simplification function+ −
that achieves these requirements is given below:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\simp \; r_1 \cdot r_2 $ & $ \dn$ & + −
$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\+ −
& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\+ −
& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\+ −
& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\+ −
& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\+ −
& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\+ −
$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\+ −
& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\+ −
& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\+ −
& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\+ −
$\simp \; r$ & $\dn$ & $r$+ −
+ −
\end{tabular}+ −
\end{center}+ −
If we repeatedly apply this simplification + −
function during the matching algorithm, + −
we have a matcher with simplification:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\derssimp \; [] \; r$ & $\dn$ & $r$\\+ −
$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\+ −
$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$+ −
\end{tabular}+ −
\end{center}+ −
\begin{figure}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
ylabel={time in secs},+ −
ymode = log,+ −
xmode = log,+ −
grid = both,+ −
legend entries={Matcher With Simp}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};+ −
\end{axis}+ −
\end{tikzpicture} + −
\caption{$(a^*)^*b$ + −
against + −
$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}+ −
\end{figure}+ −
\noindent+ −
The running time of $\textit{ders}\_\textit{simp}$+ −
on the same example of \ref{NaiveMatcher}+ −
is now very tame in terms of the length of inputs,+ −
as shown in \ref{BetterMatcher}.+ −
+ −
Building derivatives and then testing the existence+ −
of empty string in the resulting regular expression's language,+ −
adding simplifications when necessary.+ −
So far, so good. But what if we want to + −
do lexing instead of just getting a YES/NO answer?+ −
Sulzmanna and Lu \cite{Sulzmann2014} first came up with a nice and + −
elegant (arguably as beautiful as the definition of the original derivative) solution for this.+ −
+ −
\section{Values and the Lexing Algorithm by Sulzmann and Lu}+ −
In this section, we present a two-phase regular expression lexing + −
algorithm.+ −
The first phase takes successive derivatives with + −
respect to the input string,+ −
and the second phase does the reverse, \emph{injecting} back+ −
characters, in the meantime constructing a lexing result.+ −
We will introduce the injection phase in detail slightly+ −
later, but as a preliminary we have to first define + −
the datatype for lexing results, + −
called \emph{value} or+ −
sometimes also \emph{lexical value}. Values and regular+ −
expressions correspond to each other as illustrated in the following+ −
table:+ −
+ −
\begin{center}+ −
\begin{tabular}{c@{\hspace{20mm}}c}+ −
\begin{tabular}{@{}rrl@{}}+ −
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\+ −
$r$ & $::=$ & $\ZERO$\\+ −
& $\mid$ & $\ONE$ \\+ −
& $\mid$ & $c$ \\+ −
& $\mid$ & $r_1 \cdot r_2$\\+ −
& $\mid$ & $r_1 + r_2$ \\+ −
\\+ −
& $\mid$ & $r^*$ \\+ −
\end{tabular}+ −
&+ −
\begin{tabular}{@{\hspace{0mm}}rrl@{}}+ −
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\+ −
$v$ & $::=$ & \\+ −
& & $\Empty$ \\+ −
& $\mid$ & $\Char(c)$ \\+ −
& $\mid$ & $\Seq\,v_1\, v_2$\\+ −
& $\mid$ & $\Left(v)$ \\+ −
& $\mid$ & $\Right(v)$ \\+ −
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\+ −
\end{tabular}+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
A value has an underlying string, which + −
can be calculated by the ``flatten" function $|\_|$:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$|\Empty|$ & $\dn$ & $[]$\\+ −
$|\Char \; c|$ & $ \dn$ & $ [c]$\\+ −
$|\Seq(v_1, v_2)|$ & $ \dn$ & $ v_1| @ |v_2|$\\+ −
$|\Left(v)|$ & $ \dn$ & $ |v|$\\+ −
$|\Right(v)|$ & $ \dn$ & $ |v|$\\+ −
$|\Stars([])|$ & $\dn$ & $[]$\\+ −
$|\Stars(v::vs)|$ & $\dn$ & $ |v| @ |\Stars(vs)|$+ −
\end{tabular}+ −
\end{center}+ −
Sulzmann and Lu used a binary predicate, written $\vdash v:r $,+ −
to indicate that a value $v$ could be generated from a lexing algorithm+ −
with input $r$. They call it the value inhabitation relation. + −
\begin{mathpar}+ −
\inferrule{\mbox{}}{\vdash \Char(c) : \mathbf{c}} \hspace{2em}+ −
+ −
\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}+ −
+ −
\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq(v_1, v_2) : (r_1 \cdot r_2)}+ −
+ −
\inferrule{\vdash v_1 : r_1}{\vdash \Left(v_1):r_1+r_2}+ −
+ −
\inferrule{\vdash v_2 : r_2}{\vdash \Right(v_2):r_1 + r_2}+ −
+ −
\inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars(vs):r^*}+ −
\end{mathpar}+ −
\noindent+ −
The condition $|v| \neq []$ in the premise of star's rule+ −
is to make sure that for a given pair of regular + −
expression $r$ and string $s$, the number of values + −
satisfying $|v| = s$ and $\vdash v:r$ is finite.+ −
Given the same string and regular expression, there can be+ −
multiple values for it. For example, both+ −
$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and+ −
$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold+ −
and the values both flatten to $abc$.+ −
Lexers therefore have to disambiguate and choose only+ −
one of the values to output. $\POSIX$ is one of the+ −
disambiguation strategies that is widely adopted.+ −
+ −
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} + −
formalised the property + −
as a ternary relation.+ −
The $\POSIX$ value $v$ for a regular expression+ −
$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified + −
in the following set of rules\footnote{The names of the rules are used+ −
as they were originally given in \cite{AusafDyckhoffUrban2016}}:+ −
\noindent+ −
\begin{figure}+ −
\begin{mathpar}+ −
\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}+ −
+ −
\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}+ −
+ −
\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}+ −
+ −
\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}+ −
+ −
\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\+ −
\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land + −
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow+ −
\Seq \; v_1 \; v_2}+ −
+ −
\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}+ −
+ −
\inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\+ −
|v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land+ −
s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \;+ −
(v::vs)}+ −
\end{mathpar}+ −
\caption{POSIX Lexing Rules}+ −
\end{figure}+ −
\noindent+ −
The above $\POSIX$ rules follows the intuition described below: + −
\begin{itemize}+ −
\item (Left Priority)\\+ −
Match the leftmost regular expression when multiple options of matching+ −
are available.+ −
\item (Maximum munch)\\+ −
Always match a subpart as much as possible before proceeding+ −
to the next token.+ −
\end{itemize}+ −
\noindent+ −
These disambiguation strategies can be + −
quite practical.+ −
For instance, when lexing a code snippet + −
\[ + −
\textit{iffoo} = 3+ −
\]+ −
using the regular expression (with + −
keyword and identifier having their + −
usualy definitions on any formal+ −
language textbook, for instance+ −
keyword is a nonempty string starting with letters + −
followed by alphanumeric characters or underscores):+ −
\[+ −
\textit{keyword} + \textit{identifier},+ −
\]+ −
we want $\textit{iffoo}$ to be recognized+ −
as an identifier rather than a keyword (if)+ −
followed by+ −
an identifier (foo).+ −
POSIX lexing achieves this.+ −
+ −
We know that a $\POSIX$ value is also a normal underlying+ −
value:+ −
\begin{lemma}+ −
$(r, s) \rightarrow v \implies \vdash v: r$+ −
\end{lemma}+ −
\noindent+ −
The good property about a $\POSIX$ value is that + −
given the same regular expression $r$ and string $s$,+ −
one can always uniquely determine the $\POSIX$ value for it:+ −
\begin{lemma}+ −
$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$+ −
\end{lemma}+ −
\begin{proof}+ −
By induction on $s$, $r$ and $v_1$. The inductive cases+ −
are all the POSIX rules. + −
Probably the most cumbersome cases are + −
the sequence and star with non-empty iterations.+ −
We shall give the details for proving the sequence case here.+ −
+ −
When we have + −
\[+ −
(s_1, r_1) \rightarrow v_1 \;\, and \;\, + −
(s_2, r_2) \rightarrow v_2 \;\, and \;\,\\ + −
\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land + −
s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2+ −
\]+ −
we know that the last condition + −
excludes the possibility of a + −
string $s_1'$ longer than $s_1$ such that + −
\[+ −
(s_1', r_1) \rightarrow v_1' \;\; + −
and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s + −
\]+ −
hold.+ −
A shorter string $s_1''$ with $s_2''$ satisfying+ −
\[+ −
(s_1'', r_1) \rightarrow v_1''+ −
\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s + −
\]+ −
cannot possibly form a $\POSIX$ value either, because+ −
by definition there is a candidate+ −
with longer initial string+ −
$s_1$. Therefore, we know that the POSIX+ −
value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching+ −
$s$ must have the + −
property that + −
\[+ −
|a| = s_1 \;\; and \;\; |b| = s_2.+ −
\]+ −
The goal is to prove that $a = v_1 $ and $b = v_2$.+ −
If we have some other POSIX values $v_{10}$ and $v_{20}$ such that + −
$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,+ −
then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, + −
which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$+ −
is the same as $\Seq(v_1, v_2)$. + −
\end{proof}+ −
\noindent+ −
Now we know what a $\POSIX$ value is and why it is unique;+ −
the problem is generating + −
such a value in a lexing algorithm using derivatives.+ −
+ −
\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}+ −
+ −
Sulzmann and Lu extended Brzozowski's + −
derivative-based matching+ −
to a lexing algorithm by a second pass+ −
after the initial phase of successive derivatives.+ −
This second phase generates a POSIX value + −
if the regular expression matches the string.+ −
Two functions are involved: $\inj$ and $\mkeps$.+ −
The first one used is $\mkeps$, which constructs a POSIX value from the last+ −
derivative $r_n$:+ −
\begin{ceqn}+ −
\begin{equation}\label{graph:mkeps}+ −
\begin{tikzcd}+ −
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\+ −
& & & v_n + −
\end{tikzcd}+ −
\end{equation}+ −
\end{ceqn}+ −
\noindent+ −
In the above diagram, again we assume that+ −
the input string $s$ is made of $n$ characters+ −
$c_0c_1 \ldots c_{n-1}$, and the input regular expression $r$+ −
is given label $0$ and after each character $c_i$ is taken off+ −
by the derivative operation the resulting derivative regular + −
expressioin is $r_{i+1}$.The last derivative operation + −
$\backslash c_{n-1}$ gives back $r_n$, which is transformed into + −
a value $v_n$ by $\mkeps$.+ −
$v_n$ tells us how an empty string is matched by the (nullable)+ −
regular expression $r_n$, in a $\POSIX$ way.+ −
The definition of $\mkeps$ is+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\+ −
$\mkeps \; (r_{1}+r_{2})$ & $\dn$ + −
& $\textit{if}\; (\nullable \; r_{1}) \;\,+ −
\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ + −
& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\+ −
$\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\+ −
$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$+ −
\end{tabular}+ −
\end{center}+ −
+ −
+ −
\noindent + −
We favour the left child $r_1$ of $r_1 + r_2$ + −
to match an empty string if there is a choice.+ −
When there is a star for us to match the empty string,+ −
we give the $\Stars$ constructor an empty list, meaning+ −
no iteration is taken.+ −
The result of a call to $\mkeps$ on a $\nullable$ $r$ would+ −
be a $\POSIX$ value corresponding to $r$:+ −
\begin{lemma}\label{mePosix}+ −
$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$+ −
\end{lemma}+ −
\begin{proof}+ −
By induction on the shape of $r$.+ −
\end{proof}+ −
\noindent+ −
After the $\mkeps$-call, we inject back the characters one by one+ −
in reverse order as they were chopped off in the derivative phase.+ −
The fucntion for this is called $\inj$. $\inj$ and $\backslash$+ −
are not exactly reverse operations of one another, as $\inj$ + −
operates on values instead of regular+ −
expressions.+ −
In the diagram below, $v_i$ stands for the (POSIX) value + −
for how the regular expression + −
$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters+ −
of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.+ −
After injecting back $n$ characters, we get the lexical value for how $r_0$+ −
matches $s$. + −
\begin{ceqn}+ −
\begin{equation}\label{graph:inj}+ −
\begin{tikzcd}+ −
r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\+ −
v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] + −
\end{tikzcd}+ −
\end{equation}+ −
\end{ceqn}+ −
\noindent+ −
$\textit{inj}$ takes three arguments: a regular+ −
expression ${r_{i}}$, before the character is chopped off, + −
a character ${c_{i}}$, the character we want to inject back and + −
the third argument $v_{i+1}$ the value we want to inject into. + −
The result of an application + −
$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that+ −
\[+ −
(s_i, r_i) \rightarrow v_i+ −
\]+ −
holds.+ −
The definition of $\textit{inj}$ is as follows: + −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}+ −
$\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\+ −
$\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\+ −
$\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\+ −
$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ & + −
$\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\+ −
$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & + −
$\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\+ −
$\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\+ −
$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent + −
The function does a recursion on + −
the shape of regular+ −
expression $r_i$ and value $v_{i+1}$. + −
Intuitively, each clause analyses + −
how $r_i$ could have transformed when being + −
derived by $c$, identifying which subpart+ −
of $v_{i+1}$ has the ``hole'' + −
to inject the character back into.+ −
Once the character is+ −
injected back to that sub-value; + −
$\inj$ assembles all things together+ −
to form a new value.+ −
+ −
For instance, the last clause is an+ −
injection into a sequence value $v_{i+1}$+ −
whose second child+ −
value is a star, and the shape of the + −
regular expression $r_i$ before injection + −
is a star.+ −
We therefore know + −
the derivative + −
starts on a star and ends as a sequence:+ −
\[+ −
(r^*) \backslash c \longrightarrow r\backslash c \cdot r^*+ −
\]+ −
during which an iteration of the star+ −
had just been unfolded, giving the below+ −
value inhabitation relation:+ −
\[+ −
\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.+ −
\]+ −
The value list $vs$ corresponds to+ −
matched star iterations,+ −
and the ``hole'' lies in $v$ because+ −
\[+ −
\vdash v: r\backslash c.+ −
\]+ −
Finally, + −
$\inj \; r \;c \; v$ is prepended+ −
to the previous list of iterations, and then+ −
wrapped under the $\Stars$+ −
constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.+ −
+ −
Recall that lemma + −
\ref{mePosix} tells us+ −
$\mkeps$ always selects the POSIX matching among+ −
multiple values that flatten to the empty string.+ −
Now $\inj$ preserves the POSIXness, provided+ −
the value before injection is POSIX:+ −
\begin{lemma}\label{injPosix}+ −
If+ −
\[+ −
(r \backslash c, s) \rightarrow v + −
\]+ −
then+ −
\[+ −
(r, c :: s) \rightarrow (\inj r \; c\; v).+ −
\]+ −
\end{lemma}+ −
\begin{proof}+ −
By induction on $r$.+ −
The involved cases are sequence and star.+ −
When $r = a \cdot b$, there could be+ −
three cases for the value $v$ satisfying $\vdash v:a\backslash c$.+ −
We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each+ −
case.+ −
\begin{itemize}+ −
\item+ −
$v = \Seq \; v_a \; v_b$.\\+ −
The ``not nullable'' clause of the $\inj$ function is taken:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\+ −
& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$+ −
\end{tabular}+ −
\end{center}+ −
We know that there exists a unique pair of+ −
$s_a$ and $s_b$ satisfaying + −
$(a \backslash c, s_a) \rightarrow v_a$,+ −
$(b , s_b) \rightarrow v_b$, and+ −
$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in + −
L \; (a\backslash c) \land+ −
s_4 \in L \; b$.+ −
The last condition gives us+ −
$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in + −
L \; a \land+ −
s_4 \in L \; b$.+ −
By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,+ −
and this gives us+ −
\[+ −
(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.+ −
\]+ −
\item+ −
$v = \Left \; (\Seq \; v_a \; v_b)$\\+ −
The argument is almost identical to the above case, + −
except that a different clause of $\inj$ is taken:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\+ −
& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$+ −
\end{tabular}+ −
\end{center}+ −
With a similar reasoning, + −
+ −
\[+ −
(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.+ −
\]+ −
again holds.+ −
\item + −
$v = \Right \; v_b$\\+ −
Again the injection result would be + −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\+ −
& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$+ −
\end{tabular}+ −
\end{center}+ −
We know that $a$ must be nullable,+ −
allowing us to call $\mkeps$ and get+ −
\[+ −
(a, []) \rightarrow \mkeps \; a.+ −
\]+ −
Also by inductive hypothesis+ −
\[+ −
(b, c::s) \rightarrow \inj\; b \; c \; v_b+ −
\]+ −
holds.+ −
In addition, as+ −
$\Right \;v_b$ instead of $\Left \ldots$ is + −
the POSIX value for $v$, it must be the case+ −
that $s \notin L \;( (a\backslash c)\cdot b)$.+ −
This tells us that + −
\[+ −
\nexists s_3 \; s_4.+ −
s_3 @s_4 = s \land s_3 \in L \; (a\backslash c) + −
\land s_4 \in L \; b+ −
\]+ −
which translates to+ −
\[+ −
\nexists s_3 \; s_4. \; s_3 \neq [] \land+ −
s_3 @s_4 = c::s \land s_3 \in L \; a + −
\land s_4 \in L \; b.+ −
\]+ −
(Which basically says there cannot be a longer + −
initial split for $s$ other than the empty string.)+ −
Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$+ −
as the POSIX value for $a\cdot b$.+ −
\end{itemize}+ −
The star case can be proven similarly.+ −
\end{proof}+ −
\noindent+ −
Putting all the functions $\inj$, $\mkeps$, $\backslash$ together+ −
by following the procedure outlined in the diagram \ref{graph:inj},+ −
and taking into consideration the possibility of a non-match,+ −
a lexer can be built with the following recursive definition:+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\+ −
$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\+ −
& & $\quad \phantom{\mid}\; \None \implies \None$\\+ −
& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
The central property of the $\lexer$ is that it gives the correct result by+ −
$\POSIX$ standards:+ −
\begin{theorem}+ −
The $\lexer$ based on derivatives and injections is correct: + −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\+ −
$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$+ −
\end{tabular}+ −
\end{center}+ −
\end{theorem} + −
\begin{proof}+ −
By induction on $s$. $r$ is allowed to be an arbitrary regular expression.+ −
The $[]$ case is proven by lemma \ref{mePosix}, and the inductive case+ −
by lemma \ref{injPosix}.+ −
\end{proof}+ −
\noindent+ −
As we did earlier in this chapter on the matcher, one can + −
introduce simplification on the regex.+ −
However, now one needs to do a backward phase and make sure+ −
the values align with the regular expressions.+ −
Therefore one has to+ −
be careful not to break the correctness, as the injection + −
function heavily relies on the structure of the regexes and values+ −
being correct and matching each other.+ −
It can be achieved by recording some extra rectification functions+ −
during the derivatives step, and applying these rectifications in + −
each run during the injection phase.+ −
With extra care+ −
one can show that POSIXness will not be affected---although it is much harder+ −
to establish. + −
Some initial results in this regard have been+ −
obtained in \cite{AusafDyckhoffUrban2016}. + −
+ −
However, with all the simplification rules allowed+ −
in an injection-based lexer, one could still end up in+ −
trouble, when cases that require more involved and aggressive+ −
simplifications arise.+ −
\section{A Case Requring More Aggressive Simplifications}+ −
For example, when starting with the regular+ −
expression $(a^* \cdot a^*)^*$ and building a few successive derivatives (around 10)+ −
w.r.t.~the character $a$, one obtains a derivative regular expression+ −
with more than 9000 nodes (when viewed as a tree)+ −
even with simplification.+ −
\begin{figure}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
ylabel={size},+ −
legend entries={Naive Matcher}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};+ −
\end{axis}+ −
\end{tikzpicture} + −
\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}+ −
\end{figure}\label{fig:BetterWaterloo}+ −
+ −
That is because Sulzmann and Lu's + −
injection-based lexing algorithm keeps a lot of + −
"useless" values that will not be used. + −
These different ways of matching will grow exponentially with the string length.+ −
Take + −
\[+ −
r= (a^*\cdot a^*)^* \quad and \quad+ −
s=\underbrace{aa\ldots a}_\text{n \textit{a}s}+ −
\]+ −
as an example.+ −
This is a highly ambiguous regular expression, with+ −
many ways to split up the string into multiple segments for+ −
different star iteratioins,+ −
and each segment will have multiple ways of splitting between + −
the two $a^*$ sub-expressions.+ −
It is not surprising there are exponentially many + −
distinct lexical values+ −
for such a pair of regular expression and string.+ −
A lexer without a good enough strategy to + −
deduplicate will naturally+ −
have an exponential runtime on ambiguous regular expressions.+ −
+ −
Somehow one has to make sure which+ −
lexical values are $\POSIX$ and must be kept in a lexing algorithm.+ −
For example, the above $r= (a^*\cdot a^*)^*$ and + −
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ example has the POSIX value+ −
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.+ −
We want to keep this value only, and remove all the regular expression subparts+ −
not corresponding to this value during lexing.+ −
To do this, a two-phase algorithm with rectification is a bit too fragile.+ −
Can we not create those intermediate values $v_1,\ldots v_n$,+ −
and get the lexing information that should be already there while+ −
doing derivatives in one pass, without a second injection phase?+ −
In the meantime, can we ensure that simplifications+ −
are easily handled without breaking the correctness of the algorithm?+ −
+ −
Sulzmann and Lu solved this problem by+ −
introducing additional information to the + −
regular expressions called \emph{bitcodes}.+ −
+ −
+ −
+ −
+ −