% Chapter Template\chapter{Regular Expressions and POSIX Lexing} % Main chapter title\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts%and notations we % used for describing the lexing algorithm by Sulzmann and Lu,%and then give the algorithm and its variant and discuss%why more aggressive simplifications are needed. In this chapter, we define the basic notions for regular languages and regular expressions.This is essentially a description in ``English''of our formalisation in Isabelle/HOL.We also give the definition of what $\POSIX$ lexing means, followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} that produces the output conformingto the $\POSIX$ standard\footnote{In what follows we choose to use the Isabelle-style notationfor function applications, wherethe parameters of a function are not enclosedinside a pair of parentheses (e.g. $f \;x \;y$instead of $f(x,\;y)$). This is mainlyto make the text visually more concise.}.\section{Basic Concepts}Formal language theory usually starts with an alphabet denoting a set of characters.Here we just use the datatype of characters from Isabelle,which roughly corresponds to the ASCII characters.In what follows, we shall leave the information about the alphabetimplicit.Then using the usual bracket notation for lists,we can define strings made up of characters as: \begin{center}\begin{tabular}{lcl}$\textit{s}$ & $\dn$ & $[] \; |\; c :: s$\end{tabular}\end{center}where $c$ is a variable ranging over characters.The $::$ stands for list cons and $[]$ for the emptylist.A singleton list is sometimes written as $[c]$ for brevity.Strings can be concatenated to form longer strings in the sameway as we concatenate two lists, which we shall write as $s_1 @ s_2$.We omit the precise recursive definition here.We overload this concatenation operator for two sets of strings:\begin{center}\begin{tabular}{lcl}$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\\end{tabular}\end{center}We also call the above \emph{language concatenation}.The power of a language is defined recursively, using the concatenation operator $@$:\begin{center}\begin{tabular}{lcl}$A^0 $ & $\dn$ & $\{ [] \}$\\$A^{n+1}$ & $\dn$ & $A @ A^n$\end{tabular}\end{center}The union of all powers of a language can be used to define the Kleene star operator:\begin{center}\begin{tabular}{lcl} $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\\end{tabular}\end{center}\noindentHowever, to obtain a more convenient induction principle in Isabelle/HOL, we instead define the Kleene staras an inductive set: \begin{center}\begin{mathpar} \inferrule{\mbox{}}{[] \in A*\\}\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}\end{mathpar}\end{center}\noindentWe also define an operation of "chopping off" a character froma language, which we call $\Der$, meaning \emph{Derivative} (for a language):\begin{center}\begin{tabular}{lcl}$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\\end{tabular}\end{center}\noindentThis can be generalised to ``chopping off'' a string from all strings within a set $A$, namely:\begin{center}\begin{tabular}{lcl}$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\\end{tabular}\end{center}\noindentwhich is essentially the left quotient $A \backslash L$ of $A$ against the singleton language with $L = \{s\}$.However, for our purposes here, the $\textit{Ders}$ definition with a single string is sufficient.The reason for defining derivativesis that they provide another approachto test membership of a string in a set of strings. For example, to test whether the string$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set withrespect to the string $bar$:\begin{center}\begin{tabular}{lll} $S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & $\{ar, rak\}$ \\ & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\ %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\\end{tabular} \end{center}\noindentand in the end test whether the sethas the empty string.\footnote{We use the infix notation $A\backslash c$ instead of $\Der \; c \; A$ for brevity, as it is clear we are operatingon languages rather than regular expressions.}In general, if we have a language $S$,then we can test whether $s$ is in $S$by testing whether $[] \in S \backslash s$.With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,we have a few properties of how the language derivative can be defined using sub-languages.For example, for the sequence operator, we havesomething similar to a ``chain rule'':\begin{lemma}\[ \Der \; c \; (A @ B) = \begin{cases} ((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\ (\Der \; c \; A) \, @ \, B, & \text{otherwise} \end{cases} \]\end{lemma}\noindentThis lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through itand get to $B$.The language derivative for $A*$ can be described using the language derivativeof $A$:\begin{lemma}$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\\end{lemma}\begin{proof}There are two inclusions to prove:\begin{itemize}\item{$\subseteq$}:\\The set \[ \{s \mid c :: s \in A*\} \]is enclosed in the set\[ \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]because whenever you have a string starting with a character in the language of a Kleene star $A*$, then that character together with some sub-stringimmediately after it will form the first iteration, and the rest of the string will be still in $A*$.\item{$\supseteq$}:\\Note that\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \]holds.Also the following holds:\[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]where the $\textit{RHS}$ can be rewrittenas \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]which of course contains $\Der \; c \; A @ A*$.\end{itemize}\end{proof}\noindentThe clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$for regular expressions.To introduce them, we need to first give definitions for regular expressions,which we shall do next.\subsection{Regular Expressions and Their Meaning}The \emph{basic regular expressions} are defined inductively by the following grammar:\[ r ::= \ZERO \mid \ONE \mid c \mid r_1 \cdot r_2 \mid r_1 + r_2 \mid r^* \]\noindentWe call them basic because we will introduceadditional constructors in later chapters such as negationand bounded repetitions.We use $\ZERO$ for the regular expression thatmatches no string, and $\ONE$ for the regularexpression that matches only the empty string.\footnote{Some authorsalso use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$but we prefer our notation.} The sequence regular expression is written $r_1\cdot r_2$and sometimes we omit the dot if it is clear whichregular expression is meant; the alternativeis written $r_1 + r_2$.The \emph{language} or meaning of a regular expression is defined recursively asa set of strings:%TODO: FILL in the other defs\begin{center}\begin{tabular}{lcl}$L \; \ZERO$ & $\dn$ & $\phi$\\$L \; \ONE$ & $\dn$ & $\{[]\}$\\$L \; c$ & $\dn$ & $\{[c]\}$\\$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\$L \; (r^*)$ & $\dn$ & $ (L\;r)*$\end{tabular}\end{center}\noindent%Now with language derivatives of a language and regular expressions and%their language interpretations in place, we are ready to define derivatives on regular expressions.With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.We do so by first introducing what properties it should satisfy.\subsection{Brzozowski Derivatives and a Regular Expression Matcher}%Recall, the language derivative acts on a set of strings%and essentially chops off a particular character from%all strings in that set, Brzozowski defined a derivative operation on regular expressions%so that after derivative $L(r\backslash c)$ %will look as if it was obtained by doing a language derivative on $L(r)$:%Recall that the language derivative acts on a %language (set of strings).%One can decide whether a string $s$ belongs%to a language $S$ by taking derivative with respect to%that string and then checking whether the empty %string is in the derivative:%\begin{center}%\parskip \baselineskip%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}%\def\rlwd{.5pt}%\newcommand\notate[3]{%% \unskip\def\useanchorwidth{T}%% \setbox0=\hbox{#1}%% \def\stackalignment{c}\stackunder[-6pt]{%% \def\stackalignment{c}\stackunder[-1.5pt]{%% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%% \rule{\rlwd}{#2\baselineskip}}}{%% \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%%}%\Longstack{%\notate{$\{ \ldots ,\;$ % \notate{s}{1}{$(c_1 :: s_1)$} % $, \; \ldots \}$ %}{1}{$S_{start}$} %}%\Longstack{% $\stackrel{\backslash c_1}{\longrightarrow}$%}%\Longstack{% $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$} % $,\; \ldots \}$%}%\Longstack{% $\stackrel{\backslash c_2}{\longrightarrow}$ %}%\Longstack{% $\{ \ldots,\; s_2% ,\; \ldots \}$%}%\Longstack{% $ \xdashrightarrow{\backslash c_3\ldots\ldots} $ %}%\Longstack{% \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = % S_{start}\backslash s$}%}%\end{center}%\begin{center}% $s \in S_{start} \iff [] \in S_{end}$%\end{center}%\noindentBrzozowski noticed that $\Der$can be ``mirrored'' on regular expressions whichhe calls the derivative of a regular expression $r$with respect to a character $c$, written$r \backslash c$. This infix operatortakes an original regular expression $r$ as inputand a character as a right operand andoutputs a result, which is a new regular expression.The derivative operation on regular expressionis defined such that the language of the derivative result coincides with the language of the original regular expression being taken derivative with respect to the same character:\begin{property}\[ L \; (r \backslash c) = \Der \; c \; (L \; r)\]\end{property}\noindentPictorially, this looks as follows:\\\vspace{3mm}\parskip \baselineskip\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}\def\rlwd{.5pt}\newcommand\notate[3]{% \unskip\def\useanchorwidth{T}% \setbox0=\hbox{#1}% \def\stackalignment{c}\stackunder[-6pt]{% \def\stackalignment{c}\stackunder[-1.5pt]{% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{% \rule{\rlwd}{#2\baselineskip}}}{% \strut\kern8pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%}\Longstack{ \notate{$r$}{1}{$L \; r = \{\ldots, \;c::s_1, \;\ldots\}$}}\Longstack{ $\stackrel{\backslash c}{\xrightarrow{\hspace*{8cm}}}$}\Longstack{ \notate{$r\backslash c$}{1}{$L \; (r\backslash c)= \{\ldots,\;s_1,\;\ldots\}$}}\\ \vspace{ 3mm }The derivatives on regular expression can again be generalised to a string.One could compute $r_{start} \backslash s$ and test membership of $s$in $L \; r_{start}$ by checkingwhether the empty string is in the language of $r_{end}$ ($r_{start}\backslash s$).\\\vspace{2mm}\Longstack{ \notate{$r_{start}$}{4}{ \Longstack{$L \; r_{start} = \{\ldots, \;$ \notate{$s$}{1}{$c_1::s_1$} $, \ldots\} $ } } }\Longstack{ $\stackrel{\backslash c_1}{ \xrightarrow{\hspace*{1.8cm}} }$}\Longstack{ \notate{$r_1$}{3}{ $r_1 = r_{start}\backslash c_1$, $L \; r_1 = $ \Longstack{ $\{ \ldots,\;$ \notate{$s_1$}{1}{$c_2::s_2$} $,\; \ldots \}$ }}}\Longstack{ $\stackrel{\backslash c_2}{\xrightarrow{\hspace*{1.8cm}}}$ }\Longstack{ $r_2$ }\Longstack{ $ \xdashrightarrow{\hspace*{0.3cm} \backslash c_3 \ldots \ldots \ldots \hspace*{0.3cm}} $ }\Longstack{ \notate{$r_{end}$}{1}{ $L \; r_{end} = \{\ldots, \; [], \ldots\}$}}\begin{property} $s \in L \; r_{start} \iff [] \in L \; r_{end}$\end{property}\noindentNext we give the recursive definition of derivative onregular expressions, so that it satisfies the properties above.The derivative function, written $r\backslash c$, takes a regular expression $r$ and character $c$, andreturns a new regular expression representingthe original regular expression's language $L \; r$being taken the language derivative with respect to $c$.\begin{table} \begin{center}\begin{tabular}{lcl} $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ $d \backslash c$ & $\dn$ & $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\\end{tabular} \end{center}\caption{Derivative on Regular Expressions}\label{table:der}\end{table}\noindentThe most involved cases are the sequence caseand the star case.The sequence case says that if the first regular expressioncontains an empty string, then the second component of the sequenceneeds to be considered, as its derivative will contribute to theresult of this derivative:\begin{center} \begin{tabular}{lcl} $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1))\; \textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\ & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ \end{tabular}\end{center}\noindentNotice how this closely resemblesthe language derivative operation $\Der$:\begin{center} \begin{tabular}{lcl} $\Der \; c \; (A @ B)$ & $\dn$ & $ \textit{if} \;\, [] \in A \; \textit{then} \;\, ((\Der \; c \; A) @ B ) \cup \Der \; c\; B$\\ & & $\textit{else}\; (\Der \; c \; A) @ B$\\ \end{tabular}\end{center}\noindentThe derivative of the star regular expression $r^*$ unwraps one iteration of $r$, turns it into $r\backslash c$,and attaches the original $r^*$after $r\backslash c$, so that we can further unfold it as many times as needed:\[ (r^*) \backslash c \dn (r \backslash c)\cdot r^*.\]Again,the structure is the same as the language derivative of Kleene star: \[ \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)\]In the above definition of $(r_1\cdot r_2) \backslash c$,the $\textit{if}$ clause'sboolean condition $[] \in L(r_1)$ needs to be somehow recursively computed.We call such a function that checkswhether the empty string $[]$ is in the language of a regular expression $\nullable$:\begin{center} \begin{tabular}{lcl} $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ \end{tabular}\end{center}\noindentThe $\ZERO$ regular expressiondoes not contain any string andtherefore is not \emph{nullable}.$\ONE$ is \emph{nullable} by definition. The character regular expression $c$corresponds to the singleton set $\{c\}$, and therefore does not contain the empty string.The alternative regular expression is nullableif at least one of its children is nullable.The sequence regular expressionwould require both children to have the empty stringto compose an empty string, and the Kleene staris always nullable because it naturallycontains the empty string. \noindentWe have the following correspondence between derivatives on regular expressions andderivatives on a set of strings:\begin{lemma}\label{derDer} \mbox{} \begin{itemize} \item$\textit{Der} \; c \; L(r) = L (r\backslash c)$\item $c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$. \end{itemize}\end{lemma}\begin{proof} By induction on $r$.\end{proof}\noindentwhich is the main property of derivativesthat enables us to reason about the correctness ofderivative-based matching.We can generalise the derivative operation shown above for single charactersto strings as follows:\begin{center}\begin{tabular}{lcl}$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\$r \backslash_s [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentWhen there is no ambiguity, we will omit the subscript and use $\backslash$ insteadof $\backslash_s$ to denotestring derivatives for brevity.Brzozowski's derivative-basedregular-expression matching algorithm can then be described as:\begin{definition}$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$\end{definition}\noindentAssuming the string is given as a sequence of characters, say $c_0c_1..c_n$, this algorithm presented graphically is as follows:\begin{equation}\label{matcher}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{true}/\textrm{false}\end{tikzcd}\end{equation}\noindent It is relativelyeasy to show that this matcher is correct, namely\begin{lemma} $\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$\end{lemma}\begin{proof} By induction on $s$ using property of derivatives: lemma \ref{derDer}.\end{proof}\begin{figure}\begin{center}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={time in secs}, %ymode = log, legend entries={Naive Matcher}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};\end{axis}\end{tikzpicture} \caption{Matching the regular expression $(a^*)^*b$ against strings of the form$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ using Brzozowski's original algorithm}\label{NaiveMatcher}\end{center}\end{figure}\noindentIf we implement the above algorithm naively, however,the algorithm can be excruciatingly slow, as shown in \ref{NaiveMatcher}.Note that both axes are in logarithmic scale.Around two dozens characterswould already ``explode'' the matcher with the regular expression $(a^*)^*b$.To improve this situation, we need to introduce simplificationrules for the intermediate results,such as $r + r \rightarrow r$,and make sure those rules do not change the language of the regular expression.One simpled-minded simplification functionthat achieves these requirements is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):\begin{center} \begin{tabular}{lcl} $\simp \; r_1 \cdot r_2 $ & $ \dn$ & $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\ & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\ & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\ & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ $\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ \end{tabular}\end{center}If we repeatedly apply this simplification function during the matching algorithm, we have a matcher with simplification:\begin{center} \begin{tabular}{lcl} $\derssimp \; [] \; r$ & $\dn$ & $r$\\ $\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\ $\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$ \end{tabular}\end{center}\begin{figure}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={time in secs}, %ymode = log, %xmode = log, grid = both, legend entries={Matcher With Simp}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};\end{axis}\end{tikzpicture} \caption{$(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}\end{figure}\noindentThe running time of $\textit{ders}\_\textit{simp}$on the same example of Figure \ref{NaiveMatcher}is now ``tame'' in terms of the length of inputs,as shown in Figure \ref{BetterMatcher}.So far the story is use Brzozowski derivatives andsimplify as much as possible and at the end testwhether the empty string is recognised by the final derivative.But what if we want to do lexing instead of just getting a true/false answer?Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and elegant (arguably as beautiful as the definition of the Brzozowski derivative) solution for this.\section{Values and the Lexing Algorithm by Sulzmann and Lu}In this section, we present a two-phase regular expression lexing algorithm.The first phase takes successive derivatives with respect to the input string,and the second phase does the reverse, \emph{injecting} backcharacters, in the meantime constructing a lexing result.We will introduce the injection phase in detail slightlylater, but as a preliminary we have to first define the datatype for lexing results, called \emph{value} orsometimes also \emph{lexical value}. Values and regularexpressions correspond to each other as illustrated in the followingtable:\begin{center} \begin{tabular}{c@{\hspace{20mm}}c} \begin{tabular}{@{}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ $r$ & $::=$ & $\ZERO$\\ & $\mid$ & $\ONE$ \\ & $\mid$ & $c$ \\ & $\mid$ & $r_1 \cdot r_2$\\ & $\mid$ & $r_1 + r_2$ \\ \\ & $\mid$ & $r^*$ \\ \end{tabular} & \begin{tabular}{@{\hspace{0mm}}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ $v$ & $::=$ & \\ & & $\Empty$ \\ & $\mid$ & $\Char \; c$ \\ & $\mid$ & $\Seq\,v_1\, v_2$\\ & $\mid$ & $\Left \;v$ \\ & $\mid$ & $\Right\;v$ \\ & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ \end{tabular} \end{tabular}\end{center}\noindentA value has an underlying string, which can be calculated by the ``flatten" function $|\_|$:\begin{center} \begin{tabular}{lcl} $|\Empty|$ & $\dn$ & $[]$\\ $|\Char \; c|$ & $ \dn$ & $ [c]$\\ $|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\ $|\Left \; v|$ & $ \dn$ & $ |v|$\\ $|\Right \; v|$ & $ \dn$ & $ |v|$\\ $|\Stars \; []|$ & $\dn$ & $[]$\\ $|\Stars \; v::vs|$ & $\dn$ & $ |v| @ |\Stars(vs)|$ \end{tabular}\end{center}Sulzmann and Lu used a binary predicate, written $\vdash v:r $,to indicate that a value $v$ could be generated from a lexing algorithmwith input $r$. They call it the value inhabitation relation,defined by the rules.\begin{figure}[H]\label{fig:inhab}\begin{mathpar} \inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em} \inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}\inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars \; vs : r^*}\end{mathpar}\caption{The inhabitation relation for values and regular expressions}\end{figure}\noindentThe condition $|v| \neq []$ in the premise of star's ruleis to make sure that for a given pair of regular expression $r$ and string $s$, the number of values satisfying $|v| = s$ and $\vdash v:r$ is finite.This additional condition wasimposed by Ausaf and Urban to make their proofs easier.Given a string and a regular expression, there can bemultiple values for it. For example, both$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ holdand the values both flatten to $abc$.Lexers therefore have to disambiguate and choose onlyone of the values be generated. $\POSIX$ is one of thedisambiguation strategies that is widely adopted.Ausaf et al.\parencite{AusafDyckhoffUrban2016} formalised the property as a ternary relation.The $\POSIX$ value $v$ for a regular expression$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified in the following rules\footnote{The names of the rules are usedas they were originally given in \cite{AusafDyckhoffUrban2016}.}:\begin{figure}[p]\begin{mathpar} \inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty} \inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c} \inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1} \inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2} \inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow \Seq \; v_1 \; v_2} \inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])} \inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\ |v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \; (v::vs)}\end{mathpar}\caption{The inductive POSIX rules given by Ausaf et al. \cite{AusafDyckhoffUrban2016}.This ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on thevalue $v$ given a string $s$ and regular expression $r$.}\end{figure}\afterpage{\clearpage}\noindent%\begin{figure}%\begin{tikzpicture}[]% \node [minimum width = 6cm, rectangle split, rectangle split horizontal, % rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]% (node1)% {$r_{token1}$% \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ };% %\node [left = 6.0cm of node1] (start1) {hi};% \node [left = 0.2cm of node1] (middle) {$v.s.$};% \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, % rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]% (node2)% {$\quad\;\;\;r_{token1}\quad\;\;\;$% \nodepart{two} $r_{token2}$ };% \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};% \node [above = 1.5cm of middle, minimum width = 6cm, % rectangle, style={draw, rounded corners, inner sep=10pt}] % (topNode) {$s$};% \path[->,draw]% (topNode) edge node {split $A$} (node2)% (topNode) edge node {split $B$} (node1)% ;% %%\end{tikzpicture}%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}%\end{figure}The above $\POSIX$ rules follows the intuition described below: \begin{itemize} \item (Left Priority)\\ Match the leftmost regular expression when multiple options of matching are available. See P+L and P+R where in P+R $s$ cannot be in the language of $L \; r_1$. \item (Maximum munch)\\ Always match a subpart as much as possible before proceeding to the next part of the string. For example, when the string $s$ matches $r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split: Then the split that matches a longer string for the first part $r_{part1}$ is preferred by this maximum munch rule. This is caused by the side-condition \begin{center} $\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$ \end{center} in PS. %(See %\ref{munch} for an illustration).\end{itemize}\noindentThese disambiguation strategies can be quite practical.For instance, when lexing a code snippet \[ \textit{iffoo} = 3\]using a regular expression for keywords and identifiers:%(for example, keyword is a nonempty string starting with letters %followed by alphanumeric characters or underscores):\[ r_{keyword} + r_{identifier}.\]If we want $\textit{iffoo}$ to be recognizedas an identifierwhere identifiers are defined as usual (lettersfollowed by letters, numbers or underscores),then a match with a keyword (if)followed byan identifier (foo) would be incorrect.POSIX lexing would generate what we want.\noindentWe know that a POSIX value for regular expression $r$ is inhabited by $r$.\begin{lemma}$(r, s) \rightarrow v \implies \vdash v: r$\end{lemma}\noindentThe main property about a $\POSIX$ value is that given the same regular expression $r$ and string $s$,one can always uniquely determine the $\POSIX$ value for it:\begin{lemma}$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$\end{lemma}\begin{proof}By induction on $s$, $r$ and $v_1$. The inductive casesare all the POSIX rules. Probably the most cumbersome cases are the sequence and star with non-empty iterations.We shall give the details for proving the sequence case here.When we have \[ (s_1, r_1) \rightarrow v_1 \;\, and \;\, (s_2, r_2) \rightarrow v_2 \;\, and \;\, \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2\]we know that the last condition excludes the possibility of a string $s_1'$ longer than $s_1$ such that \[(s_1', r_1) \rightarrow v_1' \;\; and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s \]hold.A shorter string $s_1''$ with $s_2''$ satisfying\[(s_1'', r_1) \rightarrow v_1''\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s \]cannot possibly form a $\POSIX$ value either, becauseby definition there is a candidatewith longer initial string$s_1$. Therefore, we know that the POSIXvalue $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching$s$ must have the property that \[ |a| = s_1 \;\; and \;\; |b| = s_2.\]The goal is to prove that $a = v_1 $ and $b = v_2$.If we have some other POSIX values $v_{10}$ and $v_{20}$ such that $(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$is the same as $\Seq(v_1, v_2)$. \end{proof}\noindentWe have now defined what a POSIX value is and shown that it is unique,the problem is to generatesuch a value in a lexing algorithm using derivatives.\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}Sulzmann and Lu extended Brzozowski's derivative-based matchingto a lexing algorithm by a second phase after the initial phase of successive derivatives.This second phase generates a POSIX value if the regular expression matches the string.Two functions are involved: $\inj$ and $\mkeps$.The function $\mkeps$ constructs a POSIX value from the lastderivative $r_n$:\begin{ceqn}\begin{equation}\label{graph:mkeps}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\ & & & v_n \end{tikzcd}\end{equation}\end{ceqn}\noindentIn the above diagram, again we assume thatthe input string $s$ is made of $n$ characters$c_0c_1 \ldots c_{n-1}$ The last derivative operation $\backslash c_{n-1}$ generates the derivative $r_n$, for which$\mkeps$ produces the value $v_n$. This valuetells us how the empty string is matched by the (nullable)regular expression $r_n$, in a POSIX way.The definition of $\mkeps$ is \begin{center} \begin{tabular}{lcl} $\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\ $\mkeps \; (r_{1}+r_{2})$ & $\dn$ & $\textit{if}\; (\nullable \; r_{1}) \;\, \textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ & & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\ $\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\ $\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$ \end{tabular} \end{center}\noindent The function prefers the left child $r_1$ of $r_1 + r_2$ to match an empty string if there is a choice.When there is a star to match the empty string,we give the $\Stars$ constructor an empty list, meaningno iteration is taken.The result of $\mkeps$ on a $\nullable$ $r$ is a POSIX value for $r$ and the empty string:\begin{lemma}\label{mePosix}$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$\end{lemma}\begin{proof} By induction on $r$.\end{proof}\noindentAfter the $\mkeps$-call, Sulzmann and Lu inject back the characters one by onein reverse order as they were chopped off in the derivative phase.The fucntion for this is called $\inj$. This function operates on values, unlike $\backslash$ which operates on regular expressions.In the diagram below, $v_i$ stands for the (POSIX) value for how the regular expression $r_i$ matches the string $s_i$ consisting of the last $n-i$ charactersof $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.After injecting back $n$ characters, we get the lexical value for how $r_0$matches $s$. \begin{figure}[H]\begin{center} \begin{ceqn}\begin{tikzcd}r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] \end{tikzcd}\end{ceqn}\end{center}\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016}, matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$. The first phase involves taking successive derivatives w.r.t the characters $c_0$, $c_1$, and so on. These are the same operations as they have appeared in the matcher \ref{matcher}. When the final derivative regular expression is nullable (contains the empty string), then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches the empty string , by always selecting the leftmost nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they appeared in the string, always preserving POSIXness.}\label{graph:inj}\end{figure}\noindentThe function $\textit{inj}$ as defined by Sulzmann and Lutakes three arguments: a regularexpression ${r_{i}}$, before the character is chopped off, a character ${c_{i}}$ (the character we want to inject back) and the third argument $v_{i+1}$ the value we want to inject into. The result of an application $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that\[ (s_i, r_i) \rightarrow v_i\]holds.The definition of $\textit{inj}$ is as follows: \begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l} $\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\ $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\ $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\ $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ & $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\ $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & $\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\ $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\ $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\\end{tabular}\end{center}\noindent The function recurses on the shape of regularexpressionsw and values.Intuitively, each clause analyses how $r_i$ could have transformed when being derived by $c$, identifying which subpartof $v_{i+1}$ has the ``hole'' to inject the character back into.Once the character isinjected back to that sub-value; $\inj$ assembles all parts togetherto form a new value.For instance, the last clause is aninjection into a sequence value $v_{i+1}$whose second childvalue is a star, and the shape of the regular expression $r_i$ before injection is a star.We therefore know the derivative starts on a star and ends as a sequence:\[ (r^*) \backslash c \longrightarrow r\backslash c \cdot r^*\]during which an iteration of the starhad just been unfolded, giving the belowvalue inhabitation relation:\[ \vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.\]The value list $vs$ corresponds tomatched star iterations,and the ``hole'' lies in $v$ because\[ \vdash v: r\backslash c.\]Finally, $\inj \; r \;c \; v$ is prependedto the previous list of iterations, and thenwrapped under the $\Stars$constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.Recall that lemma \ref{mePosix} tells us that$\mkeps$ always generates the POSIX value.The function $\inj$ preserves the POSIXness, providedthe value before injection is POSIX, namely\begin{lemma}\label{injPosix} If$(r \backslash c, s) \rightarrow v $, then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.\end{lemma}\begin{proof} By induction on $r$. The non-trivial cases are sequence and star. When $r = a \cdot b$, there can be three cases for the value $v$ satisfying $\vdash v:a\backslash c$. We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each case. \begin{itemize} \item $v = \Seq \; v_a \; v_b$.\\ The ``not nullable'' clause of the $\inj$ function is taken: \begin{center} \begin{tabular}{lcl} $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\ & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ \end{tabular} \end{center} We know that there exists a unique pair of $s_a$ and $s_b$ satisfaying $(a \backslash c, s_a) \rightarrow v_a$, $(b , s_b) \rightarrow v_b$, and $\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in L \; (a\backslash c) \land s_4 \in L \; b$. The last condition gives us $\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in L \; a \land s_4 \in L \; b$. By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds, and this gives us \[ (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. \] \item $v = \Left \; (\Seq \; v_a \; v_b)$\\ The argument is almost identical to the above case, except that a different clause of $\inj$ is taken: \begin{center} \begin{tabular}{lcl} $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\ & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ \end{tabular} \end{center} With a similar reasoning, \[ (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. \] again holds. \item $v = \Right \; v_b$\\ Again the injection result would be \begin{center} \begin{tabular}{lcl} $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\ & $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$ \end{tabular} \end{center} We know that $a$ must be nullable, allowing us to call $\mkeps$ and get \[ (a, []) \rightarrow \mkeps \; a. \] Also by inductive hypothesis \[ (b, c::s) \rightarrow \inj\; b \; c \; v_b \] holds. In addition, as $\Right \;v_b$ instead of $\Left \ldots$ is the POSIX value for $v$, it must be the case that $s \notin L \;( (a\backslash c)\cdot b)$. This tells us that \[ \nexists s_3 \; s_4. s_3 @s_4 = s \land s_3 \in L \; (a\backslash c) \land s_4 \in L \; b \] which translates to \[ \nexists s_3 \; s_4. \; s_3 \neq [] \land s_3 @s_4 = c::s \land s_3 \in L \; a \land s_4 \in L \; b. \] (Which basically says there cannot be a longer initial split for $s$ other than the empty string.) Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$ as the POSIX value for $a\cdot b$. \end{itemize} The star case can be proven similarly.\end{proof}\noindentPutting all together, Sulzmann and Lu obtained the following algorithmoutlined in the diagram \ref{graph:inj}:\begin{center}\begin{tabular}{lcl} $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ & & $\quad \phantom{\mid}\; \None \implies \None$\\ & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$\end{tabular}\end{center}\noindentThe central property of the $\lexer$ is that it gives the correct resultaccording toPOSIX rules. \begin{theorem}\label{lexerCorrectness} The $\lexer$ based on derivatives and injections is correct: \begin{center} \begin{tabular}{lcl} $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\ $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$ \end{tabular} \end{center}\end{theorem} \begin{proof}By induction on $s$. $r$ generalising over an arbitrary regular expression.The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive caseby lemma \ref{injPosix}.\end{proof}\noindentAs we did earlier in this chapter with the matcher, one can introduce simplification on the regular expression in each derivative step.However, now one needs to do a backward phase and make surethe values align with the regular expression.Therefore one has tobe careful not to break the correctness, as the injection function heavily relies on the structure of the regular expressions and values being aligned.This can be achieved by recording some extra rectification functionsduring the derivatives step, and applying these rectifications in each run during the injection phase.With extra careone can show that POSIXness will not be affectedby the simplifications listed here \cite{AusafDyckhoffUrban2016}. \begin{center} \begin{tabular}{lcl} $\simp \; r_1 \cdot r_2 $ & $ \dn$ & $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\ & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\ & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\ & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ $\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ \end{tabular}\end{center}However, with the simple-minded simplification rules allowedin an injection-based lexer, one can still end upwith exploding derivatives.\section{A Case Requring More Aggressive Simplifications}For example, when starting with the regularexpression $(a^* \cdot a^*)^*$ and building just overa dozen successive derivatives w.r.t.~the character $a$, one obtains a derivative regular expressionwith millions of nodes (when viewed as a tree)even with the simple-minded simplification.\begin{figure}[H]\begin{center}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={size}, legend entries={Simple-Minded Simp, Naive Matcher}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};\end{axis}\end{tikzpicture} \end{center}\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\end{figure}\label{fig:BetterWaterloo}That is because Sulzmann and Lu's injection-based lexing algorithm keeps a lot of "useless" values that will not be used. These different ways of matching will grow exponentially with the string length.Consider the case\[ r= (a^*\cdot a^*)^* \quad and \quad s=\underbrace{aa\ldots a}_\text{n \textit{a}s}\]as an example.This is a highly ambiguous regular expression, withmany ways to split up the string into multiple segments fordifferent star iteratioins,and for each segment multiple ways of splitting between the two $a^*$ sub-expressions.When $n$ is equal to $1$, there are two lexical values forthe match:\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)\]and\[ \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)\]The derivative of $\derssimp \;s \; r$ is\[ (a^*a^* + a^*)\cdot(a^*a^*)^*.\]The $a^*a^*$ and $a^*$ in the first child of the above sequencecorrespond to value 1 and value 2, respectively.When $n=2$, the number goes up to 7:\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]\]\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]\]\[ \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]\]\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]\]\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) ] \]\[ \Stars \; [ \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) ] \]and\[ \Stars \; [ \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []) ] \]And $\derssimp \; aa \; (a^*a^*)^*$ would be\[ ((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*.\]which removes two out of the seven terms corresponding to theseven distinct lexical values.It is not surprising that there are exponentially many distinct lexical values that cannot be eliminated by the simple-minded simplification of $\derssimp$. A lexer without a good enough strategy to deduplicate will naturallyhave an exponential runtime on highlyambiguous regular expressions, because thereare exponentially many matches.For this particular example, it seemsthat the number of distinct matches growthspeed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).On the other hand, the $\POSIX$ value for $r= (a^*\cdot a^*)^*$ and $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is \[ \Stars\, [\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]\]and at any moment the subterms in a regular expressionthat will result in a POSIX value is onlya minority among the many other terms,and one can remove ones that are absolutely not possible to be POSIX.In the above example,\begin{equation}\label{eqn:growth2} ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\end{equation}can be further simplified by removing the underlined term first,which would open up possibilitiesof further simplification that removes theunderbraced part.The result would be \[ (\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.\]with corresponding values\begin{center} \begin{tabular}{lr} $\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\ $\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$ \end{tabular}\end{center}Other terms with an underlying value such as\[ \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]\]is simply too hopeless to contribute a POSIX lexical value,and is therefore thrown away.Ausaf et al. \cite{AusafDyckhoffUrban2016} have come up with some simplification steps, however those stepsare not yet sufficiently strong so that they achieve the above effects.And even with these relatively mild simplifications the proofis already quite a bit complicated than the theorem \ref{lexerCorrectness}.One would prove something like: \[ \textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\; \textit{then}\;\; (r, c::s) \rightarrow \inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v) \]instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$not only has to return a simplified regular expression,but also what specific simplifications has been done as a function on valuesshowing how one can transform the valueunderlying the simplified regular expressionto the unsimplified one.We therefore choose a slightly different approachalso described by Sulzmann and Lu toget better simplifications, which usessome augmented data structures compared to plain regular expressions.We call them \emph{annotated}regular expressions.With annotated regular expressions,we can avoid creating the intermediate values $v_1,\ldots v_n$ and a second phase altogether.We introduce this new datatype and the corresponding algorithm in the next chapter.