% Chapter Template\chapter{Regular Expressions and POSIX Lexing-Preliminaries} % Main chapter title\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts%and notations we % used for describing the lexing algorithm by Sulzmann and Lu,%and then give the algorithm and its variant and discuss%why more aggressive simplifications are needed. We start with a technical overview of the catastrophic backtracking problem,motivating rigorous approaches to regular expression matching and lexing.Then we introduce basic notations and definitions of our problem.\section{Technical Overview}Consider for example the regular expression $(a^*)^*\,b$ and strings of the form $aa..a$. These strings cannot be matched by this regularexpression: obviously the expected $b$ in the lastposition is missing. One would assume that modern regular expressionmatching engines can find this out very quickly. Surprisingly, if one triesthis example in JavaScript, Python or Java 8, even with small strings, say of lenght of around 30 $a$'s,the decision takes large amounts of time to finish.This is inproportionalto the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}).The algorithms clearly show exponential behaviour, and as can be seenis triggered by some relatively simple regular expressions.Java 9 and newerversions mitigates this behaviour by several magnitudes, but are still slow compared with the approach we are going to use in this thesis.This superlinear blowup in regular expression engineshas caused grief in ``real life'' where it is given the name ``catastrophic backtracking'' or ``evil'' regular expressions.For example, on 20 July 2016 one evilregular expression brought the webpage\href{http://stackexchange.com}{Stack Exchange} to itsknees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}In this instance, a regular expression intended to just trim whitespaces from the beginning and the end of a line actually consumedmassive amounts of CPU resources---causing the web servers to grind to ahalt. In this example, the time needed to processthe string was $O(n^2)$ with respect to the string length $n$. Thisquadratic overhead was enough for the homepage of Stack Exchange torespond so slowly that the load balancer assumed a $\mathit{DoS}$ attack and therefore stopped the servers from responding to anyrequests. This made the whole site become unavailable. \begin{figure}[p]\begin{center}\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={JavaScript}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};\end{axis}\end{tikzpicture} &\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Python}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};\end{axis}\end{tikzpicture}\\ \begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Java 8}, legend pos=north west, legend cell align=left]\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};\end{axis}\end{tikzpicture} &\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Dart}, legend pos=north west, legend cell align=left]\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};\end{axis}\end{tikzpicture}\\ \begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Swift}, legend pos=north west, legend cell align=left]\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};\end{axis}\end{tikzpicture} & \begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=true, %xtick={0,5000,...,40000}, %xmax=40000, %ymax=35, restrict x to domain*=0:40000, restrict y to domain*=0:35, %ytick={0,5,...,30}, %scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Java9+}, legend pos=north west, legend cell align=left]\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};\end{axis}\end{tikzpicture}\\ \begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, ylabel={time in secs}, enlargelimits=true, %xtick={0,5,...,30}, %xmax=33, %ymax=35, restrict x to domain*=0:60000, restrict y to domain*=0:35, %ytick={0,5,...,30}, %scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Scala}, legend pos=north west, legend cell align=left]\addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data};\end{axis}\end{tikzpicture} & \begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=true, %xtick={0,5000,...,40000}, %xmax=40000, %ymax=35, restrict x to domain*=0:60000, restrict y to domain*=0:45, %ytick={0,5,...,30}, %scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend style={cells={align=left}}, legend entries={Isabelle \\ Extracted}, legend pos=north west, legend cell align=left]\addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data};\end{axis}\end{tikzpicture}\\ \multicolumn{2}{c}{Graphs}\end{tabular} \end{center}\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. The reason for their fast growth %superlinear behaviour is that they do a depth-first-search using NFAs. If the string does not match, the regular expression matching engine starts to explore all possibilities. The last two graphs are for our implementation in Scala, one manual and one extracted from the verified lexer in Isabelle by $\textit{codegen}$. Our lexer performs better in this case, and is formally verified. Despite being almost identical, the codegen-generated lexer % is generated directly from Isabelle using $\textit{codegen}$, is slower than the manually written version since the code synthesised by $\textit{codegen}$ does not use native integer or string types of the target language. %Note that we are comparing our \emph{lexer} against other languages' matchers.}\label{fig:aStarStarb}\end{figure}\afterpage{\clearpage}A more recent example is a global outage of all Cloudflare servers on 2 July2019. A poorly written regular expression exhibited catastrophic backtrackingand exhausted CPUs that serve HTTP traffic. Although the outagehad several causes, at the heart was a regular expression thatwas used to monitor networktraffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}These problems with regular expressions are not isolated events that happenvery rarely, %but actually widespread.%They occur so often that they have a but they occur actually often enough that they have aname: Regular-Expression-Denial-Of-Service (ReDoS)attacks.Davis et al. \cite{Davis18} detected morethan 1000 evil regular expressionsin Node.js, Python core libraries, npm and pypi. They therefore concluded that evil regular expressionsare a real problem rather than just "a parlour trick".The work in this thesis aims to address this issuewith the help of formal proofs.We describe a lexing algorithm basedon Brzozowski derivatives with verified correctness and a finiteness property for the size of derivatives(which are all done in Isabelle/HOL).Such properties %guarantee the absence of are an important step in preventingcatastrophic backtracking once and for all.We will give more details in the next sectionson (i) why the slow cases in graph \ref{fig:aStarStarb}can occur in traditional regular expression enginesand (ii) why we choose our approach based on Brzozowski derivatives and formal proofs.In this chapter, we define the basic notions for regular languages and regular expressions.This is essentially a description in ``English''the functions and datatypes of our formalisation in Isabelle/HOL.We also define what $\POSIX$ lexing means, followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} that produces the output conformingto the $\POSIX$ standard. This is a preliminary chapter which describes the results ofSulzmann and Lu and Ausaf et al \ref{AusafDyckhoffUrban2016},but the proof details and function definitions are needed to motivate our workas we show in chapter \ref{Bitcoded2} how the proofs break down whensimplifications are applied.%TODO: Actually show this in chapter 4.In what follows we choose to use the Isabelle-style notationfor function and datatype constructor applications, wherethe parameters of a function are not enclosedinside a pair of parentheses (e.g. $f \;x \;y$instead of $f(x,\;y)$). This is mainlyto make the text visually more concise.\section{Preliminaries}%Regex, and the Problems with Regex Matchers}Regular expressions and regular expression matchers have clearly been studied for many, many years.Theoretical results in automata theory state that basic regular expression matching should be linearw.r.t the input.This assumes that the regular expression$r$ was pre-processed and turned into adeterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.By basic we mean textbook definitions such as the onebelow, involving only regular expressions for characters, alternatives,sequences, and Kleene stars:\[ r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*\]Modern regular expression matchers used by programmers,however,support much richer constructs, such as bounded repetitions,negations,and back-references.To differentiate, we use the word \emph{regex} to referto those expressions with richer constructs while reserving theterm \emph{regular expression}for the more traditional meaning in formal languages theory.We follow this convention in this thesis.In the future, we aim to support all the popular features of regexes, but for this work we mainly look at basic regular expressionsand bounded repetitions.%Most modern regex libraries%the so-called PCRE standard (Peral Compatible Regular Expressions)%has the back-referencesRegexes come with a number of constructsthat make it more convenient for programmers to write regular expressions.Depending on the types of constructsthe task of matching and lexing with themwill have different levels of complexity.Some of those constructs are syntactic sugars that aresimply short hand notationsthat save the programmers a few keystrokes.These will not cause problems for regex libraries.For example thenon-binary alternative involving three or more choices just means:\[ (a | b | c) \stackrel{means}{=} ((a + b)+ c)\]Similarly, the range operator%used to express the alternative%of all characters between its operands, is just a concise wayof expressing an alternative of consecutive characters:\[ [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \]for an alternative. Thewildcard character '$.$' is used to refer to any single character,\[ . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]\]except the newline.\subsection{Bounded Repetitions}More interesting are bounded repetitions, which can make the regular expressions muchmore compact.Normally there are four kinds of bounded repetitions:$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$(where $n$ and $m$ are constant natural numbers).Like the star regular expressions, the set of strings or languagea bounded regular expression can matchis defined using the power operation on sets:\begin{center} \begin{tabular}{lcl} $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ \end{tabular}\end{center}The attraction of bounded repetitions is that they can beused to avoid a size blow up: for example $r^{\{n\}}$is a shorthand forthe much longer regular expression:\[ \underbrace{r\ldots r}_\text{n copies of r}.\]%Therefore, a naive algorithm that simply unfolds%them into their desugared forms%will suffer from at least an exponential runtime increase.The problem with matching such bounded repetitionsis that tools based on the classic notion ofautomata need to expand $r^{\{n\}}$ into $n$ connected copies of the automaton for $r$. This leads to very inefficient matchingalgorithms or algorithms that consume large amounts of memory.Implementations using $\DFA$s willin such situationseither become very slow (for example Verbatim++ \cite{Verbatimpp}) or runout of memory (for example $\mathit{LEX}$ and $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generatorsin C and JAVA that generate $\mathit{DFA}$-basedlexers. The user provides a set of regular expressionsand configurations, and then gets an output program encoding a minimized $\mathit{DFA}$that can be compiled and run. When given the above countdown regular expression,a small $n$ (say 20) would result in a program representing aDFAwith millions of states.}) for large counters.A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$where the minimal DFA requires at least $2^{n+1}$ states.For example, when $n$ is equal to 2,the corresponding $\mathit{NFA}$ looks like:\vspace{6mm}\begin{center}\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] \node[state,initial] (q_0) {$q_0$}; \node[state, red] (q_1) [right=of q_0] {$q_1$}; \node[state, red] (q_2) [right=of q_1] {$q_2$}; \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; \path[->] (q_0) edge node {a} (q_1) edge [loop below] node {a,b} () (q_1) edge node {a,b} (q_2) (q_2) edge node {a,b} (q_3);\end{tikzpicture}\end{center}and when turned into a DFA by the subset constructionrequires at least $2^3$ states.\footnote{The red states are "countdown states" which count down the number of characters needed in addition to the currentstring to make a successful match.For example, state $q_1$ indicates a match that hasgone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,and just consumed the "delimiter" $a$ in the middle, and needs to match 2 more iterations of $(a|b)$ to complete.State $q_2$ on the other hand, can be viewed as a stateafter $q_1$ has consumed 1 character, and just waitsfor 1 more character to complete.The state $q_3$ is the last (accepting) state, requiring 0 more characters.Depending on the suffix of theinput string up to the current read location,the states $q_1$ and $q_2$, $q_3$may or maynot be active.A $\mathit{DFA}$ for such an $\mathit{NFA}$ wouldcontain at least $2^3$ non-equivalent states that cannot be merged, because the subset construction during determinisation will generateall the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.Generalizing this to regular expressions with largerbounded repetitions number, we have thatregexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$swould require at least $2^{n+1}$ states, if $r$ itself containsmore than 1 string.This is to represent all different scenarios in which "countdown" states are active.}Bounded repetitions are important because theytend to occur frequently in practical use,for example in the regex library RegExLib, inthe rules library of Snort \cite{Snort1999}\footnote{Snort is a network intrusion detection (NID) toolfor monitoring network traffic.The network security community curates a listof malicious patterns written as regexes,which is used by Snort's detection engineto match against network traffic for any hostileactivities such as buffer overflow attacks.}, as well as in XML Schema definitions (XSDs).According to Bj\"{o}rklund et al \cite{xml2015},more than half of the XSDs they found on the Maven.org central repositoryhave bounded regular expressions in them.Often the counters are quite large, with the largest beingclose to ten million. A smaller sample XSD they gaveis:\lstset{ basicstyle=\fontsize{8.5}{9}\ttfamily, language=XML, morekeywords={encoding, xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}}\begin{lstlisting}<sequence minOccurs="0" maxOccurs="65535"> <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/></sequence>\end{lstlisting}This can be seen as the regex$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselvesregular expressions satisfying certain constraints (such as satisfying the floating point number format).It is therefore quite unsatisfying that some regular expressions matching librariesimpose adhoc limitsfor bounded regular expressions:For example, in the regular expression matching library in the Golanguage the regular expression $a^{1001}$ is not permitted, because no countercan be above 1000, and in the built-in Rust regular expression libraryexpressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error messagefor being too big. As Becchi and Crawley \cite{Becchi08} have pointed out,the reason for these restrictionsis that they simulate a non-deterministic finiteautomata (NFA) with a breadth-first search.This way the number of active states couldbe equal to the counter number.When the counters are large, the memory requirement could becomeinfeasible, and a regex enginelike in Go will reject this pattern straight away.\begin{figure}[H]\begin{center}\begin{tikzpicture} [node distance = 2cm, on grid, auto] \node (q0) [state, initial] {$0$}; \node (q1) [state, right = of q0] {$1$}; %\node (q2) [state, right = of q1] {$2$}; \node (qdots) [right = of q1] {$\ldots$}; \node (qn) [state, right = of qdots] {$n$}; \node (qn1) [state, right = of qn] {$n+1$}; \node (qn2) [state, right = of qn1] {$n+2$}; \node (qn3) [state, accepting, right = of qn2] {$n+3$}; \path [-stealth, thick] (q0) edge [loop above] node {a} () (q0) edge node {a} (q1) %(q1) edge node {.} (q2) (q1) edge node {.} (qdots) (qdots) edge node {.} (qn) (qn) edge node {.} (qn1) (qn1) edge node {b} (qn2) (qn2) edge node {$c$} (qn3);\end{tikzpicture}%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] % \node[state,initial] (q_0) {$0$}; % \node[state, ] (q_1) [right=of q_0] {$1$}; % \node[state, ] (q_2) [right=of q_1] {$2$}; % \node[state,% \node[state, accepting, ](q_3) [right=of q_2] {$3$};% \path[->] % (q_0) edge node {a} (q_1)% edge [loop below] node {a,b} ()% (q_1) edge node {a,b} (q_2)% (q_2) edge node {a,b} (q_3);%\end{tikzpicture}\end{center}\caption{The example given by Becchi and Crawley that NFA simulation can consume large amounts of memory: $.^*a.^{\{n\}}bc$ matching strings of the form $aaa\ldots aaaabc$. When traversing in a breadth-first manner,all states from 0 till $n+1$ will become active.}\label{fig:inj}\end{figure}%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this%type of $\mathit{NFA}$ simulation and guarantees a linear runtime%in terms of input string length.%TODO:try out these lexersThese problems can of course be solved in matching algorithms where automata go beyond the classic notion and for instance include explicitcounters \cite{Turo_ov__2020}.These solutions can be quite efficient,with the ability to processgigabits of strings input per secondeven with large counters \cite{Becchi08}.These practical solutions do not come withformal guarantees, and as pointed out byKuklewicz \cite{KuklewiczHaskell}, can be error-prone.%But formal reasoning about these automata especially in Isabelle %can be challenging%and un-intuitive. %Therefore, we take correctness and runtime claims made about these solutions%with a grain of salt.In the work reported in \cite{FoSSaCS2023} and here, we add better support using derivativesfor bounded regular expression $r^{\{n\}}$.Our resultsextend straightforwardly torepetitions with intervals such as $r^{\{n\ldots m\}}$.The merit of Brzozowski derivatives (more on this later)on this problem is thatit can be naturally extended to support bounded repetitions.Moreover these extensions are still made up of only smallinductive datatypes and recursive functions,making it handy to deal with them in theorem provers.%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be%straightforwardly extended to deal with bounded regular expressions%and moreover the resulting code still consists of only simple%recursive functions and inductive datatypes.Finally, bounded regular expressions do not destroy our finiteboundedness property, which we shall prove later on.\subsection{Back-References}The other way to simulate an $\mathit{NFA}$ for matching is choosing a single transition each time, keeping all the other options in a queue or stack, and backtracking if that choice eventually fails. This method, often called a "depth-first-search", is efficient in many cases, but could end upwith exponential run time.The backtracking method is employed in regex librariesthat support \emph{back-references}, for examplein Java and Python.%\section{Back-references and The Terminology Regex}%When one constructs an $\NFA$ out of a regular expression%there is often very little to be done in the first phase, one simply %construct the $\NFA$ states based on the structure of the input regular expression.%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:%one by keeping track of all active states after consuming %a character, and update that set of states iteratively.%This can be viewed as a breadth-first-search of the $\mathit{NFA}$%for a path terminating%at an accepting state.Consider the following regular expression where the sequenceoperator is omitted for brevity:\begin{center} $r_1r_2r_3r_4$\end{center}In this example,one could label sub-expressions of interest by parenthesizing them and giving them a number in the order in which their opening parentheses appear.One possible way of parenthesizing and labelling is: \begin{center} $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$\end{center}The sub-expressions$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelledby 1 to 4, and can be ``referred back'' by their respective numbers. %These sub-expressions are called "capturing groups".To do so, one uses the syntax $\backslash i$ to denote that we want the sub-string of the input matched by the i-thsub-expression to appear again, exactly the same as it first appeared: \begin{center}$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$\end{center}Once the sub-string $s_i$ for the sub-expression $r_i$has been fixed, there is no variability on what the back-referencelabel $\backslash i$ can be---it is tied to $s_i$.The overall string will look like $\ldots s_i \ldots s_i \ldots $%The backslash and number $i$ are the%so-called "back-references".%Let $e$ be an expression made of regular expressions %and back-references. $e$ contains the expression $e_i$%as its $i$-th capturing group.%The semantics of back-reference can be recursively%written as:%\begin{center}% \begin{tabular}{c}% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$% \end{tabular}%\end{center}A concrete examplefor back-references is\begin{center}$(.^*)\backslash 1$,\end{center}which matchesstrings that can be split into two identical halves,for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.Note that this is different from repeating the sub-expression verbatim like\begin{center} $(.^*)(.^*)$,\end{center}which does not impose any restrictions on what strings the second sub-expression $.^*$might match.Another example for back-references is\begin{center}$(.)(.)\backslash 2\backslash 1$\end{center}which matches four-character palindromeslike $abba$, $x??x$ and so on.Back-references are a regex construct that programmers find quite useful.According to Becchi and Crawley \cite{Becchi08},6\% of Snort rules (up until 2008) use them.The most common use of back-referencesis to express well-formed html files,where back-references are convenient for matchingopening and closing tags like \begin{center} $\langle html \rangle \ldots \langle / html \rangle$\end{center}A regex describing such a formatis\begin{center} $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$\end{center}Despite being useful, the expressive power of regexes go beyond regular languages once back-references are included.In fact, they allow the regex construct to express languages that cannot be contained in context-freelanguages either.For example, the back-reference $(a^*)b\backslash1 b \backslash 1$expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,which cannot be expressed by context-free grammars \cite{campeanu2003formal}.Such a language is contained in the context-sensitive hierarchyof formal languages. Also solving the matching problem involving back-referencesis known to be NP-complete \parencite{alfred2014algorithms}.Regex libraries supporting back-references such as PCRE \cite{pcre} therefore have torevert to a depth-first search algorithm including backtracking.What is unexpected is that even in the cases not involving back-references, there is stilla (non-negligible) chance they might backtrack super-linearly,as shown in the graphs in figure \ref{fig:aStarStarb}.Summing up, we can categorise existing practical regex libraries into two kinds:(i) The ones with lineartime guarantees like Go and Rust. The downside with them is thatthey impose restrictionson the regular expressions (not allowing back-references, bounded repetitions cannot exceed an ad hoc limit etc.).And (ii) those that allow large bounded regular expressions and back-referencesat the expense of using backtracking algorithms.They can potentially ``grind to a halt''on some very simple cases, resulting ReDoS attacks if exposed to the internet.The problems with both approaches are the motivation for us to look again at the regular expression matching problem. Another motivation is that regular expression matching algorithmsthat follow the POSIX standard often contain errors and bugs as we shall explain next.%We would like to have regex engines that can %deal with the regular part (e.g.%bounded repetitions) of regexes more%efficiently.%Also we want to make sure that they do it correctly.%It turns out that such aim is not so easy to achieve. %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions % For example, the Rust regex engine claims to be linear, % but does not support lookarounds and back-references.% The GoLang regex library does not support over 1000 repetitions. % Java and Python both support back-references, but shows%catastrophic backtracking behaviours on inputs without back-references(%when the language is still regular). %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac %TODO: verify the fact Rust does not allow 1000+ reps%The time cost of regex matching algorithms in general%involve two different phases, and different things can go differently wrong on %these phases.%$\DFA$s usually have problems in the first (construction) phase%, whereas $\NFA$s usually run into trouble%on the second phase.\section{Error-prone POSIX Implementations}Very often there are multiple ways of matching a stringwith a regular expression.In such cases the regular expressions matcher needs todisambiguate.The more widely used strategy is called POSIX,which roughly speaking always chooses the longest initial match.The POSIX strategy is widely adopted in many regular expression matchersbecause it is a natural strategy for lexers.However, many implementations (including the C librariesused by Linux and OS X distributions) contain bugsor do not meet the specification they claim to adhere to.Kuklewicz maintains a unit test repository which lists someproblems with existing regular expression engines \cite{KuklewiczHaskell}.In some cases, they either fail to generate aresult when there exists a match,or give results that are inconsistent with the POSIX standard.A concrete example is the regex:\begin{center} $(aba + ab + a)^* \text{and the string} \; ababa$\end{center}The correct POSIX match for the aboveinvolves the entire string $ababa$, split into two Kleene star iterations, namely $[ab], [aba]$ at positions$[0, 2), [2, 5)$respectively.But feeding this example to the different engineslisted at regex101 \footnote{ regex101 is an online regular expression matcher which provides API for trying out regular expression engines of multiple popular programming languages likeJava, Python, Go, etc.} \parencite{regex101}. yieldsonly two incomplete matches: $[aba]$ at $[0, 3)$and $a$ at $[4, 5)$.Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} commented that most regex libraries are notcorrectly implementing the central POSIXrule, called the maximum munch rule.Grathwohl \parencite{grathwohl2014crash} wrote:\begin{quote}\it ``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''\end{quote}%\noindentPeople have recognised that the implementation complexity of POSIX rules also come fromthe specification being not very precise.The developers of the regexp package of Go \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}commented that\begin{quote}\it``The POSIX rule is computationally prohibitiveand not even well-defined.``\end{quote}There are many informal summaries of this disambiguationstrategy, which are often quite long and delicate.For example Kuklewicz \cite{KuklewiczHaskell} described the POSIX rule as (section 1, last paragraph):\begin{quote} \begin{itemize} \itemregular expressions (REs) take the leftmost starting match, and the longest match starting thereearlier subpatterns have leftmost-longest priority over later subpatterns\\\itemhigher-level subpatterns have leftmost-longest priority over their component subpatterns\\\itemREs have right associative concatenation which can be changed with parenthesis\\\itemparenthesized subexpressions return the match from their last usage\\\itemtext of component subexpressions must be contained in the text of the higher-level subexpressions\\\itemif "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\\itemif "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\\end{itemize}\end{quote}%The text above %is trying to capture something very precise,%and is crying out for formalising.Ribeiro and Du Bois \cite{RibeiroAgda2017} have formalised the notion of bit-coded regular expressionsand proved their relations with simple regular expressions inthe dependently-typed proof assistant Agda.They also proved the soundness and completeness of a matching algorithmbased on the bit-coded regular expressions.Ausaf et al. \cite{AusafDyckhoffUrban2016}are the first togive a quite simple formalised POSIXspecification in Isabelle/HOL, and also prove that their specification coincides with an earlier (unformalised) POSIX specification given by Okui and Suzuki \cite{Okui10}.They then formally proved the correctness ofa lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}with regards to that specification.They also found that the informal POSIXspecification by Sulzmann and Lu needs to be substantially revised in order for the correctness proof to go through.Their original specification and proof were unfixableaccording to Ausaf et al.In the next section, we will brieflyintroduce Brzozowski derivatives and Sulzmannand Lu's algorithm, which the main point of this thesis builds on.%We give a taste of what they %are like and why they are suitable for regular expression%matching and lexing.\section{Formal Specification of POSIX Matching and Brzozowski Derivatives}%Now we start with the central topic of the thesis: Brzozowski derivatives.Brzozowski \cite{Brzozowski1964} first introduced the concept of a \emph{derivative} of regular expression in 1964.The derivative of a regular expression $r$with respect to a character $c$, is written as $r \backslash c$.This operation tells us what $r$ transforms intoif we ``chop'' off a particular character $c$ from all strings in the language of $r$ (definedlater as $L \; r$).%To give a flavour of Brzozowski derivatives, we present%two straightforward clauses from it:%\begin{center}% \begin{tabular}{lcl}% $d \backslash c$ & $\dn$ & % $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\% \end{tabular}%\end{center}%\noindent%The first clause says that for the regular expression%denoting a singleton set consisting of a single-character string $\{ d \}$,%we check the derivative character $c$ against $d$,%returning a set containing only the empty string $\{ [] \}$%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.%The second clause states that to obtain the regular expression%representing all strings' head character $c$ being chopped off%from $r_1 + r_2$, one simply needs to recursively take derivative%of $r_1$ and $r_2$ and then put them together.Derivatives have the propertythat $s \in L \; (r\backslash c)$ if and only if $c::s \in L \; r$ where $::$ stands for list prepending.%This property can be used on regular expressions%matching and lexing--to test whether a string $s$ is in $L \; r$,%one simply takes derivatives of $r$ successively with%respect to the characters (in the correct order) in $s$,%and then test whether the empty string is in the last regular expression.With this property, derivatives can give a simple solutionto the problem of matching a string $s$ with a regularexpression $r$: if the derivative of $r$ w.r.t.\ (insuccession) all the characters of the string matches the empty string,then $r$ matches $s$ (and {\em vice versa}). %This makes formally reasoning about these properties such%as correctness and complexity smooth and intuitive.There are several mechanised proofs of this property in various theoremprovers,for example one by Owens and Slind \cite{Owens2008} in HOL4,another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, andyet another in Coq by Coquand and Siles \cite{Coquand2012}.In addition, one can extend derivatives to bounded repetitionsrelatively straightforwardly. For example, the derivative for this can be defined as:\begin{center} \begin{tabular}{lcl} $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot r^{\{n-1\}} (\text{when} n > 0)$\\ \end{tabular}\end{center}\noindentExperimental results suggest that unlike DFA-based solutionsfor bounded regular expressions,derivatives can copelarge countersquite well.There have also been extensions of derivatives to other regex constructs.For example, Owens et al include the derivativesfor the \emph{NOT} regular expression, which isable to concisely express C-style comments of the form$/* \ldots */$ (see \cite{Owens2008}).Another extension for derivatives isregular expressions with look-aheads, doneby Miyazaki and Minamide\cite{Takayuki2019}.%We therefore use Brzozowski derivatives on regular expressions %lexing Given the above definitions and properties ofBrzozowski derivatives, one quickly realises their potentialin generating a formally verified algorithm for lexing: the clauses and propertycan be easily expressed in a functional programming language or converted to theorem provercode, with great ease.Perhaps this is the reason why derivatives have sparked quite a bit of interestin the functional programming and theorem prover communities in the lastfifteen or so years (\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},\cite{Chen12} and \cite{Coquand2012}to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}after they were first published by Brzozowski.However, there are two difficulties with derivative-based matchers:First, Brzozowski's original matcher only generates a yes/no answerfor whether a regular expression matches a string or not. This is toolittle information in the context of lexing where separate tokens mustbe identified and also classified (for example as keywordsor identifiers). Second, derivative-based matchers need to be more efficient in termsof the sizes of derivatives.Elegant and beautifulas many implementations are,they can be still quite slow. For example, Sulzmann and Luclaim a linear running time of their proposed algorithm,but that was falsified by our experiments. The running time is actually $\Omega(2^n)$ in the worst case.A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim \cite{Verbatim}%TODO: give referenceslexer, which calculates POSIX matches and is based on derivatives.They formalized the correctness of the lexer, but not their complexity result.In the performance evaluation section, they analyzed the run timeof matching $a$ with the string \begin{center} $\underbrace{a \ldots a}_{\text{n a's}}$.\end{center}\noindentThey concluded that the algorithm is quadratic in terms of the length of the input string.When we tried out their extracted OCaml code with the example $(a+aa)^*$,the time it took to match a string of 40 $a$'s was approximately 5 minutes.\subsection{Sulzmann and Lu's Algorithm}Sulzmann and Lu~\cite{Sulzmann2014} overcame the first problem with the yes/no answer by cleverly extending Brzozowski's matchingalgorithm. Their extended version generates additional information on\emph{how} a regular expression matches a string following the POSIXrules for regular expression matching. They achieve this by adding asecond ``phase'' to Brzozowski's algorithm involving an injectionfunction. This injection function in a sense undoes the ``damage''of the derivatives chopping off characters.In earlier work, Ausaf et al provided the formalspecification of what POSIX matching means and proved in Isabelle/HOLthe correctnessof this extended algorithm accordingly\cite{AusafDyckhoffUrban2016}.The version of the algorithm proven correctsuffers however heavily from a second difficulty, where derivatives cangrow to arbitrarily big sizes. For example if we start with theregular expression $(a+aa)^*$ and takesuccessive derivatives according to the character $a$, we end up witha sequence of ever-growing derivatives like \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}\begin{center}\begin{tabular}{rll}$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)\end{tabular}\end{center}\noindent where after around 35 steps we usually run out of memory on atypical computer. Clearly, thenotation involving $\ZERO$s and $\ONE$s already suggestssimplification rules that can be applied to regular regularexpressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrowr$. While such simple-minded simplifications have been proved in the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu'salgorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do\emph{not} help with limiting the growth of the derivatives shownabove: the growth is slowed, but the derivatives can still grow ratherquickly beyond any finite bound.Therefore we want to look in this thesis at a secondalgorithm by Sulzmann and Lu where theyovercame this ``growth problem'' \cite{Sulzmann2014}.In this version, POSIX values are represented as bit sequences and such sequences are incrementally generatedwhen derivatives are calculated. The compact representationof bit sequences and regular expressions allows them to define a more``aggressive'' simplification method that keeps the size of thederivatives finite no matter what the length of the string is.They make some informal claims about the correctness and linear behaviourof this version, but do not provide any supporting proof arguments, noteven ``pencil-and-paper'' arguments. They write about their bit-coded\emph{incremental parsing method} (that is the algorithm to be formalisedin this dissertation) \begin{quote}\it ``Correctness Claim: We further claim that the incremental parsing method [..] in combination with the simplification steps [..] yields POSIX parse trees. We have tested this claim extensively [..] but yet have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}\end{quote} Ausaf and Urban made some initial progress towards the full correctness proof but still had to leave out the optimisationSulzmann and Lu proposed.Ausaf wrote \cite{Ausaf}, \begin{quote}\it``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''\end{quote} This thesis implements the aggressive simplifications envisionedby Ausaf and Urban,together with a formal proof of the correctness of those simplifications.One of the most recent work in the context of lexing%with this issueis the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.This is relevant work for us and we will compare it later with our derivative-based matcher we are going to present.There is also some newer work calledVerbatim++~\cite{Verbatimpp}, which does not use derivatives, but deterministic finite automaton instead.We will also study this work in a later section.%An example that gives problem to automaton approaches would be%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.%It requires at least $2^{n+1}$ states to represent%as a DFA.\section{Basic Concepts}Formal language theory usually starts with an alphabet denoting a set of characters.Here we use the datatype of characters from Isabelle,which roughly corresponds to the ASCII characters.In what follows, we shall leave the information about the alphabetimplicit.Then using the usual bracket notation for lists,we can define strings made up of characters as: \begin{center}\begin{tabular}{lcl}$\textit{s}$ & $\dn$ & $[] \; |\; c :: s$\end{tabular}\end{center}where $c$ is a variable ranging over characters.The $::$ stands for list cons and $[]$ for the emptylist.For brevity, a singleton list is sometimes written as $[c]$.Strings can be concatenated to form longer strings in the sameway we concatenate two lists, which we shall write as $s_1 @ s_2$.We omit the precise recursive definition here.We overload this concatenation operator for two sets of strings:\begin{center}\begin{tabular}{lcl}$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\\end{tabular}\end{center}We also call the above \emph{language concatenation}.The power of a language is defined recursively, using the concatenation operator $@$:\begin{center}\begin{tabular}{lcl}$A^0 $ & $\dn$ & $\{ [] \}$\\$A^{n+1}$ & $\dn$ & $A @ A^n$\end{tabular}\end{center}The union of all powers of a language can be used to define the Kleene star operator:\begin{center}\begin{tabular}{lcl} $A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\\end{tabular}\end{center}\noindentHowever, to obtain a more convenient induction principle in Isabelle/HOL, we instead define the Kleene staras an inductive set: \begin{center}\begin{mathpar} \inferrule{\mbox{}}{[] \in A*\\}\inferrule{s_1 \in A \;\; s_2 \in A*}{s_1 @ s_2 \in A*}\end{mathpar}\end{center}\noindentWe also define an operation of "chopping off" a character froma language, which we call $\Der$, meaning \emph{Derivative} (for a language):\begin{center}\begin{tabular}{lcl}$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\\end{tabular}\end{center}\noindentThis can be generalised to ``chopping off'' a string from all strings within a set $A$, namely:\begin{center}\begin{tabular}{lcl}$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\\end{tabular}\end{center}\noindentwhich is essentially the left quotient $A \backslash L$ of $A$ against the singleton language with $L = \{s\}$.However, for our purposes here, the $\textit{Ders}$ definition with a single string is sufficient.The reason for defining derivativesis that they provide another approachto test membership of a string in a set of strings. For example, to test whether the string$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set withrespect to the string $bar$:\begin{center}\begin{tabular}{lll} $S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ & $\{ar, rak\}$ \\ & $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\ & $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\ %& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\\end{tabular} \end{center}\noindentand in the end, test whether the setcontains the empty string.\footnote{We use the infix notation $A\backslash c$ instead of $\Der \; c \; A$ for brevity, as it will always be clear from the context that we are operatingon languages rather than regular expressions.}In general, if we have a language $S$,then we can test whether $s$ is in $S$by testing whether $[] \in S \backslash s$.With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,we have a few properties of how the language derivative can be defined using sub-languages.For example, for the sequence operator, we havesomething similar to a ``chain rule'':\begin{lemma}\[ \Der \; c \; (A @ B) = \begin{cases} ((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\ (\Der \; c \; A) \, @ \, B, & \text{otherwise} \end{cases} \]\end{lemma}\noindentThis lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through itand get to $B$.The language derivative for $A*$ can be described using the language derivativeof $A$:\begin{lemma}$\textit{Der} \;c \;(A*) = (\textit{Der}\; c A) @ (A*)$\\\end{lemma}\begin{proof}There are two inclusions to prove:\begin{itemize}\item{$\subseteq$}:\\The set \[ S_1 = \{s \mid c :: s \in A*\} \]is enclosed in the set\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]This is because for any string $c::s$ satisfying $c::s \in A*$,%whenever you have a string starting with a character %in the language of a Kleene star $A*$, %then thatthe character $c$, together with a prefix of $s$%immediately after $c$ forms the first iteration of $A*$, and the rest of the $s$ is also $A*$.This coincides with the definition of $S_2$.\item{$\supseteq$}:\\Note that\[ \Der \; c \; (A*) = \Der \; c \; (\{ [] \} \cup (A @ A*) ) \]holds.Also the following holds:\[ \Der \; c \; (\{ [] \} \cup (A @ A*) ) = \Der\; c \; (A @ A*) \]where the $\textit{RHS}$ can be rewrittenas \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]which of course contains $\Der \; c \; A @ A*$.\end{itemize}\end{proof}\noindentThe clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$for regular expressions.To introduce them, we need to first give definitions for regular expressions,which we shall do next.\subsection{Regular Expressions and Their Meaning}The \emph{basic regular expressions} are defined inductively by the following grammar:\[ r ::= \ZERO \mid \ONE \mid c \mid r_1 \cdot r_2 \mid r_1 + r_2 \mid r^* \]\noindentWe call them basic because we will introduceadditional constructors in later chapters, such as negationand bounded repetitions.We use $\ZERO$ for the regular expression thatmatches no string, and $\ONE$ for the regularexpression that matches only the empty string.\footnote{Some authorsalso use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$but we prefer this notation.} The sequence regular expression is written $r_1\cdot r_2$and sometimes we omit the dot if it is clear whichregular expression is meant; the alternativeis written $r_1 + r_2$.The \emph{language} or meaning of a regular expression is defined recursively asa set of strings:%TODO: FILL in the other defs\begin{center}\begin{tabular}{lcl}$L \; \ZERO$ & $\dn$ & $\varnothing$\\$L \; \ONE$ & $\dn$ & $\{[]\}$\\$L \; c$ & $\dn$ & $\{[c]\}$\\$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\$L \; (r^*)$ & $\dn$ & $ (L\;r)*$\end{tabular}\end{center}\noindent%Now with language derivatives of a language and regular expressions and%their language interpretations in place, we are ready to define derivatives on regular expressions.With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.We do so by first introducing what properties they should satisfy.\subsection{Brzozowski Derivatives and a Regular Expression Matcher}%Recall, the language derivative acts on a set of strings%and essentially chops off a particular character from%all strings in that set, Brzozowski defined a derivative operation on regular expressions%so that after derivative $L(r\backslash c)$ %will look as if it was obtained by doing a language derivative on $L(r)$:%Recall that the language derivative acts on a %language (set of strings).%One can decide whether a string $s$ belongs%to a language $S$ by taking derivative with respect to%that string and then checking whether the empty %string is in the derivative:%\begin{center}%\parskip \baselineskip%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}%\def\rlwd{.5pt}%\newcommand\notate[3]{%% \unskip\def\useanchorwidth{T}%% \setbox0=\hbox{#1}%% \def\stackalignment{c}\stackunder[-6pt]{%% \def\stackalignment{c}\stackunder[-1.5pt]{%% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%% \rule{\rlwd}{#2\baselineskip}}}{%% \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%%}%\Longstack{%\notate{$\{ \ldots ,\;$ % \notate{s}{1}{$(c_1 :: s_1)$} % $, \; \ldots \}$ %}{1}{$S_{start}$} %}%\Longstack{% $\stackrel{\backslash c_1}{\longrightarrow}$%}%\Longstack{% $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$} % $,\; \ldots \}$%}%\Longstack{% $\stackrel{\backslash c_2}{\longrightarrow}$ %}%\Longstack{% $\{ \ldots,\; s_2% ,\; \ldots \}$%}%\Longstack{% $ \xdashrightarrow{\backslash c_3\ldots\ldots} $ %}%\Longstack{% \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} = % S_{start}\backslash s$}%}%\end{center}%\begin{center}% $s \in S_{start} \iff [] \in S_{end}$%\end{center}%\noindentBrzozowski noticed that $\Der$can be ``mirrored'' on regular expressions whichhe calls the derivative of a regular expression $r$with respect to a character $c$, written$r \backslash c$. This infix operatortakes regular expression $r$ as inputand a character as a right operand.The derivative operation on regular expressionis defined such that the language of the derivative result coincides with the language of the original regular expression being taken derivative with respect to the same characters, namely\begin{property}\[ L \; (r \backslash c) = \Der \; c \; (L \; r)\]\end{property}\noindentNext, we give the recursive definition of derivative onregular expressions so that it satisfies the properties above.%The derivative function, written $r\backslash c$, %takes a regular expression $r$ and character $c$, and%returns a new regular expression representing%the original regular expression's language $L \; r$%being taken the language derivative with respect to $c$.\begin{table} \begin{center}\begin{tabular}{lcl} $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ $d \backslash c$ & $\dn$ & $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\ & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\\end{tabular} \end{center}\caption{Derivative on Regular Expressions}\label{table:der}\end{table}\noindentThe most involved cases are the sequence caseand the star case.The sequence case says that if the first regular expressioncontains an empty string, then the second component of the sequenceneeds to be considered, as its derivative will contribute to theresult of this derivative:\begin{center} \begin{tabular}{lcl} $(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & $\textit{if}\;\,([] \in L(r_1))\; \textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\ & & $\textit{else} \; (r_1 \backslash c) \cdot r_2$ \end{tabular}\end{center}\noindentNotice how this closely resemblesthe language derivative operation $\Der$:\begin{center} \begin{tabular}{lcl} $\Der \; c \; (A @ B)$ & $\dn$ & $ \textit{if} \;\, [] \in A \; \textit{then} \;\, ((\Der \; c \; A) @ B ) \cup \Der \; c\; B$\\ & & $\textit{else}\; (\Der \; c \; A) @ B$\\ \end{tabular}\end{center}\noindentThe derivative of the star regular expression $r^*$ unwraps one iteration of $r$, turns it into $r\backslash c$,and attaches the original $r^*$after $r\backslash c$, so that we can further unfold it as many times as needed:\[ (r^*) \backslash c \dn (r \backslash c)\cdot r^*.\]Again,the structure is the same as the language derivative of the Kleene star: \[ \textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)\]In the above definition of $(r_1\cdot r_2) \backslash c$,the $\textit{if}$ clause'sboolean condition $[] \in L(r_1)$ needs to be somehow recursively computed.We call such a function that checkswhether the empty string $[]$ is in the language of a regular expression $\nullable$:\begin{center} \begin{tabular}{lcl} $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ \end{tabular}\end{center}\noindentThe $\ZERO$ regular expressiondoes not contain any string andtherefore is not \emph{nullable}.$\ONE$ is \emph{nullable} by definition. The character regular expression $c$corresponds to the singleton set $\{c\}$, and therefore does not contain the empty string.The alternative regular expression is nullableif at least one of its children is nullable.The sequence regular expressionwould require both children to have the empty stringto compose an empty string, and the Kleene staris always nullable because it naturallycontains the empty string. \noindentWe have the following two correspondences between derivatives on regular expressions andderivatives on a set of strings:\begin{lemma}\label{derDer} \mbox{} \begin{itemize} \item$\textit{Der} \; c \; L(r) = L (r\backslash c)$\item $c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$. \end{itemize}\end{lemma}\begin{proof} By induction on $r$.\end{proof}\noindentwhich are the main properties of derivativesthat enables us later to reason about the correctness ofderivative-based matching.We can generalise the derivative operation shown above for single charactersto strings as follows:\begin{center}\begin{tabular}{lcl}$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\$r \backslash_s [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentWhen there is no ambiguity, we will omit the subscript and use $\backslash$ insteadof $\backslash_s$ to denotestring derivatives for brevity.Brzozowski's derivative-basedregular-expression matching algorithm can then be described as:\begin{definition}$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$\end{definition}\noindentAssuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, this algorithm, presented graphically, is as follows:\begin{equation}\label{matcher}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{true}/\textrm{false}\end{tikzcd}\end{equation}\noindent It is relativelyeasy to show that this matcher is correct, namely\begin{lemma} $\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$\end{lemma}\begin{proof} By induction on $s$ using the property of derivatives: lemma \ref{derDer}.\end{proof}\begin{figure}\begin{center}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={time in secs}, %ymode = log, legend entries={Naive Matcher}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};\end{axis}\end{tikzpicture} \caption{Matching the regular expression $(a^*)^*b$ against strings of the form$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ using Brzozowski's original algorithm}\label{NaiveMatcher}\end{center}\end{figure}\noindentIf we implement the above algorithm naively, however,the algorithm can be as slow as backtracking lexers, as shown in \ref{NaiveMatcher}.Note that both axes are in logarithmic scale.Around two dozen charactersthis algorithm already ``explodes'' with the regular expression $(a^*)^*b$.To improve this situation, we need to introduce simplificationrules for the intermediate results,such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,and make sure those rules do not change the language of the regular expression.One simple-minded simplification functionthat achieves these requirements is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):\begin{center} \begin{tabular}{lcl} $\simp \; r_1 \cdot r_2 $ & $ \dn$ & $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\ & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\ & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\ & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ $\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ \end{tabular}\end{center}If we repeatedly apply this simplification function during the matching algorithm, we have a matcher with simplification:\begin{center} \begin{tabular}{lcl} $\derssimp \; [] \; r$ & $\dn$ & $r$\\ $\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\ $\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$ \end{tabular}\end{center}\begin{figure}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={time in secs}, %ymode = log, %xmode = log, grid = both, legend entries={Matcher With Simp}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};\end{axis}\end{tikzpicture} \caption{$(a^*)^*b$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}\end{figure}\noindentThe running time of $\textit{ders}\_\textit{simp}$on the same example of Figure \ref{NaiveMatcher}is now ``tame'' in terms of the length of inputs,as shown in Figure \ref{BetterMatcher}.So far, the story is use Brzozowski derivatives andsimplify as much as possible, and at the end testwhether the empty string is recognised by the final derivative.But what if we want to do lexing instead of just getting a true/false answer?Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and elegant (arguably as beautiful as the definition of the Brzozowski derivative) solution for this.\section{Values and the Lexing Algorithm by Sulzmann and Lu}In this section, we present a two-phase regular expression lexing algorithm.The first phase takes successive derivatives with respect to the input string,and the second phase does the reverse, \emph{injecting} backcharacters, in the meantime constructing a lexing result.We will introduce the injection phase in detail slightlylater, but as a preliminary we have to first define the datatype for lexing results, called \emph{value} orsometimes also \emph{lexical value}. Values and regularexpressions correspond to each other as illustrated in the followingtable:\begin{center} \begin{tabular}{c@{\hspace{20mm}}c} \begin{tabular}{@{}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ $r$ & $::=$ & $\ZERO$\\ & $\mid$ & $\ONE$ \\ & $\mid$ & $c$ \\ & $\mid$ & $r_1 \cdot r_2$\\ & $\mid$ & $r_1 + r_2$ \\ \\ & $\mid$ & $r^*$ \\ \end{tabular} & \begin{tabular}{@{\hspace{0mm}}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ $v$ & $::=$ & \\ & & $\Empty$ \\ & $\mid$ & $\Char \; c$ \\ & $\mid$ & $\Seq\,v_1\, v_2$\\ & $\mid$ & $\Left \;v$ \\ & $\mid$ & $\Right\;v$ \\ & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ \end{tabular} \end{tabular}\end{center}\noindentA value has an underlying string, which can be calculated by the ``flatten" function $|\_|$:\begin{center} \begin{tabular}{lcl} $|\Empty|$ & $\dn$ & $[]$\\ $|\Char \; c|$ & $ \dn$ & $ [c]$\\ $|\Seq \; v_1, \;v_2|$ & $ \dn$ & $ v_1| @ |v_2|$\\ $|\Left \; v|$ & $ \dn$ & $ |v|$\\ $|\Right \; v|$ & $ \dn$ & $ |v|$\\ $|\Stars \; []|$ & $\dn$ & $[]$\\ $|\Stars \; v::vs|$ & $\dn$ & $ |v| @ |\Stars(vs)|$ \end{tabular}\end{center}Sulzmann and Lu used a binary predicate, written $\vdash v:r $,to indicate that a value $v$ could be generated from a lexing algorithmwith input $r$. They call it the value inhabitation relation,defined by the rules.\begin{figure}[H]\begin{mathpar} \inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em} \inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}\inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars \; vs : r^*}\end{mathpar}\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}\end{figure}\noindentThe condition $|v| \neq []$ in the premise of star's ruleis to make sure that for a given pair of regular expression $r$ and string $s$, the number of values satisfying $|v| = s$ and $\vdash v:r$ is finite.This additional condition wasimposed by Ausaf and Urban to make their proofs easier.Given a string and a regular expression, there can bemultiple values for it. For example, both$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ holdand the values both flatten to $abc$.Lexers therefore have to disambiguate and choose onlyone of the values to be generated. $\POSIX$ is one of thedisambiguation strategies that is widely adopted.Ausaf et al. \cite{AusafDyckhoffUrban2016} formalised the property as a ternary relation.The $\POSIX$ value $v$ for a regular expression$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified in the following rules\footnote{The names of the rules are usedas they were originally given in \cite{AusafDyckhoffUrban2016}.}:\begin{figure}[p]\begin{mathpar} \inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty} \inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c} \inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1} \inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2} \inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow \Seq \; v_1 \; v_2} \inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])} \inferrule[P*]{(s_1, v) \rightarrow v \\ (s_2, r^*) \rightarrow \Stars \; vs \\ |v| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land s_1@s_3 \in L \; r \land s_4 \in L \; r^*}{(s_1@s_2, r^*)\rightarrow \Stars \; (v::vs)}\end{mathpar}\caption{The inductive POSIX rules given by Ausaf et al. \cite{AusafDyckhoffUrban2016}.This ternary relation, written $(s, r) \rightarrow v$, formalises the POSIX constraints on thevalue $v$ given a string $s$ and regular expression $r$.}\label{fig:POSIXDef}\end{figure}\afterpage{\clearpage}\noindent%\begin{figure}%\begin{tikzpicture}[]% \node [minimum width = 6cm, rectangle split, rectangle split horizontal, % rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]% (node1)% {$r_{token1}$% \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ };% %\node [left = 6.0cm of node1] (start1) {hi};% \node [left = 0.2cm of node1] (middle) {$v.s.$};% \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal, % rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]% (node2)% {$\quad\;\;\;r_{token1}\quad\;\;\;$% \nodepart{two} $r_{token2}$ };% \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};% \node [above = 1.5cm of middle, minimum width = 6cm, % rectangle, style={draw, rounded corners, inner sep=10pt}] % (topNode) {$s$};% \path[->,draw]% (topNode) edge node {split $A$} (node2)% (topNode) edge node {split $B$} (node1)% ;% %%\end{tikzpicture}%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}%\end{figure}The above $\POSIX$ rules follow the intuition described below: \begin{itemize} \item (Left Priority)\\ Match the leftmost regular expression when multiple options for matching are available. See P+L and P+R where in P+R $s$ cannot be in the language of $L \; r_1$. \item (Maximum munch)\\ Always match a subpart as much as possible before proceeding to the next part of the string. For example, when the string $s$ matches $r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split: Then the split that matches a longer string for the first part $r_{part1}$ is preferred by this maximum munch rule. The side-condition \begin{center} $\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$ \end{center} in PS causes this. %(See %\ref{munch} for an illustration).\end{itemize}\noindentThese disambiguation strategies can be quite practical.For instance, when lexing a code snippet \[ \textit{iffoo} = 3\]using a regular expression for keywords and identifiers:%(for example, keyword is a nonempty string starting with letters %followed by alphanumeric characters or underscores):\[ r_{keyword} + r_{identifier}.\]If we want $\textit{iffoo}$ to be recognizedas an identifierwhere identifiers are defined as usual (lettersfollowed by letters, numbers or underscores),then a match with a keyword (if)followed byan identifier (foo) would be incorrect.POSIX lexing generates what is included by lexing.\noindentWe know that a POSIX value for regular expression $r$ is inhabited by $r$.\begin{lemma}$(r, s) \rightarrow v \implies \vdash v: r$\end{lemma}\noindentThe main property about a $\POSIX$ value is that given the same regular expression $r$ and string $s$,one can always uniquely determine the $\POSIX$ value for it:\begin{lemma}$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$\end{lemma}\begin{proof}By induction on $s$, $r$ and $v_1$. The inductive casesare all the POSIX rules. Probably the most cumbersome cases are the sequence and star with non-empty iterations.We shall give the details for proving the sequence case here.When we have \[ (s_1, r_1) \rightarrow v_1 \;\, and \;\, (s_2, r_2) \rightarrow v_2 \;\, and \;\, \nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2\]we know that the last condition excludes the possibility of a string $s_1'$ longer than $s_1$ such that \[(s_1', r_1) \rightarrow v_1' \;\; and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s \]hold.A shorter string $s_1''$ with $s_2''$ satisfying\[(s_1'', r_1) \rightarrow v_1''\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s \]cannot possibly form a $\POSIX$ value either, becauseby definition, there is a candidatewith a longer initial string$s_1$. Therefore, we know that the POSIXvalue $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching$s$ must have the property that \[ |a| = s_1 \;\; and \;\; |b| = s_2.\]The goal is to prove that $a = v_1 $ and $b = v_2$.If we have some other POSIX values $v_{10}$ and $v_{20}$ such that $(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$, which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$is the same as $\Seq(v_1, v_2)$. \end{proof}\noindentWe have now defined what a POSIX value is and shown that it is unique.The problem is to generatesuch a value in a lexing algorithm using derivatives.\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}Sulzmann and Lu extended Brzozowski's derivative-based matchingto a lexing algorithm by a second phase after the initial phase of successive derivatives.This second phase generates a POSIX value if the regular expression matches the string.The algorithm uses two functions called $\inj$ and $\mkeps$.The function $\mkeps$ constructs a POSIX value from the lastderivative $r_n$:\begin{ceqn}\begin{equation}\label{graph:mkeps}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\ & & & v_n \end{tikzcd}\end{equation}\end{ceqn}\noindentIn the above diagram, again we assume thatthe input string $s$ is made of $n$ characters$c_0c_1 \ldots c_{n-1}$ The last derivative operation $\backslash c_{n-1}$ generates the derivative $r_n$, for which$\mkeps$ produces the value $v_n$. This valuetells us how the empty string is matched by the (nullable)regular expression $r_n$, in a POSIX way.The definition of $\mkeps$ is \begin{center} \begin{tabular}{lcl} $\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\ $\mkeps \; (r_{1}+r_{2})$ & $\dn$ & $\textit{if}\; (\nullable \; r_{1}) \;\, \textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\ & & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\ $\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\ $\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$ \end{tabular} \end{center}\noindent The function prefers the left child $r_1$ of $r_1 + r_2$ to match an empty string if there is a choice.When there is a star to match the empty string,we give the $\Stars$ constructor an empty list, meaningno iteration is taken.The result of $\mkeps$ on a $\nullable$ $r$ is a POSIX value for $r$ and the empty string:\begin{lemma}\label{mePosix}$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$\end{lemma}\begin{proof} By induction on $r$.\end{proof}\noindentAfter the $\mkeps$-call, Sulzmann and Lu inject back the characters one by onein reverse order as they were chopped off in the derivative phase.The function for this is called $\inj$. This function operates on values, unlike $\backslash$ which operates on regular expressions.In the diagram below, $v_i$ stands for the (POSIX) value for how the regular expression $r_i$ matches the string $s_i$ consisting of the last $n-i$ charactersof $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.After injecting back $n$ characters, we get the lexical value for how $r_0$matches $s$. \begin{figure}[H]\begin{center} \begin{ceqn}\begin{tikzcd}r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed] \end{tikzcd}\end{ceqn}\end{center}\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016}, matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$. The first phase involves taking successive derivatives w.r.t the characters $c_0$, $c_1$, and so on. These are the same operations as they have appeared in the matcher \ref{matcher}. When the final derivative regular expression is nullable (contains the empty string), then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches the empty string, by always selecting the leftmost nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they appeared in the string, always preserving POSIXness.}\label{graph:inj}\end{figure}\noindentThe function $\textit{inj}$ as defined by Sulzmann and Lutakes three arguments: a regularexpression ${r_{i}}$, before the character is chopped off, a character ${c_{i}}$ (the character we want to inject back) and the third argument $v_{i+1}$ the value we want to inject into. The result of an application $\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that\[ (s_i, r_i) \rightarrow v_i\]holds.The definition of $\textit{inj}$ is as follows: \begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l} $\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\ $\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\ $\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\ $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ & $\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\ $\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ & $\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\ $\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\ $\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\\end{tabular}\end{center}\noindent The function recurses on the shape of regularexpressions and values.Intuitively, each clause analyses how $r_i$ could have transformed when being derived by $c$, identifying which subpartof $v_{i+1}$ has the ``hole'' to inject the character back into.Once the character isinjected back to that sub-value; $\inj$ assembles all partsto form a new value.For instance, the last clause is aninjection into a sequence value $v_{i+1}$whose second childvalue is a star and the shape of the regular expression $r_i$ before injection is a star.We therefore know the derivative starts on a star and ends as a sequence:\[ (r^*) \backslash c \longrightarrow r\backslash c \cdot r^*\]during which an iteration of the starhad just been unfolded, giving the belowvalue inhabitation relation:\[ \vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.\]The value list $vs$ corresponds tomatched star iterations,and the ``hole'' lies in $v$ because\[ \vdash v: r\backslash c.\]Finally, $\inj \; r \;c \; v$ is prependedto the previous list of iterations and thenwrapped under the $\Stars$constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.Putting all together, Sulzmann and Lu obtained the following algorithmoutlined in the injection-based lexing diagram \ref{graph:inj}:\begin{center}\begin{tabular}{lcl} $\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\ $\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\ & & $\quad \phantom{\mid}\; \None \implies \None$\\ & & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$\end{tabular}\end{center}\noindent\subsection{Examples on How Injection and Lexer Works}We will provide a few examples on how $\inj$ and $\lexer$works by showing their values in each recursive call on some concrete examples.We start with the call $\lexer \; (a+aa)^*\cdot c\; aac$on the lexer, note that the value's character constructor$\Char \;c$ is abbreviated as $c$ for readability.Similarly the last derivative's sub-expressionis abbreviated as $r_{=0}$\footnote{which is equal to$((\ZERO + (\ZERO a+ \ZERO))\cdot(a+aa)^* + (\ZERO + \ZERO a)\cdot (a+aa)^*)+((\ZERO+\ZERO a)\cdot(a+aa)^* + (\ZERO+\ZERO a)\cdot(a+aa)^*),$.}whose language interpretation is equivalent to that of$\ZERO$ and therefore not crucial to be displayed fully expanded,as they will not be injected into.%our example run.\begin{center} \begin{tabular}{lcl} $(a+\textcolor{magenta}{a}\textcolor{blue}{a})^* \cdot \textcolor{red}{c}$ & $\stackrel{\backslash \textcolor{magenta}{a}}{\rightarrow}$ & $((\ONE + \textcolor{magenta}{\ONE} \textcolor{blue}{a})\cdot (a+aa)^*)\cdot \textcolor{red}{c}+\ZERO$\\ %& $\stackrel{\rightarrow}{\backslash a}$ & $((\ONE + \ONE a)\cdot (a+aa)^*)\cdot c + \ZERO$\\ & $\stackrel{\backslash \textcolor{blue}{a}}{\rightarrow}$ & $(((\ZERO+(\ZERO a+ \textcolor{blue}{\ONE}))\cdot (a+aa)^* + (\ONE+\ONE a)\cdot (a+aa)^* )\cdot \textcolor{red}{\mathbf{c}} + \ZERO)+\ZERO$\\ & $\stackrel{\backslash \textcolor{red}{c}}{\rightarrow}$ & $((r_{=0}\cdot c + \textcolor{red}{\ONE})+\ZERO)+\ZERO$\\ & $\stackrel{\mkeps}{\rightarrow}$ & $\Left (\Left \; (\Right \; \textcolor{red}{\Empty}))$ \\ & $\stackrel{\inj \;\textcolor{red}{c} }{\rightarrow}$ & $\Left \; (\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; \textcolor{blue}{\Empty})) \; \Stars \, [])) \; \textcolor{red}{c}))$\\ & $\stackrel{\inj \;\textcolor{blue}{a}}{\rightarrow}$ & $\Left\; (\Seq \; (\Seq\; (\Right \; (\Seq \; \textcolor{magenta}{\Empty }\; \textcolor{blue}{ \mathbf{a}}) ) \;\Stars \,[]) \; \textcolor{red}{c})$\\ & $\stackrel{\inj \;\textcolor{magenta}{a}}{\rightarrow}$ & $\Seq \; (\Stars \; [\Right \; (\Seq \; \mathbf{\textcolor{magenta}{a}} \; \textcolor{blue}{a})]) \; \textcolor{red}{ c}$ %$\inj \; r \; c\A$ %$\inj \; (a\cdot b)\cdot c \; \Seq \; \ONE \; b$ & $=$ & $(a+e)$\\ \end{tabular}\end{center}\noindentWe have assigned different colours for each character,as well as their corresponding locations in values and regular expressions.The most recently injected character is marked with a bold font.%Their corresponding derivative steps have been marked with the same%colour. We also mark the specific location where the coloured character%was injected into with the same colour, before and after the injection occurred.%TODO: can be also used to motivate injection->blexer in Bitcoded1To show the details of how $\inj$ works,we zoom in to the second injection above, illustrating the recursive calls involved: \begin{center} \begin{tabular}{lcl}$\inj \;\quad ((\ONE + {\ONE} \textcolor{blue}{a})\cdot (a+aa)^*)\cdot c + \ZERO \; \quad \textcolor{blue}{a} \; $ & &\\$\quad \Left \; (\Left \; (\Seq \;(\Left \; (\Seq \; (\Right \; (\Right\; \textcolor{blue}{\Empty})) \; \Stars \, [])) \; c))$ & \\$=$\\ $\Left \; (\inj \; ((\ONE + \ONE \textcolor{blue}{a})\cdot (a+aa)^*)\cdot c \quad \textcolor{blue}{a} \quad (\Left \; (\Seq\; ( \Left \; (\Seq \; (\Right \; (\Right\; \textcolor{blue}{\Empty})) \; \Stars \, []) ) \; c ) )\;\;)$ &\\ $=$\\ $\Left \; (\Seq \; (\inj \; (\ONE + \ONE \textcolor{blue}{a})\cdot(a+aa)^* \quad \textcolor{blue}{a} \quad (\Left \; (\Seq \; (\Right \; (\Right\;\textcolor{blue}{\Empty})))) ) \; c ) $ & \\$=$\\ $\Left \; (\Seq \; (\Seq \; (\inj \quad (\ONE + \ONE \textcolor{blue}{a}) \quad \textcolor{blue}{a}\quad \Right \;(\Right \; \textcolor{blue}{\Empty}) ) \; \Stars \,[])\; c)$ &\\ $=$ \\ $\Left \; (\Seq \; (\Seq \; (\Right \; (\inj \; \ONE \textcolor{blue}{a} \quad \textcolor{blue}{a}\quad (\Right \;\textcolor{blue}{\Empty})))) \; \Stars \, [])$ & \\ $=$\\ $\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; (\mkeps \; \ONE)\;(\inj \;\textcolor{blue}{a} \; \textcolor{blue}{a} \; \textcolor{blue}{\Empty})) ) ) \; \Stars \, [] )$ & \\ $=$\\ $\Left \; (\Seq \; (\Seq \; (\Right \; (\Seq \; \Empty \; \textcolor{blue}{a}) ) ) \; \Stars \, [] )$ \end{tabular}\end{center}\noindentWe will now introduce the proofs related to $\inj$ and$\lexer$. These proofs have originally been found by Ausaf et al.in their 2016 work \cite{AusafDyckhoffUrban2016}.Proofs to some of these lemmas have also been discussed in more detail in Ausaf's thesis \cite{Ausaf}.Nevertheless, we still introduce these proofs, to make this thesisself-contained as we show inductive variants used in these proofs breakafter more simplifications are introduced.%Also note that lemmas like \ref{}described in this thesis is a more faithful%representation of the actual accompanying Isabelle formalisation, and %therefore Recall that lemma \ref{mePosix} tells us that$\mkeps$ always generates the POSIX value.The function $\inj$ preserves the POSIXness, providedthe value before injection is POSIX, namely\begin{lemma}\label{injPosix} If$(r \backslash c, s) \rightarrow v $, then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.\end{lemma}\begin{proof} By induction on $r$. The non-trivial cases are sequence and star. When $r = a \cdot b$, there can be three cases for the value $v$ satisfying $\vdash v:a\backslash c$. We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each case. \begin{itemize} \item $v = \Seq \; v_a \; v_b$.\\ The ``not nullable'' clause of the $\inj$ function is taken: \begin{center} \begin{tabular}{lcl} $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\ & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ \end{tabular} \end{center} We know that there exists a unique pair of $s_a$ and $s_b$ satisfying $(a \backslash c, s_a) \rightarrow v_a$, $(b , s_b) \rightarrow v_b$, and $\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in L \; (a\backslash c) \land s_4 \in L \; b$. The last condition gives us $\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in L \; a \land s_4 \in L \; b$. By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds, and this gives us \[ (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. \] \item $v = \Left \; (\Seq \; v_a \; v_b)$\\ The argument is almost identical to the above case, except that a different clause of $\inj$ is taken: \begin{center} \begin{tabular}{lcl} $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\ & $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$ \end{tabular} \end{center} With similar reasoning, \[ (a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b. \] again holds. \item $v = \Right \; v_b$\\ Again the injection result would be \begin{center} \begin{tabular}{lcl} $\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\ & $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$ \end{tabular} \end{center} We know that $a$ must be nullable, allowing us to call $\mkeps$ and get \[ (a, []) \rightarrow \mkeps \; a. \] Also, by inductive hypothesis \[ (b, c::s) \rightarrow \inj\; b \; c \; v_b \] holds. In addition, as $\Right \;v_b$ instead of $\Left \ldots$ is the POSIX value for $v$, it must be the case that $s \notin L \;( (a\backslash c)\cdot b)$. This tells us that \[ \nexists s_3 \; s_4. s_3 @s_4 = s \land s_3 \in L \; (a\backslash c) \land s_4 \in L \; b \] which translates to \[ \nexists s_3 \; s_4. \; s_3 \neq [] \land s_3 @s_4 = c::s \land s_3 \in L \; a \land s_4 \in L \; b. \] (Which says there cannot be a longer initial split for $s$ other than the empty string.) Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$ as the POSIX value for $a\cdot b$. \end{itemize} The star case can be proven similarly.\end{proof}\noindent%TODO: Cut from previous lexer def, need to make coherentThe central property of the $\lexer$ is that it gives the correct resultaccording toPOSIX rules. \begin{theorem}\label{lexerCorrectness} The $\lexer$ based on derivatives and injections is correct: \begin{center} \begin{tabular}{lcl} $\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\ $\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$ \end{tabular} \end{center}\end{theorem} \begin{proof}By induction on $s$. $r$ generalising over an arbitrary regular expression.The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive caseby lemma \ref{injPosix}.\end{proof}\noindentAs we did earlier in this chapter with the matcher, one can introduce simplification on the regular expression in each derivative step.However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)and ensure thatthe values align with the regular expression at each step.Therefore one has tobe careful not to break the correctness, as the injection function heavily relies on the structure of the regular expressions and values being aligned.This can be achieved by recording some extra rectification functionsduring the derivatives step and applying these rectifications in each run during the injection phase.With extra careone can show that POSIXness will not be affectedby the simplifications listed here \cite{AusafDyckhoffUrban2016}. \begin{center} \begin{tabular}{lcl} $\simp \; r_1 \cdot r_2 $ & $ \dn$ & $(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\ & & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\ & & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\ & & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\ & & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\ & & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\ $\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\ & & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\ & & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\ & & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\ $\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$ \end{tabular}\end{center}However, one can still end upwith exploding derivatives,even with the simple-minded simplification rules allowedin an injection-based lexer.\section{A Case Requiring More Aggressive Simplifications}For example, when starting with the regularexpression $(a^* \cdot a^*)^*$ and building just overa dozen successive derivatives w.r.t.~the character $a$, one obtains a derivative regular expressionwith millions of nodes (when viewed as a tree)even with the mentioned simplifications.\begin{figure}[H]\begin{center}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, ylabel={size}, legend entries={Simple-Minded Simp, Naive Matcher}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};\end{axis}\end{tikzpicture} \end{center}\caption{Size of $(a^*\cdot a^*)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}\end{figure}\label{fig:BetterWaterloo}That is because Sulzmann and Lu's injection-based lexing algorithm keeps a lot of "useless" values that will not be used. These different ways of matching will grow exponentially with the string length.Consider the case\[ r= (a^*\cdot a^*)^* \quad and \quad s=\underbrace{aa\ldots a}_\text{n \textit{a}s}\]as an example.This is a highly ambiguous regular expression, withmany ways to split up the string into multiple segments fordifferent star iterations,and for each segment multiple ways of splitting between the two $a^*$ sub-expressions.When $n$ is equal to $1$, there are two lexical values forthe match:\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)\]and\[ \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)\]The derivative of $\derssimp \;s \; r$ is\[ (a^*a^* + a^*)\cdot(a^*a^*)^*.\]The $a^*a^*$ and $a^*$ in the first child of the above sequencecorrespond to value 1 and value 2, respectively.When $n=2$, the number goes up to 7:\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]\]\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]\]\[ \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]\]\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]\]\[ \Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) ] \]\[ \Stars \; [ \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]) ] \]and\[ \Stars \; [ \Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]), \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []) ] \]And $\derssimp \; aa \; (a^*a^*)^*$ is\[ ((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*.\]which removes two out of the seven terms corresponding to theseven distinct lexical values.It is not surprising that there are exponentially many distinct lexical values that cannot be eliminated by the simple-minded simplification of $\derssimp$. A lexer without a good enough strategy to deduplicate will naturallyhave an exponential runtime on highlyambiguous regular expressions because thereare exponentially many matches.For this particular example, it seemsthat the number of distinct matches growthspeed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).On the other hand, the $\POSIX$ value for $r= (a^*\cdot a^*)^*$ and $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is \[ \Stars\, [\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].\]At any moment, the subterms in a regular expressionthat will potentially result in a POSIX value is onlya minority among the many other terms,and one can remove the ones that are not possible to be POSIX.In the above example,\begin{equation}\label{eqn:growth2} ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\end{equation}can be further simplified by removing the underlined term first,which would open up possibilitiesof further simplification that removes theunderbraced part.The result would be \[ (\underbrace{a^*a^*}_\text{term 1} + \underbrace{a^*}_\text{term 2})\cdot(a^*a^*)^*.\]with corresponding values\begin{center} \begin{tabular}{lr} $\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\ $\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$ \end{tabular}\end{center}Other terms with an underlying value, such as\[ \Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]\]do not to contribute a POSIX lexical value,and therefore can be thrown away.Ausaf et al. \cite{AusafDyckhoffUrban2016} have come up with some simplification steps, and incorporated the simplification into $\lexer$.They call this lexer $\textit{lexer}_{simp}$ and proved that\[ \lexer \; r\; s = \textit{lexer}_{simp} \; r \; s\]The function $\textit{lexer}_{simp}$involves some fiddly manipulation of value rectification,which we omit here.however those stepsare not yet sufficiently strong, to achieve the above effects.And even with these relatively mild simplifications, the proofis already quite a bit more complicated than the theorem \ref{lexerCorrectness}.One would need to prove something like this: \[ \textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\; \textit{then}\;\; (r, c::s) \rightarrow \inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).\]instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$not only has to return a simplified regular expression,but also what specific simplifications have been done as a function on valuesshowing how one can transform the valueunderlying the simplified regular expressionto the unsimplified one.We therefore choose a slightly different approachalso described by Sulzmann and Lu toget better simplifications, which usessome augmented data structures compared to plain regular expressions.We call them \emph{annotated}regular expressions.With annotated regular expressions,we can avoid creating the intermediate values $v_1,\ldots v_n$ and a second phase altogether.We introduce this new datatype and the corresponding algorithm in the next chapter.