% Chapter 1\chapter{Introduction} % Main chapter title\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1} %----------------------------------------------------------------------------------------% Define some commands to keep the formatting separated from the content \newcommand{\keyword}[1]{\textbf{#1}}\newcommand{\tabhead}[1]{\textbf{#1}}\newcommand{\code}[1]{\texttt{#1}}\newcommand{\file}[1]{\texttt{\bfseries#1}}\newcommand{\option}[1]{\texttt{\itshape#1}}\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%\newcommand{\ZERO}{\mbox{\bf 0}}\newcommand{\ONE}{\mbox{\bf 1}}\def\lexer{\mathit{lexer}}\def\mkeps{\mathit{mkeps}}\def\DFA{\textit{DFA}}\def\bmkeps{\textit{bmkeps}}\def\retrieve{\textit{retrieve}}\def\blexer{\textit{blexer}}\def\flex{\textit{flex}}\def\inj{\mathit{inj}}\def\Empty{\mathit{Empty}}\def\Left{\mathit{Left}}\def\Right{\mathit{Right}}\def\Stars{\mathit{Stars}}\def\Char{\mathit{Char}}\def\Seq{\mathit{Seq}}\def\Der{\mathit{Der}}\def\nullable{\mathit{nullable}}\def\Z{\mathit{Z}}\def\S{\mathit{S}}\def\rup{r^\uparrow}\def\simp{\mathit{simp}}\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}\def\map{\mathit{map}}\def\distinct{\mathit{distinct}}\def\blexersimp{\mathit{blexer}\_\mathit{simp}}%----------------------------------------------------------------------------------------%This part is about regular expressions, Brzozowski derivatives,%and a bit-coded lexing algorithm with proven correctness and time bounds.%TODO: look up snort rules to use here--give readers idea of what regexes look likeRegular expressions are widely used in computer science: be it in IDEs with syntax hightlighting and auto completion, command line tools like $\mathit{grep}$ that facilitates easy text processing , network intrusiondetection systems that rejects suspicious traffic, or compilerfront ends--there is always an important phase of thetask that involves matching a regular exression with a string.Given its usefulness and ubiquity, one would imagine thatmodern regular expression matching implementationsare mature and fully-studied.If you go to a popular programming language'sregex engine,you can supply it with regex and strings of your own,and get matching/lexing information such as how a sub-part of the regex matches a sub-part of the string.These lexing libraries are on average quite fast.%TODO: get source for SNORT/BRO's regex matching engine/speedFor example, the regex engines some network intrusion detectionsystems use are able to process megabytes or even gigabytes of network traffic per second.Why do we need to have our version, if the algorithms areblindingly fast already? Well it turns out it is not always the case.Take $(a^*)^*\,b$ and ask whetherstrings of the form $aa..a$ match this regularexpression. Obviously this is not the case---the expected $b$ in the lastposition is missing. One would expect that modern regular expressionmatching engines can find this out very quickly. Alas, if one triesthis example in JavaScript, Python or Java 8 with strings like 28$a$'s, one discovers that this decision takes around 30 seconds andtakes considerably longer when adding a few more $a$'s, as the graphsbelow show:\begin{center}\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={JavaScript}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};\end{axis}\end{tikzpicture} &\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Python}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};\end{axis}\end{tikzpicture} &\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Java 8}, legend pos=north west, legend cell align=left]\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};\end{axis}\end{tikzpicture}\\\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings of the form $\underbrace{aa..a}_{n}$.}\end{tabular} \end{center} This is clearly exponential behaviour, and is triggered by some relatively simple regex patterns.This superlinear blowup in matching algorithms sometimes causeconsiderable grief in real life: for example on 20 July 2016 one evilregular expression brought the webpage\href{http://stackexchange.com}{Stack Exchange} to itsknees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}In this instance, a regular expression intended to just trim whitespaces from the beginning and the end of a line actually consumedmassive amounts of CPU-resources---causing web servers to grind to ahalt. This happened when a post with 20,000 white spaces was submitted,but importantly the white spaces were neither at the beginning nor atthe end. As a result, the regular expression matching engine needed tobacktrack over many choices. In this example, the time needed to processthe string was $O(n^2)$ with respect to the string length. Thisquadratic overhead was enough for the homepage of Stack Exchange torespond so slowly that the load balancer assumed a $\mathit{DoS}$ attack and therefore stopped the servers from responding to anyrequests. This made the whole site become unavailable. A more recent example is a global outage of all Cloudflare servers on 2 July2019. A poorly written regular expression exhibited exponentialbehaviour and exhausted CPUs that serve HTTP traffic. Although theoutage had several causes, at the heart was a regular expression thatwas used to monitor networktraffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}%TODO: data points for some new versions of languagesThese problems with regular expressions are not isolated events that happenvery occasionally, but actually quite widespread.They occur so often that they get a name--Regular-Expression-Denial-Of-Service (ReDoS)attack.Davis et al. \parencite{Davis18} detected morethan 1000 super-linear (SL) regular expressionsin Node.js, Python core libraries, and npm and pypi. They therefore concluded that evil regular expressionsare problems more than "a parlour trick", but one thatrequiresmore research attention. \section{Why are current algorithm for regexes slow?}%find literature/find out for yourself that REGEX->DFA on basic regexes%does not blow up the sizeShouldn't regular expression matching be linear?How can one explain the super-linear behaviour of the regex matching engines we have?The time cost of regex matching algorithms in generalinvolve two phases: the construction phase, in which the algorithm builds some suitable data structure from the input regex $r$, we denotethe time cost by $P_1(r)$.The lexingphase, when the input string $s$ is read and the data structurerepresenting that regex $r$ is being operated on. We represent the timeit takes by $P_2(r, s)$.\\In the case of a $\mathit{DFA}$,we have $P_2(r, s) = O( |s| )$,because we take at most $|s|$ steps, and each step takesat most one transition--a deterministic-finite-automataby definition has at most one state active and at most onetransition upon receiving an input symbol.But unfortunately in the worst case$P_1(r) = O(exp^{|r|})$. An example will be given later. \\For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.On the other hand, if backtracking is used, the worst-case time bound bloatsto $|r| * 2^|s|$ .%on the input%And when calculating the time complexity of the matching algorithm,%we are assuming that each input reading step requires constant time.%which translates to that the number of %states active and transitions taken each time is bounded by a%constant $C$.%But modern regex libraries in popular language engines% often want to support much richer constructs than just% sequences and Kleene stars,%such as negation, intersection, %bounded repetitions and back-references.%And de-sugaring these "extended" regular expressions %into basic ones might bloat the size exponentially.%TODO: more reference for exponential size blowup on desugaring. \subsection{Tools that uses $\mathit{DFA}$s}%TODO:more tools that use DFAs?$\mathit{LEX}$ and $\mathit{JFLEX}$ are toolsin $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-basedlexers. The user provides a set of regular expressionsand configurations to such lexer generators, and then gets an output program encoding a minimized $\mathit{DFA}$that can be compiled and run. The good things about $\mathit{DFA}$s is that oncegenerated, they are fast and stable, unlikebacktracking algorithms. However, they do not scale well with bounded repetitions.\\Bounded repetitions, usually written in the form$r^{\{c\}}$ (where $c$ is a constant natural number),denotes a regular expression accepting stringsthat can be divided into $c$ substrings, where each substring is in $r$. For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,an $\mathit{NFA}$ describing it would look like:\begin{center}\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] \node[state,initial] (q_0) {$q_0$}; \node[state, red] (q_1) [right=of q_0] {$q_1$}; \node[state, red] (q_2) [right=of q_1] {$q_2$}; \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; \path[->] (q_0) edge node {a} (q_1) edge [loop below] node {a,b} () (q_1) edge node {a,b} (q_2) (q_2) edge node {a,b} (q_3);\end{tikzpicture}\end{center}The red states are "countdown states" which counts down the number of characters needed in addition to the currentstring to make a successful match.For example, state $q_1$ indicates a match that hasgone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,and just consumed the "delimiter" $a$ in the middle, and need to match 2 more iterations of $(a|b)$ to complete.State $q_2$ on the other hand, can be viewed as a stateafter $q_1$ has consumed 1 character, and just waitsfor 1 more character to complete.$q_3$ is the last state, requiring 0 more character and is accepting.Depending on the suffix of theinput string up to the current read location,the states $q_1$ and $q_2$, $q_3$may or maynot be active, independent from each other.A $\mathit{DFA}$ for such an $\mathit{NFA}$ wouldcontain at least $2^3$ non-equivalent states that cannot be merged, because the subset construction during determinisation will generateall the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.Generalizing this to regular expressions with largerbounded repetitions number, we have thatregexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$swould require at least $2^{n+1}$ states, if $r$ containsmore than 1 string.This is to represent all different scenarios which "countdown" states are active.For those regexes, tools such as $\mathit{JFLEX}$ would generate gigantic $\mathit{DFA}$'s orout of memory errors.For this reason, regex libraries that support bounded repetitions often choose to use the $\mathit{NFA}$ approach.\subsection{The $\mathit{NFA}$ approach to regex matching}One can simulate the $\mathit{NFA}$ running in two ways:one by keeping track of all active states after consuming a character, and update that set of states iteratively.This can be viewed as a breadth-first-search of the $\mathit{NFA}$for a path terminatingat an accepting state.Languages like $\mathit{Go}$ and $\mathit{Rust}$ use thistype of $\mathit{NFA}$ simulation, and guarantees a linear runtimein terms of input string length.%TODO:try out these lexersThe other way to use $\mathit{NFA}$ for matching is choosing a single transition each time, keeping all the other options in a queue or stack, and backtracking if that choice eventually fails. This method, often called a "depth-first-search", is efficient in a lot of cases, but could end upwith exponential run time.\\%TODO:COMPARE java python lexer speed with Rust and GoThe reason behind backtracking algorithms in languages likeJava and Python is that they support back-references.\subsection{Back References in Regex--Non-Regular part}If we have a regular expression like this (the sequenceoperator is omitted for brevity):\begin{center} $r_1(r_2(r_3r_4))$\end{center}We could label sub-expressions of interest by parenthesizing them and giving them a number by the order in which their opening parentheses appear.One possible way of parenthesizing and labelling is given below:\begin{center} $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$\end{center}$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelledby 1 to 4. $1$ would refer to the entire expression $(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.These sub-expressions are called "capturing groups".We can use the following syntax to denote that we want a string just matched by a sub-expression (capturing group) to appear at a certain location again, exactly as it was:\begin{center}$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$\end{center}The backslash and number $i$ are used to denote such so-called "back-references".Let $e$ be an expression made of regular expressions and back-references. $e$ contains the expression $e_i$as its $i$-th capturing group.The semantics of back-reference can be recursivelywritten as:\begin{center} \begin{tabular}{c} $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ \end{tabular}\end{center}The concrete example$((a|b|c|\ldots|z)^*)\backslash 1$would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\Back-reference is a construct in the "regex" standardthat programmers found quite useful, but not exactly regular any more.In fact, that allows the regex construct to express languages that cannot be contained in context-freelanguages either.For example, the back-reference $((a^*)b\backslash1 b \backslash 1$expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.Such a language is contained in the context-sensitive hierarchyof formal languages. Solving the back-reference expressions matching problemis NP-complete\parencite{alfred2014algorithms} and a non-bactracking,efficient solution is not known to exist.%TODO:read a bit more about back reference algorithmsIt seems that languages like Java and Python made the trade-offto support back-references at the expense of having to backtrack,even in the case of regexes not involving back-references.\\Summing these up, we can categorise existing practical regex libraries into the ones with lineartime guarantees like Go and Rust, which impose restrictionson the user input (not allowing back-references, bounded repetitions canno exceed 1000 etc.), and ones that allows the programmer much freedom, but grinds to a halt in some non-negligible portion of cases. %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions % For example, the Rust regex engine claims to be linear, % but does not support lookarounds and back-references.% The GoLang regex library does not support over 1000 repetitions. % Java and Python both support back-references, but shows%catastrophic backtracking behaviours on inputs without back-references(%when the language is still regular). %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac %TODO: verify the fact Rust does not allow 1000+ reps %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?\section{Buggy Regex Engines} Another thing about the these libraries is that there is no correctness guarantee. In some cases they either fails to generate a lexing result when there is a match, or gives the wrong way of matching.It turns out that regex libraries not only suffer from exponential backtracking problems, but also undesired (or even buggy) outputs.%TODO: comment from whoKuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are notcorrectly implementing the POSIX (maximum-munch)rule of regular expression matching.This experience is echoed by the writer'stryout of a few online regex testers:A concrete example would be the regex\begin{verbatim}(((((a*a*)b*)b){20})*)c\end{verbatim}and the string\begin{verbatim}baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac\end{verbatim}This seemingly complex regex simply says "some $a$'sfollowed by some $b$'s then followed by 1 single $b$,and this iterates 20 times, finally followed by a $c$.And a POSIX match would involve the entire string,"eating up"all the $b$'s in it.%TODO: give a coloured example of how this matches POSIXlyThis regex would trigger catastrophic backtracking in languages like Python and Java,whereas it gives a non-POSIX and uninformative match in languages like Go or .NET--The match with only character $c$.As Grathwohl\parencite{grathwohl2014crash} commented,\begin{center} ``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''\end{center}%\section{How people solve problems with regexes}When a regular expression does not behave as intended,people usually try to rewrite the regex to some equivalent formor they try to avoid the possibly problematic patterns completely\parencite{Davis18},of which there are many false positives.Animated tools to "debug" regular expressionsare also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} to name a few.There is also static analysis work on regular expressions thataims to detect potentially expoential regex patterns. Rathnayake and Thielecke \parencite{Rathnayake2014StaticAF} proposed an algorithmthat detects regular expressions triggering exponentialbehavious on backtracking matchers.Weideman \parencite{Weideman2017Static} came up with non-linear polynomial worst-time estimatesfor regexes, attack string that exploit the worst-time scenario, and "attack automata" that generatesattack strings.%Arguably these methods limits the programmers' freedom%or productivity when all they want is to come up with a regex%that solves the text processing problem.%TODO:also the regex101 debugger\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} Is it possible to have a regex lexing algorithm with proven correctness and time complexity, which allows easy extensions to constructs like bounded repetitions, negation, lookarounds, and even back-references? We propose Brzozowski derivatives on regular expressions as a solution to this. In the last fifteen or so years, Brzozowski's derivatives of regularexpressions have sparked quite a bit of interest in the functionalprogramming and theorem prover communities. The beauty ofBrzozowski's derivatives \parencite{Brzozowski1964} is that they are neatlyexpressible in any functional language, and easily definable andreasoned about in theorem provers---the definitions just consist ofinductive datatypes and simple recursive functions. And an algorithms based on it by Suzmann and Lu \parencite{Sulzmann2014} allows easy extensionto include extended regular expressions and simplification of internal data structures eliminating the exponential behaviours.%----------------------------------------------------------------------------------------\section{Our Contribution}This work addresses the vulnerability of super-linear andbuggy regex implementations by the combinationof Brzozowski's derivatives and interactive theorem proving. We give an improved version of Sulzmann and Lu's bit-coded algorithm using derivatives, which come with a formal guarantee in terms of correctness and running time as an Isabelle/HOL proof.Then we improve the algorithm with an even stronger version of simplification, and prove a time bound linear to input andcubic to regular expression size using a technique byAntimirov.The main contribution of this thesis is a proven correct lexing algorithmwith formalized time bounds.To our best knowledge, there is no lexing libraries using Brzozowski derivativesthat have a provable time guarantee, and claims about running time are usually speculative and backed by thin empiricalevidence.%TODO: give referencesFor example, Sulzmann and Lu had proposed an algorithm in which theyclaim a linear running time.But that was falsified by our experiments and the running time is actually $\Omega(2^n)$ in the worst case.A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim%TODO: give referenceslexer, which calculates POSIX matches and is based on derivatives.They formalized the correctness of the lexer, but not the complexity.In the performance evaluation section, they simply analyzed the run timeof matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$and concluded that the algorithm is quadratic in terms of input length.When we tried out their extracted OCaml code with our example $(a+aa)^*$,the time it took to lex only 40 $a$'s was 5 minutes.We believe our results of a proof of performance on generalinputs rather than specific examples a novel contribution.\\\subsection{Related Work}We are awareof a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 byOwens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is partof the work by Krauss and Nipkow \parencite{Krauss2011}. And another onein Coq is given by Coquand and Siles \parencite{Coquand2012}.Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. %We propose Brzozowski's derivatives as a solution to this problem.% about Lexing Using Brzozowski derivatives \section{Preliminaries}Suppose we have an alphabet $\Sigma$, the strings whose charactersare from $\Sigma$can be expressed as $\Sigma^*$.We use patterns to define a set of strings concisely. Regular expressionsare one of such patterns systems:The basic regular expressions are defined inductively by the following grammar:\[ r ::= \ZERO \mid \ONE \mid c \mid r_1 \cdot r_2 \mid r_1 + r_2 \mid r^* \]The language or set of strings defined by regular expressions are defined as%TODO: FILL in the other defs\begin{center}\begin{tabular}{lcl}$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\\end{tabular}\end{center}Which are also called the "language interpretation".The Brzozowski derivative w.r.t character $c$ is an operation on the regex,where the operation transforms the regex to a new one containingstrings without the head character $c$.Formally, we define first such a transformation on any string set, whichwe call semantic derivative:\begin{center}$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$\end{center}Mathematically, it can be expressed as the If the $\textit{StringSet}$ happen to have some structure, for example,if it is regular, then we have that it% Derivatives of a%regular expression, written $r \backslash c$, give a simple solution%to the problem of matching a string $s$ with a regular%expression $r$: if the derivative of $r$ w.r.t.\ (in%succession) all the characters of the string matches the empty string,%then $r$ matches $s$ (and {\em vice versa}). The the derivative of regular expression, denoted as$r \backslash c$, is a function that takes parameters$r$ and $c$, and returns another regular expression $r'$,which is computed by the following recursive function:\begin{center}\begin{tabular}{lcl} $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ $d \backslash c$ & $\dn$ & $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\\end{tabular}\end{center}\noindent\noindentThe $\nullable$ function tests whether the empty string $""$ is in the language of $r$:\begin{center} \begin{tabular}{lcl} $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ \end{tabular}\end{center}\noindentThe empty set does not contain any string andtherefore not the empty string, the empty string regular expression contains the empty stringby definition, the character regular expressionis the singleton that contains character only,and therefore does not contain the empty string,the alternative regular expression(or "or" expression)might have one of its children regular expressionsbeing nullable and any one of its children being nullablewould suffice. The sequence regular expressionwould require both children to have the empty stringto compose an empty string and the Kleene staroperation naturally introduced the empty string.We can give the meaning of regular expressions derivativesby language interpretation:\begin{center}\begin{tabular}{lcl} $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ $d \backslash c$ & $\dn$ & $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\\end{tabular}\end{center}\noindent\noindentThe function derivative, written $\backslash c$, defines how a regular expression evolves intoa new regular expression after all the string it containsis chopped off a certain head character $c$.The most involved cases are the sequence and star case.The sequence case says that if the first regular expressioncontains an empty string then second component of the sequencemight be chosen as the target regular expression to be choppedoff its head character.The star regular expression unwraps the iteration ofregular expression and attack the star regular expressionto its back again to make sure there are 0 or more iterationsfollowing this unfolded iteration.The main property of the derivative operationthat enables us to reason about the correctness ofan algorithm using derivatives is \begin{center}$c\!::\!s \in L(r)$ holdsif and only if $s \in L(r\backslash c)$.\end{center}\noindentWe can generalise the derivative operation shown above for single charactersto strings as follows:\begin{center}\begin{tabular}{lcl}$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\$r \backslash [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentand then define Brzozowski's regular-expression matching algorithm as:\[match\;s\;r \;\dn\; nullable(r\backslash s)\]\noindentAssuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, this algorithm presented graphically is as follows:\begin{equation}\label{graph:*}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}\end{tikzcd}\end{equation}\noindentwhere we start with a regular expression $r_0$, build successivederivatives until we exhaust the string and then use \textit{nullable}to test whether the result can match the empty string. It can berelatively easily shown that this matcher is correct (that is givenan $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).Beautiful and simple definition.If we implement the above algorithm naively, however,the algorithm can be excruciatingly slow. For example, when starting with the regularexpression $(a + aa)^*$ and building 12 successive derivativesw.r.t.~the character $a$, one obtains a derivative regular expressionwith more than 8000 nodes (when viewed as a tree). Operations like$\backslash$ and $\nullable$ need to traverse such trees andconsequently the bigger the size of the derivative the slower thealgorithm. Brzozowski was quick in finding that during this process a lot useless$\ONE$s and $\ZERO$s are generated and therefore not optimal.He also introduced some "similarity rules" suchas $P+(Q+R) = (P+Q)+R$ to merge syntactically different but language-equivalent sub-regexes to further decrease the sizeof the intermediate regexes. More simplifications are possible, such as deleting duplicatesand opening up nested alternatives to trigger even more simplifications.And suppose we apply simplification after each derivative step, and composethese two operations together as an atomic one: $a \backslash_{simp}\,c \dn\textit{simp}(a \backslash c)$. Then we can builda matcher without having cumbersome regular expressions.If we want the size of derivatives in the algorithm tostay even lower, we would need more aggressive simplifications.Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well asdeleting duplicates whenever possible. For example, the parentheses in$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Anotherexample is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just$a^*+a+\ONE$. Adding these more aggressive simplification rules help usto achieve a very tight size bound, namely, the same size bound as that of the \emph{partial derivatives}. Building derivatives and then simplify them.So far so good. But what if we want to do lexing instead of just a YES/NO answer?This requires us to go back again to the world without simplification first for a moment.Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and elegant(arguably as beautiful as the originalderivatives definition) solution for this.\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}They first defined the datatypes for storing the lexing information called a \emph{value} orsometimes also \emph{lexical value}. These values and regularexpressions correspond to each other as illustrated in the followingtable:\begin{center} \begin{tabular}{c@{\hspace{20mm}}c} \begin{tabular}{@{}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ $r$ & $::=$ & $\ZERO$\\ & $\mid$ & $\ONE$ \\ & $\mid$ & $c$ \\ & $\mid$ & $r_1 \cdot r_2$\\ & $\mid$ & $r_1 + r_2$ \\ \\ & $\mid$ & $r^*$ \\ \end{tabular} & \begin{tabular}{@{\hspace{0mm}}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ $v$ & $::=$ & \\ & & $\Empty$ \\ & $\mid$ & $\Char(c)$ \\ & $\mid$ & $\Seq\,v_1\, v_2$\\ & $\mid$ & $\Left(v)$ \\ & $\mid$ & $\Right(v)$ \\ & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ \end{tabular} \end{tabular}\end{center}\noindentBuilding on top of Sulzmann and Lu's attempt to formalize the notion of POSIX lexing rules \parencite{Sulzmann2014}, Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelledPOSIX matching as a ternary relation recursively defined in anatural deduction style.With the formally-specified rules for what a POSIX matching is,they proved in Isabelle/HOL that the algorithm gives correct results.But having a correct result is still not enough, we want $\mathbf{efficiency}$.One regular expression can have multiple lexical values. For examplefor the regular expression $(a+b)^*$, it has a infinite list ofvalues corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,$\ldots$, and vice versa.Even for the regular expression matching a certain string, there could still be more than one value corresponding to it.Take the example where $r= (a^*\cdot a^*)^*$ and the string $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.The number of different ways of matching without allowing any value under a star to be flattenedto an empty string can be given by the following formula:\begin{center} $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$\end{center}and a closed form formula can be calculated to be\begin{equation} C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}\end{equation}which is clearly in exponential order.A lexer aimed at getting all the possible values has an exponentialworst case runtime. Therefore it is impractical to try to generateall possible matches in a run. In practice, we are usually interested about POSIX values, which by intuition alwaysmatch the leftmost regular expression when there is a choiceand always match a sub part as much as possible before proceedingto the next token. For example, the above example has the POSIX value$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.The output of an algorithm we want would be a POSIX matchingencoded as a value.The contribution of Sulzmann and Lu is an extension of Brzozowski'salgorithm by a second phase (the first phase being building successivederivatives---see \eqref{graph:*}). In this second phase, a POSIX value is generated in case the regular expression matches the string. Pictorially, the Sulzmann and Lu algorithm is as follows:\begin{ceqn}\begin{equation}\label{graph:2}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] \end{tikzcd}\end{equation}\end{ceqn}\noindentFor convenience, we shall employ the following notations: the regularexpression we start with is $r_0$, and the given string $s$ is composedof characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from theleft to right, we build the derivatives $r_1$, $r_2$, \ldots accordingto the characters $c_0$, $c_1$ until we exhaust the string and obtainthe derivative $r_n$. We test whether this derivative is$\textit{nullable}$ or not. If not, we know the string does not match$r$ and no value needs to be generated. If yes, we start building thevalues incrementally by \emph{injecting} back the characters into theearlier values $v_n, \ldots, v_0$. This is the second phase of thealgorithm from the right to left. For the first value $v_n$, we call thefunction $\textit{mkeps}$, which builds a POSIX lexical valuefor how the empty string has been matched by the (nullable) regularexpression $r_n$. This function is defined as \begin{center} \begin{tabular}{lcl} $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ $\mkeps(r_{1}+r_{2})$ & $\dn$ & \textit{if} $\nullable(r_{1})$\\ & & \textit{then} $\Left(\mkeps(r_{1}))$\\ & & \textit{else} $\Right(\mkeps(r_{2}))$\\ $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ \end{tabular} \end{center}\noindent After the $\mkeps$-call, we inject back the characters one by one in order to buildthe lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.After injecting back $n$ characters, we get the lexical value for how $r_0$matches $s$. The POSIX value is maintained throught out the process.For this Sulzmann and Lu defined a function that reversesthe ``chopping off'' of characters during the derivative phase. Thecorresponding function is called \emph{injection}, written$\textit{inj}$; it takes three arguments: the first one is a regularexpression ${r_{i-1}}$, before the character is chopped off, the secondis a character ${c_{i-1}}$, the character we want to inject and thethird argument is the value ${v_i}$, into which one wants to inject thecharacter (it corresponds to the regular expression after the characterhas been chopped off). The result of this function is a new value. Thedefinition of $\textit{inj}$ is as follows: \begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\\end{tabular}\end{center}\noindent This definition is by recursion on the ``shape'' of regularexpressions and values. The clauses basically do one thing--identifying the ``holes'' on value to inject the character back into.For instance, in the last clause for injecting back to a valuethat would turn into a new star value that corresponds to a star,we know it must be a sequence value. And we know that the first value of that sequence corresponds to the child regex of the starwith the first character being chopped off--an iteration of the starthat had just been unfolded. This value is followed by the alreadymatched star iterations we collected before. So we inject the character back to the first value and form a new value with this new iterationbeing added to the previous list of iterations, all under the $Stars$top level.We have mentioned before that derivatives without simplification can get clumsy, and this is true for values as well--they reflectthe regular expressions size by definition.One can introduce simplification on the regex and values, but have tobe careful in not breaking the correctness as the injection function heavily relies on the structure of the regexes and valuesbeing correct and match each other.It can be achieved by recording some extra rectification functionsduring the derivatives step, and applying these rectifications in each run during the injection phase.And we can prove that the POSIX value of howregular expressions match strings will not be affected---although is much harderto establish. Some initial results in this regard have beenobtained in \cite{AusafDyckhoffUrban2016}. %Brzozowski, after giving the derivatives and simplification,%did not explore lexing with simplification or he may well be %stuck on an efficient simplificaiton with a proof.%He went on to explore the use of derivatives together with %automaton, and did not try lexing using derivatives.We want to get rid of complex and fragile rectification of values.Can we not create those intermediate values $v_1,\ldots v_n$,and get the lexing information that should be already there whiledoing derivatives in one pass, without a second phase of injection?In the meantime, can we make sure that simplificationsare easily handled without breaking the correctness of the algorithm?Sulzmann and Lu solved this problem byintroducing additional informtaion to the regular expressions called \emph{bitcodes}.\subsection*{Bit-coded Algorithm}Bits and bitcodes (lists of bits) are defined as:\begin{center} $b ::= 1 \mid 0 \qquadbs ::= [] \mid b::bs $\end{center}\noindentThe $1$ and $0$ are not in bold in order to avoid confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (orbit-lists) can be used to encode values (or potentially incomplete values) in acompact form. This can be straightforwardly seen in the followingcoding function from values to bitcodes: \begin{center}\begin{tabular}{lcl} $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; code(\Stars\,vs)$\end{tabular} \end{center} \noindentHere $\textit{code}$ encodes a value into a bitcodes by converting$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-emptystar iteration by $1$. The border where a local star terminatesis marked by $0$. This coding is lossy, as it throws away the information aboutcharacters, and also does not encode the ``boundary'' between twosequence values. Moreover, with only the bitcode we cannot even tellwhether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. Thereason for choosing this compact way of storing information is that therelatively small size of bits can be easily manipulated and ``movedaround'' in a regular expression. In order to recover values, we will need the corresponding regular expression as an extra information. Thismeans the decoding function is defined as:%\begin{definition}[Bitdecoding of Values]\mbox{}\begin{center}\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; (\Left\,v, bs_1)$\\ $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; (\Right\,v, bs_1)$\\ $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ $\textit{decode}\,bs\,r$ & $\dn$ & $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; \textit{else}\;\textit{None}$ \end{tabular} \end{center} %\end{definition}Sulzmann and Lu's integrated the bitcodes into regular expressions tocreate annotated regular expressions \cite{Sulzmann2014}.\emph{Annotated regular expressions} are defined by the followinggrammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}\begin{center}\begin{tabular}{lcl} $\textit{a}$ & $::=$ & $\ZERO$\\ & $\mid$ & $_{bs}\ONE$\\ & $\mid$ & $_{bs}{\bf c}$\\ & $\mid$ & $_{bs}\sum\,as$\\ & $\mid$ & $_{bs}a_1\cdot a_2$\\ & $\mid$ & $_{bs}a^*$\end{tabular} \end{center} %(in \textit{ALTS})\noindentwhere $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regularexpressions and $as$ for a list of annotated regular expressions.The alternative constructor($\sum$) has been generalized to accept a list of annotated regular expressions rather than just 2.We will show that these bitcodes encode information aboutthe (POSIX) value that should be generated by the Sulzmann and Lualgorithm.To do lexing using annotated regular expressions, we shall firsttransform the usual (un-annotated) regular expressions into annotatedregular expressions. This operation is called \emph{internalisation} anddefined as follows:%\begin{definition}\begin{center}\begin{tabular}{lcl} $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ $(r_1 + r_2)^\uparrow$ & $\dn$ & $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, \textit{fuse}\,[1]\,r_2^\uparrow]$\\ $(r_1\cdot r_2)^\uparrow$ & $\dn$ & $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ $(r^*)^\uparrow$ & $\dn$ & $_{[]}(r^\uparrow)^*$\\\end{tabular} \end{center} %\end{definition}\noindentWe use up arrows here to indicate that the basic un-annotated regularexpressions are ``lifted up'' into something slightly more complex. In thefourth clause, $\textit{fuse}$ is an auxiliary function that helps toattach bits to the front of an annotated regular expression. Itsdefinition is as follows:\begin{center}\begin{tabular}{lcl} $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & $_{bs @ bs'}\ONE$\\ $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & $_{bs@bs'}{\bf c}$\\ $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & $_{bs@bs'}\sum\textit{as}$\\ $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & $_{bs@bs'}a_1 \cdot a_2$\\ $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & $_{bs @ bs'}a^*$\end{tabular} \end{center} \noindentAfter internalising the regular expression, we perform successivederivative operations on the annotated regular expressions. Thisderivative operation is the same as what we had previously for thebasic regular expressions, except that we beed to take care ofthe bitcodes:\iffalse %\begin{definition}{bder}\begin{center} \begin{tabular}{@{}lcl@{}} $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\ $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, (\textit{STAR}\,[]\,r)$\end{tabular} \end{center} %\end{definition}\begin{center} \begin{tabular}{@{}lcl@{}} $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, (_{bs}\textit{STAR}\,[]\,r)$\end{tabular} \end{center} %\end{definition}\fi\begin{center} \begin{tabular}{@{}lcl@{}} $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; _{bs}\ONE\;\textit{else}\;\ZERO$\\ $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\ $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ $(_{bs}a^*)\,\backslash c$ & $\dn$ & $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot (_{[]}r^*))$\end{tabular} \end{center} %\end{definition}\noindentFor instance, when we do derivative of $_{bs}a^*$ with respect to c,we need to unfold it into a sequence,and attach an additional bit $0$ to the front of $r \backslash c$to indicate that there is one more star iteration. Also the sequence clauseis more subtle---when $a_1$ is $\textit{bnullable}$ (here\textit{bnullable} is exactly the same as $\textit{nullable}$, exceptthat it is for annotated regular expressions, therefore we omit thedefinition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how$a_1$ matches the string prior to character $c$ (more on this later),then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2\backslash c)$ will collapse the regular expression $a_1$(as it hasalready been fully matched) and store the parsing information at thehead of the regular expression $a_2 \backslash c$ by fusing to it. Thebitsequence $\textit{bs}$, which was initially attached to thefirst element of the sequence $a_1 \cdot a_2$, hasnow been elevated to the top-level of $\sum$, as this information will beneeded whichever way the sequence is matched---no matter whether $c$ belongsto $a_1$ or $ a_2$. After building these derivatives and maintaining allthe lexing information, we complete the lexing by collecting thebitcodes using a generalised version of the $\textit{mkeps}$ functionfor annotated regular expressions, called $\textit{bmkeps}$:%\begin{definition}[\textit{bmkeps}]\mbox{}\begin{center}\begin{tabular}{lcl} $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a$\\ & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & $bs \,@\, [0]$\end{tabular} \end{center} %\end{definition}\noindentThis function completes the value information by travelling along thepath of the regular expression that corresponds to a POSIX value andcollecting all the bitcodes, and using $S$ to indicate the end of stariterations. If we take the bitcodes produced by $\textit{bmkeps}$ anddecode them, we get the value we expect. The corresponding lexingalgorithm looks as follows:\begin{center}\begin{tabular}{lcl} $\textit{blexer}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentIn this definition $\_\backslash s$ is the generalisation of the derivativeoperation from characters to strings (just like the derivatives for un-annotatedregular expressions).Remember tha one of the important reasons we introduced bitcodesis that they can make simplification more structured and therefore guaranteeingthe correctness.\subsection*{Our Simplification Rules}In this section we introduce aggressive (in terms of size) simplification ruleson annotated regular expressionsin order to keep derivatives small. Such simplifications are promisingas we havegenerated test data that showthat a good tight bound can be achieved. Obviously we could onlypartially cover the search space as there are infinitely many regularexpressions and strings. One modification we introduced is to allow a list of annotated regularexpressions in the $\sum$ constructor. This allows us to not justdelete unnecessary $\ZERO$s and $\ONE$s from regular expressions, butalso unnecessary ``copies'' of regular expressions (very similar tosimplifying $r + r$ to just $r$, but in a more general setting). Anothermodification is that we use simplification rules inspired by Antimirov'swork on partial derivatives. They maintain the idea that only the first``copy'' of a regular expression in an alternative contributes to thecalculation of a POSIX value. All subsequent copies can be pruned away fromthe regular expression. A recursive definition of our simplification function that looks somewhat similar to our Scala code is given below:%\comment{Use $\ZERO$, $\ONE$ and so on. %Is it $ALTS$ or $ALTS$?}\\\begin{center} \begin{tabular}{@{}lcl@{}} $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ \end{tabular} \end{center} \noindentThe simplification does a pattern matching on the regular expression.When it detected that the regular expression is an alternative orsequence, it will try to simplify its children regular expressionsrecursively and then see if one of the children turn into $\ZERO$ or$\ONE$, which might trigger further simplification at the current level.The most involved part is the $\sum$ clause, where we use twoauxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nestedalternatives and reduce as many duplicates as possible. Function$\textit{distinct}$ keeps the first occurring copy only and remove all later oneswhen detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.Its recursive definition is given below: \begin{center} \begin{tabular}{@{}lcl@{}} $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) \end{tabular} \end{center} \noindentHere $\textit{flatten}$ behaves like the traditional functional programming flattenfunction, except that it also removes $\ZERO$s. Or in terms of regular expressions, itremoves parentheses, for example changing $a+(b+c)$ into $a+b+c$.Having defined the $\simp$ function,we can use the previous notation of naturalextension from derivative w.r.t.~character to derivativew.r.t.~string:%\comment{simp in the [] case?}\begin{center}\begin{tabular}{lcl}$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\$r \backslash_{simp} [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentto obtain an optimised version of the algorithm: \begin{center}\begin{tabular}{lcl} $\textit{blexer\_simp}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentThis algorithm keeps the regular expression size small, for example,with this simplification our previous $(a + aa)^*$ example's 8000 nodeswill be reduced to just 6 and stays constant, no matter how long theinput string is.Derivatives give a simple solutionto the problem of matching a string $s$ with a regularexpression $r$: if the derivative of $r$ w.r.t.\ (insuccession) all the characters of the string matches the empty string,then $r$ matches $s$ (and {\em vice versa}). However, there are two difficulties with derivative-based matchers:First, Brzozowski's original matcher only generates a yes/no answerfor whether a regular expression matches a string or not. This is toolittle information in the context of lexing where separate tokens mustbe identified and also classified (for example as keywordsor identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome thisdifficulty by cleverly extending Brzozowski's matchingalgorithm. Their extended version generates additional information on\emph{how} a regular expression matches a string following the POSIXrules for regular expression matching. They achieve this by adding asecond ``phase'' to Brzozowski's algorithm involving an injectionfunction. In our own earlier work we provided the formalspecification of what POSIX matching means and proved in Isabelle/HOLthe correctnessof Sulzmann and Lu's extended algorithm accordingly\cite{AusafDyckhoffUrban2016}.The second difficulty is that Brzozowski's derivatives can grow to arbitrarily big sizes. For example if we start with theregular expression $(a+aa)^*$ and takesuccessive derivatives according to the character $a$, we end up witha sequence of ever-growing derivatives like \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}\begin{center}\begin{tabular}{rll}$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)\end{tabular}\end{center}\noindent where after around 35 steps we run out of memory on atypical computer (we shall define shortly the precise details of ourregular expressions and the derivative operation). Clearly, thenotation involving $\ZERO$s and $\ONE$s already suggestssimplification rules that can be applied to regular regularexpressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrowr$. While such simple-minded simplifications have been proved in ourearlier work to preserve the correctness of Sulzmann and Lu'salgorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do\emph{not} help with limiting the growth of the derivatives shownabove: the growth is slowed, but the derivatives can still grow ratherquickly beyond any finite bound.Sulzmann and Lu overcome this ``growth problem'' in a second algorithm\cite{Sulzmann2014} where they introduce bitcodedregular expressions. In this version, POSIX values arerepresented as bitsequences and such sequences are incrementally generatedwhen derivatives are calculated. The compact representationof bitsequences and regular expressions allows them to define a more``aggressive'' simplification method that keeps the size of thederivatives finite no matter what the length of the string is.They make some informal claims about the correctness and linear behaviourof this version, but do not provide any supporting proof arguments, noteven ``pencil-and-paper'' arguments. They write about their bitcoded\emph{incremental parsing method} (that is the algorithm to be formalisedin this paper):\begin{quote}\it ``Correctness Claim: We further claim that the incremental parsing method [..] in combination with the simplification steps [..] yields POSIX parse trees. We have tested this claim extensively [..] but yet have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}\end{quote} %----------------------------------------------------------------------------------------%----------------------------------------------------------------------------------------%----------------------------------------------------------------------------------------%----------------------------------------------------------------------------------------