\documentclass[a4paper,UKenglish]{lipics}\usepackage{graphic}\usepackage{data}\usepackage{tikz-cd}%\usepackage{algorithm}\usepackage{amsmath}\usepackage[noend]{algpseudocode}\usepackage{enumitem}\usepackage{nccmath}\definecolor{darkblue}{rgb}{0,0,0.6}\hypersetup{colorlinks=true,allcolors=darkblue}\newcommand{\comment}[1]%{{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}}% \documentclass{article}%\usepackage[utf8]{inputenc}%\usepackage[english]{babel}%\usepackage{listings}% \usepackage{amsthm}%\usepackage{hyperref}% \usepackage[margin=0.5in]{geometry}%\usepackage{pmboxdraw}\title{POSIX Regular Expression Matching and Lexing}\author{Chengsong Tan}\affil{King's College London\\London, UK\\\texttt{chengsong.tan@kcl.ac.uk}}\authorrunning{Chengsong Tan}\Copyright{Chengsong Tan}\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%\newcommand{\ZERO}{\mbox{\bf 0}}\newcommand{\ONE}{\mbox{\bf 1}}\def\erase{\textit{erase}}\def\bders{\textit{bders}}\def\lexer{\mathit{lexer}}\def\blexer{\textit{blexer}}\def\fuse{\textit{fuse}}\def\flatten{\textit{flatten}}\def\map{\textit{map}}\def\blexers{\mathit{blexer\_simp}}\def\simp{\mathit{simp}}\def\mkeps{\mathit{mkeps}}\def\bmkeps{\textit{bmkeps}}\def\inj{\mathit{inj}}\def\Empty{\mathit{Empty}}\def\Left{\mathit{Left}}\def\Right{\mathit{Right}}\def\Stars{\mathit{Stars}}\def\Char{\mathit{Char}}\def\Seq{\mathit{Seq}}\def\Der{\mathit{Der}}\def\nullable{\mathit{nullable}}\def\Z{\mathit{Z}}\def\S{\mathit{S}}\def\flex{\textit{flex}}\def\rup{r^\uparrow}\def\retrieve{\textit{retrieve}}\def\AALTS{\textit{AALTS}}\def\AONE{\textit{AONE}}%\theoremstyle{theorem}%\newtheorem{theorem}{Theorem}%\theoremstyle{lemma}%\newtheorem{lemma}{Lemma}%\newcommand{\lemmaautorefname}{Lemma}%\theoremstyle{definition}%\newtheorem{definition}{Definition}\algnewcommand\algorithmicswitch{\textbf{switch}}\algnewcommand\algorithmiccase{\textbf{case}}\algnewcommand\algorithmicassert{\texttt{assert}}\algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%% New "environments"\algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%\algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%\algtext*{EndSwitch}%\algtext*{EndCase}%\begin{document}\maketitle\begin{abstract} Brzozowski introduced in 1964 a beautifully simple algorithm for regular expression matching based on the notion of derivatives of regular expressions. In 2014, Sulzmann and Lu extended this algorithm to not just give a YES/NO answer for whether or not a regular expression matches a string, but in case it does also answers with \emph{how} it matches the string. This is important for applications such as lexing (tokenising a string). The problem is to make the algorithm by Sulzmann and Lu fast on all inputs without breaking its correctness. Being fast depends on a complete set of simplification rules, some of which have been put forward by Sulzmann and Lu. We have extended their rules in order to obtain a tight bound on the size of regular expressions. We have tested these extended rules, but have not formally established their correctness. We have also not yet looked at extended regular expressions, such as bounded repetitions, negation and back-references.\end{abstract}\section{Recapitulation of Concepts From the Last Report}\subsection*{Regular Expressions and Derivatives}Suppose (basic) regular expressions are given by the following grammar:\[ r ::= \ZERO \mid \ONE \mid c \mid r_1 \cdot r_2 \mid r_1 + r_2 \mid r^* \]\noindentThe ingenious contribution of Brzozowski is the notion of \emph{derivatives} ofregular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of$\nullable$ defined below.\begin{center} \begin{tabular}{lcl} $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ \end{tabular} \end{center}\begin{center}\begin{tabular}{lcl} $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ $d \backslash c$ & $\dn$ & $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\\end{tabular}\end{center}%Assuming the classic notion of a%\emph{language} of a regular expression, written $L(\_)$, t\noindentThe main property of the derivative operation is that\begin{center}$c\!::\!s \in L(r)$ holdsif and only if $s \in L(r\backslash c)$.\end{center}\noindentWe can generalise the derivative operation shown above for single charactersto strings as follows:\begin{center}\begin{tabular}{lcl}$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\$r \backslash [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentand then define Brzozowski's regular-expression matching algorithm as:\[match\;s\;r \;\dn\; nullable(r\backslash s)\]\noindentAssuming the a string is givane as a sequence of characters, say $c_0c_1..c_n$, this algorithm presented graphically is as follows:\begin{equation}\label{graph:*}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}\end{tikzcd}\end{equation}\noindentwhere we start with a regular expression $r_0$, build successivederivatives until we exhaust the string and then use \textit{nullable}to test whether the result can match the empty string. It can berelatively easily shown that this matcher is correct (that is givenan $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}One limitation of Brzozowski's algorithm is that it only produces aYES/NO answer for whether a string is being matched by a regularexpression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithmto allow generation of an actual matching, called a \emph{value} orsometimes also \emph{lexical value}. These values and regularexpressions correspond to each other as illustrated in the followingtable:\begin{center} \begin{tabular}{c@{\hspace{20mm}}c} \begin{tabular}{@{}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ $r$ & $::=$ & $\ZERO$\\ & $\mid$ & $\ONE$ \\ & $\mid$ & $c$ \\ & $\mid$ & $r_1 \cdot r_2$\\ & $\mid$ & $r_1 + r_2$ \\ \\ & $\mid$ & $r^*$ \\ \end{tabular} & \begin{tabular}{@{\hspace{0mm}}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ $v$ & $::=$ & \\ & & $\Empty$ \\ & $\mid$ & $\Char(c)$ \\ & $\mid$ & $\Seq\,v_1\, v_2$\\ & $\mid$ & $\Left(v)$ \\ & $\mid$ & $\Right(v)$ \\ & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ \end{tabular} \end{tabular}\end{center}\noindentThe contribution of Sulzmann and Lu is an extension of Brzozowski'salgorithm by a second phase (the first phase being building successivederivatives---see \eqref{graph:*}). In this second phase, a POSIX value is generated in case the regular expression matches the string. Pictorially, the Sulzmann and Lu algorithm is as follows:\begin{ceqn}\begin{equation}\label{graph:2}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] \end{tikzcd}\end{equation}\end{ceqn}\noindentFor convenience, we shall employ the following notations: the regularexpression we start with is $r_0$, and the given string $s$ is composedof characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from theleft to right, we build the derivatives $r_1$, $r_2$, \ldots accordingto the characters $c_0$, $c_1$ until we exhaust the string and obtainthe derivative $r_n$. We test whether this derivative is$\textit{nullable}$ or not. If not, we know the string does not match$r$ and no value needs to be generated. If yes, we start building thevalues incrementally by \emph{injecting} back the characters into theearlier values $v_n, \ldots, v_0$. This is the second phase of thealgorithm from the right to left. For the first value $v_n$, we call thefunction $\textit{mkeps}$, which builds the lexical valuefor how the empty string has been matched by the (nullable) regularexpression $r_n$. This function is defined as \begin{center} \begin{tabular}{lcl} $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ $\mkeps(r_{1}+r_{2})$ & $\dn$ & \textit{if} $\nullable(r_{1})$\\ & & \textit{then} $\Left(\mkeps(r_{1}))$\\ & & \textit{else} $\Right(\mkeps(r_{2}))$\\ $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ \end{tabular} \end{center}\noindent After the $\mkeps$-call, we inject back the characters one by one in order to buildthe lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.After injecting back $n$ characters, we get the lexical value for how $r_0$matches $s$. For this Sulzmann and Lu defined a function that reversesthe ``chopping off'' of characters during the derivative phase. Thecorresponding function is called \emph{injection}, written$\textit{inj}$; it takes three arguments: the first one is a regularexpression ${r_{i-1}}$, before the character is chopped off, the secondis a character ${c_{i-1}}$, the character we want to inject and thethird argument is the value ${v_i}$, into which one wants to inject thecharacter (it corresponds to the regular expression after the characterhas been chopped off). The result of this function is a new value. Thedefinition of $\textit{inj}$ is as follows: \begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\\end{tabular}\end{center}\noindent This definition is by recursion on the ``shape'' of regularexpressions and values. \subsection*{Simplification of Regular Expressions}The main drawback of building successive derivatives accordingto Brzozowski's definition is that they can grow very quickly in size.This is mainly due to the fact that the derivative operation generatesoften ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, ifimplemented naively both algorithms by Brzozowski and by Sulzmann and Luare excruciatingly slow. For example when starting with the regularexpression $(a + aa)^*$ and building 12 successive derivativesw.r.t.~the character $a$, one obtains a derivative regular expressionwith more than 8000 nodes (when viewed as a tree). Operations like$\textit{der}$ and $\nullable$ need to traverse such trees andconsequently the bigger the size of the derivative the slower thealgorithm. Fortunately, one can simplify regular expressions after each derivativestep. Various simplifications of regular expressions are possible, suchas the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r\cdot \ONE$, and $r + r$ to just $r$. These simplifications do notaffect the answer for whether a regular expression matches a string ornot, but fortunately also do not affect the POSIX strategy of howregular expressions match strings---although the latter is much harderto establish. Some initial results in this regard have beenobtained in \cite{AusafDyckhoffUrban2016}. If we want the size of derivatives in Sulzmann and Lu's algorithm tostay below this bound, we would need more aggressive simplifications.Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well asdeleting duplicates whenever possible. For example, the parentheses in$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Anotherexample is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just$a^*+a+\ONE$. Adding these more aggressive simplification rules help usto achieve the same size bound as that of the partial derivatives. In order to implement the idea of ``spilling out alternatives'' and tomake them compatible with the $\textit{inj}$-mechanism, we use\emph{bitcodes}. They were first introduced by Sulzmann and Lu.Here bits and bitcodes (lists of bits) are defined as:\begin{center} $b ::= S \mid Z \qquadbs ::= [] \mid b:bs $\end{center}\noindentThe $S$ and $Z$ are arbitrary names for the bits in order to avoid confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (orbit-lists) can be used to encode values (or incomplete values) in acompact form. This can be straightforwardly seen in the followingcoding function from values to bitcodes: \begin{center}\begin{tabular}{lcl} $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\ $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\ $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\ $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\; code(\Stars\,vs)$\end{tabular} \end{center} \noindentHere $\textit{code}$ encodes a value into a bitcodes by converting$\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-emptystar iteration into $\S$, and the border where a local star terminatesinto $\Z$. This coding is lossy, as it throws away the information aboutcharacters, and also does not encode the ``boundary'' between twosequence values. Moreover, with only the bitcode we cannot even tellwhether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. Thereason for choosing this compact way of storing information is that therelatively small size of bits can be easily manipulated and ``movedaround'' in a regular expression. In order to recover values, we will need the corresponding regular expression as an extra information. Thismeans the decoding function is defined as:%\begin{definition}[Bitdecoding of Values]\mbox{}\begin{center}\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; (\Left\,v, bs_1)$\\ $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; (\Right\,v, bs_1)$\\ $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ $\textit{decode}\,bs\,r$ & $\dn$ & $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; \textit{else}\;\textit{None}$ \end{tabular} \end{center} %\end{definition}Sulzmann and Lu's integrated the bitcodes into regular expressions tocreate annotated regular expressions \cite{Sulzmann2014}.\emph{Annotated regular expressions} are defined by the followinggrammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}\begin{center}\begin{tabular}{lcl} $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ & $\mid$ & $\textit{ONE}\;\;bs$\\ & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\ & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ & $\mid$ & $\textit{STAR}\;\;bs\,a$\end{tabular} \end{center} %(in \textit{ALTS})\noindentwhere $bs$ stands for bitcodes, $a$ for $\bold{a}$nnotated regularexpressions and $as$ for a list of annotated regular expressions.The alternative constructor($\textit{ALTS}$) has been generalized to accept a list of annotated regular expressions rather than just 2.We will show that these bitcodes encode information aboutthe (POSIX) value that should be generated by the Sulzmann and Lualgorithm.To do lexing using annotated regular expressions, we shall firsttransform the usual (un-annotated) regular expressions into annotatedregular expressions. This operation is called \emph{internalisation} anddefined as follows:%\begin{definition}\begin{center}\begin{tabular}{lcl} $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\ $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\ $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ $(r_1 + r_2)^\uparrow$ & $\dn$ & $\textit{ALTS}\;[]\,List((\textit{fuse}\,[\Z]\,r_1^\uparrow),\, (\textit{fuse}\,[\S]\,r_2^\uparrow))$\\ $(r_1\cdot r_2)^\uparrow$ & $\dn$ & $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\ $(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$\\\end{tabular} \end{center} %\end{definition}\noindentWe use up arrows here to indicate that the basic un-annotated regularexpressions are ``lifted up'' into something slightly more complex. In thefourth clause, $\textit{fuse}$ is an auxiliary function that helps toattach bits to the front of an annotated regular expression. Itsdefinition is as follows:\begin{center}\begin{tabular}{lcl} $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ & $\textit{ONE}\,(bs\,@\,bs')$\\ $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ & $\textit{CHAR}\,(bs\,@\,bs')\,c$\\ $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,as)$ & $\dn$ & $\textit{ALTS}\,(bs\,@\,bs')\,as$\\ $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ & $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\ $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ & $\textit{STAR}\,(bs\,@\,bs')\,a$\end{tabular} \end{center} \noindentAfter internalising the regular expression, we perform successivederivative operations on the annotated regular expressions. Thisderivative operation is the same as what we had previously for thebasic regular expressions, except that we beed to take care ofthe bitcodes:\iffalse %\begin{definition}{bder}\begin{center} \begin{tabular}{@{}lcl@{}} $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\ $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, (\textit{STAR}\,[]\,r)$\end{tabular} \end{center} %\end{definition}\begin{center} \begin{tabular}{@{}lcl@{}} $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ $(_{bs}\textit{ALTS}\;as)\,\backslash c$ & $\dn$ & $_{bs}\textit{ALTS}\;(as.map(\backslash c))$\\ $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & $_{bs}\textit{SEQ}\;(\textit{fuse}\, [\Z] (r\,\backslash c))\, (_{bs}\textit{STAR}\,[]\,r)$\end{tabular} \end{center} %\end{definition}\fi\begin{center} \begin{tabular}{@{}lcl@{}} $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; _{bs}\ONE\;\textit{else}\;\ZERO$\\ $(_{bs}\oplus \;as)\,\backslash c$ & $\dn$ & $_{bs}\oplus\;(as.map(\backslash c))$\\ $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;_{bs}\oplus\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ & &$\phantom{\textit{then},\;_{bs}\oplus\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ $(_{bs}a^*)\,\backslash c$ & $\dn$ & $_{bs}\;((\textit{fuse}\, [\Z] (r\,\backslash c))\cdot (_{[]}r^*))$\end{tabular} \end{center} %\end{definition}\noindentFor instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,we need to attach an additional bit $Z$ to the front of $r \backslash c$to indicate that there is one more star iteration. Also the $SEQ$ clauseis more subtle---when $a_1$ is $\textit{bnullable}$ (here\textit{bnullable} is exactly the same as $\textit{nullable}$, exceptthat it is for annotated regular expressions, therefore we omit thedefinition). Assume that $bmkeps$ correctly extracts the bitcode for how$a_1$ matches the string prior to character $c$ (more on this later),then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2\backslash c)$ will collapse the regular expression $a_1$(as it hasalready been fully matched) and store the parsing information at thehead of the regular expression $a_2 \backslash c$ by fusing to it. Thebitsequence $bs$, which was initially attached to the head of $SEQ$, hasnow been elevated to the top-level of $ALTS$, as this information will beneeded whichever way the $SEQ$ is matched---no matter whether $c$ belongsto $a_1$ or $ a_2$. After building these derivatives and maintaining allthe lexing information, we complete the lexing by collecting thebitcodes using a generalised version of the $\textit{mkeps}$ functionfor annotated regular expressions, called $\textit{bmkeps}$:%\begin{definition}[\textit{bmkeps}]\mbox{}\begin{center}\begin{tabular}{lcl} $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\ $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a::as)$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a$\\ & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(\textit{ALTS}\;bs\,as)$\\ $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ & $bs \,@\, [\S]$\end{tabular} \end{center} %\end{definition}\noindentThis function completes the value information by travelling along thepath of the regular expression that corresponds to a POSIX value andcollecting all the bitcodes, and using $S$ to indicate the end of stariterations. If we take the bitcodes produced by $\textit{bmkeps}$ anddecode them, we get the value we expect. The corresponding lexingalgorithm looks as follows:\begin{center}\begin{tabular}{lcl} $\textit{blexer}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentIn this definition $\_\backslash s$ is the generalisation of the derivativeoperation from characters to strings (just like the derivatives for un-annotatedregular expressions).\subsection*{Our Simplification Rules}The main point of the bitcodes and annotated regular expressions is thatwe can apply rather aggressive (in terms of size) simplification rulesin order to keep derivatives small. We have developed such``aggressive'' simplification rules and generated test data that showthat the expected bound can be achieved. Obviously we could onlypartially cover the search space as there are infinitely many regularexpressions and strings. One modification we introduced is to allow a list of annotated regularexpressions in the \textit{ALTS} constructor. This allows us to not justdelete unnecessary $\ZERO$s and $\ONE$s from regular expressions, butalso unnecessary ``copies'' of regular expressions (very similar tosimplifying $r + r$ to just $r$, but in a more general setting). Anothermodification is that we use simplification rules inspired by Antimirov'swork on partial derivatives. They maintain the idea that only the first``copy'' of a regular expression in an alternative contributes to thecalculation of a POSIX value. All subsequent copies can be pruned away fromthe regular expression. A recursive definition of our simplification function that looks somewhat similar to our Scala code is given below:%\comment{Use $\ZERO$, $\ONE$ and so on. %Is it $ALTS$ or $ALTS$?}\\\begin{center} \begin{tabular}{@{}lcl@{}} $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\ $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ \end{tabular} \end{center} \noindentThe simplification does a pattern matching on the regular expression.When it detected that the regular expression is an alternative orsequence, it will try to simplify its children regular expressionsrecursively and then see if one of the children turn into $\ZERO$ or$\ONE$, which might trigger further simplification at the current level.The most involved part is the $\textit{ALTS}$ clause, where we use twoauxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested$\textit{ALTS}$ and reduce as many duplicates as possible. Function$\textit{distinct}$ keeps the first occurring copy only and remove all later oneswhen detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.Its recursive definition is given below: \begin{center} \begin{tabular}{@{}lcl@{}} $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) \end{tabular} \end{center} \noindentHere $\textit{flatten}$ behaves like the traditional functional programming flattenfunction, except that it also removes $\ZERO$s. Or in terms of regular expressions, itremoves parentheses, for example changing $a+(b+c)$ into $a+b+c$.Suppose we apply simplification after each derivative step, and viewthese two operations as an atomic one: $a \backslash_{simp}\,c \dn\textit{simp}(a \backslash c)$. Then we can use the previous naturalextension from derivative w.r.t.~character to derivativew.r.t.~string:%\comment{simp in the [] case?}\begin{center}\begin{tabular}{lcl}$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\$r \backslash_{simp} [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentwe obtain an optimised version of the algorithm: \begin{center}\begin{tabular}{lcl} $\textit{blexer\_simp}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentThis algorithm keeps the regular expression size small, for example,with this simplification our previous $(a + aa)^*$ example's 8000 nodeswill be reduced to just 6 and stays constant, no matter how long theinput string is.\section{Introduction}While we believe derivatives of regular expressions, written$r\backslash s$, are a beautiful concept (in terms of ease ofimplementing them in functional programming languages and in terms ofreasoning about them formally), they have one major drawback: everyderivative step can make regular expressions grow drastically insize. This in turn has negative effect on the runtime of thecorresponding lexing algorithms. Consider for example the regularexpression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. Thecorresponding derivative contains already 8668 nodes where we assumethe derivative is given as a tree. The reason for the poor runtime ofthe derivative-based lexing algorithms is that they need to traversesuch trees over and over again. The solution is to find a complete setof simplification rules that keep the sizes of derivatives uniformlysmall.For reasons beyond this report, it turns out that a complete set ofsimplification rules depends on values being encoded asbitsequences.\footnote{Values are the results the lexing algorithms generate; they encode how a regular expression matched a string.} Wealready know that the lexing algorithm using bitsequences but\emph{without} simplification is correct, albeilt horriblyslow. Therefore in the past 6 months I was trying to prove that thealgorithm using bitsequences plus our simplification rules isalso correct. Formally this amounts to show that\begin{equation}\label{mainthm}\blexers \; r \; s = \blexer \;r\;s\end{equation}\noindentwhereby $\blexers$ simplifies (makes derivatives smaller) in eachstep, whereas with $\blexer$ the size can grow exponentially. Thiswould be an important milestone for my thesis, because we alreadyhave a very good idea how to establish that our set our simplificationrules keeps the size of derivativs below a relatively tight bound.In order to prove the main theorem in \eqref{mainthm}, we need to prove that thetwo functions produce the same output. The definition of these two functions is shown below.\begin{center}\begin{tabular}{lcl} $\textit{blexer}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\begin{center}\begin{tabular}{lcl} $\blexers \; r \, s$ &$\dn$ & $\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\ & & $\; \; \textit{if} \; \textit{bnullable}(a)$\\ & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\; \textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentIn these definitions $(r^\uparrow)$ is a kind of coding function thatis the same in each case, similarly the decode and the \textit{bmkeps}are functions that are the same in each case. Our maintheorem~\eqref{mainthm} therefore boils down to proving the followingtwo propositions (depending on which branch the if-else clausetakes). They establish how the derivatives \emph{with} simplificationdo not change the computed result:\begin{itemize}\item{(a)} If a string $s$ is in the language of $L(r)$, then \\$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\\item{(b)} If a string $s$ is in the language $L(r)$, then $\rup \backslash_{simp} \,s$ is not nullable.\end{itemize}\noindentWe have already proved the second part in Isabelle. This is actuallynot too difficult because we can show that simplification does notchange the language of simplified regular expressions.If we can prove the first part, that is the bitsequence algorithm withsimplification produces the same result as the one withoutsimplification, then we are done. Unfortunately that part requiresmore effort, because simplification does not only need to \emph{not}change the language, but also not change the value (that is thecomputed result).%\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}%Do you want to keep this? You essentially want to say that the old%method used retrieve, which unfortunately cannot be adopted to %the simplification rules. You could just say that and give an example.%However you have to think about how you give the example....nobody knows%about AZERO etc yet. Maybe it might be better to use normal regexes%like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$.%\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}%REPLY:\\%Yes, I am essentially saying that the old method%cannot be adopted without adjustments.%But this does not mean we should skip%the proof of the bit-coded algorithm%as it is still the main direction we are looking into%to prove things. We are trying to modify%the old proof to suit our needs, but not give %up it totally, that is why i believe the old %proof is fundamental in understanding%what we are doing in the past 6 months.%\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}\subsubsection*{Existing Proof}For this we have started with looking at the original proof thatestablished that the bitsequence algorrithm produces the same resultas the algorithm not using bitsequences. Formally this proofestablished\begin{equation}\label{lexer}\blexer \; (r^\uparrow) s = \lexer \;r \;s\end{equation}%\noindent%might provide us insight into proving %\begin{center}%$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$%\end{center}\noindentThe proof uses two ``tricks''. One is that it uses a \flex-function\begin{center}\begin{tabular}{lcl}$\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\$\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$\end{tabular}\end{center}\noindentand then proves for the right-hand side in \eqref{lexer}\begin{center}$\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$\end{center}.\noindentThe $\flex$-function essentially does lexing bystacking up injection functions while doing derivatives.explicitly showing the order of characters beinginjected back in each step.With $\flex$ we can write $\lexer$ this way: \begin{center}$\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$\end{center}%\noindent%$\flex$ focuses on the injections instead of the derivatives ,%compared to the original definition of $\lexer$, which puts equal%amount of emphasis on injection and derivative with respect to each%character:%\begin{center}%\begin{tabular}{lcl}%$\textit{lexer} \; r\; (c\!::\!s) $ & $\dn$ & $\textit{case} \; \lexer \; (r\backslash c) \;s \; %\textit{of}$ \\% & & $\textit{None} \; \Longrightarrow \; \textit{None}$\\% & & $\textbar \; v \; \Longrightarrow \; \inj \; r\;c\;v$\\%$\textit{lexer} \; r\; [\,] $ & $\dn$ & $\textit{if} \; \nullable (r) \; \textit{then} \; \mkeps% (r) \; \textit{else} \;None$%\end{tabular}%\end{center}\noindentThe crux in the existing proof is how $\flex$ relates to injection, namely\begin{center}$\flex \; r \; id \; (s@[c]) \; v = \flex \; r \; id \; s \; (inj \; (r\backslash s) \; c\; v)$.\end{center}\noindentThis property allows one to rewrite an induction hypothesis like \begin{center} $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$\end{center}\subsubsection{Retrieve Function}The crucial point is to find the$\textit{POSIX}$ information of a regular expression and how it is modified,augmented and propagated during simplification in parallel with the regular expression thathas not been simplified in the subsequent derivative operations. To aid this,we use the helper function retrieve described by Sulzmann and Lu:\begin{center}\begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}} $\textit{retrieve}\,(\textit{ONE}\,bs)\,\Empty$ & $\dn$ & $bs$\\ $\textit{retrieve}\,(\textit{CHAR}\,bs\,c)\,(\Char\,d)$ & $\dn$ & $bs$\\ $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Left\,v)$ & $\dn$ & $bs \,@\, \textit{retrieve}\,a\,v$\\ $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Right\,v)$ & $\dn$ & $bs \,@\, \textit{retrieve}\,(\textit{ALTS}\,bs\,as)\,v$\\ $\textit{retrieve}\,(\textit{SEQ}\,bs\,a_1\,a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ & $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\ $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,[])$ & $\dn$ & $bs \,@\, [\S]$\\ $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\ \multicolumn{3}{l}{ \hspace{3cm}$bs \,@\, [\Z] \,@\, \textit{retrieve}\,a\,v\,@\, \textit{retrieve}\,(\textit{STAR}\,[]\,a)\,(\Stars\,vs)$}\\\end{tabular}\end{center}%\comment{Did not read further}\\This function assembles the bitcode %that corresponds to a lexical value for how%the current derivative matches the suffix of the string(the characters that%have not yet appeared, but will appear as the successive derivatives go on.%How do we get this "future" information? By the value $v$, which is%computed by a pass of the algorithm that uses%$inj$ as described in the previous section). using information from both the derivative regular expression and thevalue. Sulzmann and Lu poroposed this function, but did not proveanything about it. Ausaf and Urban used it to connect the bitcodedalgorithm to the older algorithm by the following equation: \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; (r^\uparrow)\backslash_{simp} \,c)\,v)$ \end{center} \noindentwhereby $r^\uparrow$ stands for the internalised version of $r$. Ausafand Urban also used this fact to prove the correctness of bitcodedalgorithm without simplification. Our purpose of using this, however,is to establish \begin{center}$ \textit{retrieve} \;a \; v \;=\; \textit{retrieve} \; (\textit{simp}\,a) \; v'.$\end{center}The idea is that using $v'$, a simplified version of $v$ that had gonethrough the same simplification step as $\textit{simp}(a)$, we are ableto extract the bitcode that gives the same parsing information as theunsimplified one. However, we noticed that constructing such a $v'$from $v$ is not so straightforward. The point of this is that we mightbe able to finally bridge the gap by proving\noindent\rule[1.5ex]{\linewidth}{4pt}There is no mention of retrieve yet .... this is the second trick in theexisting proof. I am not sure whether you need to explain annotated regularexpressions much earlier - maybe before the ``existing proof'' section, orevan earlier.\noindent\rule[1.5ex]{\linewidth}{4pt}\noindentBy using a property of retrieve we have the $\textit{RHS}$ of the above equality is$decode (retrieve (r^\uparrow \backslash(s @ [c])) v) r$, and this gives the main lemma result:\begin{center}$ \flex \;r\; id \; (s@[c]) \; v =\textit{decode}(\textit{retrieve} (\rup \backslash (s@[c])) \;v) r$\end{center}\noindentTo use this lemma result for our correctness proof, simply replace the $v$ in the$\textit{RHS}$ of the above equality with$\mkeps\;(r\backslash (s@[c]))$, and apply the lemma that\begin{center}$\textit{decode} \; \bmkeps \; \rup \; r = \textit{decode} \; (\textit{retrieve} \; \rup \; \mkeps(r)) \;r$\end{center}\noindentWe get the correctness of our bit-coded algorithm:\begin{center}$\flex \;r\; id \; s \; (\mkeps \; r\backslash s) = \textit{decode} \; \bmkeps \; \rup\backslash s \; r$\end{center}\noindentThe bridge between the above chain of equalitiesis the use of $\retrieve$,if we want to use a similar technique for the simplified version of algorithm,we face the problem that in the above equalities,$\retrieve \; a \; v$ is not always defined.for example,$\retrieve \; _0(_1a+_0a) \; \Left(\Empty)$is defined, but not $\retrieve \; (_{01}a) \;\Left(\Empty)$,though we can extract the same POSIXbits from the two annotated regular expressions.The latter might occur when we try to retrieve from a simplified regular expression using the same valueas the unsimplified one.This is because $\Left(\Empty)$ corresponds tothe regular expression structure $\ONE+r_2$ instead of$\ONE$.That means, if we want to prove that \begin{center}$\textit{decode} \; \bmkeps \; \rup\backslash s \; r = \textit{decode} \; \bmkeps \; \rup\backslash_{simp} s \; r$\end{center}\noindentholds by using $\retrieve$,we probably need to prove an equality like below:\begin{center}%$\retrieve \; \rup\backslash_{simp} s \; \mkeps(r\backslash_{simp} s)=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$$\retrieve \; \rup\backslash_{simp} s \; \mkeps(f(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$\end{center}\noindent$f$ rectifies $r\backslash s$ so the value $\mkeps(f(r\backslash s))$ becomes something simplerto make the retrieve function defined.\\\subsubsection{Ways to Rectify Value}One way to do this is to prove the following:\begin{center}$\retrieve \; \rup\backslash_{simp} s \; \mkeps(\simp(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$\end{center}\noindentThe reason why we choose $\simp$ as $f$ is because$\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash \, s)$have the same shape:\begin{center}$\erase (\rup\backslash_{simp} \, s) = \erase(\simp(\rup\backslash s))$\end{center}\noindent$\erase$ in the above equality means to remove the bit-codesin an annotated regular expression and only keep the originalregular expression(just like "erasing" the bits). Its definition is omitted.$\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash s)$are very closely related, but not identical.\subsubsection{Example for $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$}For example, let $r$ be the regular expression$(a+b)(a+a*)$ and $s$ be the string $aa$, thenboth $\erase (\rup\backslash_{simp} \, s)$ and $\erase (\simp (\rup\backslash s))$are $\ONE + a^*$. However, without $\erase$ \begin{center}$\rup\backslash_{simp} \, s$ is equal to $_0(_0\ONE +_{11}a^*)$\end{center}\noindentwhereas\begin{center}$\simp(\rup\backslash s)$ is equal to $(_{00}\ONE +_{011}a^*)$\end{center}\noindent(For the sake of visual simplicity, we use numbers to denote the bitsin bitcodes as we have previously defined for annotated regular expressions. $\S$ is replaced by subscript $_1$ and $\Z$ by $_0$.)What makes the difference?%Two "rules" might be inferred from the above example.%First, after erasing the bits the two regular expressions%are exactly the same: both become $1+a^*$. Here the %function $\simp$ exhibits the "one in the end equals many times%at the front"%property: one simplification in the end causes the %same regular expression structure as%successive simplifications done alongside derivatives.%$\rup\backslash_{simp} \, s$ unfolds to %$\simp((\simp(r\backslash a))\backslash a)$%and $\simp(\rup\backslash s)$ unfolds to %$\simp((r\backslash a)\backslash a)$. The one simplification%in the latter causes the resulting regular expression to %become $1+a^*$, exactly the same as the former with%two simplifications.%Second, the bit-codes are different, but they are essentially%the same: if we push the outmost bits ${\bf_0}(_0\ONE +_{11}a^*)$ of $\rup\backslash_{simp} \, s$%inside then we get $(_{00}\ONE +_{011}a^*)$, exactly the %same as that of $\rup\backslash \, s$. And this difference %does not matter when we try to apply $\bmkeps$ or $\retrieve$%to it. This seems a good news if we want to use $\retrieve$%to prove things.%If we look into the difference above, we could see that the%difference is not fundamental: the bits are just being moved%around in a way that does not hurt the correctness.During the first derivative operation, $\rup\backslash a=(_0\ONE + \ZERO)(_0a + _1a^*)$ isin the form of a sequence regular expression withtwo components, the firstone $\ONE + \ZERO$ being nullable. Recall the simplification function definition:\begin{center} \begin{tabular}{@{}lcl@{}} $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\ $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ \end{tabular} \end{center} \noindentand the difinition of $\flatten$: \begin{center} \begin{tabular}{c c c} $\flatten \; []$ & $\dn$ & $[]$\\ $\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\ $\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\ $\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$ \end{tabular} \end{center} \noindentIf we call $\simp$ on $\rup\backslash a$, just as $\backslash_{simp}$requires, then we would go throught the third clause of the sequence case:$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$.The $\ZERO$ of $(_0\ONE + \ZERO)$ is thrown away by $\flatten$ and $_0\ONE$ merged into $(_0a + _1a^*)$ by simplyputting its bits($_0$) to the front of the second component: ${\bf_0}(_0a + _1a^*)$. After a second derivative operation, namely, $(_0(_0a + _1a^*))\backslash a$, we get $ _0(_0 \ONE + _1(_1\ONE \cdot a^*)) $, and this simplifies to $_0(_0 \ONE + _{11} a^*)$ by the third clause of the alternative case: $\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$.The outmost bit $_0$ stays with the outmost regular expression, rather than being fused toits child regular expressions, as what we will later see happensto $\simp(\rup\backslash \, s)$.If we choose to not simplify in the midst of derivative operations,but only do it at the end after the string has been exhausted, namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$,then at the {\bf second} derivative of $(\rup\backslash a)\bf{\backslash a}$we will go throught the clause of $\backslash$:\begin{center}\begin{tabular}{lcl}$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & $(when \; \textit{bnullable}\,a_1)$\\ & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\\end{tabular}\end{center}because$\rup\backslash a = (_0\ONE + \ZERO)(_0a + _1a^*)$ is a sequencewith the first component being nullable(unsimplified, unlike the first round of running$\backslash_{simp}$).Therefore $((_0\ONE + \ZERO)(_0a + _1a^*))\backslash a$ splits into$([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$.After these two successive derivatives without simplification,we apply $\simp$ to this regular expression, which goes throughthe alternative clause, and each component of $([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$ will be simplified, giving us the list:$[\ZERO, _0(_0\ONE + _{11}a^*)]$This list is then "flattened"--$\ZERO$ will bethrown away by $\textit{flatten}$; $ _0(_0\ONE + _{11}a^*)$is opened up to make the list consisting of two separate elements $_{00}\ONE$ and $_{011}a^*$, note that $flatten$ $\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$.In a nutshell, the order of simplification causesthe bits to be moved differently. \subsubsection{A Failed Attempt To Remedy the Problem Above}A simple class of regular expressions and strings pairs can be deduced from the above example which trigger the difference between $\rup\backslash_{simp} \, s$and $\simp(\rup\backslash s)$:\[D = (c_1c_2, \{r_1\cdot r_2 \mid \text{$ c_1\in L(r_1), r_1 \; not \; \nullable, c_2 \in L(r_2),r_2 \backslash c_2$ is of shape alternative and $c_1c_2 \notin L(r_1)$}\})\]and we can use this to construct a set of examples based on this type of behaviour of two operations:$r_1$that is to say, despite the bits are being moved around on the regular expression(difference in bits), the structure of the (unannotated)regular expressionafter one simplification is exactly the same after the same sequence of derivative operations regardless of whether we did simplificationalong the way. One way would be to give a function that calls %CONSTRUCTION SITEfuse is the culprit: it causes the order in which alternativesare opened up to be remembered and finally the differenceappear in $\simp(\rup \backslash s)$ and $\rup \backslash{simp} \,s$.but we have to use them as they are essential in the simplification:flatten needs them.However, without erase the above equality does not hold:for the regular expression $(a+b)(a+a*)$,if we do derivative with respect to string $aa$,we get\subsection{Another Proof Strategy}sdddddr does not equal sdsdsdsr sometimes.\\For example,This equicalence class method might still have the potential of proving this,but not yeti parallelly tried another method of using retrieve\\%HERE CONSTRUCTION SITEThe vsimp function, defined as followstries to simplify the value in lockstep with regular expression:\\The problem here is that we used retrieve for the key induction:$decode (retrieve (r\backslash (s @ [c])) v) r $$decode (retrieve (r\backslash s) (inj (r\backslash s) c v)) r$Here, decode recovers a value that corresponds to a match(possibly partial)from bits, and the bits are extracted by retrieve,and the key value $v$ that guides retrieve is$mkeps r\backslash s$, $inj r c (mkeps r\backslash s)$, $inj (inj (v))$, ......if we can the problem is that need vsiimp to make a value that is suitable for decoding$Some(flex rid(s@[c])v) = Some(flex rids(inj (r\backslash s)cv))$another way that christian came up with that might circumvent the prblem of finding suitable value is by not stating the visimpfunction but include all possible value in a set that a regex is able to produce,and proving that both r and sr are able to produce the bits that correspond the POSIX valueproduced by feeding the same initial regular expression $r$ and string $s$ to the two functions $ders$ and $ders\_simp$.The reason whyNamely, if $bmkeps( r_1) = bmkeps(r_2)$, then we If we define the equivalence relation $\sim_{m\epsilon}$ between two regular expressions$r_1$ and $r_2$as follows:$r_1 \sim_{m\epsilon} r_2 \iff bmkeps(r_1)= bmkeps(r_2)$(in other words, they $r1$ and $r2$ produce the same output under the function $bmkeps$.)Then the first goal might be restated as $(r^\uparrow)\backslash_{simp}\, s \sim_{m\epsilon} (r^\uparrow)\backslash s$.I tried to establish an equivalence relation between the regular experssions like dddr dddsr,.....but right now i am only able to establish dsr and dr, using structural induction on r.Those involve multiple derivative operations are harder to prove.Two attempts have been made:(1)induction on the number of der operations(or in other words, the length of the string s),the inductive hypothesis was initially specified as "For an arbitrary regular expression r, For all string s in the language of r whose length do not exceed the number n, ders s r me derssimp s r"and the proof goal may be stated as"For an arbitrary regular expression r, For all string s in the language of r whose length do not exceed the number n+1, ders s r me derssimp s r"the problem here is that although we can easily break downa string s of length n+1 into s1@list(c), it is not that easyto use the i.h. as a stepping stone to prove anything because s1 may well be notin the language L(r). This inhibits us from obtaining the fact thatders s1 r me derssimps s1 r.Further exploration is needed to amend this hypothesis so it includes thesituation when s1 is not nullable.For example, what information(bits? values?) can be extractedfrom the regular expression ders(s1,r) so that we can compute or predict the possible result of bmkeps after another derivative operation. What function f can used to carry out the task? The possible way of exploration can be more directly perceived throught the graph below:find a functionfsuch thatf(bders s1 r)= re1f(bderss s1 r)= re2bmkeps(bders s r) = g(re1,c)bmkeps(bderssimp s r) = g(re2,c)and g(re1,c) = g(re2,c)The inductive hypothesis would be"For all strings s1 of length <= n, f(bders s1 r)= re1f(bderss s1 r)= re2"proving this would be a lemma for the main proof:the main proof would be "bmkeps(bders s r) = g(re1,c)bmkeps(bderssimp s r) = g(re2,c)for s = s1@c"and f need to be a recursive property for the lemma to be proved:it needs to store not only the "after one char nullable info",but also the "after two char nullable info",and so on so that it is able to predict what f will compute after a derivative operation,in other words, it needs to be "infinitely recursive"\\To prove the lemma, in other words, to get"For all strings s1 of length <= n+1, f(bders s1 r)= re3f(bderss s1 r)= re4"\\from\\"For all strings s1 of length <= n, f(bders s1 r)= re1f(bderss s1 r)= re2"\\it might be best to construct an auxiliary function h such that\\h(re1, c) = re3\\h(re2, c) = re4\\and re3 = f(bder c (bders s1 r))\\re4 = f(simp(bder c (bderss s1 r)))The key point here is that we are not satisfied with what bders s r will produce underbmkeps, but also how it will perform after a derivative operation and then bmkeps, and two derivative operations and so on. In essence, we are preserving the regular expression itself under the function f, in a less compact way than the regluar expression: we arenot just recording but also interpreting what the regular expression matches.In other words, we need to prove the properties of bderss s r beyond the bmkeps result,i.e., not just the nullable ones, but also those containing remaining characters.\\(2)we observed the fact that erase sdddddr= erase sdsdsdsrthat is to say, despite the bits are being moved around on the regular expression(difference in bits), the structure of the (unannotated)regular expressionafter one simplification is exactly the same after the same sequence of derivative operations regardless of whether we did simplificationalong the way.However, without erase the above equality does not hold:for the regular expression $(a+b)(a+a*)$,if we do derivative with respect to string $aa$,we get%TODOsdddddr does not equal sdsdsdsr sometimes.\\For example,This equicalence class method might still have the potential of proving this,but not yeti parallelly tried another method of using retrieve\\\noindent\rule[0.5ex]{\linewidth}{1pt}This PhD-project is about regular expression matching andlexing. Given the maturity of this topic, the reader might wonder:Surely, regular expressions must have already been studied to death?What could possibly be \emph{not} known in this area? And surely allimplemented algorithms for regular expression matching are blindinglyfast?Unfortunately these preconceptions are not supported by evidence: Takefor example the regular expression $(a^*)^*\,b$ and ask whetherstrings of the form $aa..a$ match this regularexpression. Obviously this is not the case---the expected $b$ in the lastposition is missing. One would expect that modern regular expressionmatching engines can find this out very quickly. Alas, if one triesthis example in JavaScript, Python or Java 8 with strings like 28$a$'s, one discovers that this decision takes around 30 seconds andtakes considerably longer when adding a few more $a$'s, as the graphsbelow show:\begin{center}\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={JavaScript}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};\end{axis}\end{tikzpicture} &\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Python}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};\end{axis}\end{tikzpicture} &\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=35, ytick={0,5,...,30}, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Java 8}, legend pos=north west, legend cell align=left]\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};\end{axis}\end{tikzpicture}\\\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings of the form $\underbrace{aa..a}_{n}$.}\end{tabular} \end{center} \noindent These are clearly abysmal and possibly surprising results. Onewould expect these systems to do much better than that---after all,given a DFA and a string, deciding whether a string is matched by thisDFA should be linear in terms of the size of the regular expression andthe string?Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen toexhibit this super-linear behaviour. But unfortunately, such regularexpressions are not just a few outliers. They are actually frequent enough to have a separate name created forthem---\emph{evil regular expressions}. In empiric work, Davis et alreport that they have found thousands of such evil regular expressionsin the JavaScript and Python ecosystems \cite{Davis18}. Static analysisapproach that is both sound and complete exists\cite{17Bir}, but the running time on certain examples in the RegExLib and Snort regular expressionslibraries is unacceptable. Therefore the problem of efficiency still remains.This superlinear blowup in matching algorithms sometimes causesconsiderable grief in real life: for example on 20 July 2016 one evilregular expression brought the webpage\href{http://stackexchange.com}{Stack Exchange} to itsknees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}In this instance, a regular expression intended to just trim whitespaces from the beginning and the end of a line actually consumedmassive amounts of CPU-resources---causing web servers to grind to ahalt. This happened when a post with 20,000 white spaces was submitted,but importantly the white spaces were neither at the beginning nor atthe end. As a result, the regular expression matching engine needed tobacktrack over many choices. In this example, the time needed to processthe string was $O(n^2)$ with respect to the string length. Thisquadratic overhead was enough for the homepage of Stack Exchange torespond so slowly that the load balancer assumed there must be someattack and therefore stopped the servers from responding to anyrequests. This made the whole site become unavailable. Another veryrecent example is a global outage of all Cloudflare servers on 2 July2019. A poorly written regular expression exhibited exponentialbehaviour and exhausted CPUs that serve HTTP traffic. Although theoutage had several causes, at the heart was a regular expression thatwas used to monitor networktraffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}The underlying problem is that many ``real life'' regular expressionmatching engines do not use DFAs for matching. This is because theysupport regular expressions that are not covered by the classicalautomata theory, and in this more general setting there are quite a fewresearch questions still unanswered and fast algorithms still need to bedeveloped (for example how to treat efficiently bounded repetitions, negation andback-references).%question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?%how do they avoid dfas exponential states if they use them for fast matching?There is also another under-researched problem to do with regularexpressions and lexing, i.e.~the process of breaking up strings intosequences of tokens according to some regular expressions. In thissetting one is not just interested in whether or not a regularexpression matches a string, but also in \emph{how}. Consider forexample a regular expression $r_{key}$ for recognising keywords such as\textit{if}, \textit{then} and so on; and a regular expression $r_{id}$for recognising identifiers (say, a single character followed bycharacters or numbers). One can then form the compound regularexpression $(r_{key} + r_{id})^*$ and use it to tokenise strings. Butthen how should the string \textit{iffoo} be tokenised? It could betokenised as a keyword followed by an identifier, or the entire stringas a single identifier. Similarly, how should the string \textit{if} betokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would``fire''---so is it an identifier or a keyword? While in applicationsthere is a well-known strategy to decide these questions, called POSIXmatching, only relatively recently precise definitions of what POSIXmatching actually means have been formalised\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Such adefinition has also been given by Sulzmann and Lu \cite{Sulzmann2014},but the corresponding correctness proof turned out to be faulty\cite{AusafDyckhoffUrban2016}. Roughly, POSIX matching means matchingthe longest initial substring. In the case of a tie, the initialsub-match is chosen according to some priorities attached to the regularexpressions (e.g.~keywords have a higher priority than identifiers).This sounds rather simple, but according to Grathwohl et al \cite[Page36]{CrashCourse2014} this is not the case. They wrote:\begin{quote}\it{}``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''\end{quote}\noindentThis is also supported by evidence collected by Kuklewicz\cite{Kuklewicz} who noticed that a number of POSIX regular expressionmatchers calculate incorrect results.Our focus in this project is on an algorithm introduced by Sulzmann andLu in 2014 for regular expression matching according to the POSIXstrategy \cite{Sulzmann2014}. Their algorithm is based on an olderalgorithm by Brzozowski from 1964 where he introduced the notion ofderivatives of regular expressions~\cite{Brzozowski1964}. We shallbriefly explain this algorithm next.\section{The Algorithm by Brzozowski based on Derivatives of RegularExpressions}Suppose (basic) regular expressions are given by the following grammar:\[ r ::= \ZERO \mid \ONE \mid c \mid r_1 \cdot r_2 \mid r_1 + r_2 \mid r^* \]\noindentThe intended meaning of the constructors is as follows: $\ZERO$cannot match any string, $\ONE$ can match the empty string, thecharacter regular expression $c$ can match the character $c$, and soon.The ingenious contribution by Brzozowski is the notion of\emph{derivatives} of regular expressions. The idea behind thisnotion is as follows: suppose a regular expression $r$ can match astring of the form $c\!::\! s$ (that is a list of characters startingwith $c$), what does the regular expression look like that can matchjust $s$? Brzozowski gave a neat answer to this question. He startedwith the definition of $nullable$:\begin{center} \begin{tabular}{lcl} $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ \end{tabular} \end{center}This function simply tests whether the empty string is in $L(r)$.He then definedthe following operation on regular expressions, written$r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):\begin{center}\begin{tabular}{lcl} $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ $d \backslash c$ & $\dn$ & $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\\end{tabular}\end{center}%Assuming the classic notion of a%\emph{language} of a regular expression, written $L(\_)$, t\noindentThe main property of the derivative operation is that\begin{center}$c\!::\!s \in L(r)$ holdsif and only if $s \in L(r\backslash c)$.\end{center}\noindentFor us the main advantage is that derivatives can bestraightforwardly implemented in any functional programming language,and are easily definable and reasoned about in theorem provers---thedefinitions just consist of inductive datatypes and simple recursivefunctions. Moreover, the notion of derivatives can be easilygeneralised to cover extended regular expression constructors such asthe not-regular expression, written $\neg\,r$, or bounded repetitions(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be sostraightforwardly realised within the classic automata approach.For the moment however, we focus only on the usual basic regular expressions.Now if we want to find out whether a string $s$ matches with a regularexpression $r$, we can build the derivatives of $r$ w.r.t.\ (in succession)all the characters of the string $s$. Finally, test whether theresulting regular expression can match the empty string. If yes, then$r$ matches $s$, and no in the negative case. To implement this ideawe can generalise the derivative operation to strings like this:\begin{center}\begin{tabular}{lcl}$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\$r \backslash [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentand then define as regular-expression matching algorithm: \[match\;s\;r \;\dn\; nullable(r\backslash s)\]\noindentThis algorithm looks graphically as follows:\begin{equation}\label{graph:*}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}\end{tikzcd}\end{equation}\noindentwhere we start with a regular expression $r_0$, build successivederivatives until we exhaust the string and then use \textit{nullable}to test whether the result can match the empty string. It can berelatively easily shown that this matcher is correct (that is givenan $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).\section{Values and the Algorithm by Sulzmann and Lu}One limitation of Brzozowski's algorithm is that it only produces aYES/NO answer for whether a string is being matched by a regularexpression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithmto allow generation of an actual matching, called a \emph{value} orsometimes also \emph{lexical value}. These values and regularexpressions correspond to each other as illustrated in the followingtable:\begin{center} \begin{tabular}{c@{\hspace{20mm}}c} \begin{tabular}{@{}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ $r$ & $::=$ & $\ZERO$\\ & $\mid$ & $\ONE$ \\ & $\mid$ & $c$ \\ & $\mid$ & $r_1 \cdot r_2$\\ & $\mid$ & $r_1 + r_2$ \\ \\ & $\mid$ & $r^*$ \\ \end{tabular} & \begin{tabular}{@{\hspace{0mm}}rrl@{}} \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ $v$ & $::=$ & \\ & & $\Empty$ \\ & $\mid$ & $\Char(c)$ \\ & $\mid$ & $\Seq\,v_1\, v_2$\\ & $\mid$ & $\Left(v)$ \\ & $\mid$ & $\Right(v)$ \\ & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ \end{tabular} \end{tabular}\end{center}\noindentNo value corresponds to $\ZERO$; $\Empty$ corresponds to $\ONE$;$\Char$ to the character regular expression; $\Seq$ to the sequenceregular expression and so on. The idea of values is to encode a kind oflexical value for how the sub-parts of a regular expression match thesub-parts of a string. To see this, suppose a \emph{flatten} operation,written $|v|$ for values. We can use this function to extract theunderlying string of a value $v$. For example, $|\mathit{Seq} \,(\textit{Char x}) \, (\textit{Char y})|$ is the string $xy$. Usingflatten, we can describe how values encode lexical values: $\Seq\,v_1\,v_2$ encodes a tree with two children nodes that tells how the string$|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matchesthe substring $|v_1|$ and, respectively, $r_2$ matches the substring$|v_2|$. Exactly how these two are matched is contained in the childrennodes $v_1$ and $v_2$ of parent $\textit{Seq}$. To give a concrete example of how values work, consider the string $xy$and the regular expression $(x + (y + xy))^*$. We can view this regularexpression as a tree and if the string $xy$ is matched by two Star``iterations'', then the $x$ is matched by the left-most alternative inthis tree and the $y$ by the right-left alternative. This suggests torecord this matching as\begin{center}$\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$\end{center}\noindentwhere $\Stars \; [\ldots]$ records all theiterations; and $\Left$, respectively $\Right$, whichalternative is used. The value formatching $xy$ in a single ``iteration'', i.e.~the POSIX value,would look as follows\begin{center}$\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$\end{center}\noindentwhere $\Stars$ has only a single-element list for the single iterationand $\Seq$ indicates that $xy$ is matched by a sequence regularexpression.The contribution of Sulzmann and Lu is an extension of Brzozowski'salgorithm by a second phase (the first phase being building successivederivatives---see \eqref{graph:*}). In this second phase, a POSIX value is generated in case the regular expression matches the string. Pictorially, the Sulzmann and Lu algorithm is as follows:\begin{ceqn}\begin{equation}\label{graph:2}\begin{tikzcd}r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] \end{tikzcd}\end{equation}\end{ceqn}\noindentFor convenience, we shall employ the following notations: the regularexpression we start with is $r_0$, and the given string $s$ is composedof characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from theleft to right, we build the derivatives $r_1$, $r_2$, \ldots accordingto the characters $c_0$, $c_1$ until we exhaust the string and obtainthe derivative $r_n$. We test whether this derivative is$\textit{nullable}$ or not. If not, we know the string does not match$r$ and no value needs to be generated. If yes, we start building thevalues incrementally by \emph{injecting} back the characters into theearlier values $v_n, \ldots, v_0$. This is the second phase of thealgorithm from the right to left. For the first value $v_n$, we call thefunction $\textit{mkeps}$, which builds the lexical valuefor how the empty string has been matched by the (nullable) regularexpression $r_n$. This function is defined as \begin{center} \begin{tabular}{lcl} $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ $\mkeps(r_{1}+r_{2})$ & $\dn$ & \textit{if} $\nullable(r_{1})$\\ & & \textit{then} $\Left(\mkeps(r_{1}))$\\ & & \textit{else} $\Right(\mkeps(r_{2}))$\\ $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ \end{tabular} \end{center}\noindent There are no cases for $\ZERO$ and $c$, sincethese regular expression cannot match the empty string. Notealso that in case of alternatives we give preference to theregular expression on the left-hand side. This will becomeimportant later on about what value is calculated.After the $\mkeps$-call, we inject back the characters one by one in order to buildthe lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.After injecting back $n$ characters, we get the lexical value for how $r_0$matches $s$. For this Sulzmann and Lu defined a function that reversesthe ``chopping off'' of characters during the derivative phase. Thecorresponding function is called \emph{injection}, written$\textit{inj}$; it takes three arguments: the first one is a regularexpression ${r_{i-1}}$, before the character is chopped off, the secondis a character ${c_{i-1}}$, the character we want to inject and thethird argument is the value ${v_i}$, into which one wants to inject thecharacter (it corresponds to the regular expression after the characterhas been chopped off). The result of this function is a new value. Thedefinition of $\textit{inj}$ is as follows: \begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\\end{tabular}\end{center}\noindent This definition is by recursion on the ``shape'' of regularexpressions and values. To understands this definition better considerthe situation when we build the derivative on regular expression $r_{i-1}$.For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a``hole'' in $r_i$ and its corresponding value $v_i$. To calculate $v_{i-1}$, we need tolocate where that hole is and fill it. We can find this location bycomparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape$r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that %\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]\noindentotherwise if $r_a$ is not nullable,\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]\noindentthe value $v_i$ should be $\Seq(\ldots)$, contradicting the fact that$v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape$\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the leftbranch of \[ (r_a \cdot r_b)\backslash c =\bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined) is taken instead of the right one. This means $c$ is chopped off from $r_a$ rather than $r_b$.We have therefore found out that the hole will be on $r_a$. So we recursively call $\inj\, r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.Other clauses can be understood in a similar way.%\comment{Other word: insight?}The following example gives an insight of $\textit{inj}$'s effect andhow Sulzmann and Lu's algorithm works as a whole. Suppose we have aregular expression $((((a+b)+ab)+c)+abc)^*$, and want to match itagainst the string $abc$ (when $abc$ is written as a regular expression,the standard way of expressing it is $a \cdot (b \cdot c)$. But weusually omit the parentheses and dots here for better readability. Thisalgorithm returns a POSIX value, which means it will produce the longestmatching. Consequently, it matches the string $abc$ in one stariteration, using the longest alternative $abc$ in the sub-expression (we shall use $r$ to denote thissub-expression for conciseness):\[((((a+b)+ab)+c)+\underbrace{abc}_r)\] \noindentBefore $\textit{inj}$ is called, our lexer first builds derivative usingstring $abc$ (we simplified some regular expressions like $\ZERO \cdotb$ to $\ZERO$ for conciseness; we also omit parentheses if they areclear from the context):%Similarly, we allow%$\textit{ALT}$ to take a list of regular expressions as an argument%instead of just 2 operands to reduce the nested depth of%$\textit{ALT}$\begin{center}\begin{tabular}{lcl}$r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r^*$\\ & $\xrightarrow{\backslash b}$ & $r_2 = (\ZERO+\ZERO+\ONE \cdot \ONE + \ZERO + \ONE \cdot \ONE \cdot c) \cdot r^* +(\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^*$\\ & $\xrightarrow{\backslash c}$ & $r_3 = ((\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^* + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*) + $\\ & & $\phantom{r_3 = (} ((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^* + (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* )$\end{tabular}\end{center}\noindentIn case $r_3$ is nullable, we can call $\textit{mkeps}$ to construct a lexical value for how $r_3$ matched the string $abc$. This function gives the following value $v_3$: \begin{center}$\Left(\Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty,\Empty))), \Stars [])))$\end{center}The outer $\Left(\Left(\ldots))$ tells us the leftmost nullable part of $r_3$(underlined):\begin{center} \begin{tabular}{l@{\hspace{2mm}}l} & $\big(\underline{(\ZERO+\ZERO+\ZERO+ \ZERO+ \ONE \cdot \ONE \cdot \ONE) \cdot r^*} \;+\; (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^*\big)$ \smallskip\\ $+$ & $\big((\ZERO+\ONE+\ZERO + \ZERO + \ZERO) \cdot r^* \;+\; (\ZERO+\ZERO+\ZERO + \ONE + \ZERO) \cdot r^* \big)$ \end{tabular} \end{center}\noindent Note that the leftmost location of term $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot \ONE \cdot \ONE) \cdot r^*$ (which corresponds to the initial sub-match $abc$) allows $\textit{mkeps}$ to pick it up because $\textit{mkeps}$ is defined to always choose the left one when it is nullable. In the case of this example, $abc$ is preferred over $a$ or $ab$. This $\Left(\Left(\ldots))$ location is generated by two applications of the splitting clause\begin{center} $(r_1 \cdot r_2)\backslash c \;\;(when \; r_1 \; nullable) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c.$\end{center}\noindentBy this clause, we put $r_1 \backslash c \cdot r_2 $ at the$\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. Thisallows $\textit{mkeps}$ to always pick up among two matches the one with a longerinitial sub-match. Removing the outside $\Left(\Left(...))$, the insidesub-value \begin{center} $\Seq(\Right(\Seq(\Empty, \Seq(\Empty, \Empty))), \Stars [])$\end{center}\noindenttells us how the empty string $[]$ is matched with $(\ZERO+\ZERO+\ZERO + \ZERO + \ONE \cdot\ONE \cdot \ONE) \cdot r^*$. We match $[]$ by a sequence of two nullable regularexpressions. The first one is an alternative, we take the rightmostalternative---whose language contains the empty string. The secondnullable regular expression is a Kleene star. $\Stars$ tells us how itgenerates the nullable regular expression: by 0 iterations to form$\ONE$. Now $\textit{inj}$ injects characters back and incrementallybuilds a lexical value based on $v_3$. Using the value $v_3$, the characterc, and the regular expression $r_2$, we can recover how $r_2$ matchedthe string $[c]$ : $\textit{inj} \; r_2 \; c \; v_3$ gives us \begin{center} $v_2 = \Left(\Seq(\Right(\Seq(\Empty, \Seq(\Empty, c))), \Stars [])),$ \end{center}which tells us how $r_2$ matched $[c]$. After this we inject back the character $b$, and get\begin{center}$v_1 = \Seq(\Right(\Seq(\Empty, \Seq(b, c))), \Stars [])$\end{center} for how \begin{center} $r_1= (\ONE+\ZERO+\ONE \cdot b + \ZERO + \ONE \cdot b \cdot c) \cdot r*$ \end{center} matched the string $bc$ before it split into two substrings. Finally, after injecting character $a$ back to $v_1$, we get the lexical value tree \begin{center} $v_0= \Stars [\Right(\Seq(a, \Seq(b, c)))]$ \end{center} for how $r$ matched $abc$. This completes the algorithm.%We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Readers might have noticed that the lexical value information is actuallyalready available when doing derivatives. For example, immediately afterthe operation $\backslash a$ we know that if we want to match a stringthat starts with $a$, we can either take the initial match to be \begin{center}\begin{enumerate} \item[1)] just $a$ or \item[2)] string $ab$ or \item[3)] string $abc$.\end{enumerate}\end{center}\noindentIn order to differentiate between these choices, we just need toremember their positions---$a$ is on the left, $ab$ is in the middle ,and $abc$ is on the right. Which of these alternatives is chosenlater does not affect their relative position because the algorithm doesnot change this order. If this parsing information can be determined anddoes not change because of later derivatives, there is no point intraversing this information twice. This leads to an optimisation---if westore the information for lexical values inside the regular expression,update it when we do derivative on them, and collect the informationwhen finished with derivatives and call $\textit{mkeps}$ for deciding whichbranch is POSIX, we can generate the lexical value in one pass, instead ofdoing the rest $n$ injections. This leads to Sulzmann and Lu's novelidea of using bitcodes in derivatives.In the next section, we shall focus on the bitcoded algorithm and theprocess of simplification of regular expressions. This is needed inorder to obtain \emph{fast} versions of the Brzozowski's, and Sulzmannand Lu's algorithms. This is where the PhD-project aims to advance thestate-of-the-art.\section{Simplification of Regular Expressions}Using bitcodes to guide parsing is not a novel idea. It was applied tocontext free grammars and then adapted by Henglein and Nielson forefficient regular expression lexing using DFAs~\cite{nielson11bcre}.Sulzmann and Lu took this idea of bitcodes a step further by integratingbitcodes into derivatives. The reason why we want to use bitcodes inthis project is that we want to introduce more aggressive simplificationrules in order to keep the size of derivatives small throughout. This isbecause the main drawback of building successive derivatives accordingto Brzozowski's definition is that they can grow very quickly in size.This is mainly due to the fact that the derivative operation generatesoften ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, ifimplemented naively both algorithms by Brzozowski and by Sulzmann and Luare excruciatingly slow. For example when starting with the regularexpression $(a + aa)^*$ and building 12 successive derivativesw.r.t.~the character $a$, one obtains a derivative regular expressionwith more than 8000 nodes (when viewed as a tree). Operations like$\textit{der}$ and $\nullable$ need to traverse such trees andconsequently the bigger the size of the derivative the slower thealgorithm. Fortunately, one can simplify regular expressions after each derivativestep. Various simplifications of regular expressions are possible, suchas the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r\cdot \ONE$, and $r + r$ to just $r$. These simplifications do notaffect the answer for whether a regular expression matches a string ornot, but fortunately also do not affect the POSIX strategy of howregular expressions match strings---although the latter is much harderto establish. Some initial results in this regard have beenobtained in \cite{AusafDyckhoffUrban2016}. Unfortunately, the simplification rules outlined above are notsufficient to prevent a size explosion in all cases. Webelieve a tighter bound can be achieved that prevents an explosion in\emph{all} cases. Such a tighter bound is suggested by work of Antimirov whoproved that (partial) derivatives can be bound by the number ofcharacters contained in the initial regular expression\cite{Antimirov95}. He defined the \emph{partial derivatives} of regularexpressions as follows:\begin{center}\begin{tabular}{lcl} $\textit{pder} \; c \; \ZERO$ & $\dn$ & $\emptyset$\\ $\textit{pder} \; c \; \ONE$ & $\dn$ & $\emptyset$ \\ $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{ \ONE \} \; \textit{else} \; \emptyset$ \\ $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\ $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\ & & $\textit{then} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \;$\\ & & $\textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\ \end{tabular} \end{center}\noindentA partial derivative of a regular expression $r$ is essentially a set ofregular expressions that are either $r$'s children expressions or aconcatenation of them. Antimirov has proved a tight bound of the sum ofthe size of \emph{all} partial derivatives no matter what the stringlooks like. Roughly speaking the size sum will be at most cubic in thesize of the regular expression.If we want the size of derivatives in Sulzmann and Lu's algorithm tostay below this bound, we would need more aggressive simplifications.Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well asdeleting duplicates whenever possible. For example, the parentheses in$(a+b) \cdot c + bc$ can be opened up to get $a\cdot c + b \cdot c + b\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Anotherexample is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just$a^*+a+\ONE$. Adding these more aggressive simplification rules helps usto achieve the same size bound as that of the partial derivatives. In order to implement the idea of ``spilling out alternatives'' and tomake them compatible with the $\text{inj}$-mechanism, we use\emph{bitcodes}. Bits and bitcodes (lists of bits) are just:%This allows us to prove a tight%bound on the size of regular expression during the running time of the%algorithm if we can establish the connection between our simplification%rules and partial derivatives. %We believe, and have generated test%data, that a similar bound can be obtained for the derivatives in%Sulzmann and Lu's algorithm. Let us give some details about this next.\begin{center} $b ::= S \mid Z \qquadbs ::= [] \mid b:bs $\end{center}\noindentThe $S$ and $Z$ are arbitrary names for the bits in order to avoid confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (orbit-lists) can be used to encode values (or incomplete values) in acompact form. This can be straightforwardly seen in the followingcoding function from values to bitcodes: \begin{center}\begin{tabular}{lcl} $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ $\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\ $\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\ $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ $\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\ $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\; code(\Stars\,vs)$\end{tabular} \end{center} \noindentHere $\textit{code}$ encodes a value into a bitcodes by converting$\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-emptystar iteration into $\S$, and the border where a local star terminatesinto $\Z$. This coding is lossy, as it throws away the information aboutcharacters, and also does not encode the ``boundary'' between twosequence values. Moreover, with only the bitcode we cannot even tellwhether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. Thereason for choosing this compact way of storing information is that therelatively small size of bits can be easily manipulated and ``movedaround'' in a regular expression. In order to recover values, we will need the corresponding regular expression as an extra information. Thismeans the decoding function is defined as:%\begin{definition}[Bitdecoding of Values]\mbox{}\begin{center}\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; (\Left\,v, bs_1)$\\ $\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; (\Right\,v, bs_1)$\\ $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ $\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ $\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ & $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ $\textit{decode}\,bs\,r$ & $\dn$ & $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; \textit{else}\;\textit{None}$ \end{tabular} \end{center} %\end{definition}Sulzmann and Lu's integrated the bitcodes into regular expressions tocreate annotated regular expressions \cite{Sulzmann2014}.\emph{Annotated regular expressions} are defined by the followinggrammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}\begin{center}\begin{tabular}{lcl} $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ & $\mid$ & $\textit{ONE}\;\;bs$\\ & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\ & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ & $\mid$ & $\textit{STAR}\;\;bs\,a$\end{tabular} \end{center} %(in \textit{ALTS})\noindentwhere $bs$ stands for bitcodes, $a$ for $\bold{a}$nnotated regularexpressions and $as$ for a list of annotated regular expressions.The alternative constructor($\textit{ALTS}$) has been generalized to accept a list of annotated regular expressions rather than just 2.We will show that these bitcodes encode information aboutthe (POSIX) value that should be generated by the Sulzmann and Lualgorithm.To do lexing using annotated regular expressions, we shall firsttransform the usual (un-annotated) regular expressions into annotatedregular expressions. This operation is called \emph{internalisation} anddefined as follows:%\begin{definition}\begin{center}\begin{tabular}{lcl} $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\ $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\ $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\ $(r_1 + r_2)^\uparrow$ & $\dn$ & $\textit{ALTS}\;[]\,List((\textit{fuse}\,[\Z]\,r_1^\uparrow),\, (\textit{fuse}\,[\S]\,r_2^\uparrow))$\\ $(r_1\cdot r_2)^\uparrow$ & $\dn$ & $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\ $(r^*)^\uparrow$ & $\dn$ & $\textit{STAR}\;[]\,r^\uparrow$\\\end{tabular} \end{center} %\end{definition}\noindentWe use up arrows here to indicate that the basic un-annotated regularexpressions are ``lifted up'' into something slightly more complex. In thefourth clause, $\textit{fuse}$ is an auxiliary function that helps toattach bits to the front of an annotated regular expression. Itsdefinition is as follows:\begin{center}\begin{tabular}{lcl} $\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\ $\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ & $\textit{ONE}\,(bs\,@\,bs')$\\ $\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ & $\textit{CHAR}\,(bs\,@\,bs')\,c$\\ $\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,as)$ & $\dn$ & $\textit{ALTS}\,(bs\,@\,bs')\,as$\\ $\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ & $\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\ $\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ & $\textit{STAR}\,(bs\,@\,bs')\,a$\end{tabular} \end{center} \noindentAfter internalising the regular expression, we perform successivederivative operations on the annotated regular expressions. Thisderivative operation is the same as what we had previously for thebasic regular expressions, except that we beed to take care ofthe bitcodes: %\begin{definition}{bder}\begin{center} \begin{tabular}{@{}lcl@{}} $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & $\textit{if}\;c=d\; \;\textit{then}\; \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\ $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a_1$\\ & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, (\textit{STAR}\,[]\,r)$\end{tabular} \end{center} %\end{definition}\noindentFor instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,we need to attach an additional bit $Z$ to the front of $r \backslash c$to indicate that there is one more star iteration. Also the $SEQ$ clauseis more subtle---when $a_1$ is $\textit{bnullable}$ (here\textit{bnullable} is exactly the same as $\textit{nullable}$, exceptthat it is for annotated regular expressions, therefore we omit thedefinition). Assume that $bmkeps$ correctly extracts the bitcode for how$a_1$ matches the string prior to character $c$ (more on this later),then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2\backslash c)$ will collapse the regular expression $a_1$(as it hasalready been fully matched) and store the parsing information at thehead of the regular expression $a_2 \backslash c$ by fusing to it. Thebitsequence $bs$, which was initially attached to the head of $SEQ$, hasnow been elevated to the top-level of $ALTS$, as this information will beneeded whichever way the $SEQ$ is matched---no matter whether $c$ belongsto $a_1$ or $ a_2$. After building these derivatives and maintaining allthe lexing information, we complete the lexing by collecting thebitcodes using a generalised version of the $\textit{mkeps}$ functionfor annotated regular expressions, called $\textit{bmkeps}$:%\begin{definition}[\textit{bmkeps}]\mbox{}\begin{center}\begin{tabular}{lcl} $\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\ $\textit{bmkeps}\,(\textit{ALTS}\;bs\,a::as)$ & $\dn$ & $\textit{if}\;\textit{bnullable}\,a$\\ & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(\textit{ALTS}\;bs\,as)$\\ $\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ $\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ & $bs \,@\, [\S]$\end{tabular} \end{center} %\end{definition}\noindentThis function completes the value information by travelling along thepath of the regular expression that corresponds to a POSIX value andcollecting all the bitcodes, and using $S$ to indicate the end of stariterations. If we take the bitcodes produced by $\textit{bmkeps}$ anddecode them, we get the value we expect. The corresponding lexingalgorithm looks as follows:\begin{center}\begin{tabular}{lcl} $\textit{blexer}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentIn this definition $\_\backslash s$ is the generalisation of the derivativeoperation from characters to strings (just like the derivatives for un-annotatedregular expressions).The main point of the bitcodes and annotated regular expressions is thatwe can apply rather aggressive (in terms of size) simplification rulesin order to keep derivatives small. We have developed such``aggressive'' simplification rules and generated test data that showthat the expected bound can be achieved. Obviously we could onlypartially cover the search space as there are infinitely many regularexpressions and strings. One modification we introduced is to allow a list of annotated regularexpressions in the \textit{ALTS} constructor. This allows us to not justdelete unnecessary $\ZERO$s and $\ONE$s from regular expressions, butalso unnecessary ``copies'' of regular expressions (very similar tosimplifying $r + r$ to just $r$, but in a more general setting). Anothermodification is that we use simplification rules inspired by Antimirov'swork on partial derivatives. They maintain the idea that only the first``copy'' of a regular expression in an alternative contributes to thecalculation of a POSIX value. All subsequent copies can be pruned away fromthe regular expression. A recursive definition of our simplification function that looks somewhat similar to our Scala code is given below:%\comment{Use $\ZERO$, $\ONE$ and so on. %Is it $ALTS$ or $ALTS$?}\\\begin{center} \begin{tabular}{@{}lcl@{}} $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\ $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ \end{tabular} \end{center} \noindentThe simplification does a pattern matching on the regular expression.When it detected that the regular expression is an alternative orsequence, it will try to simplify its children regular expressionsrecursively and then see if one of the children turn into $\ZERO$ or$\ONE$, which might trigger further simplification at the current level.The most involved part is the $\textit{ALTS}$ clause, where we use twoauxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested$\textit{ALTS}$ and reduce as many duplicates as possible. Function$\textit{distinct}$ keeps the first occurring copy only and remove all later oneswhen detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.Its recursive definition is given below: \begin{center} \begin{tabular}{@{}lcl@{}} $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) \end{tabular} \end{center} \noindentHere $\textit{flatten}$ behaves like the traditional functional programming flattenfunction, except that it also removes $\ZERO$s. Or in terms of regular expressions, itremoves parentheses, for example changing $a+(b+c)$ into $a+b+c$.Suppose we apply simplification after each derivative step, and viewthese two operations as an atomic one: $a \backslash_{simp}\,c \dn\textit{simp}(a \backslash c)$. Then we can use the previous naturalextension from derivative w.r.t.~character to derivativew.r.t.~string:%\comment{simp in the [] case?}\begin{center}\begin{tabular}{lcl}$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\$r \backslash_{simp} [\,] $ & $\dn$ & $r$\end{tabular}\end{center}\noindentwe obtain an optimised version of the algorithm: \begin{center}\begin{tabular}{lcl} $\textit{blexer\_simp}\;r\,s$ & $\dn$ & $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ & & $\;\;\textit{else}\;\textit{None}$\end{tabular}\end{center}\noindentThis algorithm keeps the regular expression size small, for example,with this simplification our previous $(a + aa)^*$ example's 8000 nodeswill be reduced to just 6 and stays constant, no matter how long theinput string is.\section{Current Work}We are currently engaged in two tasks related to this algorithm. Thefirst task is proving that our simplification rules actually do notaffect the POSIX value that should be generated by the algorithmaccording to the specification of a POSIX value and furthermore obtain amuch tighter bound on the sizes of derivatives. The result is that ouralgorithm should be correct and faster on all inputs. The originalblow-up, as observed in JavaScript, Python and Java, would be excludedfrom happening in our algorithm. For this proof we use the theorem proverIsabelle. Once completed, this result will advance the state-of-the-art:Sulzmann and Lu wrote in their paper~\cite{Sulzmann2014} about thebitcoded ``incremental parsing method'' (that is the lexing algorithmoutlined in this section):\begin{quote}\it ``Correctness Claim: We further claim that the incremental parsing method in Figure~5 in combination with the simplification steps in Figure 6 yields POSIX parse tree [our lexical values]. We have tested this claim extensively by using the method in Figure~3 as a reference but yet have to work out all proof details.''\end{quote} \noindent We like to settle this correctness claim. It is relativelystraightforward to establish that after one simplification step, the part of anullable derivative that corresponds to a POSIX value remains intact and canstill be collected, in other words, we can show that%\comment{Double-check....I%think this is not the case}%\comment{If i remember correctly, you have proved this lemma.%I feel this is indeed not true because you might place arbitrary %bits on the regex r, however if this is the case, did i remember wrongly that%you proved something like simplification does not affect $\textit{bmkeps}$ results?%Anyway, i have amended this a little bit so it does not allow arbitrary bits attached%to a regex. Maybe it works now.}\begin{center} $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;($\textit{provided}$ \; a\; is \; \textit{bnullable} )$\end{center} \noindentas this basically comes down to proving actions like removing theadditional $r$ in $r+r$ does not delete important POSIX information ina regular expression. The hard part of this proof is to establish that\begin{center} $ \textit{blexer}\_{simp}(r, \; s) = \textit{blexer}(r, \; s)$\end{center}%comment{This is not true either...look at the definion blexer/blexer-simp}\noindent That is, if we do derivative on regular expression $r$ and thensimplify it, and repeat this process until we exhaust the string, we get aregular expression $r''$($\textit{LHS}$) that provides the POSIX matchinginformation, which is exactly the same as the result $r'$($\textit{RHS}$ of thenormal derivative algorithm that only does derivative repeatedly and has nosimplification at all. This might seem at first glance very unintuitive, asthe $r'$ could be exponentially larger than $r''$, but can be explained in thefollowing way: we are pruning away the possible matches that are not POSIX.Since there could be exponentially many non-POSIX matchings and only 1 POSIX matching, itis understandable that our $r''$ can be a lot smaller. we can still providethe same POSIX value if there is one. This is not as straightforward as theprevious proposition, as the two regular expressions $r'$ and $r''$ might havebecome very different. The crucial point is to find the$\textit{POSIX}$ information of a regular expression and how it is modified,augmented and propagated during simplification in parallel with the regular expression thathas not been simplified in the subsequent derivative operations. To aid this,we use the helper function retrieve described by Sulzmann and Lu:\begin{center}\begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}} $\textit{retrieve}\,(\textit{ONE}\,bs)\,\Empty$ & $\dn$ & $bs$\\ $\textit{retrieve}\,(\textit{CHAR}\,bs\,c)\,(\Char\,d)$ & $\dn$ & $bs$\\ $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Left\,v)$ & $\dn$ & $bs \,@\, \textit{retrieve}\,a\,v$\\ $\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Right\,v)$ & $\dn$ & $bs \,@\, \textit{retrieve}\,(\textit{ALTS}\,bs\,as)\,v$\\ $\textit{retrieve}\,(\textit{SEQ}\,bs\,a_1\,a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ & $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\ $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,[])$ & $\dn$ & $bs \,@\, [\S]$\\ $\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\ \multicolumn{3}{l}{ \hspace{3cm}$bs \,@\, [\Z] \,@\, \textit{retrieve}\,a\,v\,@\, \textit{retrieve}\,(\textit{STAR}\,[]\,a)\,(\Stars\,vs)$}\\\end{tabular}\end{center}%\comment{Did not read further}\\This function assembles the bitcode %that corresponds to a lexical value for how%the current derivative matches the suffix of the string(the characters that%have not yet appeared, but will appear as the successive derivatives go on.%How do we get this "future" information? By the value $v$, which is%computed by a pass of the algorithm that uses%$inj$ as described in the previous section). using information from both the derivative regular expression and thevalue. Sulzmann and Lu poroposed this function, but did not proveanything about it. Ausaf and Urban used it to connect the bitcodedalgorithm to the older algorithm by the following equation: \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; (r^\uparrow)\backslash_{simp} \,c)\,v)$ \end{center} \noindentwhereby $r^\uparrow$ stands for the internalised version of $r$. Ausafand Urban also used this fact to prove the correctness of bitcodedalgorithm without simplification. Our purpose of using this, however,is to establish \begin{center}$ \textit{retrieve} \;a \; v \;=\; \textit{retrieve} \; (\textit{simp}\,a) \; v'.$\end{center}The idea is that using $v'$, a simplified version of $v$ that had gonethrough the same simplification step as $\textit{simp}(a)$, we are ableto extract the bitcode that gives the same parsing information as theunsimplified one. However, we noticed that constructing such a $v'$from $v$ is not so straightforward. The point of this is that we mightbe able to finally bridge the gap by proving\begin{center}$\textit{retrieve} \; (r^\uparrow \backslash s) \; v = \;\textit{retrieve} \;(\textit{simp}(r^\uparrow) \backslash s) \; v'$\end{center}\noindentand subsequently\begin{center}$\textit{retrieve} \; (r^\uparrow \backslash s) \; v\; = \; \textit{retrieve} \;(r^\uparrow \backslash_{simp} \, s) \; v'$.\end{center}\noindentThe $\textit{LHS}$ of the above equation is the bitcode we want. Thiswould prove that our simplified version of regular expression stillcontains all the bitcodes needed. The task here is to find a way tocompute the correct $v'$.The second task is to speed up the more aggressive simplification. Currentlyit is slower than the original naive simplification by Ausaf and Urban (thenaive version as implemented by Ausaf and Urban of course can ``explode'' insome cases). It is therefore not surprising that the speed is also much slowerthan regular expression engines in popular programming languages such as Javaand Python on most inputs that are linear. For example, just by rewriting theexample regular expression in the beginning of this report $(a^*)^*\,b$ into$a^*\,b$ would eliminate the ambiguity in the matching and make the timefor matching linear with respect to the input string size. This allows the DFA approach to become blindingly fast, and dwarf the speed of our currentimplementation. For example, here is a comparison of Java regex engine and our implementation on this example.\begin{center}\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}\begin{tikzpicture}\begin{axis}[ xlabel={$n*1000$}, x label style={at={(1.05,-0.05)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=9, scaled ticks=true, axis lines=left, width=5cm, height=4cm, legend entries={Bitcoded Algorithm}, legend pos=north west, legend cell align=left]\addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data};\end{axis}\end{tikzpicture} &\begin{tikzpicture}\begin{axis}[ xlabel={$n*1000$}, x label style={at={(1.05,-0.05)}}, %ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=9, scaled ticks=false, axis lines=left, width=5cm, height=4cm, legend entries={Java}, legend pos=north west, legend cell align=left]\addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data};\end{axis}\end{tikzpicture}\\\multicolumn{3}{c}{Graphs: Runtime for matching $a^*\,b$ with strings of the form $\underbrace{aa..a}_{n}$.}\end{tabular} \end{center} Java regex engine can match string of thousands of characters in a few milliseconds,whereas our current algorithm gets excruciatingly slow on input of this size.The running time in theory is linear, however it does not appear to be the case in an actual implementation. So it needs to be explored how tomake our algorithm faster on all inputs. It could be the recursive calls that areneeded to manipulate bits that are causing the slow down. A possible solutionis to write recursive functions into tail-recusive form.Another possibility would be to exploreagain the connection to DFAs to speed up the algorithm on subcalls that are small enough. This is very much work in progress.\section{Conclusion}In this PhD-project we are interested in fast algorithms for regularexpression matching. While this seems to be a ``settled'' area, infact interesting research questions are popping up as soon as one stepsoutside the classic automata theory (for example in terms of what kindof regular expressions are supported). The reason why it isinteresting for us to look at the derivative approach introduced byBrzozowski for regular expression matching, and then much furtherdeveloped by Sulzmann and Lu, is that derivatives can elegantly dealwith some of the regular expressions that are of interest in ``reallife''. This includes the not-regular expression, written $\neg\,r$(that is all strings that are not recognised by $r$), but also boundedregular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There isalso hope that the derivatives can provide another angle for how todeal more efficiently with back-references, which are one of thereasons why regular expression engines in JavaScript, Python and Javachoose to not implement the classic automata approach of transformingregular expressions into NFAs and then DFAs---because we simply do notknow how such back-references can be represented by DFAs.We also plan to implement the bitcoded algorithmin some imperative language like C to see if the inefficiency of the Scala implementationis language specific. To make this research more comprehensive we also planto contrast our (faster) version of bitcoded algorithm with theSymbolic Regex Matcher, the RE2, the Rust Regex Engine, and the staticanalysis approach by implementing them in the same language and then comparetheir performance.\bibliographystyle{plain}\bibliography{root}\end{document}