% Chapter 1
\chapter{Introduction} % Main chapter title
\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1}
%----------------------------------------------------------------------------------------
% Define some commands to keep the formatting separated from the content
\newcommand{\keyword}[1]{\textbf{#1}}
\newcommand{\tabhead}[1]{\textbf{#1}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\file}[1]{\texttt{\bfseries#1}}
\newcommand{\option}[1]{\texttt{\itshape#1}}
%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
\newcommand{\bderssimp}[2]{#1 \backslash_{bsimp} #2}
\newcommand{\rderssimp}[2]{#1 \backslash_{rsimp} #2}
\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
\newcommand{\rsimp}[1]{\textit{rsimp}(#1)}
\newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'}
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
\newcommand{\ZERO}{\mbox{\bf 0}}
\newcommand{\ONE}{\mbox{\bf 1}}
\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
\newcommand{\rdistinct}[2]{\textit{distinct} \; \textit{#1} \; #2}
\newcommand\hflat[1]{\llparenthesis #1 \rrparenthesis_*}
\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
\newcommand\createdByStar[1]{\textit{\textit{createdByStar}(#1)}}
\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
\def\lexer{\mathit{lexer}}
\def\mkeps{\mathit{mkeps}}
\newcommand{\rder}[2]{#2 \backslash #1}
\def\AZERO{\textit{AZERO}}
\def\AONE{\textit{AONE}}
\def\ACHAR{\textit{ACHAR}}
\def\ALTS{\textit{ALTS}}
\def\ASTAR{\textit{ASTAR}}
\def\DFA{\textit{DFA}}
\def\bmkeps{\textit{bmkeps}}
\def\retrieve{\textit{retrieve}}
\def\blexer{\textit{blexer}}
\def\flex{\textit{flex}}
\def\inj{\mathit{inj}}
\def\Empty{\mathit{Empty}}
\def\Left{\mathit{Left}}xc
\def\Right{\mathit{Right}}
\def\Stars{\mathit{Stars}}
\def\Char{\mathit{Char}}
\def\Seq{\mathit{Seq}}
\def\Der{\mathit{Der}}
\def\nullable{\mathit{nullable}}
\def\Z{\mathit{Z}}
\def\S{\mathit{S}}
\def\rup{r^\uparrow}
%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
\def\distinctWith{\textit{distinctWith}}
\def\rexp{\mathbf{rexp}}
\def\simp{\mathit{simp}}
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
\def\map{\mathit{map}}
\def\distinct{\mathit{distinct}}
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
\def\map{\textit{map}}
%\def\vsuf{\textit{vsuf}}
%\def\sflataux{\textit{sflat}\_\textit{aux}}
\def\rrexp{\textit{rrexp}}
\newcommand\rnullable[1]{\textit{rnullable}(#1)}
\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
\newcommand\asize[1]{\llbracket #1 \rrbracket}
\newcommand\rerase[1]{ (#1)\downarrow_r}
\def\erase{\textit{erase}}
\def\STAR{\textit{STAR}}
\def\flts{\textit{flts}}
\def\RZERO{\mathbf{0}_r }
\def\RONE{\mathbf{1}_r}
\newcommand\RCHAR[1]{\mathbf{#1}_r}
\newcommand\RSEQ[2]{#1 \cdot #2}
\newcommand\RALTS[1]{\oplus #1}
\newcommand\RSTAR[1]{#1^*}
\newcommand\vsuf[2]{\textit{vsuf} \;#1\;#2}
%----------------------------------------------------------------------------------------
%This part is about regular expressions, Brzozowski derivatives,
%and a bit-coded lexing algorithm with proven correctness and time bounds.
%TODO: look up snort rules to use here--give readers idea of what regexes look like
Regular expressions are widely used in computer science:
be it in text-editors\parencite{atomEditor} with syntax hightlighting and auto completion,
command line tools like $\mathit{grep}$ that facilitates easy
text processing , network intrusion
detection systems that rejects suspicious traffic, or compiler
front ends--the majority of the solutions to these tasks
involve lexing with regular
expressions.
Given its usefulness and ubiquity, one would imagine that
modern regular expression matching implementations
are mature and fully-studied.
If you go to a popular programming language's
regex engine,
you can supply it with regex and strings of your own,
and get matching/lexing information such as how a
sub-part of the regex matches a sub-part of the string.
These lexing libraries are on average quite fast.
%TODO: get source for SNORT/BRO's regex matching engine/speed
For example, the regex engines some network intrusion detection
systems use are able to process
megabytes or even gigabytes of network traffic per second.
Why do we need to have our version, if the algorithms are
blindingly fast already? Well it turns out it is not always the case.
Take $(a^*)^*\,b$ and ask whether
strings of the form $aa..a$ match this regular
expression. Obviously this is not the case---the expected $b$ in the last
position is missing. One would expect that modern regular expression
matching engines can find this out very quickly. Alas, if one tries
this example in JavaScript, Python or Java 8 with strings like 28
$a$'s, one discovers that this decision takes around 30 seconds and
takes considerably longer when adding a few more $a$'s, as the graphs
below show:
\begin{center}
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
x label style={at={(1.05,-0.05)}},
ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=35,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
width=5cm,
height=4cm,
legend entries={JavaScript},
legend pos=north west,
legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
x label style={at={(1.05,-0.05)}},
%ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=35,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
width=5cm,
height=4cm,
legend entries={Python},
legend pos=north west,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
xlabel={$n$},
x label style={at={(1.05,-0.05)}},
%ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
xmax=33,
ymax=35,
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
width=5cm,
height=4cm,
legend entries={Java 8},
legend pos=north west,
legend cell align=left]
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
\end{axis}
\end{tikzpicture}\\
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
of the form $\underbrace{aa..a}_{n}$.}
\end{tabular}
\end{center}
This is clearly exponential behaviour, and
is triggered by some relatively simple regex patterns.
This superlinear blowup in matching algorithms sometimes cause
considerable grief in real life: for example on 20 July 2016 one evil
regular expression brought the webpage
\href{http://stackexchange.com}{Stack Exchange} to its
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
In this instance, a regular expression intended to just trim white
spaces from the beginning and the end of a line actually consumed
massive amounts of CPU-resources---causing web servers to grind to a
halt. This happened when a post with 20,000 white spaces was submitted,
but importantly the white spaces were neither at the beginning nor at
the end. As a result, the regular expression matching engine needed to
backtrack over many choices. In this example, the time needed to process
the string was $O(n^2)$ with respect to the string length. This
quadratic overhead was enough for the homepage of Stack Exchange to
respond so slowly that the load balancer assumed a $\mathit{DoS}$
attack and therefore stopped the servers from responding to any
requests. This made the whole site become unavailable.
A more
recent example is a global outage of all Cloudflare servers on 2 July
2019. A poorly written regular expression exhibited exponential
behaviour and exhausted CPUs that serve HTTP traffic. Although the
outage had several causes, at the heart was a regular expression that
was used to monitor network
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
%TODO: data points for some new versions of languages
These problems with regular expressions
are not isolated events that happen
very occasionally, but actually quite widespread.
They occur so often that they get a
name--Regular-Expression-Denial-Of-Service (ReDoS)
attack.
Davis et al. \parencite{Davis18} detected more
than 1000 super-linear (SL) regular expressions
in Node.js, Python core libraries, and npm and pypi.
They therefore concluded that evil regular expressions
are problems more than "a parlour trick", but one that
requires
more research attention.
\section{Why are current algorithm for regexes slow?}
%find literature/find out for yourself that REGEX->DFA on basic regexes
%does not blow up the size
Shouldn't regular expression matching be linear?
How can one explain the super-linear behaviour of the
regex matching engines we have?
The time cost of regex matching algorithms in general
involve two phases:
the construction phase, in which the algorithm builds some
suitable data structure from the input regex $r$, we denote
the time cost by $P_1(r)$.
The lexing
phase, when the input string $s$ is read and the data structure
representing that regex $r$ is being operated on. We represent the time
it takes by $P_2(r, s)$.\\
In the case of a $\mathit{DFA}$,
we have $P_2(r, s) = O( |s| )$,
because we take at most $|s|$ steps,
and each step takes
at most one transition--
a deterministic-finite-automata
by definition has at most one state active and at most one
transition upon receiving an input symbol.
But unfortunately in the worst case
$P_1(r) = O(exp^{|r|})$. An example will be given later. \\
For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
On the other hand, if backtracking is used, the worst-case time bound bloats
to $|r| * 2^|s|$ .
%on the input
%And when calculating the time complexity of the matching algorithm,
%we are assuming that each input reading step requires constant time.
%which translates to that the number of
%states active and transitions taken each time is bounded by a
%constant $C$.
%But modern regex libraries in popular language engines
% often want to support much richer constructs than just
% sequences and Kleene stars,
%such as negation, intersection,
%bounded repetitions and back-references.
%And de-sugaring these "extended" regular expressions
%into basic ones might bloat the size exponentially.
%TODO: more reference for exponential size blowup on desugaring.
\subsection{Tools that uses $\mathit{DFA}$s}
%TODO:more tools that use DFAs?
$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
lexers. The user provides a set of regular expressions
and configurations to such lexer generators, and then
gets an output program encoding a minimized $\mathit{DFA}$
that can be compiled and run.
The good things about $\mathit{DFA}$s is that once
generated, they are fast and stable, unlike
backtracking algorithms.
However, they do not scale well with bounded repetitions.\\
Bounded repetitions, usually written in the form
$r^{\{c\}}$ (where $c$ is a constant natural number),
denotes a regular expression accepting strings
that can be divided into $c$ substrings, where each
substring is in $r$.
For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
an $\mathit{NFA}$ describing it would look like:
\begin{center}
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
\node[state,initial] (q_0) {$q_0$};
\node[state, red] (q_1) [right=of q_0] {$q_1$};
\node[state, red] (q_2) [right=of q_1] {$q_2$};
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
\path[->]
(q_0) edge node {a} (q_1)
edge [loop below] node {a,b} ()
(q_1) edge node {a,b} (q_2)
(q_2) edge node {a,b} (q_3);
\end{tikzpicture}
\end{center}
The red states are "countdown states" which counts down
the number of characters needed in addition to the current
string to make a successful match.
For example, state $q_1$ indicates a match that has
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
and just consumed the "delimiter" $a$ in the middle, and
need to match 2 more iterations of $(a|b)$ to complete.
State $q_2$ on the other hand, can be viewed as a state
after $q_1$ has consumed 1 character, and just waits
for 1 more character to complete.
$q_3$ is the last state, requiring 0 more character and is accepting.
Depending on the suffix of the
input string up to the current read location,
the states $q_1$ and $q_2$, $q_3$
may or may
not be active, independent from each other.
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
contain at least $2^3$ non-equivalent states that cannot be merged,
because the subset construction during determinisation will generate
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
Generalizing this to regular expressions with larger
bounded repetitions number, we have that
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
would require at least $2^{n+1}$ states, if $r$ contains
more than 1 string.
This is to represent all different
scenarios which "countdown" states are active.
For those regexes, tools such as $\mathit{JFLEX}$
would generate gigantic $\mathit{DFA}$'s or
out of memory errors.
For this reason, regex libraries that support
bounded repetitions often choose to use the $\mathit{NFA}$
approach.
\subsection{The $\mathit{NFA}$ approach to regex matching}
One can simulate the $\mathit{NFA}$ running in two ways:
one by keeping track of all active states after consuming
a character, and update that set of states iteratively.
This can be viewed as a breadth-first-search of the $\mathit{NFA}$
for a path terminating
at an accepting state.
Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
in terms of input string length.
%TODO:try out these lexers
The other way to use $\mathit{NFA}$ for matching is choosing
a single transition each time, keeping all the other options in
a queue or stack, and backtracking if that choice eventually
fails. This method, often called a "depth-first-search",
is efficient in a lot of cases, but could end up
with exponential run time.\\
%TODO:COMPARE java python lexer speed with Rust and Go
The reason behind backtracking algorithms in languages like
Java and Python is that they support back-references.
\subsection{Back References in Regex--Non-Regular part}
If we have a regular expression like this (the sequence
operator is omitted for brevity):
\begin{center}
$r_1(r_2(r_3r_4))$
\end{center}
We could label sub-expressions of interest
by parenthesizing them and giving
them a number by the order in which their opening parentheses appear.
One possible way of parenthesizing and labelling is given below:
\begin{center}
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
\end{center}
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
by 1 to 4. $1$ would refer to the entire expression
$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
These sub-expressions are called "capturing groups".
We can use the following syntax to denote that we want a string just matched by a
sub-expression (capturing group) to appear at a certain location again,
exactly as it was:
\begin{center}
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
\end{center}
The backslash and number $i$ are used to denote such
so-called "back-references".
Let $e$ be an expression made of regular expressions
and back-references. $e$ contains the expression $e_i$
as its $i$-th capturing group.
The semantics of back-reference can be recursively
written as:
\begin{center}
\begin{tabular}{c}
$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
\end{tabular}
\end{center}
The concrete example
$((a|b|c|\ldots|z)^*)\backslash 1$
would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
Back-reference is a construct in the "regex" standard
that programmers found quite useful, but not exactly
regular any more.
In fact, that allows the regex construct to express
languages that cannot be contained in context-free
languages either.
For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
Such a language is contained in the context-sensitive hierarchy
of formal languages.
Solving the back-reference expressions matching problem
is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
efficient solution is not known to exist.
%TODO:read a bit more about back reference algorithms
It seems that languages like Java and Python made the trade-off
to support back-references at the expense of having to backtrack,
even in the case of regexes not involving back-references.\\
Summing these up, we can categorise existing
practical regex libraries into the ones with linear
time guarantees like Go and Rust, which impose restrictions
on the user input (not allowing back-references,
bounded repetitions canno exceed 1000 etc.), and ones
that allows the programmer much freedom, but grinds to a halt
in some non-negligible portion of cases.
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
% For example, the Rust regex engine claims to be linear,
% but does not support lookarounds and back-references.
% The GoLang regex library does not support over 1000 repetitions.
% Java and Python both support back-references, but shows
%catastrophic backtracking behaviours on inputs without back-references(
%when the language is still regular).
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
%TODO: verify the fact Rust does not allow 1000+ reps
%TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
\section{Buggy Regex Engines}
Another thing about the these libraries is that there
is no correctness guarantee.
In some cases they either fails to generate a lexing result when there is a match,
or gives the wrong way of matching.
It turns out that regex libraries not only suffer from
exponential backtracking problems,
but also undesired (or even buggy) outputs.
%TODO: comment from who
Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
correctly implementing the POSIX (maximum-munch)
rule of regular expression matching.
This experience is echoed by the writer's
tryout of a few online regex testers:
A concrete example would be
the regex
\begin{verbatim}
(((((a*a*)b*)b){20})*)c
\end{verbatim}
and the string
\begin{verbatim}
baabaabababaabaaaaaaaaababaa
aababababaaaabaaabaaaaaabaab
aabababaababaaaaaaaaababaaaa
babababaaaaaaaaaaaaac
\end{verbatim}
This seemingly complex regex simply says "some $a$'s
followed by some $b$'s then followed by 1 single $b$,
and this iterates 20 times, finally followed by a $c$.
And a POSIX match would involve the entire string,"eating up"
all the $b$'s in it.
%TODO: give a coloured example of how this matches POSIXly
This regex would trigger catastrophic backtracking in
languages like Python and Java,
whereas it gives a non-POSIX and uninformative
match in languages like Go or .NET--The match with only
character $c$.
As Grathwohl\parencite{grathwohl2014crash} commented,
\begin{center}
``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
\end{center}
%\section{How people solve problems with regexes}
When a regular expression does not behave as intended,
people usually try to rewrite the regex to some equivalent form
or they try to avoid the possibly problematic patterns completely\parencite{Davis18},
of which there are many false positives.
Animated tools to "debug" regular expressions
are also quite popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101}
to name a few.
There is also static analysis work on regular expressions that
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
\parencite{Rathnayake2014StaticAF} proposed an algorithm
that detects regular expressions triggering exponential
behavious on backtracking matchers.
Weideman \parencite{Weideman2017Static} came up with
non-linear polynomial worst-time estimates
for regexes, attack string that exploit the worst-time
scenario, and "attack automata" that generates
attack strings.
%Arguably these methods limits the programmers' freedom
%or productivity when all they want is to come up with a regex
%that solves the text processing problem.
%TODO:also the regex101 debugger
\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
Is it possible to have a regex lexing algorithm with proven correctness and
time complexity, which allows easy extensions to
constructs like
bounded repetitions, negation, lookarounds, and even back-references?
We propose Brzozowski derivatives on regular expressions as
a solution to this.
In the last fifteen or so years, Brzozowski's derivatives of regular
expressions have sparked quite a bit of interest in the functional
programming and theorem prover communities. The beauty of
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
expressible in any functional language, and easily definable and
reasoned about in theorem provers---the definitions just consist of
inductive datatypes and simple recursive functions.
And an algorithms based on it by
Suzmann and Lu \parencite{Sulzmann2014} allows easy extension
to include extended regular expressions and
simplification of internal data structures
eliminating the exponential behaviours.
%----------------------------------------------------------------------------------------
\section{Our Contribution}
This work addresses the vulnerability of super-linear and
buggy regex implementations by the combination
of Brzozowski's derivatives and interactive theorem proving.
We give an
improved version of Sulzmann and Lu's bit-coded algorithm using
derivatives, which come with a formal guarantee in terms of correctness and
running time as an Isabelle/HOL proof.
Then we improve the algorithm with an even stronger version of
simplification, and prove a time bound linear to input and
cubic to regular expression size using a technique by
Antimirov.
The main contribution of this thesis is a proven correct lexing algorithm
with formalized time bounds.
To our best knowledge, there is no lexing libraries using Brzozowski derivatives
that have a provable time guarantee,
and claims about running time are usually speculative and backed by thin empirical
evidence.
%TODO: give references
For example, Sulzmann and Lu had proposed an algorithm in which they
claim a linear running time.
But that was falsified by our experiments and the running time
is actually $\Omega(2^n)$ in the worst case.
A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
%TODO: give references
lexer, which calculates POSIX matches and is based on derivatives.
They formalized the correctness of the lexer, but not the complexity.
In the performance evaluation section, they simply analyzed the run time
of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
and concluded that the algorithm is quadratic in terms of input length.
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
the time it took to lex only 40 $a$'s was 5 minutes.
We believe our results of a proof of performance on general
inputs rather than specific examples a novel contribution.\\
\subsection{Related Work}
We are aware
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
%We propose Brzozowski's derivatives as a solution to this problem.
% about Lexing Using Brzozowski derivatives
\section{Preliminaries}
Suppose we have an alphabet $\Sigma$, the strings whose characters
are from $\Sigma$
can be expressed as $\Sigma^*$.
We use patterns to define a set of strings concisely. Regular expressions
are one of such patterns systems:
The basic regular expressions are defined inductively
by the following grammar:
\[ r ::= \ZERO \mid \ONE
\mid c
\mid r_1 \cdot r_2
\mid r_1 + r_2
\mid r^*
\]
The language or set of strings defined by regular expressions are defined as
%TODO: FILL in the other defs
\begin{center}
\begin{tabular}{lcl}
$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
\end{tabular}
\end{center}
Which are also called the "language interpretation".
The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
where the operation transforms the regex to a new one containing
strings without the head character $c$.
Formally, we define first such a transformation on any string set, which
we call semantic derivative:
\begin{center}
$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
\end{center}
Mathematically, it can be expressed as the
If the $\textit{StringSet}$ happen to have some structure, for example,
if it is regular, then we have that it
% Derivatives of a
%regular expression, written $r \backslash c$, give a simple solution
%to the problem of matching a string $s$ with a regular
%expression $r$: if the derivative of $r$ w.r.t.\ (in
%succession) all the characters of the string matches the empty string,
%then $r$ matches $s$ (and {\em vice versa}).
The the derivative of regular expression, denoted as
$r \backslash c$, is a function that takes parameters
$r$ and $c$, and returns another regular expression $r'$,
which is computed by the following recursive function:
\begin{center}
\begin{tabular}{lcl}
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
$d \backslash c$ & $\dn$ &
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
\end{tabular}
\end{center}
\noindent
\noindent
The $\nullable$ function tests whether the empty string $""$
is in the language of $r$:
\begin{center}
\begin{tabular}{lcl}
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
\end{tabular}
\end{center}
\noindent
The empty set does not contain any string and
therefore not the empty string, the empty string
regular expression contains the empty string
by definition, the character regular expression
is the singleton that contains character only,
and therefore does not contain the empty string,
the alternative regular expression(or "or" expression)
might have one of its children regular expressions
being nullable and any one of its children being nullable
would suffice. The sequence regular expression
would require both children to have the empty string
to compose an empty string and the Kleene star
operation naturally introduced the empty string.
We can give the meaning of regular expressions derivatives
by language interpretation:
\begin{center}
\begin{tabular}{lcl}
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
$d \backslash c$ & $\dn$ &
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
\end{tabular}
\end{center}
\noindent
\noindent
The function derivative, written $\backslash c$,
defines how a regular expression evolves into
a new regular expression after all the string it contains
is chopped off a certain head character $c$.
The most involved cases are the sequence
and star case.
The sequence case says that if the first regular expression
contains an empty string then second component of the sequence
might be chosen as the target regular expression to be chopped
off its head character.
The star regular expression unwraps the iteration of
regular expression and attack the star regular expression
to its back again to make sure there are 0 or more iterations
following this unfolded iteration.
The main property of the derivative operation
that enables us to reason about the correctness of
an algorithm using derivatives is
\begin{center}
$c\!::\!s \in L(r)$ holds
if and only if $s \in L(r\backslash c)$.
\end{center}
\noindent
We can generalise the derivative operation shown above for single characters
to strings as follows:
\begin{center}
\begin{tabular}{lcl}
$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
$r \backslash [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}
\noindent
and then define Brzozowski's regular-expression matching algorithm as:
\[
match\;s\;r \;\dn\; nullable(r\backslash s)
\]
\noindent
Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$,
this algorithm presented graphically is as follows:
\begin{equation}\label{graph:*}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
\end{tikzcd}
\end{equation}
\noindent
where we start with a regular expression $r_0$, build successive
derivatives until we exhaust the string and then use \textit{nullable}
to test whether the result can match the empty string. It can be
relatively easily shown that this matcher is correct (that is given
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
Beautiful and simple definition.
If we implement the above algorithm naively, however,
the algorithm can be excruciatingly slow. For example, when starting with the regular
expression $(a + aa)^*$ and building 12 successive derivatives
w.r.t.~the character $a$, one obtains a derivative regular expression
with more than 8000 nodes (when viewed as a tree). Operations like
$\backslash$ and $\nullable$ need to traverse such trees and
consequently the bigger the size of the derivative the slower the
algorithm.
Brzozowski was quick in finding that during this process a lot useless
$\ONE$s and $\ZERO$s are generated and therefore not optimal.
He also introduced some "similarity rules" such
as $P+(Q+R) = (P+Q)+R$ to merge syntactically
different but language-equivalent sub-regexes to further decrease the size
of the intermediate regexes.
More simplifications are possible, such as deleting duplicates
and opening up nested alternatives to trigger even more simplifications.
And suppose we apply simplification after each derivative step, and compose
these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
\textit{simp}(a \backslash c)$. Then we can build
a matcher without having cumbersome regular expressions.
If we want the size of derivatives in the algorithm to
stay even lower, we would need more aggressive simplifications.
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
deleting duplicates whenever possible. For example, the parentheses in
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
to achieve a very tight size bound, namely,
the same size bound as that of the \emph{partial derivatives}.
Building derivatives and then simplify them.
So far so good. But what if we want to
do lexing instead of just a YES/NO answer?
This requires us to go back again to the world
without simplification first for a moment.
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
elegant(arguably as beautiful as the original
derivatives definition) solution for this.
\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
They first defined the datatypes for storing the
lexing information called a \emph{value} or
sometimes also \emph{lexical value}. These values and regular
expressions correspond to each other as illustrated in the following
table:
\begin{center}
\begin{tabular}{c@{\hspace{20mm}}c}
\begin{tabular}{@{}rrl@{}}
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
$r$ & $::=$ & $\ZERO$\\
& $\mid$ & $\ONE$ \\
& $\mid$ & $c$ \\
& $\mid$ & $r_1 \cdot r_2$\\
& $\mid$ & $r_1 + r_2$ \\
\\
& $\mid$ & $r^*$ \\
\end{tabular}
&
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
$v$ & $::=$ & \\
& & $\Empty$ \\
& $\mid$ & $\Char(c)$ \\
& $\mid$ & $\Seq\,v_1\, v_2$\\
& $\mid$ & $\Left(v)$ \\
& $\mid$ & $\Right(v)$ \\
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
\end{tabular}
\end{tabular}
\end{center}
\noindent
Building on top of Sulzmann and Lu's attempt to formalize the
notion of POSIX lexing rules \parencite{Sulzmann2014},
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
POSIX matching as a ternary relation recursively defined in a
natural deduction style.
With the formally-specified rules for what a POSIX matching is,
they proved in Isabelle/HOL that the algorithm gives correct results.
But having a correct result is still not enough,
we want at least some degree of $\mathbf{efficiency}$.
One regular expression can have multiple lexical values. For example
for the regular expression $(a+b)^*$, it has a infinite list of
values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
$\ldots$, and vice versa.
Even for the regular expression matching a certain string, there could
still be more than one value corresponding to it.
Take the example where $r= (a^*\cdot a^*)^*$ and the string
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
The number of different ways of matching
without allowing any value under a star to be flattened
to an empty string can be given by the following formula:
\begin{center}
$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
\end{center}
and a closed form formula can be calculated to be
\begin{equation}
C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
\end{equation}
which is clearly in exponential order.
A lexer aimed at getting all the possible values has an exponential
worst case runtime. Therefore it is impractical to try to generate
all possible matches in a run. In practice, we are usually
interested about POSIX values, which by intuition always
match the leftmost regular expression when there is a choice
and always match a sub part as much as possible before proceeding
to the next token. For example, the above example has the POSIX value
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
The output of an algorithm we want would be a POSIX matching
encoded as a value.
The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
is generated in case the regular expression matches the string.
Pictorially, the Sulzmann and Lu algorithm is as follows:
\begin{ceqn}
\begin{equation}\label{graph:2}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
\end{tikzcd}
\end{equation}
\end{ceqn}
\noindent
For convenience, we shall employ the following notations: the regular
expression we start with is $r_0$, and the given string $s$ is composed
of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the
left to right, we build the derivatives $r_1$, $r_2$, \ldots according
to the characters $c_0$, $c_1$ until we exhaust the string and obtain
the derivative $r_n$. We test whether this derivative is
$\textit{nullable}$ or not. If not, we know the string does not match
$r$ and no value needs to be generated. If yes, we start building the
values incrementally by \emph{injecting} back the characters into the
earlier values $v_n, \ldots, v_0$. This is the second phase of the
algorithm from the right to left. For the first value $v_n$, we call the
function $\textit{mkeps}$, which builds a POSIX lexical value
for how the empty string has been matched by the (nullable) regular
expression $r_n$. This function is defined as
\begin{center}
\begin{tabular}{lcl}
$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
$\mkeps(r_{1}+r_{2})$ & $\dn$
& \textit{if} $\nullable(r_{1})$\\
& & \textit{then} $\Left(\mkeps(r_{1}))$\\
& & \textit{else} $\Right(\mkeps(r_{2}))$\\
$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
\end{tabular}
\end{center}
\noindent
After the $\mkeps$-call, we inject back the characters one by one in order to build
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
After injecting back $n$ characters, we get the lexical value for how $r_0$
matches $s$. The POSIX value is maintained throught out the process.
For this Sulzmann and Lu defined a function that reverses
the ``chopping off'' of characters during the derivative phase. The
corresponding function is called \emph{injection}, written
$\textit{inj}$; it takes three arguments: the first one is a regular
expression ${r_{i-1}}$, before the character is chopped off, the second
is a character ${c_{i-1}}$, the character we want to inject and the
third argument is the value ${v_i}$, into which one wants to inject the
character (it corresponds to the regular expression after the character
has been chopped off). The result of this function is a new value. The
definition of $\textit{inj}$ is as follows:
\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
\end{tabular}
\end{center}
\noindent This definition is by recursion on the ``shape'' of regular
expressions and values.
The clauses basically do one thing--identifying the ``holes'' on
value to inject the character back into.
For instance, in the last clause for injecting back to a value
that would turn into a new star value that corresponds to a star,
we know it must be a sequence value. And we know that the first
value of that sequence corresponds to the child regex of the star
with the first character being chopped off--an iteration of the star
that had just been unfolded. This value is followed by the already
matched star iterations we collected before. So we inject the character
back to the first value and form a new value with this new iteration
being added to the previous list of iterations, all under the $Stars$
top level.
We have mentioned before that derivatives without simplification
can get clumsy, and this is true for values as well--they reflect
the regular expressions size by definition.
One can introduce simplification on the regex and values, but have to
be careful in not breaking the correctness as the injection
function heavily relies on the structure of the regexes and values
being correct and match each other.
It can be achieved by recording some extra rectification functions
during the derivatives step, and applying these rectifications in
each run during the injection phase.
And we can prove that the POSIX value of how
regular expressions match strings will not be affected---although is much harder
to establish.
Some initial results in this regard have been
obtained in \cite{AusafDyckhoffUrban2016}.
%Brzozowski, after giving the derivatives and simplification,
%did not explore lexing with simplification or he may well be
%stuck on an efficient simplificaiton with a proof.
%He went on to explore the use of derivatives together with
%automaton, and did not try lexing using derivatives.
We want to get rid of complex and fragile rectification of values.
Can we not create those intermediate values $v_1,\ldots v_n$,
and get the lexing information that should be already there while
doing derivatives in one pass, without a second phase of injection?
In the meantime, can we make sure that simplifications
are easily handled without breaking the correctness of the algorithm?
Sulzmann and Lu solved this problem by
introducing additional informtaion to the
regular expressions called \emph{bitcodes}.
\subsection*{Bit-coded Algorithm}
Bits and bitcodes (lists of bits) are defined as:
\begin{center}
$b ::= 1 \mid 0 \qquad
bs ::= [] \mid b::bs
$
\end{center}
\noindent
The $1$ and $0$ are not in bold in order to avoid
confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
bit-lists) can be used to encode values (or potentially incomplete values) in a
compact form. This can be straightforwardly seen in the following
coding function from values to bitcodes:
\begin{center}
\begin{tabular}{lcl}
$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
code(\Stars\,vs)$
\end{tabular}
\end{center}
\noindent
Here $\textit{code}$ encodes a value into a bitcodes by converting
$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
star iteration by $1$. The border where a local star terminates
is marked by $0$. This coding is lossy, as it throws away the information about
characters, and also does not encode the ``boundary'' between two
sequence values. Moreover, with only the bitcode we cannot even tell
whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
reason for choosing this compact way of storing information is that the
relatively small size of bits can be easily manipulated and ``moved
around'' in a regular expression. In order to recover values, we will
need the corresponding regular expression as an extra information. This
means the decoding function is defined as:
%\begin{definition}[Bitdecoding of Values]\mbox{}
\begin{center}
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
(\Left\,v, bs_1)$\\
$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
(\Right\,v, bs_1)$\\
$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
$\textit{decode}\,bs\,r$ & $\dn$ &
$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
%\end{definition}
Sulzmann and Lu's integrated the bitcodes into regular expressions to
create annotated regular expressions \cite{Sulzmann2014}.
\emph{Annotated regular expressions} are defined by the following
grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}
\begin{center}
\begin{tabular}{lcl}
$\textit{a}$ & $::=$ & $\ZERO$\\
& $\mid$ & $_{bs}\ONE$\\
& $\mid$ & $_{bs}{\bf c}$\\
& $\mid$ & $_{bs}\sum\,as$\\
& $\mid$ & $_{bs}a_1\cdot a_2$\\
& $\mid$ & $_{bs}a^*$
\end{tabular}
\end{center}
%(in \textit{ALTS})
\noindent
where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular
expressions and $as$ for a list of annotated regular expressions.
The alternative constructor($\sum$) has been generalized to
accept a list of annotated regular expressions rather than just 2.
We will show that these bitcodes encode information about
the (POSIX) value that should be generated by the Sulzmann and Lu
algorithm.
To do lexing using annotated regular expressions, we shall first
transform the usual (un-annotated) regular expressions into annotated
regular expressions. This operation is called \emph{internalisation} and
defined as follows:
%\begin{definition}
\begin{center}
\begin{tabular}{lcl}
$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
$(r_1 + r_2)^\uparrow$ & $\dn$ &
$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
\textit{fuse}\,[1]\,r_2^\uparrow]$\\
$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
$(r^*)^\uparrow$ & $\dn$ &
$_{[]}(r^\uparrow)^*$\\
\end{tabular}
\end{center}
%\end{definition}
\noindent
We use up arrows here to indicate that the basic un-annotated regular
expressions are ``lifted up'' into something slightly more complex. In the
fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
attach bits to the front of an annotated regular expression. Its
definition is as follows:
\begin{center}
\begin{tabular}{lcl}
$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
$_{bs @ bs'}\ONE$\\
$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
$_{bs@bs'}{\bf c}$\\
$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
$_{bs@bs'}\sum\textit{as}$\\
$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
$_{bs@bs'}a_1 \cdot a_2$\\
$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
$_{bs @ bs'}a^*$
\end{tabular}
\end{center}
\noindent
After internalising the regular expression, we perform successive
derivative operations on the annotated regular expressions. This
derivative operation is the same as what we had previously for the
basic regular expressions, except that we beed to take care of
the bitcodes:
\iffalse
%\begin{definition}{bder}
\begin{center}
\begin{tabular}{@{}lcl@{}}
$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
$\textit{if}\;c=d\; \;\textit{then}\;
\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
$\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
$\textit{if}\;\textit{bnullable}\,a_1$\\
& &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
& &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
(\textit{STAR}\,[]\,r)$
\end{tabular}
\end{center}
%\end{definition}
\begin{center}
\begin{tabular}{@{}lcl@{}}
$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
$(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
$(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
$\textit{if}\;c=d\; \;\textit{then}\;
_{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\
$(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
$_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
$(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
$\textit{if}\;\textit{bnullable}\,a_1$\\
& &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
& &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
(_{bs}\textit{STAR}\,[]\,r)$
\end{tabular}
\end{center}
%\end{definition}
\fi
\begin{center}
\begin{tabular}{@{}lcl@{}}
$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
$\textit{if}\;c=d\; \;\textit{then}\;
_{bs}\ONE\;\textit{else}\;\ZERO$\\
$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
$_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
$\textit{if}\;\textit{bnullable}\,a_1$\\
& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
$(_{bs}a^*)\,\backslash c$ & $\dn$ &
$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
(_{[]}r^*))$
\end{tabular}
\end{center}
%\end{definition}
\noindent
For instance, when we do derivative of $_{bs}a^*$ with respect to c,
we need to unfold it into a sequence,
and attach an additional bit $0$ to the front of $r \backslash c$
to indicate that there is one more star iteration. Also the sequence clause
is more subtle---when $a_1$ is $\textit{bnullable}$ (here
\textit{bnullable} is exactly the same as $\textit{nullable}$, except
that it is for annotated regular expressions, therefore we omit the
definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
$a_1$ matches the string prior to character $c$ (more on this later),
then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2
\backslash c)$ will collapse the regular expression $a_1$(as it has
already been fully matched) and store the parsing information at the
head of the regular expression $a_2 \backslash c$ by fusing to it. The
bitsequence $\textit{bs}$, which was initially attached to the
first element of the sequence $a_1 \cdot a_2$, has
now been elevated to the top-level of $\sum$, as this information will be
needed whichever way the sequence is matched---no matter whether $c$ belongs
to $a_1$ or $ a_2$. After building these derivatives and maintaining all
the lexing information, we complete the lexing by collecting the
bitcodes using a generalised version of the $\textit{mkeps}$ function
for annotated regular expressions, called $\textit{bmkeps}$:
%\begin{definition}[\textit{bmkeps}]\mbox{}
\begin{center}
\begin{tabular}{lcl}
$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
$\textit{if}\;\textit{bnullable}\,a$\\
& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
$bs \,@\, [0]$
\end{tabular}
\end{center}
%\end{definition}
\noindent
This function completes the value information by travelling along the
path of the regular expression that corresponds to a POSIX value and
collecting all the bitcodes, and using $S$ to indicate the end of star
iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
decode them, we get the value we expect. The corresponding lexing
algorithm looks as follows:
\begin{center}
\begin{tabular}{lcl}
$\textit{blexer}\;r\,s$ & $\dn$ &
$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
& & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
\noindent
In this definition $\_\backslash s$ is the generalisation of the derivative
operation from characters to strings (just like the derivatives for un-annotated
regular expressions).
Remember tha one of the important reasons we introduced bitcodes
is that they can make simplification more structured and therefore guaranteeing
the correctness.
\subsection*{Our Simplification Rules}
In this section we introduce aggressive (in terms of size) simplification rules
on annotated regular expressions
in order to keep derivatives small. Such simplifications are promising
as we have
generated test data that show
that a good tight bound can be achieved. Obviously we could only
partially cover the search space as there are infinitely many regular
expressions and strings.
One modification we introduced is to allow a list of annotated regular
expressions in the $\sum$ constructor. This allows us to not just
delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
also unnecessary ``copies'' of regular expressions (very similar to
simplifying $r + r$ to just $r$, but in a more general setting). Another
modification is that we use simplification rules inspired by Antimirov's
work on partial derivatives. They maintain the idea that only the first
``copy'' of a regular expression in an alternative contributes to the
calculation of a POSIX value. All subsequent copies can be pruned away from
the regular expression. A recursive definition of our simplification function
that looks somewhat similar to our Scala code is given below:
%\comment{Use $\ZERO$, $\ONE$ and so on.
%Is it $ALTS$ or $ALTS$?}\\
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
&&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\
$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
\end{tabular}
\end{center}
\noindent
The simplification does a pattern matching on the regular expression.
When it detected that the regular expression is an alternative or
sequence, it will try to simplify its children regular expressions
recursively and then see if one of the children turn into $\ZERO$ or
$\ONE$, which might trigger further simplification at the current level.
The most involved part is the $\sum$ clause, where we use two
auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
alternatives and reduce as many duplicates as possible. Function
$\textit{distinct}$ keeps the first occurring copy only and remove all later ones
when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
Its recursive definition is given below:
\begin{center}
\begin{tabular}{@{}lcl@{}}
$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\
$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
\end{tabular}
\end{center}
\noindent
Here $\textit{flatten}$ behaves like the traditional functional programming flatten
function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
Having defined the $\simp$ function,
we can use the previous notation of natural
extension from derivative w.r.t.~character to derivative
w.r.t.~string:%\comment{simp in the [] case?}
\begin{center}
\begin{tabular}{lcl}
$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
$r \backslash_{simp} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}
\noindent
to obtain an optimised version of the algorithm:
\begin{center}
\begin{tabular}{lcl}
$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
& & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
\noindent
This algorithm keeps the regular expression size small, for example,
with this simplification our previous $(a + aa)^*$ example's 8000 nodes
will be reduced to just 6 and stays constant, no matter how long the
input string is.
Derivatives give a simple solution
to the problem of matching a string $s$ with a regular
expression $r$: if the derivative of $r$ w.r.t.\ (in
succession) all the characters of the string matches the empty string,
then $r$ matches $s$ (and {\em vice versa}).
However, there are two difficulties with derivative-based matchers:
First, Brzozowski's original matcher only generates a yes/no answer
for whether a regular expression matches a string or not. This is too
little information in the context of lexing where separate tokens must
be identified and also classified (for example as keywords
or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this
difficulty by cleverly extending Brzozowski's matching
algorithm. Their extended version generates additional information on
\emph{how} a regular expression matches a string following the POSIX
rules for regular expression matching. They achieve this by adding a
second ``phase'' to Brzozowski's algorithm involving an injection
function. In our own earlier work we provided the formal
specification of what POSIX matching means and proved in Isabelle/HOL
the correctness
of Sulzmann and Lu's extended algorithm accordingly
\cite{AusafDyckhoffUrban2016}.
The second difficulty is that Brzozowski's derivatives can
grow to arbitrarily big sizes. For example if we start with the
regular expression $(a+aa)^*$ and take
successive derivatives according to the character $a$, we end up with
a sequence of ever-growing derivatives like
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
\begin{center}
\begin{tabular}{rll}
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
\end{tabular}
\end{center}
\noindent where after around 35 steps we run out of memory on a
typical computer (we shall define shortly the precise details of our
regular expressions and the derivative operation). Clearly, the
notation involving $\ZERO$s and $\ONE$s already suggests
simplification rules that can be applied to regular regular
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
r$. While such simple-minded simplifications have been proved in our
earlier work to preserve the correctness of Sulzmann and Lu's
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
\emph{not} help with limiting the growth of the derivatives shown
above: the growth is slowed, but the derivatives can still grow rather
quickly beyond any finite bound.
Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
\cite{Sulzmann2014} where they introduce bitcoded
regular expressions. In this version, POSIX values are
represented as bitsequences and such sequences are incrementally generated
when derivatives are calculated. The compact representation
of bitsequences and regular expressions allows them to define a more
``aggressive'' simplification method that keeps the size of the
derivatives finite no matter what the length of the string is.
They make some informal claims about the correctness and linear behaviour
of this version, but do not provide any supporting proof arguments, not
even ``pencil-and-paper'' arguments. They write about their bitcoded
\emph{incremental parsing method} (that is the algorithm to be formalised
in this paper):
\begin{quote}\it
``Correctness Claim: We further claim that the incremental parsing
method [..] in combination with the simplification steps [..]
yields POSIX parse trees. We have tested this claim
extensively [..] but yet
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
\end{quote}
%----------------------------------------------------------------------------------------
%----------------------------------------------------------------------------------------
%----------------------------------------------------------------------------------------
%----------------------------------------------------------------------------------------