--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Fri May 26 23:42:15 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex Sun Jun 18 17:54:52 2023 +0100
@@ -1235,6 +1235,12 @@
is function $\erase$'s cases.
\end{proof}
\noindent
+Intuitively the lemma can be understood as all lexical value
+information being preserved and recoverable throughout each lexing step.
+We shall see in the next chapter that this no longer
+holds with simplifications which takes away certain sub-expressions
+corresponding to non-POSIX lexical values.
+
Before we move on to the next
helper function, we rewrite $\blexer$ in
the following way which makes later proofs more convenient:
@@ -1340,7 +1346,7 @@
\noindent
With this pivotal lemma we can now link $\flex$ and $\blexer$
and finally give the correctness of $\blexer$--it outputs the same result as $\lexer$:
-\begin{theorem}
+\begin{theorem}\label{blexerCorrect}
$\blexer\; r \; s = \lexer \; r \; s$
\end{theorem}
\begin{proof}
--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Fri May 26 23:42:15 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex Sun Jun 18 17:54:52 2023 +0100
@@ -8,15 +8,121 @@
%simplifications and therefore introduce our version of the bitcoded algorithm and
%its correctness proof in
%Chapter 3\ref{Chapter3}.
+\section{Overview}
+
+This chapter
+is the point from which novel contributions of this PhD project are introduced
+in detail,
+and previous
+chapters are essential background work for setting the scene of the formal proof we
+are about to describe.
+In particular, the correctness theorem
+of the un-optimised bit-coded lexer $\blexer$ in
+chapter \ref{Bitcoded1} formalised by Ausaf et al.
+relies on lemma \ref{retrieveStepwise} that says
+any value can be retrieved in a stepwise manner:
+\begin{center}
+ $\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c) \; v= \retrieve \; r \; (\inj \; r\; c\; v)$
+\end{center}
+This no longer holds once we introduce simplifications.
+To control the size of regular expressions during derivatives,
+one has to eliminate redundant sub-expression with some
+procedure we call $\textit{simp}$,
+and $\textit{simp}$ is defined as
+:
+Having defined the $\textit{bsimp}$ function,
+we add it as a phase after a derivative is taken.
+\begin{center}
+ \begin{tabular}{lcl}
+ $r \backslash_{bsimp} c$ & $\dn$ & $\textit{bsimp}(r \backslash c)$
+ \end{tabular}
+\end{center}
+%Following previous notations
+%when extending from derivatives w.r.t.~character to derivative
+%w.r.t.~string, we define the derivative that nests simplifications
+%with derivatives:%\comment{simp in the [] case?}
+We extend this from characters to strings:
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\
+$r \backslash_{bsimps} [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+
+\noindent
+The lexer that extracts bitcodes from the
+derivatives with simplifications from our $\simp$ function
+is called $\blexersimp$:
+
+\begin{center}
+
+\begin{center}
+ \begin{tabular}{lcl}
+ $r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
+ \end{tabular}
+\end{center}
+\begin{tabular}{lcl}
+ $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+ $\textit{let}\;a = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\
+ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+\noindent
+the redundant sub-expressions after each derivative operation
+allows the exact structure of each intermediate result to be preserved,
+so that pairs of inhabitation relations in the form $\vdash v : r_{c} $ and
+$\vdash v^{c} : r $ hold (if we allow the abbreviation $r_{c} \dn r\backslash c$
+and $v^{c} \dn \inj \;r \; c \; v$).
+
+
+Define the
+But $\blexersimp$ introduces simplification after the derivative
+to reduce redundancy,
+yielding $r \backslash c$
+This allows
+
+The proof details are necessary materials for this thesis
+because it provides necessary context to explain why we need a
+new framework for the proof of $\blexersimp$, which involves
+simplifications that cause structural changes to the regular expression.
+a new formal proof of the correctness of $\blexersimp$, where the
+proof of $\blexer$
+is not applicatble in the sense that we cannot straightforwardly extend the
+proof of theorem \ref{blexerCorrect} because lemma \ref{flex_retrieve} does
+not hold anymore.
+This is because the structural induction on the stepwise correctness
+of $\inj$ breaks due to each pair of $r_i$ and $v_i$ described
+in chapter \ref{Inj} and \ref{Bitcoded1} no longer correspond to
+each other.
In this chapter we introduce simplifications
for annotated regular expressions that can be applied to
each intermediate derivative result. This allows
us to make $\blexer$ much more efficient.
Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions,
but their simplification functions could have been more efficient and in some cases needed fixing.
+
+
+
+
+From this chapter we start with the main contribution of this thesis, which
+
+o
+In particular, the $\blexer$ proof relies on a lockstep POSIX
+correspondence between the lexical value and the
+regular expression in each derivative and injection.
+
+
+which is essential for getting an understanding this thesis
+in chapter \ref{Bitcoded1}, which is necessary for understanding why
+the proof
+
+In this chapter,
+
%We contrast our simplification function
%with Sulzmann and Lu's, indicating the simplicity of our algorithm.
%This is another case for the usefulness
@@ -28,6 +134,10 @@
%$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system.
%
\section{Simplifications by Sulzmann and Lu}
+The algorithms $\lexer$ and $\blexer$ work beautifully as functional
+programs, but not as practical code. One main reason for the slowness is due
+to the size of intermediate representations--the derivative regular expressions
+tend to grow unbounded if the matching involved a large number of possible matches.
Consider the derivatives of the following example $(a^*a^*)^*$:
%and $(a^* + (aa)^*)^*$:
\begin{center}
@@ -438,7 +548,7 @@
we add it as a phase after a derivative is taken.
\begin{center}
\begin{tabular}{lcl}
- $r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$
+ $r \backslash_{bsimp} c$ & $\dn$ & $\textit{bsimp}(r \backslash c)$
\end{tabular}
\end{center}
%Following previous notations
--- a/ChengsongTanPhdThesis/Chapters/Inj.tex Fri May 26 23:42:15 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Sun Jun 18 17:54:52 2023 +0100
@@ -23,6 +23,1046 @@
instead of $f(x,\;y)$). This is mainly
to make the text visually more concise.}.
+
+\section{Technical Overview}
+
+Consider for example the regular expression $(a^*)^*\,b$ and
+strings of the form $aa..a$. These strings cannot be matched by this regular
+expression: Obviously the expected $b$ in the last
+position is missing. One would assume that modern regular expression
+matching engines can find this out very quickly. Surprisingly, if one tries
+this example in JavaScript, Python or Java 8, even with small strings,
+say of lenght of around 30 $a$'s,
+the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
+The algorithms clearly show exponential behaviour, and as can be seen
+is triggered by some relatively simple regular expressions.
+Java 9 and newer
+versions improve this behaviour somewhat, but are still slow compared
+with the approach we are going to use in this thesis.
+
+
+
+This superlinear blowup in regular expression engines
+has caused grief in ``real life'' where it is
+given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
+For example, on 20 July 2016 one evil
+regular expression brought the webpage
+\href{http://stackexchange.com}{Stack Exchange} to its
+knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+In this instance, a regular expression intended to just trim white
+spaces from the beginning and the end of a line actually consumed
+massive amounts of CPU resources---causing the web servers to grind to a
+halt. In this example, the time needed to process
+the string was
+$O(n^2)$ with respect to the string length $n$. This
+quadratic overhead was enough for the homepage of Stack Exchange to
+respond so slowly that the load balancer assumed a $\mathit{DoS}$
+attack and therefore stopped the servers from responding to any
+requests. This made the whole site become unavailable.
+
+\begin{figure}[p]
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={JavaScript},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Python},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+\end{axis}
+\end{tikzpicture}\\
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Java 8},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Dart},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\end{axis}
+\end{tikzpicture}\\
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Swift},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+\end{axis}
+\end{tikzpicture}
+ &
+\begin{tikzpicture}
+\begin{axis}[
+ xlabel={$n$},
+ x label style={at={(1.05,-0.05)}},
+ %ylabel={time in secs},
+ enlargelimits=true,
+ %xtick={0,5000,...,40000},
+ %xmax=40000,
+ %ymax=35,
+ restrict x to domain*=0:40000,
+ restrict y to domain*=0:35,
+ %ytick={0,5,...,30},
+ %scaled ticks=false,
+ axis lines=left,
+ width=5cm,
+ height=4cm,
+ legend entries={Java9+},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{2}{c}{Graphs}
+\end{tabular}
+\end{center}
+\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
+ of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
+ The reason for their superlinear behaviour is that they do a depth-first-search
+ using NFAs.
+ If the string does not match, the regular expression matching
+ engine starts to explore all possibilities.
+}\label{fig:aStarStarb}
+\end{figure}\afterpage{\clearpage}
+
+A more recent example is a global outage of all Cloudflare servers on 2 July
+2019. A poorly written regular expression exhibited catastrophic backtracking
+and exhausted CPUs that serve HTTP traffic. Although the outage
+had several causes, at the heart was a regular expression that
+was used to monitor network
+traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
+These problems with regular expressions
+are not isolated events that happen
+very rarely,
+%but actually widespread.
+%They occur so often that they have a
+but they occur actually often enough that they have a
+name: Regular-Expression-Denial-Of-Service (ReDoS)
+attacks.
+Davis et al. \cite{Davis18} detected more
+than 1000 evil regular expressions
+in Node.js, Python core libraries, npm and pypi.
+They therefore concluded that evil regular expressions
+are a real problem rather than just "a parlour trick".
+
+The work in this thesis aims to address this issue
+with the help of formal proofs.
+We describe a lexing algorithm based
+on Brzozowski derivatives with verified correctness
+and a finiteness property for the size of derivatives
+(which are all done in Isabelle/HOL).
+Such properties %guarantee the absence of
+are an important step in preventing
+catastrophic backtracking once and for all.
+We will give more details in the next sections
+on (i) why the slow cases in graph \ref{fig:aStarStarb}
+can occur in traditional regular expression engines
+and (ii) why we choose our
+approach based on Brzozowski derivatives and formal proofs.
+
+
+\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
+Regular expressions and regular expression matchers
+have clearly been studied for many, many years.
+Theoretical results in automata theory state
+that basic regular expression matching should be linear
+w.r.t the input.
+This assumes that the regular expression
+$r$ was pre-processed and turned into a
+deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
+By basic we mean textbook definitions such as the one
+below, involving only regular expressions for characters, alternatives,
+sequences, and Kleene stars:
+\[
+ r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
+\]
+Modern regular expression matchers used by programmers,
+however,
+support much richer constructs, such as bounded repetitions,
+negations,
+and back-references.
+To differentiate, we use the word \emph{regex} to refer
+to those expressions with richer constructs while reserving the
+term \emph{regular expression}
+for the more traditional meaning in formal languages theory.
+We follow this convention
+in this thesis.
+In the future, we aim to support all the popular features of regexes,
+but for this work we mainly look at basic regular expressions
+and bounded repetitions.
+
+
+
+%Most modern regex libraries
+%the so-called PCRE standard (Peral Compatible Regular Expressions)
+%has the back-references
+Regexes come with a number of constructs
+that make it more convenient for
+programmers to write regular expressions.
+Depending on the types of constructs
+the task of matching and lexing with them
+will have different levels of complexity.
+Some of those constructs are syntactic sugars that are
+simply short hand notations
+that save the programmers a few keystrokes.
+These will not cause problems for regex libraries.
+For example the
+non-binary alternative involving three or more choices just means:
+\[
+ (a | b | c) \stackrel{means}{=} ((a + b)+ c)
+\]
+Similarly, the range operator
+%used to express the alternative
+%of all characters between its operands,
+is just a concise way
+of expressing an alternative of consecutive characters:
+\[
+ [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )
+\]
+for an alternative. The
+wildcard character '$.$' is used to refer to any single character,
+\[
+ . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
+\]
+except the newline.
+
+\subsection{Bounded Repetitions}
+More interesting are bounded repetitions, which can
+make the regular expressions much
+more compact.
+Normally there are four kinds of bounded repetitions:
+$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
+(where $n$ and $m$ are constant natural numbers).
+Like the star regular expressions, the set of strings or language
+a bounded regular expression can match
+is defined using the power operation on sets:
+\begin{center}
+ \begin{tabular}{lcl}
+ $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
+ $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
+ $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
+ $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
+ \end{tabular}
+\end{center}
+The attraction of bounded repetitions is that they can be
+used to avoid a size blow up: for example $r^{\{n\}}$
+is a shorthand for
+the much longer regular expression:
+\[
+ \underbrace{r\ldots r}_\text{n copies of r}.
+\]
+%Therefore, a naive algorithm that simply unfolds
+%them into their desugared forms
+%will suffer from at least an exponential runtime increase.
+
+
+The problem with matching
+such bounded repetitions
+is that tools based on the classic notion of
+automata need to expand $r^{\{n\}}$ into $n$ connected
+copies of the automaton for $r$. This leads to very inefficient matching
+algorithms or algorithms that consume large amounts of memory.
+Implementations using $\DFA$s will
+in such situations
+either become excruciatingly slow
+(for example Verbatim++ \cite{Verbatimpp}) or run
+out of memory (for example $\mathit{LEX}$ and
+$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
+in C and JAVA that generate $\mathit{DFA}$-based
+lexers. The user provides a set of regular expressions
+and configurations, and then
+gets an output program encoding a minimized $\mathit{DFA}$
+that can be compiled and run.
+When given the above countdown regular expression,
+a small $n$ (say 20) would result in a program representing a
+DFA
+with millions of states.}) for large counters.
+A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
+where the minimal DFA requires at least $2^{n+1}$ states.
+For example, when $n$ is equal to 2,
+the corresponding $\mathit{NFA}$ looks like:
+\vspace{6mm}
+\begin{center}
+\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ \node[state,initial] (q_0) {$q_0$};
+ \node[state, red] (q_1) [right=of q_0] {$q_1$};
+ \node[state, red] (q_2) [right=of q_1] {$q_2$};
+ \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+ \path[->]
+ (q_0) edge node {a} (q_1)
+ edge [loop below] node {a,b} ()
+ (q_1) edge node {a,b} (q_2)
+ (q_2) edge node {a,b} (q_3);
+\end{tikzpicture}
+\end{center}
+and when turned into a DFA by the subset construction
+requires at least $2^3$ states.\footnote{The
+red states are "countdown states" which count down
+the number of characters needed in addition to the current
+string to make a successful match.
+For example, state $q_1$ indicates a match that has
+gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+and just consumed the "delimiter" $a$ in the middle, and
+needs to match 2 more iterations of $(a|b)$ to complete.
+State $q_2$ on the other hand, can be viewed as a state
+after $q_1$ has consumed 1 character, and just waits
+for 1 more character to complete.
+The state $q_3$ is the last (accepting) state, requiring 0
+more characters.
+Depending on the suffix of the
+input string up to the current read location,
+the states $q_1$ and $q_2$, $q_3$
+may or may
+not be active.
+A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+contain at least $2^3$ non-equivalent states that cannot be merged,
+because the subset construction during determinisation will generate
+all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+Generalizing this to regular expressions with larger
+bounded repetitions number, we have that
+regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
+would require at least $2^{n+1}$ states, if $r$ itself contains
+more than 1 string.
+This is to represent all different
+scenarios in which "countdown" states are active.}
+
+
+Bounded repetitions are important because they
+tend to occur frequently in practical use,
+for example in the regex library RegExLib, in
+the rules library of Snort \cite{Snort1999}\footnote{
+Snort is a network intrusion detection (NID) tool
+for monitoring network traffic.
+The network security community curates a list
+of malicious patterns written as regexes,
+which is used by Snort's detection engine
+to match against network traffic for any hostile
+activities such as buffer overflow attacks.},
+as well as in XML Schema definitions (XSDs).
+According to Bj\"{o}rklund et al \cite{xml2015},
+more than half of the
+XSDs they found on the Maven.org central repository
+have bounded regular expressions in them.
+Often the counters are quite large, with the largest being
+close to ten million.
+A smaller sample XSD they gave
+is:
+\lstset{
+ basicstyle=\fontsize{8.5}{9}\ttfamily,
+ language=XML,
+ morekeywords={encoding,
+ xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
+}
+\begin{lstlisting}
+<sequence minOccurs="0" maxOccurs="65535">
+ <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{lstlisting}
+This can be seen as the regex
+$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+regular expressions
+satisfying certain constraints (such as
+satisfying the floating point number format).
+It is therefore quite unsatisfying that
+some regular expressions matching libraries
+impose adhoc limits
+for bounded regular expressions:
+For example, in the regular expression matching library in the Go
+language the regular expression $a^{1001}$ is not permitted, because no counter
+can be above 1000, and in the built-in Rust regular expression library
+expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
+for being too big.
+As Becchi and Crawley \cite{Becchi08} have pointed out,
+the reason for these restrictions
+is that they simulate a non-deterministic finite
+automata (NFA) with a breadth-first search.
+This way the number of active states could
+be equal to the counter number.
+When the counters are large,
+the memory requirement could become
+infeasible, and a regex engine
+like in Go will reject this pattern straight away.
+\begin{figure}[H]
+\begin{center}
+\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+
+ \node (q0) [state, initial] {$0$};
+ \node (q1) [state, right = of q0] {$1$};
+ %\node (q2) [state, right = of q1] {$2$};
+ \node (qdots) [right = of q1] {$\ldots$};
+ \node (qn) [state, right = of qdots] {$n$};
+ \node (qn1) [state, right = of qn] {$n+1$};
+ \node (qn2) [state, right = of qn1] {$n+2$};
+ \node (qn3) [state, accepting, right = of qn2] {$n+3$};
+
+\path [-stealth, thick]
+ (q0) edge [loop above] node {a} ()
+ (q0) edge node {a} (q1)
+ %(q1) edge node {.} (q2)
+ (q1) edge node {.} (qdots)
+ (qdots) edge node {.} (qn)
+ (qn) edge node {.} (qn1)
+ (qn1) edge node {b} (qn2)
+ (qn2) edge node {$c$} (qn3);
+\end{tikzpicture}
+%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+% \node[state,initial] (q_0) {$0$};
+% \node[state, ] (q_1) [right=of q_0] {$1$};
+% \node[state, ] (q_2) [right=of q_1] {$2$};
+% \node[state,
+% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+% \path[->]
+% (q_0) edge node {a} (q_1)
+% edge [loop below] node {a,b} ()
+% (q_1) edge node {a,b} (q_2)
+% (q_2) edge node {a,b} (q_3);
+%\end{tikzpicture}
+\end{center}
+\caption{The example given by Becchi and Crawley
+ that NFA simulation can consume large
+ amounts of memory: $.^*a.^{\{n\}}bc$ matching
+ strings of the form $aaa\ldots aaaabc$.
+ When traversing in a breadth-first manner,
+all states from 0 till $n+1$ will become active.}
+\end{figure}
+%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+%in terms of input string length.
+%TODO:try out these lexers
+These problems can of course be solved in matching algorithms where
+automata go beyond the classic notion and for instance include explicit
+counters \cite{Turo_ov__2020}.
+These solutions can be quite efficient,
+with the ability to process
+gigabits of strings input per second
+even with large counters \cite{Becchi08}.
+These practical solutions do not come with
+formal guarantees, and as pointed out by
+Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
+%But formal reasoning about these automata especially in Isabelle
+%can be challenging
+%and un-intuitive.
+%Therefore, we take correctness and runtime claims made about these solutions
+%with a grain of salt.
+
+In the work reported in \cite{FoSSaCS2023} and here,
+we add better support using derivatives
+for bounded regular expression $r^{\{n\}}$.
+Our results
+extend straightforwardly to
+repetitions with intervals such as
+$r^{\{n\ldots m\}}$.
+The merit of Brzozowski derivatives (more on this later)
+on this problem is that
+it can be naturally extended to support bounded repetitions.
+Moreover these extensions are still made up of only small
+inductive datatypes and recursive functions,
+making it handy to deal with them in theorem provers.
+%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+%straightforwardly extended to deal with bounded regular expressions
+%and moreover the resulting code still consists of only simple
+%recursive functions and inductive datatypes.
+Finally, bounded regular expressions do not destroy our finite
+boundedness property, which we shall prove later on.
+
+
+
+
+
+\subsection{Back-References}
+The other way to simulate an $\mathit{NFA}$ for matching is choosing
+a single transition each time, keeping all the other options in
+a queue or stack, and backtracking if that choice eventually
+fails.
+This method, often called a "depth-first-search",
+is efficient in many cases, but could end up
+with exponential run time.
+The backtracking method is employed in regex libraries
+that support \emph{back-references}, for example
+in Java and Python.
+%\section{Back-references and The Terminology Regex}
+
+%When one constructs an $\NFA$ out of a regular expression
+%there is often very little to be done in the first phase, one simply
+%construct the $\NFA$ states based on the structure of the input regular expression.
+
+%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
+%one by keeping track of all active states after consuming
+%a character, and update that set of states iteratively.
+%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+%for a path terminating
+%at an accepting state.
+
+
+
+Consider the following regular expression where the sequence
+operator is omitted for brevity:
+\begin{center}
+ $r_1r_2r_3r_4$
+\end{center}
+In this example,
+one could label sub-expressions of interest
+by parenthesizing them and giving
+them a number in the order in which their opening parentheses appear.
+One possible way of parenthesizing and labelling is:
+\begin{center}
+ $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
+\end{center}
+The sub-expressions
+$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
+by 1 to 4, and can be ``referred back'' by their respective numbers.
+%These sub-expressions are called "capturing groups".
+To do so, one uses the syntax $\backslash i$
+to denote that we want the sub-string
+of the input matched by the i-th
+sub-expression to appear again,
+exactly the same as it first appeared:
+\begin{center}
+$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
+\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
+\end{center}
+Once the sub-string $s_i$ for the sub-expression $r_i$
+has been fixed, there is no variability on what the back-reference
+label $\backslash i$ can be---it is tied to $s_i$.
+The overall string will look like $\ldots s_i \ldots s_i \ldots $
+%The backslash and number $i$ are the
+%so-called "back-references".
+%Let $e$ be an expression made of regular expressions
+%and back-references. $e$ contains the expression $e_i$
+%as its $i$-th capturing group.
+%The semantics of back-reference can be recursively
+%written as:
+%\begin{center}
+% \begin{tabular}{c}
+% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+% \end{tabular}
+%\end{center}
+A concrete example
+for back-references is
+\begin{center}
+$(.^*)\backslash 1$,
+\end{center}
+which matches
+strings that can be split into two identical halves,
+for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
+Note that this is different from
+repeating the sub-expression verbatim like
+\begin{center}
+ $(.^*)(.^*)$,
+\end{center}
+which does not impose any restrictions on what strings the second
+sub-expression $.^*$
+might match.
+Another example for back-references is
+\begin{center}
+$(.)(.)\backslash 2\backslash 1$
+\end{center}
+which matches four-character palindromes
+like $abba$, $x??x$ and so on.
+
+Back-references are a regex construct
+that programmers find quite useful.
+According to Becchi and Crawley \cite{Becchi08},
+6\% of Snort rules (up until 2008) use them.
+The most common use of back-references
+is to express well-formed html files,
+where back-references are convenient for matching
+opening and closing tags like
+\begin{center}
+ $\langle html \rangle \ldots \langle / html \rangle$
+\end{center}
+A regex describing such a format
+is
+\begin{center}
+ $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+\end{center}
+Despite being useful, the expressive power of regexes
+go beyond regular languages
+once back-references are included.
+In fact, they allow the regex construct to express
+languages that cannot be contained in context-free
+languages either.
+For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
+expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
+which cannot be expressed by
+context-free grammars \cite{campeanu2003formal}.
+Such a language is contained in the context-sensitive hierarchy
+of formal languages.
+Also solving the matching problem involving back-references
+is known to be NP-complete \parencite{alfred2014algorithms}.
+Regex libraries supporting back-references such as
+PCRE \cite{pcre} therefore have to
+revert to a depth-first search algorithm including backtracking.
+What is unexpected is that even in the cases
+not involving back-references, there is still
+a (non-negligible) chance they might backtrack super-linearly,
+as shown in the graphs in figure \ref{fig:aStarStarb}.
+
+Summing up, we can categorise existing
+practical regex libraries into two kinds:
+(i) The ones with linear
+time guarantees like Go and Rust. The downside with them is that
+they impose restrictions
+on the regular expressions (not allowing back-references,
+bounded repetitions cannot exceed an ad hoc limit etc.).
+And (ii) those
+that allow large bounded regular expressions and back-references
+at the expense of using backtracking algorithms.
+They can potentially ``grind to a halt''
+on some very simple cases, resulting
+ReDoS attacks if exposed to the internet.
+
+The problems with both approaches are the motivation for us
+to look again at the regular expression matching problem.
+Another motivation is that regular expression matching algorithms
+that follow the POSIX standard often contain errors and bugs
+as we shall explain next.
+
+%We would like to have regex engines that can
+%deal with the regular part (e.g.
+%bounded repetitions) of regexes more
+%efficiently.
+%Also we want to make sure that they do it correctly.
+%It turns out that such aim is not so easy to achieve.
+ %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
+% For example, the Rust regex engine claims to be linear,
+% but does not support lookarounds and back-references.
+% The GoLang regex library does not support over 1000 repetitions.
+% Java and Python both support back-references, but shows
+%catastrophic backtracking behaviours on inputs without back-references(
+%when the language is still regular).
+ %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
+ %TODO: verify the fact Rust does not allow 1000+ reps
+
+
+
+
+%The time cost of regex matching algorithms in general
+%involve two different phases, and different things can go differently wrong on
+%these phases.
+%$\DFA$s usually have problems in the first (construction) phase
+%, whereas $\NFA$s usually run into trouble
+%on the second phase.
+
+
+\section{Error-prone POSIX Implementations}
+Very often there are multiple ways of matching a string
+with a regular expression.
+In such cases the regular expressions matcher needs to
+disambiguate.
+The more widely used strategy is called POSIX,
+which roughly speaking always chooses the longest initial match.
+The POSIX strategy is widely adopted in many regular expression matchers
+because it is a natural strategy for lexers.
+However, many implementations (including the C libraries
+used by Linux and OS X distributions) contain bugs
+or do not meet the specification they claim to adhere to.
+Kuklewicz maintains a unit test repository which lists some
+problems with existing regular expression engines \cite{KuklewiczHaskell}.
+In some cases, they either fail to generate a
+result when there exists a match,
+or give results that are inconsistent with the POSIX standard.
+A concrete example is the regex:
+\begin{center}
+ $(aba + ab + a)^* \text{and the string} \; ababa$
+\end{center}
+The correct POSIX match for the above
+involves the entire string $ababa$,
+split into two Kleene star iterations, namely $[ab], [aba]$ at positions
+$[0, 2), [2, 5)$
+respectively.
+But feeding this example to the different engines
+listed at regex101 \footnote{
+ regex101 is an online regular expression matcher which
+ provides API for trying out regular expression
+ engines of multiple popular programming languages like
+Java, Python, Go, etc.} \parencite{regex101}.
+yields
+only two incomplete matches: $[aba]$ at $[0, 3)$
+and $a$ at $[4, 5)$.
+Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
+commented that most regex libraries are not
+correctly implementing the central POSIX
+rule, called the maximum munch rule.
+Grathwohl \parencite{grathwohl2014crash} wrote:
+\begin{quote}\it
+ ``The POSIX strategy is more complicated than the
+ greedy because of the dependence on information about
+ the length of matched strings in the various subexpressions.''
+\end{quote}
+%\noindent
+People have recognised that the implementation complexity of POSIX rules also come from
+the specification being not very precise.
+The developers of the regexp package of Go
+\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
+commented that
+\begin{quote}\it
+``
+The POSIX rule is computationally prohibitive
+and not even well-defined.
+``
+\end{quote}
+There are many informal summaries of this disambiguation
+strategy, which are often quite long and delicate.
+For example Kuklewicz \cite{KuklewiczHaskell}
+described the POSIX rule as (section 1, last paragraph):
+\begin{quote}
+ \begin{itemize}
+ \item
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
+parenthesized subexpressions return the match from their last usage\\
+\item
+text of component subexpressions must be contained in the text of the
+higher-level subexpressions\\
+\item
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+\item
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+\end{itemize}
+\end{quote}
+%The text above
+%is trying to capture something very precise,
+%and is crying out for formalising.
+Ribeiro and Du Bois \cite{RibeiroAgda2017} have
+formalised the notion of bit-coded regular expressions
+and proved their relations with simple regular expressions in
+the dependently-typed proof assistant Agda.
+They also proved the soundness and completeness of a matching algorithm
+based on the bit-coded regular expressions.
+Ausaf et al. \cite{AusafDyckhoffUrban2016}
+are the first to
+give a quite simple formalised POSIX
+specification in Isabelle/HOL, and also prove
+that their specification coincides with an earlier (unformalised)
+POSIX specification given by Okui and Suzuki \cite{Okui10}.
+They then formally proved the correctness of
+a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
+with regards to that specification.
+They also found that the informal POSIX
+specification by Sulzmann and Lu needs to be substantially
+revised in order for the correctness proof to go through.
+Their original specification and proof were unfixable
+according to Ausaf et al.
+
+
+In the next section, we will briefly
+introduce Brzozowski derivatives and Sulzmann
+and Lu's algorithm, which the main point of this thesis builds on.
+%We give a taste of what they
+%are like and why they are suitable for regular expression
+%matching and lexing.
+\section{Formal Specification of POSIX Matching
+and Brzozowski Derivatives}
+%Now we start with the central topic of the thesis: Brzozowski derivatives.
+Brzozowski \cite{Brzozowski1964} first introduced the
+concept of a \emph{derivative} of regular expression in 1964.
+The derivative of a regular expression $r$
+with respect to a character $c$, is written as $r \backslash c$.
+This operation tells us what $r$ transforms into
+if we ``chop'' off a particular character $c$
+from all strings in the language of $r$ (defined
+later as $L \; r$).
+%To give a flavour of Brzozowski derivatives, we present
+%two straightforward clauses from it:
+%\begin{center}
+% \begin{tabular}{lcl}
+% $d \backslash c$ & $\dn$ &
+% $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+% \end{tabular}
+%\end{center}
+%\noindent
+%The first clause says that for the regular expression
+%denoting a singleton set consisting of a single-character string $\{ d \}$,
+%we check the derivative character $c$ against $d$,
+%returning a set containing only the empty string $\{ [] \}$
+%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+%The second clause states that to obtain the regular expression
+%representing all strings' head character $c$ being chopped off
+%from $r_1 + r_2$, one simply needs to recursively take derivative
+%of $r_1$ and $r_2$ and then put them together.
+Derivatives have the property
+that $s \in L \; (r\backslash c)$ if and only if
+$c::s \in L \; r$ where $::$ stands for list prepending.
+%This property can be used on regular expressions
+%matching and lexing--to test whether a string $s$ is in $L \; r$,
+%one simply takes derivatives of $r$ successively with
+%respect to the characters (in the correct order) in $s$,
+%and then test whether the empty string is in the last regular expression.
+With this property, derivatives can give a simple solution
+to the problem of matching a string $s$ with a regular
+expression $r$: if the derivative of $r$ w.r.t.\ (in
+succession) all the characters of the string matches the empty string,
+then $r$ matches $s$ (and {\em vice versa}).
+%This makes formally reasoning about these properties such
+%as correctness and complexity smooth and intuitive.
+There are several mechanised proofs of this property in various theorem
+provers,
+for example one by Owens and Slind \cite{Owens2008} in HOL4,
+another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
+yet another in Coq by Coquand and Siles \cite{Coquand2012}.
+
+In addition, one can extend derivatives to bounded repetitions
+relatively straightforwardly. For example, the derivative for
+this can be defined as:
+\begin{center}
+ \begin{tabular}{lcl}
+ $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
+ r^{\{n-1\}} (\text{when} n > 0)$\\
+ \end{tabular}
+\end{center}
+\noindent
+Experimental results suggest that unlike DFA-based solutions
+for bounded regular expressions,
+derivatives can cope
+large counters
+quite well.
+
+There have also been
+extensions of derivatives to other regex constructs.
+For example, Owens et al include the derivatives
+for the \emph{NOT} regular expression, which is
+able to concisely express C-style comments of the form
+$/* \ldots */$ (see \cite{Owens2008}).
+Another extension for derivatives is
+regular expressions with look-aheads, done
+by Miyazaki and Minamide
+\cite{Takayuki2019}.
+%We therefore use Brzozowski derivatives on regular expressions
+%lexing
+
+
+
+Given the above definitions and properties of
+Brzozowski derivatives, one quickly realises their potential
+in generating a formally verified algorithm for lexing: the clauses and property
+can be easily expressed in a functional programming language
+or converted to theorem prover
+code, with great ease.
+Perhaps this is the reason why derivatives have sparked quite a bit of interest
+in the functional programming and theorem prover communities in the last
+fifteen or so years (
+\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
+\cite{Chen12} and \cite{Coquand2012}
+to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
+after they were first published by Brzozowski.
+
+
+However, there are two difficulties with derivative-based matchers:
+First, Brzozowski's original matcher only generates a yes/no answer
+for whether a regular expression matches a string or not. This is too
+little information in the context of lexing where separate tokens must
+be identified and also classified (for example as keywords
+or identifiers).
+Second, derivative-based matchers need to be more efficient in terms
+of the sizes of derivatives.
+Elegant and beautiful
+as many implementations are,
+they can be excruciatingly slow.
+For example, Sulzmann and Lu
+claim a linear running time of their proposed algorithm,
+but that was falsified by our experiments. The running time
+is actually $\Omega(2^n)$ in the worst case.
+A similar claim about a theoretical runtime of $O(n^2)$
+is made for the Verbatim \cite{Verbatim}
+%TODO: give references
+lexer, which calculates POSIX matches and is based on derivatives.
+They formalized the correctness of the lexer, but not their complexity result.
+In the performance evaluation section, they analyzed the run time
+of matching $a$ with the string
+\begin{center}
+ $\underbrace{a \ldots a}_{\text{n a's}}$.
+\end{center}
+\noindent
+They concluded that the algorithm is quadratic in terms of
+the length of the input string.
+When we tried out their extracted OCaml code with the example $(a+aa)^*$,
+the time it took to match a string of 40 $a$'s was approximately 5 minutes.
+
+
+\subsection{Sulzmann and Lu's Algorithm}
+Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
+problem with the yes/no answer
+by cleverly extending Brzozowski's matching
+algorithm. Their extended version generates additional information on
+\emph{how} a regular expression matches a string following the POSIX
+rules for regular expression matching. They achieve this by adding a
+second ``phase'' to Brzozowski's algorithm involving an injection
+function. This injection function in a sense undoes the ``damage''
+of the derivatives chopping off characters.
+In earlier work, Ausaf et al provided the formal
+specification of what POSIX matching means and proved in Isabelle/HOL
+the correctness
+of this extended algorithm accordingly
+\cite{AusafDyckhoffUrban2016}.
+
+The version of the algorithm proven correct
+suffers however heavily from a
+second difficulty, where derivatives can
+grow to arbitrarily big sizes.
+For example if we start with the
+regular expression $(a+aa)^*$ and take
+successive derivatives according to the character $a$, we end up with
+a sequence of ever-growing derivatives like
+
+\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+\begin{center}
+\begin{tabular}{rll}
+$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+\end{tabular}
+\end{center}
+
+\noindent where after around 35 steps we usually run out of memory on a
+typical computer. Clearly, the
+notation involving $\ZERO$s and $\ONE$s already suggests
+simplification rules that can be applied to regular regular
+expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
+\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
+r$. While such simple-minded simplifications have been proved in
+the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
+algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
+\emph{not} help with limiting the growth of the derivatives shown
+above: the growth is slowed, but the derivatives can still grow rather
+quickly beyond any finite bound.
+
+Therefore we want to look in this thesis at a second
+algorithm by Sulzmann and Lu where they
+overcame this ``growth problem'' \cite{Sulzmann2014}.
+In this version, POSIX values are
+represented as bit sequences and such sequences are incrementally generated
+when derivatives are calculated. The compact representation
+of bit sequences and regular expressions allows them to define a more
+``aggressive'' simplification method that keeps the size of the
+derivatives finite no matter what the length of the string is.
+They make some informal claims about the correctness and linear behaviour
+of this version, but do not provide any supporting proof arguments, not
+even ``pencil-and-paper'' arguments. They write about their bit-coded
+\emph{incremental parsing method} (that is the algorithm to be formalised
+in this dissertation)
+
+
+
+ \begin{quote}\it
+ ``Correctness Claim: We further claim that the incremental parsing
+ method [..] in combination with the simplification steps [..]
+ yields POSIX parse trees. We have tested this claim
+ extensively [..] but yet
+ have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
+\end{quote}
+Ausaf and Urban made some initial progress towards the
+full correctness proof but still had to leave out the optimisation
+Sulzmann and Lu proposed.
+Ausaf wrote \cite{Ausaf},
+ \begin{quote}\it
+``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
+\end{quote}
+This thesis implements the aggressive simplifications envisioned
+by Ausaf and Urban,
+together with a formal proof of the correctness of those simplifications.
+
+
+One of the most recent work in the context of lexing
+%with this issue
+is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
+This is relevant work for us and we will compare it later with
+our derivative-based matcher we are going to present.
+There is also some newer work called
+Verbatim++~\cite{Verbatimpp}, which does not use derivatives,
+but deterministic finite automaton instead.
+We will also study this work in a later section.
+%An example that gives problem to automaton approaches would be
+%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+%It requires at least $2^{n+1}$ states to represent
+%as a DFA.
+
+
\section{Basic Concepts}
Formal language theory usually starts with an alphabet
denoting a set of characters.
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri May 26 23:42:15 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Jun 18 17:54:52 2023 +0100
@@ -216,8 +216,8 @@
\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
\end{center}
Using this, regular expression matchers and lexers are able to extract
-for example the domain names by the use of \verb|[a-zA-Z0-9.-]+|.
-Consequently, they are an indispensible component in text processing tools
+the domain names by the use of \verb|[a-zA-Z0-9.-]+|.
+Consequently, they are an indispensible components in text processing tools
of software systems such as compilers, IDEs, and firewalls.
The study of regular expressions is ongoing due to an
@@ -303,23 +303,207 @@
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
\end{itemize}
\end{quote}
-However, the author noted that lexers are rarely correct according to this standard.
-Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove
-the correctness of such algorithms against the POSIX semantics definitions
-would be valuable.
+However, the author noted that various lexers that claim to be POSIX
+are rarely correct according to this standard.
+There are numerous occasions where programmers realised the subtlety and
+difficulty to implement correctly, one such quote from Go's regexp library author
+\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
+\begin{quote}\it
+``
+The POSIX rule is computationally prohibitive
+and not even well-defined.
+``
+\end{quote}
+Being able to formally define and capture the idea of
+POSIX rules and prove
+the correctness of regular expression matching/lexing
+algorithms against the POSIX semantics definitions
+is valuable.
+
-Formal proofs are machine checked programs
-such as the ones written in Isabelle/HOL, is a powerful means
-for computer scientists to be certain about correctness of their algorithms and correctness properties.
-This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions
-and proof rules that are known to be correct.
-The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
-methods as their implementation and complexity analysis tend to be error-prone.
-
-There have been many interesting steps taken in the direction of formal proofs and regular expressions
-lexing and matching.
+Formal proofs are
+machine checked programs
+ %such as the ones written in Isabelle/HOL, is a powerful means
+for computer scientists to be certain
+about the correctness of their algorithms.
+This is done by
+recursively checking that every fact in a proof script
+is either an axiom or a fact that is derived from
+known axioms or verified facts.
+%The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
+%methods as their implementation and complexity analysis tend to be error-prone.
+Formal proofs provides an unprecendented level of asssurance
+that an algorithm will perform as expected under all inputs.
+The software systems that help people interactively build and check
+such proofs are called theorem-provers or proof assitants.
+Many theorem-provers have been developed, such as Mizar,
+Isabelle/HOL, HOL-Light, HOL4,
+Coq, Agda, Idris, Lean and so on.
+Isabelle/HOL is a theorem prover with a simple type theory
+and powerful automated proof generators like sledgehammer.
+We chose to use Isabelle/HOL for its powerful automation
+and ease and simplicity in expressing regular expressions and
+regular languages.
+%Some of those employ
+%dependent types like Mizar, Coq, Agda, Lean and Idris.
+%Some support a constructivism approach, such as Coq.
+Formal proofs on regular expression matching and lexing
+complements the efforts in
+industry which tend to focus on overall speed
+with techniques like parallelization (FPGA paper), tackling
+the problem of catastrophic backtracking
+in an ad-hoc manner (cloudflare and stackexchange article).
+
+There have been many interesting steps in the theorem-proving community
+about formalising regular expressions and lexing.
+One flavour is to build from the regular expression an automaton, and determine
+acceptance in terms of the resulting
+state after executing the input string on that automaton.
+Automata formalisations are in general harder and more cumbersome to deal
+with for theorem provers than working directly on regular expressions.
+One such example is by Nipkow \cite{Nipkow1998}.
+%They
+%made everything recursive (for example the next state function),
+As a core idea, they
+used a list of booleans to name each state so that
+after composing sub-automata together, renaming the states to maintain
+the distinctness of each state is recursive and simple.
+The result was the obvious lemmas incorporating
+``a painful amount of detail'' in their formalisation.
+Sometimes the automata are represented as graphs.
+But graphs are not inductive datatypes.
+Having to set the induction principle on the number of nodes
+in a graph makes formal reasoning non-intuitive and convoluted,
+resulting in large formalisations \cite{Lammich2012}.
+When combining two graphs, one also needs to make sure that the nodes in
+both graphs are distinct, which almost always involve
+renaming of the nodes.
+A theorem-prover which provides dependent types such as Coq
+can alleviate the issue of representing graph nodes
+\cite{Doczkal2013}. There the representation of nodes is made
+easier by the use of $\textit{FinType}$.
+Another representation for automata are matrices.
+But the induction for them is not as straightforward either.
+There are some more clever representations, for example one by Paulson
+using hereditarily finite sets \cite{Paulson2015}.
+There the problem with combining graphs can be solved better.
+%but we believe that such clever tricks are not very obvious for
+%the John-average-Isabelle-user.
+
+The approach that operates directly on regular expressions circumvents the problem of
+conversion between a regular expression and an automaton, thereby avoiding representation
+problems altogether, despite that regular expressions may be seen as a variant of a
+non-deterministic finite automaton (ref Laurikari tagged NFA paper).
+To matching a string, a sequence of algebraic transformations called
+(Brzozowski) $\textit{derivatives}$ (ref Brzozowski) is carried out on that regular expression.
+Each derivative takes a character and a regular expression,
+and returns a new regular expression whose language is closely related to
+the original regular expression's language:
+strings prefixed with that input character will have their head removed
+and strings not prefixed
+with that character will be eliminated.
+After taking derivatives with respect to all the characters the string is
+exhausted. Then an algorithm checks whether the empty string is in that final
+regular expression's language.
+If so, a match exists and the string is in the language of the input regular expression.
+
+Again this process can be seen as the simulation of an NFA running on a string,
+but the recursive nature of the datatypes and functions involved makes
+derivatives a perfect object of study for theorem provers.
+That is why there has been numerous formalisations of regular expressions
+and Brzozowski derivatives in the functional programming and
+theorem proving community (a large list of refs to derivatives formalisation publications).
+Ribeiro and Du Bois \cite{RibeiroAgda2017} have
+formalised the notion of bit-coded regular expressions
+and proved their relations with simple regular expressions in
+the dependently-typed proof assistant Agda.
+They also proved the soundness and completeness of a matching algorithm
+based on the bit-coded regular expressions. Their algorithm is a decision procedure
+that gives a Yes/No answer, which does not produce
+lexical values.
+%X also formalised derivatives and regular expressions, producing "parse trees".
+%(Some person who's a big name in formal methods)
+
+
+The variant of the problem we are looking at centers around
+an algorithm (which we call $\blexer$) developed by Sulzmann and Lu \ref{Sulzmann2014}.
+The reason we chose to look at $\blexer$ and its simplifications
+is because it allows a lexical tree to be generated
+by some elegant and subtle procedure based on Brzozowski derivatives.
+The procedures are made of recursive functions and inductive datatypes just like derivatives,
+allowing intuitive and concise formal reasoning with theorem provers.
+Most importantly, $\blexer$ opens up a path to an optimized version
+of $\blexersimp$ possibilities to improve
+performance with simplifications that aggressively change the structure of regular expressions.
+While most derivative-based methods
+rely on structures to be maintained to allow induction to
+go through.
+For example, Egolf et al. \ref{Verbatim} have developed a verified lexer
+with derivatives, but as soon as they started introducing
+optimizations such as memoization, they reverted to constructing
+DFAs first.
+Edelmann \ref{Edelmann2020} explored similar optimizations in his
+work on verified LL(1) parsing, with additional enhancements with data structures like
+zippers.
+
+%Sulzmann and Lu have worked out an algorithm
+%that is especially suited for verification
+%which utilized the fact
+%that whenever ambiguity occurs one may duplicate a sub-expression and use
+%different copies to describe different matching choices.
+The idea behind the correctness of $\blexer$ is simple: during a derivative,
+multiple matches might be possible, where an alternative with multiple children
+each corresponding to a
+different match is created. In the end of
+a lexing process one always picks up the leftmost alternative, which is guarnateed
+to be a POSIX value.
+This is done by consistently keeping sub-regular expressions in an alternative
+with longer submatches
+to the left of other copies (
+Remember that POSIX values are roughly the values with the longest inital
+submatch).
+The idea behind the optimized version of $\blexer$, which we call $\blexersimp$,
+is that since we only take the leftmost copy, then all re-occurring copies can be
+eliminated without losing the POSIX property, and this can be done with
+children of alternatives at different levels by merging them together.
+Proving $\blexersimp$ requires a different
+proof strategy compared to that by Ausaf \cite{FahadThesis}.
+We invent a rewriting relation as an
+inductive predicate which captures
+a strong enough invariance that ensures correctness,
+which commutes with the derivative operation.
+This predicate allows a simple
+induction on the input string to go through.
+
+%This idea has been repeatedly used in different variants of lexing
+%algorithms in their paper, one of which involves bit-codes. The bit-coded
+%derivative-based algorithm even allows relatively aggressive
+%%simplification rules which cause
+%structural changes that render the induction used in the correctness
+%proofs unusable.
+%More details will be given in \ref{Bitcoded2} including the
+%new correctness proof which involves a new inductive predicate which allows
+%rule induction to go through.
+
+
+
+
+
+
+
+%first character is removed
+%state of the automaton after matching that character
+%where nodes are represented as
+%a sub-expression (for example tagged NFA
+%Working on regular expressions
+%Because of these problems with automata, we prefer regular expressions
+%and derivatives rather than an automata (or graph-based) approach which explicitly converts between
+%the regular expression and a different data structure.
+%
+%
+%The key idea
(ends)
@@ -345,1043 +529,6 @@
%TODO: get source for SNORT/BRO's regex matching engine/speed
-Consider for example the regular expression $(a^*)^*\,b$ and
-strings of the form $aa..a$. These strings cannot be matched by this regular
-expression: Obviously the expected $b$ in the last
-position is missing. One would assume that modern regular expression
-matching engines can find this out very quickly. Surprisingly, if one tries
-this example in JavaScript, Python or Java 8, even with small strings,
-say of lenght of around 30 $a$'s,
-the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
-The algorithms clearly show exponential behaviour, and as can be seen
-is triggered by some relatively simple regular expressions.
-Java 9 and newer
-versions improve this behaviour somewhat, but are still slow compared
-with the approach we are going to use in this thesis.
-
-
-
-This superlinear blowup in regular expression engines
-has caused grief in ``real life'' where it is
-given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
-For example, on 20 July 2016 one evil
-regular expression brought the webpage
-\href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
-In this instance, a regular expression intended to just trim white
-spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU resources---causing the web servers to grind to a
-halt. In this example, the time needed to process
-the string was
-$O(n^2)$ with respect to the string length $n$. This
-quadratic overhead was enough for the homepage of Stack Exchange to
-respond so slowly that the load balancer assumed a $\mathit{DoS}$
-attack and therefore stopped the servers from responding to any
-requests. This made the whole site become unavailable.
-
-\begin{figure}[p]
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={JavaScript},
- legend pos=north west,
- legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
-\end{axis}
-\end{tikzpicture}
- &
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- %ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={Python},
- legend pos=north west,
- legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
-\end{axis}
-\end{tikzpicture}\\
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={Java 8},
- legend pos=north west,
- legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}
- &
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- %ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={Dart},
- legend pos=north west,
- legend cell align=left]
-\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
-\end{axis}
-\end{tikzpicture}\\
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- ylabel={time in secs},
- enlargelimits=false,
- xtick={0,5,...,30},
- xmax=33,
- ymax=35,
- ytick={0,5,...,30},
- scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={Swift},
- legend pos=north west,
- legend cell align=left]
-\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
-\end{axis}
-\end{tikzpicture}
- &
-\begin{tikzpicture}
-\begin{axis}[
- xlabel={$n$},
- x label style={at={(1.05,-0.05)}},
- %ylabel={time in secs},
- enlargelimits=true,
- %xtick={0,5000,...,40000},
- %xmax=40000,
- %ymax=35,
- restrict x to domain*=0:40000,
- restrict y to domain*=0:35,
- %ytick={0,5,...,30},
- %scaled ticks=false,
- axis lines=left,
- width=5cm,
- height=4cm,
- legend entries={Java9+},
- legend pos=north west,
- legend cell align=left]
-\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{2}{c}{Graphs}
-\end{tabular}
-\end{center}
-\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
- of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
- The reason for their superlinear behaviour is that they do a depth-first-search
- using NFAs.
- If the string does not match, the regular expression matching
- engine starts to explore all possibilities.
-}\label{fig:aStarStarb}
-\end{figure}\afterpage{\clearpage}
-
-A more recent example is a global outage of all Cloudflare servers on 2 July
-2019. A poorly written regular expression exhibited catastrophic backtracking
-and exhausted CPUs that serve HTTP traffic. Although the outage
-had several causes, at the heart was a regular expression that
-was used to monitor network
-traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
-These problems with regular expressions
-are not isolated events that happen
-very rarely,
-%but actually widespread.
-%They occur so often that they have a
-but they occur actually often enough that they have a
-name: Regular-Expression-Denial-Of-Service (ReDoS)
-attacks.
-Davis et al. \cite{Davis18} detected more
-than 1000 evil regular expressions
-in Node.js, Python core libraries, npm and pypi.
-They therefore concluded that evil regular expressions
-are a real problem rather than just "a parlour trick".
-
-The work in this thesis aims to address this issue
-with the help of formal proofs.
-We describe a lexing algorithm based
-on Brzozowski derivatives with verified correctness
-and a finiteness property for the size of derivatives
-(which are all done in Isabelle/HOL).
-Such properties %guarantee the absence of
-are an important step in preventing
-catastrophic backtracking once and for all.
-We will give more details in the next sections
-on (i) why the slow cases in graph \ref{fig:aStarStarb}
-can occur in traditional regular expression engines
-and (ii) why we choose our
-approach based on Brzozowski derivatives and formal proofs.
-
-
-\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
-Regular expressions and regular expression matchers
-have clearly been studied for many, many years.
-Theoretical results in automata theory state
-that basic regular expression matching should be linear
-w.r.t the input.
-This assumes that the regular expression
-$r$ was pre-processed and turned into a
-deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
-By basic we mean textbook definitions such as the one
-below, involving only regular expressions for characters, alternatives,
-sequences, and Kleene stars:
-\[
- r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
-\]
-Modern regular expression matchers used by programmers,
-however,
-support much richer constructs, such as bounded repetitions,
-negations,
-and back-references.
-To differentiate, we use the word \emph{regex} to refer
-to those expressions with richer constructs while reserving the
-term \emph{regular expression}
-for the more traditional meaning in formal languages theory.
-We follow this convention
-in this thesis.
-In the future, we aim to support all the popular features of regexes,
-but for this work we mainly look at basic regular expressions
-and bounded repetitions.
-
-
-
-%Most modern regex libraries
-%the so-called PCRE standard (Peral Compatible Regular Expressions)
-%has the back-references
-Regexes come with a number of constructs
-that make it more convenient for
-programmers to write regular expressions.
-Depending on the types of constructs
-the task of matching and lexing with them
-will have different levels of complexity.
-Some of those constructs are syntactic sugars that are
-simply short hand notations
-that save the programmers a few keystrokes.
-These will not cause problems for regex libraries.
-For example the
-non-binary alternative involving three or more choices just means:
-\[
- (a | b | c) \stackrel{means}{=} ((a + b)+ c)
-\]
-Similarly, the range operator
-%used to express the alternative
-%of all characters between its operands,
-is just a concise way
-of expressing an alternative of consecutive characters:
-\[
- [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )
-\]
-for an alternative. The
-wildcard character '$.$' is used to refer to any single character,
-\[
- . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
-\]
-except the newline.
-
-\subsection{Bounded Repetitions}
-More interesting are bounded repetitions, which can
-make the regular expressions much
-more compact.
-Normally there are four kinds of bounded repetitions:
-$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
-(where $n$ and $m$ are constant natural numbers).
-Like the star regular expressions, the set of strings or language
-a bounded regular expression can match
-is defined using the power operation on sets:
-\begin{center}
- \begin{tabular}{lcl}
- $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
- $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
- $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
- $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
- \end{tabular}
-\end{center}
-The attraction of bounded repetitions is that they can be
-used to avoid a size blow up: for example $r^{\{n\}}$
-is a shorthand for
-the much longer regular expression:
-\[
- \underbrace{r\ldots r}_\text{n copies of r}.
-\]
-%Therefore, a naive algorithm that simply unfolds
-%them into their desugared forms
-%will suffer from at least an exponential runtime increase.
-
-
-The problem with matching
-such bounded repetitions
-is that tools based on the classic notion of
-automata need to expand $r^{\{n\}}$ into $n$ connected
-copies of the automaton for $r$. This leads to very inefficient matching
-algorithms or algorithms that consume large amounts of memory.
-Implementations using $\DFA$s will
-in such situations
-either become excruciatingly slow
-(for example Verbatim++ \cite{Verbatimpp}) or run
-out of memory (for example $\mathit{LEX}$ and
-$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
-in C and JAVA that generate $\mathit{DFA}$-based
-lexers. The user provides a set of regular expressions
-and configurations, and then
-gets an output program encoding a minimized $\mathit{DFA}$
-that can be compiled and run.
-When given the above countdown regular expression,
-a small $n$ (say 20) would result in a program representing a
-DFA
-with millions of states.}) for large counters.
-A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
-where the minimal DFA requires at least $2^{n+1}$ states.
-For example, when $n$ is equal to 2,
-the corresponding $\mathit{NFA}$ looks like:
-\vspace{6mm}
-\begin{center}
-\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
- \node[state,initial] (q_0) {$q_0$};
- \node[state, red] (q_1) [right=of q_0] {$q_1$};
- \node[state, red] (q_2) [right=of q_1] {$q_2$};
- \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
- \path[->]
- (q_0) edge node {a} (q_1)
- edge [loop below] node {a,b} ()
- (q_1) edge node {a,b} (q_2)
- (q_2) edge node {a,b} (q_3);
-\end{tikzpicture}
-\end{center}
-and when turned into a DFA by the subset construction
-requires at least $2^3$ states.\footnote{The
-red states are "countdown states" which count down
-the number of characters needed in addition to the current
-string to make a successful match.
-For example, state $q_1$ indicates a match that has
-gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
-and just consumed the "delimiter" $a$ in the middle, and
-needs to match 2 more iterations of $(a|b)$ to complete.
-State $q_2$ on the other hand, can be viewed as a state
-after $q_1$ has consumed 1 character, and just waits
-for 1 more character to complete.
-The state $q_3$ is the last (accepting) state, requiring 0
-more characters.
-Depending on the suffix of the
-input string up to the current read location,
-the states $q_1$ and $q_2$, $q_3$
-may or may
-not be active.
-A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
-contain at least $2^3$ non-equivalent states that cannot be merged,
-because the subset construction during determinisation will generate
-all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
-Generalizing this to regular expressions with larger
-bounded repetitions number, we have that
-regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
-would require at least $2^{n+1}$ states, if $r$ itself contains
-more than 1 string.
-This is to represent all different
-scenarios in which "countdown" states are active.}
-
-
-Bounded repetitions are important because they
-tend to occur frequently in practical use,
-for example in the regex library RegExLib, in
-the rules library of Snort \cite{Snort1999}\footnote{
-Snort is a network intrusion detection (NID) tool
-for monitoring network traffic.
-The network security community curates a list
-of malicious patterns written as regexes,
-which is used by Snort's detection engine
-to match against network traffic for any hostile
-activities such as buffer overflow attacks.},
-as well as in XML Schema definitions (XSDs).
-According to Bj\"{o}rklund et al \cite{xml2015},
-more than half of the
-XSDs they found on the Maven.org central repository
-have bounded regular expressions in them.
-Often the counters are quite large, with the largest being
-close to ten million.
-A smaller sample XSD they gave
-is:
-\lstset{
- basicstyle=\fontsize{8.5}{9}\ttfamily,
- language=XML,
- morekeywords={encoding,
- xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
-}
-\begin{lstlisting}
-<sequence minOccurs="0" maxOccurs="65535">
- <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
- <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-</sequence>
-\end{lstlisting}
-This can be seen as the regex
-$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
-regular expressions
-satisfying certain constraints (such as
-satisfying the floating point number format).
-It is therefore quite unsatisfying that
-some regular expressions matching libraries
-impose adhoc limits
-for bounded regular expressions:
-For example, in the regular expression matching library in the Go
-language the regular expression $a^{1001}$ is not permitted, because no counter
-can be above 1000, and in the built-in Rust regular expression library
-expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
-for being too big.
-As Becchi and Crawley \cite{Becchi08} have pointed out,
-the reason for these restrictions
-is that they simulate a non-deterministic finite
-automata (NFA) with a breadth-first search.
-This way the number of active states could
-be equal to the counter number.
-When the counters are large,
-the memory requirement could become
-infeasible, and a regex engine
-like in Go will reject this pattern straight away.
-\begin{figure}[H]
-\begin{center}
-\begin{tikzpicture} [node distance = 2cm, on grid, auto]
-
- \node (q0) [state, initial] {$0$};
- \node (q1) [state, right = of q0] {$1$};
- %\node (q2) [state, right = of q1] {$2$};
- \node (qdots) [right = of q1] {$\ldots$};
- \node (qn) [state, right = of qdots] {$n$};
- \node (qn1) [state, right = of qn] {$n+1$};
- \node (qn2) [state, right = of qn1] {$n+2$};
- \node (qn3) [state, accepting, right = of qn2] {$n+3$};
-
-\path [-stealth, thick]
- (q0) edge [loop above] node {a} ()
- (q0) edge node {a} (q1)
- %(q1) edge node {.} (q2)
- (q1) edge node {.} (qdots)
- (qdots) edge node {.} (qn)
- (qn) edge node {.} (qn1)
- (qn1) edge node {b} (qn2)
- (qn2) edge node {$c$} (qn3);
-\end{tikzpicture}
-%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
-% \node[state,initial] (q_0) {$0$};
-% \node[state, ] (q_1) [right=of q_0] {$1$};
-% \node[state, ] (q_2) [right=of q_1] {$2$};
-% \node[state,
-% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
-% \path[->]
-% (q_0) edge node {a} (q_1)
-% edge [loop below] node {a,b} ()
-% (q_1) edge node {a,b} (q_2)
-% (q_2) edge node {a,b} (q_3);
-%\end{tikzpicture}
-\end{center}
-\caption{The example given by Becchi and Crawley
- that NFA simulation can consume large
- amounts of memory: $.^*a.^{\{n\}}bc$ matching
- strings of the form $aaa\ldots aaaabc$.
- When traversing in a breadth-first manner,
-all states from 0 till $n+1$ will become active.}
-\end{figure}
-%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
-%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
-%in terms of input string length.
-%TODO:try out these lexers
-These problems can of course be solved in matching algorithms where
-automata go beyond the classic notion and for instance include explicit
-counters \cite{Turo_ov__2020}.
-These solutions can be quite efficient,
-with the ability to process
-gigabits of strings input per second
-even with large counters \cite{Becchi08}.
-These practical solutions do not come with
-formal guarantees, and as pointed out by
-Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
-%But formal reasoning about these automata especially in Isabelle
-%can be challenging
-%and un-intuitive.
-%Therefore, we take correctness and runtime claims made about these solutions
-%with a grain of salt.
-
-In the work reported in \cite{FoSSaCS2023} and here,
-we add better support using derivatives
-for bounded regular expression $r^{\{n\}}$.
-Our results
-extend straightforwardly to
-repetitions with intervals such as
-$r^{\{n\ldots m\}}$.
-The merit of Brzozowski derivatives (more on this later)
-on this problem is that
-it can be naturally extended to support bounded repetitions.
-Moreover these extensions are still made up of only small
-inductive datatypes and recursive functions,
-making it handy to deal with them in theorem provers.
-%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
-%straightforwardly extended to deal with bounded regular expressions
-%and moreover the resulting code still consists of only simple
-%recursive functions and inductive datatypes.
-Finally, bounded regular expressions do not destroy our finite
-boundedness property, which we shall prove later on.
-
-
-
-
-
-\subsection{Back-References}
-The other way to simulate an $\mathit{NFA}$ for matching is choosing
-a single transition each time, keeping all the other options in
-a queue or stack, and backtracking if that choice eventually
-fails.
-This method, often called a "depth-first-search",
-is efficient in many cases, but could end up
-with exponential run time.
-The backtracking method is employed in regex libraries
-that support \emph{back-references}, for example
-in Java and Python.
-%\section{Back-references and The Terminology Regex}
-
-%When one constructs an $\NFA$ out of a regular expression
-%there is often very little to be done in the first phase, one simply
-%construct the $\NFA$ states based on the structure of the input regular expression.
-
-%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
-%one by keeping track of all active states after consuming
-%a character, and update that set of states iteratively.
-%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
-%for a path terminating
-%at an accepting state.
-
-
-
-Consider the following regular expression where the sequence
-operator is omitted for brevity:
-\begin{center}
- $r_1r_2r_3r_4$
-\end{center}
-In this example,
-one could label sub-expressions of interest
-by parenthesizing them and giving
-them a number in the order in which their opening parentheses appear.
-One possible way of parenthesizing and labelling is:
-\begin{center}
- $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
-\end{center}
-The sub-expressions
-$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
-by 1 to 4, and can be ``referred back'' by their respective numbers.
-%These sub-expressions are called "capturing groups".
-To do so, one uses the syntax $\backslash i$
-to denote that we want the sub-string
-of the input matched by the i-th
-sub-expression to appear again,
-exactly the same as it first appeared:
-\begin{center}
-$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
-\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
-\end{center}
-Once the sub-string $s_i$ for the sub-expression $r_i$
-has been fixed, there is no variability on what the back-reference
-label $\backslash i$ can be---it is tied to $s_i$.
-The overall string will look like $\ldots s_i \ldots s_i \ldots $
-%The backslash and number $i$ are the
-%so-called "back-references".
-%Let $e$ be an expression made of regular expressions
-%and back-references. $e$ contains the expression $e_i$
-%as its $i$-th capturing group.
-%The semantics of back-reference can be recursively
-%written as:
-%\begin{center}
-% \begin{tabular}{c}
-% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
-% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
-% \end{tabular}
-%\end{center}
-A concrete example
-for back-references is
-\begin{center}
-$(.^*)\backslash 1$,
-\end{center}
-which matches
-strings that can be split into two identical halves,
-for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
-Note that this is different from
-repeating the sub-expression verbatim like
-\begin{center}
- $(.^*)(.^*)$,
-\end{center}
-which does not impose any restrictions on what strings the second
-sub-expression $.^*$
-might match.
-Another example for back-references is
-\begin{center}
-$(.)(.)\backslash 2\backslash 1$
-\end{center}
-which matches four-character palindromes
-like $abba$, $x??x$ and so on.
-
-Back-references are a regex construct
-that programmers find quite useful.
-According to Becchi and Crawley \cite{Becchi08},
-6\% of Snort rules (up until 2008) use them.
-The most common use of back-references
-is to express well-formed html files,
-where back-references are convenient for matching
-opening and closing tags like
-\begin{center}
- $\langle html \rangle \ldots \langle / html \rangle$
-\end{center}
-A regex describing such a format
-is
-\begin{center}
- $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
-\end{center}
-Despite being useful, the expressive power of regexes
-go beyond regular languages
-once back-references are included.
-In fact, they allow the regex construct to express
-languages that cannot be contained in context-free
-languages either.
-For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
-expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
-which cannot be expressed by
-context-free grammars \cite{campeanu2003formal}.
-Such a language is contained in the context-sensitive hierarchy
-of formal languages.
-Also solving the matching problem involving back-references
-is known to be NP-complete \parencite{alfred2014algorithms}.
-Regex libraries supporting back-references such as
-PCRE \cite{pcre} therefore have to
-revert to a depth-first search algorithm including backtracking.
-What is unexpected is that even in the cases
-not involving back-references, there is still
-a (non-negligible) chance they might backtrack super-linearly,
-as shown in the graphs in figure \ref{fig:aStarStarb}.
-
-Summing up, we can categorise existing
-practical regex libraries into two kinds:
-(i) The ones with linear
-time guarantees like Go and Rust. The downside with them is that
-they impose restrictions
-on the regular expressions (not allowing back-references,
-bounded repetitions cannot exceed an ad hoc limit etc.).
-And (ii) those
-that allow large bounded regular expressions and back-references
-at the expense of using backtracking algorithms.
-They can potentially ``grind to a halt''
-on some very simple cases, resulting
-ReDoS attacks if exposed to the internet.
-
-The problems with both approaches are the motivation for us
-to look again at the regular expression matching problem.
-Another motivation is that regular expression matching algorithms
-that follow the POSIX standard often contain errors and bugs
-as we shall explain next.
-
-%We would like to have regex engines that can
-%deal with the regular part (e.g.
-%bounded repetitions) of regexes more
-%efficiently.
-%Also we want to make sure that they do it correctly.
-%It turns out that such aim is not so easy to achieve.
- %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
-% For example, the Rust regex engine claims to be linear,
-% but does not support lookarounds and back-references.
-% The GoLang regex library does not support over 1000 repetitions.
-% Java and Python both support back-references, but shows
-%catastrophic backtracking behaviours on inputs without back-references(
-%when the language is still regular).
- %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
- %TODO: verify the fact Rust does not allow 1000+ reps
-
-
-
-
-%The time cost of regex matching algorithms in general
-%involve two different phases, and different things can go differently wrong on
-%these phases.
-%$\DFA$s usually have problems in the first (construction) phase
-%, whereas $\NFA$s usually run into trouble
-%on the second phase.
-
-
-\section{Error-prone POSIX Implementations}
-Very often there are multiple ways of matching a string
-with a regular expression.
-In such cases the regular expressions matcher needs to
-disambiguate.
-The more widely used strategy is called POSIX,
-which roughly speaking always chooses the longest initial match.
-The POSIX strategy is widely adopted in many regular expression matchers
-because it is a natural strategy for lexers.
-However, many implementations (including the C libraries
-used by Linux and OS X distributions) contain bugs
-or do not meet the specification they claim to adhere to.
-Kuklewicz maintains a unit test repository which lists some
-problems with existing regular expression engines \cite{KuklewiczHaskell}.
-In some cases, they either fail to generate a
-result when there exists a match,
-or give results that are inconsistent with the POSIX standard.
-A concrete example is the regex:
-\begin{center}
- $(aba + ab + a)^* \text{and the string} \; ababa$
-\end{center}
-The correct POSIX match for the above
-involves the entire string $ababa$,
-split into two Kleene star iterations, namely $[ab], [aba]$ at positions
-$[0, 2), [2, 5)$
-respectively.
-But feeding this example to the different engines
-listed at regex101 \footnote{
- regex101 is an online regular expression matcher which
- provides API for trying out regular expression
- engines of multiple popular programming languages like
-Java, Python, Go, etc.} \parencite{regex101}.
-yields
-only two incomplete matches: $[aba]$ at $[0, 3)$
-and $a$ at $[4, 5)$.
-Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
-commented that most regex libraries are not
-correctly implementing the central POSIX
-rule, called the maximum munch rule.
-Grathwohl \parencite{grathwohl2014crash} wrote:
-\begin{quote}\it
- ``The POSIX strategy is more complicated than the
- greedy because of the dependence on information about
- the length of matched strings in the various subexpressions.''
-\end{quote}
-%\noindent
-People have recognised that the implementation complexity of POSIX rules also come from
-the specification being not very precise.
-The developers of the regexp package of Go
-\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
-commented that
-\begin{quote}\it
-``
-The POSIX rule is computationally prohibitive
-and not even well-defined.
-``
-\end{quote}
-There are many informal summaries of this disambiguation
-strategy, which are often quite long and delicate.
-For example Kuklewicz \cite{KuklewiczHaskell}
-described the POSIX rule as (section 1, last paragraph):
-\begin{quote}
- \begin{itemize}
- \item
-regular expressions (REs) take the leftmost starting match, and the longest match starting there
-earlier subpatterns have leftmost-longest priority over later subpatterns\\
-\item
-higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
-\item
-REs have right associative concatenation which can be changed with parenthesis\\
-\item
-parenthesized subexpressions return the match from their last usage\\
-\item
-text of component subexpressions must be contained in the text of the
-higher-level subexpressions\\
-\item
-if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
-\item
-if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
-\end{itemize}
-\end{quote}
-%The text above
-%is trying to capture something very precise,
-%and is crying out for formalising.
-Ribeiro and Du Bois \cite{RibeiroAgda2017} have
-formalised the notion of bit-coded regular expressions
-and proved their relations with simple regular expressions in
-the dependently-typed proof assistant Agda.
-They also proved the soundness and completeness of a matching algorithm
-based on the bit-coded regular expressions.
-Ausaf et al. \cite{AusafDyckhoffUrban2016}
-are the first to
-give a quite simple formalised POSIX
-specification in Isabelle/HOL, and also prove
-that their specification coincides with an earlier (unformalised)
-POSIX specification given by Okui and Suzuki \cite{Okui10}.
-They then formally proved the correctness of
-a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
-with regards to that specification.
-They also found that the informal POSIX
-specification by Sulzmann and Lu needs to be substantially
-revised in order for the correctness proof to go through.
-Their original specification and proof were unfixable
-according to Ausaf et al.
-
-
-In the next section, we will briefly
-introduce Brzozowski derivatives and Sulzmann
-and Lu's algorithm, which the main point of this thesis builds on.
-%We give a taste of what they
-%are like and why they are suitable for regular expression
-%matching and lexing.
-\section{Formal Specification of POSIX Matching
-and Brzozowski Derivatives}
-%Now we start with the central topic of the thesis: Brzozowski derivatives.
-Brzozowski \cite{Brzozowski1964} first introduced the
-concept of a \emph{derivative} of regular expression in 1964.
-The derivative of a regular expression $r$
-with respect to a character $c$, is written as $r \backslash c$.
-This operation tells us what $r$ transforms into
-if we ``chop'' off a particular character $c$
-from all strings in the language of $r$ (defined
-later as $L \; r$).
-%To give a flavour of Brzozowski derivatives, we present
-%two straightforward clauses from it:
-%\begin{center}
-% \begin{tabular}{lcl}
-% $d \backslash c$ & $\dn$ &
-% $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-% \end{tabular}
-%\end{center}
-%\noindent
-%The first clause says that for the regular expression
-%denoting a singleton set consisting of a single-character string $\{ d \}$,
-%we check the derivative character $c$ against $d$,
-%returning a set containing only the empty string $\{ [] \}$
-%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
-%The second clause states that to obtain the regular expression
-%representing all strings' head character $c$ being chopped off
-%from $r_1 + r_2$, one simply needs to recursively take derivative
-%of $r_1$ and $r_2$ and then put them together.
-Derivatives have the property
-that $s \in L \; (r\backslash c)$ if and only if
-$c::s \in L \; r$ where $::$ stands for list prepending.
-%This property can be used on regular expressions
-%matching and lexing--to test whether a string $s$ is in $L \; r$,
-%one simply takes derivatives of $r$ successively with
-%respect to the characters (in the correct order) in $s$,
-%and then test whether the empty string is in the last regular expression.
-With this property, derivatives can give a simple solution
-to the problem of matching a string $s$ with a regular
-expression $r$: if the derivative of $r$ w.r.t.\ (in
-succession) all the characters of the string matches the empty string,
-then $r$ matches $s$ (and {\em vice versa}).
-%This makes formally reasoning about these properties such
-%as correctness and complexity smooth and intuitive.
-There are several mechanised proofs of this property in various theorem
-provers,
-for example one by Owens and Slind \cite{Owens2008} in HOL4,
-another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
-yet another in Coq by Coquand and Siles \cite{Coquand2012}.
-
-In addition, one can extend derivatives to bounded repetitions
-relatively straightforwardly. For example, the derivative for
-this can be defined as:
-\begin{center}
- \begin{tabular}{lcl}
- $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
- r^{\{n-1\}} (\text{when} n > 0)$\\
- \end{tabular}
-\end{center}
-\noindent
-Experimental results suggest that unlike DFA-based solutions
-for bounded regular expressions,
-derivatives can cope
-large counters
-quite well.
-
-There have also been
-extensions of derivatives to other regex constructs.
-For example, Owens et al include the derivatives
-for the \emph{NOT} regular expression, which is
-able to concisely express C-style comments of the form
-$/* \ldots */$ (see \cite{Owens2008}).
-Another extension for derivatives is
-regular expressions with look-aheads, done
-by Miyazaki and Minamide
-\cite{Takayuki2019}.
-%We therefore use Brzozowski derivatives on regular expressions
-%lexing
-
-
-
-Given the above definitions and properties of
-Brzozowski derivatives, one quickly realises their potential
-in generating a formally verified algorithm for lexing: the clauses and property
-can be easily expressed in a functional programming language
-or converted to theorem prover
-code, with great ease.
-Perhaps this is the reason why derivatives have sparked quite a bit of interest
-in the functional programming and theorem prover communities in the last
-fifteen or so years (
-\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
-\cite{Chen12} and \cite{Coquand2012}
-to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
-after they were first published by Brzozowski.
-
-
-However, there are two difficulties with derivative-based matchers:
-First, Brzozowski's original matcher only generates a yes/no answer
-for whether a regular expression matches a string or not. This is too
-little information in the context of lexing where separate tokens must
-be identified and also classified (for example as keywords
-or identifiers).
-Second, derivative-based matchers need to be more efficient in terms
-of the sizes of derivatives.
-Elegant and beautiful
-as many implementations are,
-they can be excruciatingly slow.
-For example, Sulzmann and Lu
-claim a linear running time of their proposed algorithm,
-but that was falsified by our experiments. The running time
-is actually $\Omega(2^n)$ in the worst case.
-A similar claim about a theoretical runtime of $O(n^2)$
-is made for the Verbatim \cite{Verbatim}
-%TODO: give references
-lexer, which calculates POSIX matches and is based on derivatives.
-They formalized the correctness of the lexer, but not their complexity result.
-In the performance evaluation section, they analyzed the run time
-of matching $a$ with the string
-\begin{center}
- $\underbrace{a \ldots a}_{\text{n a's}}$.
-\end{center}
-\noindent
-They concluded that the algorithm is quadratic in terms of
-the length of the input string.
-When we tried out their extracted OCaml code with the example $(a+aa)^*$,
-the time it took to match a string of 40 $a$'s was approximately 5 minutes.
-
-
-\subsection{Sulzmann and Lu's Algorithm}
-Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
-problem with the yes/no answer
-by cleverly extending Brzozowski's matching
-algorithm. Their extended version generates additional information on
-\emph{how} a regular expression matches a string following the POSIX
-rules for regular expression matching. They achieve this by adding a
-second ``phase'' to Brzozowski's algorithm involving an injection
-function. This injection function in a sense undoes the ``damage''
-of the derivatives chopping off characters.
-In earlier work, Ausaf et al provided the formal
-specification of what POSIX matching means and proved in Isabelle/HOL
-the correctness
-of this extended algorithm accordingly
-\cite{AusafDyckhoffUrban2016}.
-
-The version of the algorithm proven correct
-suffers however heavily from a
-second difficulty, where derivatives can
-grow to arbitrarily big sizes.
-For example if we start with the
-regular expression $(a+aa)^*$ and take
-successive derivatives according to the character $a$, we end up with
-a sequence of ever-growing derivatives like
-
-\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
-\begin{center}
-\begin{tabular}{rll}
-$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
-& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
-& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
-& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
-& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
-\end{tabular}
-\end{center}
-
-\noindent where after around 35 steps we usually run out of memory on a
-typical computer. Clearly, the
-notation involving $\ZERO$s and $\ONE$s already suggests
-simplification rules that can be applied to regular regular
-expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
-\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
-r$. While such simple-minded simplifications have been proved in
-the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
-algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
-\emph{not} help with limiting the growth of the derivatives shown
-above: the growth is slowed, but the derivatives can still grow rather
-quickly beyond any finite bound.
-
-Therefore we want to look in this thesis at a second
-algorithm by Sulzmann and Lu where they
-overcame this ``growth problem'' \cite{Sulzmann2014}.
-In this version, POSIX values are
-represented as bit sequences and such sequences are incrementally generated
-when derivatives are calculated. The compact representation
-of bit sequences and regular expressions allows them to define a more
-``aggressive'' simplification method that keeps the size of the
-derivatives finite no matter what the length of the string is.
-They make some informal claims about the correctness and linear behaviour
-of this version, but do not provide any supporting proof arguments, not
-even ``pencil-and-paper'' arguments. They write about their bit-coded
-\emph{incremental parsing method} (that is the algorithm to be formalised
-in this dissertation)
-
-
-
- \begin{quote}\it
- ``Correctness Claim: We further claim that the incremental parsing
- method [..] in combination with the simplification steps [..]
- yields POSIX parse trees. We have tested this claim
- extensively [..] but yet
- have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
-\end{quote}
-Ausaf and Urban made some initial progress towards the
-full correctness proof but still had to leave out the optimisation
-Sulzmann and Lu proposed.
-Ausaf wrote \cite{Ausaf},
- \begin{quote}\it
-``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
-\end{quote}
-This thesis implements the aggressive simplifications envisioned
-by Ausaf and Urban,
-together with a formal proof of the correctness of those simplifications.
-
-
-One of the most recent work in the context of lexing
-%with this issue
-is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
-This is relevant work for us and we will compare it later with
-our derivative-based matcher we are going to present.
-There is also some newer work called
-Verbatim++~\cite{Verbatimpp}, which does not use derivatives,
-but deterministic finite automaton instead.
-We will also study this work in a later section.
-%An example that gives problem to automaton approaches would be
-%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
-%It requires at least $2^{n+1}$ states to represent
-%as a DFA.
-
-
%----------------------------------------------------------------------------------------
\section{Contribution}
{\color{red} \rule{\linewidth}{0.5mm}}