% Chapter Template\chapter{Related Work} % Main chapter title\label{RelatedWork} %In this chapter, we introduce%work relevant to this thesis.\section{Regular Expressions, Derivatives and POSIX Lexing}%\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms}Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}.Since then they have become a fundamental concept in formal languages and automata theory \cite{Sakarovitch2009}.Brzozowski defined derivatives on regular expressions in his PhD thesisin 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbersof regular expression derivatives modulo the ACI-axioms.It is worth pointing out that this result does not directlytranslate to ourfiniteness proof, and our proof does not depend on it to be valid.The key observation is that our version of the Sulzmann and Lu's algorithm\cite{Sulzmann2014} represents derivative terms in a way that allows efficient de-duplication,and we do not make use of an equivalence checker that exploits the ACI-equivalentterms.Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}.They first introduced the elegant and simple idea of injection-based lexingand bit-coded lexing.In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhovenincorporated these ideasinto a tool called \emph{dreml}.The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideasby Frisch and Cardelli \cite{Frisch2004} were laterfound to have unfillable gaps by Ausaf \cite{Ausaf},who came up with an alternative proof inspired by Vansummeren \cite{Vansummeren2006}.Sulzmann and Thiemann extended the Brzozowski derivatives to shuffling regular expressions \cite{Sulzmann2015},which are a very succinct way of describing basic regular expressions.Regular expressions and lexers have been a popular topic amongthe theorem proving and functional programming community.In the next section we give a list of lexersand matchers that come with a machine-checked correctness proof.\subsection{Matchers and Lexers with Mechanised Proofs}We are awareof a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 byOwens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is partof the work by Krauss and Nipkow \parencite{Krauss2011}. Another onein Coq is given by Coquand and Siles \parencite{Coquand2012}.Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.The most recent works to our best knowledge are the Verbatim \cite{Verbatim}and Verbatim++ \cite{Verbatim} lexers.The Verbatim++ lexer adds many correctness-preserving optimisations to the Verbatim lexer,and is therefore quite fast on many inputs.The problem is that they chose to use DFA to speed up things,for which dealing with bounded repetitions is a bottleneck.This thesis builds on the formal specifications of POSIXrules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}.Automata formalisations are in general harder and more cumbersome to dealwith for theorem provers \cite{Nipkow1998}.To represent them, one way is to use graphs, but graphs are not inductive datatypes.Having to set the inductive principle on the number of nodesin a graph makes formal reasoning non-intuitive and convoluted,resulting in large formalisations \cite{Lammich2012}. When combining two graphs, one also needs to make sure that the nodes inboth graphs are distinct. If they are not distinct, then renaming of the nodes is needed.Using Coq which provides dependent types can potentially make things slightly easier\cite{Doczkal2013}Another representation for automata are matrices. But the induction for them is not as straightforward either.Both approaches have been used in the past and resulted in huge formalisations.There are some more clever representations, for example one by Paulson using hereditarily finite sets \cite{Paulson2015}. There the problem with combining graphs can be solved better. %but we believe that such clever tricks are not very obvious for %the John-average-Isabelle-user.\subsection{Different Definitions of POSIX Rules}There are different ways to formalise values and POSIX matching.Cardelli and Frisch \cite{Frisch2004} have developed a notion of\emph{non-problematic values} which is a slight variationof the values defined in the inhabitation relation in \ref{fig:inhab}.They then defined an ordering between values, and showed thatthe maximal element of those values correspond to the outputof their GREEDY lexing algorithm.Okui and Suzuki \cite{Okui10} allow iterations of values to flatten to the emptystring for the star inhabitation rule in\ref{fig:inhab}.They refer to the more restrictive version as used in this thesis (which was defined by Ausaf et al.\cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.The very interesting link between the work by Ausaf et al.and Okui and Suzuki is that they have distinct formalisationsof POSIX values, and yet they are provably equivalent! See\cite{AusafDyckhoffUrban2016} for details of thealternative definitions given by Okui and Suzuki and the formalisation.%TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}.In their paper they defined an ordering between values with respectto regular expressions, and tried to establish that theiralgorithm outputs the minimal element by a pencil-and-paper proof.But having the ordering relation taking regular expression as parameterscauses the transitivity of their ordering to not go through.When faced with catastrophic backtracking, sometimes programmers have to rewrite their regexes in an ad hoc fashion.Knowing which patterns should be avoidedcan be helpful.\subsection{Static Analysis of Evil Regex Patterns} or they try to avoid the possibly problematic patterns completely,for which many false positives exist\parencite{Davis18}.Animated tools to "debug" regular expressions such as \parencite{regexploit2021} \parencite{regex101} are also popular.We are also aware of static analysis work on regular expressions thataims to detect potentially expoential regex patterns. Rathnayake and Thielecke \parencite{Rathnayake2014StaticAF} proposed an algorithmthat detects regular expressions triggering exponentialbehavious on backtracking matchers.Weideman \parencite{Weideman2017Static} came up with non-linear polynomial worst-time estimatesfor regexes, attack string that exploit the worst-time scenario, and "attack automata" that generatesattack strings.\section{Optimisations}Perhaps the biggest problem that prevents derivative-based lexing from being more widely adoptedis that they tend to be not blindingly fast in practice, unable to reach throughputs like gigabytes per second, which is theapplication we had in mind when we initially started looking at the topic.Commercialregular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}are capable of inspecting payloadsat line rates (which can be up to a few gigabits per second) against thousands of rules \cite{communityRules}.For our algorithm to be more attractive for practical use, weneed more correctness-preserving optimisations.FPGA-based implementations such as \cite{Sidhu2001} have the advantages of beingreconfigurable and parallel, but suffer from lower clock frequencyand scalability.Traditional automaton approaches that use a DFA instead of NFAbenefit from the fact that only a single transition is needed for each input character \cite{Becchi08}. Lower memory bandwidth leadsto faster performance.However, they suffer from exponential blow-ups on bounded repetitions.Compression techniques are used, such as those in \cite{Kumar2006} and\cite{Becchi2007}.Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}have beenproposed to better deal with bounded repetitions.%So far our focus has been mainly on the bit-coded algorithm,%but the injection-based lexing algorithm %could also be sped up in various ways:%Another direction of optimisation for derivative-based approachesis defining string derivativesdirectly, without recursively decomposing them into character-by-character derivatives. For example, instead ofreplacing $(r_1 + r_2)\backslash (c::cs)$ by$((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rathercalculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.This has the potential to speed up matching because input is processed in larger granularity.One interesting thing is to explore whether this can be doneto $\inj$ as well, so that we can generate a lexical value rather than simply get a matcher.\subsection{Derivatives and Zippers}Zipper is a data structure designed to focus on and navigate between local parts of a tree.The idea is that often operations on a large tree only deal withlocal regions each time.Therefore it would be a waste to traverse the entire tree ifthe operation onlyinvolves a small fraction of it.It was first formally described by Huet \cite{Huet1997}.Typical applications of zippers involve text editor buffersand proof system databases.In our setting, the idea is to compactify the representationof derivatives with zippers, thereby making our algorithm faster.We introduce several works on parsing, derivativesand zippers.Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives\cite{Zippy2020}.They adopted zippers to improve the speed, and argued that the runtimecomplexity of their algorithm was linear with respect to the input string.The idea of using Brzozowski derivatives on general context-free parsing was first implementedby Might et al. \cite{Might2011}. They used memoization and fixpoint construction to eliminate infinite recursion,which is a major problem for using derivatives on context-free grammars.The initial version was quite slow----exponential on the size of the input.Adams et al. \cite{Adams2016} improved the speed and argued that their versionwas cubic with respect to the input.Darragh and Adams \cite{Darragh2020} further improved the performanceby using zippers in an innovative way--their zippers had multiple focusesinstead of just one in a traditional zipper to handle ambiguity.Their algorithm was not formalised, though.\subsection{Back-References}We introduced regular expressions with back-referencesin chapter \ref{Introduction}.We adopt the common practice of calling them rewbrs (Regular ExpressionsWith Back References) for brevity.It has been shown by Aho \cite{Aho1990}that the k-vertex cover problem can be reducedto the problem of rewbrs matching, and is therefore NP-complete.Given the depth of the problem, the progress made at the full generalityof arbitrary rewbrs matching has been slow, withtheoretical work on them being fairly recent. Campaneu et al. studied regexes with back-referencesin the context of formal languages theory in their 2003 work \cite{campeanu2003formal}.They devised a pumping lemma for Perl-like regexes, proving that the langugages denoted by themis properly contained in context-sensitivelanguages.More interesting questions such as whether the language denoted by Perl-like regexescan express palindromes ($\{w \mid w = w^R\}$)are discussed in \cite{campeanu2009patterns} and \cite{CAMPEANU2009Intersect}.Freydenberger \cite{Frey2013} investigated the expressive power of back-references. He showed several undecidability and decriptional complexity results of back-references, concluding that they addgreat power to certain programming tasks but are difficult to harness.An interesting question would then bewhether we can add restrictions to them, so that they become expressive enough for practical use suchas html files, but not too powerful.Freydenberger and Schmid \cite{FREYDENBERGER20191}introduced the notion of deterministicregular expressions with back-references to achievea better balance between expressiveness and tractability.Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}investigated the time complexity of different variantsof back-references.We are not aware of any work that uses derivatives on back-references.See \cite{BERGLUND2022} for a surveyof these works and comparison between differentflavours of back-references syntax (e.g. whether references can be circular, can labels be repeated etc.).