diff -r b797c9a709d9 -r 1c8525061545 ChengsongTanPhdThesis/Chapters/RelatedWork.tex --- a/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Thu Nov 17 23:13:57 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Mon Nov 21 23:56:15 2022 +0000 @@ -5,7 +5,211 @@ \label{RelatedWork} In this chapter, we introduce -the work relevant to this thesis. +work relevant to this thesis. + +\section{Regular Expressions, Derivatives and POSIX Lexing} + +%\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms} +Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}. +Since then they have become a fundamental concept in +formal languages and automata theory \cite{Sakarovitch2009}. +Brzozowski defined derivatives on regular expressions in his PhD thesis +in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers +of regular expression derivatives modulo the ACI-axioms. +It is worth pointing out that this result does not directly +translate to our +finiteness proof, +and our proof does not depend on it to be valid. +The key observation is that our version of the Sulzmann and Lu's algorithm +\cite{Sulzmann2014} represents +derivative terms in a way that allows efficient de-duplication, +and we do not make use of an equivalence checker that exploits the ACI-equivalent +terms. + +Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}. +They first introduced the elegant and simple idea of injection-based lexing +and bit-coded lexing. +In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven +incorporated these ideas +into a tool called \emph{dreml}. +The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas +by Frisch and Cardelli \cite{Frisch2004} were later +found to have unfillable gaps by Ausaf \cite{Ausaf}, +who came up with an alternative proof inspired by +Vansummeren \cite{Vansummeren2006}. +Sulzmann and Thiemann extended the Brzozowski derivatives to +shuffling regular expressions \cite{Sulzmann2015}, +which are a very succinct way of describing basic regular expressions. + + + +Regular expressions and lexers have been a popular topic among +the theorem proving and functional programming community. +In the next section we give a list of lexers +and matchers that come with a machine-checked correctness proof. + +\subsection{Matchers and Lexers with Mechanised Proofs} +We are aware +of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by +Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part +of the work by Krauss and Nipkow \parencite{Krauss2011}. Another one +in Coq is given by Coquand and Siles \parencite{Coquand2012}. +Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}. +The most recent works to our best knowledge are the Verbatim \cite{Verbatim} +and Verbatim++ \cite{Verbatim} lexers. +The Verbatim++ lexer adds many correctness-preserving +optimisations to the Verbatim lexer, +and is therefore quite fast on many inputs. +The problem is that they chose to use DFA to speed up things, +for which dealing with bounded repetitions is a bottleneck. + + +This thesis builds on the formal specifications of POSIX +rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}. +The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}. + +Automata formalisations are in general harder and more cumbersome to deal +with for theorem provers \cite{Nipkow1998}. +To represent them, one way is to use graphs, but graphs are not inductive datatypes. +Having to set the inductive principle on the number of nodes +in a graph makes formal reasoning non-intuitive and convoluted. +When combining two graphs, one also needs to make sure that the nodes in +both graphs are distinct. +If they are not distinct, then renaming of the nodes is needed. +Using Coq which provides dependent types can potentially make things slightly easier +\cite{Doczkal2013} +Another representation for automata are matrices. +But the induction for them is not as straightforward either. +Both approaches have been used in the past and resulted in huge formalisations. +There are some more clever representations, for example one by Paulson +using hereditarily finite sets. +There the problem with combining graphs can be solved better, +but we believe that such clever tricks are not very obvious for +the John-average-Isabelle-user. + +\subsection{Different Definitions of POSIX Rules} +There are different ways to formalise values and POSIX matching. +Cardelli and Frisch \cite{Frisch2004} have developed a notion of +\emph{non-problematic values} which is a slight variation +of the values defined in the inhabitation relation in \ref{fig:inhab}. +They then defined an ordering between values, and showed that +the maximal element of those values correspond to the output +of their GREEDY lexing algorithm. + +Okui and Suzuki \cite{Okui10} allow iterations of values to +flatten to the empty +string for the star inhabitation rule in +\ref{fig:inhab}. +They refer to the more restrictive version as used +in this thesis (which was defined by Ausaf et al. +\cite{AusafDyckhoffUrban2016}) as \emph{canonical values}. +The very interesting link between the work by Ausaf et al. +and Okui and Suzuki is that they have distinct formalisations +of POSIX values, and yet they are provably equivalent! See +\cite{AusafDyckhoffUrban2016} for details of the +alternative definitions given by Okui and Suzuki and the formalisation. +%TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF. + +Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}. +In their paper they defined an ordering between values with respect +to regular expressions, and tried to establish that their +algorithm outputs the minimal element by a pencil-and-paper proof. +But having the ordering relation taking regular expression as parameters +causes the transitivity of their ordering to not go through. + + +\section{Optimisations} +Perhaps the biggest problem that prevents derivative-based lexing +from being more widely adopted +is that they tend to be not blindingly fast in practice, unable to +reach throughputs like gigabytes per second, which is the +application we had in mind when we initially started looking at the topic. +Commercial +regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro} +are capable of inspecting payloads +at line rates (which can be up to a few gigabits per second) against +thousands of rules \cite{communityRules}. +For our algorithm to be more attractive for practical use, we +need more correctness-preserving optimisations. +%So far our focus has been mainly on the bit-coded algorithm, +%but the injection-based lexing algorithm +%could also be sped up in various ways: +% +One such optimisation is defining string derivatives +directly, without recursively decomposing them into +character-by-character derivatives. For example, instead of +replacing $(r_1 + r_2)\backslash (c::cs)$ by +$((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather +calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. +This has the potential to speed up matching because input is +processed in larger granularity. +One interesting thing is to explore whether this can be done +to $\inj$ as well, so that we can generate a lexical value +rather than simply get a matcher. + + +\subsection{Derivatives and Zippers} +Zipper is a data structure designed to focus on +and navigate between local parts of a tree. +The idea is that often operations on a large tree only deal with +local regions each time. +Therefore it would be a waste to +traverse the entire tree if +the operation only +involves a small fraction of it. +It was first formally described by Huet \cite{HuetZipper}. +Typical applications of zippers involve text editor buffers +and proof system databases. +In our setting, the idea is to compactify the representation +of derivatives with zippers, thereby making our algorithm faster. +We introduce several works on parsing, derivatives +and zippers. + +Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives +\cite{Zippy2020}. +They adopted zippers to improve the speed, and argued that the runtime +complexity of their algorithm was linear with respect to the input string. + +The idea of using Brzozowski derivatives on general context-free +parsing was first implemented +by Might et al. \ref{Might2011}. +They used memoization and fixpoint construction to eliminate infinite recursion, +which is a major problem for using derivatives on context-free grammars. +The initial version was quite slow----exponential on the size of the input. +Adams et al. \ref{Adams2016} improved the speed and argued that their version +was cubic with respect to the input. +Darragh and Adams \cite{Darragh2020} further improved the performance +by using zippers in an innovative way--their zippers had multiple focuses +instead of just one in a traditional zipper to handle ambiguity. +Their algorithm was not formalised, though. + + +Some initial results have been obtained, but more care needs to be taken to make sure +that the lexer output conforms to the POSIX standard. Formalising correctness of +such a lexer using zippers will probably require using an imperative +framework with separation logic as it involves +mutable data structures, which is also interesting to look into. + +To further optimise the algorithm, +I plan to add a ``deduplicate'' function that tells +whether two regular expressions denote the same +language using +an efficient and verified equivalence checker. +In this way, the internal data structures can be pruned more aggressively, +leading to better simplifications and ultimately faster algorithms. + + + +The idea is to put the focus on that subtree, turning other parts +of the tree into a context + + +One observation about our derivative-based lexing algorithm is that +the derivative operation sometimes traverses the entire regular expression +unnecessarily: + + + \section{Work on Back-References} @@ -15,9 +219,10 @@ With Back References) in this section for brevity. It has been shown by Aho \cite{Aho1990} that the k-vertex cover problem can be reduced -to the problem of rewbrs matching problem, and is therefore NP-complete. -Given the depth of the problem, the -theoretical work on them is +to the problem of rewbrs matching, and is therefore NP-complete. +Given the depth of the problem, the progress made at the full generality +of arbitrary rewbrs matching has been slow, with +theoretical work on them being fairly recent. Campaneu et al. studied regexes with back-references @@ -58,13 +263,6 @@ can labels be repeated etc.). -\subsection{Matchers and Lexers with Mechanised Proofs} -We are aware -of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by -Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part -of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one -in Coq is given by Coquand and Siles \parencite{Coquand2012}. -Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. \subsection{Static Analysis of Evil Regex Patterns} When a regular expression does not behave as intended, @@ -112,84 +310,3 @@ of catastrophic backtracking. -\section{Derivatives and Zippers} -Zipper is a data structure designed to focus on -and navigate between local parts of a tree. -The idea is that often operations on a large tree only deal with -local regions each time. -It was first formally described by Huet \cite{HuetZipper}. -Typical applications of zippers involve text editor buffers -and proof system databases. -In our setting, the idea is to compactify the representation -of derivatives with zippers, thereby making our algorithm faster. -We draw our inspirations from several works on parsing, derivatives -and zippers. - -Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives -\cite{Zippy2020}. -They adopted zippers to improve the speed, and argued that the runtime -complexity of their algorithm was linear with respect to the input string. - -The idea of using Brzozowski derivatives on general context-free -parsing was first implemented -by Might et al. \ref{Might2011}. -They used memoization and fixpoint construction to eliminate infinite recursion, -which is a major problem for using derivatives on context-free grammars. -The initial version was quite slow----exponential on the size of the input. -Adams et al. \ref{Adams2016} improved the speed and argued that their version -was cubic with respect to the input. -Darragh and Adams \cite{10.1145/3408990} further improved the performance -by using zippers in an innovative way--their zippers had multiple focuses -instead of just one in a traditional zipper to handle ambiguity. -Their algorithm was not formalised, though. - -We aim to integrate the zipper data structure into our lexer. -The idea is very simple: using a zipper with multiple focuses -just like Darragh did, we could represent -\[ - x\cdot r + y \cdot r + \ldots -\] -as -\[ - (x+y + \ldots) \cdot r. -\] -This would greatly reduce the amount of space needed -when we have many terms like $x\cdot r$. - -into the current lexer. -This allows the lexer to not traverse the entire -derivative tree generated -in every intermediate step, but only a small segment -that is currently active. - -Some initial results have been obtained, but more care needs to be taken to make sure -that the lexer output conforms to the POSIX standard. Formalising correctness of -such a lexer using zippers will probably require using an imperative -framework with separation logic as it involves -mutable data structures, which is also interesting to look into. - -To further optimise the algorithm, -I plan to add a ``deduplicate'' function that tells -whether two regular expressions denote the same -language using -an efficient and verified equivalence checker. -In this way, the internal data structures can be pruned more aggressively, -leading to better simplifications and ultimately faster algorithms. - - -Some initial results -We first give a brief introduction to what zippers are, -and other works -that apply zippers to derivatives -When dealing with large trees, it would be a waste to -traverse the entire tree if -the operation only -involves a small fraction of it. -The idea is to put the focus on that subtree, turning other parts -of the tree into a context - - -One observation about our derivative-based lexing algorithm is that -the derivative operation sometimes traverses the entire regular expression -unnecessarily: -