--- a/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Thu Nov 17 23:13:57 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Mon Nov 21 23:56:15 2022 +0000
@@ -5,7 +5,211 @@
\label{RelatedWork}
In this chapter, we introduce
-the work relevant to this thesis.
+work relevant to this thesis.
+
+\section{Regular Expressions, Derivatives and POSIX Lexing}
+
+%\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms}
+Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}.
+Since then they have become a fundamental concept in
+formal languages and automata theory \cite{Sakarovitch2009}.
+Brzozowski defined derivatives on regular expressions in his PhD thesis
+in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
+of regular expression derivatives modulo the ACI-axioms.
+It is worth pointing out that this result does not directly
+translate to our
+finiteness proof,
+and our proof does not depend on it to be valid.
+The key observation is that our version of the Sulzmann and Lu's algorithm
+\cite{Sulzmann2014} represents
+derivative terms in a way that allows efficient de-duplication,
+and we do not make use of an equivalence checker that exploits the ACI-equivalent
+terms.
+
+Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}.
+They first introduced the elegant and simple idea of injection-based lexing
+and bit-coded lexing.
+In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
+incorporated these ideas
+into a tool called \emph{dreml}.
+The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
+by Frisch and Cardelli \cite{Frisch2004} were later
+found to have unfillable gaps by Ausaf \cite{Ausaf},
+who came up with an alternative proof inspired by
+Vansummeren \cite{Vansummeren2006}.
+Sulzmann and Thiemann extended the Brzozowski derivatives to
+shuffling regular expressions \cite{Sulzmann2015},
+which are a very succinct way of describing basic regular expressions.
+
+
+
+Regular expressions and lexers have been a popular topic among
+the theorem proving and functional programming community.
+In the next section we give a list of lexers
+and matchers that come with a machine-checked correctness proof.
+
+\subsection{Matchers and Lexers with Mechanised Proofs}
+We are aware
+of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
+of the work by Krauss and Nipkow \parencite{Krauss2011}. Another one
+in Coq is given by Coquand and Siles \parencite{Coquand2012}.
+Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
+The most recent works to our best knowledge are the Verbatim \cite{Verbatim}
+and Verbatim++ \cite{Verbatim} lexers.
+The Verbatim++ lexer adds many correctness-preserving
+optimisations to the Verbatim lexer,
+and is therefore quite fast on many inputs.
+The problem is that they chose to use DFA to speed up things,
+for which dealing with bounded repetitions is a bottleneck.
+
+
+This thesis builds on the formal specifications of POSIX
+rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
+The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}.
+
+Automata formalisations are in general harder and more cumbersome to deal
+with for theorem provers \cite{Nipkow1998}.
+To represent them, one way is to use graphs, but graphs are not inductive datatypes.
+Having to set the inductive principle on the number of nodes
+in a graph makes formal reasoning non-intuitive and convoluted.
+When combining two graphs, one also needs to make sure that the nodes in
+both graphs are distinct.
+If they are not distinct, then renaming of the nodes is needed.
+Using Coq which provides dependent types can potentially make things slightly easier
+\cite{Doczkal2013}
+Another representation for automata are matrices.
+But the induction for them is not as straightforward either.
+Both approaches have been used in the past and resulted in huge formalisations.
+There are some more clever representations, for example one by Paulson
+using hereditarily finite sets.
+There the problem with combining graphs can be solved better,
+but we believe that such clever tricks are not very obvious for
+the John-average-Isabelle-user.
+
+\subsection{Different Definitions of POSIX Rules}
+There are different ways to formalise values and POSIX matching.
+Cardelli and Frisch \cite{Frisch2004} have developed a notion of
+\emph{non-problematic values} which is a slight variation
+of the values defined in the inhabitation relation in \ref{fig:inhab}.
+They then defined an ordering between values, and showed that
+the maximal element of those values correspond to the output
+of their GREEDY lexing algorithm.
+
+Okui and Suzuki \cite{Okui10} allow iterations of values to
+flatten to the empty
+string for the star inhabitation rule in
+\ref{fig:inhab}.
+They refer to the more restrictive version as used
+in this thesis (which was defined by Ausaf et al.
+\cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
+The very interesting link between the work by Ausaf et al.
+and Okui and Suzuki is that they have distinct formalisations
+of POSIX values, and yet they are provably equivalent! See
+\cite{AusafDyckhoffUrban2016} for details of the
+alternative definitions given by Okui and Suzuki and the formalisation.
+%TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
+
+Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}.
+In their paper they defined an ordering between values with respect
+to regular expressions, and tried to establish that their
+algorithm outputs the minimal element by a pencil-and-paper proof.
+But having the ordering relation taking regular expression as parameters
+causes the transitivity of their ordering to not go through.
+
+
+\section{Optimisations}
+Perhaps the biggest problem that prevents derivative-based lexing
+from being more widely adopted
+is that they tend to be not blindingly fast in practice, unable to
+reach throughputs like gigabytes per second, which is the
+application we had in mind when we initially started looking at the topic.
+Commercial
+regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
+are capable of inspecting payloads
+at line rates (which can be up to a few gigabits per second) against
+thousands of rules \cite{communityRules}.
+For our algorithm to be more attractive for practical use, we
+need more correctness-preserving optimisations.
+%So far our focus has been mainly on the bit-coded algorithm,
+%but the injection-based lexing algorithm
+%could also be sped up in various ways:
+%
+One such optimisation is defining string derivatives
+directly, without recursively decomposing them into
+character-by-character derivatives. For example, instead of
+replacing $(r_1 + r_2)\backslash (c::cs)$ by
+$((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
+calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
+This has the potential to speed up matching because input is
+processed in larger granularity.
+One interesting thing is to explore whether this can be done
+to $\inj$ as well, so that we can generate a lexical value
+rather than simply get a matcher.
+
+
+\subsection{Derivatives and Zippers}
+Zipper is a data structure designed to focus on
+and navigate between local parts of a tree.
+The idea is that often operations on a large tree only deal with
+local regions each time.
+Therefore it would be a waste to
+traverse the entire tree if
+the operation only
+involves a small fraction of it.
+It was first formally described by Huet \cite{HuetZipper}.
+Typical applications of zippers involve text editor buffers
+and proof system databases.
+In our setting, the idea is to compactify the representation
+of derivatives with zippers, thereby making our algorithm faster.
+We introduce several works on parsing, derivatives
+and zippers.
+
+Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
+\cite{Zippy2020}.
+They adopted zippers to improve the speed, and argued that the runtime
+complexity of their algorithm was linear with respect to the input string.
+
+The idea of using Brzozowski derivatives on general context-free
+parsing was first implemented
+by Might et al. \ref{Might2011}.
+They used memoization and fixpoint construction to eliminate infinite recursion,
+which is a major problem for using derivatives on context-free grammars.
+The initial version was quite slow----exponential on the size of the input.
+Adams et al. \ref{Adams2016} improved the speed and argued that their version
+was cubic with respect to the input.
+Darragh and Adams \cite{Darragh2020} further improved the performance
+by using zippers in an innovative way--their zippers had multiple focuses
+instead of just one in a traditional zipper to handle ambiguity.
+Their algorithm was not formalised, though.
+
+
+Some initial results have been obtained, but more care needs to be taken to make sure
+that the lexer output conforms to the POSIX standard. Formalising correctness of
+such a lexer using zippers will probably require using an imperative
+framework with separation logic as it involves
+mutable data structures, which is also interesting to look into.
+
+To further optimise the algorithm,
+I plan to add a ``deduplicate'' function that tells
+whether two regular expressions denote the same
+language using
+an efficient and verified equivalence checker.
+In this way, the internal data structures can be pruned more aggressively,
+leading to better simplifications and ultimately faster algorithms.
+
+
+
+The idea is to put the focus on that subtree, turning other parts
+of the tree into a context
+
+
+One observation about our derivative-based lexing algorithm is that
+the derivative operation sometimes traverses the entire regular expression
+unnecessarily:
+
+
+
\section{Work on Back-References}
@@ -15,9 +219,10 @@
With Back References) in this section for brevity.
It has been shown by Aho \cite{Aho1990}
that the k-vertex cover problem can be reduced
-to the problem of rewbrs matching problem, and is therefore NP-complete.
-Given the depth of the problem, the
-theoretical work on them is
+to the problem of rewbrs matching, and is therefore NP-complete.
+Given the depth of the problem, the progress made at the full generality
+of arbitrary rewbrs matching has been slow, with
+theoretical work on them being
fairly recent.
Campaneu et al. studied regexes with back-references
@@ -58,13 +263,6 @@
can labels be repeated etc.).
-\subsection{Matchers and Lexers with Mechanised Proofs}
-We are aware
-of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
-Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
-of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
-in Coq is given by Coquand and Siles \parencite{Coquand2012}.
-Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
\subsection{Static Analysis of Evil Regex Patterns}
When a regular expression does not behave as intended,
@@ -112,84 +310,3 @@
of catastrophic backtracking.
-\section{Derivatives and Zippers}
-Zipper is a data structure designed to focus on
-and navigate between local parts of a tree.
-The idea is that often operations on a large tree only deal with
-local regions each time.
-It was first formally described by Huet \cite{HuetZipper}.
-Typical applications of zippers involve text editor buffers
-and proof system databases.
-In our setting, the idea is to compactify the representation
-of derivatives with zippers, thereby making our algorithm faster.
-We draw our inspirations from several works on parsing, derivatives
-and zippers.
-
-Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
-\cite{Zippy2020}.
-They adopted zippers to improve the speed, and argued that the runtime
-complexity of their algorithm was linear with respect to the input string.
-
-The idea of using Brzozowski derivatives on general context-free
-parsing was first implemented
-by Might et al. \ref{Might2011}.
-They used memoization and fixpoint construction to eliminate infinite recursion,
-which is a major problem for using derivatives on context-free grammars.
-The initial version was quite slow----exponential on the size of the input.
-Adams et al. \ref{Adams2016} improved the speed and argued that their version
-was cubic with respect to the input.
-Darragh and Adams \cite{10.1145/3408990} further improved the performance
-by using zippers in an innovative way--their zippers had multiple focuses
-instead of just one in a traditional zipper to handle ambiguity.
-Their algorithm was not formalised, though.
-
-We aim to integrate the zipper data structure into our lexer.
-The idea is very simple: using a zipper with multiple focuses
-just like Darragh did, we could represent
-\[
- x\cdot r + y \cdot r + \ldots
-\]
-as
-\[
- (x+y + \ldots) \cdot r.
-\]
-This would greatly reduce the amount of space needed
-when we have many terms like $x\cdot r$.
-
-into the current lexer.
-This allows the lexer to not traverse the entire
-derivative tree generated
-in every intermediate step, but only a small segment
-that is currently active.
-
-Some initial results have been obtained, but more care needs to be taken to make sure
-that the lexer output conforms to the POSIX standard. Formalising correctness of
-such a lexer using zippers will probably require using an imperative
-framework with separation logic as it involves
-mutable data structures, which is also interesting to look into.
-
-To further optimise the algorithm,
-I plan to add a ``deduplicate'' function that tells
-whether two regular expressions denote the same
-language using
-an efficient and verified equivalence checker.
-In this way, the internal data structures can be pruned more aggressively,
-leading to better simplifications and ultimately faster algorithms.
-
-
-Some initial results
-We first give a brief introduction to what zippers are,
-and other works
-that apply zippers to derivatives
-When dealing with large trees, it would be a waste to
-traverse the entire tree if
-the operation only
-involves a small fraction of it.
-The idea is to put the focus on that subtree, turning other parts
-of the tree into a context
-
-
-One observation about our derivative-based lexing algorithm is that
-the derivative operation sometimes traverses the entire regular expression
-unnecessarily:
-