lexing: comparison ChengsongTanPhdThesis/Chapters/RelatedWork.tex

equal deleted inserted replaced

-:b797c9a709d9
+:1c8525061545
 \chapter{Related Work} % Main chapter title
 \label{RelatedWork}
 In this chapter, we introduce
-the work relevant to this thesis.
+work relevant to this thesis.
+\section{Regular Expressions, Derivatives and POSIX Lexing}
+%\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms}
+Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}.
+Since then they have become a fundamental concept in
+formal languages and automata theory \cite{Sakarovitch2009}.
+Brzozowski defined derivatives on regular expressions in his PhD thesis
+in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
+of regular expression derivatives modulo the ACI-axioms.
+It is worth pointing out that this result does not directly
+translate to our
+finiteness proof,
+and our proof does not depend on it to be valid.
+The key observation is that our version of the Sulzmann and Lu's algorithm
+\cite{Sulzmann2014} represents
+derivative terms in a way that allows efficient de-duplication,
+and we do not make use of an equivalence checker that exploits the ACI-equivalent
+terms.
+Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}.
+They first introduced the elegant and simple idea of injection-based lexing
+and bit-coded lexing.
+In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
+incorporated these ideas
+into a tool called \emph{dreml}.
+The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
+by Frisch and Cardelli \cite{Frisch2004} were later
+found to have unfillable gaps by Ausaf \cite{Ausaf},
+who came up with an alternative proof inspired by
+Vansummeren \cite{Vansummeren2006}.
+Sulzmann and Thiemann extended the Brzozowski derivatives to
+shuffling regular expressions \cite{Sulzmann2015},
+which are a very succinct way of describing basic regular expressions.
+Regular expressions and lexers have been a popular topic among
+the theorem proving and functional programming community.
+In the next section we give a list of lexers
+and matchers that come with a machine-checked correctness proof.
+\subsection{Matchers and Lexers with Mechanised Proofs}
+We are aware
+of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
+of the work by Krauss and Nipkow \parencite{Krauss2011}.  Another one
+in Coq is given by Coquand and Siles \parencite{Coquand2012}.
+Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
+The most recent works to our best knowledge are the Verbatim \cite{Verbatim}
+and Verbatim++ \cite{Verbatim} lexers.
+The Verbatim++ lexer adds many correctness-preserving
+optimisations to the Verbatim lexer,
+and is therefore quite fast on many inputs.
+The problem is that they chose to use DFA to speed up things,
+for which dealing with bounded repetitions is a bottleneck.
+This thesis builds on the formal specifications of POSIX
+rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
+The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}.
+Automata formalisations are in general harder and more cumbersome to deal
+with for theorem provers \cite{Nipkow1998}.
+To represent them, one way is to use graphs, but graphs are not inductive datatypes.
+Having to set the inductive principle on the number of nodes
+in a graph makes formal reasoning non-intuitive and convoluted.
+When combining two graphs, one also needs to make sure that the nodes in
+both graphs are distinct.
+If they are not distinct, then renaming of the nodes is needed.
+Using Coq which provides dependent types can potentially make things slightly easier
+\cite{Doczkal2013}
+Another representation for automata are matrices.
+But the induction for them is not as straightforward either.
+Both approaches have been used in the past and resulted in huge formalisations.
+There are some more clever representations, for example one by Paulson
+using hereditarily finite sets.
+There the problem with combining graphs can be solved better,
+but we believe that such clever tricks are not very obvious for
+the John-average-Isabelle-user.
+\subsection{Different Definitions of POSIX Rules}
+There are different ways to formalise values and POSIX matching.
+Cardelli and Frisch \cite{Frisch2004} have developed a notion of
+\emph{non-problematic values} which is a slight variation
+of the values defined in the inhabitation relation in \ref{fig:inhab}.
+They then defined an ordering between values, and showed that
+the maximal element of those values correspond to the output
+of their GREEDY lexing algorithm.
+Okui and Suzuki \cite{Okui10} allow iterations of values to
+flatten to the empty
+string for the star inhabitation rule in
+\ref{fig:inhab}.
+They refer to the more restrictive version as used
+in this thesis (which was defined by Ausaf et al.
+\cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
+The very interesting link between the work by Ausaf et al.
+and Okui and Suzuki is that they have distinct formalisations
+of POSIX values, and yet they are provably equivalent! See
+\cite{AusafDyckhoffUrban2016} for details of the
+alternative definitions given by Okui and Suzuki and the formalisation.
+%TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
+Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}.
+In their paper they defined an ordering between values with respect
+to regular expressions, and tried to establish that their
+algorithm outputs the minimal element by a pencil-and-paper proof.
+But having the ordering relation taking regular expression as parameters
+causes the transitivity of their ordering to not go through.
+\section{Optimisations}
+Perhaps the biggest problem that prevents derivative-based lexing
+from being more widely adopted
+is that they tend to be not blindingly fast in practice, unable to
+reach throughputs like gigabytes per second, which is the
+application we had in mind when we initially started looking at the topic.
+Commercial
+regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
+are capable of inspecting payloads
+at line rates (which can be up to a few gigabits per second) against
+thousands of rules \cite{communityRules}.
+For our algorithm to be more attractive for practical use, we
+need more correctness-preserving optimisations.
+%So far our focus has been mainly on the bit-coded algorithm,
+%but the injection-based lexing algorithm
+%could also be sped up in various ways:
+%
+One such optimisation is defining string derivatives
+directly, without recursively decomposing them into
+character-by-character derivatives. For example, instead of
+replacing $(r_1 + r_2)\backslash (c::cs)$ by
+$((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
+calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
+This has the potential to speed up matching because input is
+processed in larger granularity.
+One interesting thing is to explore whether this can be done
+to $\inj$ as well, so that we can generate a lexical value
+rather than simply get a matcher.
+\subsection{Derivatives and Zippers}
+Zipper is a data structure designed to focus on
+and navigate between local parts of a tree.
+The idea is that often operations on a large tree only deal with
+local regions each time.
+Therefore it would be a waste to
+traverse the entire tree if
+the operation only
+involves a small fraction of it.
+It was first formally described by Huet \cite{HuetZipper}.
+Typical applications of zippers involve text editor buffers
+and proof system databases.
+In our setting, the idea is to compactify the representation
+of derivatives with zippers, thereby making our algorithm faster.
+We introduce several works on parsing, derivatives
+and zippers.
+Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
+\cite{Zippy2020}.
+They adopted zippers to improve the speed, and argued that the runtime
+complexity of their algorithm was linear with respect to the input string.
+The idea of using Brzozowski derivatives on general context-free
+parsing was first implemented
+by Might et al. \ref{Might2011}.
+They used memoization and fixpoint construction to eliminate infinite recursion,
+which is a major problem for using derivatives on context-free grammars.
+The initial version was quite slow----exponential on the size of the input.
+Adams et al. \ref{Adams2016} improved the speed and argued that their version
+was cubic with respect to the input.
+Darragh and Adams \cite{Darragh2020} further improved the performance
+by using zippers in an innovative way--their zippers had multiple focuses
+instead of just one in a traditional zipper to handle ambiguity.
+Their algorithm was not formalised, though.
+Some initial results have been obtained, but more care needs to be taken to make sure
+that the lexer output conforms to the POSIX standard. Formalising correctness of
+such a lexer using zippers will probably require using an imperative
+framework with separation logic as it involves
+mutable data structures, which is also interesting to look into.
+To further optimise the algorithm,
+I plan to add a ``deduplicate'' function that tells
+whether two regular expressions denote the same
+language using
+an efficient and verified equivalence checker.
+In this way, the internal data structures can be pruned more aggressively,
+leading to better simplifications and ultimately faster algorithms.
+The idea is to put the focus on that subtree, turning other parts
+of the tree into a context
+One observation about our derivative-based lexing algorithm is that
+the derivative operation sometimes traverses the entire regular expression
+unnecessarily:
 \section{Work on Back-References}
 We introduced regular expressions with back-references
 in chapter \ref{Introduction}.
 We adopt the common practice of calling them rewbrs (Regular Expressions
 With Back References) in this section for brevity.
 It has been shown by Aho \cite{Aho1990}
 that the k-vertex cover problem can be reduced
-to the problem of rewbrs matching problem, and is therefore NP-complete.
+to the problem of rewbrs matching, and is therefore NP-complete.
-Given the depth of the problem, the
+Given the depth of the problem, the progress made at the full generality
-theoretical work on them is
+of arbitrary rewbrs matching has been slow, with
+theoretical work on them being
 fairly recent.
 Campaneu et al. studied regexes with back-references
 in the context of formal languages theory in
 their 2003 work \cite{campeanu2003formal}.
 of these works and comparison between different
 flavours  of back-references syntax (e.g. whether references can be circular,
 can labels be repeated etc.).
-\subsection{Matchers and Lexers with Mechanised Proofs}
-We are aware
-of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
-Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
-of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
-in Coq is given by Coquand and Siles \parencite{Coquand2012}.
-Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
 \subsection{Static Analysis of Evil Regex Patterns}
 When a regular expression does not behave as intended,
 people usually try to rewrite the regex to some equivalent form
 or they try to avoid the possibly problematic patterns completely,
 does not grow indefinitely),
 then they cannot claim with confidence having solved the problem
 of catastrophic backtracking.
-\section{Derivatives and Zippers}
-Zipper is a data structure designed to focus on
-and navigate between local parts of a tree.
-The idea is that often operations on a large tree only deal with
-local regions each time.
-It was first formally described by Huet \cite{HuetZipper}.
-Typical applications of zippers involve text editor buffers
-and proof system databases.
-In our setting, the idea is to compactify the representation
-of derivatives with zippers, thereby making our algorithm faster.
-We draw our inspirations from several works on parsing, derivatives
-and zippers.
-Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
-\cite{Zippy2020}.
-They adopted zippers to improve the speed, and argued that the runtime
-complexity of their algorithm was linear with respect to the input string.
-The idea of using Brzozowski derivatives on general context-free
-parsing was first implemented
-by Might et al. \ref{Might2011}.
-They used memoization and fixpoint construction to eliminate infinite recursion,
-which is a major problem for using derivatives on context-free grammars.
-The initial version was quite slow----exponential on the size of the input.
-Adams et al. \ref{Adams2016} improved the speed and argued that their version
-was cubic with respect to the input.
-Darragh and Adams \cite{10.1145/3408990} further improved the performance
-by using zippers in an innovative way--their zippers had multiple focuses
-instead of just one in a traditional zipper to handle ambiguity.
-Their algorithm was not formalised, though.
-We aim to  integrate the zipper data structure into our lexer.
-The idea is very simple: using a zipper with multiple focuses
-just like Darragh did, we could represent
-\[
-	x\cdot r + y \cdot r + \ldots
-\]
-as
-\[
-	(x+y + \ldots) \cdot r.
-\]
-This would greatly reduce the amount of space needed
-when we have many terms like $x\cdot r$.
-into the current lexer.
-This allows the lexer to not traverse the entire
-derivative tree generated
-in every intermediate step, but only a small segment
-that is currently active.
-Some initial results have been obtained, but more care needs to be taken to make sure
-that the lexer output conforms to the POSIX standard. Formalising correctness of
-such a lexer using zippers will probably require using an imperative
-framework with separation logic as it involves
-mutable data structures, which is also interesting to look into.
-To further optimise the algorithm,
-I plan to add a ``deduplicate'' function that tells
-whether two regular expressions denote the same
-language using
-an efficient and verified equivalence checker.
-In this way, the internal data structures can be pruned more aggressively,
-leading to better simplifications and ultimately faster algorithms.
-Some initial results
-We first give a brief introduction to what zippers are,
-and other works
-that apply zippers to derivatives
-When dealing with large trees, it would be a waste to
-traverse the entire tree if
-the operation only
-involves a small fraction of it.
-The idea is to put the focus on that subtree, turning other parts
-of the tree into a context
-One observation about our derivative-based lexing algorithm is that
-the derivative operation sometimes traverses the entire regular expression
-unnecessarily:

changeset 626	1c8525061545
parent 625	b797c9a709d9
child 627	94db2636a296