lexing: comparison ChengsongTanPhdThesis/Chapters/RelatedWork.tex

equal deleted inserted replaced

-:80cc6dc4c98b
+:bd1354127574
 in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
 of regular expression derivatives modulo the ACI-axioms.
 It is worth pointing out that this result does not directly
 translate to our
 finiteness proof,
-and our proof does not depend on it to be valid.
+and our proof does not depend on it.
 The key observation is that our version of the Sulzmann and Lu's algorithm
 \cite{Sulzmann2014} represents
 derivative terms in a way that allows efficient de-duplication,
 and we do not make use of an equivalence checker that exploits the ACI-equivalent
 terms.
 In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
 incorporated these ideas
 into a tool called \emph{dreml}.
 The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
 by Frisch and Cardelli \cite{Frisch2004} were later
-found to have unfillable gaps by Ausaf \cite{Ausaf},
+found to have unfillable gaps by Ausaf et al. \cite{AusafDyckhoffUrban2016},
 who came up with an alternative proof inspired by
 Vansummeren \cite{Vansummeren2006}.
 Sulzmann and Thiemann extended the Brzozowski derivatives to
 shuffling regular expressions \cite{Sulzmann2015},
-which are a very succinct way of describing basic regular expressions.
+which are a very succinct way of describing regular expressions.
 Regular expressions and lexers have been a popular topic among
 the theorem proving and functional programming community.
 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
 of the work by Krauss and Nipkow \parencite{Krauss2011}.  Another one
 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
 Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
-The most recent works to our best knowledge are the Verbatim \cite{Verbatim}
+The most recent works on verified lexers to our best knowledge are the Verbatim \cite{Verbatim}
-and Verbatim++ \cite{Verbatim} lexers.
+and Verbatim++ \cite{Verbatimpp} lexers by Egolf et al.
+Verbatim is based on derivatives but does not simplify them.
+Therefore it can be very slow on evil regular expressions.
 The Verbatim++ lexer adds many correctness-preserving
 optimisations to the Verbatim lexer,
 and is therefore quite fast on many inputs.
-The problem is that they chose to use DFA to speed up things,
+The problem is that Egolf et al. chose to use traditional DFAs to speed up lexing,
-for which dealing with bounded repetitions is a bottleneck.
+but then dealing with bounded repetitions is a real bottleneck.
 This thesis builds on the formal specifications of POSIX
 rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
-The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}.
+The bounded repetitions presented here is a continuation of the work by Ausaf \cite{Ausaf}.
 Automata formalisations are in general harder and more cumbersome to deal
 with for theorem provers \cite{Nipkow1998}.
 To represent them, one way is to use graphs, but graphs are not inductive datatypes.
 Having to set the inductive principle on the number of nodes
 resulting in large formalisations \cite{Lammich2012}.
 When combining two graphs, one also needs to make sure that the nodes in
 both graphs are distinct.
 If they are not distinct, then renaming of the nodes is needed.
 Using Coq which provides dependent types can potentially make things slightly easier
-\cite{Doczkal2013}
+\cite{Doczkal2013}.
 Another representation for automata are matrices.
 But the induction for them is not as straightforward either.
 Both approaches have been used in the past and resulted in huge formalisations.
 There are some more clever representations, for example one by Paulson
 using hereditarily finite sets \cite{Paulson2015}.
 There the problem with combining graphs can be solved better.
 %but we believe that such clever tricks are not very obvious for
 %the John-average-Isabelle-user.
+Because of these problems with automata, we prefer regular expressions
+and derivatives which can be implemented in theorem provers and functional programming languages
+with ease.
 \subsection{Different Definitions of POSIX Rules}
 There are different ways to formalise values and POSIX matching.
 Cardelli and Frisch \cite{Frisch2004} have developed a notion of
 \emph{non-problematic values} which is a slight variation
 They refer to the more restrictive version as used
 in this thesis (which was defined by Ausaf et al.
 \cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
 The very interesting link between the work by Ausaf et al.
 and Okui and Suzuki is that they have distinct formalisations
-of POSIX values, and yet they are provably equivalent! See
+of POSIX values, and yet they define the same notion.  See
 \cite{AusafDyckhoffUrban2016} for details of the
-alternative definitions given by Okui and Suzuki and the formalisation.
+alternative definitions given by Okui and Suzuki and the formalisation
+described in this thesis.
 %TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
-Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}.
+Sulzmann and Lu themselves have come up with a POSIX definition \cite{Sulzmann2014}.
 In their paper they defined an ordering between values with respect
 to regular expressions, and tried to establish that their
 algorithm outputs the minimal element by a pencil-and-paper proof.
 But having the ordering relation taking regular expression as parameters
 causes the transitivity of their ordering to not go through.
+\subsection{Static Analysis of Evil Regex Patterns}
 When faced with catastrophic backtracking,
 sometimes programmers
 have to rewrite their regexes in an ad hoc fashion.
 Knowing which patterns should be avoided
 can be helpful.
-\subsection{Static Analysis of Evil Regex Patterns}
+Tools for this
-or they try to avoid the possibly problematic patterns completely,
+often over-approximate evil regular expression patterns,
-for which many false positives exist\parencite{Davis18}.
+and there can be many false positives \cite{Davis18}.
 Animated tools to "debug" regular expressions such as
 \parencite{regexploit2021} \parencite{regex101} are also popular.
 We are also aware of static analysis work on regular expressions that
 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
 \parencite{Rathnayake2014StaticAF} proposed an algorithm
 that detects regular expressions triggering exponential
-behavious on backtracking matchers.
+behavious with backtracking matchers.
 Weideman \parencite{Weideman2017Static} came up with
 non-linear polynomial worst-time estimates
-for regexes, attack string that exploit the worst-time
+for regexes and ``attack string'' that exploit the worst-time
-scenario, and "attack automata" that generates
+scenario, and introduced "attack automaton" that generates
 attack strings.
 \section{Optimisations}
 Perhaps the biggest problem that prevents derivative-based lexing
 from being more widely adopted
-is that they tend to be not blindingly fast in practice, unable to
+is that they tend to be not very fast in practice, unable to
 reach throughputs like gigabytes per second, which is the
-application we had in mind when we initially started looking at the topic.
+application we have in mind when we started looking at the topic.
 Commercial
 regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
 are capable of inspecting payloads
 at line rates (which can be up to a few gigabits per second) against
-thousands of rules \cite{communityRules}.
+thousands of regex rules \cite{communityRules}.
 For our algorithm to be more attractive for practical use, we
 need more correctness-preserving optimisations.
 FPGA-based implementations such as \cite{Sidhu2001}
 have the advantages of being
 reconfigurable and parallel, but suffer from lower clock frequency
 and scalability.
-Traditional automaton approaches that use a DFA instead of NFA
+Traditional automaton approaches that use DFAs instead of NFAs
 benefit from the fact that only a single transition is needed
 for each input character \cite{Becchi08}. Lower memory bandwidth leads
 to faster performance.
-However, they suffer from exponential blow-ups on bounded repetitions.
+However, they suffer from exponential memory requirements on bounded repetitions.
 Compression techniques are used, such as those in \cite{Kumar2006} and
 \cite{Becchi2007}.
 Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
 have been
 proposed to better deal with bounded repetitions.
+But they are usually not used in formalised proofs.
 %So far our focus has been mainly on the bit-coded algorithm,
 %but the injection-based lexing algorithm
 %could also be sped up in various ways:
 %
 character-by-character derivatives. For example, instead of
 replacing $(r_1 + r_2)\backslash (c::cs)$ by
 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
 This has the potential to speed up matching because input is
-processed in larger granularity.
+processed at a larger granularity.
-One interesting thing is to explore whether this can be done
+One interesting point is to explore whether this can be done
 to $\inj$ as well, so that we can generate a lexical value
 rather than simply get a matcher.
+It is also not yet clear how this idea can be extended to sequences and stars.
 \subsection{Derivatives and Zippers}
-Zipper is a data structure designed to focus on
+Zippers are a data structure designed to focus on
 and navigate between local parts of a tree.
-The idea is that often operations on a large tree only deal with
+The idea is that often operations on large trees only deal with
 local regions each time.
 Therefore it would be a waste to
 traverse the entire tree if
 the operation only
 involves a small fraction of it.
-It was first formally described by Huet \cite{Huet1997}.
+Zippers were first formally described by Huet \cite{Huet1997}.
 Typical applications of zippers involve text editor buffers
 and proof system databases.
 In our setting, the idea is to compactify the representation
 of derivatives with zippers, thereby making our algorithm faster.
-We introduce several works on parsing, derivatives
+Below we describe several works on parsing, derivatives
 and zippers.
 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
 \cite{Zippy2020}.
 They adopted zippers to improve the speed, and argued that the runtime
 complexity of their algorithm was linear with respect to the input string.
+They did not provide a formal proof for this.
 The idea of using Brzozowski derivatives on general context-free
 parsing was first implemented
 by Might et al. \cite{Might2011}.
-They used memoization and fixpoint construction to eliminate infinite recursion,
+They used memoization and fixpoint constructions to eliminate infinite recursion,
-which is a major problem for using derivatives on context-free grammars.
+which is a major problem for using derivatives with context-free grammars.
-The initial version was quite slow----exponential on the size of the input.
+Their initial version was quite slow----exponential on the size of the input.
 Adams et al. \cite{Adams2016} improved the speed and argued that their version
 was cubic with respect to the input.
 Darragh and Adams \cite{Darragh2020} further improved the performance
 by using zippers in an innovative way--their zippers had multiple focuses
 instead of just one in a traditional zipper to handle ambiguity.
-Their algorithm was not formalised, though.
+Their algorithm was not formalised though.
 \subsection{Back-References}
 in chapter \ref{Introduction}.
 We adopt the common practice of calling them rewbrs (Regular Expressions
 With Back References) for brevity.
 It has been shown by Aho \cite{Aho1990}
 that the k-vertex cover problem can be reduced
-to the problem of rewbrs matching, and is therefore NP-complete.
+to the problem of rewbrs matching, and therefore matching with rewbrs is NP-complete.
 Given the depth of the problem, the progress made at the full generality
-of arbitrary rewbrs matching has been slow, with
+of arbitrary rewbrs matching has been slow with
 theoretical work on them being
 fairly recent.
-Campaneu et al. studied regexes with back-references
+Campaneu et al. studied rewbrs
 in the context of formal languages theory in
-their 2003 work \cite{campeanu2003formal}.
+\cite{campeanu2003formal}.
 They devised a pumping lemma for Perl-like regexes,
 proving that the langugages denoted by them
 is  properly  contained in context-sensitive
 languages.
 More interesting questions such as
-whether the language denoted by Perl-like regexes
+whether the languages denoted by Perl-like regexes
 can express palindromes ($\{w \mid w = w^R\}$)
 are discussed in \cite{campeanu2009patterns}
 and \cite{CAMPEANU2009Intersect}.
 Freydenberger \cite{Frey2013} investigated the
 expressive power of back-references. He showed several
 undecidability and decriptional complexity results
-of back-references, concluding that they add
+for back-references, concluding that they add
-great power to certain programming tasks but are difficult to harness.
+great power to certain programming tasks but are difficult to comprehend.
 An interesting question would then be
 whether we can add restrictions to them,
-so that they become expressive enough for practical use such
+so that they become expressive enough for practical use, such
-as html files, but not too powerful.
+as matching html files, but not too powerful.
 Freydenberger and Schmid \cite{FREYDENBERGER20191}
 introduced the notion of deterministic
 regular expressions with back-references to achieve
 a better balance between expressiveness and tractability.
 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
 investigated the time complexity of different variants
 of back-references.
-We are not aware of any work that uses derivatives on back-references.
+We are not aware of any work that uses derivatives with back-references.
-See \cite{BERGLUND2022} for a survey
+%See \cite{BERGLUND2022} for a survey
-of these works and comparison between different
+%of these works and comparison between different
-flavours  of back-references syntax (e.g. whether references can be circular,
+%flavours  of back-references syntax (e.g. whether references can be circular,
-can labels be repeated etc.).
+%can labels be repeated etc.).

changeset 640	bd1354127574
parent 632	99eceee5c4ac
child 668	3831621d7b14