ChengsongTanPhdThesis/Chapters/RelatedWork.tex
changeset 640 bd1354127574
parent 632 99eceee5c4ac
child 668 3831621d7b14
--- a/ChengsongTanPhdThesis/Chapters/RelatedWork.tex	Fri Dec 30 17:37:51 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex	Fri Dec 30 23:41:44 2022 +0000
@@ -19,7 +19,7 @@
 It is worth pointing out that this result does not directly
 translate to our
 finiteness proof, 
-and our proof does not depend on it to be valid.
+and our proof does not depend on it.
 The key observation is that our version of the Sulzmann and Lu's algorithm
 \cite{Sulzmann2014} represents 
 derivative terms in a way that allows efficient de-duplication,
@@ -35,12 +35,12 @@
 into a tool called \emph{dreml}.
 The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
 by Frisch and Cardelli \cite{Frisch2004} were later
-found to have unfillable gaps by Ausaf \cite{Ausaf},
+found to have unfillable gaps by Ausaf et al. \cite{AusafDyckhoffUrban2016},
 who came up with an alternative proof inspired by 
 Vansummeren \cite{Vansummeren2006}.
 Sulzmann and Thiemann extended the Brzozowski derivatives to 
 shuffling regular expressions \cite{Sulzmann2015},
-which are a very succinct way of describing basic regular expressions.
+which are a very succinct way of describing regular expressions.
 
 
 
@@ -56,18 +56,20 @@
 of the work by Krauss and Nipkow \parencite{Krauss2011}.  Another one
 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
 Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
-The most recent works to our best knowledge are the Verbatim \cite{Verbatim}
-and Verbatim++ \cite{Verbatim} lexers.
+The most recent works on verified lexers to our best knowledge are the Verbatim \cite{Verbatim}
+and Verbatim++ \cite{Verbatimpp} lexers by Egolf et al.
+Verbatim is based on derivatives but does not simplify them. 
+Therefore it can be very slow on evil regular expressions.
 The Verbatim++ lexer adds many correctness-preserving 
 optimisations to the Verbatim lexer,
 and is therefore quite fast on many inputs.
-The problem is that they chose to use DFA to speed up things,
-for which dealing with bounded repetitions is a bottleneck.
+The problem is that Egolf et al. chose to use traditional DFAs to speed up lexing,
+but then dealing with bounded repetitions is a real bottleneck.
 
 
 This thesis builds on the formal specifications of POSIX
 rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
-The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}.
+The bounded repetitions presented here is a continuation of the work by Ausaf \cite{Ausaf}.
 
 Automata formalisations are in general harder and more cumbersome to deal
 with for theorem provers \cite{Nipkow1998}.
@@ -79,7 +81,8 @@
 both graphs are distinct. 
 If they are not distinct, then renaming of the nodes is needed.
 Using Coq which provides dependent types can potentially make things slightly easier
-\cite{Doczkal2013}
+\cite{Doczkal2013}.
+
 Another representation for automata are matrices. 
 But the induction for them is not as straightforward either.
 Both approaches have been used in the past and resulted in huge formalisations.
@@ -88,6 +91,9 @@
 There the problem with combining graphs can be solved better. 
 %but we believe that such clever tricks are not very obvious for 
 %the John-average-Isabelle-user.
+Because of these problems with automata, we prefer regular expressions
+and derivatives which can be implemented in theorem provers and functional programming languages
+with ease.
 
 \subsection{Different Definitions of POSIX Rules}
 There are different ways to formalise values and POSIX matching.
@@ -107,37 +113,41 @@
 \cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
 The very interesting link between the work by Ausaf et al.
 and Okui and Suzuki is that they have distinct formalisations
-of POSIX values, and yet they are provably equivalent! See
+of POSIX values, and yet they define the same notion.  See
 \cite{AusafDyckhoffUrban2016} for details of the
-alternative definitions given by Okui and Suzuki and the formalisation.
+alternative definitions given by Okui and Suzuki and the formalisation
+described in this thesis.
 %TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
 
-Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}.
+Sulzmann and Lu themselves have come up with a POSIX definition \cite{Sulzmann2014}.
 In their paper they defined an ordering between values with respect
 to regular expressions, and tried to establish that their
 algorithm outputs the minimal element by a pencil-and-paper proof.
 But having the ordering relation taking regular expression as parameters
 causes the transitivity of their ordering to not go through.
 
+\subsection{Static Analysis of Evil Regex Patterns} 
+
+
 When faced with catastrophic backtracking, 
 sometimes programmers 
 have to rewrite their regexes in an ad hoc fashion.
 Knowing which patterns should be avoided
 can be helpful.
-\subsection{Static Analysis of Evil Regex Patterns} 
-or they try to avoid the possibly problematic patterns completely,
-for which many false positives exist\parencite{Davis18}.
+Tools for this
+often over-approximate evil regular expression patterns,
+and there can be many false positives \cite{Davis18}.
 Animated tools to "debug" regular expressions such as
  \parencite{regexploit2021} \parencite{regex101} are also popular.
 We are also aware of static analysis work on regular expressions that
 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
 \parencite{Rathnayake2014StaticAF} proposed an algorithm
 that detects regular expressions triggering exponential
-behavious on backtracking matchers.
+behavious with backtracking matchers.
 Weideman \parencite{Weideman2017Static} came up with 
 non-linear polynomial worst-time estimates
-for regexes, attack string that exploit the worst-time 
-scenario, and "attack automata" that generates
+for regexes and ``attack string'' that exploit the worst-time 
+scenario, and introduced "attack automaton" that generates
 attack strings.
 
 
@@ -146,14 +156,14 @@
 \section{Optimisations}
 Perhaps the biggest problem that prevents derivative-based lexing 
 from being more widely adopted
-is that they tend to be not blindingly fast in practice, unable to 
+is that they tend to be not very fast in practice, unable to 
 reach throughputs like gigabytes per second, which is the
-application we had in mind when we initially started looking at the topic.
+application we have in mind when we started looking at the topic.
 Commercial
 regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
 are capable of inspecting payloads
 at line rates (which can be up to a few gigabits per second) against 
-thousands of rules \cite{communityRules}.
+thousands of regex rules \cite{communityRules}.
 For our algorithm to be more attractive for practical use, we
 need more correctness-preserving optimisations.
 
@@ -161,16 +171,17 @@
 have the advantages of being
 reconfigurable and parallel, but suffer from lower clock frequency
 and scalability.
-Traditional automaton approaches that use a DFA instead of NFA
+Traditional automaton approaches that use DFAs instead of NFAs
 benefit from the fact that only a single transition is needed 
 for each input character \cite{Becchi08}. Lower memory bandwidth leads
 to faster performance.
-However, they suffer from exponential blow-ups on bounded repetitions.
+However, they suffer from exponential memory requirements on bounded repetitions.
 Compression techniques are used, such as those in \cite{Kumar2006} and
 \cite{Becchi2007}.
 Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
 have been
 proposed to better deal with bounded repetitions.
+But they are usually not used in formalised proofs.
 
 %So far our focus has been mainly on the bit-coded algorithm,
 %but the injection-based lexing algorithm 
@@ -184,46 +195,48 @@
 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
 This has the potential to speed up matching because input is 
-processed in larger granularity.
-One interesting thing is to explore whether this can be done
+processed at a larger granularity.
+One interesting point is to explore whether this can be done
 to $\inj$ as well, so that we can generate a lexical value 
 rather than simply get a matcher.
+It is also not yet clear how this idea can be extended to sequences and stars.
 
 
 \subsection{Derivatives and Zippers}
-Zipper is a data structure designed to focus on 
+Zippers are a data structure designed to focus on 
 and navigate between local parts of a tree.
-The idea is that often operations on a large tree only deal with
+The idea is that often operations on large trees only deal with
 local regions each time.
 Therefore it would be a waste to 
 traverse the entire tree if
 the operation only
 involves a small fraction of it.
-It was first formally described by Huet \cite{Huet1997}.
+Zippers were first formally described by Huet \cite{Huet1997}.
 Typical applications of zippers involve text editor buffers
 and proof system databases.
 In our setting, the idea is to compactify the representation
 of derivatives with zippers, thereby making our algorithm faster.
-We introduce several works on parsing, derivatives
+Below we describe several works on parsing, derivatives
 and zippers.
 
 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
 \cite{Zippy2020}.
 They adopted zippers to improve the speed, and argued that the runtime
 complexity of their algorithm was linear with respect to the input string.
+They did not provide a formal proof for this.
 
 The idea of using Brzozowski derivatives on general context-free 
 parsing was first implemented
 by Might et al. \cite{Might2011}. 
-They used memoization and fixpoint construction to eliminate infinite recursion,
-which is a major problem for using derivatives on context-free grammars.
-The initial version was quite slow----exponential on the size of the input.
+They used memoization and fixpoint constructions to eliminate infinite recursion,
+which is a major problem for using derivatives with context-free grammars.
+Their initial version was quite slow----exponential on the size of the input.
 Adams et al. \cite{Adams2016} improved the speed and argued that their version
 was cubic with respect to the input.
 Darragh and Adams \cite{Darragh2020} further improved the performance
 by using zippers in an innovative way--their zippers had multiple focuses
 instead of just one in a traditional zipper to handle ambiguity.
-Their algorithm was not formalised, though.
+Their algorithm was not formalised though.
 
 
 
@@ -235,33 +248,33 @@
 With Back References) for brevity.
 It has been shown by Aho \cite{Aho1990}
 that the k-vertex cover problem can be reduced
-to the problem of rewbrs matching, and is therefore NP-complete.
+to the problem of rewbrs matching, and therefore matching with rewbrs is NP-complete.
 Given the depth of the problem, the progress made at the full generality
-of arbitrary rewbrs matching has been slow, with
+of arbitrary rewbrs matching has been slow with
 theoretical work on them being 
 fairly recent. 
 
-Campaneu et al. studied regexes with back-references
+Campaneu et al. studied rewbrs 
 in the context of formal languages theory in 
-their 2003 work \cite{campeanu2003formal}.
+\cite{campeanu2003formal}.
 They devised a pumping lemma for Perl-like regexes, 
 proving that the langugages denoted by them
 is  properly  contained in context-sensitive
 languages.
 More interesting questions such as 
-whether the language denoted by Perl-like regexes
+whether the languages denoted by Perl-like regexes
 can express palindromes ($\{w \mid w = w^R\}$)
 are discussed in \cite{campeanu2009patterns} 
 and \cite{CAMPEANU2009Intersect}.
 Freydenberger \cite{Frey2013} investigated the 
 expressive power of back-references. He showed several 
 undecidability and decriptional complexity results 
-of back-references, concluding that they add
-great power to certain programming tasks but are difficult to harness.
+for back-references, concluding that they add
+great power to certain programming tasks but are difficult to comprehend.
 An interesting question would then be
 whether we can add restrictions to them, 
-so that they become expressive enough for practical use such
-as html files, but not too powerful.
+so that they become expressive enough for practical use, such
+as matching html files, but not too powerful.
 Freydenberger and Schmid \cite{FREYDENBERGER20191}
 introduced the notion of deterministic
 regular expressions with back-references to achieve
@@ -271,12 +284,12 @@
 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
 investigated the time complexity of different variants
 of back-references.
-We are not aware of any work that uses derivatives on back-references.
+We are not aware of any work that uses derivatives with back-references.
 
-See \cite{BERGLUND2022} for a survey
-of these works and comparison between different
-flavours  of back-references syntax (e.g. whether references can be circular, 
-can labels be repeated etc.).
+%See \cite{BERGLUND2022} for a survey
+%of these works and comparison between different
+%flavours  of back-references syntax (e.g. whether references can be circular, 
+%can labels be repeated etc.).