lexing: comparison ChengsongTanPhdThesis/Chapters/RelatedWork.tex

equal deleted inserted replaced

-:1c8525061545
+:94db2636a296
 Automata formalisations are in general harder and more cumbersome to deal
 with for theorem provers \cite{Nipkow1998}.
 To represent them, one way is to use graphs, but graphs are not inductive datatypes.
 Having to set the inductive principle on the number of nodes
-in a graph makes formal reasoning non-intuitive and convoluted.
+in a graph makes formal reasoning non-intuitive and convoluted,
+resulting in large formalisations \cite{Lammich2012}.
 When combining two graphs, one also needs to make sure that the nodes in
 both graphs are distinct.
 If they are not distinct, then renaming of the nodes is needed.
 Using Coq which provides dependent types can potentially make things slightly easier
 \cite{Doczkal2013}
 Another representation for automata are matrices.
 But the induction for them is not as straightforward either.
 Both approaches have been used in the past and resulted in huge formalisations.
 There are some more clever representations, for example one by Paulson
-using hereditarily finite sets.
+using hereditarily finite sets \cite{Paulson2015}.
-There the problem with combining graphs can be solved better,
+There the problem with combining graphs can be solved better.
-but we believe that such clever tricks are not very obvious for
+%but we believe that such clever tricks are not very obvious for
-the John-average-Isabelle-user.
+%the John-average-Isabelle-user.
 \subsection{Different Definitions of POSIX Rules}
 There are different ways to formalise values and POSIX matching.
 Cardelli and Frisch \cite{Frisch2004} have developed a notion of
 \emph{non-problematic values} which is a slight variation
 algorithm outputs the minimal element by a pencil-and-paper proof.
 But having the ordering relation taking regular expression as parameters
 causes the transitivity of their ordering to not go through.
+\subsection{Static Analysis of Evil Regex Patterns}
+When a regular expression does not behave as intended,
+people usually try to rewrite the regex to some equivalent form
+or they try to avoid the possibly problematic patterns completely,
+for which many false positives exist\parencite{Davis18}.
+Animated tools to "debug" regular expressions such as
+\parencite{regexploit2021} \parencite{regex101} are also popular.
+We are also aware of static analysis work on regular expressions that
+aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
+\parencite{Rathnayake2014StaticAF} proposed an algorithm
+that detects regular expressions triggering exponential
+behavious on backtracking matchers.
+Weideman \parencite{Weideman2017Static} came up with
+non-linear polynomial worst-time estimates
+for regexes, attack string that exploit the worst-time
+scenario, and "attack automata" that generates
+attack strings.
 \section{Optimisations}
 Perhaps the biggest problem that prevents derivative-based lexing
 from being more widely adopted
 is that they tend to be not blindingly fast in practice, unable to
 reach throughputs like gigabytes per second, which is the
 are capable of inspecting payloads
 at line rates (which can be up to a few gigabits per second) against
 thousands of rules \cite{communityRules}.
 For our algorithm to be more attractive for practical use, we
 need more correctness-preserving optimisations.
+FPGA-based implementations such as \cite{Sidhu2001}
+have the advantages of being
+reconfigurable and parallel, but suffer from lower clock frequency
+and scalability.
+Traditional automaton approaches that use a DFA instead of NFA
+benefit from the fact that only a single transition is needed
+for each input character \cite{Becchi08}. Lower memory bandwidth leads
+to faster performance.
+However, they suffer from exponential blow-ups on bounded repetitions.
+Compression techniques are used, such as those in \cite{Kumar2006} and
+\cite{Becchi2007}.
+Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
+have been
+proposed to better deal with bounded repetitions.
 %So far our focus has been mainly on the bit-coded algorithm,
 %but the injection-based lexing algorithm
 %could also be sped up in various ways:
 %
-One such optimisation is defining string derivatives
+Another direction of optimisation for derivative-based approaches
+is defining string derivatives
 directly, without recursively decomposing them into
 character-by-character derivatives. For example, instead of
 replacing $(r_1 + r_2)\backslash (c::cs)$ by
 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
 local regions each time.
 Therefore it would be a waste to
 traverse the entire tree if
 the operation only
 involves a small fraction of it.
-It was first formally described by Huet \cite{HuetZipper}.
+It was first formally described by Huet \cite{Huet1997}.
 Typical applications of zippers involve text editor buffers
 and proof system databases.
 In our setting, the idea is to compactify the representation
 of derivatives with zippers, thereby making our algorithm faster.
 We introduce several works on parsing, derivatives
 They adopted zippers to improve the speed, and argued that the runtime
 complexity of their algorithm was linear with respect to the input string.
 The idea of using Brzozowski derivatives on general context-free
 parsing was first implemented
-by Might et al. \ref{Might2011}.
+by Might et al. \cite{Might2011}.
 They used memoization and fixpoint construction to eliminate infinite recursion,
 which is a major problem for using derivatives on context-free grammars.
 The initial version was quite slow----exponential on the size of the input.
-Adams et al. \ref{Adams2016} improved the speed and argued that their version
+Adams et al. \cite{Adams2016} improved the speed and argued that their version
 was cubic with respect to the input.
 Darragh and Adams \cite{Darragh2020} further improved the performance
 by using zippers in an innovative way--their zippers had multiple focuses
 instead of just one in a traditional zipper to handle ambiguity.
 Their algorithm was not formalised, though.
-Some initial results have been obtained, but more care needs to be taken to make sure
-that the lexer output conforms to the POSIX standard. Formalising correctness of
-such a lexer using zippers will probably require using an imperative
+\subsection{Back-References}
-framework with separation logic as it involves
-mutable data structures, which is also interesting to look into.
-To further optimise the algorithm,
-I plan to add a ``deduplicate'' function that tells
-whether two regular expressions denote the same
-language using
-an efficient and verified equivalence checker.
-In this way, the internal data structures can be pruned more aggressively,
-leading to better simplifications and ultimately faster algorithms.
-The idea is to put the focus on that subtree, turning other parts
-of the tree into a context
-One observation about our derivative-based lexing algorithm is that
-the derivative operation sometimes traverses the entire regular expression
-unnecessarily:
-\section{Work on Back-References}
 We introduced regular expressions with back-references
 in chapter \ref{Introduction}.
 We adopt the common practice of calling them rewbrs (Regular Expressions
-With Back References) in this section for brevity.
+With Back References) for brevity.
 It has been shown by Aho \cite{Aho1990}
 that the k-vertex cover problem can be reduced
 to the problem of rewbrs matching, and is therefore NP-complete.
 Given the depth of the problem, the progress made at the full generality
 of arbitrary rewbrs matching has been slow, with
 flavours  of back-references syntax (e.g. whether references can be circular,
 can labels be repeated etc.).
-\subsection{Static Analysis of Evil Regex Patterns}
-When a regular expression does not behave as intended,
-people usually try to rewrite the regex to some equivalent form
-or they try to avoid the possibly problematic patterns completely,
-for which many false positives exist\parencite{Davis18}.
-Animated tools to "debug" regular expressions such as
-\parencite{regexploit2021} \parencite{regex101} are also popular.
-We are also aware of static analysis work on regular expressions that
-aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
-\parencite{Rathnayake2014StaticAF} proposed an algorithm
-that detects regular expressions triggering exponential
-behavious on backtracking matchers.
-Weideman \parencite{Weideman2017Static} came up with
-non-linear polynomial worst-time estimates
-for regexes, attack string that exploit the worst-time
-scenario, and "attack automata" that generates
-attack strings.
-Thanks to our theorem-prover-friendly approach,
-we believe that
-this finiteness bound can be improved to a bound
-linear to input and
-cubic to the regular expression size using a technique by
-Antimirov\cite{Antimirov95}.
-Once formalised, this would be a guarantee for the absence of all super-linear behavious.
-We are working out the
-details.
-To our best knowledge, no lexing libraries using Brzozowski derivatives
-have similar complexity-related bounds,
-and claims about running time are usually speculative and backed by empirical
-evidence on a few test cases.
-If a matching or lexing algorithm
-does not come with certain basic complexity related
-guarantees (for examaple the internal data structure size
-does not grow indefinitely),
-then they cannot claim with confidence having solved the problem
-of catastrophic backtracking.

changeset 627	94db2636a296
parent 626	1c8525061545
child 628	7af4e2420a8c