lexing: comparison ChengsongTanPhdThesis/Chapters/RelatedWork.tex

equal deleted inserted replaced

-:8ffa28fce271
+:b797c9a709d9
 In this chapter, we introduce
 the work relevant to this thesis.
 \section{Work on Back-References}
-We introduced back-references
+We introduced regular expressions with back-references
 in chapter \ref{Introduction}.
-This is a quite deep problem,
+We adopt the common practice of calling them rewbrs (Regular Expressions
-with theoretical work on them being
+With Back References) in this section for brevity.
-fairly recent.
+It has been shown by Aho \cite{Aho1990}
+that the k-vertex cover problem can be reduced
+to the problem of rewbrs matching problem, and is therefore NP-complete.
+Given the depth of the problem, the
+theoretical work on them is
+fairly recent.
 Campaneu et al. studied regexes with back-references
 in the context of formal languages theory in
 their 2003 work \cite{campeanu2003formal}.
 They devised a pumping lemma for Perl-like regexes,
 undecidability and decriptional complexity results
 of back-references, concluding that they add
 great power to certain programming tasks but are difficult to harness.
 An interesting question would then be
 whether we can add restrictions to them,
-so that they become expressive enough, but not too powerful.
+so that they become expressive enough for practical use such
+as html files, but not too powerful.
 Freydenberger and Schmid \cite{FREYDENBERGER20191}
 introduced the notion of deterministic
 regular expressions with back-references to achieve
 a better balance between expressiveness and tractability.
 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
 investigated the time complexity of different variants
 of back-references.
+We are not aware of any work that uses derivatives on back-references.
 See \cite{BERGLUND2022} for a survey
 of these works and comparison between different
 flavours  of back-references syntax (e.g. whether references can be circular,
 can labels be repeated etc.).
 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
 \subsection{Static Analysis of Evil Regex Patterns}
 When a regular expression does not behave as intended,
 people usually try to rewrite the regex to some equivalent form
 or they try to avoid the possibly problematic patterns completely,
 for which many false positives exist\parencite{Davis18}.
 Animated tools to "debug" regular expressions such as
 \parencite{regexploit2021} \parencite{regex101} are also popular.
 does not come with certain basic complexity related
 guarantees (for examaple the internal data structure size
 does not grow indefinitely),
 then they cannot claim with confidence having solved the problem
 of catastrophic backtracking.
+\section{Derivatives and Zippers}
+Zipper is a data structure designed to focus on
+and navigate between local parts of a tree.
+The idea is that often operations on a large tree only deal with
+local regions each time.
+It was first formally described by Huet \cite{HuetZipper}.
+Typical applications of zippers involve text editor buffers
+and proof system databases.
+In our setting, the idea is to compactify the representation
+of derivatives with zippers, thereby making our algorithm faster.
+We draw our inspirations from several works on parsing, derivatives
+and zippers.
+Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
+\cite{Zippy2020}.
+They adopted zippers to improve the speed, and argued that the runtime
+complexity of their algorithm was linear with respect to the input string.
+The idea of using Brzozowski derivatives on general context-free
+parsing was first implemented
+by Might et al. \ref{Might2011}.
+They used memoization and fixpoint construction to eliminate infinite recursion,
+which is a major problem for using derivatives on context-free grammars.
+The initial version was quite slow----exponential on the size of the input.
+Adams et al. \ref{Adams2016} improved the speed and argued that their version
+was cubic with respect to the input.
+Darragh and Adams \cite{10.1145/3408990} further improved the performance
+by using zippers in an innovative way--their zippers had multiple focuses
+instead of just one in a traditional zipper to handle ambiguity.
+Their algorithm was not formalised, though.
+We aim to  integrate the zipper data structure into our lexer.
+The idea is very simple: using a zipper with multiple focuses
+just like Darragh did, we could represent
+\[
+	x\cdot r + y \cdot r + \ldots
+\]
+as
+\[
+	(x+y + \ldots) \cdot r.
+\]
+This would greatly reduce the amount of space needed
+when we have many terms like $x\cdot r$.
+into the current lexer.
+This allows the lexer to not traverse the entire
+derivative tree generated
+in every intermediate step, but only a small segment
+that is currently active.
+Some initial results have been obtained, but more care needs to be taken to make sure
+that the lexer output conforms to the POSIX standard. Formalising correctness of
+such a lexer using zippers will probably require using an imperative
+framework with separation logic as it involves
+mutable data structures, which is also interesting to look into.
+To further optimise the algorithm,
+I plan to add a ``deduplicate'' function that tells
+whether two regular expressions denote the same
+language using
+an efficient and verified equivalence checker.
+In this way, the internal data structures can be pruned more aggressively,
+leading to better simplifications and ultimately faster algorithms.
+Some initial results
+We first give a brief introduction to what zippers are,
+and other works
+that apply zippers to derivatives
+When dealing with large trees, it would be a waste to
+traverse the entire tree if
+the operation only
+involves a small fraction of it.
+The idea is to put the focus on that subtree, turning other parts
+of the tree into a context
+One observation about our derivative-based lexing algorithm is that
+the derivative operation sometimes traverses the entire regular expression
+unnecessarily:

changeset 625	b797c9a709d9
parent 624	8ffa28fce271
child 626	1c8525061545