lexing: ChengsongTanPhdThesis/Chapters/RelatedWork.tex@b797c9a709d9 (annotated)

608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	1	% Chapter Template
37b6fd310a16 added related work chap Chengsong parents: diff changeset	2
37b6fd310a16 added related work chap Chengsong parents: diff changeset	3	\chapter{Related Work} % Main chapter title
37b6fd310a16 added related work chap Chengsong parents: diff changeset	4
37b6fd310a16 added related work chap Chengsong parents: diff changeset	5	\label{RelatedWork}
37b6fd310a16 added related work chap Chengsong parents: diff changeset	6
37b6fd310a16 added related work chap Chengsong parents: diff changeset	7	In this chapter, we introduce
37b6fd310a16 added related work chap Chengsong parents: diff changeset	8	the work relevant to this thesis.
37b6fd310a16 added related work chap Chengsong parents: diff changeset	9
37b6fd310a16 added related work chap Chengsong parents: diff changeset	10
37b6fd310a16 added related work chap Chengsong parents: diff changeset	11	\section{Work on Back-References}
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	12	We introduced regular expressions with back-references
608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	13	in chapter \ref{Introduction}.
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	14	We adopt the common practice of calling them rewbrs (Regular Expressions
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	15	With Back References) in this section for brevity.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	16	It has been shown by Aho \cite{Aho1990}
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	17	that the k-vertex cover problem can be reduced
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	18	to the problem of rewbrs matching problem, and is therefore NP-complete.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	19	Given the depth of the problem, the
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	20	theoretical work on them is
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	21	fairly recent.
608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	22
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	23	Campaneu et al. studied regexes with back-references
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	24	in the context of formal languages theory in
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	25	their 2003 work \cite{campeanu2003formal}.
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	26	They devised a pumping lemma for Perl-like regexes,
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	27	proving that the langugages denoted by them
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	28	is properly contained in context-sensitive
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	29	languages.
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	30	More interesting questions such as
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	31	whether the language denoted by Perl-like regexes
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	32	can express palindromes ($\{w \mid w = w^R\}$)
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	33	are discussed in \cite{campeanu2009patterns}
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	34	and \cite{CAMPEANU2009Intersect}.
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	35	Freydenberger \cite{Frey2013} investigated the
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	36	expressive power of back-references. He showed several
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	37	undecidability and decriptional complexity results
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	38	of back-references, concluding that they add
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	39	great power to certain programming tasks but are difficult to harness.
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	40	An interesting question would then be
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	41	whether we can add restrictions to them,
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	42	so that they become expressive enough for practical use such
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	43	as html files, but not too powerful.
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	44	Freydenberger and Schmid \cite{FREYDENBERGER20191}
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	45	introduced the notion of deterministic
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	46	regular expressions with back-references to achieve
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	47	a better balance between expressiveness and tractability.
608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	48
37b6fd310a16 added related work chap Chengsong parents: diff changeset	49
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	50	Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	51	investigated the time complexity of different variants
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	52	of back-references.
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	53	We are not aware of any work that uses derivatives on back-references.
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	54
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	55	See \cite{BERGLUND2022} for a survey
608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	56	of these works and comparison between different
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	57	flavours of back-references syntax (e.g. whether references can be circular,
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	58	can labels be repeated etc.).
608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	59
609 61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	60
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	61	\subsection{Matchers and Lexers with Mechanised Proofs}
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	62	We are aware
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	63	of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	64	Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	65	of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	66	in Coq is given by Coquand and Siles \parencite{Coquand2012}.
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	67	Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	68
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	69	\subsection{Static Analysis of Evil Regex Patterns}
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	70	When a regular expression does not behave as intended,
609 61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	71	people usually try to rewrite the regex to some equivalent form
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	72	or they try to avoid the possibly problematic patterns completely,
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	73	for which many false positives exist\parencite{Davis18}.
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	74	Animated tools to "debug" regular expressions such as
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	75	\parencite{regexploit2021} \parencite{regex101} are also popular.
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	76	We are also aware of static analysis work on regular expressions that
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	77	aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	78	\parencite{Rathnayake2014StaticAF} proposed an algorithm
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	79	that detects regular expressions triggering exponential
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	80	behavious on backtracking matchers.
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	81	Weideman \parencite{Weideman2017Static} came up with
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	82	non-linear polynomial worst-time estimates
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	83	for regexes, attack string that exploit the worst-time
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	84	scenario, and "attack automata" that generates
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	85	attack strings.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	86
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	87
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	88
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	89
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	90
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	91
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	92	Thanks to our theorem-prover-friendly approach,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	93	we believe that
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	94	this finiteness bound can be improved to a bound
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	95	linear to input and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	96	cubic to the regular expression size using a technique by
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	97	Antimirov\cite{Antimirov95}.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	98	Once formalised, this would be a guarantee for the absence of all super-linear behavious.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	99	We are working out the
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	100	details.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	101
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	102
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	103	To our best knowledge, no lexing libraries using Brzozowski derivatives
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	104	have similar complexity-related bounds,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	105	and claims about running time are usually speculative and backed by empirical
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	106	evidence on a few test cases.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	107	If a matching or lexing algorithm
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	108	does not come with certain basic complexity related
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	109	guarantees (for examaple the internal data structure size
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	110	does not grow indefinitely),
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	111	then they cannot claim with confidence having solved the problem
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	112	of catastrophic backtracking.
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	113
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	114
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	115	\section{Derivatives and Zippers}
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	116	Zipper is a data structure designed to focus on
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	117	and navigate between local parts of a tree.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	118	The idea is that often operations on a large tree only deal with
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	119	local regions each time.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	120	It was first formally described by Huet \cite{HuetZipper}.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	121	Typical applications of zippers involve text editor buffers
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	122	and proof system databases.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	123	In our setting, the idea is to compactify the representation
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	124	of derivatives with zippers, thereby making our algorithm faster.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	125	We draw our inspirations from several works on parsing, derivatives
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	126	and zippers.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	127
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	128	Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	129	\cite{Zippy2020}.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	130	They adopted zippers to improve the speed, and argued that the runtime
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	131	complexity of their algorithm was linear with respect to the input string.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	132
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	133	The idea of using Brzozowski derivatives on general context-free
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	134	parsing was first implemented
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	135	by Might et al. \ref{Might2011}.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	136	They used memoization and fixpoint construction to eliminate infinite recursion,
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	137	which is a major problem for using derivatives on context-free grammars.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	138	The initial version was quite slow----exponential on the size of the input.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	139	Adams et al. \ref{Adams2016} improved the speed and argued that their version
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	140	was cubic with respect to the input.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	141	Darragh and Adams \cite{10.1145/3408990} further improved the performance
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	142	by using zippers in an innovative way--their zippers had multiple focuses
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	143	instead of just one in a traditional zipper to handle ambiguity.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	144	Their algorithm was not formalised, though.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	145
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	146	We aim to integrate the zipper data structure into our lexer.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	147	The idea is very simple: using a zipper with multiple focuses
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	148	just like Darragh did, we could represent
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	149	\[
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	150	x\cdot r + y \cdot r + \ldots
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	151	\]
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	152	as
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	153	\[
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	154	(x+y + \ldots) \cdot r.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	155	\]
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	156	This would greatly reduce the amount of space needed
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	157	when we have many terms like $x\cdot r$.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	158
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	159	into the current lexer.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	160	This allows the lexer to not traverse the entire
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	161	derivative tree generated
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	162	in every intermediate step, but only a small segment
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	163	that is currently active.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	164
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	165	Some initial results have been obtained, but more care needs to be taken to make sure
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	166	that the lexer output conforms to the POSIX standard. Formalising correctness of
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	167	such a lexer using zippers will probably require using an imperative
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	168	framework with separation logic as it involves
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	169	mutable data structures, which is also interesting to look into.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	170
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	171	To further optimise the algorithm,
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	172	I plan to add a ``deduplicate'' function that tells
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	173	whether two regular expressions denote the same
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	174	language using
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	175	an efficient and verified equivalence checker.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	176	In this way, the internal data structures can be pruned more aggressively,
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	177	leading to better simplifications and ultimately faster algorithms.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	178
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	179
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	180	Some initial results
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	181	We first give a brief introduction to what zippers are,
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	182	and other works
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	183	that apply zippers to derivatives
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	184	When dealing with large trees, it would be a waste to
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	185	traverse the entire tree if
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	186	the operation only
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	187	involves a small fraction of it.
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	188	The idea is to put the focus on that subtree, turning other parts
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	189	of the tree into a context
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	190
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	191
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	192	One observation about our derivative-based lexing algorithm is that
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	193	the derivative operation sometimes traverses the entire regular expression
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	194	unnecessarily:
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	195

author	Chengsong
	Thu, 17 Nov 2022 23:13:57 +0000
changeset 625	b797c9a709d9
parent 624	8ffa28fce271
child 626	1c8525061545
permissions	-rw-r--r--