lexing: ChengsongTanPhdThesis/Chapters/RelatedWork.tex@56057198e4f5 (annotated)

608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	1	% Chapter Template
37b6fd310a16 added related work chap Chengsong parents: diff changeset	2
37b6fd310a16 added related work chap Chengsong parents: diff changeset	3	\chapter{Related Work} % Main chapter title
37b6fd310a16 added related work chap Chengsong parents: diff changeset	4
37b6fd310a16 added related work chap Chengsong parents: diff changeset	5	\label{RelatedWork}
37b6fd310a16 added related work chap Chengsong parents: diff changeset	6
630 d50a309a0645 with Christian Chengsong parents: 628 diff changeset	7	%In this chapter, we introduce
d50a309a0645 with Christian Chengsong parents: 628 diff changeset	8	%work relevant to this thesis.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	9
1c8525061545 finished! Chengsong parents: 625 diff changeset	10	\section{Regular Expressions, Derivatives and POSIX Lexing}
1c8525061545 finished! Chengsong parents: 625 diff changeset	11
1c8525061545 finished! Chengsong parents: 625 diff changeset	12	%\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms}
1c8525061545 finished! Chengsong parents: 625 diff changeset	13	Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	14	Since then they have become a fundamental concept in
1c8525061545 finished! Chengsong parents: 625 diff changeset	15	formal languages and automata theory \cite{Sakarovitch2009}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	16	Brzozowski defined derivatives on regular expressions in his PhD thesis
1c8525061545 finished! Chengsong parents: 625 diff changeset	17	in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
1c8525061545 finished! Chengsong parents: 625 diff changeset	18	of regular expression derivatives modulo the ACI-axioms.
1c8525061545 finished! Chengsong parents: 625 diff changeset	19	It is worth pointing out that this result does not directly
1c8525061545 finished! Chengsong parents: 625 diff changeset	20	translate to our
1c8525061545 finished! Chengsong parents: 625 diff changeset	21	finiteness proof,
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	22	and our proof does not depend on it.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	23	The key observation is that our version of the Sulzmann and Lu's algorithm
1c8525061545 finished! Chengsong parents: 625 diff changeset	24	\cite{Sulzmann2014} represents
1c8525061545 finished! Chengsong parents: 625 diff changeset	25	derivative terms in a way that allows efficient de-duplication,
1c8525061545 finished! Chengsong parents: 625 diff changeset	26	and we do not make use of an equivalence checker that exploits the ACI-equivalent
1c8525061545 finished! Chengsong parents: 625 diff changeset	27	terms.
1c8525061545 finished! Chengsong parents: 625 diff changeset	28
628 7af4e2420a8c ready to submit~~ Chengsong parents: 627 diff changeset	29
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	30	Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	31	They first introduced the elegant and simple idea of injection-based lexing
1c8525061545 finished! Chengsong parents: 625 diff changeset	32	and bit-coded lexing.
1c8525061545 finished! Chengsong parents: 625 diff changeset	33	In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
1c8525061545 finished! Chengsong parents: 625 diff changeset	34	incorporated these ideas
1c8525061545 finished! Chengsong parents: 625 diff changeset	35	into a tool called \emph{dreml}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	36	The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
1c8525061545 finished! Chengsong parents: 625 diff changeset	37	by Frisch and Cardelli \cite{Frisch2004} were later
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	38	found to have unfillable gaps by Ausaf et al. \cite{AusafDyckhoffUrban2016},
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	39	who came up with an alternative proof inspired by
1c8525061545 finished! Chengsong parents: 625 diff changeset	40	Vansummeren \cite{Vansummeren2006}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	41	Sulzmann and Thiemann extended the Brzozowski derivatives to
1c8525061545 finished! Chengsong parents: 625 diff changeset	42	shuffling regular expressions \cite{Sulzmann2015},
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	43	which are a very succinct way of describing regular expressions.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	44
1c8525061545 finished! Chengsong parents: 625 diff changeset	45
1c8525061545 finished! Chengsong parents: 625 diff changeset	46
1c8525061545 finished! Chengsong parents: 625 diff changeset	47	Regular expressions and lexers have been a popular topic among
1c8525061545 finished! Chengsong parents: 625 diff changeset	48	the theorem proving and functional programming community.
1c8525061545 finished! Chengsong parents: 625 diff changeset	49	In the next section we give a list of lexers
1c8525061545 finished! Chengsong parents: 625 diff changeset	50	and matchers that come with a machine-checked correctness proof.
1c8525061545 finished! Chengsong parents: 625 diff changeset	51
1c8525061545 finished! Chengsong parents: 625 diff changeset	52	\subsection{Matchers and Lexers with Mechanised Proofs}
1c8525061545 finished! Chengsong parents: 625 diff changeset	53	We are aware
1c8525061545 finished! Chengsong parents: 625 diff changeset	54	of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
1c8525061545 finished! Chengsong parents: 625 diff changeset	55	Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
1c8525061545 finished! Chengsong parents: 625 diff changeset	56	of the work by Krauss and Nipkow \parencite{Krauss2011}. Another one
1c8525061545 finished! Chengsong parents: 625 diff changeset	57	in Coq is given by Coquand and Siles \parencite{Coquand2012}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	58	Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	59	The most recent works on verified lexers to our best knowledge are the Verbatim \cite{Verbatim}
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	60	and Verbatim++ \cite{Verbatimpp} lexers by Egolf et al.
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	61	Verbatim is based on derivatives but does not simplify them.
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	62	Therefore it can be very slow on evil regular expressions.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	63	The Verbatim++ lexer adds many correctness-preserving
1c8525061545 finished! Chengsong parents: 625 diff changeset	64	optimisations to the Verbatim lexer,
1c8525061545 finished! Chengsong parents: 625 diff changeset	65	and is therefore quite fast on many inputs.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	66	The problem is that Egolf et al. chose to use traditional DFAs to speed up lexing,
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	67	but then dealing with bounded repetitions is a real bottleneck.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	68
1c8525061545 finished! Chengsong parents: 625 diff changeset	69
1c8525061545 finished! Chengsong parents: 625 diff changeset	70	This thesis builds on the formal specifications of POSIX
1c8525061545 finished! Chengsong parents: 625 diff changeset	71	rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	72	The bounded repetitions presented here is a continuation of the work by Ausaf \cite{Ausaf}.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	73
1c8525061545 finished! Chengsong parents: 625 diff changeset	74	Automata formalisations are in general harder and more cumbersome to deal
1c8525061545 finished! Chengsong parents: 625 diff changeset	75	with for theorem provers \cite{Nipkow1998}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	76	To represent them, one way is to use graphs, but graphs are not inductive datatypes.
1c8525061545 finished! Chengsong parents: 625 diff changeset	77	Having to set the inductive principle on the number of nodes
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	78	in a graph makes formal reasoning non-intuitive and convoluted,
94db2636a296 finished! Chengsong parents: 626 diff changeset	79	resulting in large formalisations \cite{Lammich2012}.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	80	When combining two graphs, one also needs to make sure that the nodes in
1c8525061545 finished! Chengsong parents: 625 diff changeset	81	both graphs are distinct.
1c8525061545 finished! Chengsong parents: 625 diff changeset	82	If they are not distinct, then renaming of the nodes is needed.
1c8525061545 finished! Chengsong parents: 625 diff changeset	83	Using Coq which provides dependent types can potentially make things slightly easier
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	84	\cite{Doczkal2013}.
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	85
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	86	Another representation for automata are matrices.
1c8525061545 finished! Chengsong parents: 625 diff changeset	87	But the induction for them is not as straightforward either.
1c8525061545 finished! Chengsong parents: 625 diff changeset	88	Both approaches have been used in the past and resulted in huge formalisations.
1c8525061545 finished! Chengsong parents: 625 diff changeset	89	There are some more clever representations, for example one by Paulson
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	90	using hereditarily finite sets \cite{Paulson2015}.
94db2636a296 finished! Chengsong parents: 626 diff changeset	91	There the problem with combining graphs can be solved better.
94db2636a296 finished! Chengsong parents: 626 diff changeset	92	%but we believe that such clever tricks are not very obvious for
94db2636a296 finished! Chengsong parents: 626 diff changeset	93	%the John-average-Isabelle-user.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	94	Because of these problems with automata, we prefer regular expressions
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	95	and derivatives which can be implemented in theorem provers and functional programming languages
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	96	with ease.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	97
1c8525061545 finished! Chengsong parents: 625 diff changeset	98	\subsection{Different Definitions of POSIX Rules}
1c8525061545 finished! Chengsong parents: 625 diff changeset	99	There are different ways to formalise values and POSIX matching.
1c8525061545 finished! Chengsong parents: 625 diff changeset	100	Cardelli and Frisch \cite{Frisch2004} have developed a notion of
1c8525061545 finished! Chengsong parents: 625 diff changeset	101	\emph{non-problematic values} which is a slight variation
1c8525061545 finished! Chengsong parents: 625 diff changeset	102	of the values defined in the inhabitation relation in \ref{fig:inhab}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	103	They then defined an ordering between values, and showed that
1c8525061545 finished! Chengsong parents: 625 diff changeset	104	the maximal element of those values correspond to the output
1c8525061545 finished! Chengsong parents: 625 diff changeset	105	of their GREEDY lexing algorithm.
1c8525061545 finished! Chengsong parents: 625 diff changeset	106
1c8525061545 finished! Chengsong parents: 625 diff changeset	107	Okui and Suzuki \cite{Okui10} allow iterations of values to
1c8525061545 finished! Chengsong parents: 625 diff changeset	108	flatten to the empty
1c8525061545 finished! Chengsong parents: 625 diff changeset	109	string for the star inhabitation rule in
1c8525061545 finished! Chengsong parents: 625 diff changeset	110	\ref{fig:inhab}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	111	They refer to the more restrictive version as used
1c8525061545 finished! Chengsong parents: 625 diff changeset	112	in this thesis (which was defined by Ausaf et al.
1c8525061545 finished! Chengsong parents: 625 diff changeset	113	\cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	114	The very interesting link between the work by Ausaf et al.
1c8525061545 finished! Chengsong parents: 625 diff changeset	115	and Okui and Suzuki is that they have distinct formalisations
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	116	of POSIX values, and yet they define the same notion. See
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	117	\cite{AusafDyckhoffUrban2016} for details of the
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	118	alternative definitions given by Okui and Suzuki and the formalisation
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	119	described in this thesis.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	120	%TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
1c8525061545 finished! Chengsong parents: 625 diff changeset	121
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	122	Sulzmann and Lu themselves have come up with a POSIX definition \cite{Sulzmann2014}.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	123	In their paper they defined an ordering between values with respect
1c8525061545 finished! Chengsong parents: 625 diff changeset	124	to regular expressions, and tried to establish that their
1c8525061545 finished! Chengsong parents: 625 diff changeset	125	algorithm outputs the minimal element by a pencil-and-paper proof.
1c8525061545 finished! Chengsong parents: 625 diff changeset	126	But having the ordering relation taking regular expression as parameters
1c8525061545 finished! Chengsong parents: 625 diff changeset	127	causes the transitivity of their ordering to not go through.
1c8525061545 finished! Chengsong parents: 625 diff changeset	128
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	129	\subsection{Static Analysis of Evil Regex Patterns}
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	130
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	131
632 99eceee5c4ac more Chengsong parents: 631 diff changeset	132	When faced with catastrophic backtracking,
99eceee5c4ac more Chengsong parents: 631 diff changeset	133	sometimes programmers
99eceee5c4ac more Chengsong parents: 631 diff changeset	134	have to rewrite their regexes in an ad hoc fashion.
99eceee5c4ac more Chengsong parents: 631 diff changeset	135	Knowing which patterns should be avoided
99eceee5c4ac more Chengsong parents: 631 diff changeset	136	can be helpful.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	137	Tools for this
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	138	often over-approximate evil regular expression patterns,
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	139	and there can be many false positives \cite{Davis18}.
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	140	Animated tools to "debug" regular expressions such as
94db2636a296 finished! Chengsong parents: 626 diff changeset	141	\parencite{regexploit2021} \parencite{regex101} are also popular.
94db2636a296 finished! Chengsong parents: 626 diff changeset	142	We are also aware of static analysis work on regular expressions that
94db2636a296 finished! Chengsong parents: 626 diff changeset	143	aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
94db2636a296 finished! Chengsong parents: 626 diff changeset	144	\parencite{Rathnayake2014StaticAF} proposed an algorithm
94db2636a296 finished! Chengsong parents: 626 diff changeset	145	that detects regular expressions triggering exponential
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	146	behavious with backtracking matchers.
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	147	Weideman \parencite{Weideman2017Static} came up with
94db2636a296 finished! Chengsong parents: 626 diff changeset	148	non-linear polynomial worst-time estimates
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	149	for regexes and ``attack string'' that exploit the worst-time
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	150	scenario, and introduced "attack automaton" that generates
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	151	attack strings.
94db2636a296 finished! Chengsong parents: 626 diff changeset	152
94db2636a296 finished! Chengsong parents: 626 diff changeset	153
94db2636a296 finished! Chengsong parents: 626 diff changeset	154
94db2636a296 finished! Chengsong parents: 626 diff changeset	155
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	156	\section{Optimisations}
1c8525061545 finished! Chengsong parents: 625 diff changeset	157	Perhaps the biggest problem that prevents derivative-based lexing
1c8525061545 finished! Chengsong parents: 625 diff changeset	158	from being more widely adopted
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	159	is that they tend to be not very fast in practice, unable to
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	160	reach throughputs like gigabytes per second, which is the
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	161	application we have in mind when we started looking at the topic.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	162	Commercial
1c8525061545 finished! Chengsong parents: 625 diff changeset	163	regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
1c8525061545 finished! Chengsong parents: 625 diff changeset	164	are capable of inspecting payloads
1c8525061545 finished! Chengsong parents: 625 diff changeset	165	at line rates (which can be up to a few gigabits per second) against
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	166	thousands of regex rules \cite{communityRules}.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	167	For our algorithm to be more attractive for practical use, we
1c8525061545 finished! Chengsong parents: 625 diff changeset	168	need more correctness-preserving optimisations.
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	169
94db2636a296 finished! Chengsong parents: 626 diff changeset	170	FPGA-based implementations such as \cite{Sidhu2001}
94db2636a296 finished! Chengsong parents: 626 diff changeset	171	have the advantages of being
94db2636a296 finished! Chengsong parents: 626 diff changeset	172	reconfigurable and parallel, but suffer from lower clock frequency
94db2636a296 finished! Chengsong parents: 626 diff changeset	173	and scalability.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	174	Traditional automaton approaches that use DFAs instead of NFAs
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	175	benefit from the fact that only a single transition is needed
94db2636a296 finished! Chengsong parents: 626 diff changeset	176	for each input character \cite{Becchi08}. Lower memory bandwidth leads
94db2636a296 finished! Chengsong parents: 626 diff changeset	177	to faster performance.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	178	However, they suffer from exponential memory requirements on bounded repetitions.
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	179	Compression techniques are used, such as those in \cite{Kumar2006} and
94db2636a296 finished! Chengsong parents: 626 diff changeset	180	\cite{Becchi2007}.
94db2636a296 finished! Chengsong parents: 626 diff changeset	181	Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
94db2636a296 finished! Chengsong parents: 626 diff changeset	182	have been
94db2636a296 finished! Chengsong parents: 626 diff changeset	183	proposed to better deal with bounded repetitions.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	184	But they are usually not used in formalised proofs.
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	185
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	186	%So far our focus has been mainly on the bit-coded algorithm,
1c8525061545 finished! Chengsong parents: 625 diff changeset	187	%but the injection-based lexing algorithm
1c8525061545 finished! Chengsong parents: 625 diff changeset	188	%could also be sped up in various ways:
1c8525061545 finished! Chengsong parents: 625 diff changeset	189	%
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	190	Another direction of optimisation for derivative-based approaches
94db2636a296 finished! Chengsong parents: 626 diff changeset	191	is defining string derivatives
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	192	directly, without recursively decomposing them into
1c8525061545 finished! Chengsong parents: 625 diff changeset	193	character-by-character derivatives. For example, instead of
1c8525061545 finished! Chengsong parents: 625 diff changeset	194	replacing $(r_1 + r_2)\backslash (c::cs)$ by
1c8525061545 finished! Chengsong parents: 625 diff changeset	195	$((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
1c8525061545 finished! Chengsong parents: 625 diff changeset	196	calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
1c8525061545 finished! Chengsong parents: 625 diff changeset	197	This has the potential to speed up matching because input is
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	198	processed at a larger granularity.
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	199	One interesting point is to explore whether this can be done
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	200	to $\inj$ as well, so that we can generate a lexical value
1c8525061545 finished! Chengsong parents: 625 diff changeset	201	rather than simply get a matcher.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	202	It is also not yet clear how this idea can be extended to sequences and stars.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	203
1c8525061545 finished! Chengsong parents: 625 diff changeset	204
1c8525061545 finished! Chengsong parents: 625 diff changeset	205	\subsection{Derivatives and Zippers}
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	206	Zippers are a data structure designed to focus on
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	207	and navigate between local parts of a tree.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	208	The idea is that often operations on large trees only deal with
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	209	local regions each time.
1c8525061545 finished! Chengsong parents: 625 diff changeset	210	Therefore it would be a waste to
1c8525061545 finished! Chengsong parents: 625 diff changeset	211	traverse the entire tree if
1c8525061545 finished! Chengsong parents: 625 diff changeset	212	the operation only
1c8525061545 finished! Chengsong parents: 625 diff changeset	213	involves a small fraction of it.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	214	Zippers were first formally described by Huet \cite{Huet1997}.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	215	Typical applications of zippers involve text editor buffers
1c8525061545 finished! Chengsong parents: 625 diff changeset	216	and proof system databases.
1c8525061545 finished! Chengsong parents: 625 diff changeset	217	In our setting, the idea is to compactify the representation
1c8525061545 finished! Chengsong parents: 625 diff changeset	218	of derivatives with zippers, thereby making our algorithm faster.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	219	Below we describe several works on parsing, derivatives
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	220	and zippers.
1c8525061545 finished! Chengsong parents: 625 diff changeset	221
1c8525061545 finished! Chengsong parents: 625 diff changeset	222	Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
1c8525061545 finished! Chengsong parents: 625 diff changeset	223	\cite{Zippy2020}.
1c8525061545 finished! Chengsong parents: 625 diff changeset	224	They adopted zippers to improve the speed, and argued that the runtime
1c8525061545 finished! Chengsong parents: 625 diff changeset	225	complexity of their algorithm was linear with respect to the input string.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	226	They did not provide a formal proof for this.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	227
1c8525061545 finished! Chengsong parents: 625 diff changeset	228	The idea of using Brzozowski derivatives on general context-free
1c8525061545 finished! Chengsong parents: 625 diff changeset	229	parsing was first implemented
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	230	by Might et al. \cite{Might2011}.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	231	They used memoization and fixpoint constructions to eliminate infinite recursion,
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	232	which is a major problem for using derivatives with context-free grammars.
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	233	Their initial version was quite slow----exponential on the size of the input.
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	234	Adams et al. \cite{Adams2016} improved the speed and argued that their version
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	235	was cubic with respect to the input.
1c8525061545 finished! Chengsong parents: 625 diff changeset	236	Darragh and Adams \cite{Darragh2020} further improved the performance
1c8525061545 finished! Chengsong parents: 625 diff changeset	237	by using zippers in an innovative way--their zippers had multiple focuses
1c8525061545 finished! Chengsong parents: 625 diff changeset	238	instead of just one in a traditional zipper to handle ambiguity.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	239	Their algorithm was not formalised though.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	240
1c8525061545 finished! Chengsong parents: 625 diff changeset	241
1c8525061545 finished! Chengsong parents: 625 diff changeset	242
1c8525061545 finished! Chengsong parents: 625 diff changeset	243
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	244	\subsection{Back-References}
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	245	We introduced regular expressions with back-references
608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	246	in chapter \ref{Introduction}.
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	247	We adopt the common practice of calling them rewbrs (Regular Expressions
627 94db2636a296 finished! Chengsong parents: 626 diff changeset	248	With Back References) for brevity.
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	249	It has been shown by Aho \cite{Aho1990}
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	250	that the k-vertex cover problem can be reduced
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	251	to the problem of rewbrs matching, and therefore matching with rewbrs is NP-complete.
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	252	Given the depth of the problem, the progress made at the full generality
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	253	of arbitrary rewbrs matching has been slow with
626 1c8525061545 finished! Chengsong parents: 625 diff changeset	254	theoretical work on them being
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	255	fairly recent.
608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	256
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	257	Campaneu et al. studied rewbrs
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	258	in the context of formal languages theory in
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	259	\cite{campeanu2003formal}.
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	260	They devised a pumping lemma for Perl-like regexes,
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	261	proving that the langugages denoted by them
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	262	is properly contained in context-sensitive
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	263	languages.
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	264	More interesting questions such as
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	265	whether the languages denoted by Perl-like regexes
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	266	can express palindromes ($\{w \mid w = w^R\}$)
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	267	are discussed in \cite{campeanu2009patterns}
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	268	and \cite{CAMPEANU2009Intersect}.
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	269	Freydenberger \cite{Frey2013} investigated the
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	270	expressive power of back-references. He showed several
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	271	undecidability and decriptional complexity results
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	272	for back-references, concluding that they add
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	273	great power to certain programming tasks but are difficult to comprehend.
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	274	An interesting question would then be
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	275	whether we can add restrictions to them,
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	276	so that they become expressive enough for practical use, such
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	277	as matching html files, but not too powerful.
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	278	Freydenberger and Schmid \cite{FREYDENBERGER20191}
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	279	introduced the notion of deterministic
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	280	regular expressions with back-references to achieve
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	281	a better balance between expressiveness and tractability.
608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	282
37b6fd310a16 added related work chap Chengsong parents: diff changeset	283
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	284	Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	285	investigated the time complexity of different variants
8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	286	of back-references.
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	287	We are not aware of any work that uses derivatives with back-references.
624 8ffa28fce271 all comments incorporated!!+related work Chengsong parents: 622 diff changeset	288
640 bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	289	%See \cite{BERGLUND2022} for a survey
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	290	%of these works and comparison between different
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	291	%flavours of back-references syntax (e.g. whether references can be circular,
bd1354127574 more proofreading done, last version before submission Chengsong parents: 632 diff changeset	292	%can labels be repeated etc.).
608 37b6fd310a16 added related work chap Chengsong parents: diff changeset	293
609 61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	294
61139fdddae0 chap1 totally done Chengsong parents: 608 diff changeset	295
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	296
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	297
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	298
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	299
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	300
4b1149fb5aec incorporated more comments, bib Chengsong parents: 609 diff changeset	301
625 b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	302
b797c9a709d9 section reorganising, related work Chengsong parents: 624 diff changeset	303

author	Chengsong
	Fri, 26 May 2023 08:09:30 +0100
changeset 646	56057198e4f5
parent 640	bd1354127574
child 668	3831621d7b14
permissions	-rw-r--r--