lexing: ChengsongTanPhdThesis/Chapters/Future.tex@e3752aac8ec2 (annotated)

637 e3752aac8ec2 more Chengsong parents: diff changeset	1	% Chapter Template
e3752aac8ec2 more Chengsong parents: diff changeset	2
e3752aac8ec2 more Chengsong parents: diff changeset	3	\chapter{Future Work and Conclusion} % Main chapter title
e3752aac8ec2 more Chengsong parents: diff changeset	4
e3752aac8ec2 more Chengsong parents: diff changeset	5	\label{Future}
e3752aac8ec2 more Chengsong parents: diff changeset	6
e3752aac8ec2 more Chengsong parents: diff changeset	7	In this thesis, in order to solve the ReDoS attacks
e3752aac8ec2 more Chengsong parents: diff changeset	8	once and for all, we have set out to formalise the correctness
e3752aac8ec2 more Chengsong parents: diff changeset	9	proof of Sulzmann and Lu's
e3752aac8ec2 more Chengsong parents: diff changeset	10	lexing algorithm with simplifications \cite{Sulzmann2014}
e3752aac8ec2 more Chengsong parents: diff changeset	11	in Isabelle/HOL.
e3752aac8ec2 more Chengsong parents: diff changeset	12	We have fixed inefficiencies in the original simplification
e3752aac8ec2 more Chengsong parents: diff changeset	13	algorithm, and established the correctness by proving
e3752aac8ec2 more Chengsong parents: diff changeset	14	that our algorithm outputs the same result as the original bit-coded
e3752aac8ec2 more Chengsong parents: diff changeset	15	lexer without simplifications (whose correctness was established in
e3752aac8ec2 more Chengsong parents: diff changeset	16	previous work by Ausaf et al. in
e3752aac8ec2 more Chengsong parents: diff changeset	17	\cite{Ausaf} and \cite{AusafDyckhoffUrban2016}).
e3752aac8ec2 more Chengsong parents: diff changeset	18	The proof technique used in \cite{Ausaf} does not work in our case
e3752aac8ec2 more Chengsong parents: diff changeset	19	because the simplification function messes with the structure of
e3752aac8ec2 more Chengsong parents: diff changeset	20	regular expressions.
e3752aac8ec2 more Chengsong parents: diff changeset	21	Despite having to try out several workarounds and being
e3752aac8ec2 more Chengsong parents: diff changeset	22	stuck for months or even years looking for proofs,
e3752aac8ec2 more Chengsong parents: diff changeset	23	we were delighted to have discovered the simple yet
e3752aac8ec2 more Chengsong parents: diff changeset	24	effective proof method
e3752aac8ec2 more Chengsong parents: diff changeset	25	by modelling individual
e3752aac8ec2 more Chengsong parents: diff changeset	26	simplification steps as small-step rewriting rules and proving
e3752aac8ec2 more Chengsong parents: diff changeset	27	equivalence between terms linked by these rules.
e3752aac8ec2 more Chengsong parents: diff changeset	28
e3752aac8ec2 more Chengsong parents: diff changeset	29	In addition, we have proved a formal size bound on the
e3752aac8ec2 more Chengsong parents: diff changeset	30	regular expressions. The technique was by establishing
e3752aac8ec2 more Chengsong parents: diff changeset	31	a ``closed form'' informally
e3752aac8ec2 more Chengsong parents: diff changeset	32	described by
e3752aac8ec2 more Chengsong parents: diff changeset	33	Murugesan and Shanmuga Sundaram \cite{Murugesan2014}
e3752aac8ec2 more Chengsong parents: diff changeset	34	for compound regular expressions' derivatives
e3752aac8ec2 more Chengsong parents: diff changeset	35	and using those closed forms to control the size.
e3752aac8ec2 more Chengsong parents: diff changeset	36	The Isabelle/HOL code for our formalisation can be
e3752aac8ec2 more Chengsong parents: diff changeset	37	found at
e3752aac8ec2 more Chengsong parents: diff changeset	38	\begin{center}
e3752aac8ec2 more Chengsong parents: diff changeset	39	\url{https://github.com/hellotommmy/posix}
e3752aac8ec2 more Chengsong parents: diff changeset	40	\end{center}
e3752aac8ec2 more Chengsong parents: diff changeset	41	\noindent
e3752aac8ec2 more Chengsong parents: diff changeset	42	Thanks to our theorem-prover-friendly approach,
e3752aac8ec2 more Chengsong parents: diff changeset	43	we believe that
e3752aac8ec2 more Chengsong parents: diff changeset	44	this finiteness bound can be improved to a bound
e3752aac8ec2 more Chengsong parents: diff changeset	45	linear to input and
e3752aac8ec2 more Chengsong parents: diff changeset	46	cubic to the regular expression size using a technique by
e3752aac8ec2 more Chengsong parents: diff changeset	47	Antimirov\cite{Antimirov95}.
e3752aac8ec2 more Chengsong parents: diff changeset	48	Once formalised, this would be a guarantee for the absence of all super-linear behavious.
e3752aac8ec2 more Chengsong parents: diff changeset	49	We are working out the
e3752aac8ec2 more Chengsong parents: diff changeset	50	details.
e3752aac8ec2 more Chengsong parents: diff changeset	51
e3752aac8ec2 more Chengsong parents: diff changeset	52
e3752aac8ec2 more Chengsong parents: diff changeset	53	To our best knowledge, no lexing libraries using Brzozowski derivatives
e3752aac8ec2 more Chengsong parents: diff changeset	54	have similar complexity-related bounds,
e3752aac8ec2 more Chengsong parents: diff changeset	55	and claims about running time are usually speculative and backed by empirical
e3752aac8ec2 more Chengsong parents: diff changeset	56	evidence on a few test cases.
e3752aac8ec2 more Chengsong parents: diff changeset	57	If a matching or lexing algorithm
e3752aac8ec2 more Chengsong parents: diff changeset	58	does not come with certain basic complexity related
e3752aac8ec2 more Chengsong parents: diff changeset	59	guarantees (for examaple the internal data structure size
e3752aac8ec2 more Chengsong parents: diff changeset	60	does not grow indefinitely),
e3752aac8ec2 more Chengsong parents: diff changeset	61	then one cannot claim with confidence having solved the problem
e3752aac8ec2 more Chengsong parents: diff changeset	62	of catastrophic backtracking.
e3752aac8ec2 more Chengsong parents: diff changeset	63
e3752aac8ec2 more Chengsong parents: diff changeset	64	We believe our proof method is not specific to this
e3752aac8ec2 more Chengsong parents: diff changeset	65	algorithm, and intend to extend this approach
e3752aac8ec2 more Chengsong parents: diff changeset	66	to prove the correctness of a faster version
e3752aac8ec2 more Chengsong parents: diff changeset	67	of the lexer. The closed
e3752aac8ec2 more Chengsong parents: diff changeset	68	forms can then be re-used as well. Together
e3752aac8ec2 more Chengsong parents: diff changeset	69	with the idea from Antimirov \cite{Antimirov95}
e3752aac8ec2 more Chengsong parents: diff changeset	70	we plan to reduce the size bound to become polynomial
e3752aac8ec2 more Chengsong parents: diff changeset	71	with respect to the size of the regular expression.
e3752aac8ec2 more Chengsong parents: diff changeset	72	%We have tried out a variety of other optimisations
e3752aac8ec2 more Chengsong parents: diff changeset	73	%to make lexing/parsing with derivatives faster,
e3752aac8ec2 more Chengsong parents: diff changeset	74	%for example in \cite{Edelmann2021}, \cite{Darragh2020}
e3752aac8ec2 more Chengsong parents: diff changeset	75	%and \cite{Verbatimpp}.
e3752aac8ec2 more Chengsong parents: diff changeset	76
e3752aac8ec2 more Chengsong parents: diff changeset	77	We have learned quite a few lessons in this process.
e3752aac8ec2 more Chengsong parents: diff changeset	78	As simple as the end result may seem,
e3752aac8ec2 more Chengsong parents: diff changeset	79	coming up with the initial proof idea with rewriting relations
e3752aac8ec2 more Chengsong parents: diff changeset	80	was the hardest for us, as it requires a deep understanding of what information
e3752aac8ec2 more Chengsong parents: diff changeset	81	is preserved during the simplification process.
e3752aac8ec2 more Chengsong parents: diff changeset	82	Having the first formalisation gives us much more confidence----it took
e3752aac8ec2 more Chengsong parents: diff changeset	83	us much less time to obtain the closed forms and size bound, although they
e3752aac8ec2 more Chengsong parents: diff changeset	84	are more involved than the correctness proof.
e3752aac8ec2 more Chengsong parents: diff changeset	85	As having been tested numerous times,
e3752aac8ec2 more Chengsong parents: diff changeset	86	the usefulness of formal approaches cannot be overstated:
e3752aac8ec2 more Chengsong parents: diff changeset	87	they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
e3752aac8ec2 more Chengsong parents: diff changeset	88	but also helped us set up realistic expectations about performance
e3752aac8ec2 more Chengsong parents: diff changeset	89	of algorithms. We believed in the beginning that
e3752aac8ec2 more Chengsong parents: diff changeset	90	the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
e3752aac8ec2 more Chengsong parents: diff changeset	91	was already able to achieve the linear bound claimed by
e3752aac8ec2 more Chengsong parents: diff changeset	92	Sulzmann and Lu.
e3752aac8ec2 more Chengsong parents: diff changeset	93	We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$
e3752aac8ec2 more Chengsong parents: diff changeset	94	is $O(\llbracket r\rrbracket^c\cdot \|s\|)$,
e3752aac8ec2 more Chengsong parents: diff changeset	95	where $c$ is a small constant number, making
e3752aac8ec2 more Chengsong parents: diff changeset	96	$\llbracket r\rrbracket^c$ a small constant factor.
e3752aac8ec2 more Chengsong parents: diff changeset	97	We were quite surprised that this did not go through despite a lot of efforts.
e3752aac8ec2 more Chengsong parents: diff changeset	98	This led us to discover the example where $\llbracket r\rrbracket^c$
e3752aac8ec2 more Chengsong parents: diff changeset	99	can in fact be exponentially large in chapter \ref{Finite}.
e3752aac8ec2 more Chengsong parents: diff changeset	100	We therefore learned the necessity to introduce the stronger simplifications
e3752aac8ec2 more Chengsong parents: diff changeset	101	in chapter \ref{Cubic}.
e3752aac8ec2 more Chengsong parents: diff changeset	102	Without formal proofs we could not have found out this so soon.
e3752aac8ec2 more Chengsong parents: diff changeset	103
e3752aac8ec2 more Chengsong parents: diff changeset	104
e3752aac8ec2 more Chengsong parents: diff changeset	105	In the future, we plan to incorporate many extensions
e3752aac8ec2 more Chengsong parents: diff changeset	106	to this work.
e3752aac8ec2 more Chengsong parents: diff changeset	107
e3752aac8ec2 more Chengsong parents: diff changeset	108	\section{Future Work}
e3752aac8ec2 more Chengsong parents: diff changeset	109	The most obvious next-step is to implement the cubic bound and
e3752aac8ec2 more Chengsong parents: diff changeset	110	correctness of $\blexerStrong$
e3752aac8ec2 more Chengsong parents: diff changeset	111	in chapter \ref{Cubic} formally.
e3752aac8ec2 more Chengsong parents: diff changeset	112	A cubic bound ($O(\llbracket r\rrbracket^c\cdot \|s\|)$) with respect to
e3752aac8ec2 more Chengsong parents: diff changeset	113	regular expression size will fulfill
e3752aac8ec2 more Chengsong parents: diff changeset	114	the linear complexity claim made by Sulzmann and Lu.
e3752aac8ec2 more Chengsong parents: diff changeset	115
e3752aac8ec2 more Chengsong parents: diff changeset	116	With a linear size bound theoretically, the next challenge would be
e3752aac8ec2 more Chengsong parents: diff changeset	117	to generate code that are competitive with respect to commercial
e3752aac8ec2 more Chengsong parents: diff changeset	118	matchers. For that a lot of optimisations are needed.
e3752aac8ec2 more Chengsong parents: diff changeset	119	We aim to integrate the zipper data structure into our lexer.
e3752aac8ec2 more Chengsong parents: diff changeset	120	The idea is very simple: using a zipper with multiple focuses
e3752aac8ec2 more Chengsong parents: diff changeset	121	just like Darragh \cite{Darragh2020} did, we could represent
e3752aac8ec2 more Chengsong parents: diff changeset	122	\[
e3752aac8ec2 more Chengsong parents: diff changeset	123	x\cdot r + y \cdot r + \ldots
e3752aac8ec2 more Chengsong parents: diff changeset	124	\]
e3752aac8ec2 more Chengsong parents: diff changeset	125	as
e3752aac8ec2 more Chengsong parents: diff changeset	126	\[
e3752aac8ec2 more Chengsong parents: diff changeset	127	(x+y + \ldots) \cdot r.
e3752aac8ec2 more Chengsong parents: diff changeset	128	\]
e3752aac8ec2 more Chengsong parents: diff changeset	129	This would greatly reduce the amount of space needed
e3752aac8ec2 more Chengsong parents: diff changeset	130	when we have many terms like $x\cdot r$.
e3752aac8ec2 more Chengsong parents: diff changeset	131	Some initial results have been obtained, but
e3752aac8ec2 more Chengsong parents: diff changeset	132	significant changes are needed to make sure
e3752aac8ec2 more Chengsong parents: diff changeset	133	that the lexer output conforms to the POSIX standard.
e3752aac8ec2 more Chengsong parents: diff changeset	134	We aim to make use of Okui and Suzuki's labelling
e3752aac8ec2 more Chengsong parents: diff changeset	135	system \cite{Okui10} to ensure regular expressions represented as zippers
e3752aac8ec2 more Chengsong parents: diff changeset	136	always maintain the POSIX orderings.
e3752aac8ec2 more Chengsong parents: diff changeset	137	%Formalising correctness of
e3752aac8ec2 more Chengsong parents: diff changeset	138	%such a lexer using zippers will probably require using an imperative
e3752aac8ec2 more Chengsong parents: diff changeset	139	%framework with separation logic as it involves
e3752aac8ec2 more Chengsong parents: diff changeset	140	%mutable data structures, which is also interesting to look into.
e3752aac8ec2 more Chengsong parents: diff changeset	141
e3752aac8ec2 more Chengsong parents: diff changeset	142	To further optimise the algorithm,
e3752aac8ec2 more Chengsong parents: diff changeset	143	we plan to add a deduplicate function that tells
e3752aac8ec2 more Chengsong parents: diff changeset	144	whether two regular expressions denote the same
e3752aac8ec2 more Chengsong parents: diff changeset	145	language using
e3752aac8ec2 more Chengsong parents: diff changeset	146	an efficient and verified equivalence checker like \cite{Krauss2011}.
e3752aac8ec2 more Chengsong parents: diff changeset	147	In this way, the internal data structures can be pruned more aggressively,
e3752aac8ec2 more Chengsong parents: diff changeset	148	leading to better simplifications and ultimately faster algorithms.
e3752aac8ec2 more Chengsong parents: diff changeset	149
e3752aac8ec2 more Chengsong parents: diff changeset	150	Traditional automaton approaches can be sped up by compiling
e3752aac8ec2 more Chengsong parents: diff changeset	151	multiple rules into the same automaton. This has been done
e3752aac8ec2 more Chengsong parents: diff changeset	152	in \cite{Kumar2006} and \cite{Becchi08}, for example.
e3752aac8ec2 more Chengsong parents: diff changeset	153	Currently our lexer only deals with a single regular expression each time.
e3752aac8ec2 more Chengsong parents: diff changeset	154	Putting multiple regular expressions might open up more
e3752aac8ec2 more Chengsong parents: diff changeset	155	possibilities of simplifications.
e3752aac8ec2 more Chengsong parents: diff changeset	156
e3752aac8ec2 more Chengsong parents: diff changeset	157	As already mentioned in chapter \ref{RelatedWork},
e3752aac8ec2 more Chengsong parents: diff changeset	158	reducing the number of memory accesses can accelerate the
e3752aac8ec2 more Chengsong parents: diff changeset	159	matching speed. It would be interesting to see the memory
e3752aac8ec2 more Chengsong parents: diff changeset	160	bandwidth of our derivative-based matching and improve accordingly.
e3752aac8ec2 more Chengsong parents: diff changeset	161
e3752aac8ec2 more Chengsong parents: diff changeset	162	Memoization has been used frequently in lexing and parsing to achieve
e3752aac8ec2 more Chengsong parents: diff changeset	163	better complexity results, for example in \cite{Reps1998},
e3752aac8ec2 more Chengsong parents: diff changeset	164	\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
e3752aac8ec2 more Chengsong parents: diff changeset	165	We plan to explore the performance enhancements by memoization in our algorithm
e3752aac8ec2 more Chengsong parents: diff changeset	166	in a correctness-preserving way.
e3752aac8ec2 more Chengsong parents: diff changeset	167	The Monadic data refinement technique that Lammich and Tuerk used
e3752aac8ec2 more Chengsong parents: diff changeset	168	in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
e3752aac8ec2 more Chengsong parents: diff changeset	169	algorithm seem quite interesting.
e3752aac8ec2 more Chengsong parents: diff changeset	170	We aim to learn from their refinement framework
e3752aac8ec2 more Chengsong parents: diff changeset	171	which generates high performance code with proofs
e3752aac8ec2 more Chengsong parents: diff changeset	172	that can be broken down into uncluttered small steps.
e3752aac8ec2 more Chengsong parents: diff changeset	173
e3752aac8ec2 more Chengsong parents: diff changeset	174
e3752aac8ec2 more Chengsong parents: diff changeset	175
e3752aac8ec2 more Chengsong parents: diff changeset	176
e3752aac8ec2 more Chengsong parents: diff changeset	177	%We establish the correctness
e3752aac8ec2 more Chengsong parents: diff changeset	178	%claim made by them about the
e3752aac8ec2 more Chengsong parents: diff changeset	179	%, which were only proven with pencil and paper.
e3752aac8ec2 more Chengsong parents: diff changeset	180
e3752aac8ec2 more Chengsong parents: diff changeset	181
e3752aac8ec2 more Chengsong parents: diff changeset	182	%\section{Work on Back-References}
e3752aac8ec2 more Chengsong parents: diff changeset	183	%We introduced regular expressions with back-references
e3752aac8ec2 more Chengsong parents: diff changeset	184	%in chapter \ref{Introduction}.
e3752aac8ec2 more Chengsong parents: diff changeset	185	%We adopt the common practice of calling them rewbrs (Regular Expressions
e3752aac8ec2 more Chengsong parents: diff changeset	186	%With Back References) in this section for brevity.
e3752aac8ec2 more Chengsong parents: diff changeset	187	%It has been shown by Aho \cite{Aho1990}
e3752aac8ec2 more Chengsong parents: diff changeset	188	%that the k-vertex cover problem can be reduced
e3752aac8ec2 more Chengsong parents: diff changeset	189	%to the problem of rewbrs matching, and is therefore NP-complete.
e3752aac8ec2 more Chengsong parents: diff changeset	190	%Given the depth of the problem, the progress made at the full generality
e3752aac8ec2 more Chengsong parents: diff changeset	191	%of arbitrary rewbrs matching has been slow, with
e3752aac8ec2 more Chengsong parents: diff changeset	192	%theoretical work on them being
e3752aac8ec2 more Chengsong parents: diff changeset	193	%fairly recent.
e3752aac8ec2 more Chengsong parents: diff changeset	194	%
e3752aac8ec2 more Chengsong parents: diff changeset	195	%Campaneu et al. studied regexes with back-references
e3752aac8ec2 more Chengsong parents: diff changeset	196	%in the context of formal languages theory in
e3752aac8ec2 more Chengsong parents: diff changeset	197	%their 2003 work \cite{campeanu2003formal}.
e3752aac8ec2 more Chengsong parents: diff changeset	198	%They devised a pumping lemma for Perl-like regexes,
e3752aac8ec2 more Chengsong parents: diff changeset	199	%proving that the langugages denoted by them
e3752aac8ec2 more Chengsong parents: diff changeset	200	%is properly contained in context-sensitive
e3752aac8ec2 more Chengsong parents: diff changeset	201	%languages.
e3752aac8ec2 more Chengsong parents: diff changeset	202	%More interesting questions such as
e3752aac8ec2 more Chengsong parents: diff changeset	203	%whether the language denoted by Perl-like regexes
e3752aac8ec2 more Chengsong parents: diff changeset	204	%can express palindromes ($\{w \mid w = w^R\}$)
e3752aac8ec2 more Chengsong parents: diff changeset	205	%are discussed in \cite{campeanu2009patterns}
e3752aac8ec2 more Chengsong parents: diff changeset	206	%and \cite{CAMPEANU2009Intersect}.
e3752aac8ec2 more Chengsong parents: diff changeset	207	%Freydenberger \cite{Frey2013} investigated the
e3752aac8ec2 more Chengsong parents: diff changeset	208	%expressive power of back-references. He showed several
e3752aac8ec2 more Chengsong parents: diff changeset	209	%undecidability and decriptional complexity results
e3752aac8ec2 more Chengsong parents: diff changeset	210	%of back-references, concluding that they add
e3752aac8ec2 more Chengsong parents: diff changeset	211	%great power to certain programming tasks but are difficult to harness.
e3752aac8ec2 more Chengsong parents: diff changeset	212	%An interesting question would then be
e3752aac8ec2 more Chengsong parents: diff changeset	213	%whether we can add restrictions to them,
e3752aac8ec2 more Chengsong parents: diff changeset	214	%so that they become expressive enough for practical use such
e3752aac8ec2 more Chengsong parents: diff changeset	215	%as html files, but not too powerful.
e3752aac8ec2 more Chengsong parents: diff changeset	216	%Freydenberger and Schmid \cite{FREYDENBERGER20191}
e3752aac8ec2 more Chengsong parents: diff changeset	217	%introduced the notion of deterministic
e3752aac8ec2 more Chengsong parents: diff changeset	218	%regular expressions with back-references to achieve
e3752aac8ec2 more Chengsong parents: diff changeset	219	%a better balance between expressiveness and tractability.
e3752aac8ec2 more Chengsong parents: diff changeset	220	%
e3752aac8ec2 more Chengsong parents: diff changeset	221	%
e3752aac8ec2 more Chengsong parents: diff changeset	222	%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
e3752aac8ec2 more Chengsong parents: diff changeset	223	%investigated the time complexity of different variants
e3752aac8ec2 more Chengsong parents: diff changeset	224	%of back-references.
e3752aac8ec2 more Chengsong parents: diff changeset	225	%We are not aware of any work that uses derivatives on back-references.
e3752aac8ec2 more Chengsong parents: diff changeset	226	%
e3752aac8ec2 more Chengsong parents: diff changeset	227	%See \cite{BERGLUND2022} for a survey
e3752aac8ec2 more Chengsong parents: diff changeset	228	%of these works and comparison between different
e3752aac8ec2 more Chengsong parents: diff changeset	229	%flavours of back-references syntax (e.g. whether references can be circular,
e3752aac8ec2 more Chengsong parents: diff changeset	230	%can labels be repeated etc.).
e3752aac8ec2 more Chengsong parents: diff changeset	231	%
e3752aac8ec2 more Chengsong parents: diff changeset	232	%
e3752aac8ec2 more Chengsong parents: diff changeset	233	%\subsection{Matchers and Lexers with Mechanised Proofs}
e3752aac8ec2 more Chengsong parents: diff changeset	234	%We are aware
e3752aac8ec2 more Chengsong parents: diff changeset	235	%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
e3752aac8ec2 more Chengsong parents: diff changeset	236	%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
e3752aac8ec2 more Chengsong parents: diff changeset	237	%of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
e3752aac8ec2 more Chengsong parents: diff changeset	238	%in Coq is given by Coquand and Siles \parencite{Coquand2012}.
e3752aac8ec2 more Chengsong parents: diff changeset	239	%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
e3752aac8ec2 more Chengsong parents: diff changeset	240	%
e3752aac8ec2 more Chengsong parents: diff changeset	241	%\subsection{Static Analysis of Evil Regex Patterns}
e3752aac8ec2 more Chengsong parents: diff changeset	242	%When a regular expression does not behave as intended,
e3752aac8ec2 more Chengsong parents: diff changeset	243	%people usually try to rewrite the regex to some equivalent form
e3752aac8ec2 more Chengsong parents: diff changeset	244	%or they try to avoid the possibly problematic patterns completely,
e3752aac8ec2 more Chengsong parents: diff changeset	245	%for which many false positives exist\parencite{Davis18}.
e3752aac8ec2 more Chengsong parents: diff changeset	246	%Animated tools to "debug" regular expressions such as
e3752aac8ec2 more Chengsong parents: diff changeset	247	% \parencite{regexploit2021} \parencite{regex101} are also popular.
e3752aac8ec2 more Chengsong parents: diff changeset	248	%We are also aware of static analysis work on regular expressions that
e3752aac8ec2 more Chengsong parents: diff changeset	249	%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
e3752aac8ec2 more Chengsong parents: diff changeset	250	%\parencite{Rathnayake2014StaticAF} proposed an algorithm
e3752aac8ec2 more Chengsong parents: diff changeset	251	%that detects regular expressions triggering exponential
e3752aac8ec2 more Chengsong parents: diff changeset	252	%behavious on backtracking matchers.
e3752aac8ec2 more Chengsong parents: diff changeset	253	%Weideman \parencite{Weideman2017Static} came up with
e3752aac8ec2 more Chengsong parents: diff changeset	254	%non-linear polynomial worst-time estimates
e3752aac8ec2 more Chengsong parents: diff changeset	255	%for regexes, attack string that exploit the worst-time
e3752aac8ec2 more Chengsong parents: diff changeset	256	%scenario, and "attack automata" that generates
e3752aac8ec2 more Chengsong parents: diff changeset	257	%attack strings.
e3752aac8ec2 more Chengsong parents: diff changeset	258	%
e3752aac8ec2 more Chengsong parents: diff changeset	259	%
e3752aac8ec2 more Chengsong parents: diff changeset	260	%
e3752aac8ec2 more Chengsong parents: diff changeset	261	%
e3752aac8ec2 more Chengsong parents: diff changeset	262	%
e3752aac8ec2 more Chengsong parents: diff changeset	263	%
e3752aac8ec2 more Chengsong parents: diff changeset	264	%Thanks to our theorem-prover-friendly approach,
e3752aac8ec2 more Chengsong parents: diff changeset	265	%we believe that
e3752aac8ec2 more Chengsong parents: diff changeset	266	%this finiteness bound can be improved to a bound
e3752aac8ec2 more Chengsong parents: diff changeset	267	%linear to input and
e3752aac8ec2 more Chengsong parents: diff changeset	268	%cubic to the regular expression size using a technique by
e3752aac8ec2 more Chengsong parents: diff changeset	269	%Antimirov\cite{Antimirov95}.
e3752aac8ec2 more Chengsong parents: diff changeset	270	%Once formalised, this would be a guarantee for the absence of all super-linear behavious.
e3752aac8ec2 more Chengsong parents: diff changeset	271	%We are working out the
e3752aac8ec2 more Chengsong parents: diff changeset	272	%details.
e3752aac8ec2 more Chengsong parents: diff changeset	273	%
e3752aac8ec2 more Chengsong parents: diff changeset	274	%
e3752aac8ec2 more Chengsong parents: diff changeset	275	%To our best knowledge, no lexing libraries using Brzozowski derivatives
e3752aac8ec2 more Chengsong parents: diff changeset	276	%have similar complexity-related bounds,
e3752aac8ec2 more Chengsong parents: diff changeset	277	%and claims about running time are usually speculative and backed by empirical
e3752aac8ec2 more Chengsong parents: diff changeset	278	%evidence on a few test cases.
e3752aac8ec2 more Chengsong parents: diff changeset	279	%If a matching or lexing algorithm
e3752aac8ec2 more Chengsong parents: diff changeset	280	%does not come with certain basic complexity related
e3752aac8ec2 more Chengsong parents: diff changeset	281	%guarantees (for examaple the internal data structure size
e3752aac8ec2 more Chengsong parents: diff changeset	282	%does not grow indefinitely),
e3752aac8ec2 more Chengsong parents: diff changeset	283	%then they cannot claim with confidence having solved the problem
e3752aac8ec2 more Chengsong parents: diff changeset	284	%of catastrophic backtracking.
e3752aac8ec2 more Chengsong parents: diff changeset	285	%
e3752aac8ec2 more Chengsong parents: diff changeset	286	%
e3752aac8ec2 more Chengsong parents: diff changeset	287	%\section{Optimisations}
e3752aac8ec2 more Chengsong parents: diff changeset	288	%Perhaps the biggest problem that prevents derivative-based lexing
e3752aac8ec2 more Chengsong parents: diff changeset	289	%from being more widely adopted
e3752aac8ec2 more Chengsong parents: diff changeset	290	%is that they are still not blindingly fast in practice, unable to
e3752aac8ec2 more Chengsong parents: diff changeset	291	%reach throughputs like gigabytes per second, which the commercial
e3752aac8ec2 more Chengsong parents: diff changeset	292	%regular expression matchers such as Snort and Bro are able to achieve.
e3752aac8ec2 more Chengsong parents: diff changeset	293	%For our algorithm to be more attractive for practical use, we
e3752aac8ec2 more Chengsong parents: diff changeset	294	%need more correctness-preserving optimisations.
e3752aac8ec2 more Chengsong parents: diff changeset	295	%%So far our focus has been mainly on the bit-coded algorithm,
e3752aac8ec2 more Chengsong parents: diff changeset	296	%%but the injection-based lexing algorithm
e3752aac8ec2 more Chengsong parents: diff changeset	297	%%could also be sped up in various ways:
e3752aac8ec2 more Chengsong parents: diff changeset	298	%%
e3752aac8ec2 more Chengsong parents: diff changeset	299	%One such optimisation is defining string derivatives
e3752aac8ec2 more Chengsong parents: diff changeset	300	%directly, without recursively decomposing them into
e3752aac8ec2 more Chengsong parents: diff changeset	301	%character-by-character derivatives. For example, instead of
e3752aac8ec2 more Chengsong parents: diff changeset	302	%replacing $(r_1 + r_2)\backslash (c::cs)$ by
e3752aac8ec2 more Chengsong parents: diff changeset	303	%$((r_1 + r_2)\backslash c)\backslash cs$, we rather
e3752aac8ec2 more Chengsong parents: diff changeset	304	%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
e3752aac8ec2 more Chengsong parents: diff changeset	305	%
e3752aac8ec2 more Chengsong parents: diff changeset	306	%
e3752aac8ec2 more Chengsong parents: diff changeset	307	%\subsection{Derivatives and Zippers}
e3752aac8ec2 more Chengsong parents: diff changeset	308	%Zipper is a data structure designed to focus on
e3752aac8ec2 more Chengsong parents: diff changeset	309	%and navigate between local parts of a tree.
e3752aac8ec2 more Chengsong parents: diff changeset	310	%The idea is that often operations on a large tree only deal with
e3752aac8ec2 more Chengsong parents: diff changeset	311	%local regions each time.
e3752aac8ec2 more Chengsong parents: diff changeset	312	%It was first formally described by Huet \cite{HuetZipper}.
e3752aac8ec2 more Chengsong parents: diff changeset	313	%Typical applications of zippers involve text editor buffers
e3752aac8ec2 more Chengsong parents: diff changeset	314	%and proof system databases.
e3752aac8ec2 more Chengsong parents: diff changeset	315	%In our setting, the idea is to compactify the representation
e3752aac8ec2 more Chengsong parents: diff changeset	316	%of derivatives with zippers, thereby making our algorithm faster.
e3752aac8ec2 more Chengsong parents: diff changeset	317	%We draw our inspirations from several works on parsing, derivatives
e3752aac8ec2 more Chengsong parents: diff changeset	318	%and zippers.
e3752aac8ec2 more Chengsong parents: diff changeset	319	%
e3752aac8ec2 more Chengsong parents: diff changeset	320	%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
e3752aac8ec2 more Chengsong parents: diff changeset	321	%\cite{Zippy2020}.
e3752aac8ec2 more Chengsong parents: diff changeset	322	%They adopted zippers to improve the speed, and argued that the runtime
e3752aac8ec2 more Chengsong parents: diff changeset	323	%complexity of their algorithm was linear with respect to the input string.
e3752aac8ec2 more Chengsong parents: diff changeset	324	%
e3752aac8ec2 more Chengsong parents: diff changeset	325	%The idea of using Brzozowski derivatives on general context-free
e3752aac8ec2 more Chengsong parents: diff changeset	326	%parsing was first implemented
e3752aac8ec2 more Chengsong parents: diff changeset	327	%by Might et al. \ref{Might2011}.
e3752aac8ec2 more Chengsong parents: diff changeset	328	%They used memoization and fixpoint construction to eliminate infinite recursion,
e3752aac8ec2 more Chengsong parents: diff changeset	329	%which is a major problem for using derivatives on context-free grammars.
e3752aac8ec2 more Chengsong parents: diff changeset	330	%The initial version was quite slow----exponential on the size of the input.
e3752aac8ec2 more Chengsong parents: diff changeset	331	%Adams et al. \ref{Adams2016} improved the speed and argued that their version
e3752aac8ec2 more Chengsong parents: diff changeset	332	%was cubic with respect to the input.
e3752aac8ec2 more Chengsong parents: diff changeset	333	%Darragh and Adams \cite{10.1145/3408990} further improved the performance
e3752aac8ec2 more Chengsong parents: diff changeset	334	%by using zippers in an innovative way--their zippers had multiple focuses
e3752aac8ec2 more Chengsong parents: diff changeset	335	%instead of just one in a traditional zipper to handle ambiguity.
e3752aac8ec2 more Chengsong parents: diff changeset	336	%Their algorithm was not formalised, though.
e3752aac8ec2 more Chengsong parents: diff changeset	337	%
e3752aac8ec2 more Chengsong parents: diff changeset	338	%We aim to integrate the zipper data structure into our lexer.
e3752aac8ec2 more Chengsong parents: diff changeset	339	%The idea is very simple: using a zipper with multiple focuses
e3752aac8ec2 more Chengsong parents: diff changeset	340	%just like Darragh did, we could represent
e3752aac8ec2 more Chengsong parents: diff changeset	341	%\[
e3752aac8ec2 more Chengsong parents: diff changeset	342	% x\cdot r + y \cdot r + \ldots
e3752aac8ec2 more Chengsong parents: diff changeset	343	%\]
e3752aac8ec2 more Chengsong parents: diff changeset	344	%as
e3752aac8ec2 more Chengsong parents: diff changeset	345	%\[
e3752aac8ec2 more Chengsong parents: diff changeset	346	% (x+y + \ldots) \cdot r.
e3752aac8ec2 more Chengsong parents: diff changeset	347	%\]
e3752aac8ec2 more Chengsong parents: diff changeset	348	%This would greatly reduce the amount of space needed
e3752aac8ec2 more Chengsong parents: diff changeset	349	%when we have many terms like $x\cdot r$.
e3752aac8ec2 more Chengsong parents: diff changeset	350	%
e3752aac8ec2 more Chengsong parents: diff changeset	351	%into the current lexer.
e3752aac8ec2 more Chengsong parents: diff changeset	352	%This allows the lexer to not traverse the entire
e3752aac8ec2 more Chengsong parents: diff changeset	353	%derivative tree generated
e3752aac8ec2 more Chengsong parents: diff changeset	354	%in every intermediate step, but only a small segment
e3752aac8ec2 more Chengsong parents: diff changeset	355	%that is currently active.
e3752aac8ec2 more Chengsong parents: diff changeset	356	%
e3752aac8ec2 more Chengsong parents: diff changeset	357	%Some initial results have been obtained, but more care needs to be taken to make sure
e3752aac8ec2 more Chengsong parents: diff changeset	358	%that the lexer output conforms to the POSIX standard. Formalising correctness of
e3752aac8ec2 more Chengsong parents: diff changeset	359	%such a lexer using zippers will probably require using an imperative
e3752aac8ec2 more Chengsong parents: diff changeset	360	%framework with separation logic as it involves
e3752aac8ec2 more Chengsong parents: diff changeset	361	%mutable data structures, which is also interesting to look into.
e3752aac8ec2 more Chengsong parents: diff changeset	362	%
e3752aac8ec2 more Chengsong parents: diff changeset	363	%To further optimise the algorithm,
e3752aac8ec2 more Chengsong parents: diff changeset	364	%I plan to add a ``deduplicate'' function that tells
e3752aac8ec2 more Chengsong parents: diff changeset	365	%whether two regular expressions denote the same
e3752aac8ec2 more Chengsong parents: diff changeset	366	%language using
e3752aac8ec2 more Chengsong parents: diff changeset	367	%an efficient and verified equivalence checker.
e3752aac8ec2 more Chengsong parents: diff changeset	368	%In this way, the internal data structures can be pruned more aggressively,
e3752aac8ec2 more Chengsong parents: diff changeset	369	%leading to better simplifications and ultimately faster algorithms.
e3752aac8ec2 more Chengsong parents: diff changeset	370	%
e3752aac8ec2 more Chengsong parents: diff changeset	371	%
e3752aac8ec2 more Chengsong parents: diff changeset	372	%Some initial results
e3752aac8ec2 more Chengsong parents: diff changeset	373	%We first give a brief introduction to what zippers are,
e3752aac8ec2 more Chengsong parents: diff changeset	374	%and other works
e3752aac8ec2 more Chengsong parents: diff changeset	375	%that apply zippers to derivatives
e3752aac8ec2 more Chengsong parents: diff changeset	376	%When dealing with large trees, it would be a waste to
e3752aac8ec2 more Chengsong parents: diff changeset	377	%traverse the entire tree if
e3752aac8ec2 more Chengsong parents: diff changeset	378	%the operation only
e3752aac8ec2 more Chengsong parents: diff changeset	379	%involves a small fraction of it.
e3752aac8ec2 more Chengsong parents: diff changeset	380	%The idea is to put the focus on that subtree, turning other parts
e3752aac8ec2 more Chengsong parents: diff changeset	381	%of the tree into a context
e3752aac8ec2 more Chengsong parents: diff changeset	382	%
e3752aac8ec2 more Chengsong parents: diff changeset	383	%
e3752aac8ec2 more Chengsong parents: diff changeset	384	%One observation about our derivative-based lexing algorithm is that
e3752aac8ec2 more Chengsong parents: diff changeset	385	%the derivative operation sometimes traverses the entire regular expression
e3752aac8ec2 more Chengsong parents: diff changeset	386	%unnecessarily:
e3752aac8ec2 more Chengsong parents: diff changeset	387

author	Chengsong
	Sat, 24 Dec 2022 11:55:04 +0000
changeset 637	e3752aac8ec2
child 640	bd1354127574
permissions	-rw-r--r--