cst_tests: ecp/ecoop_paper.tex@4a9c9085fb85 (annotated)

17 3241b1e71633 hi Chengsong parents: diff changeset	1	\documentclass[a4paper,UKenglish]{lipics}
3241b1e71633 hi Chengsong parents: diff changeset	2	\usepackage{graphic}
3241b1e71633 hi Chengsong parents: diff changeset	3	\usepackage{data}
3241b1e71633 hi Chengsong parents: diff changeset	4	% \documentclass{article}
3241b1e71633 hi Chengsong parents: diff changeset	5	%\usepackage[utf8]{inputenc}
3241b1e71633 hi Chengsong parents: diff changeset	6	%\usepackage[english]{babel}
3241b1e71633 hi Chengsong parents: diff changeset	7	%\usepackage{listings}
3241b1e71633 hi Chengsong parents: diff changeset	8	% \usepackage{amsthm}
3241b1e71633 hi Chengsong parents: diff changeset	9	% \usepackage{hyperref}
3241b1e71633 hi Chengsong parents: diff changeset	10	% \usepackage[margin=0.5in]{geometry}
3241b1e71633 hi Chengsong parents: diff changeset	11	%\usepackage{pmboxdraw}
3241b1e71633 hi Chengsong parents: diff changeset	12
3241b1e71633 hi Chengsong parents: diff changeset	13	\title{POSIX Regular Expression Matching and Lexing}
3241b1e71633 hi Chengsong parents: diff changeset	14	\author[1]{Annonymous}
3241b1e71633 hi Chengsong parents: diff changeset	15
3241b1e71633 hi Chengsong parents: diff changeset	16	\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
3241b1e71633 hi Chengsong parents: diff changeset	17	\newcommand{\ZERO}{\mbox{\bf 0}}
3241b1e71633 hi Chengsong parents: diff changeset	18	\newcommand{\ONE}{\mbox{\bf 1}}
3241b1e71633 hi Chengsong parents: diff changeset	19	\def\lexer{\mathit{lexer}}
3241b1e71633 hi Chengsong parents: diff changeset	20	\def\mkeps{\mathit{mkeps}}
3241b1e71633 hi Chengsong parents: diff changeset	21	\def\inj{\mathit{inj}}
3241b1e71633 hi Chengsong parents: diff changeset	22	\def\Empty{\mathit{Empty}}
3241b1e71633 hi Chengsong parents: diff changeset	23	\def\Left{\mathit{Left}}
3241b1e71633 hi Chengsong parents: diff changeset	24	\def\Right{\mathit{Right}}
3241b1e71633 hi Chengsong parents: diff changeset	25	\def\Stars{\mathit{Stars}}
3241b1e71633 hi Chengsong parents: diff changeset	26	\def\Char{\mathit{Char}}
3241b1e71633 hi Chengsong parents: diff changeset	27	\def\Seq{\mathit{Seq}}
3241b1e71633 hi Chengsong parents: diff changeset	28	\def\Der{\mathit{Der}}
3241b1e71633 hi Chengsong parents: diff changeset	29	\def\nullable{\mathit{nullable}}
3241b1e71633 hi Chengsong parents: diff changeset	30	\def\Z{\mathit{Z}}
3241b1e71633 hi Chengsong parents: diff changeset	31	\def\S{\mathit{S}}
3241b1e71633 hi Chengsong parents: diff changeset	32
3241b1e71633 hi Chengsong parents: diff changeset	33	%\theoremstyle{theorem}
3241b1e71633 hi Chengsong parents: diff changeset	34	%\newtheorem{theorem}{Theorem}
3241b1e71633 hi Chengsong parents: diff changeset	35	%\theoremstyle{lemma}
3241b1e71633 hi Chengsong parents: diff changeset	36	%\newtheorem{lemma}{Lemma}
3241b1e71633 hi Chengsong parents: diff changeset	37	%\newcommand{\lemmaautorefname}{Lemma}
3241b1e71633 hi Chengsong parents: diff changeset	38	%\theoremstyle{definition}
3241b1e71633 hi Chengsong parents: diff changeset	39	%\newtheorem{definition}{Definition}
3241b1e71633 hi Chengsong parents: diff changeset	40
3241b1e71633 hi Chengsong parents: diff changeset	41
3241b1e71633 hi Chengsong parents: diff changeset	42	\begin{document}
3241b1e71633 hi Chengsong parents: diff changeset	43
3241b1e71633 hi Chengsong parents: diff changeset	44	\maketitle
3241b1e71633 hi Chengsong parents: diff changeset	45	\begin{abstract}
3241b1e71633 hi Chengsong parents: diff changeset	46	Brzozowski introduced in 1964 a beautifully simple algorithm for
3241b1e71633 hi Chengsong parents: diff changeset	47	regular expression matching based on the notion of derivatives of
3241b1e71633 hi Chengsong parents: diff changeset	48	regular expressions. In 2014, Sulzmann and Lu extended this
3241b1e71633 hi Chengsong parents: diff changeset	49	algorithm to not just give a YES/NO answer for whether or not a regular
3241b1e71633 hi Chengsong parents: diff changeset	50	expression matches a string, but in case it matches also \emph{how}
3241b1e71633 hi Chengsong parents: diff changeset	51	it matches the string. This is important for applications such as
3241b1e71633 hi Chengsong parents: diff changeset	52	lexing (tokenising a string). The problem is to make the algorithm
3241b1e71633 hi Chengsong parents: diff changeset	53	by Sulzmann and Lu fast on all inputs without breaking its
3241b1e71633 hi Chengsong parents: diff changeset	54	correctness.
3241b1e71633 hi Chengsong parents: diff changeset	55	\end{abstract}
3241b1e71633 hi Chengsong parents: diff changeset	56
3241b1e71633 hi Chengsong parents: diff changeset	57	\section{Introduction}
3241b1e71633 hi Chengsong parents: diff changeset	58
3241b1e71633 hi Chengsong parents: diff changeset	59	This PhD-project is about regular expression matching and
3241b1e71633 hi Chengsong parents: diff changeset	60	lexing. Given the maturity of this topic, the reader might wonder:
3241b1e71633 hi Chengsong parents: diff changeset	61	Surely, regular expressions must have already been studied to death?
3241b1e71633 hi Chengsong parents: diff changeset	62	What could possibly be \emph{not} known in this area? And surely all
3241b1e71633 hi Chengsong parents: diff changeset	63	implemented algorithms for regular expression matching are blindingly
3241b1e71633 hi Chengsong parents: diff changeset	64	fast?
3241b1e71633 hi Chengsong parents: diff changeset	65
3241b1e71633 hi Chengsong parents: diff changeset	66
3241b1e71633 hi Chengsong parents: diff changeset	67
3241b1e71633 hi Chengsong parents: diff changeset	68	Unfortunately these preconceptions are not supported by evidence: Take
3241b1e71633 hi Chengsong parents: diff changeset	69	for example the regular expression $(a^)^\,b$ and ask whether
3241b1e71633 hi Chengsong parents: diff changeset	70	strings of the form $aa..a$ match this regular
3241b1e71633 hi Chengsong parents: diff changeset	71	expression. Obviously they do not match---the expected $b$ in the last
3241b1e71633 hi Chengsong parents: diff changeset	72	position is missing. One would expect that modern regular expression
3241b1e71633 hi Chengsong parents: diff changeset	73	matching engines can find this out very quickly. Alas, if one tries
3241b1e71633 hi Chengsong parents: diff changeset	74	this example in JavaScript, Python or Java 8 with strings like 28
3241b1e71633 hi Chengsong parents: diff changeset	75	$a$'s, one discovers that this decision takes around 30 seconds and
3241b1e71633 hi Chengsong parents: diff changeset	76	takes considerably longer when adding a few more $a$'s, as the graphs
3241b1e71633 hi Chengsong parents: diff changeset	77	below show:
3241b1e71633 hi Chengsong parents: diff changeset	78
3241b1e71633 hi Chengsong parents: diff changeset	79	\begin{center}
3241b1e71633 hi Chengsong parents: diff changeset	80	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
3241b1e71633 hi Chengsong parents: diff changeset	81	\begin{tikzpicture}
3241b1e71633 hi Chengsong parents: diff changeset	82	\begin{axis}[
3241b1e71633 hi Chengsong parents: diff changeset	83	xlabel={$n$},
3241b1e71633 hi Chengsong parents: diff changeset	84	x label style={at={(1.05,-0.05)}},
3241b1e71633 hi Chengsong parents: diff changeset	85	ylabel={time in secs},
3241b1e71633 hi Chengsong parents: diff changeset	86	enlargelimits=false,
3241b1e71633 hi Chengsong parents: diff changeset	87	xtick={0,5,...,30},
3241b1e71633 hi Chengsong parents: diff changeset	88	xmax=33,
3241b1e71633 hi Chengsong parents: diff changeset	89	ymax=35,
3241b1e71633 hi Chengsong parents: diff changeset	90	ytick={0,5,...,30},
3241b1e71633 hi Chengsong parents: diff changeset	91	scaled ticks=false,
3241b1e71633 hi Chengsong parents: diff changeset	92	axis lines=left,
3241b1e71633 hi Chengsong parents: diff changeset	93	width=5cm,
3241b1e71633 hi Chengsong parents: diff changeset	94	height=4cm,
3241b1e71633 hi Chengsong parents: diff changeset	95	legend entries={JavaScript},
3241b1e71633 hi Chengsong parents: diff changeset	96	legend pos=north west,
3241b1e71633 hi Chengsong parents: diff changeset	97	legend cell align=left]
3241b1e71633 hi Chengsong parents: diff changeset	98	\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
3241b1e71633 hi Chengsong parents: diff changeset	99	\end{axis}
3241b1e71633 hi Chengsong parents: diff changeset	100	\end{tikzpicture}
3241b1e71633 hi Chengsong parents: diff changeset	101	&
3241b1e71633 hi Chengsong parents: diff changeset	102	\begin{tikzpicture}
3241b1e71633 hi Chengsong parents: diff changeset	103	\begin{axis}[
3241b1e71633 hi Chengsong parents: diff changeset	104	xlabel={$n$},
3241b1e71633 hi Chengsong parents: diff changeset	105	x label style={at={(1.05,-0.05)}},
3241b1e71633 hi Chengsong parents: diff changeset	106	%ylabel={time in secs},
3241b1e71633 hi Chengsong parents: diff changeset	107	enlargelimits=false,
3241b1e71633 hi Chengsong parents: diff changeset	108	xtick={0,5,...,30},
3241b1e71633 hi Chengsong parents: diff changeset	109	xmax=33,
3241b1e71633 hi Chengsong parents: diff changeset	110	ymax=35,
3241b1e71633 hi Chengsong parents: diff changeset	111	ytick={0,5,...,30},
3241b1e71633 hi Chengsong parents: diff changeset	112	scaled ticks=false,
3241b1e71633 hi Chengsong parents: diff changeset	113	axis lines=left,
3241b1e71633 hi Chengsong parents: diff changeset	114	width=5cm,
3241b1e71633 hi Chengsong parents: diff changeset	115	height=4cm,
3241b1e71633 hi Chengsong parents: diff changeset	116	legend entries={Python},
3241b1e71633 hi Chengsong parents: diff changeset	117	legend pos=north west,
3241b1e71633 hi Chengsong parents: diff changeset	118	legend cell align=left]
3241b1e71633 hi Chengsong parents: diff changeset	119	\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
3241b1e71633 hi Chengsong parents: diff changeset	120	\end{axis}
3241b1e71633 hi Chengsong parents: diff changeset	121	\end{tikzpicture}
3241b1e71633 hi Chengsong parents: diff changeset	122	&
3241b1e71633 hi Chengsong parents: diff changeset	123	\begin{tikzpicture}
3241b1e71633 hi Chengsong parents: diff changeset	124	\begin{axis}[
3241b1e71633 hi Chengsong parents: diff changeset	125	xlabel={$n$},
3241b1e71633 hi Chengsong parents: diff changeset	126	x label style={at={(1.05,-0.05)}},
3241b1e71633 hi Chengsong parents: diff changeset	127	%ylabel={time in secs},
3241b1e71633 hi Chengsong parents: diff changeset	128	enlargelimits=false,
3241b1e71633 hi Chengsong parents: diff changeset	129	xtick={0,5,...,30},
3241b1e71633 hi Chengsong parents: diff changeset	130	xmax=33,
3241b1e71633 hi Chengsong parents: diff changeset	131	ymax=35,
3241b1e71633 hi Chengsong parents: diff changeset	132	ytick={0,5,...,30},
3241b1e71633 hi Chengsong parents: diff changeset	133	scaled ticks=false,
3241b1e71633 hi Chengsong parents: diff changeset	134	axis lines=left,
3241b1e71633 hi Chengsong parents: diff changeset	135	width=5cm,
3241b1e71633 hi Chengsong parents: diff changeset	136	height=4cm,
3241b1e71633 hi Chengsong parents: diff changeset	137	legend entries={Java 8},
3241b1e71633 hi Chengsong parents: diff changeset	138	legend pos=north west,
3241b1e71633 hi Chengsong parents: diff changeset	139	legend cell align=left]
3241b1e71633 hi Chengsong parents: diff changeset	140	\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
3241b1e71633 hi Chengsong parents: diff changeset	141	\end{axis}
3241b1e71633 hi Chengsong parents: diff changeset	142	\end{tikzpicture}\\
3241b1e71633 hi Chengsong parents: diff changeset	143	\multicolumn{3}{c}{Graphs: Runtime for matching $(a^)^\,b$ with strings
3241b1e71633 hi Chengsong parents: diff changeset	144	of the form $\underbrace{aa..a}_{n}$.}
3241b1e71633 hi Chengsong parents: diff changeset	145	\end{tabular}
3241b1e71633 hi Chengsong parents: diff changeset	146	\end{center}
3241b1e71633 hi Chengsong parents: diff changeset	147
3241b1e71633 hi Chengsong parents: diff changeset	148	\noindent These are clearly abysmal and possibly surprising results.
3241b1e71633 hi Chengsong parents: diff changeset	149	One would expect these systems doing much better than that---after
3241b1e71633 hi Chengsong parents: diff changeset	150	all, given a DFA and a string, whether a string is matched by this DFA
3241b1e71633 hi Chengsong parents: diff changeset	151	should be linear.
3241b1e71633 hi Chengsong parents: diff changeset	152
3241b1e71633 hi Chengsong parents: diff changeset	153	Admittedly, the regular expression $(a^)^\,b$ is carefully chosen to
3241b1e71633 hi Chengsong parents: diff changeset	154	exhibit this ``exponential behaviour''. Unfortunately, such regular
3241b1e71633 hi Chengsong parents: diff changeset	155	expressions are not just a few ``outliers'', but actually they are
3241b1e71633 hi Chengsong parents: diff changeset	156	frequent enough that a separate name has been created for
3241b1e71633 hi Chengsong parents: diff changeset	157	them---\emph{evil regular expressions}. In empiric work, Davis et al
3241b1e71633 hi Chengsong parents: diff changeset	158	report that they have found thousands of such evil regular expressions
3241b1e71633 hi Chengsong parents: diff changeset	159	in the JavaScript and Python ecosystems \cite{Davis18}.
3241b1e71633 hi Chengsong parents: diff changeset	160
3241b1e71633 hi Chengsong parents: diff changeset	161	This exponential blowup sometimes causes real pain in ``real life'':
3241b1e71633 hi Chengsong parents: diff changeset	162	for example one evil regular expression brought on 20 July 2016 the
3241b1e71633 hi Chengsong parents: diff changeset	163	webpage \href{http://stackexchange.com}{Stack Exchange} to its knees.
3241b1e71633 hi Chengsong parents: diff changeset	164	In this instance, a regular expression intended to just trim white
3241b1e71633 hi Chengsong parents: diff changeset	165	spaces from the beginning and the end of a line actually consumed
3241b1e71633 hi Chengsong parents: diff changeset	166	massive amounts of CPU-resources and because of this the web servers
3241b1e71633 hi Chengsong parents: diff changeset	167	ground to a halt. This happened when a post with 20,000 white spaces
3241b1e71633 hi Chengsong parents: diff changeset	168	was submitted, but importantly the white spaces were neither at the
3241b1e71633 hi Chengsong parents: diff changeset	169	beginning nor at the end. As a result, the regular expression matching
3241b1e71633 hi Chengsong parents: diff changeset	170	engine needed to backtrack over many choices.
3241b1e71633 hi Chengsong parents: diff changeset	171
3241b1e71633 hi Chengsong parents: diff changeset	172	The underlying problem is that many ``real life'' regular expression
3241b1e71633 hi Chengsong parents: diff changeset	173	matching engines do not use DFAs for matching. This is because they
3241b1e71633 hi Chengsong parents: diff changeset	174	support regular expressions that are not covered by the classical
3241b1e71633 hi Chengsong parents: diff changeset	175	automata theory, and in this more general setting there are quite a
3241b1e71633 hi Chengsong parents: diff changeset	176	few research questions still unanswered and fast algorithms still need
3241b1e71633 hi Chengsong parents: diff changeset	177	to be developed.
3241b1e71633 hi Chengsong parents: diff changeset	178
3241b1e71633 hi Chengsong parents: diff changeset	179	There is also another under-researched problem to do with regular
3241b1e71633 hi Chengsong parents: diff changeset	180	expressions and lexing, i.e.~the process of breaking up strings into
3241b1e71633 hi Chengsong parents: diff changeset	181	sequences of tokens according to some regular expressions. In this
3241b1e71633 hi Chengsong parents: diff changeset	182	setting one is not just interested in whether or not a regular
3241b1e71633 hi Chengsong parents: diff changeset	183	expression matches a string, but if it matches also in \emph{how} it
3241b1e71633 hi Chengsong parents: diff changeset	184	matches the string. Consider for example a regular expression
3241b1e71633 hi Chengsong parents: diff changeset	185	$r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
3241b1e71633 hi Chengsong parents: diff changeset	186	and so on; and a regular expression $r_{id}$ for recognising
3241b1e71633 hi Chengsong parents: diff changeset	187	identifiers (say, a single character followed by characters or
3241b1e71633 hi Chengsong parents: diff changeset	188	numbers). One can then form the compound regular expression
3241b1e71633 hi Chengsong parents: diff changeset	189	$(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how
3241b1e71633 hi Chengsong parents: diff changeset	190	should the string \textit{iffoo} be tokenised? It could be tokenised
3241b1e71633 hi Chengsong parents: diff changeset	191	as a keyword followed by an identifier, or the entire string as a
3241b1e71633 hi Chengsong parents: diff changeset	192	single identifier. Similarly, how should the string \textit{if} be
3241b1e71633 hi Chengsong parents: diff changeset	193	tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
3241b1e71633 hi Chengsong parents: diff changeset	194	``fire''---so is it an identifier or a keyword? While in applications
3241b1e71633 hi Chengsong parents: diff changeset	195	there is a well-known strategy to decide these questions, called POSIX
3241b1e71633 hi Chengsong parents: diff changeset	196	matching, only relatively recently precise definitions of what POSIX
3241b1e71633 hi Chengsong parents: diff changeset	197	matching actually means have been formalised
3241b1e71633 hi Chengsong parents: diff changeset	198	\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
3241b1e71633 hi Chengsong parents: diff changeset	199	POSIX matching means to match the longest initial substring and
3241b1e71633 hi Chengsong parents: diff changeset	200	possible ties are solved according to some priorities attached to the
3241b1e71633 hi Chengsong parents: diff changeset	201	regular expressions (e.g.~keywords have a higher priority than
3241b1e71633 hi Chengsong parents: diff changeset	202	identifiers). This sounds rather simple, but according to Grathwohl et
3241b1e71633 hi Chengsong parents: diff changeset	203	al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
3241b1e71633 hi Chengsong parents: diff changeset	204
3241b1e71633 hi Chengsong parents: diff changeset	205	\begin{quote}
3241b1e71633 hi Chengsong parents: diff changeset	206	\it{}``The POSIX strategy is more complicated than the greedy because of
3241b1e71633 hi Chengsong parents: diff changeset	207	the dependence on information about the length of matched strings in the
3241b1e71633 hi Chengsong parents: diff changeset	208	various subexpressions.''
3241b1e71633 hi Chengsong parents: diff changeset	209	\end{quote}
3241b1e71633 hi Chengsong parents: diff changeset	210
3241b1e71633 hi Chengsong parents: diff changeset	211	\noindent
3241b1e71633 hi Chengsong parents: diff changeset	212	This is also supported by evidence collected by Kuklewicz
3241b1e71633 hi Chengsong parents: diff changeset	213	\cite{Kuklewicz} who noticed that a number of POSIX regular expression
3241b1e71633 hi Chengsong parents: diff changeset	214	matchers calculate incorrect results.
3241b1e71633 hi Chengsong parents: diff changeset	215
3241b1e71633 hi Chengsong parents: diff changeset	216	Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
3241b1e71633 hi Chengsong parents: diff changeset	217	regular expression matching according to the POSIX strategy
3241b1e71633 hi Chengsong parents: diff changeset	218	\cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
3241b1e71633 hi Chengsong parents: diff changeset	219	Brzozowski from 1964 where he introduced the notion of derivatives of
3241b1e71633 hi Chengsong parents: diff changeset	220	regular expressions \cite{Brzozowski1964}. We shall briefly explain
3241b1e71633 hi Chengsong parents: diff changeset	221	the algorithms next.
3241b1e71633 hi Chengsong parents: diff changeset	222
3241b1e71633 hi Chengsong parents: diff changeset	223	\section{The Algorithms by Brzozowski, and Sulzmann and Lu}
3241b1e71633 hi Chengsong parents: diff changeset	224
3241b1e71633 hi Chengsong parents: diff changeset	225	Suppose regular expressions are given by the following grammar (for
3241b1e71633 hi Chengsong parents: diff changeset	226	the moment ignore the grammar for values on the right-hand side):
3241b1e71633 hi Chengsong parents: diff changeset	227
3241b1e71633 hi Chengsong parents: diff changeset	228
3241b1e71633 hi Chengsong parents: diff changeset	229	\begin{center}
3241b1e71633 hi Chengsong parents: diff changeset	230	\begin{tabular}{c@{\hspace{20mm}}c}
3241b1e71633 hi Chengsong parents: diff changeset	231	\begin{tabular}{@{}rrl@{}}
3241b1e71633 hi Chengsong parents: diff changeset	232	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
3241b1e71633 hi Chengsong parents: diff changeset	233	$r$ & $::=$ & $\ZERO$\\
3241b1e71633 hi Chengsong parents: diff changeset	234	& $\mid$ & $\ONE$ \\
3241b1e71633 hi Chengsong parents: diff changeset	235	& $\mid$ & $c$ \\
3241b1e71633 hi Chengsong parents: diff changeset	236	& $\mid$ & $r_1 \cdot r_2$\\
3241b1e71633 hi Chengsong parents: diff changeset	237	& $\mid$ & $r_1 + r_2$ \\
3241b1e71633 hi Chengsong parents: diff changeset	238	\\
3241b1e71633 hi Chengsong parents: diff changeset	239	& $\mid$ & $r^*$ \\
3241b1e71633 hi Chengsong parents: diff changeset	240	\end{tabular}
3241b1e71633 hi Chengsong parents: diff changeset	241	&
3241b1e71633 hi Chengsong parents: diff changeset	242	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
3241b1e71633 hi Chengsong parents: diff changeset	243	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
3241b1e71633 hi Chengsong parents: diff changeset	244	$v$ & $::=$ & \\
3241b1e71633 hi Chengsong parents: diff changeset	245	& & $\Empty$ \\
3241b1e71633 hi Chengsong parents: diff changeset	246	& $\mid$ & $\Char(c)$ \\
3241b1e71633 hi Chengsong parents: diff changeset	247	& $\mid$ & $\Seq\,v_1\, v_2$\\
3241b1e71633 hi Chengsong parents: diff changeset	248	& $\mid$ & $\Left(v)$ \\
3241b1e71633 hi Chengsong parents: diff changeset	249	& $\mid$ & $\Right(v)$ \\
3241b1e71633 hi Chengsong parents: diff changeset	250	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
3241b1e71633 hi Chengsong parents: diff changeset	251	\end{tabular}
3241b1e71633 hi Chengsong parents: diff changeset	252	\end{tabular}
3241b1e71633 hi Chengsong parents: diff changeset	253	\end{center}
3241b1e71633 hi Chengsong parents: diff changeset	254
3241b1e71633 hi Chengsong parents: diff changeset	255	\noindent
3241b1e71633 hi Chengsong parents: diff changeset	256	The intended meaning of the regular expressions is as usual: $\ZERO$
3241b1e71633 hi Chengsong parents: diff changeset	257	cannot match any string, $\ONE$ can match the empty string, the
3241b1e71633 hi Chengsong parents: diff changeset	258	character regular expression $c$ can match the character $c$, and so
3241b1e71633 hi Chengsong parents: diff changeset	259	on. The brilliant contribution by Brzozowski is the notion of
3241b1e71633 hi Chengsong parents: diff changeset	260	\emph{derivatives} of regular expressions. The idea behind this
3241b1e71633 hi Chengsong parents: diff changeset	261	notion is as follows: suppose a regular expression $r$ can match a
3241b1e71633 hi Chengsong parents: diff changeset	262	string of the form $c\!::\! s$ (that is a list of characters starting
3241b1e71633 hi Chengsong parents: diff changeset	263	with $c$), what does the regular expression look like that can match
3241b1e71633 hi Chengsong parents: diff changeset	264	just $s$? Brzozowski gave a neat answer to this question. He defined
3241b1e71633 hi Chengsong parents: diff changeset	265	the following operation on regular expressions, written
3241b1e71633 hi Chengsong parents: diff changeset	266	$r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
3241b1e71633 hi Chengsong parents: diff changeset	267
3241b1e71633 hi Chengsong parents: diff changeset	268	\begin{center}
3241b1e71633 hi Chengsong parents: diff changeset	269	\begin{tabular}{lcl}
3241b1e71633 hi Chengsong parents: diff changeset	270	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
3241b1e71633 hi Chengsong parents: diff changeset	271	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
3241b1e71633 hi Chengsong parents: diff changeset	272	$d \backslash c$ & $\dn$ &
3241b1e71633 hi Chengsong parents: diff changeset	273	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
3241b1e71633 hi Chengsong parents: diff changeset	274	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
3241b1e71633 hi Chengsong parents: diff changeset	275	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \nullable(r_1)$\\
3241b1e71633 hi Chengsong parents: diff changeset	276	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
3241b1e71633 hi Chengsong parents: diff changeset	277	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
3241b1e71633 hi Chengsong parents: diff changeset	278	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
3241b1e71633 hi Chengsong parents: diff changeset	279	\end{tabular}
3241b1e71633 hi Chengsong parents: diff changeset	280	\end{center}
3241b1e71633 hi Chengsong parents: diff changeset	281
3241b1e71633 hi Chengsong parents: diff changeset	282	\noindent
3241b1e71633 hi Chengsong parents: diff changeset	283	In this definition $\nullable(\_)$ stands for a simple recursive
3241b1e71633 hi Chengsong parents: diff changeset	284	function that tests whether a regular expression can match the empty
3241b1e71633 hi Chengsong parents: diff changeset	285	string (its definition is omitted). Assuming the classic notion of a
3241b1e71633 hi Chengsong parents: diff changeset	286	\emph{language} of a regular expression, written $L(\_)$, the main
3241b1e71633 hi Chengsong parents: diff changeset	287	property of the derivative operation is that
3241b1e71633 hi Chengsong parents: diff changeset	288
3241b1e71633 hi Chengsong parents: diff changeset	289	\begin{center}
3241b1e71633 hi Chengsong parents: diff changeset	290	$c\!::\!s \in L(r)$ holds
3241b1e71633 hi Chengsong parents: diff changeset	291	if and only if $s \in L(r\backslash c)$.
3241b1e71633 hi Chengsong parents: diff changeset	292	\end{center}
3241b1e71633 hi Chengsong parents: diff changeset	293
3241b1e71633 hi Chengsong parents: diff changeset	294	\noindent
3241b1e71633 hi Chengsong parents: diff changeset	295	The beauty of derivatives is that they lead to a really simple regular
3241b1e71633 hi Chengsong parents: diff changeset	296	expression matching algorithm: To find out whether a string $s$
3241b1e71633 hi Chengsong parents: diff changeset	297	matches with a regular expression $r$, build the derivatives of $r$
3241b1e71633 hi Chengsong parents: diff changeset	298	w.r.t.\ (in succession) all the characters of the string $s$. Finally,
3241b1e71633 hi Chengsong parents: diff changeset	299	test whether the resulting regular expression can match the empty
3241b1e71633 hi Chengsong parents: diff changeset	300	string. If yes, then $r$ matches $s$, and no in the negative
3241b1e71633 hi Chengsong parents: diff changeset	301	case. For us the main advantage is that derivatives can be
3241b1e71633 hi Chengsong parents: diff changeset	302	straightforwardly implemented in any functional programming language,
3241b1e71633 hi Chengsong parents: diff changeset	303	and are easily definable and reasoned about in theorem provers---the
3241b1e71633 hi Chengsong parents: diff changeset	304	definitions just consist of inductive datatypes and simple recursive
3241b1e71633 hi Chengsong parents: diff changeset	305	functions. Moreover, the notion of derivatives can be easily
3241b1e71633 hi Chengsong parents: diff changeset	306	generalised to cover extended regular expression constructors such as
3241b1e71633 hi Chengsong parents: diff changeset	307	the not-regular expression, written $\neg\,r$, or bounded repetitions
3241b1e71633 hi Chengsong parents: diff changeset	308	(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
3241b1e71633 hi Chengsong parents: diff changeset	309	straightforwardly realised within the classic automata approach.
3241b1e71633 hi Chengsong parents: diff changeset	310
3241b1e71633 hi Chengsong parents: diff changeset	311
3241b1e71633 hi Chengsong parents: diff changeset	312	One limitation, however, of Brzozowski's algorithm is that it only
3241b1e71633 hi Chengsong parents: diff changeset	313	produces a YES/NO answer for whether a string is being matched by a
3241b1e71633 hi Chengsong parents: diff changeset	314	regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this
3241b1e71633 hi Chengsong parents: diff changeset	315	algorithm to allow generation of an actual matching, called a
3241b1e71633 hi Chengsong parents: diff changeset	316	\emph{value}---see the grammar above for its definition. Assuming a
3241b1e71633 hi Chengsong parents: diff changeset	317	regular expression matches a string, values encode the information of
3241b1e71633 hi Chengsong parents: diff changeset	318	\emph{how} the string is matched by the regular expression---that is,
3241b1e71633 hi Chengsong parents: diff changeset	319	which part of the string is matched by which part of the regular
3241b1e71633 hi Chengsong parents: diff changeset	320	expression. To illustrate this consider the string $xy$ and the
3241b1e71633 hi Chengsong parents: diff changeset	321	regular expression $(x + (y + xy))^*$. We can view this regular
3241b1e71633 hi Chengsong parents: diff changeset	322	expression as a tree and if the string $xy$ is matched by two Star
3241b1e71633 hi Chengsong parents: diff changeset	323	``iterations'', then the $x$ is matched by the left-most alternative
3241b1e71633 hi Chengsong parents: diff changeset	324	in this tree and the $y$ by the right-left alternative. This suggests
3241b1e71633 hi Chengsong parents: diff changeset	325	to record this matching as
3241b1e71633 hi Chengsong parents: diff changeset	326
3241b1e71633 hi Chengsong parents: diff changeset	327	\begin{center}
3241b1e71633 hi Chengsong parents: diff changeset	328	$\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
3241b1e71633 hi Chengsong parents: diff changeset	329	\end{center}
3241b1e71633 hi Chengsong parents: diff changeset	330
3241b1e71633 hi Chengsong parents: diff changeset	331	\noindent
3241b1e71633 hi Chengsong parents: diff changeset	332	where $\Stars$ records how many
3241b1e71633 hi Chengsong parents: diff changeset	333	iterations were used; and $\Left$, respectively $\Right$, which
3241b1e71633 hi Chengsong parents: diff changeset	334	alternative is used. The value for
3241b1e71633 hi Chengsong parents: diff changeset	335	matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
3241b1e71633 hi Chengsong parents: diff changeset	336	would look as follows
3241b1e71633 hi Chengsong parents: diff changeset	337
3241b1e71633 hi Chengsong parents: diff changeset	338	\begin{center}
3241b1e71633 hi Chengsong parents: diff changeset	339	$\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
3241b1e71633 hi Chengsong parents: diff changeset	340	\end{center}
3241b1e71633 hi Chengsong parents: diff changeset	341
3241b1e71633 hi Chengsong parents: diff changeset	342	\noindent
3241b1e71633 hi Chengsong parents: diff changeset	343	where $\Stars$ has only a single-element list for the single iteration
3241b1e71633 hi Chengsong parents: diff changeset	344	and $\Seq$ indicates that $xy$ is matched by a sequence regular
3241b1e71633 hi Chengsong parents: diff changeset	345	expression.
3241b1e71633 hi Chengsong parents: diff changeset	346
3241b1e71633 hi Chengsong parents: diff changeset	347	The contribution of Sulzmann and Lu is an extension of Brzozowski's
3241b1e71633 hi Chengsong parents: diff changeset	348	algorithm by a second phase (the first phase being building successive
3241b1e71633 hi Chengsong parents: diff changeset	349	derivatives). In this second phase, for every successful match the
3241b1e71633 hi Chengsong parents: diff changeset	350	corresponding POSIX value is computed. As mentioned before, from this
3241b1e71633 hi Chengsong parents: diff changeset	351	value we can extract the information \emph{how} a regular expression
3241b1e71633 hi Chengsong parents: diff changeset	352	matched a string. We omit the details here on how Sulzmann and Lu
3241b1e71633 hi Chengsong parents: diff changeset	353	achieved this~(see \cite{Sulzmann2014}). Rather, we shall focus next on the
3241b1e71633 hi Chengsong parents: diff changeset	354	process of simplification of regular expressions, which is needed in
3241b1e71633 hi Chengsong parents: diff changeset	355	order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
3241b1e71633 hi Chengsong parents: diff changeset	356	and Lu's algorithms. This is where the PhD-project hopes to advance
3241b1e71633 hi Chengsong parents: diff changeset	357	the state-of-the-art.
3241b1e71633 hi Chengsong parents: diff changeset	358
3241b1e71633 hi Chengsong parents: diff changeset	359
3241b1e71633 hi Chengsong parents: diff changeset	360	\section{Simplification of Regular Expressions}
3241b1e71633 hi Chengsong parents: diff changeset	361
3241b1e71633 hi Chengsong parents: diff changeset	362	The main drawback of building successive derivatives according to
3241b1e71633 hi Chengsong parents: diff changeset	363	Brzozowski's definition is that they can grow very quickly in size.
3241b1e71633 hi Chengsong parents: diff changeset	364	This is mainly due to the fact that the derivative operation generates
3241b1e71633 hi Chengsong parents: diff changeset	365	often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result,
3241b1e71633 hi Chengsong parents: diff changeset	366	if implemented naively both algorithms by Brzozowski and by Sulzmann
3241b1e71633 hi Chengsong parents: diff changeset	367	and Lu are excruciatingly slow. For example when starting with the
3241b1e71633 hi Chengsong parents: diff changeset	368	regular expression $(a + aa)^*$ and building 12 successive derivatives
3241b1e71633 hi Chengsong parents: diff changeset	369	w.r.t.~the character $a$, one obtains a derivative regular expression
3241b1e71633 hi Chengsong parents: diff changeset	370	with more than 8000 nodes (when viewed as a tree). Operations like
3241b1e71633 hi Chengsong parents: diff changeset	371	derivative and $\nullable$ need to traverse such trees and
3241b1e71633 hi Chengsong parents: diff changeset	372	consequently the bigger the size of the derivative the slower the
3241b1e71633 hi Chengsong parents: diff changeset	373	algorithm. Fortunately, one can simplify regular expressions after
3241b1e71633 hi Chengsong parents: diff changeset	374	each derivative step. Various simplifications of regular expressions
3241b1e71633 hi Chengsong parents: diff changeset	375	are possible, such as the simplifications of $\ZERO + r$,
3241b1e71633 hi Chengsong parents: diff changeset	376	$r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
3241b1e71633 hi Chengsong parents: diff changeset	377	$r$. These simplifications do not affect the answer for whether a
3241b1e71633 hi Chengsong parents: diff changeset	378	regular expression matches a string or not, but fortunately also do
3241b1e71633 hi Chengsong parents: diff changeset	379	not affect the POSIX strategy of how regular expressions match
3241b1e71633 hi Chengsong parents: diff changeset	380	strings---although the latter is much harder to establish. Some
3241b1e71633 hi Chengsong parents: diff changeset	381	initial results in this regard have been obtained in
3241b1e71633 hi Chengsong parents: diff changeset	382	\cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
3241b1e71633 hi Chengsong parents: diff changeset	383	is a very tight bound for the size. Such a tight bound is suggested by
3241b1e71633 hi Chengsong parents: diff changeset	384	work of Antimirov who proved that (partial) derivatives can be bound
3241b1e71633 hi Chengsong parents: diff changeset	385	by the number of characters contained in the initial regular
3241b1e71633 hi Chengsong parents: diff changeset	386	expression \cite{Antimirov95}. We believe, and have generated test
3241b1e71633 hi Chengsong parents: diff changeset	387	data, that a similar bound can be obtained for the derivatives in
3241b1e71633 hi Chengsong parents: diff changeset	388	Sulzmann and Lu's algorithm. Let us give some details about this next.
3241b1e71633 hi Chengsong parents: diff changeset	389
3241b1e71633 hi Chengsong parents: diff changeset	390	We first followed Sulzmann and Lu's idea of introducing
3241b1e71633 hi Chengsong parents: diff changeset	391	\emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
3241b1e71633 hi Chengsong parents: diff changeset	392	defined by the following grammar:
3241b1e71633 hi Chengsong parents: diff changeset	393
3241b1e71633 hi Chengsong parents: diff changeset	394	\begin{center}
3241b1e71633 hi Chengsong parents: diff changeset	395	\begin{tabular}{lcl}
3241b1e71633 hi Chengsong parents: diff changeset	396	$\textit{a}$ & $::=$ & $\textit{ZERO}$\\
3241b1e71633 hi Chengsong parents: diff changeset	397	& $\mid$ & $\textit{ONE}\;\;bs$\\
3241b1e71633 hi Chengsong parents: diff changeset	398	& $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
3241b1e71633 hi Chengsong parents: diff changeset	399	& $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
3241b1e71633 hi Chengsong parents: diff changeset	400	& $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
3241b1e71633 hi Chengsong parents: diff changeset	401	& $\mid$ & $\textit{STAR}\;\;bs\,a$
3241b1e71633 hi Chengsong parents: diff changeset	402	\end{tabular}
3241b1e71633 hi Chengsong parents: diff changeset	403	\end{center}
3241b1e71633 hi Chengsong parents: diff changeset	404
3241b1e71633 hi Chengsong parents: diff changeset	405	\noindent
3241b1e71633 hi Chengsong parents: diff changeset	406	where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
3241b1e71633 hi Chengsong parents: diff changeset	407	list of annotated regular expressions. These bitsequences encode
3241b1e71633 hi Chengsong parents: diff changeset	408	information about the (POSIX) value that should be generated by the
3241b1e71633 hi Chengsong parents: diff changeset	409	Sulzmann and Lu algorithm. There are operations that can transform the
3241b1e71633 hi Chengsong parents: diff changeset	410	usual (un-annotated) regular expressions into annotated regular
3241b1e71633 hi Chengsong parents: diff changeset	411	expressions, and there are operations for encoding/decoding values to
3241b1e71633 hi Chengsong parents: diff changeset	412	or from bitsequences. For example the encoding function for values is
3241b1e71633 hi Chengsong parents: diff changeset	413	defined as follows:
3241b1e71633 hi Chengsong parents: diff changeset	414
3241b1e71633 hi Chengsong parents: diff changeset	415	\begin{center}
3241b1e71633 hi Chengsong parents: diff changeset	416	\begin{tabular}{lcl}
3241b1e71633 hi Chengsong parents: diff changeset	417	$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
3241b1e71633 hi Chengsong parents: diff changeset	418	$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
3241b1e71633 hi Chengsong parents: diff changeset	419	$\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
3241b1e71633 hi Chengsong parents: diff changeset	420	$\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
3241b1e71633 hi Chengsong parents: diff changeset	421	$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
3241b1e71633 hi Chengsong parents: diff changeset	422	$\textit{code}(\Stars\,[])$ & $\dn$ & $[\S]$\\
3241b1e71633 hi Chengsong parents: diff changeset	423	$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\Z :: code(v) \;@\;
3241b1e71633 hi Chengsong parents: diff changeset	424	code(\Stars\,vs)$
3241b1e71633 hi Chengsong parents: diff changeset	425	\end{tabular}
3241b1e71633 hi Chengsong parents: diff changeset	426	\end{center}
3241b1e71633 hi Chengsong parents: diff changeset	427
3241b1e71633 hi Chengsong parents: diff changeset	428	\noindent
3241b1e71633 hi Chengsong parents: diff changeset	429	where $\Z$ and $\S$ are arbitrary names for the ``bits'' in the
3241b1e71633 hi Chengsong parents: diff changeset	430	bitsequences. Although this encoding is ``lossy'' in the sense of not
3241b1e71633 hi Chengsong parents: diff changeset	431	recording all details of a value, Sulzmann and Lu have defined the
3241b1e71633 hi Chengsong parents: diff changeset	432	decoding function such that with additional information (namely the
3241b1e71633 hi Chengsong parents: diff changeset	433	corresponding regular expression) a value can be precisely extracted
3241b1e71633 hi Chengsong parents: diff changeset	434	from a bitsequence.
3241b1e71633 hi Chengsong parents: diff changeset	435
3241b1e71633 hi Chengsong parents: diff changeset	436	The main point of the bitsequences and annotated regular expressions
3241b1e71633 hi Chengsong parents: diff changeset	437	is that we can apply rather aggressive (in terms of size)
3241b1e71633 hi Chengsong parents: diff changeset	438	simplification rules in order to keep derivatives small. We have
3241b1e71633 hi Chengsong parents: diff changeset	439	developed such ``aggressive'' simplification rules and generated test
3241b1e71633 hi Chengsong parents: diff changeset	440	data that show that the expected bound can be achieved. Obviously we
3241b1e71633 hi Chengsong parents: diff changeset	441	could only cover partially the search space as there are infinitely
3241b1e71633 hi Chengsong parents: diff changeset	442	many regular expressions and strings. One modification we introduced
3241b1e71633 hi Chengsong parents: diff changeset	443	is to allow a list of annotated regular expressions in the
3241b1e71633 hi Chengsong parents: diff changeset	444	\textit{ALTS} constructor. This allows us to not just delete
3241b1e71633 hi Chengsong parents: diff changeset	445	unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
3241b1e71633 hi Chengsong parents: diff changeset	446	unnecessary ``copies'' of regular expressions (very similar to
3241b1e71633 hi Chengsong parents: diff changeset	447	simplifying $r + r$ to just $r$, but in a more general
3241b1e71633 hi Chengsong parents: diff changeset	448	setting). Another modification is that we use simplification rules
3241b1e71633 hi Chengsong parents: diff changeset	449	inspired by Antimirov's work on partial derivatives. They maintain the
3241b1e71633 hi Chengsong parents: diff changeset	450	idea that only the first ``copy'' of a regular expression in an
3241b1e71633 hi Chengsong parents: diff changeset	451	alternative contributes to the calculation of a POSIX value. All
3241b1e71633 hi Chengsong parents: diff changeset	452	subsequent copies can be prunned from the regular expression.
3241b1e71633 hi Chengsong parents: diff changeset	453
3241b1e71633 hi Chengsong parents: diff changeset	454	We are currently engaged with proving that our simplification rules
3241b1e71633 hi Chengsong parents: diff changeset	455	actually do not affect the POSIX value that should be generated by the
3241b1e71633 hi Chengsong parents: diff changeset	456	algorithm according to the specification of a POSIX value and
3241b1e71633 hi Chengsong parents: diff changeset	457	furthermore that our derivatives stay small for all derivatives. For
3241b1e71633 hi Chengsong parents: diff changeset	458	this proof we use the theorem prover Isabelle. Once completed, this
3241b1e71633 hi Chengsong parents: diff changeset	459	result will advance the state-of-the-art: Sulzmann and Lu wrote in
3241b1e71633 hi Chengsong parents: diff changeset	460	their paper \cite{Sulzmann2014} about the bitcoded ``incremental
3241b1e71633 hi Chengsong parents: diff changeset	461	parsing method'' (that is the matching algorithm outlined in this
3241b1e71633 hi Chengsong parents: diff changeset	462	section):
3241b1e71633 hi Chengsong parents: diff changeset	463
3241b1e71633 hi Chengsong parents: diff changeset	464	\begin{quote}\it
3241b1e71633 hi Chengsong parents: diff changeset	465	``Correctness Claim: We further claim that the incremental parsing
3241b1e71633 hi Chengsong parents: diff changeset	466	method in Figure~5 in combination with the simplification steps in
3241b1e71633 hi Chengsong parents: diff changeset	467	Figure 6 yields POSIX parse trees. We have tested this claim
3241b1e71633 hi Chengsong parents: diff changeset	468	extensively by using the method in Figure~3 as a reference but yet
3241b1e71633 hi Chengsong parents: diff changeset	469	have to work out all proof details.''
3241b1e71633 hi Chengsong parents: diff changeset	470	\end{quote}
3241b1e71633 hi Chengsong parents: diff changeset	471
3241b1e71633 hi Chengsong parents: diff changeset	472	\noindent
3241b1e71633 hi Chengsong parents: diff changeset	473	We would settle the correctness claim and furthermore obtain a much
3241b1e71633 hi Chengsong parents: diff changeset	474	tighter bound on the sizes of derivatives. The result is that our
3241b1e71633 hi Chengsong parents: diff changeset	475	algorithm should be correct and faster on all inputs. The original
3241b1e71633 hi Chengsong parents: diff changeset	476	blow-up, as observed in JavaScript, Python and Java, would be excluded
3241b1e71633 hi Chengsong parents: diff changeset	477	from happening in our algorithm.
3241b1e71633 hi Chengsong parents: diff changeset	478
3241b1e71633 hi Chengsong parents: diff changeset	479	\section{Conclusion}
3241b1e71633 hi Chengsong parents: diff changeset	480
3241b1e71633 hi Chengsong parents: diff changeset	481	In this PhD-project we are interested in fast algorithms for regular
3241b1e71633 hi Chengsong parents: diff changeset	482	expression matching. While this seems to be a ``settled'' area, in
3241b1e71633 hi Chengsong parents: diff changeset	483	fact interesting research questions are popping up as soon as one steps
3241b1e71633 hi Chengsong parents: diff changeset	484	outside the classic automata theory (for example in terms of what kind
3241b1e71633 hi Chengsong parents: diff changeset	485	of regular expressions are supported). The reason why it is
3241b1e71633 hi Chengsong parents: diff changeset	486	interesting for us to look at the derivative approach introduced by
3241b1e71633 hi Chengsong parents: diff changeset	487	Brzozowski for regular expression matching, and then much further
3241b1e71633 hi Chengsong parents: diff changeset	488	developed by Sulzmann and Lu, is that derivatives can elegantly deal
3241b1e71633 hi Chengsong parents: diff changeset	489	with some of the regular expressions that are of interest in ``real
3241b1e71633 hi Chengsong parents: diff changeset	490	life''. This includes the not-regular expression, written $\neg\,r$
3241b1e71633 hi Chengsong parents: diff changeset	491	(that is all strings that are not recognised by $r$), but also bounded
3241b1e71633 hi Chengsong parents: diff changeset	492	regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
3241b1e71633 hi Chengsong parents: diff changeset	493	also hope that the derivatives can provide another angle for how to
3241b1e71633 hi Chengsong parents: diff changeset	494	deal more efficiently with back-references, which are one of the
3241b1e71633 hi Chengsong parents: diff changeset	495	reasons why regular expression engines in JavaScript, Python and Java
3241b1e71633 hi Chengsong parents: diff changeset	496	choose to not implement the classic automata approach of transforming
3241b1e71633 hi Chengsong parents: diff changeset	497	regular expressions into NFAs and then DFAs---because we simply do not
3241b1e71633 hi Chengsong parents: diff changeset	498	know how such back-references can be represented by DFAs.
3241b1e71633 hi Chengsong parents: diff changeset	499
3241b1e71633 hi Chengsong parents: diff changeset	500
3241b1e71633 hi Chengsong parents: diff changeset	501	\bibliographystyle{plain}
3241b1e71633 hi Chengsong parents: diff changeset	502	\bibliography{root}
3241b1e71633 hi Chengsong parents: diff changeset	503
3241b1e71633 hi Chengsong parents: diff changeset	504
3241b1e71633 hi Chengsong parents: diff changeset	505	\end{document}

author	Christian Urban <urbanc@in.tum.de>
	Tue, 25 Jun 2019 22:43:21 +0100
changeset 18	4a9c9085fb85
parent 17	3241b1e71633
child 22	feffec3af1a1
permissions	-rw-r--r--