cst_tests: ninems/ninems.tex@f0360e17080e (annotated)

45 60cb82639691 spell check Christian Urban <urbanc@in.tum.de> parents: 44 diff changeset	1	\documentclass[a4paper,UKenglish]{lipics}
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	2	\usepackage{graphic}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	3	\usepackage{data}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	4	\usepackage{tikz-cd}
35 f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	5	\usepackage{algorithm}
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	6	\usepackage{amsmath}
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	7	\usepackage[noend]{algpseudocode}
42 2a469388c989 s Chengsong parents: 41 diff changeset	8	\usepackage{enumitem}
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	9	% \documentclass{article}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	10	%\usepackage[utf8]{inputenc}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	11	%\usepackage[english]{babel}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	12	%\usepackage{listings}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	13	% \usepackage{amsthm}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	14	% \usepackage{hyperref}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	15	% \usepackage[margin=0.5in]{geometry}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	16	%\usepackage{pmboxdraw}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	17
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	18	\title{POSIX Regular Expression Matching and Lexing}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	19	\author{Chengsong Tan}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	20	\affil{King's College London\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	21	London, UK\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	22	\texttt{chengsong.tan@kcl.ac.uk}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	23	\authorrunning{Chengsong Tan}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	24	\Copyright{Chengsong Tan}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	25
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	26	\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	27	\newcommand{\ZERO}{\mbox{\bf 0}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	28	\newcommand{\ONE}{\mbox{\bf 1}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	29	\def\lexer{\mathit{lexer}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	30	\def\mkeps{\mathit{mkeps}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	31	\def\inj{\mathit{inj}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	32	\def\Empty{\mathit{Empty}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	33	\def\Left{\mathit{Left}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	34	\def\Right{\mathit{Right}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	35	\def\Stars{\mathit{Stars}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	36	\def\Char{\mathit{Char}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	37	\def\Seq{\mathit{Seq}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	38	\def\Der{\mathit{Der}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	39	\def\nullable{\mathit{nullable}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	40	\def\Z{\mathit{Z}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	41	\def\S{\mathit{S}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	42
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	43	%\theoremstyle{theorem}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	44	%\newtheorem{theorem}{Theorem}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	45	%\theoremstyle{lemma}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	46	%\newtheorem{lemma}{Lemma}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	47	%\newcommand{\lemmaautorefname}{Lemma}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	48	%\theoremstyle{definition}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	49	%\newtheorem{definition}{Definition}
35 f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	50	\algnewcommand\algorithmicswitch{\textbf{switch}}
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	51	\algnewcommand\algorithmiccase{\textbf{case}}
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	52	\algnewcommand\algorithmicassert{\texttt{assert}}
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	53	\algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	54	% New "environments"
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	55	\algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	56	\algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	57	\algtext*{EndSwitch}%
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	58	\algtext*{EndCase}%
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	59
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	60
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	61	\begin{document}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	62
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	63	\maketitle
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	64
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	65	\begin{abstract}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	66	Brzozowski introduced in 1964 a beautifully simple algorithm for
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	67	regular expression matching based on the notion of derivatives of
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	68	regular expressions. In 2014, Sulzmann and Lu extended this
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	69	algorithm to not just give a YES/NO answer for whether or not a
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	70	regular expression matches a string, but in case it matches also
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	71	answers with \emph{how} it matches the string. This is important for
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	72	applications such as lexing (tokenising a string). The problem is to
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	73	make the algorithm by Sulzmann and Lu fast on all inputs without
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	74	breaking its correctness. We have already developed some
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	75	simplification rules for this, but have not proved yet that they
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	76	preserve the correctness of the algorithm. We also have not yet
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	77	looked at extended regular expressions, such as bounded repetitions,
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	78	negation and back-references.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	79	\end{abstract}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	80
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	81	\section{Introduction}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	82
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	83	This PhD-project is about regular expression matching and
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	84	lexing. Given the maturity of this topic, the reader might wonder:
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	85	Surely, regular expressions must have already been studied to death?
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	86	What could possibly be \emph{not} known in this area? And surely all
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	87	implemented algorithms for regular expression matching are blindingly
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	88	fast?
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	89
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	90	Unfortunately these preconceptions are not supported by evidence: Take
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	91	for example the regular expression $(a^)^\,b$ and ask whether
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	92	strings of the form $aa..a$ match this regular
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	93	expression. Obviously they do not match---the expected $b$ in the last
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	94	position is missing. One would expect that modern regular expression
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	95	matching engines can find this out very quickly. Alas, if one tries
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	96	this example in JavaScript, Python or Java 8 with strings like 28
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	97	$a$'s, one discovers that this decision takes around 30 seconds and
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	98	takes considerably longer when adding a few more $a$'s, as the graphs
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	99	below show:
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	100
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	101	\begin{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	102	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	103	\begin{tikzpicture}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	104	\begin{axis}[
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	105	xlabel={$n$},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	106	x label style={at={(1.05,-0.05)}},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	107	ylabel={time in secs},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	108	enlargelimits=false,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	109	xtick={0,5,...,30},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	110	xmax=33,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	111	ymax=35,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	112	ytick={0,5,...,30},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	113	scaled ticks=false,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	114	axis lines=left,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	115	width=5cm,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	116	height=4cm,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	117	legend entries={JavaScript},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	118	legend pos=north west,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	119	legend cell align=left]
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	120	\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	121	\end{axis}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	122	\end{tikzpicture}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	123	&
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	124	\begin{tikzpicture}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	125	\begin{axis}[
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	126	xlabel={$n$},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	127	x label style={at={(1.05,-0.05)}},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	128	%ylabel={time in secs},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	129	enlargelimits=false,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	130	xtick={0,5,...,30},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	131	xmax=33,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	132	ymax=35,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	133	ytick={0,5,...,30},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	134	scaled ticks=false,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	135	axis lines=left,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	136	width=5cm,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	137	height=4cm,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	138	legend entries={Python},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	139	legend pos=north west,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	140	legend cell align=left]
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	141	\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	142	\end{axis}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	143	\end{tikzpicture}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	144	&
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	145	\begin{tikzpicture}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	146	\begin{axis}[
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	147	xlabel={$n$},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	148	x label style={at={(1.05,-0.05)}},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	149	%ylabel={time in secs},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	150	enlargelimits=false,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	151	xtick={0,5,...,30},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	152	xmax=33,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	153	ymax=35,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	154	ytick={0,5,...,30},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	155	scaled ticks=false,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	156	axis lines=left,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	157	width=5cm,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	158	height=4cm,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	159	legend entries={Java 8},
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	160	legend pos=north west,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	161	legend cell align=left]
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	162	\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	163	\end{axis}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	164	\end{tikzpicture}\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	165	\multicolumn{3}{c}{Graphs: Runtime for matching $(a^)^\,b$ with strings
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	166	of the form $\underbrace{aa..a}_{n}$.}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	167	\end{tabular}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	168	\end{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	169
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	170	\noindent These are clearly abysmal and possibly surprising results. One
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	171	would expect these systems doing much better than that---after all,
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	172	given a DFA and a string, deciding whether a string is matched by this
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	173	DFA should be linear.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	174
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	175	Admittedly, the regular expression $(a^)^\,b$ is carefully chosen to
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	176	exhibit this ``exponential behaviour''. Unfortunately, such regular
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	177	expressions are not just a few ``outliers'', but actually they are
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	178	frequent enough that a separate name has been created for
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	179	them---\emph{evil regular expressions}. In empiric work, Davis et al
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	180	report that they have found thousands of such evil regular expressions
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	181	in the JavaScript and Python ecosystems \cite{Davis18}.
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	182
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	183	This exponential blowup in matching algorithms sometimes causes
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	184	considerable grief in real life: for example on 20 July 2016 one evil
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	185	regular expression brought the webpage
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	186	\href{http://stackexchange.com}{Stack Exchange} to its
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	187	knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	188	In this instance, a regular expression intended to just trim white
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	189	spaces from the beginning and the end of a line actually consumed
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	190	massive amounts of CPU-resources and because of this the web servers
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	191	ground to a halt. This happened when a post with 20,000 white spaces was
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	192	submitted, but importantly the white spaces were neither at the
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	193	beginning nor at the end. As a result, the regular expression matching
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	194	engine needed to backtrack over many choices. The underlying problem is
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	195	that many ``real life'' regular expression matching engines do not use
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	196	DFAs for matching. This is because they support regular expressions that
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	197	are not covered by the classical automata theory, and in this more
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	198	general setting there are quite a few research questions still
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	199	unanswered and fast algorithms still need to be developed (for example
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	200	how to include bounded repetitions, negation and back-references).
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	201
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	202	There is also another under-researched problem to do with regular
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	203	expressions and lexing, i.e.~the process of breaking up strings into
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	204	sequences of tokens according to some regular expressions. In this
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	205	setting one is not just interested in whether or not a regular
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	206	expression matches a string, but if it matches also in \emph{how} it
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	207	matches the string. Consider for example a regular expression
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	208	$r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	209	and so on; and a regular expression $r_{id}$ for recognising
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	210	identifiers (say, a single character followed by characters or
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	211	numbers). One can then form the compound regular expression
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	212	$(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	213	should the string \textit{iffoo} be tokenised? It could be tokenised
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	214	as a keyword followed by an identifier, or the entire string as a
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	215	single identifier. Similarly, how should the string \textit{if} be
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	216	tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	217	``fire''---so is it an identifier or a keyword? While in applications
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	218	there is a well-known strategy to decide these questions, called POSIX
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	219	matching, only relatively recently precise definitions of what POSIX
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	220	matching actually means have been formalised
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	221	\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}.
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	222	Such a definition has also been given by Sulzmann and Lu \cite{Sulzmann2014}, but the
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	223	corresponding correctness proof turned out to be faulty \cite{AusafDyckhoffUrban2016}.
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	224	Roughly, POSIX matching means matching the longest initial substring.
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	225	In the case of a tie, the initial submatch is chosen according to some priorities attached to the
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	226	regular expressions (e.g.~keywords have a higher priority than
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	227	identifiers). This sounds rather simple, but according to Grathwohl et
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	228	al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	229
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	230	\begin{quote}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	231	\it{}``The POSIX strategy is more complicated than the greedy because of
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	232	the dependence on information about the length of matched strings in the
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	233	various subexpressions.''
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	234	\end{quote}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	235
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	236	\noindent
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	237	This is also supported by evidence collected by Kuklewicz
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	238	\cite{Kuklewicz} who noticed that a number of POSIX regular expression
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	239	matchers calculate incorrect results.
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	240
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	241	Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	242	regular expression matching according to the POSIX strategy
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	243	\cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	244	Brzozowski from 1964 where he introduced the notion of derivatives of
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	245	regular expressions \cite{Brzozowski1964}. We shall briefly explain
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	246	this algorithm next.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	247
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	248	\section{The Algorithm by Brzozowski based on Derivatives of Regular
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	249	Expressions}
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	250
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	251	Suppose (basic) regular expressions are given by the following grammar:
38 b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	252	\[ r ::= \ZERO \mid \ONE
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	253	\mid c
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	254	\mid r_1 \cdot r_2
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	255	\mid r_1 + r_2
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	256	\mid r^*
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	257	\]
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	258
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	259	\noindent
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	260	The intended meaning of the constructors is as follows: $\ZERO$
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	261	cannot match any string, $\ONE$ can match the empty string, the
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	262	character regular expression $c$ can match the character $c$, and so
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	263	on.
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	264
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	265	The ingenious contribution by Brzozowski is the notion of
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	266	\emph{derivatives} of regular expressions. The idea behind this
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	267	notion is as follows: suppose a regular expression $r$ can match a
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	268	string of the form $c\!::\! s$ (that is a list of characters starting
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	269	with $c$), what does the regular expression look like that can match
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	270	just $s$? Brzozowski gave a neat answer to this question. He started
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	271	with the definition of $nullable$:
36 97c9ac95194d chages Chengsong parents: 35 diff changeset	272	\begin{center}
97c9ac95194d chages Chengsong parents: 35 diff changeset	273	\begin{tabular}{lcl}
97c9ac95194d chages Chengsong parents: 35 diff changeset	274	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
97c9ac95194d chages Chengsong parents: 35 diff changeset	275	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
97c9ac95194d chages Chengsong parents: 35 diff changeset	276	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
97c9ac95194d chages Chengsong parents: 35 diff changeset	277	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
97c9ac95194d chages Chengsong parents: 35 diff changeset	278	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
97c9ac95194d chages Chengsong parents: 35 diff changeset	279	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
97c9ac95194d chages Chengsong parents: 35 diff changeset	280	\end{tabular}
97c9ac95194d chages Chengsong parents: 35 diff changeset	281	\end{center}
38 b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	282	This function simply tests whether the empty string is in $L(r)$.
36 97c9ac95194d chages Chengsong parents: 35 diff changeset	283	He then defined
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	284	the following operation on regular expressions, written
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	285	$r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	286
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	287	\begin{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	288	\begin{tabular}{lcl}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	289	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	290	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	291	$d \backslash c$ & $\dn$ &
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	292	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	293	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
36 97c9ac95194d chages Chengsong parents: 35 diff changeset	294	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	295	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	296	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	297	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	298	\end{tabular}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	299	\end{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	300
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	301	%Assuming the classic notion of a
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	302	%\emph{language} of a regular expression, written $L(\_)$, t
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	303
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	304	\noindent
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	305	The main property of the derivative operation is that
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	306
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	307	\begin{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	308	$c\!::\!s \in L(r)$ holds
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	309	if and only if $s \in L(r\backslash c)$.
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	310	\end{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	311
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	312	\noindent
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	313	For us the main advantage is that derivatives can be
38 b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	314	straightforwardly implemented in any functional programming language,
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	315	and are easily definable and reasoned about in theorem provers---the
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	316	definitions just consist of inductive datatypes and simple recursive
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	317	functions. Moreover, the notion of derivatives can be easily
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	318	generalised to cover extended regular expression constructors such as
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	319	the not-regular expression, written $\neg\,r$, or bounded repetitions
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	320	(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	321	straightforwardly realised within the classic automata approach.
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	322	For the moment however, we focus only on the usual basic regular expressions.
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	323
b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	324
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	325	Now if we want to find out whether a string $s$ matches with a regular
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	326	expression $r$, build the derivatives of $r$ w.r.t.\ (in succession)
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	327	all the characters of the string $s$. Finally, test whether the
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	328	resulting regular expression can match the empty string. If yes, then
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	329	$r$ matches $s$, and no in the negative case. To implement this idea
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	330	we can generalise the derivative operation to strings like this:
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	331
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	332	\begin{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	333	\begin{tabular}{lcl}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	334	$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	335	$r \backslash [\,] $ & $\dn$ & $r$
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	336	\end{tabular}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	337	\end{center}
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	338
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	339	\noindent
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	340	and then define as regular-expression matching algorithm:
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	341	\[
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	342	match\;s\;r \;\dn\; nullable(r\backslash s)
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	343	\]
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	344
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	345	\noindent
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	346	This algorithm can be illustrated graphically as follows
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	347	\begin{equation}\label{graph:*}
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	348	\begin{tikzcd}
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	349	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
38 b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	350	\end{tikzcd}
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	351	\end{equation}
40 8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	352
8063792920ef proof reading Christian Urban <urbanc@in.tum.de> parents: 39 diff changeset	353	\noindent
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	354	where we start with a regular expression $r_0$, build successive
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	355	derivatives until we exhaust the string and then use \textit{nullable}
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	356	to test whether the result can match the empty string. It can be
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	357	relatively easily shown that this matcher is correct (that is given
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	358	an $s$ and a $r$, it generates YES if and only if $s \in L(r)$).
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	359
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	360
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	361	\section{Values and the Algorithm by Sulzmann and Lu}
38 b5363c0dcd99 half easy changes Chengsong parents: 37 diff changeset	362
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	363	One limitation, however, of Brzozowski's algorithm is that it only
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	364	produces a YES/NO answer for whether a string is being matched by a
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	365	regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	366	algorithm to allow generation of an actual matching, called a
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	367	\emph{value}. Values and regular expressions correspond to each
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	368	other as illustrated in the following table:
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	369
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	370
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	371	\begin{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	372	\begin{tabular}{c@{\hspace{20mm}}c}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	373	\begin{tabular}{@{}rrl@{}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	374	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	375	$r$ & $::=$ & $\ZERO$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	376	& $\mid$ & $\ONE$ \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	377	& $\mid$ & $c$ \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	378	& $\mid$ & $r_1 \cdot r_2$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	379	& $\mid$ & $r_1 + r_2$ \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	380	\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	381	& $\mid$ & $r^*$ \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	382	\end{tabular}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	383	&
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	384	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	385	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	386	$v$ & $::=$ & \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	387	& & $\Empty$ \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	388	& $\mid$ & $\Char(c)$ \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	389	& $\mid$ & $\Seq\,v_1\, v_2$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	390	& $\mid$ & $\Left(v)$ \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	391	& $\mid$ & $\Right(v)$ \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	392	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	393	\end{tabular}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	394	\end{tabular}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	395	\end{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	396
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	397	\noindent
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	398	There is no value corresponding to $\ZERO$; $\Empty$ corresponds to
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	399	$\ONE$; $\Seq$ to the sequence regular expression and so on. The idea of
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	400	values is to encode parse trees. To see this, suppose a \emph{flatten}
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	401	operation, written $\|v\|$, which we can use to extract the underlying
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	402	string of a value $v$. For example, $\|\mathit{Seq} \, (\textit{Char x}) \,
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	403	(\textit{Char y})\|$ is the string $xy$. We omit the straightforward
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	404	definition of flatten. Using flatten, we can describe how values encode
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	405	parse trees: $\Seq\,v_1\, v_2$ encodes how the string $\|v_1\| @ \|v_2\|$
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	406	matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $\|v_1\|$ and,
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	407	respectively, $r_2$ matches the substring $\|v_2\|$. Exactly how these two are matched
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	408	is contained in the sub-structure of $v_1$ and $v_2$.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	409
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	410	To give a concrete example of how value work, consider the string $xy$
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	411	and the regular expression $(x + (y + xy))^*$. We can view this regular
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	412	expression as a tree and if the string $xy$ is matched by two Star
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	413	``iterations'', then the $x$ is matched by the left-most alternative in
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	414	this tree and the $y$ by the right-left alternative. This suggests to
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	415	record this matching as
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	416
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	417	\begin{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	418	$\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	419	\end{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	420
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	421	\noindent
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	422	where $\Stars$ records how many
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	423	iterations were used; and $\Left$, respectively $\Right$, which
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	424	alternative is used. The value for
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	425	matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	426	would look as follows
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	427
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	428	\begin{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	429	$\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	430	\end{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	431
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	432	\noindent
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	433	where $\Stars$ has only a single-element list for the single iteration
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	434	and $\Seq$ indicates that $xy$ is matched by a sequence regular
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	435	expression.
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	436
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	437	The contribution of Sulzmann and Lu is an extension of Brzozowski's
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	438	algorithm by a second phase (the first phase being building successive
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	439	derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	440	is generated assuming the regular expression matches the string.
54 4bfd28722884 minor changes Chengsong parents: 53 diff changeset	441	Pictorially, the algorithm is as follows:
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	442
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	443	\begin{center}
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	444	\begin{tikzcd}
36 97c9ac95194d chages Chengsong parents: 35 diff changeset	445	r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	446	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	447	\end{tikzcd}
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	448	\end{center}
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	449
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	450	\noindent
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	451	For the convenience, we shall employ the following notations: the regular expression we
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	452	start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	453	\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots according to
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	454	the characters $c_0$, $c_1$,\ldots until we exhaust the string and
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	455	obtain at the derivative $r_n$. We test whether this derivative is
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	456	$\textit{nullable}$ or not. If not, we know the string does not match
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	457	$r$ and no value needs to be generated. If yes, we start building the
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	458	parse tree incrementally by \emph{injecting} back the characters into
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	459	the values $v_n, \ldots, v_0$. For this we first call the function
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	460	$\textit{mkeps}$, which builds the parse tree for how the empty string
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	461	has matched the (nullable) regular expression $r_n$. This function is defined
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	462	as
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	463
51 5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	464	\begin{center}
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	465	\begin{tabular}{lcl}
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	466	$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	467	$\mkeps(r_{1}+r_{2})$ & $\dn$
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	468	& \textit{if} $\nullable(r_{1})$\\
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	469	& & \textit{then} $\Left(\mkeps(r_{1}))$\\
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	470	& & \textit{else} $\Right(\mkeps(r_{2}))$\\
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	471	$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	472	$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	473	\end{tabular}
5df7faf69238 added mkeps and pder, still have not proof read it Chengsong parents: 50 diff changeset	474	\end{center}
41 a1f90febbc7f example Chengsong parents: 40 diff changeset	475
46 9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	476	After this, we inject back the characters one by one in order to build
9b48724ec609 proofread Christian Urban <urbanc@in.tum.de> parents: 45 diff changeset	477	the parse tree $v_i$ for how the regex $r_i$ matches the string
55 488d35c20949 some minor changes Chengsong parents: 54 diff changeset	478	$s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
488d35c20949 some minor changes Chengsong parents: 54 diff changeset	479	get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
488d35c20949 some minor changes Chengsong parents: 54 diff changeset	480	A correctness proof using induction can be routinely established.
41 a1f90febbc7f example Chengsong parents: 40 diff changeset	481
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	482	It is instructive to see how this algorithm works by a little example.
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	483	Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	484	match it against the string $abc$(when $abc$ is written as a regular
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	485	expression, the most standard way of expressing it should be $a \cdot (b
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	486	\cdot c)$. We omit the parenthesis and dots here for readability). By
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	487	POSIX rules the lexer should go for the longest matching, i.e. it should
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	488	match the string $abc$ in one star iteration, using the longest string
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	489	$abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	490	sub-expression for conciseness). Here is how the lexer achieves a parse
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	491	tree for this matching. First, we build successive derivatives until we
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	492	exhaust the string, as illustrated here( we simplified some regular
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	493	expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	494	$\textit{ALT}$ to take a list of regular expressions as an argument
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	495	instead of just 2 operands to reduce the nested depth of
f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	496	$\textit{ALT}$):
41 a1f90febbc7f example Chengsong parents: 40 diff changeset	497
a1f90febbc7f example Chengsong parents: 40 diff changeset	498	\[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
a1f90febbc7f example Chengsong parents: 40 diff changeset	499	\[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r* \xrightarrow{\backslash c}\]
a1f90febbc7f example Chengsong parents: 40 diff changeset	500	\[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r) +((0+1+0 + 0 + 0) \cdot r+(0+0+0 + 1 + 0) \cdot r* )
a1f90febbc7f example Chengsong parents: 40 diff changeset	501	\]
a1f90febbc7f example Chengsong parents: 40 diff changeset	502
a1f90febbc7f example Chengsong parents: 40 diff changeset	503	Now instead of using $nullable$ to give a $yes$, we call $mkeps$ to construct a parse tree for how $r_3$ matched the string $abc$. $mkeps$ gives the following value $v_3$: \\$Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
56 747c8cf666ca comprehension Chengsong parents: 55 diff changeset	504	This corresponds to the leftmost term
747c8cf666ca comprehension Chengsong parents: 55 diff changeset	505	$((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* $\\
747c8cf666ca comprehension Chengsong parents: 55 diff changeset	506	in $r_3$. Note that its leftmost location allows $mkeps$ to choose it as the first candidate that meets the requirement of being $nullable$. This location is naturally generated by the splitting clause\\ $(r_1 \cdot r_2)\backslash c (when \; r_1 \; nullable)) \, = (r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c. \\$ By this clause, we put
747c8cf666ca comprehension Chengsong parents: 55 diff changeset	507	$r_1 \backslash c \cdot r_2 $ at the $\textit{front}$ and $r_2 \backslash c$ at the $\textit{back}$. This allows $mkeps$ to always pick up among two matches the one with a longer prefix. The value \\
41 a1f90febbc7f example Chengsong parents: 40 diff changeset	508	$Left(Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, Empty)))))), Stars []))$\\
56 747c8cf666ca comprehension Chengsong parents: 55 diff changeset	509	tells us how about the empty string matches the final regular expression after doing all the derivatives: among the regular expressions in \\$(0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r) +((0+1+0 + 0 + 0) \cdot r+(0+0+0 + 1 + 0) \cdot r* )$, \\
57 8ea861a19d70 h Chengsong parents: 56 diff changeset	510	we choose the left most nullable one, which is composed of a sequence of an alternative and a star. In that alternative $0+0+0 + 0 + 1 \cdot 1 \cdot 1$ we take the rightmost choice.
41 a1f90febbc7f example Chengsong parents: 40 diff changeset	511
57 8ea861a19d70 h Chengsong parents: 56 diff changeset	512	Using the value $v_3$, the character c, and the regular expression $r_2$, we can recover how $r_2$ matched the string $[c]$ : we inject $c$ back to $v_3$, and get \\ $v_2 = Left(Seq(Right(Right(Right(Seq(Empty, Seq(Empty, c)))))), Stars [])$, \\
8ea861a19d70 h Chengsong parents: 56 diff changeset	513	which tells us how $r_2$ matched $c$. After this we inject back the character $b$, and get\\ $v_1 = Seq(Right(Right(Right(Seq(Empty, Seq(b, c)))))), Stars [])$ for how $r_1= (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r*$ matched the string $bc$ before it split into 2 pieces. Finally, after injecting character a back to $v_1$, we get the parse tree $v_0= Stars [Right(Right(Right(Seq(a, Seq(b, c)))))]$ for how r matched $abc$.
42 2a469388c989 s Chengsong parents: 41 diff changeset	514	We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}.
2a469388c989 s Chengsong parents: 41 diff changeset	515	Readers might have noticed that the parse tree information as actually already available when doing derivatives. For example, immediately after the operation $\backslash a$ we know that if we want to match a string that starts with a, we can either take the initial match to be
2a469388c989 s Chengsong parents: 41 diff changeset	516	\begin{enumerate}
2a469388c989 s Chengsong parents: 41 diff changeset	517	\item[1)] just $a$ or
2a469388c989 s Chengsong parents: 41 diff changeset	518	\item[2)] string $ab$ or
2a469388c989 s Chengsong parents: 41 diff changeset	519	\item[3)] string $abc$.
2a469388c989 s Chengsong parents: 41 diff changeset	520	\end{enumerate}
45 60cb82639691 spell check Christian Urban <urbanc@in.tum.de> parents: 44 diff changeset	521	In order to differentiate between these choices, we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. Which one of these alternatives is chosen later does not affect their relative position because our algorithm does not change this order. There is no need to traverse this information twice. This leads to a new approach of lexing-- if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
42 2a469388c989 s Chengsong parents: 41 diff changeset	522
2a469388c989 s Chengsong parents: 41 diff changeset	523	In the next section, we shall focus on the bit-coded algorithm and the natural
2a469388c989 s Chengsong parents: 41 diff changeset	524	process of simplification of regular expressions using bit-codes, which is needed in
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	525	order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	526	and Lu's algorithms. This is where the PhD-project hopes to advance
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	527	the state-of-the-art.
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	528
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	529
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	530	\section{Simplification of Regular Expressions}
42 2a469388c989 s Chengsong parents: 41 diff changeset	531	Using bit-codes to guide parsing is not a new idea.
45 60cb82639691 spell check Christian Urban <urbanc@in.tum.de> parents: 44 diff changeset	532	It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	533
43 52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	534	The argument for complicating the data structures from basic regular expressions to those with bitcodes
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	535	is that we can introduce simplification without making the algorithm crash or impossible to reason about.
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	536	The reason why we need simplification is due to the shortcoming of a naive algorithm using Brzozowski's definition only.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	537	The main drawback of building successive derivatives according to
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	538	Brzozowski's definition is that they can grow very quickly in size.
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	539	This is mainly due to the fact that the derivative operation generates
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	540	often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	541	if implemented naively both algorithms by Brzozowski and by Sulzmann
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	542	and Lu are excruciatingly slow. For example when starting with the
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	543	regular expression $(a + aa)^*$ and building 12 successive derivatives
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	544	w.r.t.~the character $a$, one obtains a derivative regular expression
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	545	with more than 8000 nodes (when viewed as a tree). Operations like
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	546	derivative and $\nullable$ need to traverse such trees and
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	547	consequently the bigger the size of the derivative the slower the
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	548	algorithm. Fortunately, one can simplify regular expressions after
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	549	each derivative step. Various simplifications of regular expressions
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	550	are possible, such as the simplifications of $\ZERO + r$,
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	551	$r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	552	$r$. These simplifications do not affect the answer for whether a
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	553	regular expression matches a string or not, but fortunately also do
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	554	not affect the POSIX strategy of how regular expressions match
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	555	strings---although the latter is much harder to establish. Some
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	556	initial results in this regard have been obtained in
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	557	\cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	558	is a very tight bound for the size. Such a tight bound is suggested by
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	559	work of Antimirov who proved that (partial) derivatives can be bound
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	560	by the number of characters contained in the initial regular
35 f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	561	expression \cite{Antimirov95}.
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	562
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	563	Antimirov defined the "partial derivatives" of regular expressions to be this:
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	564	%TODO definition of partial derivatives
52 25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	565	\begin{center}
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	566	\begin{tabular}{lcl}
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	567	$\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	568	$\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	569	$\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{ 1 \} \; \textit{else} \; \emptyset$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	570	$\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	571	$\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \; \textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	572	$\textit{pder} \; c \; r^$ & $\dn$ & $ \{ r' \cdot r^ \mid r' \in pder \; c \; r \} $ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	573	\end{tabular}
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	574	\end{center}
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	575
35 f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	576
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	577	it is essentially a set of regular expressions that come from the sub-structure of the original regular expression.
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	578	Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression. Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	579
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	580	%We believe, and have generated test
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	581	%data, that a similar bound can be obtained for the derivatives in
f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	582	%Sulzmann and Lu's algorithm. Let us give some details about this next.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	583
43 52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	584	Bit-codes look like this:
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	585	\[ b ::= S \mid Z \; \;\;
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	586	bs ::= [] \mid b:bs
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	587	\]
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	588	They are just a string of bits, the names "S" and "Z" here are kind of arbitrary, we can use 0 and 1 or binary symbol to substitute them. They are a compact form of parse trees.
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	589	Here is how values and bit-codes are related:
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	590	Bitcodes are essentially incomplete values.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	591	This can be straightforwardly seen in the following transformation:
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	592	\begin{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	593	\begin{tabular}{lcl}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	594	$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	595	$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	596	$\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	597	$\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	598	$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	599	$\textit{code}(\Stars\,[])$ & $\dn$ & $[\S]$\\
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	600	$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\Z :: code(v) \;@\;
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	601	code(\Stars\,vs)$
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	602	\end{tabular}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	603	\end{center}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	604	where $\Z$ and $\S$ are arbitrary names for the bits in the
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	605	bitsequences.
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	606	Here code encodes a value into a bitsequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $Left/Right$ or $Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around during the lexing process. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode them back into value:\\
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	607	%\begin{definition}[Bitdecoding of Values]\mbox{}
36 97c9ac95194d chages Chengsong parents: 35 diff changeset	608	\begin{center}
97c9ac95194d chages Chengsong parents: 35 diff changeset	609	\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
97c9ac95194d chages Chengsong parents: 35 diff changeset	610	$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	611	$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	612	$\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	613	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
97c9ac95194d chages Chengsong parents: 35 diff changeset	614	(\Left\,v, bs_1)$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	615	$\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	616	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
97c9ac95194d chages Chengsong parents: 35 diff changeset	617	(\Right\,v, bs_1)$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	618	$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	619	$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	620	& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	621	& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	622	$\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	623	$\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	624	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	625	& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	626	& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	627
97c9ac95194d chages Chengsong parents: 35 diff changeset	628	$\textit{decode}\,bs\,r$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	629	$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	630	& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
97c9ac95194d chages Chengsong parents: 35 diff changeset	631	\textit{else}\;\textit{None}$
97c9ac95194d chages Chengsong parents: 35 diff changeset	632	\end{tabular}
97c9ac95194d chages Chengsong parents: 35 diff changeset	633	\end{center}
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	634	%\end{definition}
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	635
43 52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	636
53 3ec403f650a8 readable version Chengsong parents: 52 diff changeset	637	Sulzmann and Lu's integrated the bitcodes into annotated regular expressions by attaching them to the head of every substructure of a regular expression\cite{Sulzmann2014}. They are
43 52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	638	defined by the following grammar:
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	639
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	640	\begin{center}
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	641	\begin{tabular}{lcl}
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	642	$\textit{a}$ & $::=$ & $\textit{ZERO}$\\
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	643	& $\mid$ & $\textit{ONE}\;\;bs$\\
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	644	& $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	645	& $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	646	& $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	647	& $\mid$ & $\textit{STAR}\;\;bs\,a$
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	648	\end{tabular}
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	649	\end{center}
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	650
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	651	\noindent
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	652	where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	653	list of annotated regular expressions. These bitsequences encode
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	654	information about the (POSIX) value that should be generated by the
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	655	Sulzmann and Lu algorithm.
52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	656
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	657	To do lexing using annotated regular expressions, we shall first transform the
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	658	usual (un-annotated) regular expressions into annotated regular
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	659	expressions:\\
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	660	%\begin{definition}
36 97c9ac95194d chages Chengsong parents: 35 diff changeset	661	\begin{center}
97c9ac95194d chages Chengsong parents: 35 diff changeset	662	\begin{tabular}{lcl}
97c9ac95194d chages Chengsong parents: 35 diff changeset	663	$(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	664	$(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	665	$(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	666	$(r_1 + r_2)^\uparrow$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	667	$\textit{ALT}\;[]\,(\textit{fuse}\,[\Z]\,r_1^\uparrow)\,
97c9ac95194d chages Chengsong parents: 35 diff changeset	668	(\textit{fuse}\,[\S]\,r_2^\uparrow)$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	669	$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	670	$\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	671	$(r^*)^\uparrow$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	672	$\textit{STAR}\;[]\,r^\uparrow$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	673	\end{tabular}
97c9ac95194d chages Chengsong parents: 35 diff changeset	674	\end{center}
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	675	%\end{definition}
44 4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	676
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	677	Here $fuse$ is an auxiliary function that helps to attach bits to the front of an annotated regular expression. Its definition goes as follows:
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	678	\begin{center}
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	679	\begin{tabular}{lcl}
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	680	$\textit{fuse}\,bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	681	$\textit{fuse}\,bs\,(\textit{ONE}\,bs')$ & $\dn$ &
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	682	$\textit{ONE}\,(bs\,@\,bs')$\\
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	683	$\textit{fuse}\,bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	684	$\textit{CHAR}\,(bs\,@\,bs')\,c$\\
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	685	$\textit{fuse}\,bs\,(\textit{ALT}\,bs'\,a_1\,a_2)$ & $\dn$ &
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	686	$\textit{ALT}\,(bs\,@\,bs')\,a_1\,a_2$\\
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	687	$\textit{fuse}\,bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	688	$\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	689	$\textit{fuse}\,bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	690	$\textit{STAR}\,(bs\,@\,bs')\,a$
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	691	\end{tabular}
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	692	\end{center}
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	693
43 52a3cec0a5c7 s? Chengsong parents: 42 diff changeset	694	After internalise we do successive derivative operations on the annotated regular expression.
44 4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	695	This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits :\\
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	696	%\begin{definition}{bder}
36 97c9ac95194d chages Chengsong parents: 35 diff changeset	697	\begin{center}
97c9ac95194d chages Chengsong parents: 35 diff changeset	698	\begin{tabular}{@{}lcl@{}}
97c9ac95194d chages Chengsong parents: 35 diff changeset	699	$(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	700	$(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	701	$(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	702	$\textit{if}\;c=d\; \;\textit{then}\;
97c9ac95194d chages Chengsong parents: 35 diff changeset	703	\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	704	$(\textit{ALT}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	705	$\textit{ALT}\,bs\,(a_1\backslash c)\,(a_2\backslash c)$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	706	$(\textit{SEQ}\;bs\,a_1\,a_2)\backslash c$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	707	$\textit{if}\;\textit{bnullable}\,a_1$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	708	& &$\textit{then}\;\textit{ALT}\,bs\,(\textit{SEQ}\,[]\,(a_1\backslash c)\,a_2)$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	709	& &$\phantom{\textit{then}\;\textit{ALT}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\backslash c))$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	710	& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\backslash c)\,a_2$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	711	$(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	712	$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
97c9ac95194d chages Chengsong parents: 35 diff changeset	713	(\textit{STAR}\,[]\,r)$
97c9ac95194d chages Chengsong parents: 35 diff changeset	714	\end{tabular}
97c9ac95194d chages Chengsong parents: 35 diff changeset	715	\end{center}
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	716	%\end{definition}
44 4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	717	For instance, when we unfold $STAR \; bs \; a$ into a sequence, we attach an additional bit Z to the front of $r \backslash c$ to indicate that there is one more star iteration.
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	718	The other example, the $SEQ$ clause is more subtle-- when $a_1$ is $bnullable$(here bnullable is exactly the same as nullable, except that it is for annotated regular expressions, therefore we omit the definition).
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	719	Assume that $bmkeps$ correctly extracts the bitcode for how $a_1$ matches the string prior to character c(more on this later), then the right branch of $ALTS$, which is $fuse \; bmkeps \; a_1 (a_2 \backslash c)$ will collapse the regular expression $a_1$(as it has already been fully matched) and store the parsing information at the head of the regular expression $a_2 \backslash c$ by fusing to it. The bitsequence $bs$, which was initially attached to the head of $SEQ$, has now been elevated to the top-level of ALT,
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	720	as this information will be needed whichever way the $SEQ$ is matched--no matter whether c belongs to $a_1$ or $ a_2$.
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	721	After carefully doing these derivatives and maintaining all the parsing information, we complete the parsing by collecting the bits using a special $mkeps$ function for annotated regular expressions--$bmkeps$:
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	722
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	723
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	724	%\begin{definition}[\textit{bmkeps}]\mbox{}
36 97c9ac95194d chages Chengsong parents: 35 diff changeset	725	\begin{center}
97c9ac95194d chages Chengsong parents: 35 diff changeset	726	\begin{tabular}{lcl}
97c9ac95194d chages Chengsong parents: 35 diff changeset	727	$\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	728	$\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	729	$\textit{if}\;\textit{bnullable}\,a_1$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	730	& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a_1$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	731	& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,a_2$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	732	$\textit{bmkeps}\,(\textit{SEQ}\,bs\,a_1\,a_2)$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	733	$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
97c9ac95194d chages Chengsong parents: 35 diff changeset	734	$\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
97c9ac95194d chages Chengsong parents: 35 diff changeset	735	$bs \,@\, [\S]$
97c9ac95194d chages Chengsong parents: 35 diff changeset	736	\end{tabular}
97c9ac95194d chages Chengsong parents: 35 diff changeset	737	\end{center}
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	738	%\end{definition}
44 4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	739	This function completes the parse tree information by
45 60cb82639691 spell check Christian Urban <urbanc@in.tum.de> parents: 44 diff changeset	740	travelling along the path on the regular expression that corresponds to a POSIX value snd collect all the bits, and
44 4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	741	using S to indicate the end of star iterations. If we take the bitsproduced by $bmkeps$ and decode it,
4d674a971852 another changes. have written more. but havent typed them. tomorrow will continue. Chengsong parents: 43 diff changeset	742	we get the parse tree we need, the working flow looks like this:\\
37 17d8e7599a01 new changes Chengsong parents: 36 diff changeset	743	\begin{center}
17d8e7599a01 new changes Chengsong parents: 36 diff changeset	744	\begin{tabular}{lcl}
17d8e7599a01 new changes Chengsong parents: 36 diff changeset	745	$\textit{blexer}\;r\,s$ & $\dn$ &
17d8e7599a01 new changes Chengsong parents: 36 diff changeset	746	$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
17d8e7599a01 new changes Chengsong parents: 36 diff changeset	747	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
17d8e7599a01 new changes Chengsong parents: 36 diff changeset	748	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
17d8e7599a01 new changes Chengsong parents: 36 diff changeset	749	& & $\;\;\textit{else}\;\textit{None}$
17d8e7599a01 new changes Chengsong parents: 36 diff changeset	750	\end{tabular}
17d8e7599a01 new changes Chengsong parents: 36 diff changeset	751	\end{center}
53 3ec403f650a8 readable version Chengsong parents: 52 diff changeset	752	Here $(r^\uparrow)\backslash s$ is similar to what we have previously defined for
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	753	$r\backslash s$.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	754
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	755	The main point of the bitsequences and annotated regular expressions
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	756	is that we can apply rather aggressive (in terms of size)
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	757	simplification rules in order to keep derivatives small.
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	758
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	759	We have
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	760	developed such ``aggressive'' simplification rules and generated test
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	761	data that show that the expected bound can be achieved. Obviously we
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	762	could only partially cover the search space as there are infinitely
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	763	many regular expressions and strings. One modification we introduced
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	764	is to allow a list of annotated regular expressions in the
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	765	\textit{ALTS} constructor. This allows us to not just delete
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	766	unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	767	unnecessary ``copies'' of regular expressions (very similar to
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	768	simplifying $r + r$ to just $r$, but in a more general
35 f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	769	setting).
49 d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	770	Another modification is that we use simplification rules
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	771	inspired by Antimirov's work on partial derivatives. They maintain the
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	772	idea that only the first ``copy'' of a regular expression in an
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	773	alternative contributes to the calculation of a POSIX value. All
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	774	subsequent copies can be pruned from the regular expression.
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	775
52 25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	776	A recursive definition of simplification function that looks similar to scala code is given below:\\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	777	\begin{center}
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	778	\begin{tabular}{@{}lcl@{}}
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	779	$\textit{simp} \; a$ & $\dn$ & $\textit{a} \; \textit{if} \; a = (\textit{ONE} \; bs) \; or\; (\textit{CHAR} \, bs \; c) \; or\; (\textit{STAR}\; bs\; a_1)$\\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	780
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	781	$\textit{simp} \; \textit{SEQ}\;bs\,a_1\,a_2$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	782	&&$\textit{case} \; (0, \_) \Rightarrow 0$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	783	&&$ \textit{case} \; (\_, 0) \Rightarrow 0$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	784	&&$ \textit{case} \; (1, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	785	&&$ \textit{case} \; (a_1', 1) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	786	&&$ \textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	787
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	788	$\textit{simp} \; \textit{ALT}\;bs\,as$ & $\dn$ & $\textit{ distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	789	&&$\textit{case} \; [] \Rightarrow 0$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	790	&&$ \textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	791	&&$ \textit{case} \; as' \Rightarrow \textit{ALT bs as'}$
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	792	\end{tabular}
25bbbb8b0e90 just in case of some accidents from erasing my work Chengsong parents: 51 diff changeset	793	\end{center}
47 d2a7e87ea6e1 will not compile, just text Chengsong parents: 46 diff changeset	794
d2a7e87ea6e1 will not compile, just text Chengsong parents: 46 diff changeset	795	The simplification does a pattern matching on the regular expression. When it detected that
d2a7e87ea6e1 will not compile, just text Chengsong parents: 46 diff changeset	796	the regular expression is an alternative or sequence, it will try to simplify its children regular expressions
d2a7e87ea6e1 will not compile, just text Chengsong parents: 46 diff changeset	797	recursively and then see if one of the children turn into 0 or 1, which might trigger further simplification
d2a7e87ea6e1 will not compile, just text Chengsong parents: 46 diff changeset	798	at the current level. The most involved part is the ALTS clause, where we use two auxiliary functions
d2a7e87ea6e1 will not compile, just text Chengsong parents: 46 diff changeset	799	flatten and distinct to open up nested ALT and reduce as many duplicates as possible.
48 bbefcf7351f2 still will not compile Chengsong parents: 47 diff changeset	800	Function distinct keeps the first occurring copy only and remove all later ones when detected duplicates.
bbefcf7351f2 still will not compile Chengsong parents: 47 diff changeset	801	Function flatten opens up nested ALT. Its recursive definition is given below:
53 3ec403f650a8 readable version Chengsong parents: 52 diff changeset	802	\begin{center}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	803	\begin{tabular}{@{}lcl@{}}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	804	$\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{ map fuse}( \textit{bs, \_} ) \textit{ as}) \; +\!+ \; \textit{flatten} \; as' $ \\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	805	$\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	806	$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as' $
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	807	\end{tabular}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	808	\end{center}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	809
48 bbefcf7351f2 still will not compile Chengsong parents: 47 diff changeset	810	Here flatten behaves like the traditional functional programming flatten function,
bbefcf7351f2 still will not compile Chengsong parents: 47 diff changeset	811	what it does is basically removing parentheses like changing $a+(b+c)$ into $a+b+c$.
47 d2a7e87ea6e1 will not compile, just text Chengsong parents: 46 diff changeset	812
53 3ec403f650a8 readable version Chengsong parents: 52 diff changeset	813	Suppose we apply simplification after each derivative step,
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	814	and view these two operations as an atomic one: $a \backslash_{simp} c \dn \textit{simp}(a \backslash c)$.
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	815	Then we can use the previous natural extension from derivative w.r.t character to derivative w.r.t string:
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	816
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	817	\begin{center}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	818	\begin{tabular}{lcl}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	819	$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp} c) \backslash_{simp} s$ \\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	820	$r \backslash [\,] $ & $\dn$ & $r$
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	821	\end{tabular}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	822	\end{center}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	823
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	824	we get an optimized version of the algorithm:
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	825	\begin{center}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	826	\begin{tabular}{lcl}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	827	$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	828	$\textit{let}\;a = (r^\uparrow)\backslash_{simp} s\;\textit{in}$\\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	829	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	830	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	831	& & $\;\;\textit{else}\;\textit{None}$
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	832	\end{tabular}
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	833	\end{center}
48 bbefcf7351f2 still will not compile Chengsong parents: 47 diff changeset	834
bbefcf7351f2 still will not compile Chengsong parents: 47 diff changeset	835	This algorithm effectively keeps the regular expression size small, for example,
bbefcf7351f2 still will not compile Chengsong parents: 47 diff changeset	836	with this simplification our previous $(a + aa)^*$ example's 8000 nodes will be reduced to only 6 and stay constant, however long the input string is.
35 f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	837
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	838
35 f70e9ab4e680 psuedocode added Chengsong parents: 34 diff changeset	839
49 d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	840	We are currently engaged in 2 tasks related to this algorithm.
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	841	The first one is proving that our simplification rules
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	842	actually do not affect the POSIX value that should be generated by the
49 d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	843	algorithm according to the specification of a POSIX value
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	844	and furthermore obtain a much
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	845	tighter bound on the sizes of derivatives. The result is that our
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	846	algorithm should be correct and faster on all inputs. The original
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	847	blow-up, as observed in JavaScript, Python and Java, would be excluded
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	848	from happening in our algorithm.For
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	849	this proof we use the theorem prover Isabelle. Once completed, this
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	850	result will advance the state-of-the-art: Sulzmann and Lu wrote in
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	851	their paper \cite{Sulzmann2014} about the bitcoded ``incremental
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	852	parsing method'' (that is the matching algorithm outlined in this
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	853	section):
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	854
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	855	\begin{quote}\it
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	856	``Correctness Claim: We further claim that the incremental parsing
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	857	method in Figure~5 in combination with the simplification steps in
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	858	Figure 6 yields POSIX parse trees. We have tested this claim
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	859	extensively by using the method in Figure~3 as a reference but yet
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	860	have to work out all proof details.''
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	861	\end{quote}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	862
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	863	\noindent
49 d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	864	We would settle the correctness claim.
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	865	It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	866	bmkeps r = bmkeps simp r
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	867	as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete important POSIX information in a regular expression.
49 d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	868	The hardcore of this problem is to prove that
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	869	bmkeps bders r = bmkeps bders simp r
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	870	That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straghtforward as the previous proposition, as the two regular expression r and simp r might become very different regular expressions after repeated application ofd simp and derivative.
49 d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	871	The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification.
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	872	To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu:
d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	873	\\definition of retrieve\\
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	874	This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
49 d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	875	Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
53 3ec403f650a8 readable version Chengsong parents: 52 diff changeset	876	$inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	877	A little fact that needs to be stated to help comprehension:\\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	878	$r^\uparrow = a$($a$ stands for $annotated$).\\
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	879	Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplification.
50 866eda9ba66a now will compile Chengsong parents: 49 diff changeset	880	Our purpose of using this, however, is try to establish \\
53 3ec403f650a8 readable version Chengsong parents: 52 diff changeset	881	$ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	882	The idea is that using $v'$,
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	883	a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bitsequence that gives the same parsing information as the unsimplified one.
53 3ec403f650a8 readable version Chengsong parents: 52 diff changeset	884	After establishing this, we might be able to finally bridge the gap of proving\\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	885	$\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	886	and subsequently\\
3ec403f650a8 readable version Chengsong parents: 52 diff changeset	887	$\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	888	This proves that our simplified version of regular expression still contains all the bitcodes needed.
49 d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	889
58 f0360e17080e proofread Christian Urban <urbanc@in.tum.de> parents: 57 diff changeset	890	The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplification(the naive version as implemented in ADU of course can explode in some cases).
49 d256aabe88f3 still wont comiple hhh Chengsong parents: 48 diff changeset	891	So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress.
30 bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	892
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	893	\section{Conclusion}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	894
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	895	In this PhD-project we are interested in fast algorithms for regular
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	896	expression matching. While this seems to be a ``settled'' area, in
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	897	fact interesting research questions are popping up as soon as one steps
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	898	outside the classic automata theory (for example in terms of what kind
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	899	of regular expressions are supported). The reason why it is
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	900	interesting for us to look at the derivative approach introduced by
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	901	Brzozowski for regular expression matching, and then much further
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	902	developed by Sulzmann and Lu, is that derivatives can elegantly deal
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	903	with some of the regular expressions that are of interest in ``real
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	904	life''. This includes the not-regular expression, written $\neg\,r$
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	905	(that is all strings that are not recognised by $r$), but also bounded
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	906	regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	907	also hope that the derivatives can provide another angle for how to
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	908	deal more efficiently with back-references, which are one of the
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	909	reasons why regular expression engines in JavaScript, Python and Java
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	910	choose to not implement the classic automata approach of transforming
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	911	regular expressions into NFAs and then DFAs---because we simply do not
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	912	know how such back-references can be represented by DFAs.
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	913
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	914
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	915	\bibliographystyle{plain}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	916	\bibliography{root}
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	917
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	918
bd9eb959dbce changed file name to ninems Chengsong parents: diff changeset	919	\end{document}

author	Christian Urban <urbanc@in.tum.de>
	Fri, 05 Jul 2019 23:46:25 +0100
changeset 58	f0360e17080e
parent 57	8ea861a19d70
child 59	8ff7b7508824
permissions	-rw-r--r--