lexing: ChengsongTanPhdThesis/Chapters/Inj.tex@deb35fd780fe (annotated)

532 cc54ce075db5 restructured Chengsong parents: diff changeset	1	% Chapter Template
cc54ce075db5 restructured Chengsong parents: diff changeset	2
cc54ce075db5 restructured Chengsong parents: diff changeset	3	\chapter{Regular Expressions and POSIX Lexing} % Main chapter title
cc54ce075db5 restructured Chengsong parents: diff changeset	4
cc54ce075db5 restructured Chengsong parents: diff changeset	5	\label{Inj} % In chapter 2 \ref{Chapter2} we will introduce the concepts
cc54ce075db5 restructured Chengsong parents: diff changeset	6	%and notations we
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	7	% used for describing the lexing algorithm by Sulzmann and Lu,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	8	%and then give the algorithm and its variant and discuss
532 cc54ce075db5 restructured Chengsong parents: diff changeset	9	%why more aggressive simplifications are needed.
cc54ce075db5 restructured Chengsong parents: diff changeset	10
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	11	In this chapter, we define the basic notions
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	12	for regular languages and regular expressions.
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	13	This is essentially a description in ``English''
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	14	the functions and datatypes of our formalisation in Isabelle/HOL.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	15	We also define what $\POSIX$ lexing means,
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	16	followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	17	that produces the output conforming
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	18	to the $\POSIX$ standard\footnote{In what follows
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	19	we choose to use the Isabelle-style notation
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	20	for function applications, where
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	21	the parameters of a function are not enclosed
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	22	inside a pair of parentheses (e.g. $f \;x \;y$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	23	instead of $f(x,\;y)$). This is mainly
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	24	to make the text visually more concise.}.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	25
649 ef2b8abcbc55 more Chengsong parents: 646 diff changeset	26
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	27	\section{Technical Overview}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	28
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	29	Consider for example the regular expression $(a^)^\,b$ and
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	30	strings of the form $aa..a$. These strings cannot be matched by this regular
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	31	expression: Obviously the expected $b$ in the last
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	32	position is missing. One would assume that modern regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	33	matching engines can find this out very quickly. Surprisingly, if one tries
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	34	this example in JavaScript, Python or Java 8, even with small strings,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	35	say of lenght of around 30 $a$'s,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	36	the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	37	The algorithms clearly show exponential behaviour, and as can be seen
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	38	is triggered by some relatively simple regular expressions.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	39	Java 9 and newer
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	40	versions improve this behaviour somewhat, but are still slow compared
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	41	with the approach we are going to use in this thesis.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	42
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	43
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	44
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	45	This superlinear blowup in regular expression engines
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	46	has caused grief in ``real life'' where it is
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	47	given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	48	For example, on 20 July 2016 one evil
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	49	regular expression brought the webpage
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	50	\href{http://stackexchange.com}{Stack Exchange} to its
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	51	knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	52	In this instance, a regular expression intended to just trim white
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	53	spaces from the beginning and the end of a line actually consumed
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	54	massive amounts of CPU resources---causing the web servers to grind to a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	55	halt. In this example, the time needed to process
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	56	the string was
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	57	$O(n^2)$ with respect to the string length $n$. This
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	58	quadratic overhead was enough for the homepage of Stack Exchange to
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	59	respond so slowly that the load balancer assumed a $\mathit{DoS}$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	60	attack and therefore stopped the servers from responding to any
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	61	requests. This made the whole site become unavailable.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	62
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	63	\begin{figure}[p]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	64	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	65	\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	66	\begin{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	67	\begin{axis}[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	68	xlabel={$n$},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	69	x label style={at={(1.05,-0.05)}},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	70	ylabel={time in secs},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	71	enlargelimits=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	72	xtick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	73	xmax=33,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	74	ymax=35,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	75	ytick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	76	scaled ticks=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	77	axis lines=left,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	78	width=5cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	79	height=4cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	80	legend entries={JavaScript},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	81	legend pos=north west,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	82	legend cell align=left]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	83	\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	84	\end{axis}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	85	\end{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	86	&
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	87	\begin{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	88	\begin{axis}[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	89	xlabel={$n$},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	90	x label style={at={(1.05,-0.05)}},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	91	%ylabel={time in secs},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	92	enlargelimits=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	93	xtick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	94	xmax=33,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	95	ymax=35,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	96	ytick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	97	scaled ticks=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	98	axis lines=left,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	99	width=5cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	100	height=4cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	101	legend entries={Python},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	102	legend pos=north west,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	103	legend cell align=left]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	104	\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	105	\end{axis}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	106	\end{tikzpicture}\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	107	\begin{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	108	\begin{axis}[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	109	xlabel={$n$},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	110	x label style={at={(1.05,-0.05)}},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	111	ylabel={time in secs},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	112	enlargelimits=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	113	xtick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	114	xmax=33,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	115	ymax=35,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	116	ytick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	117	scaled ticks=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	118	axis lines=left,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	119	width=5cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	120	height=4cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	121	legend entries={Java 8},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	122	legend pos=north west,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	123	legend cell align=left]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	124	\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	125	\end{axis}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	126	\end{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	127	&
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	128	\begin{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	129	\begin{axis}[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	130	xlabel={$n$},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	131	x label style={at={(1.05,-0.05)}},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	132	%ylabel={time in secs},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	133	enlargelimits=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	134	xtick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	135	xmax=33,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	136	ymax=35,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	137	ytick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	138	scaled ticks=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	139	axis lines=left,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	140	width=5cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	141	height=4cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	142	legend entries={Dart},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	143	legend pos=north west,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	144	legend cell align=left]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	145	\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	146	\end{axis}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	147	\end{tikzpicture}\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	148	\begin{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	149	\begin{axis}[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	150	xlabel={$n$},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	151	x label style={at={(1.05,-0.05)}},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	152	ylabel={time in secs},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	153	enlargelimits=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	154	xtick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	155	xmax=33,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	156	ymax=35,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	157	ytick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	158	scaled ticks=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	159	axis lines=left,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	160	width=5cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	161	height=4cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	162	legend entries={Swift},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	163	legend pos=north west,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	164	legend cell align=left]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	165	\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	166	\end{axis}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	167	\end{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	168	&
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	169	\begin{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	170	\begin{axis}[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	171	xlabel={$n$},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	172	x label style={at={(1.05,-0.05)}},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	173	%ylabel={time in secs},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	174	enlargelimits=true,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	175	%xtick={0,5000,...,40000},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	176	%xmax=40000,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	177	%ymax=35,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	178	restrict x to domain*=0:40000,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	179	restrict y to domain*=0:35,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	180	%ytick={0,5,...,30},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	181	%scaled ticks=false,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	182	axis lines=left,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	183	width=5cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	184	height=4cm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	185	legend entries={Java9+},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	186	legend pos=north west,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	187	legend cell align=left]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	188	\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	189	\end{axis}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	190	\end{tikzpicture}\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	191	\multicolumn{2}{c}{Graphs}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	192	\end{tabular}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	193	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	194	\caption{Graphs showing runtime for matching $(a^)^\,b$ with strings
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	195	of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	196	The reason for their superlinear behaviour is that they do a depth-first-search
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	197	using NFAs.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	198	If the string does not match, the regular expression matching
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	199	engine starts to explore all possibilities.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	200	}\label{fig:aStarStarb}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	201	\end{figure}\afterpage{\clearpage}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	202
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	203	A more recent example is a global outage of all Cloudflare servers on 2 July
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	204	2019. A poorly written regular expression exhibited catastrophic backtracking
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	205	and exhausted CPUs that serve HTTP traffic. Although the outage
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	206	had several causes, at the heart was a regular expression that
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	207	was used to monitor network
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	208	traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	209	These problems with regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	210	are not isolated events that happen
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	211	very rarely,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	212	%but actually widespread.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	213	%They occur so often that they have a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	214	but they occur actually often enough that they have a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	215	name: Regular-Expression-Denial-Of-Service (ReDoS)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	216	attacks.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	217	Davis et al. \cite{Davis18} detected more
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	218	than 1000 evil regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	219	in Node.js, Python core libraries, npm and pypi.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	220	They therefore concluded that evil regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	221	are a real problem rather than just "a parlour trick".
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	222
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	223	The work in this thesis aims to address this issue
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	224	with the help of formal proofs.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	225	We describe a lexing algorithm based
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	226	on Brzozowski derivatives with verified correctness
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	227	and a finiteness property for the size of derivatives
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	228	(which are all done in Isabelle/HOL).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	229	Such properties %guarantee the absence of
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	230	are an important step in preventing
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	231	catastrophic backtracking once and for all.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	232	We will give more details in the next sections
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	233	on (i) why the slow cases in graph \ref{fig:aStarStarb}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	234	can occur in traditional regular expression engines
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	235	and (ii) why we choose our
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	236	approach based on Brzozowski derivatives and formal proofs.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	237
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	238
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	239	\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	240	Regular expressions and regular expression matchers
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	241	have clearly been studied for many, many years.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	242	Theoretical results in automata theory state
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	243	that basic regular expression matching should be linear
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	244	w.r.t the input.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	245	This assumes that the regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	246	$r$ was pre-processed and turned into a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	247	deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	248	By basic we mean textbook definitions such as the one
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	249	below, involving only regular expressions for characters, alternatives,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	250	sequences, and Kleene stars:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	251	\[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	252	r ::= c \| r_1 + r_2 \| r_1 \cdot r_2 \| r^*
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	253	\]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	254	Modern regular expression matchers used by programmers,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	255	however,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	256	support much richer constructs, such as bounded repetitions,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	257	negations,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	258	and back-references.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	259	To differentiate, we use the word \emph{regex} to refer
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	260	to those expressions with richer constructs while reserving the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	261	term \emph{regular expression}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	262	for the more traditional meaning in formal languages theory.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	263	We follow this convention
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	264	in this thesis.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	265	In the future, we aim to support all the popular features of regexes,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	266	but for this work we mainly look at basic regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	267	and bounded repetitions.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	268
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	269
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	270
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	271	%Most modern regex libraries
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	272	%the so-called PCRE standard (Peral Compatible Regular Expressions)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	273	%has the back-references
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	274	Regexes come with a number of constructs
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	275	that make it more convenient for
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	276	programmers to write regular expressions.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	277	Depending on the types of constructs
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	278	the task of matching and lexing with them
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	279	will have different levels of complexity.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	280	Some of those constructs are syntactic sugars that are
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	281	simply short hand notations
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	282	that save the programmers a few keystrokes.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	283	These will not cause problems for regex libraries.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	284	For example the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	285	non-binary alternative involving three or more choices just means:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	286	\[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	287	(a \| b \| c) \stackrel{means}{=} ((a + b)+ c)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	288	\]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	289	Similarly, the range operator
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	290	%used to express the alternative
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	291	%of all characters between its operands,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	292	is just a concise way
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	293	of expressing an alternative of consecutive characters:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	294	\[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	295	[0~-9]\stackrel{means}{=} (0 \| 1 \| \ldots \| 9 )
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	296	\]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	297	for an alternative. The
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	298	wildcard character '$.$' is used to refer to any single character,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	299	\[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	300	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	301	\]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	302	except the newline.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	303
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	304	\subsection{Bounded Repetitions}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	305	More interesting are bounded repetitions, which can
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	306	make the regular expressions much
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	307	more compact.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	308	Normally there are four kinds of bounded repetitions:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	309	$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	310	(where $n$ and $m$ are constant natural numbers).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	311	Like the star regular expressions, the set of strings or language
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	312	a bounded regular expression can match
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	313	is defined using the power operation on sets:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	314	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	315	\begin{tabular}{lcl}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	316	$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	317	$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	318	$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	319	$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	320	\end{tabular}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	321	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	322	The attraction of bounded repetitions is that they can be
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	323	used to avoid a size blow up: for example $r^{\{n\}}$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	324	is a shorthand for
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	325	the much longer regular expression:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	326	\[
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	327	\underbrace{r\ldots r}_\text{n copies of r}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	328	\]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	329	%Therefore, a naive algorithm that simply unfolds
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	330	%them into their desugared forms
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	331	%will suffer from at least an exponential runtime increase.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	332
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	333
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	334	The problem with matching
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	335	such bounded repetitions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	336	is that tools based on the classic notion of
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	337	automata need to expand $r^{\{n\}}$ into $n$ connected
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	338	copies of the automaton for $r$. This leads to very inefficient matching
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	339	algorithms or algorithms that consume large amounts of memory.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	340	Implementations using $\DFA$s will
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	341	in such situations
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	342	either become excruciatingly slow
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	343	(for example Verbatim++ \cite{Verbatimpp}) or run
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	344	out of memory (for example $\mathit{LEX}$ and
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	345	$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	346	in C and JAVA that generate $\mathit{DFA}$-based
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	347	lexers. The user provides a set of regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	348	and configurations, and then
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	349	gets an output program encoding a minimized $\mathit{DFA}$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	350	that can be compiled and run.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	351	When given the above countdown regular expression,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	352	a small $n$ (say 20) would result in a program representing a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	353	DFA
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	354	with millions of states.}) for large counters.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	355	A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	356	where the minimal DFA requires at least $2^{n+1}$ states.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	357	For example, when $n$ is equal to 2,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	358	the corresponding $\mathit{NFA}$ looks like:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	359	\vspace{6mm}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	360	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	361	\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	362	\node[state,initial] (q_0) {$q_0$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	363	\node[state, red] (q_1) [right=of q_0] {$q_1$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	364	\node[state, red] (q_2) [right=of q_1] {$q_2$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	365	\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	366	\path[->]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	367	(q_0) edge node {a} (q_1)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	368	edge [loop below] node {a,b} ()
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	369	(q_1) edge node {a,b} (q_2)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	370	(q_2) edge node {a,b} (q_3);
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	371	\end{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	372	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	373	and when turned into a DFA by the subset construction
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	374	requires at least $2^3$ states.\footnote{The
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	375	red states are "countdown states" which count down
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	376	the number of characters needed in addition to the current
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	377	string to make a successful match.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	378	For example, state $q_1$ indicates a match that has
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	379	gone past the $(a\|b)^$ part of $(a\|b)^a(a\|b)^{\{2\}}$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	380	and just consumed the "delimiter" $a$ in the middle, and
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	381	needs to match 2 more iterations of $(a\|b)$ to complete.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	382	State $q_2$ on the other hand, can be viewed as a state
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	383	after $q_1$ has consumed 1 character, and just waits
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	384	for 1 more character to complete.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	385	The state $q_3$ is the last (accepting) state, requiring 0
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	386	more characters.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	387	Depending on the suffix of the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	388	input string up to the current read location,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	389	the states $q_1$ and $q_2$, $q_3$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	390	may or may
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	391	not be active.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	392	A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	393	contain at least $2^3$ non-equivalent states that cannot be merged,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	394	because the subset construction during determinisation will generate
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	395	all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	396	Generalizing this to regular expressions with larger
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	397	bounded repetitions number, we have that
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	398	regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	399	would require at least $2^{n+1}$ states, if $r$ itself contains
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	400	more than 1 string.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	401	This is to represent all different
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	402	scenarios in which "countdown" states are active.}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	403
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	404
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	405	Bounded repetitions are important because they
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	406	tend to occur frequently in practical use,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	407	for example in the regex library RegExLib, in
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	408	the rules library of Snort \cite{Snort1999}\footnote{
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	409	Snort is a network intrusion detection (NID) tool
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	410	for monitoring network traffic.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	411	The network security community curates a list
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	412	of malicious patterns written as regexes,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	413	which is used by Snort's detection engine
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	414	to match against network traffic for any hostile
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	415	activities such as buffer overflow attacks.},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	416	as well as in XML Schema definitions (XSDs).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	417	According to Bj\"{o}rklund et al \cite{xml2015},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	418	more than half of the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	419	XSDs they found on the Maven.org central repository
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	420	have bounded regular expressions in them.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	421	Often the counters are quite large, with the largest being
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	422	close to ten million.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	423	A smaller sample XSD they gave
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	424	is:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	425	\lstset{
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	426	basicstyle=\fontsize{8.5}{9}\ttfamily,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	427	language=XML,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	428	morekeywords={encoding,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	429	xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	430	}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	431	\begin{lstlisting}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	432	<sequence minOccurs="0" maxOccurs="65535">
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	433	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	434	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	435	</sequence>
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	436	\end{lstlisting}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	437	This can be seen as the regex
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	438	$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	439	regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	440	satisfying certain constraints (such as
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	441	satisfying the floating point number format).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	442	It is therefore quite unsatisfying that
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	443	some regular expressions matching libraries
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	444	impose adhoc limits
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	445	for bounded regular expressions:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	446	For example, in the regular expression matching library in the Go
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	447	language the regular expression $a^{1001}$ is not permitted, because no counter
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	448	can be above 1000, and in the built-in Rust regular expression library
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	449	expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	450	for being too big.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	451	As Becchi and Crawley \cite{Becchi08} have pointed out,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	452	the reason for these restrictions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	453	is that they simulate a non-deterministic finite
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	454	automata (NFA) with a breadth-first search.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	455	This way the number of active states could
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	456	be equal to the counter number.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	457	When the counters are large,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	458	the memory requirement could become
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	459	infeasible, and a regex engine
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	460	like in Go will reject this pattern straight away.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	461	\begin{figure}[H]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	462	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	463	\begin{tikzpicture} [node distance = 2cm, on grid, auto]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	464
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	465	\node (q0) [state, initial] {$0$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	466	\node (q1) [state, right = of q0] {$1$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	467	%\node (q2) [state, right = of q1] {$2$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	468	\node (qdots) [right = of q1] {$\ldots$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	469	\node (qn) [state, right = of qdots] {$n$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	470	\node (qn1) [state, right = of qn] {$n+1$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	471	\node (qn2) [state, right = of qn1] {$n+2$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	472	\node (qn3) [state, accepting, right = of qn2] {$n+3$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	473
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	474	\path [-stealth, thick]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	475	(q0) edge [loop above] node {a} ()
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	476	(q0) edge node {a} (q1)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	477	%(q1) edge node {.} (q2)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	478	(q1) edge node {.} (qdots)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	479	(qdots) edge node {.} (qn)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	480	(qn) edge node {.} (qn1)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	481	(qn1) edge node {b} (qn2)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	482	(qn2) edge node {$c$} (qn3);
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	483	\end{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	484	%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	485	% \node[state,initial] (q_0) {$0$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	486	% \node[state, ] (q_1) [right=of q_0] {$1$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	487	% \node[state, ] (q_2) [right=of q_1] {$2$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	488	% \node[state,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	489	% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	490	% \path[->]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	491	% (q_0) edge node {a} (q_1)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	492	% edge [loop below] node {a,b} ()
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	493	% (q_1) edge node {a,b} (q_2)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	494	% (q_2) edge node {a,b} (q_3);
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	495	%\end{tikzpicture}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	496	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	497	\caption{The example given by Becchi and Crawley
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	498	that NFA simulation can consume large
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	499	amounts of memory: $.^*a.^{\{n\}}bc$ matching
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	500	strings of the form $aaa\ldots aaaabc$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	501	When traversing in a breadth-first manner,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	502	all states from 0 till $n+1$ will become active.}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	503	\end{figure}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	504	%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	505	%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	506	%in terms of input string length.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	507	%TODO:try out these lexers
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	508	These problems can of course be solved in matching algorithms where
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	509	automata go beyond the classic notion and for instance include explicit
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	510	counters \cite{Turo_ov__2020}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	511	These solutions can be quite efficient,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	512	with the ability to process
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	513	gigabits of strings input per second
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	514	even with large counters \cite{Becchi08}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	515	These practical solutions do not come with
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	516	formal guarantees, and as pointed out by
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	517	Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	518	%But formal reasoning about these automata especially in Isabelle
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	519	%can be challenging
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	520	%and un-intuitive.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	521	%Therefore, we take correctness and runtime claims made about these solutions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	522	%with a grain of salt.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	523
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	524	In the work reported in \cite{FoSSaCS2023} and here,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	525	we add better support using derivatives
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	526	for bounded regular expression $r^{\{n\}}$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	527	Our results
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	528	extend straightforwardly to
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	529	repetitions with intervals such as
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	530	$r^{\{n\ldots m\}}$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	531	The merit of Brzozowski derivatives (more on this later)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	532	on this problem is that
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	533	it can be naturally extended to support bounded repetitions.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	534	Moreover these extensions are still made up of only small
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	535	inductive datatypes and recursive functions,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	536	making it handy to deal with them in theorem provers.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	537	%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	538	%straightforwardly extended to deal with bounded regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	539	%and moreover the resulting code still consists of only simple
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	540	%recursive functions and inductive datatypes.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	541	Finally, bounded regular expressions do not destroy our finite
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	542	boundedness property, which we shall prove later on.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	543
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	544
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	545
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	546
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	547
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	548	\subsection{Back-References}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	549	The other way to simulate an $\mathit{NFA}$ for matching is choosing
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	550	a single transition each time, keeping all the other options in
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	551	a queue or stack, and backtracking if that choice eventually
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	552	fails.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	553	This method, often called a "depth-first-search",
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	554	is efficient in many cases, but could end up
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	555	with exponential run time.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	556	The backtracking method is employed in regex libraries
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	557	that support \emph{back-references}, for example
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	558	in Java and Python.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	559	%\section{Back-references and The Terminology Regex}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	560
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	561	%When one constructs an $\NFA$ out of a regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	562	%there is often very little to be done in the first phase, one simply
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	563	%construct the $\NFA$ states based on the structure of the input regular expression.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	564
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	565	%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	566	%one by keeping track of all active states after consuming
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	567	%a character, and update that set of states iteratively.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	568	%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	569	%for a path terminating
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	570	%at an accepting state.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	571
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	572
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	573
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	574	Consider the following regular expression where the sequence
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	575	operator is omitted for brevity:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	576	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	577	$r_1r_2r_3r_4$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	578	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	579	In this example,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	580	one could label sub-expressions of interest
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	581	by parenthesizing them and giving
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	582	them a number in the order in which their opening parentheses appear.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	583	One possible way of parenthesizing and labelling is:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	584	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	585	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	586	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	587	The sub-expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	588	$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	589	by 1 to 4, and can be ``referred back'' by their respective numbers.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	590	%These sub-expressions are called "capturing groups".
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	591	To do so, one uses the syntax $\backslash i$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	592	to denote that we want the sub-string
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	593	of the input matched by the i-th
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	594	sub-expression to appear again,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	595	exactly the same as it first appeared:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	596	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	597	$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	598	\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	599	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	600	Once the sub-string $s_i$ for the sub-expression $r_i$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	601	has been fixed, there is no variability on what the back-reference
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	602	label $\backslash i$ can be---it is tied to $s_i$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	603	The overall string will look like $\ldots s_i \ldots s_i \ldots $
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	604	%The backslash and number $i$ are the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	605	%so-called "back-references".
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	606	%Let $e$ be an expression made of regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	607	%and back-references. $e$ contains the expression $e_i$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	608	%as its $i$-th capturing group.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	609	%The semantics of back-reference can be recursively
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	610	%written as:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	611	%\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	612	% \begin{tabular}{c}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	613	% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	614	% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	615	% \end{tabular}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	616	%\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	617	A concrete example
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	618	for back-references is
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	619	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	620	$(.^*)\backslash 1$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	621	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	622	which matches
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	623	strings that can be split into two identical halves,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	624	for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	625	Note that this is different from
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	626	repeating the sub-expression verbatim like
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	627	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	628	$(.^)(.^)$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	629	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	630	which does not impose any restrictions on what strings the second
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	631	sub-expression $.^*$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	632	might match.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	633	Another example for back-references is
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	634	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	635	$(.)(.)\backslash 2\backslash 1$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	636	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	637	which matches four-character palindromes
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	638	like $abba$, $x??x$ and so on.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	639
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	640	Back-references are a regex construct
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	641	that programmers find quite useful.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	642	According to Becchi and Crawley \cite{Becchi08},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	643	6\% of Snort rules (up until 2008) use them.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	644	The most common use of back-references
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	645	is to express well-formed html files,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	646	where back-references are convenient for matching
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	647	opening and closing tags like
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	648	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	649	$\langle html \rangle \ldots \langle / html \rangle$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	650	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	651	A regex describing such a format
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	652	is
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	653	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	654	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	655	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	656	Despite being useful, the expressive power of regexes
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	657	go beyond regular languages
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	658	once back-references are included.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	659	In fact, they allow the regex construct to express
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	660	languages that cannot be contained in context-free
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	661	languages either.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	662	For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	663	expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	664	which cannot be expressed by
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	665	context-free grammars \cite{campeanu2003formal}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	666	Such a language is contained in the context-sensitive hierarchy
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	667	of formal languages.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	668	Also solving the matching problem involving back-references
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	669	is known to be NP-complete \parencite{alfred2014algorithms}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	670	Regex libraries supporting back-references such as
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	671	PCRE \cite{pcre} therefore have to
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	672	revert to a depth-first search algorithm including backtracking.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	673	What is unexpected is that even in the cases
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	674	not involving back-references, there is still
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	675	a (non-negligible) chance they might backtrack super-linearly,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	676	as shown in the graphs in figure \ref{fig:aStarStarb}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	677
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	678	Summing up, we can categorise existing
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	679	practical regex libraries into two kinds:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	680	(i) The ones with linear
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	681	time guarantees like Go and Rust. The downside with them is that
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	682	they impose restrictions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	683	on the regular expressions (not allowing back-references,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	684	bounded repetitions cannot exceed an ad hoc limit etc.).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	685	And (ii) those
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	686	that allow large bounded regular expressions and back-references
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	687	at the expense of using backtracking algorithms.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	688	They can potentially ``grind to a halt''
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	689	on some very simple cases, resulting
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	690	ReDoS attacks if exposed to the internet.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	691
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	692	The problems with both approaches are the motivation for us
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	693	to look again at the regular expression matching problem.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	694	Another motivation is that regular expression matching algorithms
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	695	that follow the POSIX standard often contain errors and bugs
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	696	as we shall explain next.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	697
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	698	%We would like to have regex engines that can
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	699	%deal with the regular part (e.g.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	700	%bounded repetitions) of regexes more
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	701	%efficiently.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	702	%Also we want to make sure that they do it correctly.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	703	%It turns out that such aim is not so easy to achieve.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	704	%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	705	% For example, the Rust regex engine claims to be linear,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	706	% but does not support lookarounds and back-references.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	707	% The GoLang regex library does not support over 1000 repetitions.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	708	% Java and Python both support back-references, but shows
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	709	%catastrophic backtracking behaviours on inputs without back-references(
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	710	%when the language is still regular).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	711	%TODO: test performance of Rust on (((((aa)b)b){20}))c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	712	%TODO: verify the fact Rust does not allow 1000+ reps
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	713
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	714
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	715
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	716
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	717	%The time cost of regex matching algorithms in general
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	718	%involve two different phases, and different things can go differently wrong on
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	719	%these phases.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	720	%$\DFA$s usually have problems in the first (construction) phase
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	721	%, whereas $\NFA$s usually run into trouble
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	722	%on the second phase.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	723
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	724
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	725	\section{Error-prone POSIX Implementations}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	726	Very often there are multiple ways of matching a string
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	727	with a regular expression.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	728	In such cases the regular expressions matcher needs to
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	729	disambiguate.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	730	The more widely used strategy is called POSIX,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	731	which roughly speaking always chooses the longest initial match.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	732	The POSIX strategy is widely adopted in many regular expression matchers
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	733	because it is a natural strategy for lexers.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	734	However, many implementations (including the C libraries
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	735	used by Linux and OS X distributions) contain bugs
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	736	or do not meet the specification they claim to adhere to.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	737	Kuklewicz maintains a unit test repository which lists some
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	738	problems with existing regular expression engines \cite{KuklewiczHaskell}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	739	In some cases, they either fail to generate a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	740	result when there exists a match,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	741	or give results that are inconsistent with the POSIX standard.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	742	A concrete example is the regex:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	743	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	744	$(aba + ab + a)^* \text{and the string} \; ababa$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	745	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	746	The correct POSIX match for the above
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	747	involves the entire string $ababa$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	748	split into two Kleene star iterations, namely $[ab], [aba]$ at positions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	749	$[0, 2), [2, 5)$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	750	respectively.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	751	But feeding this example to the different engines
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	752	listed at regex101 \footnote{
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	753	regex101 is an online regular expression matcher which
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	754	provides API for trying out regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	755	engines of multiple popular programming languages like
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	756	Java, Python, Go, etc.} \parencite{regex101}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	757	yields
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	758	only two incomplete matches: $[aba]$ at $[0, 3)$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	759	and $a$ at $[4, 5)$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	760	Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	761	commented that most regex libraries are not
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	762	correctly implementing the central POSIX
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	763	rule, called the maximum munch rule.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	764	Grathwohl \parencite{grathwohl2014crash} wrote:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	765	\begin{quote}\it
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	766	``The POSIX strategy is more complicated than the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	767	greedy because of the dependence on information about
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	768	the length of matched strings in the various subexpressions.''
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	769	\end{quote}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	770	%\noindent
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	771	People have recognised that the implementation complexity of POSIX rules also come from
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	772	the specification being not very precise.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	773	The developers of the regexp package of Go
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	774	\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	775	commented that
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	776	\begin{quote}\it
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	777	``
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	778	The POSIX rule is computationally prohibitive
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	779	and not even well-defined.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	780	``
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	781	\end{quote}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	782	There are many informal summaries of this disambiguation
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	783	strategy, which are often quite long and delicate.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	784	For example Kuklewicz \cite{KuklewiczHaskell}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	785	described the POSIX rule as (section 1, last paragraph):
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	786	\begin{quote}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	787	\begin{itemize}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	788	\item
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	789	regular expressions (REs) take the leftmost starting match, and the longest match starting there
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	790	earlier subpatterns have leftmost-longest priority over later subpatterns\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	791	\item
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	792	higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	793	\item
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	794	REs have right associative concatenation which can be changed with parenthesis\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	795	\item
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	796	parenthesized subexpressions return the match from their last usage\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	797	\item
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	798	text of component subexpressions must be contained in the text of the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	799	higher-level subexpressions\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	800	\item
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	801	if "p" and "q" can never match the same text then "p\|q" and "q\|p" are equivalent, up to trivial renumbering of captured subexpressions\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	802	\item
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	803	if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	804	\end{itemize}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	805	\end{quote}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	806	%The text above
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	807	%is trying to capture something very precise,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	808	%and is crying out for formalising.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	809	Ribeiro and Du Bois \cite{RibeiroAgda2017} have
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	810	formalised the notion of bit-coded regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	811	and proved their relations with simple regular expressions in
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	812	the dependently-typed proof assistant Agda.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	813	They also proved the soundness and completeness of a matching algorithm
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	814	based on the bit-coded regular expressions.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	815	Ausaf et al. \cite{AusafDyckhoffUrban2016}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	816	are the first to
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	817	give a quite simple formalised POSIX
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	818	specification in Isabelle/HOL, and also prove
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	819	that their specification coincides with an earlier (unformalised)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	820	POSIX specification given by Okui and Suzuki \cite{Okui10}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	821	They then formally proved the correctness of
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	822	a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	823	with regards to that specification.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	824	They also found that the informal POSIX
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	825	specification by Sulzmann and Lu needs to be substantially
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	826	revised in order for the correctness proof to go through.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	827	Their original specification and proof were unfixable
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	828	according to Ausaf et al.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	829
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	830
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	831	In the next section, we will briefly
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	832	introduce Brzozowski derivatives and Sulzmann
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	833	and Lu's algorithm, which the main point of this thesis builds on.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	834	%We give a taste of what they
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	835	%are like and why they are suitable for regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	836	%matching and lexing.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	837	\section{Formal Specification of POSIX Matching
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	838	and Brzozowski Derivatives}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	839	%Now we start with the central topic of the thesis: Brzozowski derivatives.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	840	Brzozowski \cite{Brzozowski1964} first introduced the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	841	concept of a \emph{derivative} of regular expression in 1964.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	842	The derivative of a regular expression $r$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	843	with respect to a character $c$, is written as $r \backslash c$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	844	This operation tells us what $r$ transforms into
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	845	if we ``chop'' off a particular character $c$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	846	from all strings in the language of $r$ (defined
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	847	later as $L \; r$).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	848	%To give a flavour of Brzozowski derivatives, we present
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	849	%two straightforward clauses from it:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	850	%\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	851	% \begin{tabular}{lcl}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	852	% $d \backslash c$ & $\dn$ &
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	853	% $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	854	%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	855	% \end{tabular}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	856	%\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	857	%\noindent
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	858	%The first clause says that for the regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	859	%denoting a singleton set consisting of a single-character string $\{ d \}$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	860	%we check the derivative character $c$ against $d$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	861	%returning a set containing only the empty string $\{ [] \}$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	862	%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	863	%The second clause states that to obtain the regular expression
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	864	%representing all strings' head character $c$ being chopped off
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	865	%from $r_1 + r_2$, one simply needs to recursively take derivative
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	866	%of $r_1$ and $r_2$ and then put them together.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	867	Derivatives have the property
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	868	that $s \in L \; (r\backslash c)$ if and only if
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	869	$c::s \in L \; r$ where $::$ stands for list prepending.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	870	%This property can be used on regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	871	%matching and lexing--to test whether a string $s$ is in $L \; r$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	872	%one simply takes derivatives of $r$ successively with
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	873	%respect to the characters (in the correct order) in $s$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	874	%and then test whether the empty string is in the last regular expression.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	875	With this property, derivatives can give a simple solution
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	876	to the problem of matching a string $s$ with a regular
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	877	expression $r$: if the derivative of $r$ w.r.t.\ (in
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	878	succession) all the characters of the string matches the empty string,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	879	then $r$ matches $s$ (and {\em vice versa}).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	880	%This makes formally reasoning about these properties such
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	881	%as correctness and complexity smooth and intuitive.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	882	There are several mechanised proofs of this property in various theorem
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	883	provers,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	884	for example one by Owens and Slind \cite{Owens2008} in HOL4,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	885	another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	886	yet another in Coq by Coquand and Siles \cite{Coquand2012}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	887
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	888	In addition, one can extend derivatives to bounded repetitions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	889	relatively straightforwardly. For example, the derivative for
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	890	this can be defined as:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	891	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	892	\begin{tabular}{lcl}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	893	$r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	894	r^{\{n-1\}} (\text{when} n > 0)$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	895	\end{tabular}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	896	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	897	\noindent
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	898	Experimental results suggest that unlike DFA-based solutions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	899	for bounded regular expressions,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	900	derivatives can cope
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	901	large counters
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	902	quite well.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	903
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	904	There have also been
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	905	extensions of derivatives to other regex constructs.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	906	For example, Owens et al include the derivatives
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	907	for the \emph{NOT} regular expression, which is
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	908	able to concisely express C-style comments of the form
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	909	$/* \ldots */$ (see \cite{Owens2008}).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	910	Another extension for derivatives is
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	911	regular expressions with look-aheads, done
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	912	by Miyazaki and Minamide
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	913	\cite{Takayuki2019}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	914	%We therefore use Brzozowski derivatives on regular expressions
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	915	%lexing
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	916
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	917
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	918
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	919	Given the above definitions and properties of
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	920	Brzozowski derivatives, one quickly realises their potential
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	921	in generating a formally verified algorithm for lexing: the clauses and property
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	922	can be easily expressed in a functional programming language
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	923	or converted to theorem prover
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	924	code, with great ease.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	925	Perhaps this is the reason why derivatives have sparked quite a bit of interest
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	926	in the functional programming and theorem prover communities in the last
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	927	fifteen or so years (
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	928	\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	929	\cite{Chen12} and \cite{Coquand2012}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	930	to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	931	after they were first published by Brzozowski.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	932
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	933
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	934	However, there are two difficulties with derivative-based matchers:
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	935	First, Brzozowski's original matcher only generates a yes/no answer
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	936	for whether a regular expression matches a string or not. This is too
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	937	little information in the context of lexing where separate tokens must
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	938	be identified and also classified (for example as keywords
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	939	or identifiers).
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	940	Second, derivative-based matchers need to be more efficient in terms
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	941	of the sizes of derivatives.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	942	Elegant and beautiful
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	943	as many implementations are,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	944	they can be excruciatingly slow.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	945	For example, Sulzmann and Lu
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	946	claim a linear running time of their proposed algorithm,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	947	but that was falsified by our experiments. The running time
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	948	is actually $\Omega(2^n)$ in the worst case.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	949	A similar claim about a theoretical runtime of $O(n^2)$
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	950	is made for the Verbatim \cite{Verbatim}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	951	%TODO: give references
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	952	lexer, which calculates POSIX matches and is based on derivatives.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	953	They formalized the correctness of the lexer, but not their complexity result.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	954	In the performance evaluation section, they analyzed the run time
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	955	of matching $a$ with the string
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	956	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	957	$\underbrace{a \ldots a}_{\text{n a's}}$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	958	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	959	\noindent
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	960	They concluded that the algorithm is quadratic in terms of
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	961	the length of the input string.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	962	When we tried out their extracted OCaml code with the example $(a+aa)^*$,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	963	the time it took to match a string of 40 $a$'s was approximately 5 minutes.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	964
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	965
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	966	\subsection{Sulzmann and Lu's Algorithm}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	967	Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	968	problem with the yes/no answer
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	969	by cleverly extending Brzozowski's matching
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	970	algorithm. Their extended version generates additional information on
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	971	\emph{how} a regular expression matches a string following the POSIX
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	972	rules for regular expression matching. They achieve this by adding a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	973	second ``phase'' to Brzozowski's algorithm involving an injection
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	974	function. This injection function in a sense undoes the ``damage''
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	975	of the derivatives chopping off characters.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	976	In earlier work, Ausaf et al provided the formal
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	977	specification of what POSIX matching means and proved in Isabelle/HOL
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	978	the correctness
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	979	of this extended algorithm accordingly
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	980	\cite{AusafDyckhoffUrban2016}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	981
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	982	The version of the algorithm proven correct
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	983	suffers however heavily from a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	984	second difficulty, where derivatives can
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	985	grow to arbitrarily big sizes.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	986	For example if we start with the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	987	regular expression $(a+aa)^*$ and take
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	988	successive derivatives according to the character $a$, we end up with
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	989	a sequence of ever-growing derivatives like
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	990
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	991	\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	992	\begin{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	993	\begin{tabular}{rll}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	994	$(a + aa)^$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	995	& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	996	& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	997	& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	998	& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	999	\end{tabular}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1000	\end{center}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1001
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1002	\noindent where after around 35 steps we usually run out of memory on a
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1003	typical computer. Clearly, the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1004	notation involving $\ZERO$s and $\ONE$s already suggests
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1005	simplification rules that can be applied to regular regular
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1006	expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1007	\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1008	r$. While such simple-minded simplifications have been proved in
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1009	the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1010	algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1011	\emph{not} help with limiting the growth of the derivatives shown
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1012	above: the growth is slowed, but the derivatives can still grow rather
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1013	quickly beyond any finite bound.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1014
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1015	Therefore we want to look in this thesis at a second
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1016	algorithm by Sulzmann and Lu where they
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1017	overcame this ``growth problem'' \cite{Sulzmann2014}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1018	In this version, POSIX values are
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1019	represented as bit sequences and such sequences are incrementally generated
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1020	when derivatives are calculated. The compact representation
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1021	of bit sequences and regular expressions allows them to define a more
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1022	``aggressive'' simplification method that keeps the size of the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1023	derivatives finite no matter what the length of the string is.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1024	They make some informal claims about the correctness and linear behaviour
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1025	of this version, but do not provide any supporting proof arguments, not
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1026	even ``pencil-and-paper'' arguments. They write about their bit-coded
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1027	\emph{incremental parsing method} (that is the algorithm to be formalised
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1028	in this dissertation)
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1029
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1030
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1031
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1032	\begin{quote}\it
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1033	``Correctness Claim: We further claim that the incremental parsing
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1034	method [..] in combination with the simplification steps [..]
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1035	yields POSIX parse trees. We have tested this claim
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1036	extensively [..] but yet
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1037	have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1038	\end{quote}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1039	Ausaf and Urban made some initial progress towards the
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1040	full correctness proof but still had to leave out the optimisation
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1041	Sulzmann and Lu proposed.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1042	Ausaf wrote \cite{Ausaf},
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1043	\begin{quote}\it
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1044	``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1045	\end{quote}
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1046	This thesis implements the aggressive simplifications envisioned
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1047	by Ausaf and Urban,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1048	together with a formal proof of the correctness of those simplifications.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1049
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1050
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1051	One of the most recent work in the context of lexing
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1052	%with this issue
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1053	is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1054	This is relevant work for us and we will compare it later with
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1055	our derivative-based matcher we are going to present.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1056	There is also some newer work called
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1057	Verbatim++~\cite{Verbatimpp}, which does not use derivatives,
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1058	but deterministic finite automaton instead.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1059	We will also study this work in a later section.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1060	%An example that gives problem to automaton approaches would be
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1061	%the regular expression $(a\|b)^*a(a\|b)^{\{n\}}$.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1062	%It requires at least $2^{n+1}$ states to represent
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1063	%as a DFA.
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1064
ef2b8abcbc55 more Chengsong parents: 646 diff changeset	1065
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1066	\section{Basic Concepts}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1067	Formal language theory usually starts with an alphabet
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1068	denoting a set of characters.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1069	Here we use the datatype of characters from Isabelle,
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1070	which roughly corresponds to the ASCII characters.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1071	In what follows, we shall leave the information about the alphabet
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1072	implicit.
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1073	Then using the usual bracket notation for lists,
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1074	we can define strings made up of characters as:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1075	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1076	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1077	$\textit{s}$ & $\dn$ & $[] \; \|\; c :: s$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1078	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	1079	\end{center}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1080	where $c$ is a variable ranging over characters.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1081	The $::$ stands for list cons and $[]$ for the empty
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1082	list.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1083	For brevity, a singleton list is sometimes written as $[c]$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1084	Strings can be concatenated to form longer strings in the same
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1085	way we concatenate two lists, which we shall write as $s_1 @ s_2$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1086	We omit the precise
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1087	recursive definition here.
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1088	We overload this concatenation operator for two sets of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1089	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1090	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1091	$A @ B $ & $\dn$ & $\{s_A @ s_B \mid s_A \in A \land s_B \in B \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1092	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	1093	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1094	We also call the above \emph{language concatenation}.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1095	The power of a language is defined recursively, using the
cc54ce075db5 restructured Chengsong parents: diff changeset	1096	concatenation operator $@$:
cc54ce075db5 restructured Chengsong parents: diff changeset	1097	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1098	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	1099	$A^0 $ & $\dn$ & $\{ [] \}$\\
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1100	$A^{n+1}$ & $\dn$ & $A @ A^n$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1101	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	1102	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1103	The union of all powers of a language
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1104	can be used to define the Kleene star operator:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1105	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1106	\begin{tabular}{lcl}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1107	$A*$ & $\dn$ & $\bigcup_{i \geq 0} A^i$ \\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1108	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	1109	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1110
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1111	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1112	However, to obtain a more convenient induction principle
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1113	in Isabelle/HOL,
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1114	we instead define the Kleene star
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1115	as an inductive set:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1116
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1117	\begin{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1118	\begin{mathpar}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1119	\inferrule{\mbox{}}{[] \in A*\\}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1120
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1121	\inferrule{s_1 \in A \;\; s_2 \in A}{s_1 @ s_2 \in A}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1122	\end{mathpar}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1123	\end{center}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1124	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1125	We also define an operation of "chopping off" a character from
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1126	a language, which we call $\Der$, meaning \emph{Derivative} (for a language):
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1127	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1128	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	1129	$\textit{Der} \;c \;A$ & $\dn$ & $\{ s \mid c :: s \in A \}$\\
cc54ce075db5 restructured Chengsong parents: diff changeset	1130	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	1131	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1132	\noindent
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1133	This can be generalised to ``chopping off'' a string
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1134	from all strings within a set $A$,
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1135	namely:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1136	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1137	\begin{tabular}{lcl}
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1138	$\textit{Ders} \;s \;A$ & $\dn$ & $\{ s' \mid s@s' \in A \}$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1139	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	1140	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1141	\noindent
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1142	which is essentially the left quotient $A \backslash L$ of $A$ against
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1143	the singleton language with $L = \{s\}$.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1144	However, for our purposes here, the $\textit{Ders}$ definition with
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1145	a single string is sufficient.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1146
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1147	The reason for defining derivatives
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1148	is that they provide another approach
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1149	to test membership of a string in
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1150	a set of strings.
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1151	For example, to test whether the string
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1152	$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1153	respect to the string $bar$:
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1154	\begin{center}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1155	\begin{tabular}{lll}
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1156	$S = \{foo, bar, brak\}$ & $ \stackrel{\backslash b}{\rightarrow }$ &
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1157	$\{ar, rak\}$ \\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1158	& $\stackrel{\backslash a}{\rightarrow}$ & $\{r \}$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1159	& $\stackrel{\backslash r}{\rightarrow}$ & $\{[]\}$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1160	%& $\stackrel{[] \in S \backslash bar}{\longrightarrow}$ & $bar \in S$\\
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1161	\end{tabular}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1162	\end{center}
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1163	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1164	and in the end, test whether the set
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1165	contains the empty string.\footnote{We use the infix notation $A\backslash c$
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1166	instead of $\Der \; c \; A$ for brevity, as it will always be
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1167	clear from the context that we are operating
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1168	on languages rather than regular expressions.}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1169
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1170	In general, if we have a language $S$,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1171	then we can test whether $s$ is in $S$
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1172	by testing whether $[] \in S \backslash s$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1173	With the sequencing, Kleene star, and $\textit{Der}$ operator on languages,
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1174	we have a few properties of how the language derivative can be defined using
cc54ce075db5 restructured Chengsong parents: diff changeset	1175	sub-languages.
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1176	For example, for the sequence operator, we have
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1177	something similar to a ``chain rule'':
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1178	\begin{lemma}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1179	\[
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1180	\Der \; c \; (A @ B) =
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1181	\begin{cases}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1182	((\Der \; c \; A) \, @ \, B ) \cup (\Der \; c\; B) , & \text{if} \; [] \in A \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1183	(\Der \; c \; A) \, @ \, B, & \text{otherwise}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1184	\end{cases}
aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1185	\]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1186	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	1187	\noindent
cc54ce075db5 restructured Chengsong parents: diff changeset	1188	This lemma states that if $A$ contains the empty string, $\Der$ can "pierce" through it
cc54ce075db5 restructured Chengsong parents: diff changeset	1189	and get to $B$.
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1190	The language derivative for $A*$ can be described using the language derivative
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1191	of $A$:
cc54ce075db5 restructured Chengsong parents: diff changeset	1192	\begin{lemma}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1193	$\textit{Der} \;c \;(A) = (\textit{Der}\; c A) @ (A)$\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1194	\end{lemma}
cc54ce075db5 restructured Chengsong parents: diff changeset	1195	\begin{proof}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1196	There are two inclusions to prove:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1197	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1198	\item{$\subseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1199	The set
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1200	\[ S_1 = \{s \mid c :: s \in A*\} \]
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1201	is enclosed in the set
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1202	\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\; s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1203	This is because for any string $c::s$ satisfying $c::s \in A*$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1204	%whenever you have a string starting with a character
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1205	%in the language of a Kleene star $A*$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1206	%then that
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1207	the character $c$, together with a prefix of $s$
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1208	%immediately after $c$
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1209	forms the first iteration of $A*$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1210	and the rest of the $s$ is also $A*$.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1211	This coincides with the definition of $S_2$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1212	\item{$\supseteq$}:\\
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1213	Note that
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1214	\[ \Der \; c \; (A) = \Der \; c \; (\{ [] \} \cup (A @ A) ) \]
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1215	holds.
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1216	Also the following holds:
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1217	\[ \Der \; c \; (\{ [] \} \cup (A @ A) ) = \Der\; c \; (A @ A) \]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1218	where the $\textit{RHS}$ can be rewritten
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1219	as \[ (\Der \; c\; A) @ A* \cup (\Der \; c \; (A*)) \]
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1220	which of course contains $\Der \; c \; A @ A*$.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1221	\end{itemize}
cc54ce075db5 restructured Chengsong parents: diff changeset	1222	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1223
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1224	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1225	The clever idea of Brzozowski was to find counterparts of $\Der$ and $\Ders$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1226	for regular expressions.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1227	To introduce them, we need to first give definitions for regular expressions,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1228	which we shall do next.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1229
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1230	\subsection{Regular Expressions and Their Meaning}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1231	The \emph{basic regular expressions} are defined inductively
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1232	by the following grammar:
cc54ce075db5 restructured Chengsong parents: diff changeset	1233	\[ r ::= \ZERO \mid \ONE
cc54ce075db5 restructured Chengsong parents: diff changeset	1234	\mid c
cc54ce075db5 restructured Chengsong parents: diff changeset	1235	\mid r_1 \cdot r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	1236	\mid r_1 + r_2
cc54ce075db5 restructured Chengsong parents: diff changeset	1237	\mid r^*
cc54ce075db5 restructured Chengsong parents: diff changeset	1238	\]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1239	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1240	We call them basic because we will introduce
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1241	additional constructors in later chapters, such as negation
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1242	and bounded repetitions.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1243	We use $\ZERO$ for the regular expression that
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1244	matches no string, and $\ONE$ for the regular
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1245	expression that matches only the empty string.\footnote{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1246	Some authors
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1247	also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1248	but we prefer this notation.}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1249	The sequence regular expression is written $r_1\cdot r_2$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1250	and sometimes we omit the dot if it is clear which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1251	regular expression is meant; the alternative
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1252	is written $r_1 + r_2$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1253	The \emph{language} or meaning of
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1254	a regular expression is defined recursively as
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1255	a set of strings:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1256	%TODO: FILL in the other defs
cc54ce075db5 restructured Chengsong parents: diff changeset	1257	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1258	\begin{tabular}{lcl}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1259	$L \; \ZERO$ & $\dn$ & $\phi$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1260	$L \; \ONE$ & $\dn$ & $\{[]\}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1261	$L \; c$ & $\dn$ & $\{[c]\}$\\
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1262	$L \; (r_1 + r_2)$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1263	$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; r_1 @ L \; r_2$\\
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1264	$L \; (r^)$ & $\dn$ & $ (L\;r)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1265	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	1266	\end{center}
536 aff7bf93b9c7 comments addressed all Chengsong parents: 532 diff changeset	1267	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1268	%Now with language derivatives of a language and regular expressions and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1269	%their language interpretations in place, we are ready to define derivatives on regular expressions.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1270	With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1271	We do so by first introducing what properties they should satisfy.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1272
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1273	\subsection{Brzozowski Derivatives and a Regular Expression Matcher}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1274	%Recall, the language derivative acts on a set of strings
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1275	%and essentially chops off a particular character from
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1276	%all strings in that set, Brzozowski defined a derivative operation on regular expressions
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1277	%so that after derivative $L(r\backslash c)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1278	%will look as if it was obtained by doing a language derivative on $L(r)$:
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1279	%Recall that the language derivative acts on a
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1280	%language (set of strings).
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1281	%One can decide whether a string $s$ belongs
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1282	%to a language $S$ by taking derivative with respect to
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1283	%that string and then checking whether the empty
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1284	%string is in the derivative:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1285	%\begin{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1286	%\parskip \baselineskip
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1287	%\def\myupbracefill#1{\rotatebox{90}{\stretchto{\{}{#1}}}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1288	%\def\rlwd{.5pt}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1289	%\newcommand\notate[3]{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1290	% \unskip\def\useanchorwidth{T}%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1291	% \setbox0=\hbox{#1}%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1292	% \def\stackalignment{c}\stackunder[-6pt]{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1293	% \def\stackalignment{c}\stackunder[-1.5pt]{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1294	% \stackunder[-2pt]{\strut #1}{\myupbracefill{\wd0}}}{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1295	% \rule{\rlwd}{#2\baselineskip}}}{%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1296	% \strut\kern7pt$\hookrightarrow$\rlap{ \footnotesize#3}}\ignorespaces%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1297	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1298	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1299	%\notate{$\{ \ldots ,\;$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1300	% \notate{s}{1}{$(c_1 :: s_1)$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1301	% $, \; \ldots \}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1302	%}{1}{$S_{start}$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1303	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1304	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1305	% $\stackrel{\backslash c_1}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1306	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1307	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1308	% $\{ \ldots,\;$ \notate{$s_1$}{1}{$(c_2::s_2)$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1309	% $,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1310	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1311	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1312	% $\stackrel{\backslash c_2}{\longrightarrow}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1313	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1314	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1315	% $\{ \ldots,\; s_2
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1316	% ,\; \ldots \}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1317	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1318	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1319	% $ \xdashrightarrow{\backslash c_3\ldots\ldots} $
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1320	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1321	%\Longstack{
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1322	% \notate{$\{\ldots, [], \ldots\}$}{1}{$S_{end} =
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1323	% S_{start}\backslash s$}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1324	%}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1325	%\end{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1326	%\begin{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1327	% $s \in S_{start} \iff [] \in S_{end}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1328	%\end{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1329	%\noindent
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1330	Brzozowski noticed that $\Der$
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1331	can be ``mirrored'' on regular expressions which
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1332	he calls the derivative of a regular expression $r$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1333	with respect to a character $c$, written
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1334	$r \backslash c$. This infix operator
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1335	takes regular expression $r$ as input
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1336	and a character as a right operand.
577 f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1337	The derivative operation on regular expression
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1338	is defined such that the language of the derivative result
f47fc4840579 thesis chap2 Chengsong parents: 573 diff changeset	1339	coincides with the language of the original
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1340	regular expression being taken
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1341	derivative with respect to the same characters, namely
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1342	\begin{property}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1343
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1344	\[
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1345	L \; (r \backslash c) = \Der \; c \; (L \; r)
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1346	\]
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1347	\end{property}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1348	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1349	Next, we give the recursive definition of derivative on
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1350	regular expressions so that it satisfies the properties above.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1351	%The derivative function, written $r\backslash c$,
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1352	%takes a regular expression $r$ and character $c$, and
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1353	%returns a new regular expression representing
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1354	%the original regular expression's language $L \; r$
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1355	%being taken the language derivative with respect to $c$.
626 1c8525061545 finished! Chengsong parents: 623 diff changeset	1356	\begin{table}
1c8525061545 finished! Chengsong parents: 623 diff changeset	1357	\begin{center}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1358	\begin{tabular}{lcl}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1359	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1360	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1361	$d \backslash c$ & $\dn$ &
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1362	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1363	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1364	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, [] \in L(r_1)$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1365	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1366	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1367	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1368	\end{tabular}
626 1c8525061545 finished! Chengsong parents: 623 diff changeset	1369	\end{center}
1c8525061545 finished! Chengsong parents: 623 diff changeset	1370	\caption{Derivative on Regular Expressions}
1c8525061545 finished! Chengsong parents: 623 diff changeset	1371	\label{table:der}
1c8525061545 finished! Chengsong parents: 623 diff changeset	1372	\end{table}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1373	\noindent
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1374	The most involved cases are the sequence case
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1375	and the star case.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1376	The sequence case says that if the first regular expression
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1377	contains an empty string, then the second component of the sequence
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1378	needs to be considered, as its derivative will contribute to the
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1379	result of this derivative:
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1380	\begin{center}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1381	\begin{tabular}{lcl}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1382	$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ &
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1383	$\textit{if}\;\,([] \in L(r_1))\;
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1384	\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1385	& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1386	\end{tabular}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1387	\end{center}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1388	\noindent
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1389	Notice how this closely resembles
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1390	the language derivative operation $\Der$:
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1391	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1392	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1393	$\Der \; c \; (A @ B)$ & $\dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1394	$ \textit{if} \;\, [] \in A \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1395	\textit{then} \;\, ((\Der \; c \; A) @ B ) \cup
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1396	\Der \; c\; B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1397	& & $\textit{else}\; (\Der \; c \; A) @ B$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1398	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1399	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1400	\noindent
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1401	The derivative of the star regular expression $r^*$
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1402	unwraps one iteration of $r$, turns it into $r\backslash c$,
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1403	and attaches the original $r^*$
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1404	after $r\backslash c$, so that
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1405	we can further unfold it as many times as needed:
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1406	\[
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1407	(r^) \backslash c \dn (r \backslash c)\cdot r^.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1408	\]
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1409	Again,
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1410	the structure is the same as the language derivative of the Kleene star:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1411	\[
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1412	\textit{Der} \;c \;(A) \dn (\textit{Der}\; c A) @ (A)
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1413	\]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1414	In the above definition of $(r_1\cdot r_2) \backslash c$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1415	the $\textit{if}$ clause's
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1416	boolean condition
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1417	$[] \in L(r_1)$ needs to be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1418	somehow recursively computed.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1419	We call such a function that checks
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1420	whether the empty string $[]$ is
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1421	in the language of a regular expression $\nullable$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1422	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1423	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1424	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1425	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1426	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1427	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1428	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1429	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1430	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1431	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1432	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1433	The $\ZERO$ regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1434	does not contain any string and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1435	therefore is not \emph{nullable}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1436	$\ONE$ is \emph{nullable}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1437	by definition.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1438	The character regular expression $c$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1439	corresponds to the singleton set $\{c\}$,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1440	and therefore does not contain the empty string.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1441	The alternative regular expression is nullable
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1442	if at least one of its children is nullable.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1443	The sequence regular expression
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1444	would require both children to have the empty string
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1445	to compose an empty string, and the Kleene star
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1446	is always nullable because it naturally
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1447	contains the empty string.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1448	\noindent
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1449	We have the following two correspondences between
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1450	derivatives on regular expressions and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1451	derivatives on a set of strings:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1452	\begin{lemma}\label{derDer}
608 37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	1453	\mbox{}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1454	\begin{itemize}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1455	\item
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1456	$\textit{Der} \; c \; L(r) = L (r\backslash c)$
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1457	\item
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1458	$c\!::\!s \in L(r)$ \textit{iff} $s \in L(r\backslash c)$.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1459	\end{itemize}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1460	\end{lemma}
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1461	\begin{proof}
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1462	By induction on $r$.
35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1463	\end{proof}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1464	\noindent
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1465	which are the main properties of derivatives
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1466	that enables us later to reason about the correctness of
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1467	derivative-based matching.
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1468	We can generalise the derivative operation shown above for single characters
cc54ce075db5 restructured Chengsong parents: diff changeset	1469	to strings as follows:
cc54ce075db5 restructured Chengsong parents: diff changeset	1470
cc54ce075db5 restructured Chengsong parents: diff changeset	1471	\begin{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1472	\begin{tabular}{lcl}
cc54ce075db5 restructured Chengsong parents: diff changeset	1473	$r \backslash_s (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash_s s$ \\
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1474	$r \backslash_s [\,] $ & $\dn$ & $r$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1475	\end{tabular}
cc54ce075db5 restructured Chengsong parents: diff changeset	1476	\end{center}
cc54ce075db5 restructured Chengsong parents: diff changeset	1477
cc54ce075db5 restructured Chengsong parents: diff changeset	1478	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1479	When there is no ambiguity, we will
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1480	omit the subscript and use $\backslash$ instead
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1481	of $\backslash_s$ to denote
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1482	string derivatives for brevity.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1483	Brzozowski's derivative-based
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1484	regular-expression matching algorithm can then be described as:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1485
cc54ce075db5 restructured Chengsong parents: diff changeset	1486	\begin{definition}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1487	$\textit{match}\;s\;r \;\dn\; \nullable \; (r\backslash s)$
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1488	\end{definition}
cc54ce075db5 restructured Chengsong parents: diff changeset	1489
cc54ce075db5 restructured Chengsong parents: diff changeset	1490	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1491	Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1492	this algorithm, presented graphically, is as follows:
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1493
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1494	\begin{equation}\label{matcher}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1495	\begin{tikzcd}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1496	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] &
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1497	r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] &
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1498	\;\textrm{true}/\textrm{false}
532 cc54ce075db5 restructured Chengsong parents: diff changeset	1499	\end{tikzcd}
cc54ce075db5 restructured Chengsong parents: diff changeset	1500	\end{equation}
cc54ce075db5 restructured Chengsong parents: diff changeset	1501
cc54ce075db5 restructured Chengsong parents: diff changeset	1502	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1503	It is relatively
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1504	easy to show that this matcher is correct, namely
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1505	\begin{lemma}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1506	$\textit{match} \; s\; r = \textit{true} \; \textit{iff} \; s \in L(r)$
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1507	\end{lemma}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1508	\begin{proof}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1509	By induction on $s$ using the property of derivatives:
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1510	lemma \ref{derDer}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1511	\end{proof}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1512	\begin{figure}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1513	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1514	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1515	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1516	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1517	ylabel={time in secs},
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1518	%ymode = log,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1519	legend entries={Naive Matcher},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1520	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1521	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1522	\addplot[red,mark=*, mark options={fill=white}] table {NaiveMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1523	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1524	\end{tikzpicture}
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1525	\caption{Matching the regular expression $(a^)^b$ against strings of the form
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1526	$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1527	$ using Brzozowski's original algorithm}\label{NaiveMatcher}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1528	\end{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1529	\end{figure}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1530	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1531	If we implement the above algorithm naively, however,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1532	the algorithm can be excruciatingly slow, as shown in
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1533	\ref{NaiveMatcher}.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1534	Note that both axes are in logarithmic scale.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1535	Around two dozen characters
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1536	this algorithm already ``explodes'' with the regular expression
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1537	$(a^)^b$.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1538	To improve this situation, we need to introduce simplification
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1539	rules for the intermediate results,
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1540	such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1541	and make sure those rules do not change the
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1542	language of the regular expression.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1543	One simple-minded simplification function
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1544	that achieves these requirements
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1545	is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1546	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1547	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1548	$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1549	$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1550	& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1551	& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1552	& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1553	& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1554	& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1555	$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1556	& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1557	& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1558	& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1559	$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1560
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1561	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1562	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1563	If we repeatedly apply this simplification
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1564	function during the matching algorithm,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1565	we have a matcher with simplification:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1566	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1567	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1568	$\derssimp \; [] \; r$ & $\dn$ & $r$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1569	$\derssimp \; c :: cs \; r$ & $\dn$ & $\derssimp \; cs \; (\simp \; (r \backslash c))$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1570	$\textit{matcher}_{simp}\; s \; r $ & $\dn$ & $\nullable \; (\derssimp \; s\;r)$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1571	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1572	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1573	\begin{figure}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1574	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1575	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1576	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1577	ylabel={time in secs},
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1578	%ymode = log,
ce4e5151a836 more Chengsong parents: 591 diff changeset	1579	%xmode = log,
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1580	grid = both,
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1581	legend entries={Matcher With Simp},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1582	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1583	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1584	\addplot[red,mark=*, mark options={fill=white}] table {BetterMatcher.data};
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1585	\end{axis}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1586	\end{tikzpicture}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1587	\caption{$(a^)^b$
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1588	against
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1589	$\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$ Using $\textit{matcher}_{simp}$}\label{BetterMatcher}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1590	\end{figure}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1591	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1592	The running time of $\textit{ders}\_\textit{simp}$
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1593	on the same example of Figure \ref{NaiveMatcher}
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1594	is now ``tame'' in terms of the length of inputs,
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1595	as shown in Figure \ref{BetterMatcher}.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1596
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1597	So far, the story is use Brzozowski derivatives and
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1598	simplify as much as possible, and at the end test
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1599	whether the empty string is recognised
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1600	by the final derivative.
583 4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1601	But what if we want to
4aabb0629e4b chap4 Chengsong parents: 579 diff changeset	1602	do lexing instead of just getting a true/false answer?
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1603	Sulzmann and Lu \cite{Sulzmann2014} first came up with a nice and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1604	elegant (arguably as beautiful as the definition of the
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1605	Brzozowski derivative) solution for this.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1606
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1607	\section{Values and the Lexing Algorithm by Sulzmann and Lu}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1608	In this section, we present a two-phase regular expression lexing
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1609	algorithm.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1610	The first phase takes successive derivatives with
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1611	respect to the input string,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1612	and the second phase does the reverse, \emph{injecting} back
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1613	characters, in the meantime constructing a lexing result.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1614	We will introduce the injection phase in detail slightly
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1615	later, but as a preliminary we have to first define
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1616	the datatype for lexing results,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1617	called \emph{value} or
608 37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	1618	sometimes also \emph{lexical value}.
37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	1619	Values and regular
37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	1620	expressions correspond to each other
37b6fd310a16 added related work chap Chengsong parents: 601 diff changeset	1621	as illustrated in the following
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1622	table:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1623
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1624	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1625	\begin{tabular}{c@{\hspace{20mm}}c}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1626	\begin{tabular}{@{}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1627	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1628	$r$ & $::=$ & $\ZERO$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1629	& $\mid$ & $\ONE$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1630	& $\mid$ & $c$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1631	& $\mid$ & $r_1 \cdot r_2$\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1632	& $\mid$ & $r_1 + r_2$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1633	\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1634	& $\mid$ & $r^*$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1635	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1636	&
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1637	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1638	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1639	$v$ & $::=$ & \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1640	& & $\Empty$ \\
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1641	& $\mid$ & $\Char \; c$ \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1642	& $\mid$ & $\Seq\,v_1\, v_2$\\
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1643	& $\mid$ & $\Left \;v$ \\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1644	& $\mid$ & $\Right\;v$ \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1645	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1646	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1647	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1648	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1649	\noindent
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1650	A value has an underlying string, which
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1651	can be calculated by the ``flatten" function $\|\_\|$:
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1652	\begin{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1653	\begin{tabular}{lcl}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1654	$\|\Empty\|$ & $\dn$ & $[]$\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1655	$\|\Char \; c\|$ & $ \dn$ & $ [c]$\\
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1656	$\|\Seq \; v_1, \;v_2\|$ & $ \dn$ & $ v_1\| @ \|v_2\|$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1657	$\|\Left \; v\|$ & $ \dn$ & $ \|v\|$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1658	$\|\Right \; v\|$ & $ \dn$ & $ \|v\|$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1659	$\|\Stars \; []\|$ & $\dn$ & $[]$\\
b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1660	$\|\Stars \; v::vs\|$ & $\dn$ & $ \|v\| @ \|\Stars(vs)\|$
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1661	\end{tabular}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1662	\end{center}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1663	Sulzmann and Lu used a binary predicate, written $\vdash v:r $,
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1664	to indicate that a value $v$ could be generated from a lexing algorithm
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1665	with input $r$. They call it the value inhabitation relation,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1666	defined by the rules.
628 7af4e2420a8c ready to submit~~ Chengsong parents: 626 diff changeset	1667	\begin{figure}[H]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1668	\begin{mathpar}
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1669	\inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1670
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1671	\inferrule{\mbox{}}{\vdash \Empty : \ONE} \hspace{2em}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1672
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1673	\inferrule{\vdash v_1 : r_1 \;\; \vdash v_2 : r_2 }{\vdash \Seq \; v_1,\; v_2 : (r_1 \cdot r_2)}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1674
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1675	\inferrule{\vdash v_1 : r_1}{\vdash \Left \; v_1 : r_1+r_2}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1676
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1677	\inferrule{\vdash v_2 : r_2}{\vdash \Right \; v_2:r_1 + r_2}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1678
591 b2d0de6aee18 more polishing integrated comments chap2 Chengsong parents: 585 diff changeset	1679	\inferrule{\forall v \in vs. \vdash v:r \land \|v\| \neq []}{\vdash \Stars \; vs : r^*}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1680	\end{mathpar}
628 7af4e2420a8c ready to submit~~ Chengsong parents: 626 diff changeset	1681	\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab}
626 1c8525061545 finished! Chengsong parents: 623 diff changeset	1682	\end{figure}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1683	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1684	The condition $\|v\| \neq []$ in the premise of star's rule
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1685	is to make sure that for a given pair of regular
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1686	expression $r$ and string $s$, the number of values
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1687	satisfying $\|v\| = s$ and $\vdash v:r$ is finite.
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1688	This additional condition was
ce4e5151a836 more Chengsong parents: 591 diff changeset	1689	imposed by Ausaf and Urban to make their proofs easier.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1690	Given a string and a regular expression, there can be
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1691	multiple values for it. For example, both
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1692	$\vdash \Seq(\Left \; ab)(\Right \; c):(ab+a)(bc+c)$ and
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1693	$\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1694	and the values both flatten to $abc$.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1695	Lexers therefore have to disambiguate and choose only
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1696	one of the values to be generated. $\POSIX$ is one of the
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1697	disambiguation strategies that is widely adopted.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1698
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1699	Ausaf et al. \cite{AusafDyckhoffUrban2016}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1700	formalised the property
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1701	as a ternary relation.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1702	The $\POSIX$ value $v$ for a regular expression
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1703	$r$ and string $s$, denoted as $(s, r) \rightarrow v$, can be specified
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1704	in the following rules\footnote{The names of the rules are used
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1705	as they were originally given in \cite{AusafDyckhoffUrban2016}.}:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1706	\begin{figure}[p]
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1707	\begin{mathpar}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1708	\inferrule[P1]{\mbox{}}{([], \ONE) \rightarrow \Empty}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1709
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1710	\inferrule[PC]{\mbox{}}{([c], c) \rightarrow \Char \; c}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1711
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1712	\inferrule[P+L]{(s,r_1)\rightarrow v_1}{(s, r_1+r_2)\rightarrow \Left \; v_1}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1713
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1714	\inferrule[P+R]{(s,r_2)\rightarrow v_2\\ s \notin L \; r_1}{(s, r_1+r_2)\rightarrow \Right \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1715
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1716	\inferrule[PS]{(s_1, v_1) \rightarrow r_1 \\ (s_2, v_2)\rightarrow r_2\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1717	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1718	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2}{(s_1 @ s_2, r_1\cdot r_2) \rightarrow
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1719	\Seq \; v_1 \; v_2}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1720
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1721	\inferrule[P{[]}]{\mbox{}}{([], r^*) \rightarrow \Stars([])}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1722
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1723	\inferrule[P]{(s_1, v) \rightarrow v \\ (s_2, r^) \rightarrow \Stars \; vs \\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1724	\|v\| \neq []\\ \nexists s_3 \; s_4. s_3 \neq [] \land s_3@s_4 = s_2 \land
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1725	s_1@s_3 \in L \; r \land s_4 \in L \; r^}{(s_1@s_2, r^)\rightarrow \Stars \;
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1726	(v::vs)}
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1727	\end{mathpar}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1728	\caption{The inductive POSIX rules given by Ausaf et al.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1729	\cite{AusafDyckhoffUrban2016}.
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1730	This ternary relation, written $(s, r) \rightarrow v$,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1731	formalises the POSIX constraints on the
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1732	value $v$ given a string $s$ and
ce4e5151a836 more Chengsong parents: 591 diff changeset	1733	regular expression $r$.
ce4e5151a836 more Chengsong parents: 591 diff changeset	1734	}
646 56057198e4f5 intro Chengsong parents: 639 diff changeset	1735	\label{fig:POSIXDef}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1736	\end{figure}\afterpage{\clearpage}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1737	\noindent
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1738
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1739	%\begin{figure}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1740	%\begin{tikzpicture}[]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1741	% \node [minimum width = 6cm, rectangle split, rectangle split horizontal,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1742	% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1743	% (node1)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1744	% {$r_{token1}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1745	% \nodepart{two} $\;\;\; \quad r_{token2}\;\;\;\quad$ };
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1746	% %\node [left = 6.0cm of node1] (start1) {hi};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1747	% \node [left = 0.2cm of node1] (middle) {$v.s.$};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1748	% \node [minimum width = 6cm, left = 0.2cm of middle, rectangle split, rectangle split horizontal,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1749	% rectangle split parts=2, rectangle split part fill={red!30,blue!20}, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1750	% (node2)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1751	% {$\quad\;\;\;r_{token1}\quad\;\;\;$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1752	% \nodepart{two} $r_{token2}$ };
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1753	% \node [below = 0.1cm of node2] (text1) {\checkmark preferred by POSIX};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1754	% \node [above = 1.5cm of middle, minimum width = 6cm,
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1755	% rectangle, style={draw, rounded corners, inner sep=10pt}]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1756	% (topNode) {$s$};
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1757	% \path[->,draw]
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1758	% (topNode) edge node {split $A$} (node2)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1759	% (topNode) edge node {split $B$} (node1)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1760	% ;
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1761	%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1762	%
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1763	%\end{tikzpicture}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1764	%\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1765	%\end{figure}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1766	The above $\POSIX$ rules follow the intuition described below:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1767	\begin{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1768	\item (Left Priority)\\
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1769	Match the leftmost regular expression when multiple options for matching
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1770	are available. See P+L and P+R where in P+R $s$ cannot
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1771	be in the language of $L \; r_1$.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1772	\item (Maximum munch)\\
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1773	Always match a subpart as much as possible before proceeding
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1774	to the next part of the string.
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	1775	For example, when the string $s$ matches
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1776	$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1777	Then the split that matches a longer string for the first part
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1778	$r_{part1}$ is preferred by this maximum munch rule.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1779	The side-condition
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1780	\begin{center}
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1781	$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1782	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1783	\end{center}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1784	in PS causes this.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1785	%(See
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1786	%\ref{munch} for an illustration).
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1787	\end{itemize}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1788	\noindent
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1789	These disambiguation strategies can be
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1790	quite practical.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1791	For instance, when lexing a code snippet
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1792	\[
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1793	\textit{iffoo} = 3
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1794	\]
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1795	using a regular expression
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1796	for keywords and
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1797	identifiers:
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1798	%(for example, keyword is a nonempty string starting with letters
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1799	%followed by alphanumeric characters or underscores):
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1800	\[
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1801	r_{keyword} + r_{identifier}.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1802	\]
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1803	If we want $\textit{iffoo}$ to be recognized
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1804	as an identifier
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1805	where identifiers are defined as usual (letters
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1806	followed by letters, numbers or underscores),
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1807	then a match with a keyword (if)
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1808	followed by
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1809	an identifier (foo) would be incorrect.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1810	POSIX lexing generates what is included by lexing.
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1811
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1812	\noindent
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1813	We know that a POSIX
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1814	value for regular expression $r$ is inhabited by $r$.
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1815	\begin{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1816	$(r, s) \rightarrow v \implies \vdash v: r$
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1817	\end{lemma}
5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	1818	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1819	The main property about a $\POSIX$ value is that
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1820	given the same regular expression $r$ and string $s$,
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1821	one can always uniquely determine the $\POSIX$ value for it:
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1822	\begin{lemma}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1823	$\textit{if} \,(s, r) \rightarrow v_1 \land (s, r) \rightarrow v_2\quad \textit{then} \; v_1 = v_2$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1824	\end{lemma}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1825	\begin{proof}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1826	By induction on $s$, $r$ and $v_1$. The inductive cases
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1827	are all the POSIX rules.
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1828	Probably the most cumbersome cases are
3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1829	the sequence and star with non-empty iterations.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1830	We shall give the details for proving the sequence case here.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1831
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1832	When we have
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1833	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1834	(s_1, r_1) \rightarrow v_1 \;\, and \;\,
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1835	(s_2, r_2) \rightarrow v_2 \;\, and \;\,
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1836	\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1837	s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1838	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1839	we know that the last condition
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1840	excludes the possibility of a
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1841	string $s_1'$ longer than $s_1$ such that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1842	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1843	(s_1', r_1) \rightarrow v_1' \;\;
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1844	and\;\; (s_2', r_2) \rightarrow v_2'\;\; and \;\;s_1' @s_2' = s
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1845	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1846	hold.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1847	A shorter string $s_1''$ with $s_2''$ satisfying
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1848	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1849	(s_1'', r_1) \rightarrow v_1''
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1850	\;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1851	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1852	cannot possibly form a $\POSIX$ value either, because
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1853	by definition, there is a candidate
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1854	with a longer initial string
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1855	$s_1$. Therefore, we know that the POSIX
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1856	value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1857	$s$ must have the
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1858	property that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1859	\[
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1860	\|a\| = s_1 \;\; and \;\; \|b\| = s_2.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1861	\]
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1862	The goal is to prove that $a = v_1 $ and $b = v_2$.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1863	If we have some other POSIX values $v_{10}$ and $v_{20}$ such that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1864	$(s_1, r_1) \rightarrow v_{10}$ and $(s_2, r_2) \rightarrow v_{20}$ hold,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1865	then by induction hypothesis $v_{10} = v_1$ and $v_{20}= v_2$,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1866	which means this "other" $\POSIX$ value $\Seq(v_{10}, v_{20})$
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	1867	is the same as $\Seq(v_1, v_2)$.
7cf9f17aa179 more Chengsong parents: 538 diff changeset	1868	\end{proof}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1869	\noindent
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1870	We have now defined what a POSIX value is and shown that it is unique.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1871	The problem is to generate
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1872	such a value in a lexing algorithm using derivatives.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1873
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1874	\subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1875
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1876	Sulzmann and Lu extended Brzozowski's
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1877	derivative-based matching
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1878	to a lexing algorithm by a second phase
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1879	after the initial phase of successive derivatives.
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1880	This second phase generates a POSIX value
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1881	if the regular expression matches the string.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1882	The algorithm uses two functions called $\inj$ and $\mkeps$.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1883	The function $\mkeps$ constructs a POSIX value from the last
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1884	derivative $r_n$:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1885	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1886	\begin{equation}\label{graph:mkeps}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1887	\begin{tikzcd}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1888	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed, "\ldots"] & r_n \arrow[d, "mkeps" description] \\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1889	& & & v_n
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1890	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1891	\end{equation}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1892	\end{ceqn}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1893	\noindent
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1894	In the above diagram, again we assume that
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1895	the input string $s$ is made of $n$ characters
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1896	$c_0c_1 \ldots c_{n-1}$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1897	The last derivative operation
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1898	$\backslash c_{n-1}$ generates the derivative $r_n$, for which
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1899	$\mkeps$ produces the value $v_n$. This value
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1900	tells us how the empty string is matched by the (nullable)
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1901	regular expression $r_n$, in a POSIX way.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1902	The definition of $\mkeps$ is
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1903	\begin{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1904	\begin{tabular}{lcl}
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1905	$\mkeps \; \ONE$ & $\dn$ & $\Empty$ \\
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1906	$\mkeps \; (r_{1}+r_{2})$ & $\dn$
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1907	& $\textit{if}\; (\nullable \; r_{1}) \;\,
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1908	\textit{then}\;\, \Left \; (\mkeps \; r_{1})$\\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1909	& & $\phantom{if}\; \textit{else}\;\, \Right \;(\mkeps \; r_{2})$\\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1910	$\mkeps \; (r_1 \cdot r_2)$ & $\dn$ & $\Seq\;(\mkeps\;r_1)\;(\mkeps \; r_2)$\\
564 3cbcd7cda0a9 more Chengsong parents: 543 diff changeset	1911	$\mkeps \; r^* $ & $\dn$ & $\Stars\;[]$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1912	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1913	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1914
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1915
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1916	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1917	The function prefers the left child $r_1$ of $r_1 + r_2$
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1918	to match an empty string if there is a choice.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1919	When there is a star to match the empty string,
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1920	we give the $\Stars$ constructor an empty list, meaning
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1921	no iteration is taken.
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1922	The result of $\mkeps$ on a $\nullable$ $r$
4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1923	is a POSIX value for $r$ and the empty string:
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1924	\begin{lemma}\label{mePosix}
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	1925	$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1926	\end{lemma}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1927	\begin{proof}
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1928	By induction on $r$.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1929	\end{proof}
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1930	\noindent
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1931	After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1932	in reverse order as they were chopped off in the derivative phase.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1933	The function for this is called $\inj$. This function
622 4b1149fb5aec incorporated more comments, bib Chengsong parents: 608 diff changeset	1934	operates on values, unlike $\backslash$ which operates on regular expressions.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1935	In the diagram below, $v_i$ stands for the (POSIX) value
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1936	for how the regular expression
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1937	$r_i$ matches the string $s_i$ consisting of the last $n-i$ characters
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1938	of $s$ (i.e. $s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1939	After injecting back $n$ characters, we get the lexical value for how $r_0$
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1940	matches $s$.
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1941	\begin{figure}[H]
ce4e5151a836 more Chengsong parents: 591 diff changeset	1942	\begin{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1943	\begin{ceqn}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1944	\begin{tikzcd}
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1945	r_0 \arrow[r, dashed] \arrow[d]& r_i \arrow[r, "\backslash c_i"] \arrow[d] & r_{i+1} \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1946	v_0 \arrow[u] & v_i \arrow[l, dashed] & v_{i+1} \arrow[l,"inj_{r_i} c_i"] & v_n \arrow[l, dashed]
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1947	\end{tikzcd}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1948	\end{ceqn}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1949	\end{center}
ce4e5151a836 more Chengsong parents: 591 diff changeset	1950	\caption{The two-phase lexing algorithm by Sulzmann and Lu \cite{AusafDyckhoffUrban2016},
ce4e5151a836 more Chengsong parents: 591 diff changeset	1951	matching the regular expression $r_0$ and string of the form $[c_0, c_1, \ldots, c_{n-1}]$.
ce4e5151a836 more Chengsong parents: 591 diff changeset	1952	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
ce4e5151a836 more Chengsong parents: 591 diff changeset	1953	$c_1$, and so on. These are the same operations as they have appeared in the matcher
ce4e5151a836 more Chengsong parents: 591 diff changeset	1954	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1955	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1956	the empty string, by always selecting the leftmost
e3752aac8ec2 more Chengsong parents: 628 diff changeset	1957	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	1958	appeared in the string, always preserving POSIXness.}\label{graph:inj}
ce4e5151a836 more Chengsong parents: 591 diff changeset	1959	\end{figure}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1960	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1961	The function $\textit{inj}$ as defined by Sulzmann and Lu
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1962	takes three arguments: a regular
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1963	expression ${r_{i}}$, before the character is chopped off,
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1964	a character ${c_{i}}$ (the character we want to inject back) and
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1965	the third argument $v_{i+1}$ the value we want to inject into.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1966	The result of an application
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1967	$\inj \; r_i \; c_i \; v_{i+1}$ is a new value $v_i$ such that
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1968	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1969	(s_i, r_i) \rightarrow v_i
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1970	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1971	holds.
567 28cb8089ec36 more updaates Chengsong parents: 564 diff changeset	1972	The definition of $\textit{inj}$ is as follows:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1973	\begin{center}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1974	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{5mm}}l}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1975	$\textit{inj}\;(c)\;c\,Empty$ & $\dn$ & $\Char\,c$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1976	$\textit{inj}\;(r_1 + r_2)\;c\; (\Left\; v)$ & $\dn$ & $\Left \; (\textit{inj}\; r_1 \; c\,v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1977	$\textit{inj}\;(r_1 + r_2)\,c\; (\Right\;v)$ & $\dn$ & $\Right \; (\textit{inj}\;r_2\;c \; v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1978	$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Seq \; v_1 \; v_2)$ & $\dn$ &
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1979	$\Seq \; (\textit{inj}\;r_1\;c\;v_1) \; v_2$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1980	$\textit{inj}\;(r_1 \cdot r_2)\; c\;(\Left \; (\Seq \; v_1\;v_2) )$ &
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1981	$\dn$ & $\Seq \; (\textit{inj}\,r_1\,c\,v_1)\; v_2$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1982	$\textit{inj}\;(r_1 \cdot r_2)\; c\; (\Right\; v)$ & $\dn$ & $\Seq\; (\textit{mkeps}\; r_1) \; (\textit{inj} \; r_2\;c\;v)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1983	$\textit{inj}\;(r^*)\; c \; (\Seq \; v\; (\Stars\;vs))$ & $\dn$ & $\Stars\;\,((\textit{inj}\;r\;c\;v)\,::\,vs)$\\
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1984	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1985	\end{center}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1986
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	1987	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	1988	The function recurses on
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1989	the shape of regular
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1990	expressions and values.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1991	Intuitively, each clause analyses
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1992	how $r_i$ could have transformed when being
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1993	derived by $c$, identifying which subpart
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1994	of $v_{i+1}$ has the ``hole''
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1995	to inject the character back into.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1996	Once the character is
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1997	injected back to that sub-value;
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	1998	$\inj$ assembles all parts
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	1999	to form a new value.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2000
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2001	For instance, the last clause is an
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2002	injection into a sequence value $v_{i+1}$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2003	whose second child
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2004	value is a star and the shape of the
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2005	regular expression $r_i$ before injection
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2006	is a star.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2007	We therefore know
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2008	the derivative
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2009	starts on a star and ends as a sequence:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2010	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2011	(r^) \backslash c \longrightarrow r\backslash c \cdot r^
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2012	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2013	during which an iteration of the star
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2014	had just been unfolded, giving the below
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2015	value inhabitation relation:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2016	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2017	\vdash \Seq \; v \; (\Stars \; vs) : (r\backslash c) \cdot r^*.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2018	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2019	The value list $vs$ corresponds to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2020	matched star iterations,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2021	and the ``hole'' lies in $v$ because
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2022	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2023	\vdash v: r\backslash c.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2024	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2025	Finally,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2026	$\inj \; r \;c \; v$ is prepended
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2027	to the previous list of iterations and then
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2028	wrapped under the $\Stars$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2029	constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2030
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2031	Recall that lemma
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2032	\ref{mePosix} tells us that
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2033	$\mkeps$ always generates the POSIX value.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2034	The function $\inj$ preserves the POSIXness, provided
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2035	the value before injection is POSIX, namely
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2036	\begin{lemma}\label{injPosix}
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2037	If$(r \backslash c, s) \rightarrow v $,
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2038	then $(r, c :: s) \rightarrow (\inj r \; c\; v)$.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2039	\end{lemma}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2040	\begin{proof}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2041	By induction on $r$.
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2042	The non-trivial cases are sequence and star.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2043	When $r = a \cdot b$, there can be
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2044	three cases for the value $v$ satisfying $\vdash v:a\backslash c$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2045	We give the reasoning why $\inj \; r \; c \; v$ is POSIX in each
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2046	case.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2047	\begin{itemize}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2048	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2049	$v = \Seq \; v_a \; v_b$.\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2050	The ``not nullable'' clause of the $\inj$ function is taken:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2051	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2052	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2053	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Seq \; v_a \; v_b) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2054	& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2055	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2056	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2057	We know that there exists a unique pair of
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2058	$s_a$ and $s_b$ satisfying
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2059	$(a \backslash c, s_a) \rightarrow v_a$,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2060	$(b , s_b) \rightarrow v_b$, and
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2061	$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2062	L \; (a\backslash c) \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2063	s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2064	The last condition gives us
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2065	$\nexists s_3 \; s_4. s_3 \neq [] \land (c :: s_a )@ s_3 \in
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2066	L \; a \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2067	s_4 \in L \; b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2068	By induction hypothesis, $(a, c::s_a) \rightarrow \inj \; a \; c \; v_a $ holds,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2069	and this gives us
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2070	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2071	(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2072	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2073	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2074	$v = \Left \; (\Seq \; v_a \; v_b)$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2075	The argument is almost identical to the above case,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2076	except that a different clause of $\inj$ is taken:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2077	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2078	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2079	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; (\Left \; (\Seq \; v_a \; v_b)) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2080	& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2081	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2082	\end{center}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2083	With similar reasoning,
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2084
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2085	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2086	(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2087	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2088	again holds.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2089	\item
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2090	$v = \Right \; v_b$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2091	Again the injection result would be
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2092	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2093	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2094	$\inj \; r \; c \; v$ & $=$ & $ \inj \;\; (a \cdot b) \;\; c \;\; \Right \; (v_b) $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2095	& $=$ & $\Seq \; (\mkeps \; a) \; (\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2096	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2097	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2098	We know that $a$ must be nullable,
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2099	allowing us to call $\mkeps$ and get
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2100	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2101	(a, []) \rightarrow \mkeps \; a.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2102	\]
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2103	Also, by inductive hypothesis
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2104	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2105	(b, c::s) \rightarrow \inj\; b \; c \; v_b
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2106	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2107	holds.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2108	In addition, as
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2109	$\Right \;v_b$ instead of $\Left \ldots$ is
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2110	the POSIX value for $v$, it must be the case
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2111	that $s \notin L \;( (a\backslash c)\cdot b)$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2112	This tells us that
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2113	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2114	\nexists s_3 \; s_4.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2115	s_3 @s_4 = s \land s_3 \in L \; (a\backslash c)
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2116	\land s_4 \in L \; b
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2117	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2118	which translates to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2119	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2120	\nexists s_3 \; s_4. \; s_3 \neq [] \land
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2121	s_3 @s_4 = c::s \land s_3 \in L \; a
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2122	\land s_4 \in L \; b.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2123	\]
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2124	(Which says there cannot be a longer
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2125	initial split for $s$ other than the empty string.)
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2126	Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2127	as the POSIX value for $a\cdot b$.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2128	\end{itemize}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2129	The star case can be proven similarly.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2130	\end{proof}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2131	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2132	Putting all together, Sulzmann and Lu obtained the following algorithm
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2133	outlined in the diagram \ref{graph:inj}:
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2134	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	2135	\begin{tabular}{lcl}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2136	$\lexer \; r \; [] $ & $=$ & $\textit{if} \; (\nullable \; r)\; \textit{then}\; \Some(\mkeps \; r) \; \textit{else} \; \None$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2137	$\lexer \; r \;c::s$ & $=$ & $\textit{case}\; (\lexer \; (r\backslash c) \; s) \;\textit{of}\; $\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2138	& & $\quad \phantom{\mid}\; \None \implies \None$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2139	& & $\quad \mid \Some(v) \implies \Some(\inj \; r\; c\; v)$
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2140	\end{tabular}
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2141	\end{center}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2142	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2143	The central property of the $\lexer$ is that it gives the correct result
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2144	according to
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2145	POSIX rules.
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2146	\begin{theorem}\label{lexerCorrectness}
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2147	The $\lexer$ based on derivatives and injections is correct:
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2148	\begin{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2149	\begin{tabular}{lcl}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2150	$\lexer \; r \; s = \Some(v)$ & $ \Longleftrightarrow$ & $ (r, \; s) \rightarrow v$\\
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2151	$\lexer \;r \; s = \None $ & $\Longleftrightarrow$ & $ \neg(\exists v. (r, s) \rightarrow v)$
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2152	\end{tabular}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2153	\end{center}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2154	\end{theorem}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2155	\begin{proof}
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2156	By induction on $s$. $r$ generalising over an arbitrary regular expression.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2157	The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2158	by lemma \ref{injPosix}.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2159	\end{proof}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2160	\noindent
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2161	As we did earlier in this chapter with the matcher, one can
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2162	introduce simplification on the regular expression in each derivative step.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2163	However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
e3752aac8ec2 more Chengsong parents: 628 diff changeset	2164	and ensure that
e3752aac8ec2 more Chengsong parents: 628 diff changeset	2165	the values align with the regular expression at each step.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	2166	Therefore one has to
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2167	be careful not to break the correctness, as the injection
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2168	function heavily relies on the structure of
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2169	the regular expressions and values being aligned.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2170	This can be achieved by recording some extra rectification functions
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2171	during the derivatives step and applying these rectifications in
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2172	each run during the injection phase.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2173	With extra care
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2174	one can show that POSIXness will not be affected
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2175	by the simplifications listed here \cite{AusafDyckhoffUrban2016}.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2176	\begin{center}
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2177	\begin{tabular}{lcl}
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2178	$\simp \; r_1 \cdot r_2 $ & $ \dn$ &
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2179	$(\simp \; r_1, \simp \; r_2) \; \textit{match}$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2180	& & $\quad \case \; (\ZERO, \_) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2181	& & $\quad \case \; (\_, \ZERO) \Rightarrow \ZERO$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2182	& & $\quad \case \; (\ONE, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2183	& & $\quad \case \; (r_1', \ONE) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2184	& & $\quad \case \; (r_1', r_2') \Rightarrow r_1'\cdot r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2185	$\simp \; r_1 + r_2$ & $\dn$ & $(\simp \; r_1, \simp \; r_2) \textit{match}$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2186	& & $\quad \; \case \; (\ZERO, r_2') \Rightarrow r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2187	& & $\quad \; \case \; (r_1', \ZERO) \Rightarrow r_1'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2188	& & $\quad \; \case \; (r_1', r_2') \Rightarrow r_1' + r_2'$\\
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2189	$\simp \; r$ & $\dn$ & $r\quad\quad (otherwise)$
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2190
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2191	\end{tabular}
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2192	\end{center}
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2193
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2194	However, one can still end up
e3752aac8ec2 more Chengsong parents: 628 diff changeset	2195	with exploding derivatives,
e3752aac8ec2 more Chengsong parents: 628 diff changeset	2196	even with the simple-minded simplification rules allowed
e3752aac8ec2 more Chengsong parents: 628 diff changeset	2197	in an injection-based lexer.
e3752aac8ec2 more Chengsong parents: 628 diff changeset	2198	\section{A Case Requiring More Aggressive Simplifications}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	2199	For example, when starting with the regular
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	2200	expression $(a^* \cdot a^)^$ and building just over
4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	2201	a dozen successive derivatives
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	2202	w.r.t.~the character $a$, one obtains a derivative regular expression
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	2203	with millions of nodes (when viewed as a tree)
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2204	even with the mentioned simplifications.
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	2205	\begin{figure}[H]
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	2206	\begin{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	2207	\begin{tikzpicture}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	2208	\begin{axis}[
7cf9f17aa179 more Chengsong parents: 538 diff changeset	2209	xlabel={$n$},
7cf9f17aa179 more Chengsong parents: 538 diff changeset	2210	ylabel={size},
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	2211	legend entries={Simple-Minded Simp, Naive Matcher},
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	2212	legend pos=north west,
7cf9f17aa179 more Chengsong parents: 538 diff changeset	2213	legend cell align=left]
7cf9f17aa179 more Chengsong parents: 538 diff changeset	2214	\addplot[red,mark=*, mark options={fill=white}] table {BetterWaterloo.data};
585 4969ef817d92 chap4 more Chengsong parents: 583 diff changeset	2215	\addplot[blue,mark=*, mark options={fill=white}] table {BetterWaterloo1.data};
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	2216	\end{axis}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	2217	\end{tikzpicture}
601 ce4e5151a836 more Chengsong parents: 591 diff changeset	2218	\end{center}
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	2219	\caption{Size of $(a^\cdot a^)^*$ against $\protect\underbrace{aa\ldots a}_\text{n \textit{a}s}$}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	2220	\end{figure}\label{fig:BetterWaterloo}
7cf9f17aa179 more Chengsong parents: 538 diff changeset	2221
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2222	That is because Sulzmann and Lu's
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2223	injection-based lexing algorithm keeps a lot of
541 5bf9f94c02e1 some comments implemented Chengsong parents: 539 diff changeset	2224	"useless" values that will not be used.
539 7cf9f17aa179 more Chengsong parents: 538 diff changeset	2225	These different ways of matching will grow exponentially with the string length.
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2226	Consider the case
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2227	\[
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2228	r= (a^\cdot a^)^* \quad and \quad
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2229	s=\underbrace{aa\ldots a}_\text{n \textit{a}s}
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2230	\]
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2231	as an example.
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2232	This is a highly ambiguous regular expression, with
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2233	many ways to split up the string into multiple segments for
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2234	different star iterations,
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2235	and for each segment
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2236	multiple ways of splitting between
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2237	the two $a^*$ sub-expressions.
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2238	When $n$ is equal to $1$, there are two lexical values for
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2239	the match:
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2240	\[
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2241	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])] \quad (v1)
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2242	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2243	and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2244	\[
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2245	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])] \quad (v2)
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2246	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2247	The derivative of $\derssimp \;s \; r$ is
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2248	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2249	(a^a^ + a^)\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2250	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2251	The $a^a^$ and $a^*$ in the first child of the above sequence
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2252	correspond to value 1 and value 2, respectively.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2253	When $n=2$, the number goes up to 7:
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2254	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2255	\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2256	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2257
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2258	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2259	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2260	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2261
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2262	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2263	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2264	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2265
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2266	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2267	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []), \Seq \; (\Stars \; [\Char\;a])\; (\Stars\; []) ]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2268	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2269
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2270	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2271	\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; []),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2272	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2273	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2274	\]
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2275
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2276	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2277	\Stars \; [
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2278	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2279	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2280	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2281	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2282	and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2283	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2284	\Stars \; [
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2285	\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a]),
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2286	\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2287	]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2288	\]
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	2289	And $\derssimp \; aa \; (a^a^)^*$ is
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2290	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2291	((a^a^ + a^)+a^)\cdot(a^a^)^* +
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2292	(a^a^ + a^)\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2293	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2294	which removes two out of the seven terms corresponding to the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2295	seven distinct lexical values.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2296
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2297	It is not surprising that there are exponentially many
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2298	distinct lexical values that cannot be eliminated by
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2299	the simple-minded simplification of $\derssimp$.
568 7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2300	A lexer without a good enough strategy to
7a579f5533f8 more chapter2 modifications Chengsong parents: 567 diff changeset	2301	deduplicate will naturally
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2302	have an exponential runtime on highly
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2303	ambiguous regular expressions because there
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2304	are exponentially many matches.
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2305	For this particular example, it seems
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2306	that the number of distinct matches growth
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2307	speed is proportional to $(2n)!/(n!(n+1)!)$ ($n$ being the input length).
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2308
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2309	On the other hand, the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2310	$\POSIX$ value for $r= (a^\cdot a^)^*$ and
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2311	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2312	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2313	\Stars\,
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2314	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2315	\]
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2316	At any moment, the subterms in a regular expression
e3752aac8ec2 more Chengsong parents: 628 diff changeset	2317	that will potentially result in a POSIX value is only
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2318	a minority among the many other terms,
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	2319	and one can remove the ones that are not possible to
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2320	be POSIX.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2321	In the above example,
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	2322	\begin{equation}\label{eqn:growth2}
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2323	((a^a^ + \underbrace{a^}_\text{A})+\underbrace{a^}_\text{duplicate of A})\cdot(a^a^)^* +
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2324	\underbrace{(a^a^ + a^)\cdot(a^a^)^}_\text{further simp removes this}.
579 35df9cdd36ca more chap3 Chengsong parents: 577 diff changeset	2325	\end{equation}
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2326	can be further simplified by
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2327	removing the underlined term first,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2328	which would open up possibilities
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2329	of further simplification that removes the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2330	underbraced part.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2331	The result would be
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2332	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2333	(\underbrace{a^a^}_\text{term 1} + \underbrace{a^}_\text{term 2})\cdot(a^a^)^.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2334	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2335	with corresponding values
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2336	\begin{center}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2337	\begin{tabular}{lr}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2338	$\Stars \; [\Seq \; (\Stars \; [\Char \; a, \Char \; a])\; (\Stars \; [])]$ & $(\text{term 1})$\\
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2339	$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])] $ & $(\text{term 2})$
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2340	\end{tabular}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2341	\end{center}
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2342	Other terms with an underlying value, such as
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2343	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2344	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2345	\]
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	2346	do not to contribute a POSIX lexical value,
dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	2347	and therefore can be thrown away.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2348
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2349	Ausaf et al. \cite{AusafDyckhoffUrban2016}
639 80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2350	have come up with some simplification steps,
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2351	and incorporated the simplification into $\lexer$.
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2352	They call this lexer $\textit{lexer}_{simp}$ and proved that
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2353	\[
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2354	\lexer \; r\; s = \textit{lexer}_{simp} \; r \; s
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2355	\]
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2356	The function $\textit{lexer}_{simp}$
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2357	involves some fiddly manipulation of value rectification,
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2358	which we omit here.
80cc6dc4c98b until chap 7 Chengsong parents: 638 diff changeset	2359	however those steps
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	2360	are not yet sufficiently strong, to achieve the above effects.
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2361	And even with these relatively mild simplifications, the proof
e3752aac8ec2 more Chengsong parents: 628 diff changeset	2362	is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	2363	One would need to prove something like this:
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2364	\[
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2365	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow v \;\;
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2366	\textit{then}\;\; (r, c::s) \rightarrow
637 e3752aac8ec2 more Chengsong parents: 628 diff changeset	2367	\inj\;\, r\, \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2368	\]
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2369	instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2370	not only has to return a simplified regular expression,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2371	but also what specific simplifications
638 dd9dde2d902b comments till chap4 Chengsong parents: 637 diff changeset	2372	have been done as a function on values
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2373	showing how one can transform the value
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2374	underlying the simplified regular expression
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2375	to the unsimplified one.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2376
623 c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2377	We therefore choose a slightly different approach
c0c1ebe09c7d finished injchap2 Chengsong parents: 622 diff changeset	2378	also described by Sulzmann and Lu to
573 454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2379	get better simplifications, which uses
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2380	some augmented data structures compared to
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2381	plain regular expressions.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2382	We call them \emph{annotated}
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2383	regular expressions.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2384	With annotated regular expressions,
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2385	we can avoid creating the intermediate values $v_1,\ldots v_n$ and a
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2386	second phase altogether.
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2387	We introduce this new datatype and the
454ced557605 chapter2 finished polishing Chengsong parents: 568 diff changeset	2388	corresponding algorithm in the next chapter.
538 8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2389
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2390
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2391
8016a2480704 intro and chap2 Chengsong parents: 536 diff changeset	2392

author	Chengsong
	Tue, 27 Jun 2023 01:09:36 +0100
changeset 651	deb35fd780fe
parent 649	ef2b8abcbc55
child 657	00171b627b8d
permissions	-rwxr-xr-x