lexing: ChengsongTanPhdThesis/Chapters/Chapter1.tex@a0f27e21b42c (annotated)

468 a0f27e21b42c all texrelated Chengsong parents: diff changeset	1	% Chapter 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	3	\chapter{Introduction} % Main chapter title
a0f27e21b42c all texrelated Chengsong parents: diff changeset	4
a0f27e21b42c all texrelated Chengsong parents: diff changeset	5	\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	6
a0f27e21b42c all texrelated Chengsong parents: diff changeset	7	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	8
a0f27e21b42c all texrelated Chengsong parents: diff changeset	9	% Define some commands to keep the formatting separated from the content
a0f27e21b42c all texrelated Chengsong parents: diff changeset	10	\newcommand{\keyword}[1]{\textbf{#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	11	\newcommand{\tabhead}[1]{\textbf{#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	12	\newcommand{\code}[1]{\texttt{#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	13	\newcommand{\file}[1]{\texttt{\bfseries#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	14	\newcommand{\option}[1]{\texttt{\itshape#1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	15
a0f27e21b42c all texrelated Chengsong parents: diff changeset	16
a0f27e21b42c all texrelated Chengsong parents: diff changeset	17	\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
a0f27e21b42c all texrelated Chengsong parents: diff changeset	18	\newcommand{\ZERO}{\mbox{\bf 0}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	19	\newcommand{\ONE}{\mbox{\bf 1}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	20	\def\lexer{\mathit{lexer}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	21	\def\mkeps{\mathit{mkeps}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	22
a0f27e21b42c all texrelated Chengsong parents: diff changeset	23	\def\DFA{\textit{DFA}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	24	\def\bmkeps{\textit{bmkeps}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	25	\def\retrieve{\textit{retrieve}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	26	\def\blexer{\textit{blexer}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	27	\def\flex{\textit{flex}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	28	\def\inj{\mathit{inj}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	29	\def\Empty{\mathit{Empty}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	30	\def\Left{\mathit{Left}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	31	\def\Right{\mathit{Right}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	32	\def\Stars{\mathit{Stars}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	33	\def\Char{\mathit{Char}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	34	\def\Seq{\mathit{Seq}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	35	\def\Der{\mathit{Der}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	36	\def\nullable{\mathit{nullable}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	37	\def\Z{\mathit{Z}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	38	\def\S{\mathit{S}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	39	\def\rup{r^\uparrow}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	40	\def\simp{\mathit{simp}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	41	\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	42	\def\map{\mathit{map}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	43	\def\distinct{\mathit{distinct}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	44	\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	45	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	46	%This part is about regular expressions, Brzozowski derivatives,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	47	%and a bit-coded lexing algorithm with proven correctness and time bounds.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	48	Regular expressions are widely used in modern day programming tasks.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	49	Be it IDE with syntax hightlighting and auto completion,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	50	command line tools like $\mathit{grep}$ that facilitates easy
a0f27e21b42c all texrelated Chengsong parents: diff changeset	51	processing of text by search and replace, network intrusion
a0f27e21b42c all texrelated Chengsong parents: diff changeset	52	detection systems that rejects suspicious traffic, or compiler
a0f27e21b42c all texrelated Chengsong parents: diff changeset	53	front ends--there is always an important phase of the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	54	task that involves matching a regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	55	exression with a string.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	56	Given its usefulness and ubiquity, one would imagine that
a0f27e21b42c all texrelated Chengsong parents: diff changeset	57	modern regular expression matching implementations
a0f27e21b42c all texrelated Chengsong parents: diff changeset	58	are mature and fully-studied.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	59
a0f27e21b42c all texrelated Chengsong parents: diff changeset	60	If you go to a popular programming language's
a0f27e21b42c all texrelated Chengsong parents: diff changeset	61	regex engine,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	62	you can supply it with regex and strings of your own,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	63	and get matching/lexing information such as how a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	64	sub-part of the regex matches a sub-part of the string.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	65	These lexing libraries are on average quite fast.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	66	%TODO: get source for SNORT/BRO's regex matching engine/speed
a0f27e21b42c all texrelated Chengsong parents: diff changeset	67	For example, the regex engines some network intrusion detection
a0f27e21b42c all texrelated Chengsong parents: diff changeset	68	systems use are able to process
a0f27e21b42c all texrelated Chengsong parents: diff changeset	69	megabytes or even gigabytes of network traffic per second.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	70
a0f27e21b42c all texrelated Chengsong parents: diff changeset	71	Why do we need to have our version, if the algorithms are
a0f27e21b42c all texrelated Chengsong parents: diff changeset	72	blindingly fast already? Well it turns out it is not always the case.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	73
a0f27e21b42c all texrelated Chengsong parents: diff changeset	74
a0f27e21b42c all texrelated Chengsong parents: diff changeset	75	Take $(a^)^\,b$ and ask whether
a0f27e21b42c all texrelated Chengsong parents: diff changeset	76	strings of the form $aa..a$ match this regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	77	expression. Obviously this is not the case---the expected $b$ in the last
a0f27e21b42c all texrelated Chengsong parents: diff changeset	78	position is missing. One would expect that modern regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	79	matching engines can find this out very quickly. Alas, if one tries
a0f27e21b42c all texrelated Chengsong parents: diff changeset	80	this example in JavaScript, Python or Java 8 with strings like 28
a0f27e21b42c all texrelated Chengsong parents: diff changeset	81	$a$'s, one discovers that this decision takes around 30 seconds and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	82	takes considerably longer when adding a few more $a$'s, as the graphs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	83	below show:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	84
a0f27e21b42c all texrelated Chengsong parents: diff changeset	85	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	86	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	87	\begin{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	88	\begin{axis}[
a0f27e21b42c all texrelated Chengsong parents: diff changeset	89	xlabel={$n$},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	90	x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	91	ylabel={time in secs},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	92	enlargelimits=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	93	xtick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	94	xmax=33,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	95	ymax=35,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	96	ytick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	97	scaled ticks=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	98	axis lines=left,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	99	width=5cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	100	height=4cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	101	legend entries={JavaScript},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	102	legend pos=north west,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	103	legend cell align=left]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	104	\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	105	\end{axis}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	106	\end{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	107	&
a0f27e21b42c all texrelated Chengsong parents: diff changeset	108	\begin{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	109	\begin{axis}[
a0f27e21b42c all texrelated Chengsong parents: diff changeset	110	xlabel={$n$},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	111	x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	112	%ylabel={time in secs},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	113	enlargelimits=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	114	xtick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	115	xmax=33,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	116	ymax=35,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	117	ytick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	118	scaled ticks=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	119	axis lines=left,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	120	width=5cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	121	height=4cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	122	legend entries={Python},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	123	legend pos=north west,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	124	legend cell align=left]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	125	\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	126	\end{axis}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	127	\end{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	128	&
a0f27e21b42c all texrelated Chengsong parents: diff changeset	129	\begin{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	130	\begin{axis}[
a0f27e21b42c all texrelated Chengsong parents: diff changeset	131	xlabel={$n$},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	132	x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	133	%ylabel={time in secs},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	134	enlargelimits=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	135	xtick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	136	xmax=33,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	137	ymax=35,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	138	ytick={0,5,...,30},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	139	scaled ticks=false,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	140	axis lines=left,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	141	width=5cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	142	height=4cm,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	143	legend entries={Java 8},
a0f27e21b42c all texrelated Chengsong parents: diff changeset	144	legend pos=north west,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	145	legend cell align=left]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	146	\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	147	\end{axis}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	148	\end{tikzpicture}\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	149	\multicolumn{3}{c}{Graphs: Runtime for matching $(a^)^\,b$ with strings
a0f27e21b42c all texrelated Chengsong parents: diff changeset	150	of the form $\underbrace{aa..a}_{n}$.}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	151	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	152	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	153
a0f27e21b42c all texrelated Chengsong parents: diff changeset	154
a0f27e21b42c all texrelated Chengsong parents: diff changeset	155	This is clearly exponential behaviour, and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	156	is triggered by some relatively simple regex patterns.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	157
a0f27e21b42c all texrelated Chengsong parents: diff changeset	158
a0f27e21b42c all texrelated Chengsong parents: diff changeset	159	This superlinear blowup in matching algorithms sometimes cause
a0f27e21b42c all texrelated Chengsong parents: diff changeset	160	considerable grief in real life: for example on 20 July 2016 one evil
a0f27e21b42c all texrelated Chengsong parents: diff changeset	161	regular expression brought the webpage
a0f27e21b42c all texrelated Chengsong parents: diff changeset	162	\href{http://stackexchange.com}{Stack Exchange} to its
a0f27e21b42c all texrelated Chengsong parents: diff changeset	163	knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	164	In this instance, a regular expression intended to just trim white
a0f27e21b42c all texrelated Chengsong parents: diff changeset	165	spaces from the beginning and the end of a line actually consumed
a0f27e21b42c all texrelated Chengsong parents: diff changeset	166	massive amounts of CPU-resources---causing web servers to grind to a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	167	halt. This happened when a post with 20,000 white spaces was submitted,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	168	but importantly the white spaces were neither at the beginning nor at
a0f27e21b42c all texrelated Chengsong parents: diff changeset	169	the end. As a result, the regular expression matching engine needed to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	170	backtrack over many choices. In this example, the time needed to process
a0f27e21b42c all texrelated Chengsong parents: diff changeset	171	the string was $O(n^2)$ with respect to the string length. This
a0f27e21b42c all texrelated Chengsong parents: diff changeset	172	quadratic overhead was enough for the homepage of Stack Exchange to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	173	respond so slowly that the load balancer assumed a $\mathit{DoS}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	174	attack and therefore stopped the servers from responding to any
a0f27e21b42c all texrelated Chengsong parents: diff changeset	175	requests. This made the whole site become unavailable.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	176	A more
a0f27e21b42c all texrelated Chengsong parents: diff changeset	177	recent example is a global outage of all Cloudflare servers on 2 July
a0f27e21b42c all texrelated Chengsong parents: diff changeset	178	2019. A poorly written regular expression exhibited exponential
a0f27e21b42c all texrelated Chengsong parents: diff changeset	179	behaviour and exhausted CPUs that serve HTTP traffic. Although the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	180	outage had several causes, at the heart was a regular expression that
a0f27e21b42c all texrelated Chengsong parents: diff changeset	181	was used to monitor network
a0f27e21b42c all texrelated Chengsong parents: diff changeset	182	traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	183	%TODO: data points for some new versions of languages
a0f27e21b42c all texrelated Chengsong parents: diff changeset	184	These problems with regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	185	are not isolated events that happen
a0f27e21b42c all texrelated Chengsong parents: diff changeset	186	very occasionally, but actually quite widespread.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	187	They occur so often that they get a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	188	name--Regular-Expression-Denial-Of-Service (ReDoS)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	189	attack.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	190	Davis et al. \parencite{Davis18} detected more
a0f27e21b42c all texrelated Chengsong parents: diff changeset	191	than 1000 super-linear (SL) regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	192	in Node.js, Python core libraries, and npm and pypi.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	193	They therefore concluded that evil regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	194	are problems more than "a parlour trick", but one that
a0f27e21b42c all texrelated Chengsong parents: diff changeset	195	requires
a0f27e21b42c all texrelated Chengsong parents: diff changeset	196	more research attention.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	197
a0f27e21b42c all texrelated Chengsong parents: diff changeset	198	\section{Why are current algorithm for regexes slow?}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	199
a0f27e21b42c all texrelated Chengsong parents: diff changeset	200	Theoretical results say that regular expression matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	201	should be linear with respect to the input.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	202	You could construct
a0f27e21b42c all texrelated Chengsong parents: diff changeset	203	an NFA via Thompson construction, and simulate running it.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	204	This would be linear.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	205	Or you could determinize the NFA into a DFA, and minimize that
a0f27e21b42c all texrelated Chengsong parents: diff changeset	206	DFA. Once you have the DFA, the running time is also linear, requiring just
a0f27e21b42c all texrelated Chengsong parents: diff changeset	207	one scanning pass of the input.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	208
a0f27e21b42c all texrelated Chengsong parents: diff changeset	209	But modern regex libraries in popular language engines
a0f27e21b42c all texrelated Chengsong parents: diff changeset	210	often want to support richer constructs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	211	such as bounded repetitions and back-references
a0f27e21b42c all texrelated Chengsong parents: diff changeset	212	rather than the most basic regular expressions.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	213	%put in illustrations
a0f27e21b42c all texrelated Chengsong parents: diff changeset	214	%a{1}{3}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	215	These make a DFA construction impossible because
a0f27e21b42c all texrelated Chengsong parents: diff changeset	216	of an exponential states explosion.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	217	They also want to support lexing rather than just matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	218	for tasks that involves text processing.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	219
a0f27e21b42c all texrelated Chengsong parents: diff changeset	220	Existing regex libraries either pose restrictions on the user input, or does
a0f27e21b42c all texrelated Chengsong parents: diff changeset	221	not give linear running time guarantee.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	222	%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	223	For example, the Rust regex engine claims to be linear,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	224	but does not support lookarounds and back-references.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	225	The GoLang regex library does not support over 1000 repetitions.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	226	Java and Python both support back-references, but shows
a0f27e21b42c all texrelated Chengsong parents: diff changeset	227	catastrophic backtracking behaviours on inputs without back-references(
a0f27e21b42c all texrelated Chengsong parents: diff changeset	228	when the language is still regular).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	229	%TODO: test performance of Rust on (((((aa)b)b){20}))c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
a0f27e21b42c all texrelated Chengsong parents: diff changeset	230	%TODO: verify the fact Rust does not allow 1000+ reps
a0f27e21b42c all texrelated Chengsong parents: diff changeset	231	%TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	232	Another thing about the these libraries is that there
a0f27e21b42c all texrelated Chengsong parents: diff changeset	233	is no correctness guarantee.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	234	In some cases they either fails to generate a lexing result when there is a match,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	235	or gives the wrong way of matching.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	236
a0f27e21b42c all texrelated Chengsong parents: diff changeset	237
a0f27e21b42c all texrelated Chengsong parents: diff changeset	238	It turns out that regex libraries not only suffer from
a0f27e21b42c all texrelated Chengsong parents: diff changeset	239	exponential backtracking problems,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	240	but also undesired (or even buggy) outputs.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	241	%TODO: comment from who
a0f27e21b42c all texrelated Chengsong parents: diff changeset	242	xxx commented that most regex libraries are not
a0f27e21b42c all texrelated Chengsong parents: diff changeset	243	correctly implementing the POSIX (maximum-munch)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	244	rule of regular expression matching.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	245	A concrete example would be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	246	the regex
a0f27e21b42c all texrelated Chengsong parents: diff changeset	247	\begin{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	248	(((((aa)b)b){20}))c
a0f27e21b42c all texrelated Chengsong parents: diff changeset	249	\end{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	250	and the string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	251	\begin{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	252	baabaabababaabaaaaaaaaababaa
a0f27e21b42c all texrelated Chengsong parents: diff changeset	253	aababababaaaabaaabaaaaaabaab
a0f27e21b42c all texrelated Chengsong parents: diff changeset	254	aabababaababaaaaaaaaababaaaa
a0f27e21b42c all texrelated Chengsong parents: diff changeset	255	babababaaaaaaaaaaaaac
a0f27e21b42c all texrelated Chengsong parents: diff changeset	256	\end{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	257
a0f27e21b42c all texrelated Chengsong parents: diff changeset	258	This seemingly complex regex simply says "some $a$'s
a0f27e21b42c all texrelated Chengsong parents: diff changeset	259	followed by some $b$'s then followed by 1 single $b$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	260	and this iterates 20 times, finally followed by a $c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	261	And a POSIX match would involve the entire string,"eating up"
a0f27e21b42c all texrelated Chengsong parents: diff changeset	262	all the $b$'s in it.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	263	%TODO: give a coloured example of how this matches POSIXly
a0f27e21b42c all texrelated Chengsong parents: diff changeset	264
a0f27e21b42c all texrelated Chengsong parents: diff changeset	265	This regex would trigger catastrophic backtracking in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	266	languages like Python and Java,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	267	whereas it gives a correct but uninformative (non-POSIX)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	268	match in languages like Go or .NET--The match with only
a0f27e21b42c all texrelated Chengsong parents: diff changeset	269	character $c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	270
a0f27e21b42c all texrelated Chengsong parents: diff changeset	271	Backtracking or depth-first search might give us
a0f27e21b42c all texrelated Chengsong parents: diff changeset	272	exponential running time, and quite a few tools
a0f27e21b42c all texrelated Chengsong parents: diff changeset	273	avoid that by determinising the $\mathit{NFA}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	274	into a $\mathit{DFA}$ and minimizes it.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	275	For example, $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
a0f27e21b42c all texrelated Chengsong parents: diff changeset	276	in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
a0f27e21b42c all texrelated Chengsong parents: diff changeset	277	lexers.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	278	However, they do not scale well with bounded repetitions.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	279	Bounded repetitions, usually written in the form
a0f27e21b42c all texrelated Chengsong parents: diff changeset	280	$r^{\{c\}}$ (where $c$ is a constant natural number),
a0f27e21b42c all texrelated Chengsong parents: diff changeset	281	denotes a regular expression accepting strings
a0f27e21b42c all texrelated Chengsong parents: diff changeset	282	that can be divided into $c$ substrings, and each
a0f27e21b42c all texrelated Chengsong parents: diff changeset	283	substring is in $r$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	284	%TODO: draw example NFA.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	285	For the regular expression $(a\|b)^*a(a\|b)^{\{2\}}$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	286	an $\mathit{NFA}$ describing it would look like:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	287	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	288	\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	289	\node[state,initial] (q_0) {$q_0$};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	290	\node[state, red] (q_1) [right=of q_0] {$q_1$};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	291	\node[state, red] (q_2) [right=of q_1] {$q_2$};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	292	\node[state,accepting](q_3) [right=of q_2] {$q_3$};
a0f27e21b42c all texrelated Chengsong parents: diff changeset	293	\path[->]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	294	(q_0) edge node {a} (q_1)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	295	edge [loop below] node {a,b} ()
a0f27e21b42c all texrelated Chengsong parents: diff changeset	296	(q_1) edge node {a,b} (q_2)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	297	edge [loop above] node {0} ()
a0f27e21b42c all texrelated Chengsong parents: diff changeset	298	(q_2) edge node {a,b} (q_3);
a0f27e21b42c all texrelated Chengsong parents: diff changeset	299	\end{tikzpicture}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	300	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	301	The red states are "counter states" which counts down
a0f27e21b42c all texrelated Chengsong parents: diff changeset	302	the number of characters needed in addition to the current
a0f27e21b42c all texrelated Chengsong parents: diff changeset	303	string to make a successful match.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	304	For example, state $q_1$ indicates a match that has
a0f27e21b42c all texrelated Chengsong parents: diff changeset	305	gone past the $(a\|b)$ part of $(a\|b)^*a(a\|b)^{\{2\}}$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	306	and just consumed the "delimiter" $a$ in the middle, and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	307	need to match 2 more iterations of $a\|b$ to complete.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	308	State $q_2$ on the other hand, can be viewed as a state
a0f27e21b42c all texrelated Chengsong parents: diff changeset	309	after $q_1$ has consumed 1 character, and just waits
a0f27e21b42c all texrelated Chengsong parents: diff changeset	310	for 1 more character to complete.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	311	Depending on the actual characters appearing in the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	312	input string, the states $q_1$ and $q_2$ may or may
a0f27e21b42c all texrelated Chengsong parents: diff changeset	313	not be active, independent from each other.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	314	A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
a0f27e21b42c all texrelated Chengsong parents: diff changeset	315	contain at least 4 non-equivalent states that cannot be merged,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	316	because subset states indicating which of $q_1$ and $q_2$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	317	are active are at least four: $\phi$, $\{q_1\}$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	318	$\{q_2\}$, $\{q_1, q_2\}$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	319	Generalizing this to regular expressions with larger
a0f27e21b42c all texrelated Chengsong parents: diff changeset	320	bounded repetitions number, we have $r^*ar^{\{n\}}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	321	in general would require at least $2^n$ states
a0f27e21b42c all texrelated Chengsong parents: diff changeset	322	in a $\mathit{DFA}$. This is to represent all different
a0f27e21b42c all texrelated Chengsong parents: diff changeset	323	configurations of "countdown" states.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	324	For those regexes, tools such as $\mathit{JFLEX}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	325	would generate gigantic $\mathit{DFA}$'s or even
a0f27e21b42c all texrelated Chengsong parents: diff changeset	326	give out memory errors.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	327
a0f27e21b42c all texrelated Chengsong parents: diff changeset	328	For this reason, regex libraries that support
a0f27e21b42c all texrelated Chengsong parents: diff changeset	329	bounded repetitions often choose to use the $\mathit{NFA}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	330	approach.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	331	One can simulate the $\mathit{NFA}$ running in two ways:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	332	one by keeping track of all active states after consuming
a0f27e21b42c all texrelated Chengsong parents: diff changeset	333	a character, and update that set of states iteratively.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	334	This is a breadth-first-search of the $\mathit{NFA}$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	335	for a path terminating
a0f27e21b42c all texrelated Chengsong parents: diff changeset	336	at an accepting state.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	337	Languages like $\mathit{GO}$ and $\mathit{RUST}$ use this
a0f27e21b42c all texrelated Chengsong parents: diff changeset	338	type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
a0f27e21b42c all texrelated Chengsong parents: diff changeset	339	in terms of input string length.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	340	The other way to use $\mathit{NFA}$ for matching is to take
a0f27e21b42c all texrelated Chengsong parents: diff changeset	341	a single state in a path each time, and backtrack if that path
a0f27e21b42c all texrelated Chengsong parents: diff changeset	342	fails. This is a depth-first-search that could end up
a0f27e21b42c all texrelated Chengsong parents: diff changeset	343	with exponential run time.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	344	The reason behind backtracking algorithms in languages like
a0f27e21b42c all texrelated Chengsong parents: diff changeset	345	Java and Python is that they support back-references.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	346	\subsection{Back References in Regex--Non-Regular part}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	347	If we label sub-expressions by parenthesizing them and give
a0f27e21b42c all texrelated Chengsong parents: diff changeset	348	them a number by the order their opening parenthesis appear,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	349	$\underset{1}{(}\ldots\underset{2}{(}\ldots\underset{3}{(}\ldots\underset{4}{(}\ldots)\ldots)\ldots)\ldots)$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	350	We can use the following syntax to denote that we want a string just matched by a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	351	sub-expression to appear at a certain location again exactly:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	352	$(.*)\backslash 1$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	353	would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	354
a0f27e21b42c all texrelated Chengsong parents: diff changeset	355	Back-reference is a construct in the "regex" standard
a0f27e21b42c all texrelated Chengsong parents: diff changeset	356	that programmers found quite useful, but not exactly
a0f27e21b42c all texrelated Chengsong parents: diff changeset	357	regular any more.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	358	In fact, that allows the regex construct to express
a0f27e21b42c all texrelated Chengsong parents: diff changeset	359	languages that cannot be contained in context-free
a0f27e21b42c all texrelated Chengsong parents: diff changeset	360	languages
a0f27e21b42c all texrelated Chengsong parents: diff changeset	361	For example, the back reference $(a^*)\backslash 1 \backslash 1$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	362	expresses the language $\{a^na^na^n\mid n \in N\}$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	363	which cannot be expressed by context-free grammars.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	364	To express such a language one would need context-sensitive
a0f27e21b42c all texrelated Chengsong parents: diff changeset	365	grammars, and matching/parsing for such grammars is NP-hard in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	366	general.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	367	%TODO:cite reference for NP-hardness of CSG matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	368	For such constructs, the most intuitive way of matching is
a0f27e21b42c all texrelated Chengsong parents: diff changeset	369	using backtracking algorithms, and there are no known algorithms
a0f27e21b42c all texrelated Chengsong parents: diff changeset	370	that does not backtrack.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	371	%TODO:read a bit more about back reference algorithms
a0f27e21b42c all texrelated Chengsong parents: diff changeset	372
a0f27e21b42c all texrelated Chengsong parents: diff changeset	373
a0f27e21b42c all texrelated Chengsong parents: diff changeset	374
a0f27e21b42c all texrelated Chengsong parents: diff changeset	375
a0f27e21b42c all texrelated Chengsong parents: diff changeset	376	\section{Our Solution--Brzozowski Derivatives}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	377
a0f27e21b42c all texrelated Chengsong parents: diff changeset	378
a0f27e21b42c all texrelated Chengsong parents: diff changeset	379
a0f27e21b42c all texrelated Chengsong parents: diff changeset	380	Is it possible to have a regex lexing algorithm with proven correctness and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	381	time complexity, which allows easy extensions to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	382	constructs like
a0f27e21b42c all texrelated Chengsong parents: diff changeset	383	bounded repetitions, negation, lookarounds, and even back-references?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	384
a0f27e21b42c all texrelated Chengsong parents: diff changeset	385	We propose Brzozowski's derivatives as a solution to this problem.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	386
a0f27e21b42c all texrelated Chengsong parents: diff changeset	387	The main contribution of this thesis is a proven correct lexing algorithm
a0f27e21b42c all texrelated Chengsong parents: diff changeset	388	with formalized time bounds.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	389	To our best knowledge, there is no lexing libraries using Brzozowski derivatives
a0f27e21b42c all texrelated Chengsong parents: diff changeset	390	that have a provable time guarantee,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	391	and claims about running time are usually speculative and backed by thin empirical
a0f27e21b42c all texrelated Chengsong parents: diff changeset	392	evidence.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	393	%TODO: give references
a0f27e21b42c all texrelated Chengsong parents: diff changeset	394	For example, Sulzmann and Lu had proposed an algorithm in which they
a0f27e21b42c all texrelated Chengsong parents: diff changeset	395	claim a linear running time.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	396	But that was falsified by our experiments and the running time
a0f27e21b42c all texrelated Chengsong parents: diff changeset	397	is actually $\Omega(2^n)$ in the worst case.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	398	A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
a0f27e21b42c all texrelated Chengsong parents: diff changeset	399	%TODO: give references
a0f27e21b42c all texrelated Chengsong parents: diff changeset	400	lexer, which calculates POSIX matches and is based on derivatives.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	401	They formalized the correctness of the lexer, but not the complexity.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	402	In the performance evaluation section, they simply analyzed the run time
a0f27e21b42c all texrelated Chengsong parents: diff changeset	403	of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	404	and concluded that the algorithm is quadratic in terms of input length.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	405	When we tried out their extracted OCaml code with our example $(a+aa)^*$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	406	the time it took to lex only 40 $a$'s was 5 minutes.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	407	We therefore believe our results of a proof of performance on general
a0f27e21b42c all texrelated Chengsong parents: diff changeset	408	inputs rather than specific examples a novel contribution.\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	409
a0f27e21b42c all texrelated Chengsong parents: diff changeset	410	\section{Preliminaries about Lexing Using Brzozowski derivatives}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	411	In the last fifteen or so years, Brzozowski's derivatives of regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	412	expressions have sparked quite a bit of interest in the functional
a0f27e21b42c all texrelated Chengsong parents: diff changeset	413	programming and theorem prover communities.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	414	The beauty of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	415	Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
a0f27e21b42c all texrelated Chengsong parents: diff changeset	416	expressible in any functional language, and easily definable and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	417	reasoned about in theorem provers---the definitions just consist of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	418	inductive datatypes and simple recursive functions.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	419
a0f27e21b42c all texrelated Chengsong parents: diff changeset	420	Suppose we have an alphabet $\Sigma$, the strings whose characters
a0f27e21b42c all texrelated Chengsong parents: diff changeset	421	are from $\Sigma$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	422	can be expressed as $\Sigma^*$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	423
a0f27e21b42c all texrelated Chengsong parents: diff changeset	424	We use patterns to define a set of strings concisely. Regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	425	are one of such patterns systems:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	426	The basic regular expressions are defined inductively
a0f27e21b42c all texrelated Chengsong parents: diff changeset	427	by the following grammar:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	428	\[ r ::= \ZERO \mid \ONE
a0f27e21b42c all texrelated Chengsong parents: diff changeset	429	\mid c
a0f27e21b42c all texrelated Chengsong parents: diff changeset	430	\mid r_1 \cdot r_2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	431	\mid r_1 + r_2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	432	\mid r^*
a0f27e21b42c all texrelated Chengsong parents: diff changeset	433	\]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	434
a0f27e21b42c all texrelated Chengsong parents: diff changeset	435	The language or set of strings defined by regular expressions are defined as
a0f27e21b42c all texrelated Chengsong parents: diff changeset	436	%TODO: FILL in the other defs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	437	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	438	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	439	$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	440	$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	441	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	442	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	443	Which are also called the "language interpretation".
a0f27e21b42c all texrelated Chengsong parents: diff changeset	444
a0f27e21b42c all texrelated Chengsong parents: diff changeset	445
a0f27e21b42c all texrelated Chengsong parents: diff changeset	446
a0f27e21b42c all texrelated Chengsong parents: diff changeset	447	The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	448	where the operation transforms the regex to a new one containing
a0f27e21b42c all texrelated Chengsong parents: diff changeset	449	strings without the head character $c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	450
a0f27e21b42c all texrelated Chengsong parents: diff changeset	451	Formally, we define first such a transformation on any string set, which
a0f27e21b42c all texrelated Chengsong parents: diff changeset	452	we call semantic derivative:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	453	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	454	$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	455	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	456	Mathematically, it can be expressed as the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	457
a0f27e21b42c all texrelated Chengsong parents: diff changeset	458	If the $\textit{StringSet}$ happen to have some structure, for example,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	459	if it is regular, then we have that it
a0f27e21b42c all texrelated Chengsong parents: diff changeset	460
a0f27e21b42c all texrelated Chengsong parents: diff changeset	461	The the derivative of regular expression, denoted as
a0f27e21b42c all texrelated Chengsong parents: diff changeset	462	$r \backslash c$, is a function that takes parameters
a0f27e21b42c all texrelated Chengsong parents: diff changeset	463	$r$ and $c$, and returns another regular expression $r'$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	464	which is computed by the following recursive function:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	465
a0f27e21b42c all texrelated Chengsong parents: diff changeset	466	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	467	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	468	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	469	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	470	$d \backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	471	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	472	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	473	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	474	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	475	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	476	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	477	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	478	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	479	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	480	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	481
a0f27e21b42c all texrelated Chengsong parents: diff changeset	482	The $\nullable$ function tests whether the empty string $""$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	483	is in the language of $r$:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	484
a0f27e21b42c all texrelated Chengsong parents: diff changeset	485
a0f27e21b42c all texrelated Chengsong parents: diff changeset	486	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	487	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	488	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	489	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	490	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	491	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	492	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	493	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	494	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	495	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	496	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	497	The empty set does not contain any string and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	498	therefore not the empty string, the empty string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	499	regular expression contains the empty string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	500	by definition, the character regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	501	is the singleton that contains character only,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	502	and therefore does not contain the empty string,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	503	the alternative regular expression(or "or" expression)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	504	might have one of its children regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	505	being nullable and any one of its children being nullable
a0f27e21b42c all texrelated Chengsong parents: diff changeset	506	would suffice. The sequence regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	507	would require both children to have the empty string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	508	to compose an empty string and the Kleene star
a0f27e21b42c all texrelated Chengsong parents: diff changeset	509	operation naturally introduced the empty string.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	510
a0f27e21b42c all texrelated Chengsong parents: diff changeset	511	We can give the meaning of regular expressions derivatives
a0f27e21b42c all texrelated Chengsong parents: diff changeset	512	by language interpretation:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	513
a0f27e21b42c all texrelated Chengsong parents: diff changeset	514
a0f27e21b42c all texrelated Chengsong parents: diff changeset	515
a0f27e21b42c all texrelated Chengsong parents: diff changeset	516
a0f27e21b42c all texrelated Chengsong parents: diff changeset	517	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	518	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	519	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	520	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	521	$d \backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	522	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	523	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	524	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	525	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	526	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	527	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	528	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	529	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	530	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	531	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	532	The function derivative, written $\backslash c$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	533	defines how a regular expression evolves into
a0f27e21b42c all texrelated Chengsong parents: diff changeset	534	a new regular expression after all the string it contains
a0f27e21b42c all texrelated Chengsong parents: diff changeset	535	is chopped off a certain head character $c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	536	The most involved cases are the sequence
a0f27e21b42c all texrelated Chengsong parents: diff changeset	537	and star case.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	538	The sequence case says that if the first regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	539	contains an empty string then second component of the sequence
a0f27e21b42c all texrelated Chengsong parents: diff changeset	540	might be chosen as the target regular expression to be chopped
a0f27e21b42c all texrelated Chengsong parents: diff changeset	541	off its head character.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	542	The star regular expression unwraps the iteration of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	543	regular expression and attack the star regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	544	to its back again to make sure there are 0 or more iterations
a0f27e21b42c all texrelated Chengsong parents: diff changeset	545	following this unfolded iteration.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	546
a0f27e21b42c all texrelated Chengsong parents: diff changeset	547
a0f27e21b42c all texrelated Chengsong parents: diff changeset	548	The main property of the derivative operation
a0f27e21b42c all texrelated Chengsong parents: diff changeset	549	that enables us to reason about the correctness of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	550	an algorithm using derivatives is
a0f27e21b42c all texrelated Chengsong parents: diff changeset	551
a0f27e21b42c all texrelated Chengsong parents: diff changeset	552	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	553	$c\!::\!s \in L(r)$ holds
a0f27e21b42c all texrelated Chengsong parents: diff changeset	554	if and only if $s \in L(r\backslash c)$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	555	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	556
a0f27e21b42c all texrelated Chengsong parents: diff changeset	557	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	558	We can generalise the derivative operation shown above for single characters
a0f27e21b42c all texrelated Chengsong parents: diff changeset	559	to strings as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	560
a0f27e21b42c all texrelated Chengsong parents: diff changeset	561	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	562	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	563	$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	564	$r \backslash [\,] $ & $\dn$ & $r$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	565	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	566	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	567
a0f27e21b42c all texrelated Chengsong parents: diff changeset	568	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	569	and then define Brzozowski's regular-expression matching algorithm as:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	570
a0f27e21b42c all texrelated Chengsong parents: diff changeset	571	\[
a0f27e21b42c all texrelated Chengsong parents: diff changeset	572	match\;s\;r \;\dn\; nullable(r\backslash s)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	573	\]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	574
a0f27e21b42c all texrelated Chengsong parents: diff changeset	575	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	576	Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	577	this algorithm presented graphically is as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	578
a0f27e21b42c all texrelated Chengsong parents: diff changeset	579	\begin{equation}\label{graph:*}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	580	\begin{tikzcd}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	581	r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	582	\end{tikzcd}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	583	\end{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	584
a0f27e21b42c all texrelated Chengsong parents: diff changeset	585	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	586	where we start with a regular expression $r_0$, build successive
a0f27e21b42c all texrelated Chengsong parents: diff changeset	587	derivatives until we exhaust the string and then use \textit{nullable}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	588	to test whether the result can match the empty string. It can be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	589	relatively easily shown that this matcher is correct (that is given
a0f27e21b42c all texrelated Chengsong parents: diff changeset	590	an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	591
a0f27e21b42c all texrelated Chengsong parents: diff changeset	592	Beautiful and simple definition.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	593
a0f27e21b42c all texrelated Chengsong parents: diff changeset	594	If we implement the above algorithm naively, however,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	595	the algorithm can be excruciatingly slow. For example, when starting with the regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	596	expression $(a + aa)^*$ and building 12 successive derivatives
a0f27e21b42c all texrelated Chengsong parents: diff changeset	597	w.r.t.~the character $a$, one obtains a derivative regular expression
a0f27e21b42c all texrelated Chengsong parents: diff changeset	598	with more than 8000 nodes (when viewed as a tree). Operations like
a0f27e21b42c all texrelated Chengsong parents: diff changeset	599	$\backslash$ and $\nullable$ need to traverse such trees and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	600	consequently the bigger the size of the derivative the slower the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	601	algorithm.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	602
a0f27e21b42c all texrelated Chengsong parents: diff changeset	603	Brzozowski was quick in finding that during this process a lot useless
a0f27e21b42c all texrelated Chengsong parents: diff changeset	604	$\ONE$s and $\ZERO$s are generated and therefore not optimal.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	605	He also introduced some "similarity rules" such
a0f27e21b42c all texrelated Chengsong parents: diff changeset	606	as $P+(Q+R) = (P+Q)+R$ to merge syntactically
a0f27e21b42c all texrelated Chengsong parents: diff changeset	607	different but language-equivalent sub-regexes to further decrease the size
a0f27e21b42c all texrelated Chengsong parents: diff changeset	608	of the intermediate regexes.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	609
a0f27e21b42c all texrelated Chengsong parents: diff changeset	610	More simplifications are possible, such as deleting duplicates
a0f27e21b42c all texrelated Chengsong parents: diff changeset	611	and opening up nested alternatives to trigger even more simplifications.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	612	And suppose we apply simplification after each derivative step, and compose
a0f27e21b42c all texrelated Chengsong parents: diff changeset	613	these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
a0f27e21b42c all texrelated Chengsong parents: diff changeset	614	\textit{simp}(a \backslash c)$. Then we can build
a0f27e21b42c all texrelated Chengsong parents: diff changeset	615	a matcher without having cumbersome regular expressions.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	616
a0f27e21b42c all texrelated Chengsong parents: diff changeset	617
a0f27e21b42c all texrelated Chengsong parents: diff changeset	618	If we want the size of derivatives in the algorithm to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	619	stay even lower, we would need more aggressive simplifications.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	620	Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
a0f27e21b42c all texrelated Chengsong parents: diff changeset	621	deleting duplicates whenever possible. For example, the parentheses in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	622	$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
a0f27e21b42c all texrelated Chengsong parents: diff changeset	623	\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
a0f27e21b42c all texrelated Chengsong parents: diff changeset	624	example is simplifying $(a^+a) + (a^+ \ONE) + (a +\ONE)$ to just
a0f27e21b42c all texrelated Chengsong parents: diff changeset	625	$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
a0f27e21b42c all texrelated Chengsong parents: diff changeset	626	to achieve a very tight size bound, namely,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	627	the same size bound as that of the \emph{partial derivatives}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	628
a0f27e21b42c all texrelated Chengsong parents: diff changeset	629	Building derivatives and then simplify them.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	630	So far so good. But what if we want to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	631	do lexing instead of just a YES/NO answer?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	632	This requires us to go back again to the world
a0f27e21b42c all texrelated Chengsong parents: diff changeset	633	without simplification first for a moment.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	634	Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	635	elegant(arguably as beautiful as the original
a0f27e21b42c all texrelated Chengsong parents: diff changeset	636	derivatives definition) solution for this.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	637
a0f27e21b42c all texrelated Chengsong parents: diff changeset	638	\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	639
a0f27e21b42c all texrelated Chengsong parents: diff changeset	640
a0f27e21b42c all texrelated Chengsong parents: diff changeset	641	They first defined the datatypes for storing the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	642	lexing information called a \emph{value} or
a0f27e21b42c all texrelated Chengsong parents: diff changeset	643	sometimes also \emph{lexical value}. These values and regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	644	expressions correspond to each other as illustrated in the following
a0f27e21b42c all texrelated Chengsong parents: diff changeset	645	table:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	646
a0f27e21b42c all texrelated Chengsong parents: diff changeset	647	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	648	\begin{tabular}{c@{\hspace{20mm}}c}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	649	\begin{tabular}{@{}rrl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	650	\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	651	$r$ & $::=$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	652	& $\mid$ & $\ONE$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	653	& $\mid$ & $c$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	654	& $\mid$ & $r_1 \cdot r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	655	& $\mid$ & $r_1 + r_2$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	656	\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	657	& $\mid$ & $r^*$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	658	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	659	&
a0f27e21b42c all texrelated Chengsong parents: diff changeset	660	\begin{tabular}{@{\hspace{0mm}}rrl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	661	\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	662	$v$ & $::=$ & \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	663	& & $\Empty$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	664	& $\mid$ & $\Char(c)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	665	& $\mid$ & $\Seq\,v_1\, v_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	666	& $\mid$ & $\Left(v)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	667	& $\mid$ & $\Right(v)$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	668	& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	669	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	670	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	671	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	672
a0f27e21b42c all texrelated Chengsong parents: diff changeset	673	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	674	One regular expression can have multiple lexical values. For example
a0f27e21b42c all texrelated Chengsong parents: diff changeset	675	for the regular expression $(a+b)^*$, it has a infinite list of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	676	values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	677	$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	678	$\ldots$, and vice versa.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	679	Even for the regular expression matching a certain string, there could
a0f27e21b42c all texrelated Chengsong parents: diff changeset	680	still be more than one value corresponding to it.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	681	Take the example where $r= (a^\cdot a^)^*$ and the string
a0f27e21b42c all texrelated Chengsong parents: diff changeset	682	$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	683	The number of different ways of matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	684	without allowing any value under a star to be flattened
a0f27e21b42c all texrelated Chengsong parents: diff changeset	685	to an empty string can be given by the following formula:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	686	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	687	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	688	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	689	and a closed form formula can be calculated to be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	690	\begin{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	691	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	692	\end{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	693	which is clearly in exponential order.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	694	A lexer aimed at getting all the possible values has an exponential
a0f27e21b42c all texrelated Chengsong parents: diff changeset	695	worst case runtime. Therefore it is impractical to try to generate
a0f27e21b42c all texrelated Chengsong parents: diff changeset	696	all possible matches in a run. In practice, we are usually
a0f27e21b42c all texrelated Chengsong parents: diff changeset	697	interested about POSIX values, which by intuition always
a0f27e21b42c all texrelated Chengsong parents: diff changeset	698	match the leftmost regular expression when there is a choice
a0f27e21b42c all texrelated Chengsong parents: diff changeset	699	and always match a sub part as much as possible before proceeding
a0f27e21b42c all texrelated Chengsong parents: diff changeset	700	to the next token. For example, the above example has the POSIX value
a0f27e21b42c all texrelated Chengsong parents: diff changeset	701	$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	702	The output of an algorithm we want would be a POSIX matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	703	encoded as a value.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	704	The contribution of Sulzmann and Lu is an extension of Brzozowski's
a0f27e21b42c all texrelated Chengsong parents: diff changeset	705	algorithm by a second phase (the first phase being building successive
a0f27e21b42c all texrelated Chengsong parents: diff changeset	706	derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
a0f27e21b42c all texrelated Chengsong parents: diff changeset	707	is generated in case the regular expression matches the string.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	708	Pictorially, the Sulzmann and Lu algorithm is as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	709
a0f27e21b42c all texrelated Chengsong parents: diff changeset	710	\begin{ceqn}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	711	\begin{equation}\label{graph:2}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	712	\begin{tikzcd}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	713	r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	714	v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	715	\end{tikzcd}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	716	\end{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	717	\end{ceqn}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	718
a0f27e21b42c all texrelated Chengsong parents: diff changeset	719
a0f27e21b42c all texrelated Chengsong parents: diff changeset	720	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	721	For convenience, we shall employ the following notations: the regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	722	expression we start with is $r_0$, and the given string $s$ is composed
a0f27e21b42c all texrelated Chengsong parents: diff changeset	723	of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	724	left to right, we build the derivatives $r_1$, $r_2$, \ldots according
a0f27e21b42c all texrelated Chengsong parents: diff changeset	725	to the characters $c_0$, $c_1$ until we exhaust the string and obtain
a0f27e21b42c all texrelated Chengsong parents: diff changeset	726	the derivative $r_n$. We test whether this derivative is
a0f27e21b42c all texrelated Chengsong parents: diff changeset	727	$\textit{nullable}$ or not. If not, we know the string does not match
a0f27e21b42c all texrelated Chengsong parents: diff changeset	728	$r$ and no value needs to be generated. If yes, we start building the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	729	values incrementally by \emph{injecting} back the characters into the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	730	earlier values $v_n, \ldots, v_0$. This is the second phase of the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	731	algorithm from the right to left. For the first value $v_n$, we call the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	732	function $\textit{mkeps}$, which builds a POSIX lexical value
a0f27e21b42c all texrelated Chengsong parents: diff changeset	733	for how the empty string has been matched by the (nullable) regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	734	expression $r_n$. This function is defined as
a0f27e21b42c all texrelated Chengsong parents: diff changeset	735
a0f27e21b42c all texrelated Chengsong parents: diff changeset	736	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	737	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	738	$\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	739	$\mkeps(r_{1}+r_{2})$ & $\dn$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	740	& \textit{if} $\nullable(r_{1})$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	741	& & \textit{then} $\Left(\mkeps(r_{1}))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	742	& & \textit{else} $\Right(\mkeps(r_{2}))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	743	$\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	744	$mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	745	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	746	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	747
a0f27e21b42c all texrelated Chengsong parents: diff changeset	748
a0f27e21b42c all texrelated Chengsong parents: diff changeset	749	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	750	After the $\mkeps$-call, we inject back the characters one by one in order to build
a0f27e21b42c all texrelated Chengsong parents: diff changeset	751	the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	752	($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	753	After injecting back $n$ characters, we get the lexical value for how $r_0$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	754	matches $s$. The POSIX value is maintained throught out the process.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	755	For this Sulzmann and Lu defined a function that reverses
a0f27e21b42c all texrelated Chengsong parents: diff changeset	756	the ``chopping off'' of characters during the derivative phase. The
a0f27e21b42c all texrelated Chengsong parents: diff changeset	757	corresponding function is called \emph{injection}, written
a0f27e21b42c all texrelated Chengsong parents: diff changeset	758	$\textit{inj}$; it takes three arguments: the first one is a regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	759	expression ${r_{i-1}}$, before the character is chopped off, the second
a0f27e21b42c all texrelated Chengsong parents: diff changeset	760	is a character ${c_{i-1}}$, the character we want to inject and the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	761	third argument is the value ${v_i}$, into which one wants to inject the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	762	character (it corresponds to the regular expression after the character
a0f27e21b42c all texrelated Chengsong parents: diff changeset	763	has been chopped off). The result of this function is a new value. The
a0f27e21b42c all texrelated Chengsong parents: diff changeset	764	definition of $\textit{inj}$ is as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	765
a0f27e21b42c all texrelated Chengsong parents: diff changeset	766	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	767	\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	768	$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	769	$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	770	$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	771	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	772	$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	773	$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	774	$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	775	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	776	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	777
a0f27e21b42c all texrelated Chengsong parents: diff changeset	778	\noindent This definition is by recursion on the ``shape'' of regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	779	expressions and values.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	780	The clauses basically do one thing--identifying the ``holes'' on
a0f27e21b42c all texrelated Chengsong parents: diff changeset	781	value to inject the character back into.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	782	For instance, in the last clause for injecting back to a value
a0f27e21b42c all texrelated Chengsong parents: diff changeset	783	that would turn into a new star value that corresponds to a star,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	784	we know it must be a sequence value. And we know that the first
a0f27e21b42c all texrelated Chengsong parents: diff changeset	785	value of that sequence corresponds to the child regex of the star
a0f27e21b42c all texrelated Chengsong parents: diff changeset	786	with the first character being chopped off--an iteration of the star
a0f27e21b42c all texrelated Chengsong parents: diff changeset	787	that had just been unfolded. This value is followed by the already
a0f27e21b42c all texrelated Chengsong parents: diff changeset	788	matched star iterations we collected before. So we inject the character
a0f27e21b42c all texrelated Chengsong parents: diff changeset	789	back to the first value and form a new value with this new iteration
a0f27e21b42c all texrelated Chengsong parents: diff changeset	790	being added to the previous list of iterations, all under the $Stars$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	791	top level.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	792
a0f27e21b42c all texrelated Chengsong parents: diff changeset	793	We have mentioned before that derivatives without simplification
a0f27e21b42c all texrelated Chengsong parents: diff changeset	794	can get clumsy, and this is true for values as well--they reflect
a0f27e21b42c all texrelated Chengsong parents: diff changeset	795	the regular expressions size by definition.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	796
a0f27e21b42c all texrelated Chengsong parents: diff changeset	797	One can introduce simplification on the regex and values, but have to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	798	be careful in not breaking the correctness as the injection
a0f27e21b42c all texrelated Chengsong parents: diff changeset	799	function heavily relies on the structure of the regexes and values
a0f27e21b42c all texrelated Chengsong parents: diff changeset	800	being correct and match each other.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	801	It can be achieved by recording some extra rectification functions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	802	during the derivatives step, and applying these rectifications in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	803	each run during the injection phase.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	804	And we can prove that the POSIX value of how
a0f27e21b42c all texrelated Chengsong parents: diff changeset	805	regular expressions match strings will not be affected---although is much harder
a0f27e21b42c all texrelated Chengsong parents: diff changeset	806	to establish. Some initial results in this regard have been
a0f27e21b42c all texrelated Chengsong parents: diff changeset	807	obtained in \cite{AusafDyckhoffUrban2016}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	808
a0f27e21b42c all texrelated Chengsong parents: diff changeset	809	%Brzozowski, after giving the derivatives and simplification,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	810	%did not explore lexing with simplification or he may well be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	811	%stuck on an efficient simplificaiton with a proof.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	812	%He went on to explore the use of derivatives together with
a0f27e21b42c all texrelated Chengsong parents: diff changeset	813	%automaton, and did not try lexing using derivatives.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	814
a0f27e21b42c all texrelated Chengsong parents: diff changeset	815	We want to get rid of complex and fragile rectification of values.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	816	Can we not create those intermediate values $v_1,\ldots v_n$,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	817	and get the lexing information that should be already there while
a0f27e21b42c all texrelated Chengsong parents: diff changeset	818	doing derivatives in one pass, without a second phase of injection?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	819	In the meantime, can we make sure that simplifications
a0f27e21b42c all texrelated Chengsong parents: diff changeset	820	are easily handled without breaking the correctness of the algorithm?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	821
a0f27e21b42c all texrelated Chengsong parents: diff changeset	822	Sulzmann and Lu solved this problem by
a0f27e21b42c all texrelated Chengsong parents: diff changeset	823	introducing additional informtaion to the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	824	regular expressions called \emph{bitcodes}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	825
a0f27e21b42c all texrelated Chengsong parents: diff changeset	826	\subsection*{Bit-coded Algorithm}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	827	Bits and bitcodes (lists of bits) are defined as:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	828
a0f27e21b42c all texrelated Chengsong parents: diff changeset	829	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	830	$b ::= 1 \mid 0 \qquad
a0f27e21b42c all texrelated Chengsong parents: diff changeset	831	bs ::= [] \mid b::bs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	832	$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	833	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	834
a0f27e21b42c all texrelated Chengsong parents: diff changeset	835	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	836	The $1$ and $0$ are not in bold in order to avoid
a0f27e21b42c all texrelated Chengsong parents: diff changeset	837	confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
a0f27e21b42c all texrelated Chengsong parents: diff changeset	838	bit-lists) can be used to encode values (or potentially incomplete values) in a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	839	compact form. This can be straightforwardly seen in the following
a0f27e21b42c all texrelated Chengsong parents: diff changeset	840	coding function from values to bitcodes:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	841
a0f27e21b42c all texrelated Chengsong parents: diff changeset	842	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	843	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	844	$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	845	$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	846	$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	847	$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	848	$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	849	$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	850	$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	851	code(\Stars\,vs)$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	852	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	853	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	854
a0f27e21b42c all texrelated Chengsong parents: diff changeset	855	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	856	Here $\textit{code}$ encodes a value into a bitcodes by converting
a0f27e21b42c all texrelated Chengsong parents: diff changeset	857	$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
a0f27e21b42c all texrelated Chengsong parents: diff changeset	858	star iteration by $1$. The border where a local star terminates
a0f27e21b42c all texrelated Chengsong parents: diff changeset	859	is marked by $0$. This coding is lossy, as it throws away the information about
a0f27e21b42c all texrelated Chengsong parents: diff changeset	860	characters, and also does not encode the ``boundary'' between two
a0f27e21b42c all texrelated Chengsong parents: diff changeset	861	sequence values. Moreover, with only the bitcode we cannot even tell
a0f27e21b42c all texrelated Chengsong parents: diff changeset	862	whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
a0f27e21b42c all texrelated Chengsong parents: diff changeset	863	reason for choosing this compact way of storing information is that the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	864	relatively small size of bits can be easily manipulated and ``moved
a0f27e21b42c all texrelated Chengsong parents: diff changeset	865	around'' in a regular expression. In order to recover values, we will
a0f27e21b42c all texrelated Chengsong parents: diff changeset	866	need the corresponding regular expression as an extra information. This
a0f27e21b42c all texrelated Chengsong parents: diff changeset	867	means the decoding function is defined as:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	868
a0f27e21b42c all texrelated Chengsong parents: diff changeset	869
a0f27e21b42c all texrelated Chengsong parents: diff changeset	870	%\begin{definition}[Bitdecoding of Values]\mbox{}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	871	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	872	\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	873	$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	874	$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	875	$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	876	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	877	(\Left\,v, bs_1)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	878	$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	879	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	880	(\Right\,v, bs_1)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	881	$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	882	$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	883	& & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	884	& & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	885	$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	886	$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	887	$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	888	& & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	889	& & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	890
a0f27e21b42c all texrelated Chengsong parents: diff changeset	891	$\textit{decode}\,bs\,r$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	892	$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	893	& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	894	\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	895	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	896	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	897	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	898
a0f27e21b42c all texrelated Chengsong parents: diff changeset	899	Sulzmann and Lu's integrated the bitcodes into regular expressions to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	900	create annotated regular expressions \cite{Sulzmann2014}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	901	\emph{Annotated regular expressions} are defined by the following
a0f27e21b42c all texrelated Chengsong parents: diff changeset	902	grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	903
a0f27e21b42c all texrelated Chengsong parents: diff changeset	904	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	905	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	906	$\textit{a}$ & $::=$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	907	& $\mid$ & $_{bs}\ONE$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	908	& $\mid$ & $_{bs}{\bf c}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	909	& $\mid$ & $_{bs}\sum\,as$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	910	& $\mid$ & $_{bs}a_1\cdot a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	911	& $\mid$ & $_{bs}a^*$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	912	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	913	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	914	%(in \textit{ALTS})
a0f27e21b42c all texrelated Chengsong parents: diff changeset	915
a0f27e21b42c all texrelated Chengsong parents: diff changeset	916	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	917	where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	918	expressions and $as$ for a list of annotated regular expressions.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	919	The alternative constructor($\sum$) has been generalized to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	920	accept a list of annotated regular expressions rather than just 2.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	921	We will show that these bitcodes encode information about
a0f27e21b42c all texrelated Chengsong parents: diff changeset	922	the (POSIX) value that should be generated by the Sulzmann and Lu
a0f27e21b42c all texrelated Chengsong parents: diff changeset	923	algorithm.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	924
a0f27e21b42c all texrelated Chengsong parents: diff changeset	925
a0f27e21b42c all texrelated Chengsong parents: diff changeset	926	To do lexing using annotated regular expressions, we shall first
a0f27e21b42c all texrelated Chengsong parents: diff changeset	927	transform the usual (un-annotated) regular expressions into annotated
a0f27e21b42c all texrelated Chengsong parents: diff changeset	928	regular expressions. This operation is called \emph{internalisation} and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	929	defined as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	930
a0f27e21b42c all texrelated Chengsong parents: diff changeset	931	%\begin{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	932	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	933	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	934	$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	935	$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	936	$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	937	$(r_1 + r_2)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	938	$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	939	\textit{fuse}\,[1]\,r_2^\uparrow]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	940	$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	941	$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	942	$(r^*)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	943	$_{[]}(r^\uparrow)^*$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	944	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	945	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	946	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	947
a0f27e21b42c all texrelated Chengsong parents: diff changeset	948	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	949	We use up arrows here to indicate that the basic un-annotated regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	950	expressions are ``lifted up'' into something slightly more complex. In the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	951	fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	952	attach bits to the front of an annotated regular expression. Its
a0f27e21b42c all texrelated Chengsong parents: diff changeset	953	definition is as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	954
a0f27e21b42c all texrelated Chengsong parents: diff changeset	955	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	956	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	957	$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	958	$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	959	$_{bs @ bs'}\ONE$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	960	$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	961	$_{bs@bs'}{\bf c}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	962	$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	963	$_{bs@bs'}\sum\textit{as}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	964	$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	965	$_{bs@bs'}a_1 \cdot a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	966	$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	967	$_{bs @ bs'}a^*$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	968	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	969	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	970
a0f27e21b42c all texrelated Chengsong parents: diff changeset	971	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	972	After internalising the regular expression, we perform successive
a0f27e21b42c all texrelated Chengsong parents: diff changeset	973	derivative operations on the annotated regular expressions. This
a0f27e21b42c all texrelated Chengsong parents: diff changeset	974	derivative operation is the same as what we had previously for the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	975	basic regular expressions, except that we beed to take care of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	976	the bitcodes:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	977
a0f27e21b42c all texrelated Chengsong parents: diff changeset	978
a0f27e21b42c all texrelated Chengsong parents: diff changeset	979	\iffalse
a0f27e21b42c all texrelated Chengsong parents: diff changeset	980	%\begin{definition}{bder}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	981	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	982	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	983	$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	984	$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	985	$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	986	$\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	987	\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	988	$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	989	$\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	990	$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	991	$\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	992	& &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	993	& &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	994	& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	995	$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	996	$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	997	(\textit{STAR}\,[]\,r)$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	998	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	999	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1000	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1001
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1002	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1003	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1004	$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1005	$(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1006	$(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1007	$\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1008	_{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1009	$(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1010	$_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1011	$(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1012	$\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1013	& &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1014	& &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1015	& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1016	$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1017	$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1018	(_{bs}\textit{STAR}\,[]\,r)$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1019	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1020	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1021	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1022	\fi
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1023
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1024	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1025	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1026	$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1027	$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1028	$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1029	$\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1030	_{bs}\ONE\;\textit{else}\;\ZERO$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1031	$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1032	$_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1033	$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1034	$\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1035	& &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1036	& &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1037	& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1038	$(_{bs}a^*)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1039	$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1040	(_{[]}r^*))$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1041	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1042	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1043
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1044	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1045	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1046	For instance, when we do derivative of $_{bs}a^*$ with respect to c,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1047	we need to unfold it into a sequence,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1048	and attach an additional bit $0$ to the front of $r \backslash c$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1049	to indicate that there is one more star iteration. Also the sequence clause
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1050	is more subtle---when $a_1$ is $\textit{bnullable}$ (here
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1051	\textit{bnullable} is exactly the same as $\textit{nullable}$, except
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1052	that it is for annotated regular expressions, therefore we omit the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1053	definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1054	$a_1$ matches the string prior to character $c$ (more on this later),
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1055	then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1056	\backslash c)$ will collapse the regular expression $a_1$(as it has
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1057	already been fully matched) and store the parsing information at the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1058	head of the regular expression $a_2 \backslash c$ by fusing to it. The
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1059	bitsequence $\textit{bs}$, which was initially attached to the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1060	first element of the sequence $a_1 \cdot a_2$, has
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1061	now been elevated to the top-level of $\sum$, as this information will be
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1062	needed whichever way the sequence is matched---no matter whether $c$ belongs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1063	to $a_1$ or $ a_2$. After building these derivatives and maintaining all
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1064	the lexing information, we complete the lexing by collecting the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1065	bitcodes using a generalised version of the $\textit{mkeps}$ function
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1066	for annotated regular expressions, called $\textit{bmkeps}$:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1067
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1068
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1069	%\begin{definition}[\textit{bmkeps}]\mbox{}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1070	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1071	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1072	$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1073	$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1074	$\textit{if}\;\textit{bnullable}\,a$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1075	& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1076	& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1077	$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1078	$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1079	$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1080	$bs \,@\, [0]$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1081	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1082	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1083	%\end{definition}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1084
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1085	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1086	This function completes the value information by travelling along the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1087	path of the regular expression that corresponds to a POSIX value and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1088	collecting all the bitcodes, and using $S$ to indicate the end of star
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1089	iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1090	decode them, we get the value we expect. The corresponding lexing
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1091	algorithm looks as follows:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1092
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1093	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1094	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1095	$\textit{blexer}\;r\,s$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1096	$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1097	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1098	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1099	& & $\;\;\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1100	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1101	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1102
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1103	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1104	In this definition $\_\backslash s$ is the generalisation of the derivative
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1105	operation from characters to strings (just like the derivatives for un-annotated
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1106	regular expressions).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1107
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1108	Remember tha one of the important reasons we introduced bitcodes
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1109	is that they can make simplification more structured and therefore guaranteeing
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1110	the correctness.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1111
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1112	\subsection*{Our Simplification Rules}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1113
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1114	In this section we introduce aggressive (in terms of size) simplification rules
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1115	on annotated regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1116	in order to keep derivatives small. Such simplifications are promising
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1117	as we have
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1118	generated test data that show
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1119	that a good tight bound can be achieved. Obviously we could only
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1120	partially cover the search space as there are infinitely many regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1121	expressions and strings.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1122
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1123	One modification we introduced is to allow a list of annotated regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1124	expressions in the $\sum$ constructor. This allows us to not just
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1125	delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1126	also unnecessary ``copies'' of regular expressions (very similar to
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1127	simplifying $r + r$ to just $r$, but in a more general setting). Another
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1128	modification is that we use simplification rules inspired by Antimirov's
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1129	work on partial derivatives. They maintain the idea that only the first
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1130	``copy'' of a regular expression in an alternative contributes to the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1131	calculation of a POSIX value. All subsequent copies can be pruned away from
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1132	the regular expression. A recursive definition of our simplification function
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1133	that looks somewhat similar to our Scala code is given below:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1134	%\comment{Use $\ZERO$, $\ONE$ and so on.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1135	%Is it $ALTS$ or $ALTS$?}\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1136
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1137	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1138	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1139
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1140	$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1141	&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1142	&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1143	&&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1144	&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1145	&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1146
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1147	$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1148	&&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1149	&&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1150	&&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1151
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1152	$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1153	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1154	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1155
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1156	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1157	The simplification does a pattern matching on the regular expression.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1158	When it detected that the regular expression is an alternative or
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1159	sequence, it will try to simplify its children regular expressions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1160	recursively and then see if one of the children turn into $\ZERO$ or
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1161	$\ONE$, which might trigger further simplification at the current level.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1162	The most involved part is the $\sum$ clause, where we use two
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1163	auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1164	alternatives and reduce as many duplicates as possible. Function
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1165	$\textit{distinct}$ keeps the first occurring copy only and remove all later ones
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1166	when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1167	Its recursive definition is given below:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1168
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1169	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1170	\begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1171	$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1172	(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1173	$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1174	$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1175	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1176	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1177
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1178	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1179	Here $\textit{flatten}$ behaves like the traditional functional programming flatten
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1180	function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1181	removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1182
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1183	Having defined the $\simp$ function,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1184	we can use the previous notation of natural
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1185	extension from derivative w.r.t.~character to derivative
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1186	w.r.t.~string:%\comment{simp in the [] case?}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1187
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1188	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1189	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1190	$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1191	$r \backslash_{simp} [\,] $ & $\dn$ & $r$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1192	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1193	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1194
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1195	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1196	to obtain an optimised version of the algorithm:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1197
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1198	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1199	\begin{tabular}{lcl}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1200	$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1201	$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1202	& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1203	& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1204	& & $\;\;\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1205	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1206	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1207
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1208	\noindent
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1209	This algorithm keeps the regular expression size small, for example,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1210	with this simplification our previous $(a + aa)^*$ example's 8000 nodes
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1211	will be reduced to just 6 and stays constant, no matter how long the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1212	input string is.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1213
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1214
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1215
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1216	Derivatives give a simple solution
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1217	to the problem of matching a string $s$ with a regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1218	expression $r$: if the derivative of $r$ w.r.t.\ (in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1219	succession) all the characters of the string matches the empty string,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1220	then $r$ matches $s$ (and {\em vice versa}).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1221
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1222
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1223
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1224	However, there are two difficulties with derivative-based matchers:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1225	First, Brzozowski's original matcher only generates a yes/no answer
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1226	for whether a regular expression matches a string or not. This is too
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1227	little information in the context of lexing where separate tokens must
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1228	be identified and also classified (for example as keywords
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1229	or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1230	difficulty by cleverly extending Brzozowski's matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1231	algorithm. Their extended version generates additional information on
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1232	\emph{how} a regular expression matches a string following the POSIX
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1233	rules for regular expression matching. They achieve this by adding a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1234	second ``phase'' to Brzozowski's algorithm involving an injection
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1235	function. In our own earlier work we provided the formal
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1236	specification of what POSIX matching means and proved in Isabelle/HOL
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1237	the correctness
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1238	of Sulzmann and Lu's extended algorithm accordingly
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1239	\cite{AusafDyckhoffUrban2016}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1240
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1241	The second difficulty is that Brzozowski's derivatives can
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1242	grow to arbitrarily big sizes. For example if we start with the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1243	regular expression $(a+aa)^*$ and take
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1244	successive derivatives according to the character $a$, we end up with
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1245	a sequence of ever-growing derivatives like
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1246
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1247	\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1248	\begin{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1249	\begin{tabular}{rll}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1250	$(a + aa)^$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1251	& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1252	& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1253	& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1254	& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1255	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1256	\end{center}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1257
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1258	\noindent where after around 35 steps we run out of memory on a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1259	typical computer (we shall define shortly the precise details of our
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1260	regular expressions and the derivative operation). Clearly, the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1261	notation involving $\ZERO$s and $\ONE$s already suggests
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1262	simplification rules that can be applied to regular regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1263	expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1264	\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1265	r$. While such simple-minded simplifications have been proved in our
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1266	earlier work to preserve the correctness of Sulzmann and Lu's
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1267	algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1268	\emph{not} help with limiting the growth of the derivatives shown
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1269	above: the growth is slowed, but the derivatives can still grow rather
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1270	quickly beyond any finite bound.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1271
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1272
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1273	Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1274	\cite{Sulzmann2014} where they introduce bitcoded
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1275	regular expressions. In this version, POSIX values are
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1276	represented as bitsequences and such sequences are incrementally generated
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1277	when derivatives are calculated. The compact representation
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1278	of bitsequences and regular expressions allows them to define a more
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1279	``aggressive'' simplification method that keeps the size of the
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1280	derivatives finite no matter what the length of the string is.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1281	They make some informal claims about the correctness and linear behaviour
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1282	of this version, but do not provide any supporting proof arguments, not
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1283	even ``pencil-and-paper'' arguments. They write about their bitcoded
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1284	\emph{incremental parsing method} (that is the algorithm to be formalised
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1285	in this paper):
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1286
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1287
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1288	\begin{quote}\it
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1289	``Correctness Claim: We further claim that the incremental parsing
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1290	method [..] in combination with the simplification steps [..]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1291	yields POSIX parse trees. We have tested this claim
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1292	extensively [..] but yet
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1293	have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1294	\end{quote}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1295
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1296
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1297
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1298
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1299	\section{Backgound}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1300	%Regular expression matching and lexing has been
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1301	% widely-used and well-implemented
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1302	%in software industry.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1303	%TODO: expand the above into a full paragraph
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1304	%TODO: look up snort rules to use here--give readers idea of what regexes look like
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1305
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1306
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1307	Theoretical results say that regular expression matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1308	should be linear with respect to the input.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1309	Under a certain class of regular expressions and inputs though,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1310	practical implementations suffer from non-linear or even
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1311	exponential running time,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1312	allowing a ReDoS (regular expression denial-of-service ) attack.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1313
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1314
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1315	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1316
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1317	\section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1318
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1319	The reason behind is that regex libraries in popular language engines
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1320	often want to support richer constructs
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1321	than the most basic regular expressions, and lexing rather than matching
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1322	is needed for sub-match extraction.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1323
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1324	There is also static analysis work on regular expression that
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1325	have potential expoential behavious. Rathnayake and Thielecke
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1326	\parencite{Rathnayake2014StaticAF} proposed an algorithm
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1327	that detects regular expressions triggering exponential
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1328	behavious on backtracking matchers.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1329	People also developed static analysis methods for
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1330	generating non-linear polynomial worst-time estimates
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1331	for regexes, attack string that exploit the worst-time
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1332	scenario, and "attack automata" that generates
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1333	attack strings.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1334	For a comprehensive analysis, please refer to Weideman's thesis
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1335	\parencite{Weideman2017Static}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1336
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1337	\subsection{DFA Approach}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1338
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1339	Exponential states.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1340
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1341	\subsection{NFA Approach}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1342	Backtracking.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1343
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1344
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1345
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1346	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1347
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1348	\section{Our Approach}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1349	In the last fifteen or so years, Brzozowski's derivatives of regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1350	expressions have sparked quite a bit of interest in the functional
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1351	programming and theorem prover communities. The beauty of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1352	Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1353	expressible in any functional language, and easily definable and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1354	reasoned about in theorem provers---the definitions just consist of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1355	inductive datatypes and simple recursive functions. Derivatives of a
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1356	regular expression, written $r \backslash c$, give a simple solution
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1357	to the problem of matching a string $s$ with a regular
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1358	expression $r$: if the derivative of $r$ w.r.t.\ (in
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1359	succession) all the characters of the string matches the empty string,
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1360	then $r$ matches $s$ (and {\em vice versa}).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1361
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1362
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1363	This work aims to address the above vulnerability by the combination
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1364	of Brzozowski's derivatives and interactive theorem proving. We give an
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1365	improved version of Sulzmann and Lu's bit-coded algorithm using
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1366	derivatives, which come with a formal guarantee in terms of correctness and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1367	running time as an Isabelle/HOL proof.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1368	Then we improve the algorithm with an even stronger version of
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1369	simplification, and prove a time bound linear to input and
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1370	cubic to regular expression size using a technique by
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1371	Antimirov.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1372
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1373	\subsection{Existing Work}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1374	We are aware
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1375	of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1376	Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1377	of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1378	in Coq is given by Coquand and Siles \parencite{Coquand2012}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1379	Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1380
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1381	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1382
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1383	\section{What this Template Includes}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1384
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1385	\subsection{Folders}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1386
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1387	This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1388
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1389	\keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1390
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1391	\keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1392	\begin{itemize}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1393	\item Chapter 1: Introduction to the thesis topic
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1394	\item Chapter 2: Background information and theory
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1395	\item Chapter 3: (Laboratory) experimental setup
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1396	\item Chapter 4: Details of experiment 1
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1397	\item Chapter 5: Details of experiment 2
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1398	\item Chapter 6: Discussion of the experimental results
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1399	\item Chapter 7: Conclusion and future directions
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1400	\end{itemize}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1401	This chapter layout is specialised for the experimental sciences, your discipline may be different.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1402
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1403	\keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1404
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1405	\subsection{Files}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1406
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1407	Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1408
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1409	\keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1410
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1411	\keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1412
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1413	\keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1414
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1415	\keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis!
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1416
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1417	Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1418
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1419	\keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1420
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1421	\keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1422
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1423	\keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1424
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1425	\keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1426
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1427	\keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1428
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1429	\keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1430
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1431	\keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1432
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1433	So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1434
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1435	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1436
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1437	\section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1438
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1439	You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1440
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1441	Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1442
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1443	Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb\|\href{url}{name}\| and only leave the name.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1444
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1445	When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper!
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1446
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1447	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1448
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1449	\section{The \code{main.tex} File Explained}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1450
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1451	The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1452
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1453	Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1454
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1455	Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1456
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1457	Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1458
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1459	Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1460
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1461	The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1462
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1463	The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1464
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1465	The next page contains a one line dedication. Who will you dedicate your thesis to?
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1466
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1467	Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1468
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1469	After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1470
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1471	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1472
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1473	\section{Thesis Features and Conventions}\label{ThesisConventions}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1474
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1475	To get the best out of this template, there are a few conventions that you may want to follow.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1476
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1477	One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1478
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1479	\subsection{Printing Format}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1480
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1481	This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1482
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1483	The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1484
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1485	The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1486
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1487	\subsection{Using US Letter Paper}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1488
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1489	The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1490
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1491	Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1492
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1493	\subsection{References}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1494
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1495	The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1496
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1497	Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1498
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1499	The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1500
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1501	\subsubsection{A Note on bibtex}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1502
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1503	The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1504
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1505	\subsection{Tables}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1506
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1507	Tables are an important way of displaying your results, below is an example table which was generated with this code:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1508
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1509	{\small
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1510	\begin{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1511	\begin{table}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1512	\caption{The effects of treatments X and Y on the four groups studied.}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1513	\label{tab:treatments}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1514	\centering
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1515	\begin{tabular}{l l l}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1516	\toprule
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1517	\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1518	\midrule
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1519	1 & 0.2 & 0.8\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1520	2 & 0.17 & 0.7\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1521	3 & 0.24 & 0.75\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1522	4 & 0.68 & 0.3\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1523	\bottomrule\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1524	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1525	\end{table}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1526	\end{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1527	}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1528
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1529	\begin{table}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1530	\caption{The effects of treatments X and Y on the four groups studied.}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1531	\label{tab:treatments}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1532	\centering
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1533	\begin{tabular}{l l l}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1534	\toprule
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1535	\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1536	\midrule
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1537	1 & 0.2 & 0.8\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1538	2 & 0.17 & 0.7\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1539	3 & 0.24 & 0.75\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1540	4 & 0.68 & 0.3\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1541	\bottomrule\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1542	\end{tabular}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1543	\end{table}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1544
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1545	You can reference tables with \verb\|\ref{<label>}\| where the label is defined within the table environment. See \file{Chapter1.tex} for an example of the label and citation (e.g. Table~\ref{tab:treatments}).
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1546
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1547	\subsection{Figures}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1548
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1549	There will hopefully be many figures in your thesis (that should be placed in the \emph{Figures} folder). The way to insert figures into your thesis is to use a code template like this:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1550	\begin{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1551	\begin{figure}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1552	\centering
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1553	\includegraphics{Figures/Electron}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1554	\decoRule
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1555	\caption[An Electron]{An electron (artist's impression).}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1556	\label{fig:Electron}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1557	\end{figure}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1558	\end{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1559	Also look in the source file. Putting this code into the source file produces the picture of the electron that you can see in the figure below.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1560
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1561	\begin{figure}[th]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1562	\centering
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1563	\includegraphics{Figures/Electron}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1564	\decoRule
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1565	\caption[An Electron]{An electron (artist's impression).}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1566	\label{fig:Electron}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1567	\end{figure}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1568
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1569	Sometimes figures don't always appear where you write them in the source. The placement depends on how much space there is on the page for the figure. Sometimes there is not enough room to fit a figure directly where it should go (in relation to the text) and so \LaTeX{} puts it at the top of the next page. Positioning figures is the job of \LaTeX{} and so you should only worry about making them look good!
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1570
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1571	Figures usually should have captions just in case you need to refer to them (such as in Figure~\ref{fig:Electron}). The \verb\|\caption\| command contains two parts, the first part, inside the square brackets is the title that will appear in the \emph{List of Figures}, and so should be short. The second part in the curly brackets should contain the longer and more descriptive caption text.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1572
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1573	The \verb\|\decoRule\| command is optional and simply puts an aesthetic horizontal line below the image. If you do this for one image, do it for all of them.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1574
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1575	\LaTeX{} is capable of using images in pdf, jpg and png format.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1576
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1577	\subsection{Typesetting mathematics}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1578
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1579	If your thesis is going to contain heavy mathematical content, be sure that \LaTeX{} will make it look beautiful, even though it won't be able to solve the equations for you.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1580
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1581	The \enquote{Not So Short Introduction to \LaTeX} (available on \href{http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf}{CTAN}) should tell you everything you need to know for most cases of typesetting mathematics. If you need more information, a much more thorough mathematical guide is available from the AMS called, \enquote{A Short Math Guide to \LaTeX} and can be downloaded from:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1582	\url{ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1583
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1584	There are many different \LaTeX{} symbols to remember, luckily you can find the most common symbols in \href{http://ctan.org/pkg/comprehensive}{The Comprehensive \LaTeX~Symbol List}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1585
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1586	You can write an equation, which is automatically given an equation number by \LaTeX{} like this:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1587	\begin{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1588	\begin{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1589	E = mc^{2}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1590	\label{eqn:Einstein}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1591	\end{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1592	\end{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1593
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1594	This will produce Einstein's famous energy-matter equivalence equation:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1595	\begin{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1596	E = mc^{2}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1597	\label{eqn:Einstein}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1598	\end{equation}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1599
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1600	All equations you write (which are not in the middle of paragraph text) are automatically given equation numbers by \LaTeX{}. If you don't want a particular equation numbered, use the unnumbered form:
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1601	\begin{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1602	\[ a^{2}=4 \]
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1603	\end{verbatim}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1604
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1605	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1606
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1607	\section{Sectioning and Subsectioning}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1608
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1609	You should break your thesis up into nice, bite-sized sections and subsections. \LaTeX{} automatically builds a table of Contents by looking at all the \verb\|\chapter{}\|, \verb\|\section{}\| and \verb\|\subsection{}\| commands you write in the source.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1610
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1611	The Table of Contents should only list the sections to three (3) levels. A \verb\|chapter{}\| is level zero (0). A \verb\|\section{}\| is level one (1) and so a \verb\|\subsection{}\| is level two (2). In your thesis it is likely that you will even use a \verb\|subsubsection{}\|, which is level three (3). The depth to which the Table of Contents is formatted is set within \file{MastersDoctoralThesis.cls}. If you need this changed, you can do it in \file{main.tex}.
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1612
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1613	%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1614
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1615	\section{In Closing}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1616
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1617	You have reached the end of this mini-guide. You can now rename or overwrite this pdf file and begin writing your own \file{Chapter1.tex} and the rest of your thesis. The easy work of setting up the structure and framework has been taken care of for you. It's now your job to fill it out!
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1618
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1619	Good luck and have lots of fun!
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1620
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1621	\begin{flushright}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1622	Guide written by ---\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1623	Sunil Patel: \href{http://www.sunilpatel.co.uk}{www.sunilpatel.co.uk}\\
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1624	Vel: \href{http://www.LaTeXTemplates.com}{LaTeXTemplates.com}
a0f27e21b42c all texrelated Chengsong parents: diff changeset	1625	\end{flushright}

author	Chengsong
	Fri, 25 Mar 2022 21:49:53 +0000
changeset 468	a0f27e21b42c
child 469	e5dd8cc0aa82
permissions	-rwxr-xr-x