lexing: PhdThesisRealOne/LaTeXTemplates_masters-doctoral-thesis_v2/Chapters/Chapter1.tex@26a5e640cdd7 (annotated)

456 26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	1	% Chapter 1
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	2
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	3	\chapter{Introduction} % Main chapter title
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	4
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	5	\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	6
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	7	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	8
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	9	% Define some commands to keep the formatting separated from the content
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	10	\newcommand{\keyword}[1]{\textbf{#1}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	11	\newcommand{\tabhead}[1]{\textbf{#1}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	12	\newcommand{\code}[1]{\texttt{#1}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	13	\newcommand{\file}[1]{\texttt{\bfseries#1}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	14	\newcommand{\option}[1]{\texttt{\itshape#1}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	15
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	16
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	17	\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	18	\newcommand{\ZERO}{\mbox{\bf 0}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	19	\newcommand{\ONE}{\mbox{\bf 1}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	20	\def\lexer{\mathit{lexer}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	21	\def\mkeps{\mathit{mkeps}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	22
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	23	\def\DFA{\textit{DFA}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	24	\def\bmkeps{\textit{bmkeps}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	25	\def\retrieve{\textit{retrieve}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	26	\def\blexer{\textit{blexer}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	27	\def\flex{\textit{flex}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	28	\def\inj{\mathit{inj}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	29	\def\Empty{\mathit{Empty}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	30	\def\Left{\mathit{Left}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	31	\def\Right{\mathit{Right}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	32	\def\Stars{\mathit{Stars}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	33	\def\Char{\mathit{Char}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	34	\def\Seq{\mathit{Seq}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	35	\def\Der{\mathit{Der}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	36	\def\nullable{\mathit{nullable}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	37	\def\Z{\mathit{Z}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	38	\def\S{\mathit{S}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	39	\def\rup{r^\uparrow}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	40	\def\simp{\mathit{simp}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	41	\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	42	\def\map{\mathit{map}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	43	\def\distinct{\mathit{distinct}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	44	\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	45	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	46
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	47
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	48
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	49
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	50	Regular expression matching and lexing has been
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	51	widely-used and well-implemented
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	52	in software industry.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	53	If you go to a popular programming language's
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	54	regex engine,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	55	you can supply it with regex and strings of your own,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	56	and get matching/lexing information such as how a
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	57	sub-part of the regex matches a sub-part of the string.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	58	These lexing libraries are on average quite fast.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	59	For example, the regex engines some network intrusion detection
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	60	systems use are able to process
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	61	megabytes or even gigabytes of network traffic per second.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	62
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	63	Why do we need to have our version, if the algorithms work well on
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	64	average?
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	65
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	66
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	67	Take $(a^)^\,b$ and ask whether
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	68	strings of the form $aa..a$ match this regular
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	69	expression. Obviously this is not the case---the expected $b$ in the last
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	70	position is missing. One would expect that modern regular expression
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	71	matching engines can find this out very quickly. Alas, if one tries
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	72	this example in JavaScript, Python or Java 8 with strings like 28
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	73	$a$'s, one discovers that this decision takes around 30 seconds and
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	74	takes considerably longer when adding a few more $a$'s, as the graphs
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	75	below show:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	76
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	77	\begin{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	78	\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	79	\begin{tikzpicture}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	80	\begin{axis}[
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	81	xlabel={$n$},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	82	x label style={at={(1.05,-0.05)}},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	83	ylabel={time in secs},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	84	enlargelimits=false,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	85	xtick={0,5,...,30},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	86	xmax=33,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	87	ymax=35,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	88	ytick={0,5,...,30},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	89	scaled ticks=false,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	90	axis lines=left,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	91	width=5cm,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	92	height=4cm,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	93	legend entries={JavaScript},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	94	legend pos=north west,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	95	legend cell align=left]
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	96	\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	97	\end{axis}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	98	\end{tikzpicture}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	99	&
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	100	\begin{tikzpicture}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	101	\begin{axis}[
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	102	xlabel={$n$},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	103	x label style={at={(1.05,-0.05)}},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	104	%ylabel={time in secs},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	105	enlargelimits=false,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	106	xtick={0,5,...,30},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	107	xmax=33,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	108	ymax=35,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	109	ytick={0,5,...,30},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	110	scaled ticks=false,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	111	axis lines=left,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	112	width=5cm,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	113	height=4cm,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	114	legend entries={Python},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	115	legend pos=north west,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	116	legend cell align=left]
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	117	\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	118	\end{axis}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	119	\end{tikzpicture}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	120	&
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	121	\begin{tikzpicture}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	122	\begin{axis}[
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	123	xlabel={$n$},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	124	x label style={at={(1.05,-0.05)}},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	125	%ylabel={time in secs},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	126	enlargelimits=false,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	127	xtick={0,5,...,30},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	128	xmax=33,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	129	ymax=35,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	130	ytick={0,5,...,30},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	131	scaled ticks=false,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	132	axis lines=left,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	133	width=5cm,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	134	height=4cm,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	135	legend entries={Java 8},
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	136	legend pos=north west,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	137	legend cell align=left]
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	138	\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	139	\end{axis}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	140	\end{tikzpicture}\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	141	\multicolumn{3}{c}{Graphs: Runtime for matching $(a^)^\,b$ with strings
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	142	of the form $\underbrace{aa..a}_{n}$.}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	143	\end{tabular}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	144	\end{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	145
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	146
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	147	This is clearly exponential behaviour, and
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	148	is triggered by some relatively simple regex patterns.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	149
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	150
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	151
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	152	The opens up the possibility of
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	153	a ReDoS (regular expression denial-of-service ) attack.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	154
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	155	Theoretical results say that regular expression matching
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	156	should be linear with respect to the input. You could construct
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	157	an NFA via Thompson construction, and simulate running it.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	158	This would be linear.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	159	Or you could determinize the NFA into a DFA, and minimize that
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	160	DFA. Once you have the DFA, the running time is also linear, requiring just
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	161	one scanning pass of the input.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	162
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	163	But modern regex libraries in popular language engines
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	164	often want to support richer constructs
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	165	than the most basic regular expressions such as bounded repetitions
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	166	and back references.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	167	%put in illustrations
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	168	%a{1}{3}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	169	These make a DFA construction impossible because
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	170	of an exponential states explosion.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	171	They also want to support lexing rather than just matching
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	172	for tasks that involves text processing.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	173
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	174	Existing regex libraries either pose restrictions on the user input, or does
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	175	not give linear running time guarantee.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	176	%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	177	For example, the Rust regex engine claims to be linear,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	178	but does not support lookarounds and back-references.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	179	The GoLang regex library does not support over 1000 repetitions.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	180	Java and Python both support back-references, but shows
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	181	catastrophic backtracking behaviours on inputs without back-references(
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	182	when the language is still regular).
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	183	%TODO: test performance of Rust on (((((aa)b)b){20}))c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	184	%TODO: verify the fact Rust does not allow 1000+ reps
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	185	%TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	186	Another thing about the these libraries is that there
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	187	is no correctness guarantee.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	188	In some cases they either fails to generate a lexing result when there is a match,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	189	or gives the wrong way of matching.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	190
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	191
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	192	This superlinear blowup in matching algorithms sometimes causes
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	193	considerable grief in real life: for example on 20 July 2016 one evil
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	194	regular expression brought the webpage
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	195	\href{http://stackexchange.com}{Stack Exchange} to its
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	196	%knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	197	In this instance, a regular expression intended to just trim white
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	198	spaces from the beginning and the end of a line actually consumed
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	199	massive amounts of CPU-resources---causing web servers to grind to a
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	200	halt. This happened when a post with 20,000 white spaces was submitted,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	201	but importantly the white spaces were neither at the beginning nor at
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	202	the end. As a result, the regular expression matching engine needed to
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	203	backtrack over many choices. In this example, the time needed to process
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	204	the string was $O(n^2)$ with respect to the string length. This
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	205	quadratic overhead was enough for the homepage of Stack Exchange to
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	206	respond so slowly that the load balancer assumed there must be some
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	207	attack and therefore stopped the servers from responding to any
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	208	requests. This made the whole site become unavailable. Another very
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	209	recent example is a global outage of all Cloudflare servers on 2 July
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	210	2019. A poorly written regular expression exhibited exponential
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	211	behaviour and exhausted CPUs that serve HTTP traffic. Although the
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	212	outage had several causes, at the heart was a regular expression that
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	213	was used to monitor network
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	214	%traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	215
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	216	Is it possible to have a regex lexing algorithm with proven correctness and
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	217	time complexity, which allows easy extensions to
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	218	constructs like
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	219	bounded repetitions, negation, lookarounds, and even back-references?
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	220
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	221	We propose Brzozowski's derivatives as a solution to this problem.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	222
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	223
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	224	\section{Why Brzozowski}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	225	In the last fifteen or so years, Brzozowski's derivatives of regular
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	226	expressions have sparked quite a bit of interest in the functional
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	227	programming and theorem prover communities.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	228	The beauty of
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	229	Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	230	expressible in any functional language, and easily definable and
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	231	reasoned about in theorem provers---the definitions just consist of
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	232	inductive datatypes and simple recursive functions.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	233
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	234	Suppose we have an alphabet $\Sigma$, the strings whose characters
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	235	are from $\Sigma$
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	236	can be expressed as $\Sigma^*$.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	237
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	238	We use patterns to define a set of strings concisely. Regular expressions
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	239	are one of such patterns systems:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	240	The basic regular expressions are defined inductively
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	241	by the following grammar:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	242	\[ r ::= \ZERO \mid \ONE
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	243	\mid c
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	244	\mid r_1 \cdot r_2
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	245	\mid r_1 + r_2
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	246	\mid r^*
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	247	\]
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	248
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	249	The language or set of strings defined by regular expressions are defined as
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	250	%TODO: FILL in the other defs
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	251	\begin{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	252	\begin{tabular}{lcl}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	253	$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	254	$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	255	\end{tabular}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	256	\end{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	257	Which are also called the "language interpretation".
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	258
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	259
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	260
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	261	The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	262	where the operation transforms the regex to a new one containing
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	263	strings without the head character $c$.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	264
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	265	Formally, we define first such a transformation on any string set, which
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	266	we call semantic derivative:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	267	\begin{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	268	$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	269	\end{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	270	Mathematically, it can be expressed as the
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	271
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	272	If the $\textit{StringSet}$ happen to have some structure, for example,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	273	if it is regular, then we have that it
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	274
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	275	The the derivative of regular expression, denoted as
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	276	$r \backslash c$, is a function that takes parameters
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	277	$r$ and $c$, and returns another regular expression $r'$,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	278	which is computed by the following recursive function:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	279
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	280	\begin{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	281	\begin{tabular}{lcl}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	282	$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	283	$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	284	$d \backslash c$ & $\dn$ &
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	285	$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	286	$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	287	$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	288	& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	289	& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	290	$(r^)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	291	\end{tabular}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	292	\end{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	293	\noindent
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	294	\noindent
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	295
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	296	The $\nullable$ function tests whether the empty string $""$
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	297	is in the language of $r$:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	298
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	299
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	300	\begin{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	301	\begin{tabular}{lcl}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	302	$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	303	$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	304	$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	305	$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	306	$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	307	$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	308	\end{tabular}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	309	\end{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	310	\noindent
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	311	The empty set does not contain any string and
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	312	therefore not the empty string, the empty string
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	313	regular expression contains the empty string
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	314	by definition, the character regular expression
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	315	is the singleton that contains character only,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	316	and therefore does not contain the empty string,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	317	the alternative regular expression(or "or" expression)
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	318	might have one of its children regular expressions
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	319	being nullable and any one of its children being nullable
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	320	would suffice. The sequence regular expression
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	321	would require both children to have the empty string
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	322	to compose an empty string and the Kleene star
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	323	operation naturally introduced the empty string.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	324
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	325	We can give the meaning of regular expressions derivatives
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	326	by language interpretation:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	327
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	328
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	329
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	330
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	331
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	332	Derivatives give a simple solution
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	333	to the problem of matching a string $s$ with a regular
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	334	expression $r$: if the derivative of $r$ w.r.t.\ (in
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	335	succession) all the characters of the string matches the empty string,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	336	then $r$ matches $s$ (and {\em vice versa}).
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	337
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	338
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	339
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	340	However, there are two difficulties with derivative-based matchers:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	341	First, Brzozowski's original matcher only generates a yes/no answer
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	342	for whether a regular expression matches a string or not. This is too
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	343	little information in the context of lexing where separate tokens must
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	344	be identified and also classified (for example as keywords
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	345	or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	346	difficulty by cleverly extending Brzozowski's matching
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	347	algorithm. Their extended version generates additional information on
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	348	\emph{how} a regular expression matches a string following the POSIX
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	349	rules for regular expression matching. They achieve this by adding a
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	350	second ``phase'' to Brzozowski's algorithm involving an injection
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	351	function. In our own earlier work we provided the formal
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	352	specification of what POSIX matching means and proved in Isabelle/HOL
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	353	the correctness
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	354	of Sulzmann and Lu's extended algorithm accordingly
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	355	\cite{AusafDyckhoffUrban2016}.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	356
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	357	The second difficulty is that Brzozowski's derivatives can
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	358	grow to arbitrarily big sizes. For example if we start with the
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	359	regular expression $(a+aa)^*$ and take
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	360	successive derivatives according to the character $a$, we end up with
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	361	a sequence of ever-growing derivatives like
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	362
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	363	\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	364	\begin{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	365	\begin{tabular}{rll}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	366	$(a + aa)^$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	367	& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	368	& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	369	& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	370	& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	371	\end{tabular}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	372	\end{center}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	373
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	374	\noindent where after around 35 steps we run out of memory on a
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	375	typical computer (we shall define shortly the precise details of our
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	376	regular expressions and the derivative operation). Clearly, the
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	377	notation involving $\ZERO$s and $\ONE$s already suggests
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	378	simplification rules that can be applied to regular regular
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	379	expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	380	\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	381	r$. While such simple-minded simplifications have been proved in our
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	382	earlier work to preserve the correctness of Sulzmann and Lu's
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	383	algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	384	\emph{not} help with limiting the growth of the derivatives shown
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	385	above: the growth is slowed, but the derivatives can still grow rather
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	386	quickly beyond any finite bound.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	387
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	388
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	389	Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	390	\cite{Sulzmann2014} where they introduce bitcoded
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	391	regular expressions. In this version, POSIX values are
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	392	represented as bitsequences and such sequences are incrementally generated
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	393	when derivatives are calculated. The compact representation
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	394	of bitsequences and regular expressions allows them to define a more
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	395	``aggressive'' simplification method that keeps the size of the
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	396	derivatives finite no matter what the length of the string is.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	397	They make some informal claims about the correctness and linear behaviour
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	398	of this version, but do not provide any supporting proof arguments, not
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	399	even ``pencil-and-paper'' arguments. They write about their bitcoded
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	400	\emph{incremental parsing method} (that is the algorithm to be formalised
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	401	in this paper):
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	402
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	403
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	404	\begin{quote}\it
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	405	``Correctness Claim: We further claim that the incremental parsing
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	406	method [..] in combination with the simplification steps [..]
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	407	yields POSIX parse trees. We have tested this claim
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	408	extensively [..] but yet
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	409	have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	410	\end{quote}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	411
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	412
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	413
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	414
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	415	\section{Backgound}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	416	%Regular expression matching and lexing has been
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	417	% widely-used and well-implemented
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	418	%in software industry.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	419	%TODO: expand the above into a full paragraph
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	420	%TODO: look up snort rules to use here--give readers idea of what regexes look like
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	421
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	422
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	423	Theoretical results say that regular expression matching
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	424	should be linear with respect to the input.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	425	Under a certain class of regular expressions and inputs though,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	426	practical implementations suffer from non-linear or even
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	427	exponential running time,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	428	allowing a ReDoS (regular expression denial-of-service ) attack.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	429
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	430
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	431	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	432
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	433	\section{Existing Practical Approaches}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	434
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	435	The reason behind is that regex libraries in popular language engines
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	436	often want to support richer constructs
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	437	than the most basic regular expressions, and lexing rather than matching
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	438	is needed for sub-match extraction.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	439
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	440
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	441
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	442	\subsection{DFA Approach}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	443
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	444	Exponential states.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	445
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	446	\subsection{NFA Approach}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	447	Backtracking.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	448
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	449
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	450
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	451	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	452
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	453	\section{Our Approach}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	454	In the last fifteen or so years, Brzozowski's derivatives of regular
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	455	expressions have sparked quite a bit of interest in the functional
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	456	programming and theorem prover communities. The beauty of
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	457	Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	458	expressible in any functional language, and easily definable and
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	459	reasoned about in theorem provers---the definitions just consist of
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	460	inductive datatypes and simple recursive functions. Derivatives of a
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	461	regular expression, written $r \backslash c$, give a simple solution
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	462	to the problem of matching a string $s$ with a regular
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	463	expression $r$: if the derivative of $r$ w.r.t.\ (in
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	464	succession) all the characters of the string matches the empty string,
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	465	then $r$ matches $s$ (and {\em vice versa}).
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	466
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	467
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	468	This work aims to address the above vulnerability by the combination
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	469	of Brzozowski's derivatives and interactive theorem proving. We give an
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	470	improved version of Sulzmann and Lu's bit-coded algorithm using
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	471	derivatives, which come with a formal guarantee in terms of correctness and
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	472	running time as an Isabelle/HOL proof.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	473	Then we improve the algorithm with an even stronger version of
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	474	simplification, and prove a time bound linear to input and
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	475	cubic to regular expression size using a technique by
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	476	Antimirov.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	477
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	478	\subsection{Existing Work}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	479	We are aware
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	480	of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	481	Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	482	of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	483	in Coq is given by Coquand and Siles \parencite{Coquand2012}.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	484	Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	485
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	486	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	487
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	488	\section{What this Template Includes}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	489
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	490	\subsection{Folders}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	491
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	492	This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	493
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	494	\keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	495
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	496	\keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	497	\begin{itemize}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	498	\item Chapter 1: Introduction to the thesis topic
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	499	\item Chapter 2: Background information and theory
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	500	\item Chapter 3: (Laboratory) experimental setup
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	501	\item Chapter 4: Details of experiment 1
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	502	\item Chapter 5: Details of experiment 2
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	503	\item Chapter 6: Discussion of the experimental results
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	504	\item Chapter 7: Conclusion and future directions
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	505	\end{itemize}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	506	This chapter layout is specialised for the experimental sciences, your discipline may be different.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	507
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	508	\keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	509
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	510	\subsection{Files}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	511
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	512	Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	513
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	514	\keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	515
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	516	\keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	517
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	518	\keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	519
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	520	\keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis!
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	521
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	522	Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	523
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	524	\keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	525
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	526	\keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	527
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	528	\keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	529
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	530	\keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	531
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	532	\keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	533
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	534	\keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	535
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	536	\keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	537
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	538	So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	539
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	540	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	541
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	542	\section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	543
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	544	You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	545
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	546	Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	547
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	548	Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb\|\href{url}{name}\| and only leave the name.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	549
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	550	When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper!
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	551
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	552	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	553
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	554	\section{The \code{main.tex} File Explained}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	555
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	556	The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	557
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	558	Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	559
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	560	Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	561
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	562	Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	563
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	564	Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor).
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	565
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	566	The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	567
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	568	The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	569
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	570	The next page contains a one line dedication. Who will you dedicate your thesis to?
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	571
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	572	Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	573
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	574	After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	575
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	576	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	577
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	578	\section{Thesis Features and Conventions}\label{ThesisConventions}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	579
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	580	To get the best out of this template, there are a few conventions that you may want to follow.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	581
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	582	One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	583
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	584	\subsection{Printing Format}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	585
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	586	This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	587
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	588	The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	589
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	590	The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	591
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	592	\subsection{Using US Letter Paper}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	593
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	594	The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	595
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	596	Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	597
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	598	\subsection{References}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	599
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	600	The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	601
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	602	Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	603
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	604	The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations).
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	605
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	606	\subsubsection{A Note on bibtex}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	607
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	608	The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	609
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	610	\subsection{Tables}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	611
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	612	Tables are an important way of displaying your results, below is an example table which was generated with this code:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	613
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	614	{\small
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	615	\begin{verbatim}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	616	\begin{table}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	617	\caption{The effects of treatments X and Y on the four groups studied.}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	618	\label{tab:treatments}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	619	\centering
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	620	\begin{tabular}{l l l}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	621	\toprule
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	622	\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	623	\midrule
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	624	1 & 0.2 & 0.8\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	625	2 & 0.17 & 0.7\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	626	3 & 0.24 & 0.75\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	627	4 & 0.68 & 0.3\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	628	\bottomrule\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	629	\end{tabular}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	630	\end{table}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	631	\end{verbatim}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	632	}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	633
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	634	\begin{table}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	635	\caption{The effects of treatments X and Y on the four groups studied.}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	636	\label{tab:treatments}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	637	\centering
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	638	\begin{tabular}{l l l}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	639	\toprule
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	640	\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	641	\midrule
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	642	1 & 0.2 & 0.8\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	643	2 & 0.17 & 0.7\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	644	3 & 0.24 & 0.75\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	645	4 & 0.68 & 0.3\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	646	\bottomrule\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	647	\end{tabular}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	648	\end{table}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	649
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	650	You can reference tables with \verb\|\ref{<label>}\| where the label is defined within the table environment. See \file{Chapter1.tex} for an example of the label and citation (e.g. Table~\ref{tab:treatments}).
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	651
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	652	\subsection{Figures}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	653
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	654	There will hopefully be many figures in your thesis (that should be placed in the \emph{Figures} folder). The way to insert figures into your thesis is to use a code template like this:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	655	\begin{verbatim}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	656	\begin{figure}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	657	\centering
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	658	\includegraphics{Figures/Electron}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	659	\decoRule
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	660	\caption[An Electron]{An electron (artist's impression).}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	661	\label{fig:Electron}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	662	\end{figure}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	663	\end{verbatim}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	664	Also look in the source file. Putting this code into the source file produces the picture of the electron that you can see in the figure below.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	665
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	666	\begin{figure}[th]
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	667	\centering
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	668	\includegraphics{Figures/Electron}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	669	\decoRule
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	670	\caption[An Electron]{An electron (artist's impression).}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	671	\label{fig:Electron}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	672	\end{figure}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	673
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	674	Sometimes figures don't always appear where you write them in the source. The placement depends on how much space there is on the page for the figure. Sometimes there is not enough room to fit a figure directly where it should go (in relation to the text) and so \LaTeX{} puts it at the top of the next page. Positioning figures is the job of \LaTeX{} and so you should only worry about making them look good!
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	675
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	676	Figures usually should have captions just in case you need to refer to them (such as in Figure~\ref{fig:Electron}). The \verb\|\caption\| command contains two parts, the first part, inside the square brackets is the title that will appear in the \emph{List of Figures}, and so should be short. The second part in the curly brackets should contain the longer and more descriptive caption text.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	677
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	678	The \verb\|\decoRule\| command is optional and simply puts an aesthetic horizontal line below the image. If you do this for one image, do it for all of them.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	679
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	680	\LaTeX{} is capable of using images in pdf, jpg and png format.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	681
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	682	\subsection{Typesetting mathematics}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	683
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	684	If your thesis is going to contain heavy mathematical content, be sure that \LaTeX{} will make it look beautiful, even though it won't be able to solve the equations for you.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	685
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	686	The \enquote{Not So Short Introduction to \LaTeX} (available on \href{http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf}{CTAN}) should tell you everything you need to know for most cases of typesetting mathematics. If you need more information, a much more thorough mathematical guide is available from the AMS called, \enquote{A Short Math Guide to \LaTeX} and can be downloaded from:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	687	\url{ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	688
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	689	There are many different \LaTeX{} symbols to remember, luckily you can find the most common symbols in \href{http://ctan.org/pkg/comprehensive}{The Comprehensive \LaTeX~Symbol List}.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	690
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	691	You can write an equation, which is automatically given an equation number by \LaTeX{} like this:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	692	\begin{verbatim}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	693	\begin{equation}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	694	E = mc^{2}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	695	\label{eqn:Einstein}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	696	\end{equation}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	697	\end{verbatim}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	698
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	699	This will produce Einstein's famous energy-matter equivalence equation:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	700	\begin{equation}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	701	E = mc^{2}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	702	\label{eqn:Einstein}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	703	\end{equation}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	704
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	705	All equations you write (which are not in the middle of paragraph text) are automatically given equation numbers by \LaTeX{}. If you don't want a particular equation numbered, use the unnumbered form:
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	706	\begin{verbatim}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	707	\[ a^{2}=4 \]
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	708	\end{verbatim}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	709
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	710	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	711
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	712	\section{Sectioning and Subsectioning}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	713
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	714	You should break your thesis up into nice, bite-sized sections and subsections. \LaTeX{} automatically builds a table of Contents by looking at all the \verb\|\chapter{}\|, \verb\|\section{}\| and \verb\|\subsection{}\| commands you write in the source.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	715
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	716	The Table of Contents should only list the sections to three (3) levels. A \verb\|chapter{}\| is level zero (0). A \verb\|\section{}\| is level one (1) and so a \verb\|\subsection{}\| is level two (2). In your thesis it is likely that you will even use a \verb\|subsubsection{}\|, which is level three (3). The depth to which the Table of Contents is formatted is set within \file{MastersDoctoralThesis.cls}. If you need this changed, you can do it in \file{main.tex}.
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	717
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	718	%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	719
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	720	\section{In Closing}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	721
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	722	You have reached the end of this mini-guide. You can now rename or overwrite this pdf file and begin writing your own \file{Chapter1.tex} and the rest of your thesis. The easy work of setting up the structure and framework has been taken care of for you. It's now your job to fill it out!
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	723
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	724	Good luck and have lots of fun!
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	725
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	726	\begin{flushright}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	727	Guide written by ---\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	728	Sunil Patel: \href{http://www.sunilpatel.co.uk}{www.sunilpatel.co.uk}\\
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	729	Vel: \href{http://www.LaTeXTemplates.com}{LaTeXTemplates.com}
26a5e640cdd7 realPhdThesis Chengsong parents: diff changeset	730	\end{flushright}

author	Chengsong
	Sun, 20 Mar 2022 23:32:08 +0000
changeset 456	26a5e640cdd7
child 465	2e7c7111c0be
permissions	-rwxr-xr-x