|
1 % Chapter 1 |
|
2 |
|
3 \chapter{Introduction} % Main chapter title |
|
4 |
|
5 \label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1} |
|
6 |
|
7 %---------------------------------------------------------------------------------------- |
|
8 |
|
9 % Define some commands to keep the formatting separated from the content |
|
10 \newcommand{\keyword}[1]{\textbf{#1}} |
|
11 \newcommand{\tabhead}[1]{\textbf{#1}} |
|
12 \newcommand{\code}[1]{\texttt{#1}} |
|
13 \newcommand{\file}[1]{\texttt{\bfseries#1}} |
|
14 \newcommand{\option}[1]{\texttt{\itshape#1}} |
|
15 |
|
16 |
|
17 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% |
|
18 \newcommand{\ZERO}{\mbox{\bf 0}} |
|
19 \newcommand{\ONE}{\mbox{\bf 1}} |
|
20 \def\lexer{\mathit{lexer}} |
|
21 \def\mkeps{\mathit{mkeps}} |
|
22 |
|
23 \def\DFA{\textit{DFA}} |
|
24 \def\bmkeps{\textit{bmkeps}} |
|
25 \def\retrieve{\textit{retrieve}} |
|
26 \def\blexer{\textit{blexer}} |
|
27 \def\flex{\textit{flex}} |
|
28 \def\inj{\mathit{inj}} |
|
29 \def\Empty{\mathit{Empty}} |
|
30 \def\Left{\mathit{Left}} |
|
31 \def\Right{\mathit{Right}} |
|
32 \def\Stars{\mathit{Stars}} |
|
33 \def\Char{\mathit{Char}} |
|
34 \def\Seq{\mathit{Seq}} |
|
35 \def\Der{\mathit{Der}} |
|
36 \def\nullable{\mathit{nullable}} |
|
37 \def\Z{\mathit{Z}} |
|
38 \def\S{\mathit{S}} |
|
39 \def\rup{r^\uparrow} |
|
40 \def\simp{\mathit{simp}} |
|
41 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}} |
|
42 \def\map{\mathit{map}} |
|
43 \def\distinct{\mathit{distinct}} |
|
44 \def\blexersimp{\mathit{blexer}\_\mathit{simp}} |
|
45 %---------------------------------------------------------------------------------------- |
|
46 |
|
47 |
|
48 |
|
49 |
|
50 Regular expression matching and lexing has been |
|
51 widely-used and well-implemented |
|
52 in software industry. |
|
53 If you go to a popular programming language's |
|
54 regex engine, |
|
55 you can supply it with regex and strings of your own, |
|
56 and get matching/lexing information such as how a |
|
57 sub-part of the regex matches a sub-part of the string. |
|
58 These lexing libraries are on average quite fast. |
|
59 For example, the regex engines some network intrusion detection |
|
60 systems use are able to process |
|
61 megabytes or even gigabytes of network traffic per second. |
|
62 |
|
63 Why do we need to have our version, if the algorithms work well on |
|
64 average? |
|
65 |
|
66 |
|
67 Take $(a^*)^*\,b$ and ask whether |
|
68 strings of the form $aa..a$ match this regular |
|
69 expression. Obviously this is not the case---the expected $b$ in the last |
|
70 position is missing. One would expect that modern regular expression |
|
71 matching engines can find this out very quickly. Alas, if one tries |
|
72 this example in JavaScript, Python or Java 8 with strings like 28 |
|
73 $a$'s, one discovers that this decision takes around 30 seconds and |
|
74 takes considerably longer when adding a few more $a$'s, as the graphs |
|
75 below show: |
|
76 |
|
77 \begin{center} |
|
78 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
79 \begin{tikzpicture} |
|
80 \begin{axis}[ |
|
81 xlabel={$n$}, |
|
82 x label style={at={(1.05,-0.05)}}, |
|
83 ylabel={time in secs}, |
|
84 enlargelimits=false, |
|
85 xtick={0,5,...,30}, |
|
86 xmax=33, |
|
87 ymax=35, |
|
88 ytick={0,5,...,30}, |
|
89 scaled ticks=false, |
|
90 axis lines=left, |
|
91 width=5cm, |
|
92 height=4cm, |
|
93 legend entries={JavaScript}, |
|
94 legend pos=north west, |
|
95 legend cell align=left] |
|
96 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
|
97 \end{axis} |
|
98 \end{tikzpicture} |
|
99 & |
|
100 \begin{tikzpicture} |
|
101 \begin{axis}[ |
|
102 xlabel={$n$}, |
|
103 x label style={at={(1.05,-0.05)}}, |
|
104 %ylabel={time in secs}, |
|
105 enlargelimits=false, |
|
106 xtick={0,5,...,30}, |
|
107 xmax=33, |
|
108 ymax=35, |
|
109 ytick={0,5,...,30}, |
|
110 scaled ticks=false, |
|
111 axis lines=left, |
|
112 width=5cm, |
|
113 height=4cm, |
|
114 legend entries={Python}, |
|
115 legend pos=north west, |
|
116 legend cell align=left] |
|
117 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
118 \end{axis} |
|
119 \end{tikzpicture} |
|
120 & |
|
121 \begin{tikzpicture} |
|
122 \begin{axis}[ |
|
123 xlabel={$n$}, |
|
124 x label style={at={(1.05,-0.05)}}, |
|
125 %ylabel={time in secs}, |
|
126 enlargelimits=false, |
|
127 xtick={0,5,...,30}, |
|
128 xmax=33, |
|
129 ymax=35, |
|
130 ytick={0,5,...,30}, |
|
131 scaled ticks=false, |
|
132 axis lines=left, |
|
133 width=5cm, |
|
134 height=4cm, |
|
135 legend entries={Java 8}, |
|
136 legend pos=north west, |
|
137 legend cell align=left] |
|
138 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
139 \end{axis} |
|
140 \end{tikzpicture}\\ |
|
141 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
142 of the form $\underbrace{aa..a}_{n}$.} |
|
143 \end{tabular} |
|
144 \end{center} |
|
145 |
|
146 |
|
147 This is clearly exponential behaviour, and |
|
148 is triggered by some relatively simple regex patterns. |
|
149 |
|
150 |
|
151 |
|
152 The opens up the possibility of |
|
153 a ReDoS (regular expression denial-of-service ) attack. |
|
154 |
|
155 Theoretical results say that regular expression matching |
|
156 should be linear with respect to the input. You could construct |
|
157 an NFA via Thompson construction, and simulate running it. |
|
158 This would be linear. |
|
159 Or you could determinize the NFA into a DFA, and minimize that |
|
160 DFA. Once you have the DFA, the running time is also linear, requiring just |
|
161 one scanning pass of the input. |
|
162 |
|
163 But modern regex libraries in popular language engines |
|
164 often want to support richer constructs |
|
165 than the most basic regular expressions such as bounded repetitions |
|
166 and back references. |
|
167 %put in illustrations |
|
168 %a{1}{3} |
|
169 These make a DFA construction impossible because |
|
170 of an exponential states explosion. |
|
171 They also want to support lexing rather than just matching |
|
172 for tasks that involves text processing. |
|
173 |
|
174 Existing regex libraries either pose restrictions on the user input, or does |
|
175 not give linear running time guarantee. |
|
176 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
|
177 For example, the Rust regex engine claims to be linear, |
|
178 but does not support lookarounds and back-references. |
|
179 The GoLang regex library does not support over 1000 repetitions. |
|
180 Java and Python both support back-references, but shows |
|
181 catastrophic backtracking behaviours on inputs without back-references( |
|
182 when the language is still regular). |
|
183 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
|
184 %TODO: verify the fact Rust does not allow 1000+ reps |
|
185 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs? |
|
186 Another thing about the these libraries is that there |
|
187 is no correctness guarantee. |
|
188 In some cases they either fails to generate a lexing result when there is a match, |
|
189 or gives the wrong way of matching. |
|
190 |
|
191 |
|
192 This superlinear blowup in matching algorithms sometimes causes |
|
193 considerable grief in real life: for example on 20 July 2016 one evil |
|
194 regular expression brought the webpage |
|
195 \href{http://stackexchange.com}{Stack Exchange} to its |
|
196 %knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
|
197 In this instance, a regular expression intended to just trim white |
|
198 spaces from the beginning and the end of a line actually consumed |
|
199 massive amounts of CPU-resources---causing web servers to grind to a |
|
200 halt. This happened when a post with 20,000 white spaces was submitted, |
|
201 but importantly the white spaces were neither at the beginning nor at |
|
202 the end. As a result, the regular expression matching engine needed to |
|
203 backtrack over many choices. In this example, the time needed to process |
|
204 the string was $O(n^2)$ with respect to the string length. This |
|
205 quadratic overhead was enough for the homepage of Stack Exchange to |
|
206 respond so slowly that the load balancer assumed there must be some |
|
207 attack and therefore stopped the servers from responding to any |
|
208 requests. This made the whole site become unavailable. Another very |
|
209 recent example is a global outage of all Cloudflare servers on 2 July |
|
210 2019. A poorly written regular expression exhibited exponential |
|
211 behaviour and exhausted CPUs that serve HTTP traffic. Although the |
|
212 outage had several causes, at the heart was a regular expression that |
|
213 was used to monitor network |
|
214 %traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}} |
|
215 |
|
216 Is it possible to have a regex lexing algorithm with proven correctness and |
|
217 time complexity, which allows easy extensions to |
|
218 constructs like |
|
219 bounded repetitions, negation, lookarounds, and even back-references? |
|
220 |
|
221 We propose Brzozowski's derivatives as a solution to this problem. |
|
222 |
|
223 |
|
224 \section{Why Brzozowski} |
|
225 In the last fifteen or so years, Brzozowski's derivatives of regular |
|
226 expressions have sparked quite a bit of interest in the functional |
|
227 programming and theorem prover communities. |
|
228 The beauty of |
|
229 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly |
|
230 expressible in any functional language, and easily definable and |
|
231 reasoned about in theorem provers---the definitions just consist of |
|
232 inductive datatypes and simple recursive functions. |
|
233 |
|
234 Suppose we have an alphabet $\Sigma$, the strings whose characters |
|
235 are from $\Sigma$ |
|
236 can be expressed as $\Sigma^*$. |
|
237 |
|
238 We use patterns to define a set of strings concisely. Regular expressions |
|
239 are one of such patterns systems: |
|
240 The basic regular expressions are defined inductively |
|
241 by the following grammar: |
|
242 \[ r ::= \ZERO \mid \ONE |
|
243 \mid c |
|
244 \mid r_1 \cdot r_2 |
|
245 \mid r_1 + r_2 |
|
246 \mid r^* |
|
247 \] |
|
248 |
|
249 The language or set of strings defined by regular expressions are defined as |
|
250 %TODO: FILL in the other defs |
|
251 \begin{center} |
|
252 \begin{tabular}{lcl} |
|
253 $L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\ |
|
254 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\ |
|
255 \end{tabular} |
|
256 \end{center} |
|
257 Which are also called the "language interpretation". |
|
258 |
|
259 |
|
260 |
|
261 The Brzozowski derivative w.r.t character $c$ is an operation on the regex, |
|
262 where the operation transforms the regex to a new one containing |
|
263 strings without the head character $c$. |
|
264 |
|
265 Formally, we define first such a transformation on any string set, which |
|
266 we call semantic derivative: |
|
267 \begin{center} |
|
268 $\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$ |
|
269 \end{center} |
|
270 Mathematically, it can be expressed as the |
|
271 |
|
272 If the $\textit{StringSet}$ happen to have some structure, for example, |
|
273 if it is regular, then we have that it |
|
274 |
|
275 The the derivative of regular expression, denoted as |
|
276 $r \backslash c$, is a function that takes parameters |
|
277 $r$ and $c$, and returns another regular expression $r'$, |
|
278 which is computed by the following recursive function: |
|
279 |
|
280 \begin{center} |
|
281 \begin{tabular}{lcl} |
|
282 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
283 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
284 $d \backslash c$ & $\dn$ & |
|
285 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
286 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
287 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
288 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
289 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
290 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
291 \end{tabular} |
|
292 \end{center} |
|
293 \noindent |
|
294 \noindent |
|
295 |
|
296 The $\nullable$ function tests whether the empty string $""$ |
|
297 is in the language of $r$: |
|
298 |
|
299 |
|
300 \begin{center} |
|
301 \begin{tabular}{lcl} |
|
302 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
303 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
304 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
305 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
306 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
307 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
308 \end{tabular} |
|
309 \end{center} |
|
310 \noindent |
|
311 The empty set does not contain any string and |
|
312 therefore not the empty string, the empty string |
|
313 regular expression contains the empty string |
|
314 by definition, the character regular expression |
|
315 is the singleton that contains character only, |
|
316 and therefore does not contain the empty string, |
|
317 the alternative regular expression(or "or" expression) |
|
318 might have one of its children regular expressions |
|
319 being nullable and any one of its children being nullable |
|
320 would suffice. The sequence regular expression |
|
321 would require both children to have the empty string |
|
322 to compose an empty string and the Kleene star |
|
323 operation naturally introduced the empty string. |
|
324 |
|
325 We can give the meaning of regular expressions derivatives |
|
326 by language interpretation: |
|
327 |
|
328 |
|
329 |
|
330 |
|
331 |
|
332 Derivatives give a simple solution |
|
333 to the problem of matching a string $s$ with a regular |
|
334 expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
335 succession) all the characters of the string matches the empty string, |
|
336 then $r$ matches $s$ (and {\em vice versa}). |
|
337 |
|
338 |
|
339 |
|
340 However, there are two difficulties with derivative-based matchers: |
|
341 First, Brzozowski's original matcher only generates a yes/no answer |
|
342 for whether a regular expression matches a string or not. This is too |
|
343 little information in the context of lexing where separate tokens must |
|
344 be identified and also classified (for example as keywords |
|
345 or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this |
|
346 difficulty by cleverly extending Brzozowski's matching |
|
347 algorithm. Their extended version generates additional information on |
|
348 \emph{how} a regular expression matches a string following the POSIX |
|
349 rules for regular expression matching. They achieve this by adding a |
|
350 second ``phase'' to Brzozowski's algorithm involving an injection |
|
351 function. In our own earlier work we provided the formal |
|
352 specification of what POSIX matching means and proved in Isabelle/HOL |
|
353 the correctness |
|
354 of Sulzmann and Lu's extended algorithm accordingly |
|
355 \cite{AusafDyckhoffUrban2016}. |
|
356 |
|
357 The second difficulty is that Brzozowski's derivatives can |
|
358 grow to arbitrarily big sizes. For example if we start with the |
|
359 regular expression $(a+aa)^*$ and take |
|
360 successive derivatives according to the character $a$, we end up with |
|
361 a sequence of ever-growing derivatives like |
|
362 |
|
363 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} |
|
364 \begin{center} |
|
365 \begin{tabular}{rll} |
|
366 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
367 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
368 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ |
|
369 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
370 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
|
371 \end{tabular} |
|
372 \end{center} |
|
373 |
|
374 \noindent where after around 35 steps we run out of memory on a |
|
375 typical computer (we shall define shortly the precise details of our |
|
376 regular expressions and the derivative operation). Clearly, the |
|
377 notation involving $\ZERO$s and $\ONE$s already suggests |
|
378 simplification rules that can be applied to regular regular |
|
379 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
|
380 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
|
381 r$. While such simple-minded simplifications have been proved in our |
|
382 earlier work to preserve the correctness of Sulzmann and Lu's |
|
383 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
|
384 \emph{not} help with limiting the growth of the derivatives shown |
|
385 above: the growth is slowed, but the derivatives can still grow rather |
|
386 quickly beyond any finite bound. |
|
387 |
|
388 |
|
389 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm |
|
390 \cite{Sulzmann2014} where they introduce bitcoded |
|
391 regular expressions. In this version, POSIX values are |
|
392 represented as bitsequences and such sequences are incrementally generated |
|
393 when derivatives are calculated. The compact representation |
|
394 of bitsequences and regular expressions allows them to define a more |
|
395 ``aggressive'' simplification method that keeps the size of the |
|
396 derivatives finite no matter what the length of the string is. |
|
397 They make some informal claims about the correctness and linear behaviour |
|
398 of this version, but do not provide any supporting proof arguments, not |
|
399 even ``pencil-and-paper'' arguments. They write about their bitcoded |
|
400 \emph{incremental parsing method} (that is the algorithm to be formalised |
|
401 in this paper): |
|
402 |
|
403 |
|
404 \begin{quote}\it |
|
405 ``Correctness Claim: We further claim that the incremental parsing |
|
406 method [..] in combination with the simplification steps [..] |
|
407 yields POSIX parse trees. We have tested this claim |
|
408 extensively [..] but yet |
|
409 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} |
|
410 \end{quote} |
|
411 |
|
412 |
|
413 |
|
414 |
|
415 \section{Backgound} |
|
416 %Regular expression matching and lexing has been |
|
417 % widely-used and well-implemented |
|
418 %in software industry. |
|
419 %TODO: expand the above into a full paragraph |
|
420 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
|
421 |
|
422 |
|
423 Theoretical results say that regular expression matching |
|
424 should be linear with respect to the input. |
|
425 Under a certain class of regular expressions and inputs though, |
|
426 practical implementations suffer from non-linear or even |
|
427 exponential running time, |
|
428 allowing a ReDoS (regular expression denial-of-service ) attack. |
|
429 |
|
430 |
|
431 %---------------------------------------------------------------------------------------- |
|
432 |
|
433 \section{Existing Practical Approaches} |
|
434 |
|
435 The reason behind is that regex libraries in popular language engines |
|
436 often want to support richer constructs |
|
437 than the most basic regular expressions, and lexing rather than matching |
|
438 is needed for sub-match extraction. |
|
439 |
|
440 |
|
441 |
|
442 \subsection{DFA Approach} |
|
443 |
|
444 Exponential states. |
|
445 |
|
446 \subsection{NFA Approach} |
|
447 Backtracking. |
|
448 |
|
449 |
|
450 |
|
451 %---------------------------------------------------------------------------------------- |
|
452 |
|
453 \section{Our Approach} |
|
454 In the last fifteen or so years, Brzozowski's derivatives of regular |
|
455 expressions have sparked quite a bit of interest in the functional |
|
456 programming and theorem prover communities. The beauty of |
|
457 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly |
|
458 expressible in any functional language, and easily definable and |
|
459 reasoned about in theorem provers---the definitions just consist of |
|
460 inductive datatypes and simple recursive functions. Derivatives of a |
|
461 regular expression, written $r \backslash c$, give a simple solution |
|
462 to the problem of matching a string $s$ with a regular |
|
463 expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
464 succession) all the characters of the string matches the empty string, |
|
465 then $r$ matches $s$ (and {\em vice versa}). |
|
466 |
|
467 |
|
468 This work aims to address the above vulnerability by the combination |
|
469 of Brzozowski's derivatives and interactive theorem proving. We give an |
|
470 improved version of Sulzmann and Lu's bit-coded algorithm using |
|
471 derivatives, which come with a formal guarantee in terms of correctness and |
|
472 running time as an Isabelle/HOL proof. |
|
473 Then we improve the algorithm with an even stronger version of |
|
474 simplification, and prove a time bound linear to input and |
|
475 cubic to regular expression size using a technique by |
|
476 Antimirov. |
|
477 |
|
478 \subsection{Existing Work} |
|
479 We are aware |
|
480 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
481 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
482 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
|
483 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
484 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
|
485 |
|
486 %---------------------------------------------------------------------------------------- |
|
487 |
|
488 \section{What this Template Includes} |
|
489 |
|
490 \subsection{Folders} |
|
491 |
|
492 This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory: |
|
493 |
|
494 \keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory. |
|
495 |
|
496 \keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as: |
|
497 \begin{itemize} |
|
498 \item Chapter 1: Introduction to the thesis topic |
|
499 \item Chapter 2: Background information and theory |
|
500 \item Chapter 3: (Laboratory) experimental setup |
|
501 \item Chapter 4: Details of experiment 1 |
|
502 \item Chapter 5: Details of experiment 2 |
|
503 \item Chapter 6: Discussion of the experimental results |
|
504 \item Chapter 7: Conclusion and future directions |
|
505 \end{itemize} |
|
506 This chapter layout is specialised for the experimental sciences, your discipline may be different. |
|
507 |
|
508 \keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document. |
|
509 |
|
510 \subsection{Files} |
|
511 |
|
512 Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about: |
|
513 |
|
514 \keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do. |
|
515 |
|
516 \keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis. |
|
517 |
|
518 \keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version. |
|
519 |
|
520 \keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis! |
|
521 |
|
522 Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include: |
|
523 |
|
524 \keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. |
|
525 |
|
526 \keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis. |
|
527 |
|
528 \keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file. |
|
529 |
|
530 \keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section. |
|
531 |
|
532 \keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file. |
|
533 |
|
534 \keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section. |
|
535 |
|
536 \keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. |
|
537 |
|
538 So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them. |
|
539 |
|
540 %---------------------------------------------------------------------------------------- |
|
541 |
|
542 \section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile} |
|
543 |
|
544 You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment. |
|
545 |
|
546 Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots |
|
547 |
|
548 Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb|\href{url}{name}| and only leave the name. |
|
549 |
|
550 When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper! |
|
551 |
|
552 %---------------------------------------------------------------------------------------- |
|
553 |
|
554 \section{The \code{main.tex} File Explained} |
|
555 |
|
556 The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it. |
|
557 |
|
558 Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block. |
|
559 |
|
560 Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from. |
|
561 |
|
562 Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space. |
|
563 |
|
564 Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor). |
|
565 |
|
566 The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols. |
|
567 |
|
568 The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes. |
|
569 |
|
570 The next page contains a one line dedication. Who will you dedicate your thesis to? |
|
571 |
|
572 Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder. |
|
573 |
|
574 After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place. |
|
575 |
|
576 %---------------------------------------------------------------------------------------- |
|
577 |
|
578 \section{Thesis Features and Conventions}\label{ThesisConventions} |
|
579 |
|
580 To get the best out of this template, there are a few conventions that you may want to follow. |
|
581 |
|
582 One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method. |
|
583 |
|
584 \subsection{Printing Format} |
|
585 |
|
586 This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution. |
|
587 |
|
588 The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side. |
|
589 |
|
590 The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}. |
|
591 |
|
592 \subsection{Using US Letter Paper} |
|
593 |
|
594 The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}. |
|
595 |
|
596 Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins. |
|
597 |
|
598 \subsection{References} |
|
599 |
|
600 The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type. |
|
601 |
|
602 Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.} |
|
603 |
|
604 The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations). |
|
605 |
|
606 \subsubsection{A Note on bibtex} |
|
607 |
|
608 The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors. |
|
609 |
|
610 \subsection{Tables} |
|
611 |
|
612 Tables are an important way of displaying your results, below is an example table which was generated with this code: |
|
613 |
|
614 {\small |
|
615 \begin{verbatim} |
|
616 \begin{table} |
|
617 \caption{The effects of treatments X and Y on the four groups studied.} |
|
618 \label{tab:treatments} |
|
619 \centering |
|
620 \begin{tabular}{l l l} |
|
621 \toprule |
|
622 \tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\ |
|
623 \midrule |
|
624 1 & 0.2 & 0.8\\ |
|
625 2 & 0.17 & 0.7\\ |
|
626 3 & 0.24 & 0.75\\ |
|
627 4 & 0.68 & 0.3\\ |
|
628 \bottomrule\\ |
|
629 \end{tabular} |
|
630 \end{table} |
|
631 \end{verbatim} |
|
632 } |
|
633 |
|
634 \begin{table} |
|
635 \caption{The effects of treatments X and Y on the four groups studied.} |
|
636 \label{tab:treatments} |
|
637 \centering |
|
638 \begin{tabular}{l l l} |
|
639 \toprule |
|
640 \tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\ |
|
641 \midrule |
|
642 1 & 0.2 & 0.8\\ |
|
643 2 & 0.17 & 0.7\\ |
|
644 3 & 0.24 & 0.75\\ |
|
645 4 & 0.68 & 0.3\\ |
|
646 \bottomrule\\ |
|
647 \end{tabular} |
|
648 \end{table} |
|
649 |
|
650 You can reference tables with \verb|\ref{<label>}| where the label is defined within the table environment. See \file{Chapter1.tex} for an example of the label and citation (e.g. Table~\ref{tab:treatments}). |
|
651 |
|
652 \subsection{Figures} |
|
653 |
|
654 There will hopefully be many figures in your thesis (that should be placed in the \emph{Figures} folder). The way to insert figures into your thesis is to use a code template like this: |
|
655 \begin{verbatim} |
|
656 \begin{figure} |
|
657 \centering |
|
658 \includegraphics{Figures/Electron} |
|
659 \decoRule |
|
660 \caption[An Electron]{An electron (artist's impression).} |
|
661 \label{fig:Electron} |
|
662 \end{figure} |
|
663 \end{verbatim} |
|
664 Also look in the source file. Putting this code into the source file produces the picture of the electron that you can see in the figure below. |
|
665 |
|
666 \begin{figure}[th] |
|
667 \centering |
|
668 \includegraphics{Figures/Electron} |
|
669 \decoRule |
|
670 \caption[An Electron]{An electron (artist's impression).} |
|
671 \label{fig:Electron} |
|
672 \end{figure} |
|
673 |
|
674 Sometimes figures don't always appear where you write them in the source. The placement depends on how much space there is on the page for the figure. Sometimes there is not enough room to fit a figure directly where it should go (in relation to the text) and so \LaTeX{} puts it at the top of the next page. Positioning figures is the job of \LaTeX{} and so you should only worry about making them look good! |
|
675 |
|
676 Figures usually should have captions just in case you need to refer to them (such as in Figure~\ref{fig:Electron}). The \verb|\caption| command contains two parts, the first part, inside the square brackets is the title that will appear in the \emph{List of Figures}, and so should be short. The second part in the curly brackets should contain the longer and more descriptive caption text. |
|
677 |
|
678 The \verb|\decoRule| command is optional and simply puts an aesthetic horizontal line below the image. If you do this for one image, do it for all of them. |
|
679 |
|
680 \LaTeX{} is capable of using images in pdf, jpg and png format. |
|
681 |
|
682 \subsection{Typesetting mathematics} |
|
683 |
|
684 If your thesis is going to contain heavy mathematical content, be sure that \LaTeX{} will make it look beautiful, even though it won't be able to solve the equations for you. |
|
685 |
|
686 The \enquote{Not So Short Introduction to \LaTeX} (available on \href{http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf}{CTAN}) should tell you everything you need to know for most cases of typesetting mathematics. If you need more information, a much more thorough mathematical guide is available from the AMS called, \enquote{A Short Math Guide to \LaTeX} and can be downloaded from: |
|
687 \url{ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf} |
|
688 |
|
689 There are many different \LaTeX{} symbols to remember, luckily you can find the most common symbols in \href{http://ctan.org/pkg/comprehensive}{The Comprehensive \LaTeX~Symbol List}. |
|
690 |
|
691 You can write an equation, which is automatically given an equation number by \LaTeX{} like this: |
|
692 \begin{verbatim} |
|
693 \begin{equation} |
|
694 E = mc^{2} |
|
695 \label{eqn:Einstein} |
|
696 \end{equation} |
|
697 \end{verbatim} |
|
698 |
|
699 This will produce Einstein's famous energy-matter equivalence equation: |
|
700 \begin{equation} |
|
701 E = mc^{2} |
|
702 \label{eqn:Einstein} |
|
703 \end{equation} |
|
704 |
|
705 All equations you write (which are not in the middle of paragraph text) are automatically given equation numbers by \LaTeX{}. If you don't want a particular equation numbered, use the unnumbered form: |
|
706 \begin{verbatim} |
|
707 \[ a^{2}=4 \] |
|
708 \end{verbatim} |
|
709 |
|
710 %---------------------------------------------------------------------------------------- |
|
711 |
|
712 \section{Sectioning and Subsectioning} |
|
713 |
|
714 You should break your thesis up into nice, bite-sized sections and subsections. \LaTeX{} automatically builds a table of Contents by looking at all the \verb|\chapter{}|, \verb|\section{}| and \verb|\subsection{}| commands you write in the source. |
|
715 |
|
716 The Table of Contents should only list the sections to three (3) levels. A \verb|chapter{}| is level zero (0). A \verb|\section{}| is level one (1) and so a \verb|\subsection{}| is level two (2). In your thesis it is likely that you will even use a \verb|subsubsection{}|, which is level three (3). The depth to which the Table of Contents is formatted is set within \file{MastersDoctoralThesis.cls}. If you need this changed, you can do it in \file{main.tex}. |
|
717 |
|
718 %---------------------------------------------------------------------------------------- |
|
719 |
|
720 \section{In Closing} |
|
721 |
|
722 You have reached the end of this mini-guide. You can now rename or overwrite this pdf file and begin writing your own \file{Chapter1.tex} and the rest of your thesis. The easy work of setting up the structure and framework has been taken care of for you. It's now your job to fill it out! |
|
723 |
|
724 Good luck and have lots of fun! |
|
725 |
|
726 \begin{flushright} |
|
727 Guide written by ---\\ |
|
728 Sunil Patel: \href{http://www.sunilpatel.co.uk}{www.sunilpatel.co.uk}\\ |
|
729 Vel: \href{http://www.LaTeXTemplates.com}{LaTeXTemplates.com} |
|
730 \end{flushright} |