PhdThesisRealOne/LaTeXTemplates_masters-doctoral-thesis_v2/Chapters/Chapter1.tex
changeset 456 26a5e640cdd7
child 465 2e7c7111c0be
equal deleted inserted replaced
453:d68b9db4c9ec 456:26a5e640cdd7
       
     1 % Chapter 1
       
     2 
       
     3 \chapter{Introduction} % Main chapter title
       
     4 
       
     5 \label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1} 
       
     6 
       
     7 %----------------------------------------------------------------------------------------
       
     8 
       
     9 % Define some commands to keep the formatting separated from the content 
       
    10 \newcommand{\keyword}[1]{\textbf{#1}}
       
    11 \newcommand{\tabhead}[1]{\textbf{#1}}
       
    12 \newcommand{\code}[1]{\texttt{#1}}
       
    13 \newcommand{\file}[1]{\texttt{\bfseries#1}}
       
    14 \newcommand{\option}[1]{\texttt{\itshape#1}}
       
    15 
       
    16 
       
    17 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    18 \newcommand{\ZERO}{\mbox{\bf 0}}
       
    19 \newcommand{\ONE}{\mbox{\bf 1}}
       
    20 \def\lexer{\mathit{lexer}}
       
    21 \def\mkeps{\mathit{mkeps}}
       
    22 
       
    23 \def\DFA{\textit{DFA}}
       
    24 \def\bmkeps{\textit{bmkeps}}
       
    25 \def\retrieve{\textit{retrieve}}
       
    26 \def\blexer{\textit{blexer}}
       
    27 \def\flex{\textit{flex}}
       
    28 \def\inj{\mathit{inj}}
       
    29 \def\Empty{\mathit{Empty}}
       
    30 \def\Left{\mathit{Left}}
       
    31 \def\Right{\mathit{Right}}
       
    32 \def\Stars{\mathit{Stars}}
       
    33 \def\Char{\mathit{Char}}
       
    34 \def\Seq{\mathit{Seq}}
       
    35 \def\Der{\mathit{Der}}
       
    36 \def\nullable{\mathit{nullable}}
       
    37 \def\Z{\mathit{Z}}
       
    38 \def\S{\mathit{S}}
       
    39 \def\rup{r^\uparrow}
       
    40 \def\simp{\mathit{simp}}
       
    41 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
       
    42 \def\map{\mathit{map}}
       
    43 \def\distinct{\mathit{distinct}}
       
    44 \def\blexersimp{\mathit{blexer}\_\mathit{simp}}
       
    45 %----------------------------------------------------------------------------------------
       
    46 
       
    47 
       
    48 
       
    49 
       
    50 Regular expression matching and lexing has been 
       
    51  widely-used and well-implemented
       
    52 in software industry. 
       
    53 If you go to a popular programming language's
       
    54 regex engine,
       
    55 you can supply it with regex and strings of your own,
       
    56 and get matching/lexing  information such as how a 
       
    57 sub-part of the regex matches a sub-part of the string.
       
    58 These lexing libraries are on average quite fast.
       
    59 For example, the regex engines some network intrusion detection
       
    60 systems use are able to process
       
    61  megabytes or even gigabytes of network traffic per second.
       
    62  
       
    63 Why do we need to have our version, if the algorithms work well on 
       
    64 average?
       
    65 
       
    66 
       
    67 Take $(a^*)^*\,b$ and ask whether
       
    68 strings of the form $aa..a$ match this regular
       
    69 expression. Obviously this is not the case---the expected $b$ in the last
       
    70 position is missing. One would expect that modern regular expression
       
    71 matching engines can find this out very quickly. Alas, if one tries
       
    72 this example in JavaScript, Python or Java 8 with strings like 28
       
    73 $a$'s, one discovers that this decision takes around 30 seconds and
       
    74 takes considerably longer when adding a few more $a$'s, as the graphs
       
    75 below show:
       
    76 
       
    77 \begin{center}
       
    78 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
    79 \begin{tikzpicture}
       
    80 \begin{axis}[
       
    81     xlabel={$n$},
       
    82     x label style={at={(1.05,-0.05)}},
       
    83     ylabel={time in secs},
       
    84     enlargelimits=false,
       
    85     xtick={0,5,...,30},
       
    86     xmax=33,
       
    87     ymax=35,
       
    88     ytick={0,5,...,30},
       
    89     scaled ticks=false,
       
    90     axis lines=left,
       
    91     width=5cm,
       
    92     height=4cm, 
       
    93     legend entries={JavaScript},  
       
    94     legend pos=north west,
       
    95     legend cell align=left]
       
    96 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
    97 \end{axis}
       
    98 \end{tikzpicture}
       
    99   &
       
   100 \begin{tikzpicture}
       
   101 \begin{axis}[
       
   102     xlabel={$n$},
       
   103     x label style={at={(1.05,-0.05)}},
       
   104     %ylabel={time in secs},
       
   105     enlargelimits=false,
       
   106     xtick={0,5,...,30},
       
   107     xmax=33,
       
   108     ymax=35,
       
   109     ytick={0,5,...,30},
       
   110     scaled ticks=false,
       
   111     axis lines=left,
       
   112     width=5cm,
       
   113     height=4cm, 
       
   114     legend entries={Python},  
       
   115     legend pos=north west,
       
   116     legend cell align=left]
       
   117 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   118 \end{axis}
       
   119 \end{tikzpicture}
       
   120   &
       
   121 \begin{tikzpicture}
       
   122 \begin{axis}[
       
   123     xlabel={$n$},
       
   124     x label style={at={(1.05,-0.05)}},
       
   125     %ylabel={time in secs},
       
   126     enlargelimits=false,
       
   127     xtick={0,5,...,30},
       
   128     xmax=33,
       
   129     ymax=35,
       
   130     ytick={0,5,...,30},
       
   131     scaled ticks=false,
       
   132     axis lines=left,
       
   133     width=5cm,
       
   134     height=4cm, 
       
   135     legend entries={Java 8},  
       
   136     legend pos=north west,
       
   137     legend cell align=left]
       
   138 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   139 \end{axis}
       
   140 \end{tikzpicture}\\
       
   141 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   142            of the form $\underbrace{aa..a}_{n}$.}
       
   143 \end{tabular}    
       
   144 \end{center}  
       
   145 
       
   146 
       
   147 This is clearly exponential behaviour, and 
       
   148 is triggered by some relatively simple regex patterns.
       
   149 
       
   150 
       
   151 
       
   152 The opens up the possibility of
       
   153  a ReDoS (regular expression denial-of-service ) attack.
       
   154 
       
   155 Theoretical results say that regular expression matching
       
   156 should be linear with respect to the input. You could construct
       
   157 an NFA via Thompson construction, and simulate  running it.
       
   158 This would be linear.
       
   159 Or you could determinize the NFA into a DFA, and minimize that
       
   160 DFA. Once you have the DFA, the running time is also linear, requiring just 
       
   161 one scanning pass of the input.
       
   162 
       
   163 But modern  regex libraries in popular language engines
       
   164  often want to support richer constructs
       
   165 than  the most basic regular expressions such as bounded repetitions
       
   166 and back references. 
       
   167 %put in illustrations
       
   168 %a{1}{3}
       
   169 These make a DFA construction impossible because
       
   170 of an exponential states explosion.
       
   171  They also want to support lexing rather than just matching
       
   172  for tasks that involves text processing.
       
   173  
       
   174  Existing regex libraries either pose restrictions on the user input, or does 
       
   175  not give linear running time guarantee.
       
   176  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
       
   177  For example, the Rust regex engine claims to be linear, 
       
   178  but does not support lookarounds and back-references.
       
   179  The GoLang regex library does not support over 1000 repetitions.  
       
   180  Java and Python both support back-references, but shows
       
   181 catastrophic backtracking behaviours on inputs without back-references(
       
   182 when the language is still regular).
       
   183  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
       
   184  %TODO: verify the fact Rust does not allow 1000+ reps
       
   185  %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
       
   186  Another thing about the these libraries is that there
       
   187  is no correctness guarantee.
       
   188  In some cases they either fails to generate a lexing result when there is a match,
       
   189  or gives the wrong way of matching.
       
   190  
       
   191 
       
   192 This superlinear blowup in matching algorithms sometimes causes
       
   193 considerable grief in real life: for example on 20 July 2016 one evil
       
   194 regular expression brought the webpage
       
   195 \href{http://stackexchange.com}{Stack Exchange} to its
       
   196 %knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
       
   197 In this instance, a regular expression intended to just trim white
       
   198 spaces from the beginning and the end of a line actually consumed
       
   199 massive amounts of CPU-resources---causing web servers to grind to a
       
   200 halt. This happened when a post with 20,000 white spaces was submitted,
       
   201 but importantly the white spaces were neither at the beginning nor at
       
   202 the end. As a result, the regular expression matching engine needed to
       
   203 backtrack over many choices. In this example, the time needed to process
       
   204 the string was $O(n^2)$ with respect to the string length. This
       
   205 quadratic overhead was enough for the homepage of Stack Exchange to
       
   206 respond so slowly that the load balancer assumed there must be some
       
   207 attack and therefore stopped the servers from responding to any
       
   208 requests. This made the whole site become unavailable. Another very
       
   209 recent example is a global outage of all Cloudflare servers on 2 July
       
   210 2019. A poorly written regular expression exhibited exponential
       
   211 behaviour and exhausted CPUs that serve HTTP traffic. Although the
       
   212 outage had several causes, at the heart was a regular expression that
       
   213 was used to monitor network
       
   214 %traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
       
   215  
       
   216  Is it possible to have a regex lexing algorithm with proven correctness and 
       
   217  time complexity, which allows easy extensions to
       
   218   constructs like 
       
   219  bounded repetitions, negation,  lookarounds, and even back-references? 
       
   220  
       
   221  We propose Brzozowski's derivatives as a solution to this problem.
       
   222  
       
   223  
       
   224  \section{Why Brzozowski}
       
   225  In the last fifteen or so years, Brzozowski's derivatives of regular
       
   226 expressions have sparked quite a bit of interest in the functional
       
   227 programming and theorem prover communities.  
       
   228 The beauty of
       
   229 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
       
   230 expressible in any functional language, and easily definable and
       
   231 reasoned about in theorem provers---the definitions just consist of
       
   232 inductive datatypes and simple recursive functions. 
       
   233 
       
   234 Suppose we have an alphabet $\Sigma$, the strings  whose characters
       
   235 are from $\Sigma$
       
   236 can be expressed as $\Sigma^*$.
       
   237 
       
   238 We use patterns to define a set of strings concisely. Regular expressions
       
   239 are one of such patterns systems:
       
   240 The basic regular expressions  are defined inductively
       
   241  by the following grammar:
       
   242 \[			r ::=   \ZERO \mid  \ONE
       
   243 			 \mid  c  
       
   244 			 \mid  r_1 \cdot r_2
       
   245 			 \mid  r_1 + r_2   
       
   246 			 \mid r^*         
       
   247 \]
       
   248 
       
   249 The language or set of strings defined by regular expressions are defined as
       
   250 %TODO: FILL in the other defs
       
   251 \begin{center}
       
   252 \begin{tabular}{lcl}
       
   253 $L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
       
   254 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
       
   255 \end{tabular}
       
   256 \end{center}
       
   257 Which are also called the "language interpretation".
       
   258 
       
   259 
       
   260 
       
   261 The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
       
   262 where the operation transforms the regex to a new one containing
       
   263 strings without the head character $c$.
       
   264 
       
   265 Formally, we define first such a transformation on any string set, which
       
   266 we call semantic derivative:
       
   267 \begin{center}
       
   268 $\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
       
   269 \end{center}
       
   270 Mathematically, it can be expressed as the 
       
   271 
       
   272 If the $\textit{StringSet}$ happen to have some structure, for example,
       
   273 if it is regular, then we have that it
       
   274 
       
   275 The the derivative of regular expression, denoted as
       
   276 $r \backslash c$, is a function that takes parameters
       
   277 $r$ and $c$, and returns another regular expression $r'$,
       
   278 which is computed by the following recursive function:
       
   279 
       
   280 \begin{center}
       
   281 \begin{tabular}{lcl}
       
   282 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   283 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   284 		$d \backslash c$     & $\dn$ & 
       
   285 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   286 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   287 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   288 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   289 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   290 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   291 \end{tabular}
       
   292 \end{center}
       
   293 \noindent
       
   294 \noindent
       
   295 
       
   296 The $\nullable$ function tests whether the empty string $""$ 
       
   297 is in the language of $r$:
       
   298 
       
   299 
       
   300 \begin{center}
       
   301 		\begin{tabular}{lcl}
       
   302 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   303 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   304 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   305 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   306 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   307 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   308 		\end{tabular}
       
   309 \end{center}
       
   310 \noindent
       
   311 The empty set does not contain any string and
       
   312 therefore not the empty string, the empty string 
       
   313 regular expression contains the empty string
       
   314 by definition, the character regular expression
       
   315 is the singleton that contains character only,
       
   316 and therefore does not contain the empty string,
       
   317 the alternative regular expression(or "or" expression)
       
   318 might have one of its children regular expressions
       
   319 being nullable and any one of its children being nullable
       
   320 would suffice. The sequence regular expression
       
   321 would require both children to have the empty string
       
   322 to compose an empty string and the Kleene star
       
   323 operation naturally introduced the empty string.
       
   324 
       
   325 We can give the meaning of regular expressions derivatives
       
   326 by language interpretation:
       
   327 
       
   328 
       
   329  
       
   330  
       
   331  
       
   332 Derivatives give a simple solution
       
   333 to the problem of matching a string $s$ with a regular
       
   334 expression $r$: if the derivative of $r$ w.r.t.\ (in
       
   335 succession) all the characters of the string matches the empty string,
       
   336 then $r$ matches $s$ (and {\em vice versa}).  
       
   337 
       
   338 
       
   339 
       
   340 However, there are two difficulties with derivative-based matchers:
       
   341 First, Brzozowski's original matcher only generates a yes/no answer
       
   342 for whether a regular expression matches a string or not.  This is too
       
   343 little information in the context of lexing where separate tokens must
       
   344 be identified and also classified (for example as keywords
       
   345 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
       
   346 difficulty by cleverly extending Brzozowski's matching
       
   347 algorithm. Their extended version generates additional information on
       
   348 \emph{how} a regular expression matches a string following the POSIX
       
   349 rules for regular expression matching. They achieve this by adding a
       
   350 second ``phase'' to Brzozowski's algorithm involving an injection
       
   351 function.  In our own earlier work we provided the formal
       
   352 specification of what POSIX matching means and proved in Isabelle/HOL
       
   353 the correctness
       
   354 of Sulzmann and Lu's extended algorithm accordingly
       
   355 \cite{AusafDyckhoffUrban2016}.
       
   356 
       
   357 The second difficulty is that Brzozowski's derivatives can 
       
   358 grow to arbitrarily big sizes. For example if we start with the
       
   359 regular expression $(a+aa)^*$ and take
       
   360 successive derivatives according to the character $a$, we end up with
       
   361 a sequence of ever-growing derivatives like 
       
   362 
       
   363 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
       
   364 \begin{center}
       
   365 \begin{tabular}{rll}
       
   366 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   367 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   368 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
       
   369 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   370 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
       
   371 \end{tabular}
       
   372 \end{center}
       
   373  
       
   374 \noindent where after around 35 steps we run out of memory on a
       
   375 typical computer (we shall define shortly the precise details of our
       
   376 regular expressions and the derivative operation).  Clearly, the
       
   377 notation involving $\ZERO$s and $\ONE$s already suggests
       
   378 simplification rules that can be applied to regular regular
       
   379 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
       
   380 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
       
   381 r$. While such simple-minded simplifications have been proved in our
       
   382 earlier work to preserve the correctness of Sulzmann and Lu's
       
   383 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
       
   384 \emph{not} help with limiting the growth of the derivatives shown
       
   385 above: the growth is slowed, but the derivatives can still grow rather
       
   386 quickly beyond any finite bound.
       
   387 
       
   388 
       
   389 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
       
   390 \cite{Sulzmann2014} where they introduce bitcoded
       
   391 regular expressions. In this version, POSIX values are
       
   392 represented as bitsequences and such sequences are incrementally generated
       
   393 when derivatives are calculated. The compact representation
       
   394 of bitsequences and regular expressions allows them to define a more
       
   395 ``aggressive'' simplification method that keeps the size of the
       
   396 derivatives finite no matter what the length of the string is.
       
   397 They make some informal claims about the correctness and linear behaviour
       
   398 of this version, but do not provide any supporting proof arguments, not
       
   399 even ``pencil-and-paper'' arguments. They write about their bitcoded
       
   400 \emph{incremental parsing method} (that is the algorithm to be formalised
       
   401 in this paper):
       
   402 
       
   403 
       
   404 \begin{quote}\it
       
   405   ``Correctness Claim: We further claim that the incremental parsing
       
   406   method [..] in combination with the simplification steps [..]
       
   407   yields POSIX parse trees. We have tested this claim
       
   408   extensively [..] but yet
       
   409   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
       
   410 \end{quote}  
       
   411 
       
   412 
       
   413 
       
   414 
       
   415 \section{Backgound}
       
   416 %Regular expression matching and lexing has been 
       
   417 % widely-used and well-implemented
       
   418 %in software industry. 
       
   419 %TODO: expand the above into a full paragraph
       
   420 %TODO: look up snort rules to use here--give readers idea of what regexes look like
       
   421 
       
   422 
       
   423 Theoretical results say that regular expression matching
       
   424 should be linear with respect to the input.
       
   425 Under a certain class of regular expressions and inputs though,
       
   426 practical implementations  suffer from non-linear or even 
       
   427 exponential running time,
       
   428 allowing a ReDoS (regular expression denial-of-service ) attack.
       
   429 
       
   430 
       
   431 %----------------------------------------------------------------------------------------
       
   432 
       
   433 \section{Existing Practical Approaches}
       
   434 
       
   435 The reason behind is that regex libraries in popular language engines
       
   436  often want to support richer constructs
       
   437 than  the most basic regular expressions, and lexing rather than matching
       
   438 is needed for sub-match extraction.
       
   439 
       
   440 
       
   441 
       
   442 \subsection{DFA Approach}
       
   443 
       
   444 Exponential states.
       
   445 
       
   446 \subsection{NFA Approach}
       
   447 Backtracking.
       
   448 
       
   449 
       
   450 
       
   451 %----------------------------------------------------------------------------------------
       
   452 
       
   453 \section{Our Approach}
       
   454 In the last fifteen or so years, Brzozowski's derivatives of regular
       
   455 expressions have sparked quite a bit of interest in the functional
       
   456 programming and theorem prover communities.  The beauty of
       
   457 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
       
   458 expressible in any functional language, and easily definable and
       
   459 reasoned about in theorem provers---the definitions just consist of
       
   460 inductive datatypes and simple recursive functions.  Derivatives of a
       
   461 regular expression, written $r \backslash c$, give a simple solution
       
   462 to the problem of matching a string $s$ with a regular
       
   463 expression $r$: if the derivative of $r$ w.r.t.\ (in
       
   464 succession) all the characters of the string matches the empty string,
       
   465 then $r$ matches $s$ (and {\em vice versa}).  
       
   466 
       
   467 
       
   468 This work aims to address the above vulnerability by the combination
       
   469 of Brzozowski's derivatives and interactive theorem proving. We give an 
       
   470 improved version of  Sulzmann and Lu's bit-coded algorithm using 
       
   471 derivatives, which come with a formal guarantee in terms of correctness and 
       
   472 running time as an Isabelle/HOL proof.
       
   473 Then we improve the algorithm with an even stronger version of 
       
   474 simplification, and prove a time bound linear to input and
       
   475 cubic to regular expression size using a technique by
       
   476 Antimirov.
       
   477 
       
   478 \subsection{Existing Work}
       
   479 We are aware
       
   480 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
   481 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
   482 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
       
   483 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
   484 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
       
   485 
       
   486 %----------------------------------------------------------------------------------------
       
   487 
       
   488 \section{What this Template Includes}
       
   489 
       
   490 \subsection{Folders}
       
   491 
       
   492 This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory:
       
   493 
       
   494 \keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory.
       
   495 
       
   496 \keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as:
       
   497 \begin{itemize}
       
   498 \item Chapter 1: Introduction to the thesis topic
       
   499 \item Chapter 2: Background information and theory
       
   500 \item Chapter 3: (Laboratory) experimental setup
       
   501 \item Chapter 4: Details of experiment 1
       
   502 \item Chapter 5: Details of experiment 2
       
   503 \item Chapter 6: Discussion of the experimental results
       
   504 \item Chapter 7: Conclusion and future directions
       
   505 \end{itemize}
       
   506 This chapter layout is specialised for the experimental sciences, your discipline may be different.
       
   507 
       
   508 \keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document.
       
   509 
       
   510 \subsection{Files}
       
   511 
       
   512 Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about:
       
   513 
       
   514 \keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do.
       
   515 
       
   516 \keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis. 
       
   517 
       
   518 \keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version.
       
   519 
       
   520 \keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis!
       
   521 
       
   522 Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include:
       
   523 
       
   524 \keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
       
   525 
       
   526 \keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis.
       
   527 
       
   528 \keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file.
       
   529 
       
   530 \keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section.
       
   531 
       
   532 \keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file.
       
   533 
       
   534 \keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section.
       
   535 
       
   536 \keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
       
   537 
       
   538 So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them.
       
   539 
       
   540 %----------------------------------------------------------------------------------------
       
   541 
       
   542 \section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile}
       
   543 
       
   544 You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment.
       
   545 
       
   546 Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots
       
   547 
       
   548 Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb|\href{url}{name}| and only leave the name.
       
   549 
       
   550 When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper!
       
   551 
       
   552 %----------------------------------------------------------------------------------------
       
   553 
       
   554 \section{The \code{main.tex} File Explained}
       
   555 
       
   556 The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it.
       
   557 
       
   558 Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block.
       
   559 
       
   560 Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from.
       
   561 
       
   562 Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space.
       
   563 
       
   564 Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor).
       
   565 
       
   566 The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols.
       
   567 
       
   568 The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes.
       
   569 
       
   570 The next page contains a one line dedication. Who will you dedicate your thesis to?
       
   571 
       
   572 Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder.
       
   573 
       
   574 After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place.
       
   575 
       
   576 %----------------------------------------------------------------------------------------
       
   577 
       
   578 \section{Thesis Features and Conventions}\label{ThesisConventions}
       
   579 
       
   580 To get the best out of this template, there are a few conventions that you may want to follow.
       
   581 
       
   582 One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method.
       
   583 
       
   584 \subsection{Printing Format}
       
   585 
       
   586 This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution.
       
   587 
       
   588 The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side.
       
   589 
       
   590 The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}.
       
   591 
       
   592 \subsection{Using US Letter Paper}
       
   593 
       
   594 The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}.
       
   595 
       
   596 Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins.
       
   597 
       
   598 \subsection{References}
       
   599 
       
   600 The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type.
       
   601 
       
   602 Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.}
       
   603 
       
   604 The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations).
       
   605 
       
   606 \subsubsection{A Note on bibtex}
       
   607 
       
   608 The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors.
       
   609 
       
   610 \subsection{Tables}
       
   611 
       
   612 Tables are an important way of displaying your results, below is an example table which was generated with this code:
       
   613 
       
   614 {\small
       
   615 \begin{verbatim}
       
   616 \begin{table}
       
   617 \caption{The effects of treatments X and Y on the four groups studied.}
       
   618 \label{tab:treatments}
       
   619 \centering
       
   620 \begin{tabular}{l l l}
       
   621 \toprule
       
   622 \tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
       
   623 \midrule
       
   624 1 & 0.2 & 0.8\\
       
   625 2 & 0.17 & 0.7\\
       
   626 3 & 0.24 & 0.75\\
       
   627 4 & 0.68 & 0.3\\
       
   628 \bottomrule\\
       
   629 \end{tabular}
       
   630 \end{table}
       
   631 \end{verbatim}
       
   632 }
       
   633 
       
   634 \begin{table}
       
   635 \caption{The effects of treatments X and Y on the four groups studied.}
       
   636 \label{tab:treatments}
       
   637 \centering
       
   638 \begin{tabular}{l l l}
       
   639 \toprule
       
   640 \tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
       
   641 \midrule
       
   642 1 & 0.2 & 0.8\\
       
   643 2 & 0.17 & 0.7\\
       
   644 3 & 0.24 & 0.75\\
       
   645 4 & 0.68 & 0.3\\
       
   646 \bottomrule\\
       
   647 \end{tabular}
       
   648 \end{table}
       
   649 
       
   650 You can reference tables with \verb|\ref{<label>}| where the label is defined within the table environment. See \file{Chapter1.tex} for an example of the label and citation (e.g. Table~\ref{tab:treatments}).
       
   651 
       
   652 \subsection{Figures}
       
   653 
       
   654 There will hopefully be many figures in your thesis (that should be placed in the \emph{Figures} folder). The way to insert figures into your thesis is to use a code template like this:
       
   655 \begin{verbatim}
       
   656 \begin{figure}
       
   657 \centering
       
   658 \includegraphics{Figures/Electron}
       
   659 \decoRule
       
   660 \caption[An Electron]{An electron (artist's impression).}
       
   661 \label{fig:Electron}
       
   662 \end{figure}
       
   663 \end{verbatim}
       
   664 Also look in the source file. Putting this code into the source file produces the picture of the electron that you can see in the figure below.
       
   665 
       
   666 \begin{figure}[th]
       
   667 \centering
       
   668 \includegraphics{Figures/Electron}
       
   669 \decoRule
       
   670 \caption[An Electron]{An electron (artist's impression).}
       
   671 \label{fig:Electron}
       
   672 \end{figure}
       
   673 
       
   674 Sometimes figures don't always appear where you write them in the source. The placement depends on how much space there is on the page for the figure. Sometimes there is not enough room to fit a figure directly where it should go (in relation to the text) and so \LaTeX{} puts it at the top of the next page. Positioning figures is the job of \LaTeX{} and so you should only worry about making them look good!
       
   675 
       
   676 Figures usually should have captions just in case you need to refer to them (such as in Figure~\ref{fig:Electron}). The \verb|\caption| command contains two parts, the first part, inside the square brackets is the title that will appear in the \emph{List of Figures}, and so should be short. The second part in the curly brackets should contain the longer and more descriptive caption text.
       
   677 
       
   678 The \verb|\decoRule| command is optional and simply puts an aesthetic horizontal line below the image. If you do this for one image, do it for all of them.
       
   679 
       
   680 \LaTeX{} is capable of using images in pdf, jpg and png format.
       
   681 
       
   682 \subsection{Typesetting mathematics}
       
   683 
       
   684 If your thesis is going to contain heavy mathematical content, be sure that \LaTeX{} will make it look beautiful, even though it won't be able to solve the equations for you.
       
   685 
       
   686 The \enquote{Not So Short Introduction to \LaTeX} (available on \href{http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf}{CTAN}) should tell you everything you need to know for most cases of typesetting mathematics. If you need more information, a much more thorough mathematical guide is available from the AMS called, \enquote{A Short Math Guide to \LaTeX} and can be downloaded from:
       
   687 \url{ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf}
       
   688 
       
   689 There are many different \LaTeX{} symbols to remember, luckily you can find the most common symbols in \href{http://ctan.org/pkg/comprehensive}{The Comprehensive \LaTeX~Symbol List}.
       
   690 
       
   691 You can write an equation, which is automatically given an equation number by \LaTeX{} like this:
       
   692 \begin{verbatim}
       
   693 \begin{equation}
       
   694 E = mc^{2}
       
   695 \label{eqn:Einstein}
       
   696 \end{equation}
       
   697 \end{verbatim}
       
   698 
       
   699 This will produce Einstein's famous energy-matter equivalence equation:
       
   700 \begin{equation}
       
   701 E = mc^{2}
       
   702 \label{eqn:Einstein}
       
   703 \end{equation}
       
   704 
       
   705 All equations you write (which are not in the middle of paragraph text) are automatically given equation numbers by \LaTeX{}. If you don't want a particular equation numbered, use the unnumbered form:
       
   706 \begin{verbatim}
       
   707 \[ a^{2}=4 \]
       
   708 \end{verbatim}
       
   709 
       
   710 %----------------------------------------------------------------------------------------
       
   711 
       
   712 \section{Sectioning and Subsectioning}
       
   713 
       
   714 You should break your thesis up into nice, bite-sized sections and subsections. \LaTeX{} automatically builds a table of Contents by looking at all the \verb|\chapter{}|, \verb|\section{}|  and \verb|\subsection{}| commands you write in the source.
       
   715 
       
   716 The Table of Contents should only list the sections to three (3) levels. A \verb|chapter{}| is level zero (0). A \verb|\section{}| is level one (1) and so a \verb|\subsection{}| is level two (2). In your thesis it is likely that you will even use a \verb|subsubsection{}|, which is level three (3). The depth to which the Table of Contents is formatted is set within \file{MastersDoctoralThesis.cls}. If you need this changed, you can do it in \file{main.tex}.
       
   717 
       
   718 %----------------------------------------------------------------------------------------
       
   719 
       
   720 \section{In Closing}
       
   721 
       
   722 You have reached the end of this mini-guide. You can now rename or overwrite this pdf file and begin writing your own \file{Chapter1.tex} and the rest of your thesis. The easy work of setting up the structure and framework has been taken care of for you. It's now your job to fill it out!
       
   723 
       
   724 Good luck and have lots of fun!
       
   725 
       
   726 \begin{flushright}
       
   727 Guide written by ---\\
       
   728 Sunil Patel: \href{http://www.sunilpatel.co.uk}{www.sunilpatel.co.uk}\\
       
   729 Vel: \href{http://www.LaTeXTemplates.com}{LaTeXTemplates.com}
       
   730 \end{flushright}