ChengsongTanPhdThesis/Chapters/Chapter1.tex
changeset 468 a0f27e21b42c
child 469 e5dd8cc0aa82
equal deleted inserted replaced
467:599239394c51 468:a0f27e21b42c
       
     1 % Chapter 1
       
     2 
       
     3 \chapter{Introduction} % Main chapter title
       
     4 
       
     5 \label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1} 
       
     6 
       
     7 %----------------------------------------------------------------------------------------
       
     8 
       
     9 % Define some commands to keep the formatting separated from the content 
       
    10 \newcommand{\keyword}[1]{\textbf{#1}}
       
    11 \newcommand{\tabhead}[1]{\textbf{#1}}
       
    12 \newcommand{\code}[1]{\texttt{#1}}
       
    13 \newcommand{\file}[1]{\texttt{\bfseries#1}}
       
    14 \newcommand{\option}[1]{\texttt{\itshape#1}}
       
    15 
       
    16 
       
    17 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    18 \newcommand{\ZERO}{\mbox{\bf 0}}
       
    19 \newcommand{\ONE}{\mbox{\bf 1}}
       
    20 \def\lexer{\mathit{lexer}}
       
    21 \def\mkeps{\mathit{mkeps}}
       
    22 
       
    23 \def\DFA{\textit{DFA}}
       
    24 \def\bmkeps{\textit{bmkeps}}
       
    25 \def\retrieve{\textit{retrieve}}
       
    26 \def\blexer{\textit{blexer}}
       
    27 \def\flex{\textit{flex}}
       
    28 \def\inj{\mathit{inj}}
       
    29 \def\Empty{\mathit{Empty}}
       
    30 \def\Left{\mathit{Left}}
       
    31 \def\Right{\mathit{Right}}
       
    32 \def\Stars{\mathit{Stars}}
       
    33 \def\Char{\mathit{Char}}
       
    34 \def\Seq{\mathit{Seq}}
       
    35 \def\Der{\mathit{Der}}
       
    36 \def\nullable{\mathit{nullable}}
       
    37 \def\Z{\mathit{Z}}
       
    38 \def\S{\mathit{S}}
       
    39 \def\rup{r^\uparrow}
       
    40 \def\simp{\mathit{simp}}
       
    41 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
       
    42 \def\map{\mathit{map}}
       
    43 \def\distinct{\mathit{distinct}}
       
    44 \def\blexersimp{\mathit{blexer}\_\mathit{simp}}
       
    45 %----------------------------------------------------------------------------------------
       
    46 %This part is about regular expressions, Brzozowski derivatives,
       
    47 %and a bit-coded lexing algorithm with proven correctness and time bounds.
       
    48 Regular expressions are widely used in modern day programming tasks.
       
    49 Be it IDE with syntax hightlighting and auto completion, 
       
    50 command line tools like $\mathit{grep}$ that facilitates easy 
       
    51 processing of text by search and replace, network intrusion
       
    52 detection systems that rejects suspicious traffic, or compiler
       
    53 front ends--there is always an important phase of the
       
    54 task that involves matching a regular 
       
    55 exression with a string.
       
    56 Given its usefulness and ubiquity, one would imagine that
       
    57 modern regular expression matching implementations
       
    58 are mature and fully-studied.
       
    59 
       
    60 If you go to a popular programming language's
       
    61 regex engine,
       
    62 you can supply it with regex and strings of your own,
       
    63 and get matching/lexing  information such as how a 
       
    64 sub-part of the regex matches a sub-part of the string.
       
    65 These lexing libraries are on average quite fast.
       
    66 %TODO: get source for SNORT/BRO's regex matching engine/speed
       
    67 For example, the regex engines some network intrusion detection
       
    68 systems use are able to process
       
    69  megabytes or even gigabytes of network traffic per second.
       
    70  
       
    71 Why do we need to have our version, if the algorithms are
       
    72 blindingly fast already? Well it turns out it is not always the case.
       
    73 
       
    74 
       
    75 Take $(a^*)^*\,b$ and ask whether
       
    76 strings of the form $aa..a$ match this regular
       
    77 expression. Obviously this is not the case---the expected $b$ in the last
       
    78 position is missing. One would expect that modern regular expression
       
    79 matching engines can find this out very quickly. Alas, if one tries
       
    80 this example in JavaScript, Python or Java 8 with strings like 28
       
    81 $a$'s, one discovers that this decision takes around 30 seconds and
       
    82 takes considerably longer when adding a few more $a$'s, as the graphs
       
    83 below show:
       
    84 
       
    85 \begin{center}
       
    86 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
    87 \begin{tikzpicture}
       
    88 \begin{axis}[
       
    89     xlabel={$n$},
       
    90     x label style={at={(1.05,-0.05)}},
       
    91     ylabel={time in secs},
       
    92     enlargelimits=false,
       
    93     xtick={0,5,...,30},
       
    94     xmax=33,
       
    95     ymax=35,
       
    96     ytick={0,5,...,30},
       
    97     scaled ticks=false,
       
    98     axis lines=left,
       
    99     width=5cm,
       
   100     height=4cm, 
       
   101     legend entries={JavaScript},  
       
   102     legend pos=north west,
       
   103     legend cell align=left]
       
   104 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
       
   105 \end{axis}
       
   106 \end{tikzpicture}
       
   107   &
       
   108 \begin{tikzpicture}
       
   109 \begin{axis}[
       
   110     xlabel={$n$},
       
   111     x label style={at={(1.05,-0.05)}},
       
   112     %ylabel={time in secs},
       
   113     enlargelimits=false,
       
   114     xtick={0,5,...,30},
       
   115     xmax=33,
       
   116     ymax=35,
       
   117     ytick={0,5,...,30},
       
   118     scaled ticks=false,
       
   119     axis lines=left,
       
   120     width=5cm,
       
   121     height=4cm, 
       
   122     legend entries={Python},  
       
   123     legend pos=north west,
       
   124     legend cell align=left]
       
   125 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   126 \end{axis}
       
   127 \end{tikzpicture}
       
   128   &
       
   129 \begin{tikzpicture}
       
   130 \begin{axis}[
       
   131     xlabel={$n$},
       
   132     x label style={at={(1.05,-0.05)}},
       
   133     %ylabel={time in secs},
       
   134     enlargelimits=false,
       
   135     xtick={0,5,...,30},
       
   136     xmax=33,
       
   137     ymax=35,
       
   138     ytick={0,5,...,30},
       
   139     scaled ticks=false,
       
   140     axis lines=left,
       
   141     width=5cm,
       
   142     height=4cm, 
       
   143     legend entries={Java 8},  
       
   144     legend pos=north west,
       
   145     legend cell align=left]
       
   146 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   147 \end{axis}
       
   148 \end{tikzpicture}\\
       
   149 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
   150            of the form $\underbrace{aa..a}_{n}$.}
       
   151 \end{tabular}    
       
   152 \end{center}  
       
   153 
       
   154 
       
   155 This is clearly exponential behaviour, and 
       
   156 is triggered by some relatively simple regex patterns.
       
   157 
       
   158 
       
   159 This superlinear blowup in matching algorithms sometimes cause
       
   160 considerable grief in real life: for example on 20 July 2016 one evil
       
   161 regular expression brought the webpage
       
   162 \href{http://stackexchange.com}{Stack Exchange} to its
       
   163 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
       
   164 In this instance, a regular expression intended to just trim white
       
   165 spaces from the beginning and the end of a line actually consumed
       
   166 massive amounts of CPU-resources---causing web servers to grind to a
       
   167 halt. This happened when a post with 20,000 white spaces was submitted,
       
   168 but importantly the white spaces were neither at the beginning nor at
       
   169 the end. As a result, the regular expression matching engine needed to
       
   170 backtrack over many choices. In this example, the time needed to process
       
   171 the string was $O(n^2)$ with respect to the string length. This
       
   172 quadratic overhead was enough for the homepage of Stack Exchange to
       
   173 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
       
   174 attack and therefore stopped the servers from responding to any
       
   175 requests. This made the whole site become unavailable. 
       
   176 A more 
       
   177 recent example is a global outage of all Cloudflare servers on 2 July
       
   178 2019. A poorly written regular expression exhibited exponential
       
   179 behaviour and exhausted CPUs that serve HTTP traffic. Although the
       
   180 outage had several causes, at the heart was a regular expression that
       
   181 was used to monitor network
       
   182 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
       
   183 %TODO: data points for some new versions of languages
       
   184 These problems with regular expressions 
       
   185 are not isolated events that happen
       
   186 very occasionally, but actually quite widespread.
       
   187 They occur so often that they get a 
       
   188 name--Regular-Expression-Denial-Of-Service (ReDoS)
       
   189 attack.
       
   190 Davis et al. \parencite{Davis18} detected more
       
   191 than 1000 super-linear (SL) regular expressions
       
   192 in Node.js, Python core libraries, and npm and pypi. 
       
   193 They therefore concluded that evil regular expressions
       
   194 are problems more than "a parlour trick", but one that
       
   195 requires
       
   196 more research attention.
       
   197 
       
   198  \section{Why are current algorithm for regexes slow?}
       
   199 
       
   200 Theoretical results say that regular expression matching
       
   201 should be linear with respect to the input. 
       
   202 You could construct
       
   203 an NFA via Thompson construction, and simulate  running it.
       
   204 This would be linear.
       
   205 Or you could determinize the NFA into a DFA, and minimize that
       
   206 DFA. Once you have the DFA, the running time is also linear, requiring just 
       
   207 one scanning pass of the input.
       
   208 
       
   209 But modern  regex libraries in popular language engines
       
   210  often want to support richer constructs
       
   211 such as bounded repetitions and back-references
       
   212 rather than  the most basic regular expressions.
       
   213 %put in illustrations
       
   214 %a{1}{3}
       
   215 These make a DFA construction impossible because
       
   216 of an exponential states explosion.
       
   217  They also want to support lexing rather than just matching
       
   218  for tasks that involves text processing.
       
   219  
       
   220  Existing regex libraries either pose restrictions on the user input, or does 
       
   221  not give linear running time guarantee.
       
   222  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
       
   223  For example, the Rust regex engine claims to be linear, 
       
   224  but does not support lookarounds and back-references.
       
   225  The GoLang regex library does not support over 1000 repetitions.  
       
   226  Java and Python both support back-references, but shows
       
   227 catastrophic backtracking behaviours on inputs without back-references(
       
   228 when the language is still regular).
       
   229  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
       
   230  %TODO: verify the fact Rust does not allow 1000+ reps
       
   231  %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
       
   232  Another thing about the these libraries is that there
       
   233  is no correctness guarantee.
       
   234  In some cases they either fails to generate a lexing result when there is a match,
       
   235  or gives the wrong way of matching.
       
   236  
       
   237 
       
   238 It turns out that regex libraries not only suffer from 
       
   239 exponential backtracking problems, 
       
   240 but also undesired (or even buggy) outputs.
       
   241 %TODO: comment from who
       
   242 xxx commented that most regex libraries are not
       
   243 correctly implementing the POSIX (maximum-munch)
       
   244 rule of regular expression matching.
       
   245 A concrete example would be 
       
   246 the regex
       
   247 \begin{verbatim}
       
   248 (((((a*a*)b*)b){20})*)c
       
   249 \end{verbatim}
       
   250 and the string
       
   251 \begin{verbatim}
       
   252 baabaabababaabaaaaaaaaababaa
       
   253 aababababaaaabaaabaaaaaabaab
       
   254 aabababaababaaaaaaaaababaaaa
       
   255 babababaaaaaaaaaaaaac
       
   256 \end{verbatim}
       
   257 
       
   258 This seemingly complex regex simply says "some $a$'s
       
   259 followed by some $b$'s then followed by 1 single $b$,
       
   260 and this iterates 20 times, finally followed by a $c$.
       
   261 And a POSIX match would involve the entire string,"eating up"
       
   262 all the $b$'s in it.
       
   263 %TODO: give a coloured example of how this matches POSIXly
       
   264 
       
   265 This regex would trigger catastrophic backtracking in 
       
   266 languages like Python and Java,
       
   267 whereas it gives a correct but uninformative (non-POSIX)
       
   268 match in languages like Go or .NET--The match with only 
       
   269 character $c$.
       
   270 
       
   271 Backtracking or depth-first search might give us
       
   272 exponential running time, and quite a few tools 
       
   273 avoid that by determinising the $\mathit{NFA}$ 
       
   274 into a $\mathit{DFA}$ and minimizes it.
       
   275 For example, $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
       
   276 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
       
   277 lexers.
       
   278 However, they do not scale well with bounded repetitions.
       
   279 Bounded repetitions, usually written in the form
       
   280 $r^{\{c\}}$ (where $c$ is a constant natural number),
       
   281 denotes a regular expression accepting strings
       
   282 that can be divided into $c$ substrings, and each 
       
   283 substring is in $r$. 
       
   284 %TODO: draw example NFA.
       
   285 For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
       
   286 an $\mathit{NFA}$ describing it would look like:
       
   287 \begin{center}
       
   288 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
       
   289    \node[state,initial] (q_0)   {$q_0$}; 
       
   290    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
       
   291    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
       
   292    \node[state,accepting](q_3) [right=of q_2] {$q_3$};
       
   293     \path[->] 
       
   294     (q_0) edge  node {a} (q_1)
       
   295     	  edge [loop below] node {a,b} ()
       
   296     (q_1) edge  node  {a,b} (q_2)
       
   297           edge [loop above] node {0} ()
       
   298     (q_2) edge  node  {a,b} (q_3);
       
   299 \end{tikzpicture}
       
   300 \end{center}
       
   301 The red states are "counter states" which counts down 
       
   302 the number of characters needed in addition to the current
       
   303 string to make a successful match.
       
   304 For example, state $q_1$ indicates a match that has
       
   305 gone past the $(a|b)$ part of $(a|b)^*a(a|b)^{\{2\}}$,
       
   306 and just consumed the "delimiter" $a$ in the middle, and 
       
   307 need to match 2 more iterations of $a|b$ to complete.
       
   308 State $q_2$ on the other hand, can be viewed as a state
       
   309 after $q_1$ has consumed 1 character, and just waits
       
   310 for 1 more character to complete.
       
   311 Depending on the actual characters appearing in the 
       
   312 input string, the states $q_1$ and $q_2$ may or may
       
   313 not be active, independent from each other.
       
   314 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
       
   315 contain at least 4 non-equivalent states that cannot be merged, 
       
   316 because subset states indicating which of $q_1$ and $q_2$
       
   317 are active are at least four: $\phi$, $\{q_1\}$, 
       
   318 $\{q_2\}$, $\{q_1, q_2\}$.
       
   319 Generalizing this to regular expressions with larger
       
   320 bounded repetitions number, we have $r^*ar^{\{n\}}$
       
   321 in general would require at least $2^n$ states
       
   322 in a $\mathit{DFA}$. This is to represent all different 
       
   323 configurations of "countdown" states.
       
   324 For those regexes, tools such as $\mathit{JFLEX}$ 
       
   325 would generate gigantic $\mathit{DFA}$'s or even
       
   326 give out memory errors.
       
   327 
       
   328 For this reason, regex libraries that support 
       
   329 bounded repetitions often choose to use the $\mathit{NFA}$ 
       
   330 approach.
       
   331 One can simulate the $\mathit{NFA}$ running in two ways:
       
   332 one by keeping track of all active states after consuming 
       
   333 a character, and update that set of states iteratively.
       
   334 This is a breadth-first-search of the $\mathit{NFA}$.
       
   335 for a path terminating
       
   336 at an accepting state.
       
   337 Languages like $\mathit{GO}$ and $\mathit{RUST}$ use this
       
   338 type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
       
   339 in terms of input string length.
       
   340 The other way to use $\mathit{NFA}$ for matching is to take 
       
   341 a single state in a path each time, and backtrack if that path
       
   342 fails. This is a depth-first-search that could end up
       
   343 with exponential run time.
       
   344 The reason behind backtracking algorithms in languages like
       
   345 Java and Python is that they support back-references.
       
   346 \subsection{Back References in Regex--Non-Regular part}
       
   347 If we label sub-expressions by parenthesizing them and give 
       
   348 them a number by the order their opening parenthesis appear,
       
   349 $\underset{1}{(}\ldots\underset{2}{(}\ldots\underset{3}{(}\ldots\underset{4}{(}\ldots)\ldots)\ldots)\ldots)$
       
   350 We can use the following syntax to denote that we want a string just matched by a 
       
   351 sub-expression to appear at a certain location again exactly:
       
   352 $(.*)\backslash 1$
       
   353 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.
       
   354 
       
   355 Back-reference is a construct in the "regex" standard
       
   356 that programmers found quite useful, but not exactly 
       
   357 regular any more.
       
   358 In fact, that allows the regex construct to express 
       
   359 languages that cannot be contained in context-free
       
   360 languages
       
   361 For example, the back reference $(a^*)\backslash 1 \backslash 1$
       
   362 expresses the language $\{a^na^na^n\mid n \in N\}$,
       
   363 which cannot be expressed by context-free grammars.
       
   364 To express such a language one would need context-sensitive
       
   365 grammars, and matching/parsing for such grammars is NP-hard in 
       
   366 general.
       
   367 %TODO:cite reference for NP-hardness of CSG matching
       
   368 For such constructs, the most intuitive way of matching is
       
   369 using backtracking algorithms, and there are no known algorithms 
       
   370 that does not backtrack.
       
   371 %TODO:read a bit more about back reference algorithms
       
   372 
       
   373 
       
   374 
       
   375 
       
   376 \section{Our Solution--Brzozowski Derivatives}
       
   377 
       
   378 
       
   379  
       
   380  Is it possible to have a regex lexing algorithm with proven correctness and 
       
   381  time complexity, which allows easy extensions to
       
   382   constructs like 
       
   383  bounded repetitions, negation,  lookarounds, and even back-references? 
       
   384  
       
   385  We propose Brzozowski's derivatives as a solution to this problem.
       
   386  
       
   387 The main contribution of this thesis is a proven correct lexing algorithm
       
   388 with formalized time bounds.
       
   389 To our best knowledge, there is no lexing libraries using Brzozowski derivatives
       
   390 that have a provable time guarantee, 
       
   391 and claims about running time are usually speculative and backed by thin empirical
       
   392 evidence.
       
   393 %TODO: give references
       
   394 For example, Sulzmann and Lu had proposed an algorithm  in which they
       
   395 claim a linear running time.
       
   396 But that was falsified by our experiments and the running time 
       
   397 is actually $\Omega(2^n)$ in the worst case.
       
   398 A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
       
   399 %TODO: give references
       
   400 lexer, which calculates POSIX matches and is based on derivatives.
       
   401 They formalized the correctness of the lexer, but not the complexity.
       
   402 In the performance evaluation section, they simply analyzed the run time
       
   403 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
       
   404 and concluded that the algorithm is quadratic in terms of input length.
       
   405 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
       
   406 the time it took to lex only 40 $a$'s was 5 minutes.
       
   407 We therefore believe our results of a proof of performance on general
       
   408 inputs rather than specific examples a novel contribution.\\
       
   409 
       
   410  \section{Preliminaries about Lexing Using Brzozowski derivatives}
       
   411  In the last fifteen or so years, Brzozowski's derivatives of regular
       
   412 expressions have sparked quite a bit of interest in the functional
       
   413 programming and theorem prover communities.  
       
   414 The beauty of
       
   415 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
       
   416 expressible in any functional language, and easily definable and
       
   417 reasoned about in theorem provers---the definitions just consist of
       
   418 inductive datatypes and simple recursive functions. 
       
   419 
       
   420 Suppose we have an alphabet $\Sigma$, the strings  whose characters
       
   421 are from $\Sigma$
       
   422 can be expressed as $\Sigma^*$.
       
   423 
       
   424 We use patterns to define a set of strings concisely. Regular expressions
       
   425 are one of such patterns systems:
       
   426 The basic regular expressions  are defined inductively
       
   427  by the following grammar:
       
   428 \[			r ::=   \ZERO \mid  \ONE
       
   429 			 \mid  c  
       
   430 			 \mid  r_1 \cdot r_2
       
   431 			 \mid  r_1 + r_2   
       
   432 			 \mid r^*         
       
   433 \]
       
   434 
       
   435 The language or set of strings defined by regular expressions are defined as
       
   436 %TODO: FILL in the other defs
       
   437 \begin{center}
       
   438 \begin{tabular}{lcl}
       
   439 $L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
       
   440 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
       
   441 \end{tabular}
       
   442 \end{center}
       
   443 Which are also called the "language interpretation".
       
   444 
       
   445 
       
   446 
       
   447 The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
       
   448 where the operation transforms the regex to a new one containing
       
   449 strings without the head character $c$.
       
   450 
       
   451 Formally, we define first such a transformation on any string set, which
       
   452 we call semantic derivative:
       
   453 \begin{center}
       
   454 $\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
       
   455 \end{center}
       
   456 Mathematically, it can be expressed as the 
       
   457 
       
   458 If the $\textit{StringSet}$ happen to have some structure, for example,
       
   459 if it is regular, then we have that it
       
   460 
       
   461 The the derivative of regular expression, denoted as
       
   462 $r \backslash c$, is a function that takes parameters
       
   463 $r$ and $c$, and returns another regular expression $r'$,
       
   464 which is computed by the following recursive function:
       
   465 
       
   466 \begin{center}
       
   467 \begin{tabular}{lcl}
       
   468 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   469 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   470 		$d \backslash c$     & $\dn$ & 
       
   471 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   472 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   473 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   474 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   475 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   476 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   477 \end{tabular}
       
   478 \end{center}
       
   479 \noindent
       
   480 \noindent
       
   481 
       
   482 The $\nullable$ function tests whether the empty string $""$ 
       
   483 is in the language of $r$:
       
   484 
       
   485 
       
   486 \begin{center}
       
   487 		\begin{tabular}{lcl}
       
   488 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
       
   489 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
       
   490 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
       
   491 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
       
   492 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
       
   493 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
       
   494 		\end{tabular}
       
   495 \end{center}
       
   496 \noindent
       
   497 The empty set does not contain any string and
       
   498 therefore not the empty string, the empty string 
       
   499 regular expression contains the empty string
       
   500 by definition, the character regular expression
       
   501 is the singleton that contains character only,
       
   502 and therefore does not contain the empty string,
       
   503 the alternative regular expression(or "or" expression)
       
   504 might have one of its children regular expressions
       
   505 being nullable and any one of its children being nullable
       
   506 would suffice. The sequence regular expression
       
   507 would require both children to have the empty string
       
   508 to compose an empty string and the Kleene star
       
   509 operation naturally introduced the empty string.
       
   510 
       
   511 We can give the meaning of regular expressions derivatives
       
   512 by language interpretation:
       
   513 
       
   514 
       
   515  
       
   516   
       
   517 \begin{center}
       
   518 \begin{tabular}{lcl}
       
   519 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
       
   520 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
       
   521 		$d \backslash c$     & $\dn$ & 
       
   522 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   523 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   524 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
       
   525 	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
       
   526 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
       
   527 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
       
   528 \end{tabular}
       
   529 \end{center}
       
   530 \noindent
       
   531 \noindent
       
   532 The function derivative, written $\backslash c$, 
       
   533 defines how a regular expression evolves into
       
   534 a new regular expression after all the string it contains
       
   535 is chopped off a certain head character $c$.
       
   536 The most involved cases are the sequence 
       
   537 and star case.
       
   538 The sequence case says that if the first regular expression
       
   539 contains an empty string then second component of the sequence
       
   540 might be chosen as the target regular expression to be chopped
       
   541 off its head character.
       
   542 The star regular expression unwraps the iteration of
       
   543 regular expression and attack the star regular expression
       
   544 to its back again to make sure there are 0 or more iterations
       
   545 following this unfolded iteration.
       
   546 
       
   547 
       
   548 The main property of the derivative operation
       
   549 that enables us to reason about the correctness of
       
   550 an algorithm using derivatives is 
       
   551 
       
   552 \begin{center}
       
   553 $c\!::\!s \in L(r)$ holds
       
   554 if and only if $s \in L(r\backslash c)$.
       
   555 \end{center}
       
   556 
       
   557 \noindent
       
   558 We can generalise the derivative operation shown above for single characters
       
   559 to strings as follows:
       
   560 
       
   561 \begin{center}
       
   562 \begin{tabular}{lcl}
       
   563 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
       
   564 $r \backslash [\,] $ & $\dn$ & $r$
       
   565 \end{tabular}
       
   566 \end{center}
       
   567 
       
   568 \noindent
       
   569 and then define Brzozowski's  regular-expression matching algorithm as:
       
   570 
       
   571 \[
       
   572 match\;s\;r \;\dn\; nullable(r\backslash s)
       
   573 \]
       
   574 
       
   575 \noindent
       
   576 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
       
   577 this algorithm presented graphically is as follows:
       
   578 
       
   579 \begin{equation}\label{graph:*}
       
   580 \begin{tikzcd}
       
   581 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
       
   582 \end{tikzcd}
       
   583 \end{equation}
       
   584 
       
   585 \noindent
       
   586 where we start with  a regular expression  $r_0$, build successive
       
   587 derivatives until we exhaust the string and then use \textit{nullable}
       
   588 to test whether the result can match the empty string. It can  be
       
   589 relatively  easily shown that this matcher is correct  (that is given
       
   590 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
       
   591 
       
   592 Beautiful and simple definition.
       
   593 
       
   594 If we implement the above algorithm naively, however,
       
   595 the algorithm can be excruciatingly slow. For example, when starting with the regular
       
   596 expression $(a + aa)^*$ and building 12 successive derivatives
       
   597 w.r.t.~the character $a$, one obtains a derivative regular expression
       
   598 with more than 8000 nodes (when viewed as a tree). Operations like
       
   599 $\backslash$ and $\nullable$ need to traverse such trees and
       
   600 consequently the bigger the size of the derivative the slower the
       
   601 algorithm. 
       
   602 
       
   603 Brzozowski was quick in finding that during this process a lot useless
       
   604 $\ONE$s and $\ZERO$s are generated and therefore not optimal.
       
   605 He also introduced some "similarity rules" such
       
   606 as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
       
   607 different but language-equivalent sub-regexes to further decrease the size
       
   608 of the intermediate regexes. 
       
   609 
       
   610 More simplifications are possible, such as deleting duplicates
       
   611 and opening up nested alternatives to trigger even more simplifications.
       
   612 And suppose we apply simplification after each derivative step, and compose
       
   613 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
       
   614 \textit{simp}(a \backslash c)$. Then we can build
       
   615 a matcher without having  cumbersome regular expressions.
       
   616 
       
   617 
       
   618 If we want the size of derivatives in the algorithm to
       
   619 stay even lower, we would need more aggressive simplifications.
       
   620 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
       
   621 deleting duplicates whenever possible. For example, the parentheses in
       
   622 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
       
   623 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
       
   624 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
       
   625 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us
       
   626 to achieve a very tight size bound, namely,
       
   627  the same size bound as that of the \emph{partial derivatives}. 
       
   628 
       
   629 Building derivatives and then simplify them.
       
   630 So far so good. But what if we want to 
       
   631 do lexing instead of just a YES/NO answer?
       
   632 This requires us to go back again to the world 
       
   633 without simplification first for a moment.
       
   634 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
       
   635 elegant(arguably as beautiful as the original
       
   636 derivatives definition) solution for this.
       
   637 
       
   638 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
       
   639 
       
   640 
       
   641 They first defined the datatypes for storing the 
       
   642 lexing information called a \emph{value} or
       
   643 sometimes also \emph{lexical value}.  These values and regular
       
   644 expressions correspond to each other as illustrated in the following
       
   645 table:
       
   646 
       
   647 \begin{center}
       
   648 	\begin{tabular}{c@{\hspace{20mm}}c}
       
   649 		\begin{tabular}{@{}rrl@{}}
       
   650 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
       
   651 			$r$ & $::=$  & $\ZERO$\\
       
   652 			& $\mid$ & $\ONE$   \\
       
   653 			& $\mid$ & $c$          \\
       
   654 			& $\mid$ & $r_1 \cdot r_2$\\
       
   655 			& $\mid$ & $r_1 + r_2$   \\
       
   656 			\\
       
   657 			& $\mid$ & $r^*$         \\
       
   658 		\end{tabular}
       
   659 		&
       
   660 		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
       
   661 			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
       
   662 			$v$ & $::=$  & \\
       
   663 			&        & $\Empty$   \\
       
   664 			& $\mid$ & $\Char(c)$          \\
       
   665 			& $\mid$ & $\Seq\,v_1\, v_2$\\
       
   666 			& $\mid$ & $\Left(v)$   \\
       
   667 			& $\mid$ & $\Right(v)$  \\
       
   668 			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
       
   669 		\end{tabular}
       
   670 	\end{tabular}
       
   671 \end{center}
       
   672 
       
   673 \noindent
       
   674 One regular expression can have multiple lexical values. For example
       
   675 for the regular expression $(a+b)^*$, it has a infinite list of
       
   676 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
       
   677 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
       
   678 $\ldots$, and vice versa.
       
   679 Even for the regular expression matching a certain string, there could 
       
   680 still be more than one value corresponding to it.
       
   681 Take the example where $r= (a^*\cdot a^*)^*$ and the string 
       
   682 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
       
   683 The number of different ways of matching 
       
   684 without allowing any value under a star to be flattened
       
   685 to an empty string can be given by the following formula:
       
   686 \begin{center}
       
   687 	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
       
   688 \end{center}
       
   689 and a closed form formula can be calculated to be
       
   690 \begin{equation}
       
   691 	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
       
   692 \end{equation}
       
   693 which is clearly in exponential order.
       
   694 A lexer aimed at getting all the possible values has an exponential
       
   695 worst case runtime. Therefore it is impractical to try to generate
       
   696 all possible matches in a run. In practice, we are usually 
       
   697 interested about POSIX values, which by intuition always
       
   698 match the leftmost regular expression when there is a choice
       
   699 and always match a sub part as much as possible before proceeding
       
   700 to the next token. For example, the above example has the POSIX value
       
   701 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
       
   702 The output of an algorithm we want would be a POSIX matching
       
   703 encoded as a value.
       
   704 The contribution of Sulzmann and Lu is an extension of Brzozowski's
       
   705 algorithm by a second phase (the first phase being building successive
       
   706 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
       
   707 is generated in case the regular expression matches  the string. 
       
   708 Pictorially, the Sulzmann and Lu algorithm is as follows:
       
   709 
       
   710 \begin{ceqn}
       
   711 \begin{equation}\label{graph:2}
       
   712 \begin{tikzcd}
       
   713 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
       
   714 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
       
   715 \end{tikzcd}
       
   716 \end{equation}
       
   717 \end{ceqn}
       
   718 
       
   719 
       
   720 \noindent
       
   721 For convenience, we shall employ the following notations: the regular
       
   722 expression we start with is $r_0$, and the given string $s$ is composed
       
   723 of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
       
   724 left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
       
   725 to the characters $c_0$, $c_1$  until we exhaust the string and obtain
       
   726 the derivative $r_n$. We test whether this derivative is
       
   727 $\textit{nullable}$ or not. If not, we know the string does not match
       
   728 $r$ and no value needs to be generated. If yes, we start building the
       
   729 values incrementally by \emph{injecting} back the characters into the
       
   730 earlier values $v_n, \ldots, v_0$. This is the second phase of the
       
   731 algorithm from the right to left. For the first value $v_n$, we call the
       
   732 function $\textit{mkeps}$, which builds a POSIX lexical value
       
   733 for how the empty string has been matched by the (nullable) regular
       
   734 expression $r_n$. This function is defined as
       
   735 
       
   736 	\begin{center}
       
   737 		\begin{tabular}{lcl}
       
   738 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
       
   739 			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
       
   740 			& \textit{if} $\nullable(r_{1})$\\ 
       
   741 			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
       
   742 			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
       
   743 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
       
   744 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
       
   745 		\end{tabular}
       
   746 	\end{center}
       
   747 
       
   748 
       
   749 \noindent 
       
   750 After the $\mkeps$-call, we inject back the characters one by one in order to build
       
   751 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
       
   752 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
       
   753 After injecting back $n$ characters, we get the lexical value for how $r_0$
       
   754 matches $s$. The POSIX value is maintained throught out the process.
       
   755 For this Sulzmann and Lu defined a function that reverses
       
   756 the ``chopping off'' of characters during the derivative phase. The
       
   757 corresponding function is called \emph{injection}, written
       
   758 $\textit{inj}$; it takes three arguments: the first one is a regular
       
   759 expression ${r_{i-1}}$, before the character is chopped off, the second
       
   760 is a character ${c_{i-1}}$, the character we want to inject and the
       
   761 third argument is the value ${v_i}$, into which one wants to inject the
       
   762 character (it corresponds to the regular expression after the character
       
   763 has been chopped off). The result of this function is a new value. The
       
   764 definition of $\textit{inj}$ is as follows: 
       
   765 
       
   766 \begin{center}
       
   767 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   768   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   769   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   770   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   771   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   772   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   773   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   774   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   775 \end{tabular}
       
   776 \end{center}
       
   777 
       
   778 \noindent This definition is by recursion on the ``shape'' of regular
       
   779 expressions and values. 
       
   780 The clauses basically do one thing--identifying the ``holes'' on 
       
   781 value to inject the character back into.
       
   782 For instance, in the last clause for injecting back to a value
       
   783 that would turn into a new star value that corresponds to a star,
       
   784 we know it must be a sequence value. And we know that the first 
       
   785 value of that sequence corresponds to the child regex of the star
       
   786 with the first character being chopped off--an iteration of the star
       
   787 that had just been unfolded. This value is followed by the already
       
   788 matched star iterations we collected before. So we inject the character 
       
   789 back to the first value and form a new value with this new iteration
       
   790 being added to the previous list of iterations, all under the $Stars$
       
   791 top level.
       
   792 
       
   793 We have mentioned before that derivatives without simplification 
       
   794 can get clumsy, and this is true for values as well--they reflect
       
   795 the regular expressions size by definition.
       
   796 
       
   797 One can introduce simplification on the regex and values, but have to
       
   798 be careful in not breaking the correctness as the injection 
       
   799 function heavily relies on the structure of the regexes and values
       
   800 being correct and match each other.
       
   801 It can be achieved by recording some extra rectification functions
       
   802 during the derivatives step, and applying these rectifications in 
       
   803 each run during the injection phase.
       
   804 And we can prove that the POSIX value of how
       
   805 regular expressions match strings will not be affected---although is much harder
       
   806 to establish. Some initial results in this regard have been
       
   807 obtained in \cite{AusafDyckhoffUrban2016}. 
       
   808 
       
   809 %Brzozowski, after giving the derivatives and simplification,
       
   810 %did not explore lexing with simplification or he may well be 
       
   811 %stuck on an efficient simplificaiton with a proof.
       
   812 %He went on to explore the use of derivatives together with 
       
   813 %automaton, and did not try lexing using derivatives.
       
   814 
       
   815 We want to get rid of complex and fragile rectification of values.
       
   816 Can we not create those intermediate values $v_1,\ldots v_n$,
       
   817 and get the lexing information that should be already there while
       
   818 doing derivatives in one pass, without a second phase of injection?
       
   819 In the meantime, can we make sure that simplifications
       
   820 are easily handled without breaking the correctness of the algorithm?
       
   821 
       
   822 Sulzmann and Lu solved this problem by
       
   823 introducing additional informtaion to the 
       
   824 regular expressions called \emph{bitcodes}.
       
   825 
       
   826 \subsection*{Bit-coded Algorithm}
       
   827 Bits and bitcodes (lists of bits) are defined as:
       
   828 
       
   829 \begin{center}
       
   830 		$b ::=   1 \mid  0 \qquad
       
   831 bs ::= [] \mid b::bs    
       
   832 $
       
   833 \end{center}
       
   834 
       
   835 \noindent
       
   836 The $1$ and $0$ are not in bold in order to avoid 
       
   837 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
       
   838 bit-lists) can be used to encode values (or potentially incomplete values) in a
       
   839 compact form. This can be straightforwardly seen in the following
       
   840 coding function from values to bitcodes: 
       
   841 
       
   842 \begin{center}
       
   843 \begin{tabular}{lcl}
       
   844   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
       
   845   $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
       
   846   $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
       
   847   $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
       
   848   $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
       
   849   $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
       
   850   $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
       
   851                                                  code(\Stars\,vs)$
       
   852 \end{tabular}    
       
   853 \end{center} 
       
   854 
       
   855 \noindent
       
   856 Here $\textit{code}$ encodes a value into a bitcodes by converting
       
   857 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
       
   858 star iteration by $1$. The border where a local star terminates
       
   859 is marked by $0$. This coding is lossy, as it throws away the information about
       
   860 characters, and also does not encode the ``boundary'' between two
       
   861 sequence values. Moreover, with only the bitcode we cannot even tell
       
   862 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
       
   863 reason for choosing this compact way of storing information is that the
       
   864 relatively small size of bits can be easily manipulated and ``moved
       
   865 around'' in a regular expression. In order to recover values, we will 
       
   866 need the corresponding regular expression as an extra information. This
       
   867 means the decoding function is defined as:
       
   868 
       
   869 
       
   870 %\begin{definition}[Bitdecoding of Values]\mbox{}
       
   871 \begin{center}
       
   872 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   873   $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
       
   874   $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
       
   875   $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   876      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       
   877        (\Left\,v, bs_1)$\\
       
   878   $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
       
   879      $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       
   880        (\Right\,v, bs_1)$\\                           
       
   881   $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
       
   882         $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
       
   883   & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
       
   884   & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
       
   885   $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
       
   886   $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
       
   887          $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   888   & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
       
   889   & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
       
   890   
       
   891   $\textit{decode}\,bs\,r$ & $\dn$ &
       
   892      $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
       
   893   & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       
   894        \textit{else}\;\textit{None}$                       
       
   895 \end{tabular}    
       
   896 \end{center}    
       
   897 %\end{definition}
       
   898 
       
   899 Sulzmann and Lu's integrated the bitcodes into regular expressions to
       
   900 create annotated regular expressions \cite{Sulzmann2014}.
       
   901 \emph{Annotated regular expressions} are defined by the following
       
   902 grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
       
   903 
       
   904 \begin{center}
       
   905 \begin{tabular}{lcl}
       
   906   $\textit{a}$ & $::=$  & $\ZERO$\\
       
   907                   & $\mid$ & $_{bs}\ONE$\\
       
   908                   & $\mid$ & $_{bs}{\bf c}$\\
       
   909                   & $\mid$ & $_{bs}\sum\,as$\\
       
   910                   & $\mid$ & $_{bs}a_1\cdot a_2$\\
       
   911                   & $\mid$ & $_{bs}a^*$
       
   912 \end{tabular}    
       
   913 \end{center}  
       
   914 %(in \textit{ALTS})
       
   915 
       
   916 \noindent
       
   917 where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
       
   918 expressions and $as$ for a list of annotated regular expressions.
       
   919 The alternative constructor($\sum$) has been generalized to 
       
   920 accept a list of annotated regular expressions rather than just 2.
       
   921 We will show that these bitcodes encode information about
       
   922 the (POSIX) value that should be generated by the Sulzmann and Lu
       
   923 algorithm.
       
   924 
       
   925 
       
   926 To do lexing using annotated regular expressions, we shall first
       
   927 transform the usual (un-annotated) regular expressions into annotated
       
   928 regular expressions. This operation is called \emph{internalisation} and
       
   929 defined as follows:
       
   930 
       
   931 %\begin{definition}
       
   932 \begin{center}
       
   933 \begin{tabular}{lcl}
       
   934   $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
       
   935   $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
       
   936   $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
       
   937   $(r_1 + r_2)^\uparrow$ & $\dn$ &
       
   938   $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
       
   939   \textit{fuse}\,[1]\,r_2^\uparrow]$\\
       
   940   $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
       
   941          $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
       
   942   $(r^*)^\uparrow$ & $\dn$ &
       
   943          $_{[]}(r^\uparrow)^*$\\
       
   944 \end{tabular}    
       
   945 \end{center}    
       
   946 %\end{definition}
       
   947 
       
   948 \noindent
       
   949 We use up arrows here to indicate that the basic un-annotated regular
       
   950 expressions are ``lifted up'' into something slightly more complex. In the
       
   951 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
       
   952 attach bits to the front of an annotated regular expression. Its
       
   953 definition is as follows:
       
   954 
       
   955 \begin{center}
       
   956 \begin{tabular}{lcl}
       
   957   $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
       
   958   $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
       
   959      $_{bs @ bs'}\ONE$\\
       
   960   $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
       
   961      $_{bs@bs'}{\bf c}$\\
       
   962   $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
       
   963      $_{bs@bs'}\sum\textit{as}$\\
       
   964   $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
       
   965      $_{bs@bs'}a_1 \cdot a_2$\\
       
   966   $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
       
   967      $_{bs @ bs'}a^*$
       
   968 \end{tabular}    
       
   969 \end{center}  
       
   970 
       
   971 \noindent
       
   972 After internalising the regular expression, we perform successive
       
   973 derivative operations on the annotated regular expressions. This
       
   974 derivative operation is the same as what we had previously for the
       
   975 basic regular expressions, except that we beed to take care of
       
   976 the bitcodes:
       
   977 
       
   978 
       
   979 \iffalse
       
   980  %\begin{definition}{bder}
       
   981 \begin{center}
       
   982   \begin{tabular}{@{}lcl@{}}
       
   983   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   984   $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
   985   $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
       
   986         $\textit{if}\;c=d\; \;\textit{then}\;
       
   987          \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
       
   988   $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
       
   989   $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
       
   990   $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
       
   991      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
   992 					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
       
   993 					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
   994   & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
       
   995   $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
       
   996       $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       
   997        (\textit{STAR}\,[]\,r)$
       
   998 \end{tabular}    
       
   999 \end{center}    
       
  1000 %\end{definition}
       
  1001 
       
  1002 \begin{center}
       
  1003   \begin{tabular}{@{}lcl@{}}
       
  1004   $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1005   $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
       
  1006   $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
       
  1007         $\textit{if}\;c=d\; \;\textit{then}\;
       
  1008          _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
       
  1009   $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
       
  1010   $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
       
  1011   $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
       
  1012      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  1013 					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
       
  1014 					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
       
  1015   & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
       
  1016   $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
       
  1017       $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
       
  1018        (_{bs}\textit{STAR}\,[]\,r)$
       
  1019 \end{tabular}    
       
  1020 \end{center}    
       
  1021 %\end{definition}
       
  1022 \fi
       
  1023 
       
  1024 \begin{center}
       
  1025   \begin{tabular}{@{}lcl@{}}
       
  1026   $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
  1027   $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
       
  1028   $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
       
  1029         $\textit{if}\;c=d\; \;\textit{then}\;
       
  1030          _{bs}\ONE\;\textit{else}\;\ZERO$\\  
       
  1031   $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
       
  1032   $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
       
  1033   $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
       
  1034      $\textit{if}\;\textit{bnullable}\,a_1$\\
       
  1035 					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
       
  1036 					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
       
  1037   & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
       
  1038   $(_{bs}a^*)\,\backslash c$ & $\dn$ &
       
  1039       $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
       
  1040        (_{[]}r^*))$
       
  1041 \end{tabular}    
       
  1042 \end{center}    
       
  1043 
       
  1044 %\end{definition}
       
  1045 \noindent
       
  1046 For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
       
  1047 we need to unfold it into a sequence,
       
  1048 and attach an additional bit $0$ to the front of $r \backslash c$
       
  1049 to indicate that there is one more star iteration. Also the sequence clause
       
  1050 is more subtle---when $a_1$ is $\textit{bnullable}$ (here
       
  1051 \textit{bnullable} is exactly the same as $\textit{nullable}$, except
       
  1052 that it is for annotated regular expressions, therefore we omit the
       
  1053 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
       
  1054 $a_1$ matches the string prior to character $c$ (more on this later),
       
  1055 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
       
  1056 \backslash c)$ will collapse the regular expression $a_1$(as it has
       
  1057 already been fully matched) and store the parsing information at the
       
  1058 head of the regular expression $a_2 \backslash c$ by fusing to it. The
       
  1059 bitsequence $\textit{bs}$, which was initially attached to the
       
  1060 first element of the sequence $a_1 \cdot a_2$, has
       
  1061 now been elevated to the top-level of $\sum$, as this information will be
       
  1062 needed whichever way the sequence is matched---no matter whether $c$ belongs
       
  1063 to $a_1$ or $ a_2$. After building these derivatives and maintaining all
       
  1064 the lexing information, we complete the lexing by collecting the
       
  1065 bitcodes using a generalised version of the $\textit{mkeps}$ function
       
  1066 for annotated regular expressions, called $\textit{bmkeps}$:
       
  1067 
       
  1068 
       
  1069 %\begin{definition}[\textit{bmkeps}]\mbox{}
       
  1070 \begin{center}
       
  1071 \begin{tabular}{lcl}
       
  1072   $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
       
  1073   $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
       
  1074      $\textit{if}\;\textit{bnullable}\,a$\\
       
  1075   & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
       
  1076   & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
       
  1077   $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
       
  1078      $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
       
  1079   $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
       
  1080      $bs \,@\, [0]$
       
  1081 \end{tabular}    
       
  1082 \end{center}    
       
  1083 %\end{definition}
       
  1084 
       
  1085 \noindent
       
  1086 This function completes the value information by travelling along the
       
  1087 path of the regular expression that corresponds to a POSIX value and
       
  1088 collecting all the bitcodes, and using $S$ to indicate the end of star
       
  1089 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
       
  1090 decode them, we get the value we expect. The corresponding lexing
       
  1091 algorithm looks as follows:
       
  1092 
       
  1093 \begin{center}
       
  1094 \begin{tabular}{lcl}
       
  1095   $\textit{blexer}\;r\,s$ & $\dn$ &
       
  1096       $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
       
  1097   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1098   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1099   & & $\;\;\textit{else}\;\textit{None}$
       
  1100 \end{tabular}
       
  1101 \end{center}
       
  1102 
       
  1103 \noindent
       
  1104 In this definition $\_\backslash s$ is the  generalisation  of the derivative
       
  1105 operation from characters to strings (just like the derivatives for un-annotated
       
  1106 regular expressions).
       
  1107 
       
  1108 Remember tha one of the important reasons we introduced bitcodes
       
  1109 is that they can make simplification more structured and therefore guaranteeing
       
  1110 the correctness.
       
  1111 
       
  1112 \subsection*{Our Simplification Rules}
       
  1113 
       
  1114 In this section we introduce aggressive (in terms of size) simplification rules
       
  1115 on annotated regular expressions
       
  1116 in order to keep derivatives small. Such simplifications are promising
       
  1117 as we have
       
  1118 generated test data that show
       
  1119 that a good tight bound can be achieved. Obviously we could only
       
  1120 partially cover  the search space as there are infinitely many regular
       
  1121 expressions and strings. 
       
  1122 
       
  1123 One modification we introduced is to allow a list of annotated regular
       
  1124 expressions in the $\sum$ constructor. This allows us to not just
       
  1125 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
       
  1126 also unnecessary ``copies'' of regular expressions (very similar to
       
  1127 simplifying $r + r$ to just $r$, but in a more general setting). Another
       
  1128 modification is that we use simplification rules inspired by Antimirov's
       
  1129 work on partial derivatives. They maintain the idea that only the first
       
  1130 ``copy'' of a regular expression in an alternative contributes to the
       
  1131 calculation of a POSIX value. All subsequent copies can be pruned away from
       
  1132 the regular expression. A recursive definition of our  simplification function 
       
  1133 that looks somewhat similar to our Scala code is given below:
       
  1134 %\comment{Use $\ZERO$, $\ONE$ and so on. 
       
  1135 %Is it $ALTS$ or $ALTS$?}\\
       
  1136 
       
  1137 \begin{center}
       
  1138   \begin{tabular}{@{}lcl@{}}
       
  1139    
       
  1140   $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
       
  1141    &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
       
  1142    &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
       
  1143    &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
       
  1144    &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
       
  1145    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
       
  1146 
       
  1147   $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
       
  1148   &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
       
  1149    &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
       
  1150    &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
       
  1151 
       
  1152    $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
       
  1153 \end{tabular}    
       
  1154 \end{center}    
       
  1155 
       
  1156 \noindent
       
  1157 The simplification does a pattern matching on the regular expression.
       
  1158 When it detected that the regular expression is an alternative or
       
  1159 sequence, it will try to simplify its children regular expressions
       
  1160 recursively and then see if one of the children turn into $\ZERO$ or
       
  1161 $\ONE$, which might trigger further simplification at the current level.
       
  1162 The most involved part is the $\sum$ clause, where we use two
       
  1163 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
       
  1164 alternatives and reduce as many duplicates as possible. Function
       
  1165 $\textit{distinct}$  keeps the first occurring copy only and remove all later ones
       
  1166 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
       
  1167 Its recursive definition is given below:
       
  1168 
       
  1169  \begin{center}
       
  1170   \begin{tabular}{@{}lcl@{}}
       
  1171   $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
       
  1172      (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
       
  1173   $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
       
  1174     $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
       
  1175 \end{tabular}    
       
  1176 \end{center}  
       
  1177 
       
  1178 \noindent
       
  1179 Here $\textit{flatten}$ behaves like the traditional functional programming flatten
       
  1180 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
       
  1181 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
       
  1182 
       
  1183 Having defined the $\simp$ function,
       
  1184 we can use the previous notation of  natural
       
  1185 extension from derivative w.r.t.~character to derivative
       
  1186 w.r.t.~string:%\comment{simp in  the [] case?}
       
  1187 
       
  1188 \begin{center}
       
  1189 \begin{tabular}{lcl}
       
  1190 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
       
  1191 $r \backslash_{simp} [\,] $ & $\dn$ & $r$
       
  1192 \end{tabular}
       
  1193 \end{center}
       
  1194 
       
  1195 \noindent
       
  1196 to obtain an optimised version of the algorithm:
       
  1197 
       
  1198  \begin{center}
       
  1199 \begin{tabular}{lcl}
       
  1200   $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
       
  1201       $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
       
  1202   & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
       
  1203   & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
       
  1204   & & $\;\;\textit{else}\;\textit{None}$
       
  1205 \end{tabular}
       
  1206 \end{center}
       
  1207 
       
  1208 \noindent
       
  1209 This algorithm keeps the regular expression size small, for example,
       
  1210 with this simplification our previous $(a + aa)^*$ example's 8000 nodes
       
  1211 will be reduced to just 6 and stays constant, no matter how long the
       
  1212 input string is.
       
  1213 
       
  1214 
       
  1215  
       
  1216 Derivatives give a simple solution
       
  1217 to the problem of matching a string $s$ with a regular
       
  1218 expression $r$: if the derivative of $r$ w.r.t.\ (in
       
  1219 succession) all the characters of the string matches the empty string,
       
  1220 then $r$ matches $s$ (and {\em vice versa}).  
       
  1221 
       
  1222 
       
  1223 
       
  1224 However, there are two difficulties with derivative-based matchers:
       
  1225 First, Brzozowski's original matcher only generates a yes/no answer
       
  1226 for whether a regular expression matches a string or not.  This is too
       
  1227 little information in the context of lexing where separate tokens must
       
  1228 be identified and also classified (for example as keywords
       
  1229 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
       
  1230 difficulty by cleverly extending Brzozowski's matching
       
  1231 algorithm. Their extended version generates additional information on
       
  1232 \emph{how} a regular expression matches a string following the POSIX
       
  1233 rules for regular expression matching. They achieve this by adding a
       
  1234 second ``phase'' to Brzozowski's algorithm involving an injection
       
  1235 function.  In our own earlier work we provided the formal
       
  1236 specification of what POSIX matching means and proved in Isabelle/HOL
       
  1237 the correctness
       
  1238 of Sulzmann and Lu's extended algorithm accordingly
       
  1239 \cite{AusafDyckhoffUrban2016}.
       
  1240 
       
  1241 The second difficulty is that Brzozowski's derivatives can 
       
  1242 grow to arbitrarily big sizes. For example if we start with the
       
  1243 regular expression $(a+aa)^*$ and take
       
  1244 successive derivatives according to the character $a$, we end up with
       
  1245 a sequence of ever-growing derivatives like 
       
  1246 
       
  1247 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
       
  1248 \begin{center}
       
  1249 \begin{tabular}{rll}
       
  1250 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
  1251 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
  1252 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
       
  1253 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
  1254 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
       
  1255 \end{tabular}
       
  1256 \end{center}
       
  1257  
       
  1258 \noindent where after around 35 steps we run out of memory on a
       
  1259 typical computer (we shall define shortly the precise details of our
       
  1260 regular expressions and the derivative operation).  Clearly, the
       
  1261 notation involving $\ZERO$s and $\ONE$s already suggests
       
  1262 simplification rules that can be applied to regular regular
       
  1263 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
       
  1264 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
       
  1265 r$. While such simple-minded simplifications have been proved in our
       
  1266 earlier work to preserve the correctness of Sulzmann and Lu's
       
  1267 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
       
  1268 \emph{not} help with limiting the growth of the derivatives shown
       
  1269 above: the growth is slowed, but the derivatives can still grow rather
       
  1270 quickly beyond any finite bound.
       
  1271 
       
  1272 
       
  1273 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
       
  1274 \cite{Sulzmann2014} where they introduce bitcoded
       
  1275 regular expressions. In this version, POSIX values are
       
  1276 represented as bitsequences and such sequences are incrementally generated
       
  1277 when derivatives are calculated. The compact representation
       
  1278 of bitsequences and regular expressions allows them to define a more
       
  1279 ``aggressive'' simplification method that keeps the size of the
       
  1280 derivatives finite no matter what the length of the string is.
       
  1281 They make some informal claims about the correctness and linear behaviour
       
  1282 of this version, but do not provide any supporting proof arguments, not
       
  1283 even ``pencil-and-paper'' arguments. They write about their bitcoded
       
  1284 \emph{incremental parsing method} (that is the algorithm to be formalised
       
  1285 in this paper):
       
  1286 
       
  1287 
       
  1288 \begin{quote}\it
       
  1289   ``Correctness Claim: We further claim that the incremental parsing
       
  1290   method [..] in combination with the simplification steps [..]
       
  1291   yields POSIX parse trees. We have tested this claim
       
  1292   extensively [..] but yet
       
  1293   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
       
  1294 \end{quote}  
       
  1295 
       
  1296 
       
  1297 
       
  1298 
       
  1299 \section{Backgound}
       
  1300 %Regular expression matching and lexing has been 
       
  1301 % widely-used and well-implemented
       
  1302 %in software industry. 
       
  1303 %TODO: expand the above into a full paragraph
       
  1304 %TODO: look up snort rules to use here--give readers idea of what regexes look like
       
  1305 
       
  1306 
       
  1307 Theoretical results say that regular expression matching
       
  1308 should be linear with respect to the input.
       
  1309 Under a certain class of regular expressions and inputs though,
       
  1310 practical implementations  suffer from non-linear or even 
       
  1311 exponential running time,
       
  1312 allowing a ReDoS (regular expression denial-of-service ) attack.
       
  1313 
       
  1314 
       
  1315 %----------------------------------------------------------------------------------------
       
  1316 
       
  1317 \section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking}
       
  1318 
       
  1319 The reason behind is that regex libraries in popular language engines
       
  1320  often want to support richer constructs
       
  1321 than  the most basic regular expressions, and lexing rather than matching
       
  1322 is needed for sub-match extraction.
       
  1323 
       
  1324 There is also static analysis work on regular expression that
       
  1325 have potential expoential behavious. Rathnayake and Thielecke 
       
  1326 \parencite{Rathnayake2014StaticAF} proposed an algorithm
       
  1327 that detects regular expressions triggering exponential
       
  1328 behavious on backtracking matchers.
       
  1329 People also developed static analysis methods for
       
  1330 generating non-linear polynomial worst-time estimates
       
  1331 for regexes, attack string that exploit the worst-time 
       
  1332 scenario, and "attack automata" that generates
       
  1333 attack strings.
       
  1334 For a comprehensive analysis, please refer to Weideman's thesis 
       
  1335 \parencite{Weideman2017Static}.
       
  1336 
       
  1337 \subsection{DFA Approach}
       
  1338 
       
  1339 Exponential states.
       
  1340 
       
  1341 \subsection{NFA Approach}
       
  1342 Backtracking.
       
  1343 
       
  1344 
       
  1345 
       
  1346 %----------------------------------------------------------------------------------------
       
  1347 
       
  1348 \section{Our Approach}
       
  1349 In the last fifteen or so years, Brzozowski's derivatives of regular
       
  1350 expressions have sparked quite a bit of interest in the functional
       
  1351 programming and theorem prover communities.  The beauty of
       
  1352 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
       
  1353 expressible in any functional language, and easily definable and
       
  1354 reasoned about in theorem provers---the definitions just consist of
       
  1355 inductive datatypes and simple recursive functions.  Derivatives of a
       
  1356 regular expression, written $r \backslash c$, give a simple solution
       
  1357 to the problem of matching a string $s$ with a regular
       
  1358 expression $r$: if the derivative of $r$ w.r.t.\ (in
       
  1359 succession) all the characters of the string matches the empty string,
       
  1360 then $r$ matches $s$ (and {\em vice versa}).  
       
  1361 
       
  1362 
       
  1363 This work aims to address the above vulnerability by the combination
       
  1364 of Brzozowski's derivatives and interactive theorem proving. We give an 
       
  1365 improved version of  Sulzmann and Lu's bit-coded algorithm using 
       
  1366 derivatives, which come with a formal guarantee in terms of correctness and 
       
  1367 running time as an Isabelle/HOL proof.
       
  1368 Then we improve the algorithm with an even stronger version of 
       
  1369 simplification, and prove a time bound linear to input and
       
  1370 cubic to regular expression size using a technique by
       
  1371 Antimirov.
       
  1372 
       
  1373 \subsection{Existing Work}
       
  1374 We are aware
       
  1375 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
  1376 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
  1377 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
       
  1378 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
  1379 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
       
  1380 
       
  1381 %----------------------------------------------------------------------------------------
       
  1382 
       
  1383 \section{What this Template Includes}
       
  1384 
       
  1385 \subsection{Folders}
       
  1386 
       
  1387 This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory:
       
  1388 
       
  1389 \keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory.
       
  1390 
       
  1391 \keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as:
       
  1392 \begin{itemize}
       
  1393 \item Chapter 1: Introduction to the thesis topic
       
  1394 \item Chapter 2: Background information and theory
       
  1395 \item Chapter 3: (Laboratory) experimental setup
       
  1396 \item Chapter 4: Details of experiment 1
       
  1397 \item Chapter 5: Details of experiment 2
       
  1398 \item Chapter 6: Discussion of the experimental results
       
  1399 \item Chapter 7: Conclusion and future directions
       
  1400 \end{itemize}
       
  1401 This chapter layout is specialised for the experimental sciences, your discipline may be different.
       
  1402 
       
  1403 \keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document.
       
  1404 
       
  1405 \subsection{Files}
       
  1406 
       
  1407 Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about:
       
  1408 
       
  1409 \keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do.
       
  1410 
       
  1411 \keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis. 
       
  1412 
       
  1413 \keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version.
       
  1414 
       
  1415 \keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis!
       
  1416 
       
  1417 Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include:
       
  1418 
       
  1419 \keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
       
  1420 
       
  1421 \keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis.
       
  1422 
       
  1423 \keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file.
       
  1424 
       
  1425 \keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section.
       
  1426 
       
  1427 \keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file.
       
  1428 
       
  1429 \keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section.
       
  1430 
       
  1431 \keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
       
  1432 
       
  1433 So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them.
       
  1434 
       
  1435 %----------------------------------------------------------------------------------------
       
  1436 
       
  1437 \section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile}
       
  1438 
       
  1439 You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment.
       
  1440 
       
  1441 Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots
       
  1442 
       
  1443 Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb|\href{url}{name}| and only leave the name.
       
  1444 
       
  1445 When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper!
       
  1446 
       
  1447 %----------------------------------------------------------------------------------------
       
  1448 
       
  1449 \section{The \code{main.tex} File Explained}
       
  1450 
       
  1451 The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it.
       
  1452 
       
  1453 Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block.
       
  1454 
       
  1455 Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from.
       
  1456 
       
  1457 Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space.
       
  1458 
       
  1459 Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor).
       
  1460 
       
  1461 The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols.
       
  1462 
       
  1463 The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes.
       
  1464 
       
  1465 The next page contains a one line dedication. Who will you dedicate your thesis to?
       
  1466 
       
  1467 Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder.
       
  1468 
       
  1469 After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place.
       
  1470 
       
  1471 %----------------------------------------------------------------------------------------
       
  1472 
       
  1473 \section{Thesis Features and Conventions}\label{ThesisConventions}
       
  1474 
       
  1475 To get the best out of this template, there are a few conventions that you may want to follow.
       
  1476 
       
  1477 One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method.
       
  1478 
       
  1479 \subsection{Printing Format}
       
  1480 
       
  1481 This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution.
       
  1482 
       
  1483 The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side.
       
  1484 
       
  1485 The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}.
       
  1486 
       
  1487 \subsection{Using US Letter Paper}
       
  1488 
       
  1489 The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}.
       
  1490 
       
  1491 Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins.
       
  1492 
       
  1493 \subsection{References}
       
  1494 
       
  1495 The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type.
       
  1496 
       
  1497 Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.}
       
  1498 
       
  1499 The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations).
       
  1500 
       
  1501 \subsubsection{A Note on bibtex}
       
  1502 
       
  1503 The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors.
       
  1504 
       
  1505 \subsection{Tables}
       
  1506 
       
  1507 Tables are an important way of displaying your results, below is an example table which was generated with this code:
       
  1508 
       
  1509 {\small
       
  1510 \begin{verbatim}
       
  1511 \begin{table}
       
  1512 \caption{The effects of treatments X and Y on the four groups studied.}
       
  1513 \label{tab:treatments}
       
  1514 \centering
       
  1515 \begin{tabular}{l l l}
       
  1516 \toprule
       
  1517 \tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
       
  1518 \midrule
       
  1519 1 & 0.2 & 0.8\\
       
  1520 2 & 0.17 & 0.7\\
       
  1521 3 & 0.24 & 0.75\\
       
  1522 4 & 0.68 & 0.3\\
       
  1523 \bottomrule\\
       
  1524 \end{tabular}
       
  1525 \end{table}
       
  1526 \end{verbatim}
       
  1527 }
       
  1528 
       
  1529 \begin{table}
       
  1530 \caption{The effects of treatments X and Y on the four groups studied.}
       
  1531 \label{tab:treatments}
       
  1532 \centering
       
  1533 \begin{tabular}{l l l}
       
  1534 \toprule
       
  1535 \tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
       
  1536 \midrule
       
  1537 1 & 0.2 & 0.8\\
       
  1538 2 & 0.17 & 0.7\\
       
  1539 3 & 0.24 & 0.75\\
       
  1540 4 & 0.68 & 0.3\\
       
  1541 \bottomrule\\
       
  1542 \end{tabular}
       
  1543 \end{table}
       
  1544 
       
  1545 You can reference tables with \verb|\ref{<label>}| where the label is defined within the table environment. See \file{Chapter1.tex} for an example of the label and citation (e.g. Table~\ref{tab:treatments}).
       
  1546 
       
  1547 \subsection{Figures}
       
  1548 
       
  1549 There will hopefully be many figures in your thesis (that should be placed in the \emph{Figures} folder). The way to insert figures into your thesis is to use a code template like this:
       
  1550 \begin{verbatim}
       
  1551 \begin{figure}
       
  1552 \centering
       
  1553 \includegraphics{Figures/Electron}
       
  1554 \decoRule
       
  1555 \caption[An Electron]{An electron (artist's impression).}
       
  1556 \label{fig:Electron}
       
  1557 \end{figure}
       
  1558 \end{verbatim}
       
  1559 Also look in the source file. Putting this code into the source file produces the picture of the electron that you can see in the figure below.
       
  1560 
       
  1561 \begin{figure}[th]
       
  1562 \centering
       
  1563 \includegraphics{Figures/Electron}
       
  1564 \decoRule
       
  1565 \caption[An Electron]{An electron (artist's impression).}
       
  1566 \label{fig:Electron}
       
  1567 \end{figure}
       
  1568 
       
  1569 Sometimes figures don't always appear where you write them in the source. The placement depends on how much space there is on the page for the figure. Sometimes there is not enough room to fit a figure directly where it should go (in relation to the text) and so \LaTeX{} puts it at the top of the next page. Positioning figures is the job of \LaTeX{} and so you should only worry about making them look good!
       
  1570 
       
  1571 Figures usually should have captions just in case you need to refer to them (such as in Figure~\ref{fig:Electron}). The \verb|\caption| command contains two parts, the first part, inside the square brackets is the title that will appear in the \emph{List of Figures}, and so should be short. The second part in the curly brackets should contain the longer and more descriptive caption text.
       
  1572 
       
  1573 The \verb|\decoRule| command is optional and simply puts an aesthetic horizontal line below the image. If you do this for one image, do it for all of them.
       
  1574 
       
  1575 \LaTeX{} is capable of using images in pdf, jpg and png format.
       
  1576 
       
  1577 \subsection{Typesetting mathematics}
       
  1578 
       
  1579 If your thesis is going to contain heavy mathematical content, be sure that \LaTeX{} will make it look beautiful, even though it won't be able to solve the equations for you.
       
  1580 
       
  1581 The \enquote{Not So Short Introduction to \LaTeX} (available on \href{http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf}{CTAN}) should tell you everything you need to know for most cases of typesetting mathematics. If you need more information, a much more thorough mathematical guide is available from the AMS called, \enquote{A Short Math Guide to \LaTeX} and can be downloaded from:
       
  1582 \url{ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf}
       
  1583 
       
  1584 There are many different \LaTeX{} symbols to remember, luckily you can find the most common symbols in \href{http://ctan.org/pkg/comprehensive}{The Comprehensive \LaTeX~Symbol List}.
       
  1585 
       
  1586 You can write an equation, which is automatically given an equation number by \LaTeX{} like this:
       
  1587 \begin{verbatim}
       
  1588 \begin{equation}
       
  1589 E = mc^{2}
       
  1590 \label{eqn:Einstein}
       
  1591 \end{equation}
       
  1592 \end{verbatim}
       
  1593 
       
  1594 This will produce Einstein's famous energy-matter equivalence equation:
       
  1595 \begin{equation}
       
  1596 E = mc^{2}
       
  1597 \label{eqn:Einstein}
       
  1598 \end{equation}
       
  1599 
       
  1600 All equations you write (which are not in the middle of paragraph text) are automatically given equation numbers by \LaTeX{}. If you don't want a particular equation numbered, use the unnumbered form:
       
  1601 \begin{verbatim}
       
  1602 \[ a^{2}=4 \]
       
  1603 \end{verbatim}
       
  1604 
       
  1605 %----------------------------------------------------------------------------------------
       
  1606 
       
  1607 \section{Sectioning and Subsectioning}
       
  1608 
       
  1609 You should break your thesis up into nice, bite-sized sections and subsections. \LaTeX{} automatically builds a table of Contents by looking at all the \verb|\chapter{}|, \verb|\section{}|  and \verb|\subsection{}| commands you write in the source.
       
  1610 
       
  1611 The Table of Contents should only list the sections to three (3) levels. A \verb|chapter{}| is level zero (0). A \verb|\section{}| is level one (1) and so a \verb|\subsection{}| is level two (2). In your thesis it is likely that you will even use a \verb|subsubsection{}|, which is level three (3). The depth to which the Table of Contents is formatted is set within \file{MastersDoctoralThesis.cls}. If you need this changed, you can do it in \file{main.tex}.
       
  1612 
       
  1613 %----------------------------------------------------------------------------------------
       
  1614 
       
  1615 \section{In Closing}
       
  1616 
       
  1617 You have reached the end of this mini-guide. You can now rename or overwrite this pdf file and begin writing your own \file{Chapter1.tex} and the rest of your thesis. The easy work of setting up the structure and framework has been taken care of for you. It's now your job to fill it out!
       
  1618 
       
  1619 Good luck and have lots of fun!
       
  1620 
       
  1621 \begin{flushright}
       
  1622 Guide written by ---\\
       
  1623 Sunil Patel: \href{http://www.sunilpatel.co.uk}{www.sunilpatel.co.uk}\\
       
  1624 Vel: \href{http://www.LaTeXTemplates.com}{LaTeXTemplates.com}
       
  1625 \end{flushright}