|
1 % Chapter 1 |
|
2 |
|
3 \chapter{Introduction} % Main chapter title |
|
4 |
|
5 \label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1} |
|
6 |
|
7 %---------------------------------------------------------------------------------------- |
|
8 |
|
9 % Define some commands to keep the formatting separated from the content |
|
10 \newcommand{\keyword}[1]{\textbf{#1}} |
|
11 \newcommand{\tabhead}[1]{\textbf{#1}} |
|
12 \newcommand{\code}[1]{\texttt{#1}} |
|
13 \newcommand{\file}[1]{\texttt{\bfseries#1}} |
|
14 \newcommand{\option}[1]{\texttt{\itshape#1}} |
|
15 |
|
16 |
|
17 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% |
|
18 \newcommand{\ZERO}{\mbox{\bf 0}} |
|
19 \newcommand{\ONE}{\mbox{\bf 1}} |
|
20 \def\lexer{\mathit{lexer}} |
|
21 \def\mkeps{\mathit{mkeps}} |
|
22 |
|
23 \def\DFA{\textit{DFA}} |
|
24 \def\bmkeps{\textit{bmkeps}} |
|
25 \def\retrieve{\textit{retrieve}} |
|
26 \def\blexer{\textit{blexer}} |
|
27 \def\flex{\textit{flex}} |
|
28 \def\inj{\mathit{inj}} |
|
29 \def\Empty{\mathit{Empty}} |
|
30 \def\Left{\mathit{Left}} |
|
31 \def\Right{\mathit{Right}} |
|
32 \def\Stars{\mathit{Stars}} |
|
33 \def\Char{\mathit{Char}} |
|
34 \def\Seq{\mathit{Seq}} |
|
35 \def\Der{\mathit{Der}} |
|
36 \def\nullable{\mathit{nullable}} |
|
37 \def\Z{\mathit{Z}} |
|
38 \def\S{\mathit{S}} |
|
39 \def\rup{r^\uparrow} |
|
40 \def\simp{\mathit{simp}} |
|
41 \def\simpALTs{\mathit{simp}\_\mathit{ALTs}} |
|
42 \def\map{\mathit{map}} |
|
43 \def\distinct{\mathit{distinct}} |
|
44 \def\blexersimp{\mathit{blexer}\_\mathit{simp}} |
|
45 %---------------------------------------------------------------------------------------- |
|
46 %This part is about regular expressions, Brzozowski derivatives, |
|
47 %and a bit-coded lexing algorithm with proven correctness and time bounds. |
|
48 Regular expressions are widely used in modern day programming tasks. |
|
49 Be it IDE with syntax hightlighting and auto completion, |
|
50 command line tools like $\mathit{grep}$ that facilitates easy |
|
51 processing of text by search and replace, network intrusion |
|
52 detection systems that rejects suspicious traffic, or compiler |
|
53 front ends--there is always an important phase of the |
|
54 task that involves matching a regular |
|
55 exression with a string. |
|
56 Given its usefulness and ubiquity, one would imagine that |
|
57 modern regular expression matching implementations |
|
58 are mature and fully-studied. |
|
59 |
|
60 If you go to a popular programming language's |
|
61 regex engine, |
|
62 you can supply it with regex and strings of your own, |
|
63 and get matching/lexing information such as how a |
|
64 sub-part of the regex matches a sub-part of the string. |
|
65 These lexing libraries are on average quite fast. |
|
66 %TODO: get source for SNORT/BRO's regex matching engine/speed |
|
67 For example, the regex engines some network intrusion detection |
|
68 systems use are able to process |
|
69 megabytes or even gigabytes of network traffic per second. |
|
70 |
|
71 Why do we need to have our version, if the algorithms are |
|
72 blindingly fast already? Well it turns out it is not always the case. |
|
73 |
|
74 |
|
75 Take $(a^*)^*\,b$ and ask whether |
|
76 strings of the form $aa..a$ match this regular |
|
77 expression. Obviously this is not the case---the expected $b$ in the last |
|
78 position is missing. One would expect that modern regular expression |
|
79 matching engines can find this out very quickly. Alas, if one tries |
|
80 this example in JavaScript, Python or Java 8 with strings like 28 |
|
81 $a$'s, one discovers that this decision takes around 30 seconds and |
|
82 takes considerably longer when adding a few more $a$'s, as the graphs |
|
83 below show: |
|
84 |
|
85 \begin{center} |
|
86 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
87 \begin{tikzpicture} |
|
88 \begin{axis}[ |
|
89 xlabel={$n$}, |
|
90 x label style={at={(1.05,-0.05)}}, |
|
91 ylabel={time in secs}, |
|
92 enlargelimits=false, |
|
93 xtick={0,5,...,30}, |
|
94 xmax=33, |
|
95 ymax=35, |
|
96 ytick={0,5,...,30}, |
|
97 scaled ticks=false, |
|
98 axis lines=left, |
|
99 width=5cm, |
|
100 height=4cm, |
|
101 legend entries={JavaScript}, |
|
102 legend pos=north west, |
|
103 legend cell align=left] |
|
104 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
|
105 \end{axis} |
|
106 \end{tikzpicture} |
|
107 & |
|
108 \begin{tikzpicture} |
|
109 \begin{axis}[ |
|
110 xlabel={$n$}, |
|
111 x label style={at={(1.05,-0.05)}}, |
|
112 %ylabel={time in secs}, |
|
113 enlargelimits=false, |
|
114 xtick={0,5,...,30}, |
|
115 xmax=33, |
|
116 ymax=35, |
|
117 ytick={0,5,...,30}, |
|
118 scaled ticks=false, |
|
119 axis lines=left, |
|
120 width=5cm, |
|
121 height=4cm, |
|
122 legend entries={Python}, |
|
123 legend pos=north west, |
|
124 legend cell align=left] |
|
125 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
126 \end{axis} |
|
127 \end{tikzpicture} |
|
128 & |
|
129 \begin{tikzpicture} |
|
130 \begin{axis}[ |
|
131 xlabel={$n$}, |
|
132 x label style={at={(1.05,-0.05)}}, |
|
133 %ylabel={time in secs}, |
|
134 enlargelimits=false, |
|
135 xtick={0,5,...,30}, |
|
136 xmax=33, |
|
137 ymax=35, |
|
138 ytick={0,5,...,30}, |
|
139 scaled ticks=false, |
|
140 axis lines=left, |
|
141 width=5cm, |
|
142 height=4cm, |
|
143 legend entries={Java 8}, |
|
144 legend pos=north west, |
|
145 legend cell align=left] |
|
146 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
147 \end{axis} |
|
148 \end{tikzpicture}\\ |
|
149 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
150 of the form $\underbrace{aa..a}_{n}$.} |
|
151 \end{tabular} |
|
152 \end{center} |
|
153 |
|
154 |
|
155 This is clearly exponential behaviour, and |
|
156 is triggered by some relatively simple regex patterns. |
|
157 |
|
158 |
|
159 This superlinear blowup in matching algorithms sometimes cause |
|
160 considerable grief in real life: for example on 20 July 2016 one evil |
|
161 regular expression brought the webpage |
|
162 \href{http://stackexchange.com}{Stack Exchange} to its |
|
163 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
|
164 In this instance, a regular expression intended to just trim white |
|
165 spaces from the beginning and the end of a line actually consumed |
|
166 massive amounts of CPU-resources---causing web servers to grind to a |
|
167 halt. This happened when a post with 20,000 white spaces was submitted, |
|
168 but importantly the white spaces were neither at the beginning nor at |
|
169 the end. As a result, the regular expression matching engine needed to |
|
170 backtrack over many choices. In this example, the time needed to process |
|
171 the string was $O(n^2)$ with respect to the string length. This |
|
172 quadratic overhead was enough for the homepage of Stack Exchange to |
|
173 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
|
174 attack and therefore stopped the servers from responding to any |
|
175 requests. This made the whole site become unavailable. |
|
176 A more |
|
177 recent example is a global outage of all Cloudflare servers on 2 July |
|
178 2019. A poorly written regular expression exhibited exponential |
|
179 behaviour and exhausted CPUs that serve HTTP traffic. Although the |
|
180 outage had several causes, at the heart was a regular expression that |
|
181 was used to monitor network |
|
182 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}} |
|
183 %TODO: data points for some new versions of languages |
|
184 These problems with regular expressions |
|
185 are not isolated events that happen |
|
186 very occasionally, but actually quite widespread. |
|
187 They occur so often that they get a |
|
188 name--Regular-Expression-Denial-Of-Service (ReDoS) |
|
189 attack. |
|
190 Davis et al. \parencite{Davis18} detected more |
|
191 than 1000 super-linear (SL) regular expressions |
|
192 in Node.js, Python core libraries, and npm and pypi. |
|
193 They therefore concluded that evil regular expressions |
|
194 are problems more than "a parlour trick", but one that |
|
195 requires |
|
196 more research attention. |
|
197 |
|
198 \section{Why are current algorithm for regexes slow?} |
|
199 |
|
200 Theoretical results say that regular expression matching |
|
201 should be linear with respect to the input. |
|
202 You could construct |
|
203 an NFA via Thompson construction, and simulate running it. |
|
204 This would be linear. |
|
205 Or you could determinize the NFA into a DFA, and minimize that |
|
206 DFA. Once you have the DFA, the running time is also linear, requiring just |
|
207 one scanning pass of the input. |
|
208 |
|
209 But modern regex libraries in popular language engines |
|
210 often want to support richer constructs |
|
211 such as bounded repetitions and back-references |
|
212 rather than the most basic regular expressions. |
|
213 %put in illustrations |
|
214 %a{1}{3} |
|
215 These make a DFA construction impossible because |
|
216 of an exponential states explosion. |
|
217 They also want to support lexing rather than just matching |
|
218 for tasks that involves text processing. |
|
219 |
|
220 Existing regex libraries either pose restrictions on the user input, or does |
|
221 not give linear running time guarantee. |
|
222 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
|
223 For example, the Rust regex engine claims to be linear, |
|
224 but does not support lookarounds and back-references. |
|
225 The GoLang regex library does not support over 1000 repetitions. |
|
226 Java and Python both support back-references, but shows |
|
227 catastrophic backtracking behaviours on inputs without back-references( |
|
228 when the language is still regular). |
|
229 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
|
230 %TODO: verify the fact Rust does not allow 1000+ reps |
|
231 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs? |
|
232 Another thing about the these libraries is that there |
|
233 is no correctness guarantee. |
|
234 In some cases they either fails to generate a lexing result when there is a match, |
|
235 or gives the wrong way of matching. |
|
236 |
|
237 |
|
238 It turns out that regex libraries not only suffer from |
|
239 exponential backtracking problems, |
|
240 but also undesired (or even buggy) outputs. |
|
241 %TODO: comment from who |
|
242 xxx commented that most regex libraries are not |
|
243 correctly implementing the POSIX (maximum-munch) |
|
244 rule of regular expression matching. |
|
245 A concrete example would be |
|
246 the regex |
|
247 \begin{verbatim} |
|
248 (((((a*a*)b*)b){20})*)c |
|
249 \end{verbatim} |
|
250 and the string |
|
251 \begin{verbatim} |
|
252 baabaabababaabaaaaaaaaababaa |
|
253 aababababaaaabaaabaaaaaabaab |
|
254 aabababaababaaaaaaaaababaaaa |
|
255 babababaaaaaaaaaaaaac |
|
256 \end{verbatim} |
|
257 |
|
258 This seemingly complex regex simply says "some $a$'s |
|
259 followed by some $b$'s then followed by 1 single $b$, |
|
260 and this iterates 20 times, finally followed by a $c$. |
|
261 And a POSIX match would involve the entire string,"eating up" |
|
262 all the $b$'s in it. |
|
263 %TODO: give a coloured example of how this matches POSIXly |
|
264 |
|
265 This regex would trigger catastrophic backtracking in |
|
266 languages like Python and Java, |
|
267 whereas it gives a correct but uninformative (non-POSIX) |
|
268 match in languages like Go or .NET--The match with only |
|
269 character $c$. |
|
270 |
|
271 Backtracking or depth-first search might give us |
|
272 exponential running time, and quite a few tools |
|
273 avoid that by determinising the $\mathit{NFA}$ |
|
274 into a $\mathit{DFA}$ and minimizes it. |
|
275 For example, $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools |
|
276 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based |
|
277 lexers. |
|
278 However, they do not scale well with bounded repetitions. |
|
279 Bounded repetitions, usually written in the form |
|
280 $r^{\{c\}}$ (where $c$ is a constant natural number), |
|
281 denotes a regular expression accepting strings |
|
282 that can be divided into $c$ substrings, and each |
|
283 substring is in $r$. |
|
284 %TODO: draw example NFA. |
|
285 For the regular expression $(a|b)^*a(a|b)^{\{2\}}$, |
|
286 an $\mathit{NFA}$ describing it would look like: |
|
287 \begin{center} |
|
288 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
289 \node[state,initial] (q_0) {$q_0$}; |
|
290 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
|
291 \node[state, red] (q_2) [right=of q_1] {$q_2$}; |
|
292 \node[state,accepting](q_3) [right=of q_2] {$q_3$}; |
|
293 \path[->] |
|
294 (q_0) edge node {a} (q_1) |
|
295 edge [loop below] node {a,b} () |
|
296 (q_1) edge node {a,b} (q_2) |
|
297 edge [loop above] node {0} () |
|
298 (q_2) edge node {a,b} (q_3); |
|
299 \end{tikzpicture} |
|
300 \end{center} |
|
301 The red states are "counter states" which counts down |
|
302 the number of characters needed in addition to the current |
|
303 string to make a successful match. |
|
304 For example, state $q_1$ indicates a match that has |
|
305 gone past the $(a|b)$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
|
306 and just consumed the "delimiter" $a$ in the middle, and |
|
307 need to match 2 more iterations of $a|b$ to complete. |
|
308 State $q_2$ on the other hand, can be viewed as a state |
|
309 after $q_1$ has consumed 1 character, and just waits |
|
310 for 1 more character to complete. |
|
311 Depending on the actual characters appearing in the |
|
312 input string, the states $q_1$ and $q_2$ may or may |
|
313 not be active, independent from each other. |
|
314 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
|
315 contain at least 4 non-equivalent states that cannot be merged, |
|
316 because subset states indicating which of $q_1$ and $q_2$ |
|
317 are active are at least four: $\phi$, $\{q_1\}$, |
|
318 $\{q_2\}$, $\{q_1, q_2\}$. |
|
319 Generalizing this to regular expressions with larger |
|
320 bounded repetitions number, we have $r^*ar^{\{n\}}$ |
|
321 in general would require at least $2^n$ states |
|
322 in a $\mathit{DFA}$. This is to represent all different |
|
323 configurations of "countdown" states. |
|
324 For those regexes, tools such as $\mathit{JFLEX}$ |
|
325 would generate gigantic $\mathit{DFA}$'s or even |
|
326 give out memory errors. |
|
327 |
|
328 For this reason, regex libraries that support |
|
329 bounded repetitions often choose to use the $\mathit{NFA}$ |
|
330 approach. |
|
331 One can simulate the $\mathit{NFA}$ running in two ways: |
|
332 one by keeping track of all active states after consuming |
|
333 a character, and update that set of states iteratively. |
|
334 This is a breadth-first-search of the $\mathit{NFA}$. |
|
335 for a path terminating |
|
336 at an accepting state. |
|
337 Languages like $\mathit{GO}$ and $\mathit{RUST}$ use this |
|
338 type of $\mathit{NFA}$ simulation, and guarantees a linear runtime |
|
339 in terms of input string length. |
|
340 The other way to use $\mathit{NFA}$ for matching is to take |
|
341 a single state in a path each time, and backtrack if that path |
|
342 fails. This is a depth-first-search that could end up |
|
343 with exponential run time. |
|
344 The reason behind backtracking algorithms in languages like |
|
345 Java and Python is that they support back-references. |
|
346 \subsection{Back References in Regex--Non-Regular part} |
|
347 If we label sub-expressions by parenthesizing them and give |
|
348 them a number by the order their opening parenthesis appear, |
|
349 $\underset{1}{(}\ldots\underset{2}{(}\ldots\underset{3}{(}\ldots\underset{4}{(}\ldots)\ldots)\ldots)\ldots)$ |
|
350 We can use the following syntax to denote that we want a string just matched by a |
|
351 sub-expression to appear at a certain location again exactly: |
|
352 $(.*)\backslash 1$ |
|
353 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc. |
|
354 |
|
355 Back-reference is a construct in the "regex" standard |
|
356 that programmers found quite useful, but not exactly |
|
357 regular any more. |
|
358 In fact, that allows the regex construct to express |
|
359 languages that cannot be contained in context-free |
|
360 languages |
|
361 For example, the back reference $(a^*)\backslash 1 \backslash 1$ |
|
362 expresses the language $\{a^na^na^n\mid n \in N\}$, |
|
363 which cannot be expressed by context-free grammars. |
|
364 To express such a language one would need context-sensitive |
|
365 grammars, and matching/parsing for such grammars is NP-hard in |
|
366 general. |
|
367 %TODO:cite reference for NP-hardness of CSG matching |
|
368 For such constructs, the most intuitive way of matching is |
|
369 using backtracking algorithms, and there are no known algorithms |
|
370 that does not backtrack. |
|
371 %TODO:read a bit more about back reference algorithms |
|
372 |
|
373 |
|
374 |
|
375 |
|
376 \section{Our Solution--Brzozowski Derivatives} |
|
377 |
|
378 |
|
379 |
|
380 Is it possible to have a regex lexing algorithm with proven correctness and |
|
381 time complexity, which allows easy extensions to |
|
382 constructs like |
|
383 bounded repetitions, negation, lookarounds, and even back-references? |
|
384 |
|
385 We propose Brzozowski's derivatives as a solution to this problem. |
|
386 |
|
387 The main contribution of this thesis is a proven correct lexing algorithm |
|
388 with formalized time bounds. |
|
389 To our best knowledge, there is no lexing libraries using Brzozowski derivatives |
|
390 that have a provable time guarantee, |
|
391 and claims about running time are usually speculative and backed by thin empirical |
|
392 evidence. |
|
393 %TODO: give references |
|
394 For example, Sulzmann and Lu had proposed an algorithm in which they |
|
395 claim a linear running time. |
|
396 But that was falsified by our experiments and the running time |
|
397 is actually $\Omega(2^n)$ in the worst case. |
|
398 A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim |
|
399 %TODO: give references |
|
400 lexer, which calculates POSIX matches and is based on derivatives. |
|
401 They formalized the correctness of the lexer, but not the complexity. |
|
402 In the performance evaluation section, they simply analyzed the run time |
|
403 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$ |
|
404 and concluded that the algorithm is quadratic in terms of input length. |
|
405 When we tried out their extracted OCaml code with our example $(a+aa)^*$, |
|
406 the time it took to lex only 40 $a$'s was 5 minutes. |
|
407 We therefore believe our results of a proof of performance on general |
|
408 inputs rather than specific examples a novel contribution.\\ |
|
409 |
|
410 \section{Preliminaries about Lexing Using Brzozowski derivatives} |
|
411 In the last fifteen or so years, Brzozowski's derivatives of regular |
|
412 expressions have sparked quite a bit of interest in the functional |
|
413 programming and theorem prover communities. |
|
414 The beauty of |
|
415 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly |
|
416 expressible in any functional language, and easily definable and |
|
417 reasoned about in theorem provers---the definitions just consist of |
|
418 inductive datatypes and simple recursive functions. |
|
419 |
|
420 Suppose we have an alphabet $\Sigma$, the strings whose characters |
|
421 are from $\Sigma$ |
|
422 can be expressed as $\Sigma^*$. |
|
423 |
|
424 We use patterns to define a set of strings concisely. Regular expressions |
|
425 are one of such patterns systems: |
|
426 The basic regular expressions are defined inductively |
|
427 by the following grammar: |
|
428 \[ r ::= \ZERO \mid \ONE |
|
429 \mid c |
|
430 \mid r_1 \cdot r_2 |
|
431 \mid r_1 + r_2 |
|
432 \mid r^* |
|
433 \] |
|
434 |
|
435 The language or set of strings defined by regular expressions are defined as |
|
436 %TODO: FILL in the other defs |
|
437 \begin{center} |
|
438 \begin{tabular}{lcl} |
|
439 $L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\ |
|
440 $L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\ |
|
441 \end{tabular} |
|
442 \end{center} |
|
443 Which are also called the "language interpretation". |
|
444 |
|
445 |
|
446 |
|
447 The Brzozowski derivative w.r.t character $c$ is an operation on the regex, |
|
448 where the operation transforms the regex to a new one containing |
|
449 strings without the head character $c$. |
|
450 |
|
451 Formally, we define first such a transformation on any string set, which |
|
452 we call semantic derivative: |
|
453 \begin{center} |
|
454 $\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$ |
|
455 \end{center} |
|
456 Mathematically, it can be expressed as the |
|
457 |
|
458 If the $\textit{StringSet}$ happen to have some structure, for example, |
|
459 if it is regular, then we have that it |
|
460 |
|
461 The the derivative of regular expression, denoted as |
|
462 $r \backslash c$, is a function that takes parameters |
|
463 $r$ and $c$, and returns another regular expression $r'$, |
|
464 which is computed by the following recursive function: |
|
465 |
|
466 \begin{center} |
|
467 \begin{tabular}{lcl} |
|
468 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
469 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
470 $d \backslash c$ & $\dn$ & |
|
471 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
472 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
473 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
474 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
475 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
476 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
477 \end{tabular} |
|
478 \end{center} |
|
479 \noindent |
|
480 \noindent |
|
481 |
|
482 The $\nullable$ function tests whether the empty string $""$ |
|
483 is in the language of $r$: |
|
484 |
|
485 |
|
486 \begin{center} |
|
487 \begin{tabular}{lcl} |
|
488 $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ |
|
489 $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\ |
|
490 $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\ |
|
491 $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\ |
|
492 $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ |
|
493 $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ |
|
494 \end{tabular} |
|
495 \end{center} |
|
496 \noindent |
|
497 The empty set does not contain any string and |
|
498 therefore not the empty string, the empty string |
|
499 regular expression contains the empty string |
|
500 by definition, the character regular expression |
|
501 is the singleton that contains character only, |
|
502 and therefore does not contain the empty string, |
|
503 the alternative regular expression(or "or" expression) |
|
504 might have one of its children regular expressions |
|
505 being nullable and any one of its children being nullable |
|
506 would suffice. The sequence regular expression |
|
507 would require both children to have the empty string |
|
508 to compose an empty string and the Kleene star |
|
509 operation naturally introduced the empty string. |
|
510 |
|
511 We can give the meaning of regular expressions derivatives |
|
512 by language interpretation: |
|
513 |
|
514 |
|
515 |
|
516 |
|
517 \begin{center} |
|
518 \begin{tabular}{lcl} |
|
519 $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\ |
|
520 $\ONE \backslash c$ & $\dn$ & $\ZERO$\\ |
|
521 $d \backslash c$ & $\dn$ & |
|
522 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
523 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
524 $(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\ |
|
525 & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\ |
|
526 & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\ |
|
527 $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\ |
|
528 \end{tabular} |
|
529 \end{center} |
|
530 \noindent |
|
531 \noindent |
|
532 The function derivative, written $\backslash c$, |
|
533 defines how a regular expression evolves into |
|
534 a new regular expression after all the string it contains |
|
535 is chopped off a certain head character $c$. |
|
536 The most involved cases are the sequence |
|
537 and star case. |
|
538 The sequence case says that if the first regular expression |
|
539 contains an empty string then second component of the sequence |
|
540 might be chosen as the target regular expression to be chopped |
|
541 off its head character. |
|
542 The star regular expression unwraps the iteration of |
|
543 regular expression and attack the star regular expression |
|
544 to its back again to make sure there are 0 or more iterations |
|
545 following this unfolded iteration. |
|
546 |
|
547 |
|
548 The main property of the derivative operation |
|
549 that enables us to reason about the correctness of |
|
550 an algorithm using derivatives is |
|
551 |
|
552 \begin{center} |
|
553 $c\!::\!s \in L(r)$ holds |
|
554 if and only if $s \in L(r\backslash c)$. |
|
555 \end{center} |
|
556 |
|
557 \noindent |
|
558 We can generalise the derivative operation shown above for single characters |
|
559 to strings as follows: |
|
560 |
|
561 \begin{center} |
|
562 \begin{tabular}{lcl} |
|
563 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ |
|
564 $r \backslash [\,] $ & $\dn$ & $r$ |
|
565 \end{tabular} |
|
566 \end{center} |
|
567 |
|
568 \noindent |
|
569 and then define Brzozowski's regular-expression matching algorithm as: |
|
570 |
|
571 \[ |
|
572 match\;s\;r \;\dn\; nullable(r\backslash s) |
|
573 \] |
|
574 |
|
575 \noindent |
|
576 Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, |
|
577 this algorithm presented graphically is as follows: |
|
578 |
|
579 \begin{equation}\label{graph:*} |
|
580 \begin{tikzcd} |
|
581 r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO} |
|
582 \end{tikzcd} |
|
583 \end{equation} |
|
584 |
|
585 \noindent |
|
586 where we start with a regular expression $r_0$, build successive |
|
587 derivatives until we exhaust the string and then use \textit{nullable} |
|
588 to test whether the result can match the empty string. It can be |
|
589 relatively easily shown that this matcher is correct (that is given |
|
590 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). |
|
591 |
|
592 Beautiful and simple definition. |
|
593 |
|
594 If we implement the above algorithm naively, however, |
|
595 the algorithm can be excruciatingly slow. For example, when starting with the regular |
|
596 expression $(a + aa)^*$ and building 12 successive derivatives |
|
597 w.r.t.~the character $a$, one obtains a derivative regular expression |
|
598 with more than 8000 nodes (when viewed as a tree). Operations like |
|
599 $\backslash$ and $\nullable$ need to traverse such trees and |
|
600 consequently the bigger the size of the derivative the slower the |
|
601 algorithm. |
|
602 |
|
603 Brzozowski was quick in finding that during this process a lot useless |
|
604 $\ONE$s and $\ZERO$s are generated and therefore not optimal. |
|
605 He also introduced some "similarity rules" such |
|
606 as $P+(Q+R) = (P+Q)+R$ to merge syntactically |
|
607 different but language-equivalent sub-regexes to further decrease the size |
|
608 of the intermediate regexes. |
|
609 |
|
610 More simplifications are possible, such as deleting duplicates |
|
611 and opening up nested alternatives to trigger even more simplifications. |
|
612 And suppose we apply simplification after each derivative step, and compose |
|
613 these two operations together as an atomic one: $a \backslash_{simp}\,c \dn |
|
614 \textit{simp}(a \backslash c)$. Then we can build |
|
615 a matcher without having cumbersome regular expressions. |
|
616 |
|
617 |
|
618 If we want the size of derivatives in the algorithm to |
|
619 stay even lower, we would need more aggressive simplifications. |
|
620 Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as |
|
621 deleting duplicates whenever possible. For example, the parentheses in |
|
622 $(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b |
|
623 \cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another |
|
624 example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just |
|
625 $a^*+a+\ONE$. Adding these more aggressive simplification rules help us |
|
626 to achieve a very tight size bound, namely, |
|
627 the same size bound as that of the \emph{partial derivatives}. |
|
628 |
|
629 Building derivatives and then simplify them. |
|
630 So far so good. But what if we want to |
|
631 do lexing instead of just a YES/NO answer? |
|
632 This requires us to go back again to the world |
|
633 without simplification first for a moment. |
|
634 Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and |
|
635 elegant(arguably as beautiful as the original |
|
636 derivatives definition) solution for this. |
|
637 |
|
638 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} |
|
639 |
|
640 |
|
641 They first defined the datatypes for storing the |
|
642 lexing information called a \emph{value} or |
|
643 sometimes also \emph{lexical value}. These values and regular |
|
644 expressions correspond to each other as illustrated in the following |
|
645 table: |
|
646 |
|
647 \begin{center} |
|
648 \begin{tabular}{c@{\hspace{20mm}}c} |
|
649 \begin{tabular}{@{}rrl@{}} |
|
650 \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\ |
|
651 $r$ & $::=$ & $\ZERO$\\ |
|
652 & $\mid$ & $\ONE$ \\ |
|
653 & $\mid$ & $c$ \\ |
|
654 & $\mid$ & $r_1 \cdot r_2$\\ |
|
655 & $\mid$ & $r_1 + r_2$ \\ |
|
656 \\ |
|
657 & $\mid$ & $r^*$ \\ |
|
658 \end{tabular} |
|
659 & |
|
660 \begin{tabular}{@{\hspace{0mm}}rrl@{}} |
|
661 \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\ |
|
662 $v$ & $::=$ & \\ |
|
663 & & $\Empty$ \\ |
|
664 & $\mid$ & $\Char(c)$ \\ |
|
665 & $\mid$ & $\Seq\,v_1\, v_2$\\ |
|
666 & $\mid$ & $\Left(v)$ \\ |
|
667 & $\mid$ & $\Right(v)$ \\ |
|
668 & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\ |
|
669 \end{tabular} |
|
670 \end{tabular} |
|
671 \end{center} |
|
672 |
|
673 \noindent |
|
674 One regular expression can have multiple lexical values. For example |
|
675 for the regular expression $(a+b)^*$, it has a infinite list of |
|
676 values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, |
|
677 $\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, |
|
678 $\ldots$, and vice versa. |
|
679 Even for the regular expression matching a certain string, there could |
|
680 still be more than one value corresponding to it. |
|
681 Take the example where $r= (a^*\cdot a^*)^*$ and the string |
|
682 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. |
|
683 The number of different ways of matching |
|
684 without allowing any value under a star to be flattened |
|
685 to an empty string can be given by the following formula: |
|
686 \begin{center} |
|
687 $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$ |
|
688 \end{center} |
|
689 and a closed form formula can be calculated to be |
|
690 \begin{equation} |
|
691 C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} |
|
692 \end{equation} |
|
693 which is clearly in exponential order. |
|
694 A lexer aimed at getting all the possible values has an exponential |
|
695 worst case runtime. Therefore it is impractical to try to generate |
|
696 all possible matches in a run. In practice, we are usually |
|
697 interested about POSIX values, which by intuition always |
|
698 match the leftmost regular expression when there is a choice |
|
699 and always match a sub part as much as possible before proceeding |
|
700 to the next token. For example, the above example has the POSIX value |
|
701 $ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. |
|
702 The output of an algorithm we want would be a POSIX matching |
|
703 encoded as a value. |
|
704 The contribution of Sulzmann and Lu is an extension of Brzozowski's |
|
705 algorithm by a second phase (the first phase being building successive |
|
706 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
|
707 is generated in case the regular expression matches the string. |
|
708 Pictorially, the Sulzmann and Lu algorithm is as follows: |
|
709 |
|
710 \begin{ceqn} |
|
711 \begin{equation}\label{graph:2} |
|
712 \begin{tikzcd} |
|
713 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
|
714 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
|
715 \end{tikzcd} |
|
716 \end{equation} |
|
717 \end{ceqn} |
|
718 |
|
719 |
|
720 \noindent |
|
721 For convenience, we shall employ the following notations: the regular |
|
722 expression we start with is $r_0$, and the given string $s$ is composed |
|
723 of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the |
|
724 left to right, we build the derivatives $r_1$, $r_2$, \ldots according |
|
725 to the characters $c_0$, $c_1$ until we exhaust the string and obtain |
|
726 the derivative $r_n$. We test whether this derivative is |
|
727 $\textit{nullable}$ or not. If not, we know the string does not match |
|
728 $r$ and no value needs to be generated. If yes, we start building the |
|
729 values incrementally by \emph{injecting} back the characters into the |
|
730 earlier values $v_n, \ldots, v_0$. This is the second phase of the |
|
731 algorithm from the right to left. For the first value $v_n$, we call the |
|
732 function $\textit{mkeps}$, which builds a POSIX lexical value |
|
733 for how the empty string has been matched by the (nullable) regular |
|
734 expression $r_n$. This function is defined as |
|
735 |
|
736 \begin{center} |
|
737 \begin{tabular}{lcl} |
|
738 $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\ |
|
739 $\mkeps(r_{1}+r_{2})$ & $\dn$ |
|
740 & \textit{if} $\nullable(r_{1})$\\ |
|
741 & & \textit{then} $\Left(\mkeps(r_{1}))$\\ |
|
742 & & \textit{else} $\Right(\mkeps(r_{2}))$\\ |
|
743 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
|
744 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
|
745 \end{tabular} |
|
746 \end{center} |
|
747 |
|
748 |
|
749 \noindent |
|
750 After the $\mkeps$-call, we inject back the characters one by one in order to build |
|
751 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ |
|
752 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. |
|
753 After injecting back $n$ characters, we get the lexical value for how $r_0$ |
|
754 matches $s$. The POSIX value is maintained throught out the process. |
|
755 For this Sulzmann and Lu defined a function that reverses |
|
756 the ``chopping off'' of characters during the derivative phase. The |
|
757 corresponding function is called \emph{injection}, written |
|
758 $\textit{inj}$; it takes three arguments: the first one is a regular |
|
759 expression ${r_{i-1}}$, before the character is chopped off, the second |
|
760 is a character ${c_{i-1}}$, the character we want to inject and the |
|
761 third argument is the value ${v_i}$, into which one wants to inject the |
|
762 character (it corresponds to the regular expression after the character |
|
763 has been chopped off). The result of this function is a new value. The |
|
764 definition of $\textit{inj}$ is as follows: |
|
765 |
|
766 \begin{center} |
|
767 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
768 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
769 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
770 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
771 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
772 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
773 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
774 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
775 \end{tabular} |
|
776 \end{center} |
|
777 |
|
778 \noindent This definition is by recursion on the ``shape'' of regular |
|
779 expressions and values. |
|
780 The clauses basically do one thing--identifying the ``holes'' on |
|
781 value to inject the character back into. |
|
782 For instance, in the last clause for injecting back to a value |
|
783 that would turn into a new star value that corresponds to a star, |
|
784 we know it must be a sequence value. And we know that the first |
|
785 value of that sequence corresponds to the child regex of the star |
|
786 with the first character being chopped off--an iteration of the star |
|
787 that had just been unfolded. This value is followed by the already |
|
788 matched star iterations we collected before. So we inject the character |
|
789 back to the first value and form a new value with this new iteration |
|
790 being added to the previous list of iterations, all under the $Stars$ |
|
791 top level. |
|
792 |
|
793 We have mentioned before that derivatives without simplification |
|
794 can get clumsy, and this is true for values as well--they reflect |
|
795 the regular expressions size by definition. |
|
796 |
|
797 One can introduce simplification on the regex and values, but have to |
|
798 be careful in not breaking the correctness as the injection |
|
799 function heavily relies on the structure of the regexes and values |
|
800 being correct and match each other. |
|
801 It can be achieved by recording some extra rectification functions |
|
802 during the derivatives step, and applying these rectifications in |
|
803 each run during the injection phase. |
|
804 And we can prove that the POSIX value of how |
|
805 regular expressions match strings will not be affected---although is much harder |
|
806 to establish. Some initial results in this regard have been |
|
807 obtained in \cite{AusafDyckhoffUrban2016}. |
|
808 |
|
809 %Brzozowski, after giving the derivatives and simplification, |
|
810 %did not explore lexing with simplification or he may well be |
|
811 %stuck on an efficient simplificaiton with a proof. |
|
812 %He went on to explore the use of derivatives together with |
|
813 %automaton, and did not try lexing using derivatives. |
|
814 |
|
815 We want to get rid of complex and fragile rectification of values. |
|
816 Can we not create those intermediate values $v_1,\ldots v_n$, |
|
817 and get the lexing information that should be already there while |
|
818 doing derivatives in one pass, without a second phase of injection? |
|
819 In the meantime, can we make sure that simplifications |
|
820 are easily handled without breaking the correctness of the algorithm? |
|
821 |
|
822 Sulzmann and Lu solved this problem by |
|
823 introducing additional informtaion to the |
|
824 regular expressions called \emph{bitcodes}. |
|
825 |
|
826 \subsection*{Bit-coded Algorithm} |
|
827 Bits and bitcodes (lists of bits) are defined as: |
|
828 |
|
829 \begin{center} |
|
830 $b ::= 1 \mid 0 \qquad |
|
831 bs ::= [] \mid b::bs |
|
832 $ |
|
833 \end{center} |
|
834 |
|
835 \noindent |
|
836 The $1$ and $0$ are not in bold in order to avoid |
|
837 confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or |
|
838 bit-lists) can be used to encode values (or potentially incomplete values) in a |
|
839 compact form. This can be straightforwardly seen in the following |
|
840 coding function from values to bitcodes: |
|
841 |
|
842 \begin{center} |
|
843 \begin{tabular}{lcl} |
|
844 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
|
845 $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\ |
|
846 $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\ |
|
847 $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\ |
|
848 $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\ |
|
849 $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\ |
|
850 $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\; |
|
851 code(\Stars\,vs)$ |
|
852 \end{tabular} |
|
853 \end{center} |
|
854 |
|
855 \noindent |
|
856 Here $\textit{code}$ encodes a value into a bitcodes by converting |
|
857 $\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty |
|
858 star iteration by $1$. The border where a local star terminates |
|
859 is marked by $0$. This coding is lossy, as it throws away the information about |
|
860 characters, and also does not encode the ``boundary'' between two |
|
861 sequence values. Moreover, with only the bitcode we cannot even tell |
|
862 whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The |
|
863 reason for choosing this compact way of storing information is that the |
|
864 relatively small size of bits can be easily manipulated and ``moved |
|
865 around'' in a regular expression. In order to recover values, we will |
|
866 need the corresponding regular expression as an extra information. This |
|
867 means the decoding function is defined as: |
|
868 |
|
869 |
|
870 %\begin{definition}[Bitdecoding of Values]\mbox{} |
|
871 \begin{center} |
|
872 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} |
|
873 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\ |
|
874 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\ |
|
875 $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
876 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\; |
|
877 (\Left\,v, bs_1)$\\ |
|
878 $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ & |
|
879 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\; |
|
880 (\Right\,v, bs_1)$\\ |
|
881 $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ & |
|
882 $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\ |
|
883 & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\ |
|
884 & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\ |
|
885 $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\ |
|
886 $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & |
|
887 $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
888 & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\ |
|
889 & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\ |
|
890 |
|
891 $\textit{decode}\,bs\,r$ & $\dn$ & |
|
892 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\ |
|
893 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\; |
|
894 \textit{else}\;\textit{None}$ |
|
895 \end{tabular} |
|
896 \end{center} |
|
897 %\end{definition} |
|
898 |
|
899 Sulzmann and Lu's integrated the bitcodes into regular expressions to |
|
900 create annotated regular expressions \cite{Sulzmann2014}. |
|
901 \emph{Annotated regular expressions} are defined by the following |
|
902 grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$} |
|
903 |
|
904 \begin{center} |
|
905 \begin{tabular}{lcl} |
|
906 $\textit{a}$ & $::=$ & $\ZERO$\\ |
|
907 & $\mid$ & $_{bs}\ONE$\\ |
|
908 & $\mid$ & $_{bs}{\bf c}$\\ |
|
909 & $\mid$ & $_{bs}\sum\,as$\\ |
|
910 & $\mid$ & $_{bs}a_1\cdot a_2$\\ |
|
911 & $\mid$ & $_{bs}a^*$ |
|
912 \end{tabular} |
|
913 \end{center} |
|
914 %(in \textit{ALTS}) |
|
915 |
|
916 \noindent |
|
917 where $bs$ stands for bitcodes, $a$ for $\mathbf{a}$nnotated regular |
|
918 expressions and $as$ for a list of annotated regular expressions. |
|
919 The alternative constructor($\sum$) has been generalized to |
|
920 accept a list of annotated regular expressions rather than just 2. |
|
921 We will show that these bitcodes encode information about |
|
922 the (POSIX) value that should be generated by the Sulzmann and Lu |
|
923 algorithm. |
|
924 |
|
925 |
|
926 To do lexing using annotated regular expressions, we shall first |
|
927 transform the usual (un-annotated) regular expressions into annotated |
|
928 regular expressions. This operation is called \emph{internalisation} and |
|
929 defined as follows: |
|
930 |
|
931 %\begin{definition} |
|
932 \begin{center} |
|
933 \begin{tabular}{lcl} |
|
934 $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\ |
|
935 $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\ |
|
936 $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\ |
|
937 $(r_1 + r_2)^\uparrow$ & $\dn$ & |
|
938 $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\, |
|
939 \textit{fuse}\,[1]\,r_2^\uparrow]$\\ |
|
940 $(r_1\cdot r_2)^\uparrow$ & $\dn$ & |
|
941 $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\ |
|
942 $(r^*)^\uparrow$ & $\dn$ & |
|
943 $_{[]}(r^\uparrow)^*$\\ |
|
944 \end{tabular} |
|
945 \end{center} |
|
946 %\end{definition} |
|
947 |
|
948 \noindent |
|
949 We use up arrows here to indicate that the basic un-annotated regular |
|
950 expressions are ``lifted up'' into something slightly more complex. In the |
|
951 fourth clause, $\textit{fuse}$ is an auxiliary function that helps to |
|
952 attach bits to the front of an annotated regular expression. Its |
|
953 definition is as follows: |
|
954 |
|
955 \begin{center} |
|
956 \begin{tabular}{lcl} |
|
957 $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\ |
|
958 $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ & |
|
959 $_{bs @ bs'}\ONE$\\ |
|
960 $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ & |
|
961 $_{bs@bs'}{\bf c}$\\ |
|
962 $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ & |
|
963 $_{bs@bs'}\sum\textit{as}$\\ |
|
964 $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ & |
|
965 $_{bs@bs'}a_1 \cdot a_2$\\ |
|
966 $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ & |
|
967 $_{bs @ bs'}a^*$ |
|
968 \end{tabular} |
|
969 \end{center} |
|
970 |
|
971 \noindent |
|
972 After internalising the regular expression, we perform successive |
|
973 derivative operations on the annotated regular expressions. This |
|
974 derivative operation is the same as what we had previously for the |
|
975 basic regular expressions, except that we beed to take care of |
|
976 the bitcodes: |
|
977 |
|
978 |
|
979 \iffalse |
|
980 %\begin{definition}{bder} |
|
981 \begin{center} |
|
982 \begin{tabular}{@{}lcl@{}} |
|
983 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
984 $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
985 $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ & |
|
986 $\textit{if}\;c=d\; \;\textit{then}\; |
|
987 \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\ |
|
988 $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ & |
|
989 $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\ |
|
990 $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
991 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
992 & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\ |
|
993 & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
994 & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\ |
|
995 $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ & |
|
996 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\, |
|
997 (\textit{STAR}\,[]\,r)$ |
|
998 \end{tabular} |
|
999 \end{center} |
|
1000 %\end{definition} |
|
1001 |
|
1002 \begin{center} |
|
1003 \begin{tabular}{@{}lcl@{}} |
|
1004 $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
1005 $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\ |
|
1006 $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ & |
|
1007 $\textit{if}\;c=d\; \;\textit{then}\; |
|
1008 _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\ |
|
1009 $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ & |
|
1010 $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\ |
|
1011 $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ & |
|
1012 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
1013 & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\ |
|
1014 & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\ |
|
1015 & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\ |
|
1016 $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ & |
|
1017 $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\, |
|
1018 (_{bs}\textit{STAR}\,[]\,r)$ |
|
1019 \end{tabular} |
|
1020 \end{center} |
|
1021 %\end{definition} |
|
1022 \fi |
|
1023 |
|
1024 \begin{center} |
|
1025 \begin{tabular}{@{}lcl@{}} |
|
1026 $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
1027 $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\ |
|
1028 $(_{bs}{\bf d})\,\backslash c$ & $\dn$ & |
|
1029 $\textit{if}\;c=d\; \;\textit{then}\; |
|
1030 _{bs}\ONE\;\textit{else}\;\ZERO$\\ |
|
1031 $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ & |
|
1032 $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\ |
|
1033 $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ & |
|
1034 $\textit{if}\;\textit{bnullable}\,a_1$\\ |
|
1035 & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\ |
|
1036 & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\ |
|
1037 & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\ |
|
1038 $(_{bs}a^*)\,\backslash c$ & $\dn$ & |
|
1039 $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot |
|
1040 (_{[]}r^*))$ |
|
1041 \end{tabular} |
|
1042 \end{center} |
|
1043 |
|
1044 %\end{definition} |
|
1045 \noindent |
|
1046 For instance, when we do derivative of $_{bs}a^*$ with respect to c, |
|
1047 we need to unfold it into a sequence, |
|
1048 and attach an additional bit $0$ to the front of $r \backslash c$ |
|
1049 to indicate that there is one more star iteration. Also the sequence clause |
|
1050 is more subtle---when $a_1$ is $\textit{bnullable}$ (here |
|
1051 \textit{bnullable} is exactly the same as $\textit{nullable}$, except |
|
1052 that it is for annotated regular expressions, therefore we omit the |
|
1053 definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how |
|
1054 $a_1$ matches the string prior to character $c$ (more on this later), |
|
1055 then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2 |
|
1056 \backslash c)$ will collapse the regular expression $a_1$(as it has |
|
1057 already been fully matched) and store the parsing information at the |
|
1058 head of the regular expression $a_2 \backslash c$ by fusing to it. The |
|
1059 bitsequence $\textit{bs}$, which was initially attached to the |
|
1060 first element of the sequence $a_1 \cdot a_2$, has |
|
1061 now been elevated to the top-level of $\sum$, as this information will be |
|
1062 needed whichever way the sequence is matched---no matter whether $c$ belongs |
|
1063 to $a_1$ or $ a_2$. After building these derivatives and maintaining all |
|
1064 the lexing information, we complete the lexing by collecting the |
|
1065 bitcodes using a generalised version of the $\textit{mkeps}$ function |
|
1066 for annotated regular expressions, called $\textit{bmkeps}$: |
|
1067 |
|
1068 |
|
1069 %\begin{definition}[\textit{bmkeps}]\mbox{} |
|
1070 \begin{center} |
|
1071 \begin{tabular}{lcl} |
|
1072 $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\ |
|
1073 $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ & |
|
1074 $\textit{if}\;\textit{bnullable}\,a$\\ |
|
1075 & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\ |
|
1076 & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\ |
|
1077 $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ & |
|
1078 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\ |
|
1079 $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ & |
|
1080 $bs \,@\, [0]$ |
|
1081 \end{tabular} |
|
1082 \end{center} |
|
1083 %\end{definition} |
|
1084 |
|
1085 \noindent |
|
1086 This function completes the value information by travelling along the |
|
1087 path of the regular expression that corresponds to a POSIX value and |
|
1088 collecting all the bitcodes, and using $S$ to indicate the end of star |
|
1089 iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and |
|
1090 decode them, we get the value we expect. The corresponding lexing |
|
1091 algorithm looks as follows: |
|
1092 |
|
1093 \begin{center} |
|
1094 \begin{tabular}{lcl} |
|
1095 $\textit{blexer}\;r\,s$ & $\dn$ & |
|
1096 $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\ |
|
1097 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
1098 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
1099 & & $\;\;\textit{else}\;\textit{None}$ |
|
1100 \end{tabular} |
|
1101 \end{center} |
|
1102 |
|
1103 \noindent |
|
1104 In this definition $\_\backslash s$ is the generalisation of the derivative |
|
1105 operation from characters to strings (just like the derivatives for un-annotated |
|
1106 regular expressions). |
|
1107 |
|
1108 Remember tha one of the important reasons we introduced bitcodes |
|
1109 is that they can make simplification more structured and therefore guaranteeing |
|
1110 the correctness. |
|
1111 |
|
1112 \subsection*{Our Simplification Rules} |
|
1113 |
|
1114 In this section we introduce aggressive (in terms of size) simplification rules |
|
1115 on annotated regular expressions |
|
1116 in order to keep derivatives small. Such simplifications are promising |
|
1117 as we have |
|
1118 generated test data that show |
|
1119 that a good tight bound can be achieved. Obviously we could only |
|
1120 partially cover the search space as there are infinitely many regular |
|
1121 expressions and strings. |
|
1122 |
|
1123 One modification we introduced is to allow a list of annotated regular |
|
1124 expressions in the $\sum$ constructor. This allows us to not just |
|
1125 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
|
1126 also unnecessary ``copies'' of regular expressions (very similar to |
|
1127 simplifying $r + r$ to just $r$, but in a more general setting). Another |
|
1128 modification is that we use simplification rules inspired by Antimirov's |
|
1129 work on partial derivatives. They maintain the idea that only the first |
|
1130 ``copy'' of a regular expression in an alternative contributes to the |
|
1131 calculation of a POSIX value. All subsequent copies can be pruned away from |
|
1132 the regular expression. A recursive definition of our simplification function |
|
1133 that looks somewhat similar to our Scala code is given below: |
|
1134 %\comment{Use $\ZERO$, $\ONE$ and so on. |
|
1135 %Is it $ALTS$ or $ALTS$?}\\ |
|
1136 |
|
1137 \begin{center} |
|
1138 \begin{tabular}{@{}lcl@{}} |
|
1139 |
|
1140 $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
1141 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
1142 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
1143 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
1144 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
1145 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ |
|
1146 |
|
1147 $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\ |
|
1148 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
1149 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
1150 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
|
1151 |
|
1152 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
1153 \end{tabular} |
|
1154 \end{center} |
|
1155 |
|
1156 \noindent |
|
1157 The simplification does a pattern matching on the regular expression. |
|
1158 When it detected that the regular expression is an alternative or |
|
1159 sequence, it will try to simplify its children regular expressions |
|
1160 recursively and then see if one of the children turn into $\ZERO$ or |
|
1161 $\ONE$, which might trigger further simplification at the current level. |
|
1162 The most involved part is the $\sum$ clause, where we use two |
|
1163 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
|
1164 alternatives and reduce as many duplicates as possible. Function |
|
1165 $\textit{distinct}$ keeps the first occurring copy only and remove all later ones |
|
1166 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. |
|
1167 Its recursive definition is given below: |
|
1168 |
|
1169 \begin{center} |
|
1170 \begin{tabular}{@{}lcl@{}} |
|
1171 $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
1172 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
|
1173 $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ |
|
1174 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) |
|
1175 \end{tabular} |
|
1176 \end{center} |
|
1177 |
|
1178 \noindent |
|
1179 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
|
1180 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
|
1181 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
|
1182 |
|
1183 Having defined the $\simp$ function, |
|
1184 we can use the previous notation of natural |
|
1185 extension from derivative w.r.t.~character to derivative |
|
1186 w.r.t.~string:%\comment{simp in the [] case?} |
|
1187 |
|
1188 \begin{center} |
|
1189 \begin{tabular}{lcl} |
|
1190 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
1191 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
1192 \end{tabular} |
|
1193 \end{center} |
|
1194 |
|
1195 \noindent |
|
1196 to obtain an optimised version of the algorithm: |
|
1197 |
|
1198 \begin{center} |
|
1199 \begin{tabular}{lcl} |
|
1200 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
1201 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
1202 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
1203 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
1204 & & $\;\;\textit{else}\;\textit{None}$ |
|
1205 \end{tabular} |
|
1206 \end{center} |
|
1207 |
|
1208 \noindent |
|
1209 This algorithm keeps the regular expression size small, for example, |
|
1210 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
1211 will be reduced to just 6 and stays constant, no matter how long the |
|
1212 input string is. |
|
1213 |
|
1214 |
|
1215 |
|
1216 Derivatives give a simple solution |
|
1217 to the problem of matching a string $s$ with a regular |
|
1218 expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
1219 succession) all the characters of the string matches the empty string, |
|
1220 then $r$ matches $s$ (and {\em vice versa}). |
|
1221 |
|
1222 |
|
1223 |
|
1224 However, there are two difficulties with derivative-based matchers: |
|
1225 First, Brzozowski's original matcher only generates a yes/no answer |
|
1226 for whether a regular expression matches a string or not. This is too |
|
1227 little information in the context of lexing where separate tokens must |
|
1228 be identified and also classified (for example as keywords |
|
1229 or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this |
|
1230 difficulty by cleverly extending Brzozowski's matching |
|
1231 algorithm. Their extended version generates additional information on |
|
1232 \emph{how} a regular expression matches a string following the POSIX |
|
1233 rules for regular expression matching. They achieve this by adding a |
|
1234 second ``phase'' to Brzozowski's algorithm involving an injection |
|
1235 function. In our own earlier work we provided the formal |
|
1236 specification of what POSIX matching means and proved in Isabelle/HOL |
|
1237 the correctness |
|
1238 of Sulzmann and Lu's extended algorithm accordingly |
|
1239 \cite{AusafDyckhoffUrban2016}. |
|
1240 |
|
1241 The second difficulty is that Brzozowski's derivatives can |
|
1242 grow to arbitrarily big sizes. For example if we start with the |
|
1243 regular expression $(a+aa)^*$ and take |
|
1244 successive derivatives according to the character $a$, we end up with |
|
1245 a sequence of ever-growing derivatives like |
|
1246 |
|
1247 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} |
|
1248 \begin{center} |
|
1249 \begin{tabular}{rll} |
|
1250 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
1251 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
1252 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ |
|
1253 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
1254 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
|
1255 \end{tabular} |
|
1256 \end{center} |
|
1257 |
|
1258 \noindent where after around 35 steps we run out of memory on a |
|
1259 typical computer (we shall define shortly the precise details of our |
|
1260 regular expressions and the derivative operation). Clearly, the |
|
1261 notation involving $\ZERO$s and $\ONE$s already suggests |
|
1262 simplification rules that can be applied to regular regular |
|
1263 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
|
1264 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
|
1265 r$. While such simple-minded simplifications have been proved in our |
|
1266 earlier work to preserve the correctness of Sulzmann and Lu's |
|
1267 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
|
1268 \emph{not} help with limiting the growth of the derivatives shown |
|
1269 above: the growth is slowed, but the derivatives can still grow rather |
|
1270 quickly beyond any finite bound. |
|
1271 |
|
1272 |
|
1273 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm |
|
1274 \cite{Sulzmann2014} where they introduce bitcoded |
|
1275 regular expressions. In this version, POSIX values are |
|
1276 represented as bitsequences and such sequences are incrementally generated |
|
1277 when derivatives are calculated. The compact representation |
|
1278 of bitsequences and regular expressions allows them to define a more |
|
1279 ``aggressive'' simplification method that keeps the size of the |
|
1280 derivatives finite no matter what the length of the string is. |
|
1281 They make some informal claims about the correctness and linear behaviour |
|
1282 of this version, but do not provide any supporting proof arguments, not |
|
1283 even ``pencil-and-paper'' arguments. They write about their bitcoded |
|
1284 \emph{incremental parsing method} (that is the algorithm to be formalised |
|
1285 in this paper): |
|
1286 |
|
1287 |
|
1288 \begin{quote}\it |
|
1289 ``Correctness Claim: We further claim that the incremental parsing |
|
1290 method [..] in combination with the simplification steps [..] |
|
1291 yields POSIX parse trees. We have tested this claim |
|
1292 extensively [..] but yet |
|
1293 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} |
|
1294 \end{quote} |
|
1295 |
|
1296 |
|
1297 |
|
1298 |
|
1299 \section{Backgound} |
|
1300 %Regular expression matching and lexing has been |
|
1301 % widely-used and well-implemented |
|
1302 %in software industry. |
|
1303 %TODO: expand the above into a full paragraph |
|
1304 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
|
1305 |
|
1306 |
|
1307 Theoretical results say that regular expression matching |
|
1308 should be linear with respect to the input. |
|
1309 Under a certain class of regular expressions and inputs though, |
|
1310 practical implementations suffer from non-linear or even |
|
1311 exponential running time, |
|
1312 allowing a ReDoS (regular expression denial-of-service ) attack. |
|
1313 |
|
1314 |
|
1315 %---------------------------------------------------------------------------------------- |
|
1316 |
|
1317 \section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking} |
|
1318 |
|
1319 The reason behind is that regex libraries in popular language engines |
|
1320 often want to support richer constructs |
|
1321 than the most basic regular expressions, and lexing rather than matching |
|
1322 is needed for sub-match extraction. |
|
1323 |
|
1324 There is also static analysis work on regular expression that |
|
1325 have potential expoential behavious. Rathnayake and Thielecke |
|
1326 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
1327 that detects regular expressions triggering exponential |
|
1328 behavious on backtracking matchers. |
|
1329 People also developed static analysis methods for |
|
1330 generating non-linear polynomial worst-time estimates |
|
1331 for regexes, attack string that exploit the worst-time |
|
1332 scenario, and "attack automata" that generates |
|
1333 attack strings. |
|
1334 For a comprehensive analysis, please refer to Weideman's thesis |
|
1335 \parencite{Weideman2017Static}. |
|
1336 |
|
1337 \subsection{DFA Approach} |
|
1338 |
|
1339 Exponential states. |
|
1340 |
|
1341 \subsection{NFA Approach} |
|
1342 Backtracking. |
|
1343 |
|
1344 |
|
1345 |
|
1346 %---------------------------------------------------------------------------------------- |
|
1347 |
|
1348 \section{Our Approach} |
|
1349 In the last fifteen or so years, Brzozowski's derivatives of regular |
|
1350 expressions have sparked quite a bit of interest in the functional |
|
1351 programming and theorem prover communities. The beauty of |
|
1352 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly |
|
1353 expressible in any functional language, and easily definable and |
|
1354 reasoned about in theorem provers---the definitions just consist of |
|
1355 inductive datatypes and simple recursive functions. Derivatives of a |
|
1356 regular expression, written $r \backslash c$, give a simple solution |
|
1357 to the problem of matching a string $s$ with a regular |
|
1358 expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
1359 succession) all the characters of the string matches the empty string, |
|
1360 then $r$ matches $s$ (and {\em vice versa}). |
|
1361 |
|
1362 |
|
1363 This work aims to address the above vulnerability by the combination |
|
1364 of Brzozowski's derivatives and interactive theorem proving. We give an |
|
1365 improved version of Sulzmann and Lu's bit-coded algorithm using |
|
1366 derivatives, which come with a formal guarantee in terms of correctness and |
|
1367 running time as an Isabelle/HOL proof. |
|
1368 Then we improve the algorithm with an even stronger version of |
|
1369 simplification, and prove a time bound linear to input and |
|
1370 cubic to regular expression size using a technique by |
|
1371 Antimirov. |
|
1372 |
|
1373 \subsection{Existing Work} |
|
1374 We are aware |
|
1375 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
1376 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
1377 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
|
1378 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
1379 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
|
1380 |
|
1381 %---------------------------------------------------------------------------------------- |
|
1382 |
|
1383 \section{What this Template Includes} |
|
1384 |
|
1385 \subsection{Folders} |
|
1386 |
|
1387 This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory: |
|
1388 |
|
1389 \keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory. |
|
1390 |
|
1391 \keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as: |
|
1392 \begin{itemize} |
|
1393 \item Chapter 1: Introduction to the thesis topic |
|
1394 \item Chapter 2: Background information and theory |
|
1395 \item Chapter 3: (Laboratory) experimental setup |
|
1396 \item Chapter 4: Details of experiment 1 |
|
1397 \item Chapter 5: Details of experiment 2 |
|
1398 \item Chapter 6: Discussion of the experimental results |
|
1399 \item Chapter 7: Conclusion and future directions |
|
1400 \end{itemize} |
|
1401 This chapter layout is specialised for the experimental sciences, your discipline may be different. |
|
1402 |
|
1403 \keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document. |
|
1404 |
|
1405 \subsection{Files} |
|
1406 |
|
1407 Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about: |
|
1408 |
|
1409 \keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do. |
|
1410 |
|
1411 \keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis. |
|
1412 |
|
1413 \keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version. |
|
1414 |
|
1415 \keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis! |
|
1416 |
|
1417 Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include: |
|
1418 |
|
1419 \keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. |
|
1420 |
|
1421 \keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis. |
|
1422 |
|
1423 \keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file. |
|
1424 |
|
1425 \keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section. |
|
1426 |
|
1427 \keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file. |
|
1428 |
|
1429 \keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section. |
|
1430 |
|
1431 \keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. |
|
1432 |
|
1433 So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them. |
|
1434 |
|
1435 %---------------------------------------------------------------------------------------- |
|
1436 |
|
1437 \section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile} |
|
1438 |
|
1439 You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment. |
|
1440 |
|
1441 Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots |
|
1442 |
|
1443 Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb|\href{url}{name}| and only leave the name. |
|
1444 |
|
1445 When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper! |
|
1446 |
|
1447 %---------------------------------------------------------------------------------------- |
|
1448 |
|
1449 \section{The \code{main.tex} File Explained} |
|
1450 |
|
1451 The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it. |
|
1452 |
|
1453 Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block. |
|
1454 |
|
1455 Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from. |
|
1456 |
|
1457 Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space. |
|
1458 |
|
1459 Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor). |
|
1460 |
|
1461 The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols. |
|
1462 |
|
1463 The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes. |
|
1464 |
|
1465 The next page contains a one line dedication. Who will you dedicate your thesis to? |
|
1466 |
|
1467 Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder. |
|
1468 |
|
1469 After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place. |
|
1470 |
|
1471 %---------------------------------------------------------------------------------------- |
|
1472 |
|
1473 \section{Thesis Features and Conventions}\label{ThesisConventions} |
|
1474 |
|
1475 To get the best out of this template, there are a few conventions that you may want to follow. |
|
1476 |
|
1477 One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method. |
|
1478 |
|
1479 \subsection{Printing Format} |
|
1480 |
|
1481 This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution. |
|
1482 |
|
1483 The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side. |
|
1484 |
|
1485 The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}. |
|
1486 |
|
1487 \subsection{Using US Letter Paper} |
|
1488 |
|
1489 The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}. |
|
1490 |
|
1491 Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins. |
|
1492 |
|
1493 \subsection{References} |
|
1494 |
|
1495 The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type. |
|
1496 |
|
1497 Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.} |
|
1498 |
|
1499 The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations). |
|
1500 |
|
1501 \subsubsection{A Note on bibtex} |
|
1502 |
|
1503 The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors. |
|
1504 |
|
1505 \subsection{Tables} |
|
1506 |
|
1507 Tables are an important way of displaying your results, below is an example table which was generated with this code: |
|
1508 |
|
1509 {\small |
|
1510 \begin{verbatim} |
|
1511 \begin{table} |
|
1512 \caption{The effects of treatments X and Y on the four groups studied.} |
|
1513 \label{tab:treatments} |
|
1514 \centering |
|
1515 \begin{tabular}{l l l} |
|
1516 \toprule |
|
1517 \tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\ |
|
1518 \midrule |
|
1519 1 & 0.2 & 0.8\\ |
|
1520 2 & 0.17 & 0.7\\ |
|
1521 3 & 0.24 & 0.75\\ |
|
1522 4 & 0.68 & 0.3\\ |
|
1523 \bottomrule\\ |
|
1524 \end{tabular} |
|
1525 \end{table} |
|
1526 \end{verbatim} |
|
1527 } |
|
1528 |
|
1529 \begin{table} |
|
1530 \caption{The effects of treatments X and Y on the four groups studied.} |
|
1531 \label{tab:treatments} |
|
1532 \centering |
|
1533 \begin{tabular}{l l l} |
|
1534 \toprule |
|
1535 \tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\ |
|
1536 \midrule |
|
1537 1 & 0.2 & 0.8\\ |
|
1538 2 & 0.17 & 0.7\\ |
|
1539 3 & 0.24 & 0.75\\ |
|
1540 4 & 0.68 & 0.3\\ |
|
1541 \bottomrule\\ |
|
1542 \end{tabular} |
|
1543 \end{table} |
|
1544 |
|
1545 You can reference tables with \verb|\ref{<label>}| where the label is defined within the table environment. See \file{Chapter1.tex} for an example of the label and citation (e.g. Table~\ref{tab:treatments}). |
|
1546 |
|
1547 \subsection{Figures} |
|
1548 |
|
1549 There will hopefully be many figures in your thesis (that should be placed in the \emph{Figures} folder). The way to insert figures into your thesis is to use a code template like this: |
|
1550 \begin{verbatim} |
|
1551 \begin{figure} |
|
1552 \centering |
|
1553 \includegraphics{Figures/Electron} |
|
1554 \decoRule |
|
1555 \caption[An Electron]{An electron (artist's impression).} |
|
1556 \label{fig:Electron} |
|
1557 \end{figure} |
|
1558 \end{verbatim} |
|
1559 Also look in the source file. Putting this code into the source file produces the picture of the electron that you can see in the figure below. |
|
1560 |
|
1561 \begin{figure}[th] |
|
1562 \centering |
|
1563 \includegraphics{Figures/Electron} |
|
1564 \decoRule |
|
1565 \caption[An Electron]{An electron (artist's impression).} |
|
1566 \label{fig:Electron} |
|
1567 \end{figure} |
|
1568 |
|
1569 Sometimes figures don't always appear where you write them in the source. The placement depends on how much space there is on the page for the figure. Sometimes there is not enough room to fit a figure directly where it should go (in relation to the text) and so \LaTeX{} puts it at the top of the next page. Positioning figures is the job of \LaTeX{} and so you should only worry about making them look good! |
|
1570 |
|
1571 Figures usually should have captions just in case you need to refer to them (such as in Figure~\ref{fig:Electron}). The \verb|\caption| command contains two parts, the first part, inside the square brackets is the title that will appear in the \emph{List of Figures}, and so should be short. The second part in the curly brackets should contain the longer and more descriptive caption text. |
|
1572 |
|
1573 The \verb|\decoRule| command is optional and simply puts an aesthetic horizontal line below the image. If you do this for one image, do it for all of them. |
|
1574 |
|
1575 \LaTeX{} is capable of using images in pdf, jpg and png format. |
|
1576 |
|
1577 \subsection{Typesetting mathematics} |
|
1578 |
|
1579 If your thesis is going to contain heavy mathematical content, be sure that \LaTeX{} will make it look beautiful, even though it won't be able to solve the equations for you. |
|
1580 |
|
1581 The \enquote{Not So Short Introduction to \LaTeX} (available on \href{http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf}{CTAN}) should tell you everything you need to know for most cases of typesetting mathematics. If you need more information, a much more thorough mathematical guide is available from the AMS called, \enquote{A Short Math Guide to \LaTeX} and can be downloaded from: |
|
1582 \url{ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf} |
|
1583 |
|
1584 There are many different \LaTeX{} symbols to remember, luckily you can find the most common symbols in \href{http://ctan.org/pkg/comprehensive}{The Comprehensive \LaTeX~Symbol List}. |
|
1585 |
|
1586 You can write an equation, which is automatically given an equation number by \LaTeX{} like this: |
|
1587 \begin{verbatim} |
|
1588 \begin{equation} |
|
1589 E = mc^{2} |
|
1590 \label{eqn:Einstein} |
|
1591 \end{equation} |
|
1592 \end{verbatim} |
|
1593 |
|
1594 This will produce Einstein's famous energy-matter equivalence equation: |
|
1595 \begin{equation} |
|
1596 E = mc^{2} |
|
1597 \label{eqn:Einstein} |
|
1598 \end{equation} |
|
1599 |
|
1600 All equations you write (which are not in the middle of paragraph text) are automatically given equation numbers by \LaTeX{}. If you don't want a particular equation numbered, use the unnumbered form: |
|
1601 \begin{verbatim} |
|
1602 \[ a^{2}=4 \] |
|
1603 \end{verbatim} |
|
1604 |
|
1605 %---------------------------------------------------------------------------------------- |
|
1606 |
|
1607 \section{Sectioning and Subsectioning} |
|
1608 |
|
1609 You should break your thesis up into nice, bite-sized sections and subsections. \LaTeX{} automatically builds a table of Contents by looking at all the \verb|\chapter{}|, \verb|\section{}| and \verb|\subsection{}| commands you write in the source. |
|
1610 |
|
1611 The Table of Contents should only list the sections to three (3) levels. A \verb|chapter{}| is level zero (0). A \verb|\section{}| is level one (1) and so a \verb|\subsection{}| is level two (2). In your thesis it is likely that you will even use a \verb|subsubsection{}|, which is level three (3). The depth to which the Table of Contents is formatted is set within \file{MastersDoctoralThesis.cls}. If you need this changed, you can do it in \file{main.tex}. |
|
1612 |
|
1613 %---------------------------------------------------------------------------------------- |
|
1614 |
|
1615 \section{In Closing} |
|
1616 |
|
1617 You have reached the end of this mini-guide. You can now rename or overwrite this pdf file and begin writing your own \file{Chapter1.tex} and the rest of your thesis. The easy work of setting up the structure and framework has been taken care of for you. It's now your job to fill it out! |
|
1618 |
|
1619 Good luck and have lots of fun! |
|
1620 |
|
1621 \begin{flushright} |
|
1622 Guide written by ---\\ |
|
1623 Sunil Patel: \href{http://www.sunilpatel.co.uk}{www.sunilpatel.co.uk}\\ |
|
1624 Vel: \href{http://www.LaTeXTemplates.com}{LaTeXTemplates.com} |
|
1625 \end{flushright} |