PhdThesisRealOne/LaTeXTemplates_masters-doctoral-thesis_v2/Chapters/Chapter1.tex
author Chengsong
Thu, 24 Mar 2022 20:52:34 +0000
changeset 465 2e7c7111c0be
parent 456 26a5e640cdd7
permissions -rwxr-xr-x
ha
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
456
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
     1
% Chapter 1
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
     2
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
     3
\chapter{Introduction} % Main chapter title
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
     4
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
     5
\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1} 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
     6
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
     7
%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
     8
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
     9
% Define some commands to keep the formatting separated from the content 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    10
\newcommand{\keyword}[1]{\textbf{#1}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    11
\newcommand{\tabhead}[1]{\textbf{#1}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    12
\newcommand{\code}[1]{\texttt{#1}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    14
\newcommand{\option}[1]{\texttt{\itshape#1}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    15
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    16
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    17
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    18
\newcommand{\ZERO}{\mbox{\bf 0}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    19
\newcommand{\ONE}{\mbox{\bf 1}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    20
\def\lexer{\mathit{lexer}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    21
\def\mkeps{\mathit{mkeps}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    22
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    23
\def\DFA{\textit{DFA}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    24
\def\bmkeps{\textit{bmkeps}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    25
\def\retrieve{\textit{retrieve}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    26
\def\blexer{\textit{blexer}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    27
\def\flex{\textit{flex}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    28
\def\inj{\mathit{inj}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    29
\def\Empty{\mathit{Empty}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    30
\def\Left{\mathit{Left}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    31
\def\Right{\mathit{Right}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    32
\def\Stars{\mathit{Stars}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    33
\def\Char{\mathit{Char}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    34
\def\Seq{\mathit{Seq}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    35
\def\Der{\mathit{Der}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    36
\def\nullable{\mathit{nullable}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    37
\def\Z{\mathit{Z}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    38
\def\S{\mathit{S}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    39
\def\rup{r^\uparrow}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    40
\def\simp{\mathit{simp}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    41
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    42
\def\map{\mathit{map}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    43
\def\distinct{\mathit{distinct}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    44
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    45
%----------------------------------------------------------------------------------------
465
Chengsong
parents: 456
diff changeset
    46
%This part is about regular expressions, Brzozowski derivatives,
Chengsong
parents: 456
diff changeset
    47
%and a bit-coded lexing algorithm with proven correctness and time bounds.
Chengsong
parents: 456
diff changeset
    48
Regular expressions are widely used in modern day programming tasks.
Chengsong
parents: 456
diff changeset
    49
Be it IDE with syntax hightlighting and auto completion, 
Chengsong
parents: 456
diff changeset
    50
command line tools like $\mathit{grep}$ that facilitates easy 
Chengsong
parents: 456
diff changeset
    51
processing of text by search and replace, network intrusion
Chengsong
parents: 456
diff changeset
    52
detection systems that rejects suspicious traffic, or compiler
Chengsong
parents: 456
diff changeset
    53
front-ends--there is always an important phase of the
Chengsong
parents: 456
diff changeset
    54
task that involves matching a regular 
Chengsong
parents: 456
diff changeset
    55
exression with a string.
Chengsong
parents: 456
diff changeset
    56
Given its usefulness and ubiquity, one would imagine that
Chengsong
parents: 456
diff changeset
    57
modern regular expression matching implementations
Chengsong
parents: 456
diff changeset
    58
are mature and fully-studied.
456
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    59
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    60
If you go to a popular programming language's
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    61
regex engine,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    62
you can supply it with regex and strings of your own,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    63
and get matching/lexing  information such as how a 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    64
sub-part of the regex matches a sub-part of the string.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    65
These lexing libraries are on average quite fast.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    66
For example, the regex engines some network intrusion detection
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    67
systems use are able to process
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    68
 megabytes or even gigabytes of network traffic per second.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    69
 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    70
Why do we need to have our version, if the algorithms work well on 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    71
average?
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    72
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    73
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    74
Take $(a^*)^*\,b$ and ask whether
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    75
strings of the form $aa..a$ match this regular
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    76
expression. Obviously this is not the case---the expected $b$ in the last
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    77
position is missing. One would expect that modern regular expression
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    78
matching engines can find this out very quickly. Alas, if one tries
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    79
this example in JavaScript, Python or Java 8 with strings like 28
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    80
$a$'s, one discovers that this decision takes around 30 seconds and
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    81
takes considerably longer when adding a few more $a$'s, as the graphs
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    82
below show:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    83
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    84
\begin{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    85
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    86
\begin{tikzpicture}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    87
\begin{axis}[
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    88
    xlabel={$n$},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    89
    x label style={at={(1.05,-0.05)}},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    90
    ylabel={time in secs},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    91
    enlargelimits=false,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    92
    xtick={0,5,...,30},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    93
    xmax=33,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    94
    ymax=35,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    95
    ytick={0,5,...,30},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    96
    scaled ticks=false,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    97
    axis lines=left,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    98
    width=5cm,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
    99
    height=4cm, 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   100
    legend entries={JavaScript},  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   101
    legend pos=north west,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   102
    legend cell align=left]
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   103
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   104
\end{axis}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   105
\end{tikzpicture}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   106
  &
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   107
\begin{tikzpicture}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   108
\begin{axis}[
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   109
    xlabel={$n$},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   110
    x label style={at={(1.05,-0.05)}},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   111
    %ylabel={time in secs},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   112
    enlargelimits=false,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   113
    xtick={0,5,...,30},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   114
    xmax=33,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   115
    ymax=35,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   116
    ytick={0,5,...,30},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   117
    scaled ticks=false,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   118
    axis lines=left,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   119
    width=5cm,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   120
    height=4cm, 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   121
    legend entries={Python},  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   122
    legend pos=north west,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   123
    legend cell align=left]
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   124
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   125
\end{axis}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   126
\end{tikzpicture}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   127
  &
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   128
\begin{tikzpicture}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   129
\begin{axis}[
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   130
    xlabel={$n$},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   131
    x label style={at={(1.05,-0.05)}},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   132
    %ylabel={time in secs},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   133
    enlargelimits=false,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   134
    xtick={0,5,...,30},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   135
    xmax=33,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   136
    ymax=35,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   137
    ytick={0,5,...,30},
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   138
    scaled ticks=false,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   139
    axis lines=left,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   140
    width=5cm,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   141
    height=4cm, 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   142
    legend entries={Java 8},  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   143
    legend pos=north west,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   144
    legend cell align=left]
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   145
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   146
\end{axis}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   147
\end{tikzpicture}\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   148
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   149
           of the form $\underbrace{aa..a}_{n}$.}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   150
\end{tabular}    
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   151
\end{center}  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   152
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   153
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   154
This is clearly exponential behaviour, and 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   155
is triggered by some relatively simple regex patterns.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   156
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   157
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   158
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   159
The opens up the possibility of
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   160
 a ReDoS (regular expression denial-of-service ) attack.
465
Chengsong
parents: 456
diff changeset
   161
 \section{Why Backtracking Algorithm for Regexes?}
456
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   162
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   163
Theoretical results say that regular expression matching
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   164
should be linear with respect to the input. You could construct
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   165
an NFA via Thompson construction, and simulate  running it.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   166
This would be linear.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   167
Or you could determinize the NFA into a DFA, and minimize that
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   168
DFA. Once you have the DFA, the running time is also linear, requiring just 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   169
one scanning pass of the input.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   170
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   171
But modern  regex libraries in popular language engines
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   172
 often want to support richer constructs
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   173
than  the most basic regular expressions such as bounded repetitions
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   174
and back references. 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   175
%put in illustrations
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   176
%a{1}{3}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   177
These make a DFA construction impossible because
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   178
of an exponential states explosion.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   179
 They also want to support lexing rather than just matching
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   180
 for tasks that involves text processing.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   181
 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   182
 Existing regex libraries either pose restrictions on the user input, or does 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   183
 not give linear running time guarantee.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   184
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   185
 For example, the Rust regex engine claims to be linear, 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   186
 but does not support lookarounds and back-references.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   187
 The GoLang regex library does not support over 1000 repetitions.  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   188
 Java and Python both support back-references, but shows
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   189
catastrophic backtracking behaviours on inputs without back-references(
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   190
when the language is still regular).
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   191
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   192
 %TODO: verify the fact Rust does not allow 1000+ reps
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   193
 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   194
 Another thing about the these libraries is that there
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   195
 is no correctness guarantee.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   196
 In some cases they either fails to generate a lexing result when there is a match,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   197
 or gives the wrong way of matching.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   198
 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   199
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   200
This superlinear blowup in matching algorithms sometimes causes
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   201
considerable grief in real life: for example on 20 July 2016 one evil
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   202
regular expression brought the webpage
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   203
\href{http://stackexchange.com}{Stack Exchange} to its
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   204
%knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   205
In this instance, a regular expression intended to just trim white
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   206
spaces from the beginning and the end of a line actually consumed
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   207
massive amounts of CPU-resources---causing web servers to grind to a
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   208
halt. This happened when a post with 20,000 white spaces was submitted,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   209
but importantly the white spaces were neither at the beginning nor at
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   210
the end. As a result, the regular expression matching engine needed to
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   211
backtrack over many choices. In this example, the time needed to process
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   212
the string was $O(n^2)$ with respect to the string length. This
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   213
quadratic overhead was enough for the homepage of Stack Exchange to
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   214
respond so slowly that the load balancer assumed there must be some
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   215
attack and therefore stopped the servers from responding to any
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   216
requests. This made the whole site become unavailable. Another very
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   217
recent example is a global outage of all Cloudflare servers on 2 July
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   218
2019. A poorly written regular expression exhibited exponential
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   219
behaviour and exhausted CPUs that serve HTTP traffic. Although the
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   220
outage had several causes, at the heart was a regular expression that
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   221
was used to monitor network
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   222
%traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
465
Chengsong
parents: 456
diff changeset
   223
Chengsong
parents: 456
diff changeset
   224
It turns out that regex libraries not only suffer from 
Chengsong
parents: 456
diff changeset
   225
exponential backtracking problems, 
Chengsong
parents: 456
diff changeset
   226
but also undesired (or even buggy) outputs.
Chengsong
parents: 456
diff changeset
   227
%TODO: comment from who
Chengsong
parents: 456
diff changeset
   228
xxx commented that most regex libraries are not
Chengsong
parents: 456
diff changeset
   229
correctly implementing the POSIX (maximum-munch)
Chengsong
parents: 456
diff changeset
   230
rule of regular expression matching.
Chengsong
parents: 456
diff changeset
   231
A concrete example would be 
Chengsong
parents: 456
diff changeset
   232
the regex
Chengsong
parents: 456
diff changeset
   233
\begin{verbatim}
Chengsong
parents: 456
diff changeset
   234
(((((a*a*)b*)b){20})*)c
Chengsong
parents: 456
diff changeset
   235
\end{verbatim}
Chengsong
parents: 456
diff changeset
   236
and the string
Chengsong
parents: 456
diff changeset
   237
\begin{verbatim}
Chengsong
parents: 456
diff changeset
   238
baabaabababaabaaaaaaaaababaa
Chengsong
parents: 456
diff changeset
   239
aababababaaaabaaabaaaaaabaab
Chengsong
parents: 456
diff changeset
   240
aabababaababaaaaaaaaababaaaa
Chengsong
parents: 456
diff changeset
   241
babababaaaaaaaaaaaaac
Chengsong
parents: 456
diff changeset
   242
\end{verbatim}
Chengsong
parents: 456
diff changeset
   243
Chengsong
parents: 456
diff changeset
   244
This seemingly complex regex simply says "some $a$'s
Chengsong
parents: 456
diff changeset
   245
followed by some $b$'s then followed by 1 single $b$,
Chengsong
parents: 456
diff changeset
   246
and this iterates 20 times, finally followed by a $c$.
Chengsong
parents: 456
diff changeset
   247
And a POSIX match would involve the entire string,"eating up"
Chengsong
parents: 456
diff changeset
   248
all the $b$'s in it.
Chengsong
parents: 456
diff changeset
   249
%TODO: give a coloured example of how this matches POSIXly
Chengsong
parents: 456
diff changeset
   250
Chengsong
parents: 456
diff changeset
   251
This regex would trigger catastrophic backtracking in 
Chengsong
parents: 456
diff changeset
   252
languages like Python and Java,
Chengsong
parents: 456
diff changeset
   253
whereas it gives a correct but uninformative (non-POSIX)
Chengsong
parents: 456
diff changeset
   254
match in languages like Go or .NET--The match with only 
Chengsong
parents: 456
diff changeset
   255
character $c$.
Chengsong
parents: 456
diff changeset
   256
Chengsong
parents: 456
diff changeset
   257
Backtracking or depth-first search might give us
Chengsong
parents: 456
diff changeset
   258
exponential running time, and quite a few tools 
Chengsong
parents: 456
diff changeset
   259
avoid that by determinising the $\mathit{NFA}$ 
Chengsong
parents: 456
diff changeset
   260
into a $\mathit{DFA}$ and minimizes it.
Chengsong
parents: 456
diff changeset
   261
For example, $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
Chengsong
parents: 456
diff changeset
   262
in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
Chengsong
parents: 456
diff changeset
   263
lexers.
Chengsong
parents: 456
diff changeset
   264
However, they do not scale well with bounded repetitions.
Chengsong
parents: 456
diff changeset
   265
Bounded repetitions, usually written in the form
Chengsong
parents: 456
diff changeset
   266
$r^{\{c\}}$ (where $c$ is a constant natural number),
Chengsong
parents: 456
diff changeset
   267
denotes a regular expression accepting strings
Chengsong
parents: 456
diff changeset
   268
that can be divided into $c$ substrings, and each 
Chengsong
parents: 456
diff changeset
   269
substring is in $r$. 
Chengsong
parents: 456
diff changeset
   270
%TODO: draw example NFA.
Chengsong
parents: 456
diff changeset
   271
For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
Chengsong
parents: 456
diff changeset
   272
an $\mathit{NFA}$ describing it would look like:
Chengsong
parents: 456
diff changeset
   273
\begin{center}
Chengsong
parents: 456
diff changeset
   274
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
Chengsong
parents: 456
diff changeset
   275
   \node[state,initial] (q_0)   {$q_0$}; 
Chengsong
parents: 456
diff changeset
   276
   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
Chengsong
parents: 456
diff changeset
   277
   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
Chengsong
parents: 456
diff changeset
   278
   \node[state,accepting](q_3) [right=of q_2] {$q_3$};
Chengsong
parents: 456
diff changeset
   279
    \path[->] 
Chengsong
parents: 456
diff changeset
   280
    (q_0) edge  node {a} (q_1)
Chengsong
parents: 456
diff changeset
   281
    	  edge [loop below] node {a,b} ()
Chengsong
parents: 456
diff changeset
   282
    (q_1) edge  node  {a,b} (q_2)
Chengsong
parents: 456
diff changeset
   283
          edge [loop above] node {0} ()
Chengsong
parents: 456
diff changeset
   284
    (q_2) edge  node  {a,b} (q_3);
Chengsong
parents: 456
diff changeset
   285
\end{tikzpicture}
Chengsong
parents: 456
diff changeset
   286
\end{center}
Chengsong
parents: 456
diff changeset
   287
The red states are "counter states" which counts down 
Chengsong
parents: 456
diff changeset
   288
the number of characters needed in addition to the current
Chengsong
parents: 456
diff changeset
   289
string to make a successful match.
Chengsong
parents: 456
diff changeset
   290
For example, state $q_1$ indicates a match that has
Chengsong
parents: 456
diff changeset
   291
gone past the $(a|b)$ part of $(a|b)^*a(a|b)^{\{2\}}$,
Chengsong
parents: 456
diff changeset
   292
and just consumed the "delimiter" $a$ in the middle, and 
Chengsong
parents: 456
diff changeset
   293
need to match 2 more iterations of $a|b$ to complete.
Chengsong
parents: 456
diff changeset
   294
State $q_2$ on the other hand, can be viewed as a state
Chengsong
parents: 456
diff changeset
   295
after $q_1$ has consumed 1 character, and just waits
Chengsong
parents: 456
diff changeset
   296
for 1 more character to complete.
Chengsong
parents: 456
diff changeset
   297
Depending on the actual characters appearing in the 
Chengsong
parents: 456
diff changeset
   298
input string, the states $q_1$ and $q_2$ may or may
Chengsong
parents: 456
diff changeset
   299
not be active, independent from each other.
Chengsong
parents: 456
diff changeset
   300
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
Chengsong
parents: 456
diff changeset
   301
contain at least 4 non-equivalent states that cannot be merged, 
Chengsong
parents: 456
diff changeset
   302
because subset states indicating which of $q_1$ and $q_2$
Chengsong
parents: 456
diff changeset
   303
are active are at least four: $\phi$, $\{q_1\}$, 
Chengsong
parents: 456
diff changeset
   304
$\{q_2\}$, $\{q_1, q_2\}$.
Chengsong
parents: 456
diff changeset
   305
Generalizing this to regular expressions with larger
Chengsong
parents: 456
diff changeset
   306
bounded repetitions number, we have $r^*ar^{\{n\}}$
Chengsong
parents: 456
diff changeset
   307
in general would require at least $2^n$ states
Chengsong
parents: 456
diff changeset
   308
in a $\mathit{DFA}$. This is to represent all different 
Chengsong
parents: 456
diff changeset
   309
configurations of "countdown" states.
Chengsong
parents: 456
diff changeset
   310
For those regexes, tools such as $\mathit{JFLEX}$ 
Chengsong
parents: 456
diff changeset
   311
would generate gigantic $\mathit{DFA}$'s or even
Chengsong
parents: 456
diff changeset
   312
give out memory errors.
Chengsong
parents: 456
diff changeset
   313
Chengsong
parents: 456
diff changeset
   314
For this reason, regex libraries that support 
Chengsong
parents: 456
diff changeset
   315
bounded repetitions often choose to use the $\mathit{NFA}$ 
Chengsong
parents: 456
diff changeset
   316
approach.
Chengsong
parents: 456
diff changeset
   317
One can simulate the $\mathit{NFA}$ running in two ways:
Chengsong
parents: 456
diff changeset
   318
one by keeping track of all active states after consuming 
Chengsong
parents: 456
diff changeset
   319
a character, and update that set of states iteratively.
Chengsong
parents: 456
diff changeset
   320
This is a breadth-first-search of the $\mathit{NFA}$.
Chengsong
parents: 456
diff changeset
   321
for a path terminating
Chengsong
parents: 456
diff changeset
   322
at an accepting state.
Chengsong
parents: 456
diff changeset
   323
Languages like $\mathit{GO}$ and $\mathit{RUST}$ use this
Chengsong
parents: 456
diff changeset
   324
type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
Chengsong
parents: 456
diff changeset
   325
in terms of input string length.
Chengsong
parents: 456
diff changeset
   326
The other way to use $\mathit{NFA}$ for matching is to take 
Chengsong
parents: 456
diff changeset
   327
a single state in a path each time, and backtrack if that path
Chengsong
parents: 456
diff changeset
   328
fails. This is a depth-first-search that could end up
Chengsong
parents: 456
diff changeset
   329
with exponential run time.
Chengsong
parents: 456
diff changeset
   330
The reason behind backtracking algorithms in languages like
Chengsong
parents: 456
diff changeset
   331
Java and Python is that they support back-references.
Chengsong
parents: 456
diff changeset
   332
\subsection{Back References in Regex--Non-Regular part}
Chengsong
parents: 456
diff changeset
   333
If we label sub-expressions by parenthesizing them and give 
Chengsong
parents: 456
diff changeset
   334
them a number by the order their opening parenthesis appear,
Chengsong
parents: 456
diff changeset
   335
$\underset{1}{(}\ldots\underset{2}{(}\ldots\underset{3}{(}\ldots\underset{4}{(}\ldots)\ldots)\ldots)\ldots)$
Chengsong
parents: 456
diff changeset
   336
We can use the following syntax to denote that we want a string just matched by a 
Chengsong
parents: 456
diff changeset
   337
sub-expression to appear at a certain location again exactly:
Chengsong
parents: 456
diff changeset
   338
$(.*)\backslash 1$
Chengsong
parents: 456
diff changeset
   339
would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.
Chengsong
parents: 456
diff changeset
   340
Chengsong
parents: 456
diff changeset
   341
Back-reference is a construct in the "regex" standard
Chengsong
parents: 456
diff changeset
   342
that programmers found quite useful, but not exactly 
Chengsong
parents: 456
diff changeset
   343
regular any more.
Chengsong
parents: 456
diff changeset
   344
In fact, that allows the regex construct to express 
Chengsong
parents: 456
diff changeset
   345
languages that cannot be contained in context-free
Chengsong
parents: 456
diff changeset
   346
languages
Chengsong
parents: 456
diff changeset
   347
For example, the back reference $(a^*)\backslash 1 \backslash 1$
Chengsong
parents: 456
diff changeset
   348
expresses the language $\{a^na^na^n\mid n \in N\}$,
Chengsong
parents: 456
diff changeset
   349
which cannot be expressed by context-free grammars.
Chengsong
parents: 456
diff changeset
   350
To express such a language one would need context-sensitive
Chengsong
parents: 456
diff changeset
   351
grammars, and matching/parsing for such grammars is NP-hard in 
Chengsong
parents: 456
diff changeset
   352
general.
Chengsong
parents: 456
diff changeset
   353
%TODO:cite reference for NP-hardness of CSG matching
Chengsong
parents: 456
diff changeset
   354
For such constructs, the most intuitive way of matching is
Chengsong
parents: 456
diff changeset
   355
using backtracking algorithms, and there are no known algorithms 
Chengsong
parents: 456
diff changeset
   356
that does not backtrack.
Chengsong
parents: 456
diff changeset
   357
%TODO:read a bit more about back reference algorithms
Chengsong
parents: 456
diff changeset
   358
Chengsong
parents: 456
diff changeset
   359
Chengsong
parents: 456
diff changeset
   360
Chengsong
parents: 456
diff changeset
   361
Chengsong
parents: 456
diff changeset
   362
\section{Our Solution--Brzozowski Derivatives}
Chengsong
parents: 456
diff changeset
   363
Chengsong
parents: 456
diff changeset
   364
456
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   365
 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   366
 Is it possible to have a regex lexing algorithm with proven correctness and 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   367
 time complexity, which allows easy extensions to
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   368
  constructs like 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   369
 bounded repetitions, negation,  lookarounds, and even back-references? 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   370
 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   371
 We propose Brzozowski's derivatives as a solution to this problem.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   372
 
465
Chengsong
parents: 456
diff changeset
   373
The main contribution of this thesis is a proven correct lexing algorithm
Chengsong
parents: 456
diff changeset
   374
with formalized time bounds.
Chengsong
parents: 456
diff changeset
   375
To our best knowledge, there is no lexing libraries using Brzozowski derivatives
Chengsong
parents: 456
diff changeset
   376
that have a provable time guarantee, 
Chengsong
parents: 456
diff changeset
   377
and claims about running time are usually speculative and backed by thin empirical
Chengsong
parents: 456
diff changeset
   378
evidence.
Chengsong
parents: 456
diff changeset
   379
%TODO: give references
Chengsong
parents: 456
diff changeset
   380
For example, Sulzmann and Lu had proposed an algorithm  in which they
Chengsong
parents: 456
diff changeset
   381
claim a linear running time.
Chengsong
parents: 456
diff changeset
   382
But that was falsified by our experiments and the running time 
Chengsong
parents: 456
diff changeset
   383
is actually $\Omega(2^n)$ in the worst case.
Chengsong
parents: 456
diff changeset
   384
A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
Chengsong
parents: 456
diff changeset
   385
%TODO: give references
Chengsong
parents: 456
diff changeset
   386
lexer, which calculates POSIX matches and is based on derivatives.
Chengsong
parents: 456
diff changeset
   387
They formalized the correctness of the lexer, but not the complexity.
Chengsong
parents: 456
diff changeset
   388
In the performance evaluation section, they simply analyzed the run time
Chengsong
parents: 456
diff changeset
   389
of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
Chengsong
parents: 456
diff changeset
   390
and concluded that the algorithm is quadratic in terms of input length.
Chengsong
parents: 456
diff changeset
   391
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
Chengsong
parents: 456
diff changeset
   392
the time it took to lex only 40 $a$'s was 5 minutes.
Chengsong
parents: 456
diff changeset
   393
We therefore believe our results of a proof of performance on general
Chengsong
parents: 456
diff changeset
   394
inputs rather than specific examples a novel contribution.\\
Chengsong
parents: 456
diff changeset
   395
Chengsong
parents: 456
diff changeset
   396
 \section{Preliminaries about Lexing Using Brzozowski derivatives}
456
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   397
 In the last fifteen or so years, Brzozowski's derivatives of regular
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   398
expressions have sparked quite a bit of interest in the functional
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   399
programming and theorem prover communities.  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   400
The beauty of
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   401
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   402
expressible in any functional language, and easily definable and
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   403
reasoned about in theorem provers---the definitions just consist of
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   404
inductive datatypes and simple recursive functions. 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   405
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   406
Suppose we have an alphabet $\Sigma$, the strings  whose characters
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   407
are from $\Sigma$
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   408
can be expressed as $\Sigma^*$.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   409
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   410
We use patterns to define a set of strings concisely. Regular expressions
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   411
are one of such patterns systems:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   412
The basic regular expressions  are defined inductively
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   413
 by the following grammar:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   414
\[			r ::=   \ZERO \mid  \ONE
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   415
			 \mid  c  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   416
			 \mid  r_1 \cdot r_2
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   417
			 \mid  r_1 + r_2   
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   418
			 \mid r^*         
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   419
\]
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   420
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   421
The language or set of strings defined by regular expressions are defined as
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   422
%TODO: FILL in the other defs
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   423
\begin{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   424
\begin{tabular}{lcl}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   425
$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   426
$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   427
\end{tabular}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   428
\end{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   429
Which are also called the "language interpretation".
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   430
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   431
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   432
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   433
The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   434
where the operation transforms the regex to a new one containing
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   435
strings without the head character $c$.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   436
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   437
Formally, we define first such a transformation on any string set, which
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   438
we call semantic derivative:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   439
\begin{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   440
$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   441
\end{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   442
Mathematically, it can be expressed as the 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   443
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   444
If the $\textit{StringSet}$ happen to have some structure, for example,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   445
if it is regular, then we have that it
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   446
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   447
The the derivative of regular expression, denoted as
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   448
$r \backslash c$, is a function that takes parameters
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   449
$r$ and $c$, and returns another regular expression $r'$,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   450
which is computed by the following recursive function:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   451
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   452
\begin{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   453
\begin{tabular}{lcl}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   454
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   455
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   456
		$d \backslash c$     & $\dn$ & 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   457
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   458
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   459
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   460
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   461
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   462
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   463
\end{tabular}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   464
\end{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   465
\noindent
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   466
\noindent
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   467
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   468
The $\nullable$ function tests whether the empty string $""$ 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   469
is in the language of $r$:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   470
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   471
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   472
\begin{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   473
		\begin{tabular}{lcl}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   474
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   475
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   476
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   477
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   478
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   479
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   480
		\end{tabular}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   481
\end{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   482
\noindent
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   483
The empty set does not contain any string and
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   484
therefore not the empty string, the empty string 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   485
regular expression contains the empty string
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   486
by definition, the character regular expression
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   487
is the singleton that contains character only,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   488
and therefore does not contain the empty string,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   489
the alternative regular expression(or "or" expression)
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   490
might have one of its children regular expressions
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   491
being nullable and any one of its children being nullable
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   492
would suffice. The sequence regular expression
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   493
would require both children to have the empty string
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   494
to compose an empty string and the Kleene star
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   495
operation naturally introduced the empty string.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   496
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   497
We can give the meaning of regular expressions derivatives
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   498
by language interpretation:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   499
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   500
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
   501
 
465
Chengsong
parents: 456
diff changeset
   502
  
Chengsong
parents: 456
diff changeset
   503
\begin{center}
Chengsong
parents: 456
diff changeset
   504
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
   505
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
Chengsong
parents: 456
diff changeset
   506
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
Chengsong
parents: 456
diff changeset
   507
		$d \backslash c$     & $\dn$ & 
Chengsong
parents: 456
diff changeset
   508
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
Chengsong
parents: 456
diff changeset
   509
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
Chengsong
parents: 456
diff changeset
   510
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
Chengsong
parents: 456
diff changeset
   511
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
Chengsong
parents: 456
diff changeset
   512
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
Chengsong
parents: 456
diff changeset
   513
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
Chengsong
parents: 456
diff changeset
   514
\end{tabular}
Chengsong
parents: 456
diff changeset
   515
\end{center}
Chengsong
parents: 456
diff changeset
   516
\noindent
Chengsong
parents: 456
diff changeset
   517
\noindent
Chengsong
parents: 456
diff changeset
   518
The function derivative, written $\backslash c$, 
Chengsong
parents: 456
diff changeset
   519
defines how a regular expression evolves into
Chengsong
parents: 456
diff changeset
   520
a new regular expression after all the string it contains
Chengsong
parents: 456
diff changeset
   521
is chopped off a certain head character $c$.
Chengsong
parents: 456
diff changeset
   522
The most involved cases are the sequence 
Chengsong
parents: 456
diff changeset
   523
and star case.
Chengsong
parents: 456
diff changeset
   524
The sequence case says that if the first regular expression
Chengsong
parents: 456
diff changeset
   525
contains an empty string then second component of the sequence
Chengsong
parents: 456
diff changeset
   526
might be chosen as the target regular expression to be chopped
Chengsong
parents: 456
diff changeset
   527
off its head character.
Chengsong
parents: 456
diff changeset
   528
The star regular expression unwraps the iteration of
Chengsong
parents: 456
diff changeset
   529
regular expression and attack the star regular expression
Chengsong
parents: 456
diff changeset
   530
to its back again to make sure there are 0 or more iterations
Chengsong
parents: 456
diff changeset
   531
following this unfolded iteration.
Chengsong
parents: 456
diff changeset
   532
Chengsong
parents: 456
diff changeset
   533
Chengsong
parents: 456
diff changeset
   534
The main property of the derivative operation
Chengsong
parents: 456
diff changeset
   535
that enables us to reason about the correctness of
Chengsong
parents: 456
diff changeset
   536
an algorithm using derivatives is 
Chengsong
parents: 456
diff changeset
   537
Chengsong
parents: 456
diff changeset
   538
\begin{center}
Chengsong
parents: 456
diff changeset
   539
$c\!::\!s \in L(r)$ holds
Chengsong
parents: 456
diff changeset
   540
if and only if $s \in L(r\backslash c)$.
Chengsong
parents: 456
diff changeset
   541
\end{center}
Chengsong
parents: 456
diff changeset
   542
Chengsong
parents: 456
diff changeset
   543
\noindent
Chengsong
parents: 456
diff changeset
   544
We can generalise the derivative operation shown above for single characters
Chengsong
parents: 456
diff changeset
   545
to strings as follows:
Chengsong
parents: 456
diff changeset
   546
Chengsong
parents: 456
diff changeset
   547
\begin{center}
Chengsong
parents: 456
diff changeset
   548
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
   549
$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
Chengsong
parents: 456
diff changeset
   550
$r \backslash [\,] $ & $\dn$ & $r$
Chengsong
parents: 456
diff changeset
   551
\end{tabular}
Chengsong
parents: 456
diff changeset
   552
\end{center}
Chengsong
parents: 456
diff changeset
   553
Chengsong
parents: 456
diff changeset
   554
\noindent
Chengsong
parents: 456
diff changeset
   555
and then define Brzozowski's  regular-expression matching algorithm as:
Chengsong
parents: 456
diff changeset
   556
Chengsong
parents: 456
diff changeset
   557
\[
Chengsong
parents: 456
diff changeset
   558
match\;s\;r \;\dn\; nullable(r\backslash s)
Chengsong
parents: 456
diff changeset
   559
\]
Chengsong
parents: 456
diff changeset
   560
Chengsong
parents: 456
diff changeset
   561
\noindent
Chengsong
parents: 456
diff changeset
   562
Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
Chengsong
parents: 456
diff changeset
   563
this algorithm presented graphically is as follows:
Chengsong
parents: 456
diff changeset
   564
Chengsong
parents: 456
diff changeset
   565
\begin{equation}\label{graph:*}
Chengsong
parents: 456
diff changeset
   566
\begin{tikzcd}
Chengsong
parents: 456
diff changeset
   567
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
Chengsong
parents: 456
diff changeset
   568
\end{tikzcd}
Chengsong
parents: 456
diff changeset
   569
\end{equation}
Chengsong
parents: 456
diff changeset
   570
Chengsong
parents: 456
diff changeset
   571
\noindent
Chengsong
parents: 456
diff changeset
   572
where we start with  a regular expression  $r_0$, build successive
Chengsong
parents: 456
diff changeset
   573
derivatives until we exhaust the string and then use \textit{nullable}
Chengsong
parents: 456
diff changeset
   574
to test whether the result can match the empty string. It can  be
Chengsong
parents: 456
diff changeset
   575
relatively  easily shown that this matcher is correct  (that is given
Chengsong
parents: 456
diff changeset
   576
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
Chengsong
parents: 456
diff changeset
   577
Chengsong
parents: 456
diff changeset
   578
Beautiful and simple definition.
Chengsong
parents: 456
diff changeset
   579
Chengsong
parents: 456
diff changeset
   580
If we implement the above algorithm naively, however,
Chengsong
parents: 456
diff changeset
   581
the algorithm can be excruciatingly slow. For example, when starting with the regular
Chengsong
parents: 456
diff changeset
   582
expression $(a + aa)^*$ and building 12 successive derivatives
Chengsong
parents: 456
diff changeset
   583
w.r.t.~the character $a$, one obtains a derivative regular expression
Chengsong
parents: 456
diff changeset
   584
with more than 8000 nodes (when viewed as a tree). Operations like
Chengsong
parents: 456
diff changeset
   585
$\backslash$ and $\nullable$ need to traverse such trees and
Chengsong
parents: 456
diff changeset
   586
consequently the bigger the size of the derivative the slower the
Chengsong
parents: 456
diff changeset
   587
algorithm. 
Chengsong
parents: 456
diff changeset
   588
Chengsong
parents: 456
diff changeset
   589
Brzozowski was quick in finding that during this process a lot useless
Chengsong
parents: 456
diff changeset
   590
$\ONE$s and $\ZERO$s are generated and therefore not optimal.
Chengsong
parents: 456
diff changeset
   591
He also introduced some "similarity rules" such
Chengsong
parents: 456
diff changeset
   592
as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
Chengsong
parents: 456
diff changeset
   593
different but language-equivalent sub-regexes to further decrease the size
Chengsong
parents: 456
diff changeset
   594
of the intermediate regexes. 
Chengsong
parents: 456
diff changeset
   595
Chengsong
parents: 456
diff changeset
   596
More simplifications are possible, such as deleting duplicates
Chengsong
parents: 456
diff changeset
   597
and opening up nested alternatives to trigger even more simplifications.
Chengsong
parents: 456
diff changeset
   598
And suppose we apply simplification after each derivative step, and compose
Chengsong
parents: 456
diff changeset
   599
these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
Chengsong
parents: 456
diff changeset
   600
\textit{simp}(a \backslash c)$. Then we can build
Chengsong
parents: 456
diff changeset
   601
a matcher without having  cumbersome regular expressions.
Chengsong
parents: 456
diff changeset
   602
Chengsong
parents: 456
diff changeset
   603
Chengsong
parents: 456
diff changeset
   604
If we want the size of derivatives in the algorithm to
Chengsong
parents: 456
diff changeset
   605
stay even lower, we would need more aggressive simplifications.
Chengsong
parents: 456
diff changeset
   606
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
Chengsong
parents: 456
diff changeset
   607
deleting duplicates whenever possible. For example, the parentheses in
Chengsong
parents: 456
diff changeset
   608
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
Chengsong
parents: 456
diff changeset
   609
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
Chengsong
parents: 456
diff changeset
   610
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
Chengsong
parents: 456
diff changeset
   611
$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
Chengsong
parents: 456
diff changeset
   612
to achieve a very tight size bound, namely,
Chengsong
parents: 456
diff changeset
   613
 the same size bound as that of the \emph{partial derivatives}. 
Chengsong
parents: 456
diff changeset
   614
Chengsong
parents: 456
diff changeset
   615
Building derivatives and then simplify them.
Chengsong
parents: 456
diff changeset
   616
So far so good. But what if we want to 
Chengsong
parents: 456
diff changeset
   617
do lexing instead of just a YES/NO answer?
Chengsong
parents: 456
diff changeset
   618
This requires us to go back again to the world 
Chengsong
parents: 456
diff changeset
   619
without simplification first for a moment.
Chengsong
parents: 456
diff changeset
   620
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
Chengsong
parents: 456
diff changeset
   621
elegant(arguably as beautiful as the original
Chengsong
parents: 456
diff changeset
   622
derivatives definition) solution for this.
Chengsong
parents: 456
diff changeset
   623
Chengsong
parents: 456
diff changeset
   624
\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
Chengsong
parents: 456
diff changeset
   625
Chengsong
parents: 456
diff changeset
   626
Chengsong
parents: 456
diff changeset
   627
They first defined the datatypes for storing the 
Chengsong
parents: 456
diff changeset
   628
lexing information called a \emph{value} or
Chengsong
parents: 456
diff changeset
   629
sometimes also \emph{lexical value}.  These values and regular
Chengsong
parents: 456
diff changeset
   630
expressions correspond to each other as illustrated in the following
Chengsong
parents: 456
diff changeset
   631
table:
Chengsong
parents: 456
diff changeset
   632
Chengsong
parents: 456
diff changeset
   633
\begin{center}
Chengsong
parents: 456
diff changeset
   634
	\begin{tabular}{c@{\hspace{20mm}}c}
Chengsong
parents: 456
diff changeset
   635
		\begin{tabular}{@{}rrl@{}}
Chengsong
parents: 456
diff changeset
   636
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
Chengsong
parents: 456
diff changeset
   637
			$r$ & $::=$  & $\ZERO$\\
Chengsong
parents: 456
diff changeset
   638
			& $\mid$ & $\ONE$   \\
Chengsong
parents: 456
diff changeset
   639
			& $\mid$ & $c$          \\
Chengsong
parents: 456
diff changeset
   640
			& $\mid$ & $r_1 \cdot r_2$\\
Chengsong
parents: 456
diff changeset
   641
			& $\mid$ & $r_1 + r_2$   \\
Chengsong
parents: 456
diff changeset
   642
			\\
Chengsong
parents: 456
diff changeset
   643
			& $\mid$ & $r^*$         \\
Chengsong
parents: 456
diff changeset
   644
		\end{tabular}
Chengsong
parents: 456
diff changeset
   645
		&
Chengsong
parents: 456
diff changeset
   646
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
Chengsong
parents: 456
diff changeset
   647
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
Chengsong
parents: 456
diff changeset
   648
			$v$ & $::=$  & \\
Chengsong
parents: 456
diff changeset
   649
			&        & $\Empty$   \\
Chengsong
parents: 456
diff changeset
   650
			& $\mid$ & $\Char(c)$          \\
Chengsong
parents: 456
diff changeset
   651
			& $\mid$ & $\Seq\,v_1\, v_2$\\
Chengsong
parents: 456
diff changeset
   652
			& $\mid$ & $\Left(v)$   \\
Chengsong
parents: 456
diff changeset
   653
			& $\mid$ & $\Right(v)$  \\
Chengsong
parents: 456
diff changeset
   654
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
Chengsong
parents: 456
diff changeset
   655
		\end{tabular}
Chengsong
parents: 456
diff changeset
   656
	\end{tabular}
Chengsong
parents: 456
diff changeset
   657
\end{center}
Chengsong
parents: 456
diff changeset
   658
Chengsong
parents: 456
diff changeset
   659
\noindent
Chengsong
parents: 456
diff changeset
   660
One regular expression can have multiple lexical values. For example
Chengsong
parents: 456
diff changeset
   661
for the regular expression $(a+b)^*$, it has a infinite list of
Chengsong
parents: 456
diff changeset
   662
values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
Chengsong
parents: 456
diff changeset
   663
$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
Chengsong
parents: 456
diff changeset
   664
$\ldots$, and vice versa.
Chengsong
parents: 456
diff changeset
   665
Even for the regular expression matching a certain string, there could 
Chengsong
parents: 456
diff changeset
   666
still be more than one value corresponding to it.
Chengsong
parents: 456
diff changeset
   667
Take the example where $r= (a^*\cdot a^*)^*$ and the string 
Chengsong
parents: 456
diff changeset
   668
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
Chengsong
parents: 456
diff changeset
   669
The number of different ways of matching 
Chengsong
parents: 456
diff changeset
   670
without allowing any value under a star to be flattened
Chengsong
parents: 456
diff changeset
   671
to an empty string can be given by the following formula:
Chengsong
parents: 456
diff changeset
   672
\begin{center}
Chengsong
parents: 456
diff changeset
   673
	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
Chengsong
parents: 456
diff changeset
   674
\end{center}
Chengsong
parents: 456
diff changeset
   675
and a closed form formula can be calculated to be
Chengsong
parents: 456
diff changeset
   676
\begin{equation}
Chengsong
parents: 456
diff changeset
   677
	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
Chengsong
parents: 456
diff changeset
   678
\end{equation}
Chengsong
parents: 456
diff changeset
   679
which is clearly in exponential order.
Chengsong
parents: 456
diff changeset
   680
A lexer aimed at getting all the possible values has an exponential
Chengsong
parents: 456
diff changeset
   681
worst case runtime. Therefore it is impractical to try to generate
Chengsong
parents: 456
diff changeset
   682
all possible matches in a run. In practice, we are usually 
Chengsong
parents: 456
diff changeset
   683
interested about POSIX values, which by intuition always
Chengsong
parents: 456
diff changeset
   684
match the leftmost regular expression when there is a choice
Chengsong
parents: 456
diff changeset
   685
and always match a sub part as much as possible before proceeding
Chengsong
parents: 456
diff changeset
   686
to the next token. For example, the above example has the POSIX value
Chengsong
parents: 456
diff changeset
   687
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
Chengsong
parents: 456
diff changeset
   688
The output of an algorithm we want would be a POSIX matching
Chengsong
parents: 456
diff changeset
   689
encoded as a value.
Chengsong
parents: 456
diff changeset
   690
The contribution of Sulzmann and Lu is an extension of Brzozowski's
Chengsong
parents: 456
diff changeset
   691
algorithm by a second phase (the first phase being building successive
Chengsong
parents: 456
diff changeset
   692
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
Chengsong
parents: 456
diff changeset
   693
is generated in case the regular expression matches  the string. 
Chengsong
parents: 456
diff changeset
   694
Pictorially, the Sulzmann and Lu algorithm is as follows:
Chengsong
parents: 456
diff changeset
   695
Chengsong
parents: 456
diff changeset
   696
\begin{ceqn}
Chengsong
parents: 456
diff changeset
   697
\begin{equation}\label{graph:2}
Chengsong
parents: 456
diff changeset
   698
\begin{tikzcd}
Chengsong
parents: 456
diff changeset
   699
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
Chengsong
parents: 456
diff changeset
   700
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
Chengsong
parents: 456
diff changeset
   701
\end{tikzcd}
Chengsong
parents: 456
diff changeset
   702
\end{equation}
Chengsong
parents: 456
diff changeset
   703
\end{ceqn}
Chengsong
parents: 456
diff changeset
   704
Chengsong
parents: 456
diff changeset
   705
Chengsong
parents: 456
diff changeset
   706
\noindent
Chengsong
parents: 456
diff changeset
   707
For convenience, we shall employ the following notations: the regular
Chengsong
parents: 456
diff changeset
   708
expression we start with is $r_0$, and the given string $s$ is composed
Chengsong
parents: 456
diff changeset
   709
of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
Chengsong
parents: 456
diff changeset
   710
left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
Chengsong
parents: 456
diff changeset
   711
to the characters $c_0$, $c_1$  until we exhaust the string and obtain
Chengsong
parents: 456
diff changeset
   712
the derivative $r_n$. We test whether this derivative is
Chengsong
parents: 456
diff changeset
   713
$\textit{nullable}$ or not. If not, we know the string does not match
Chengsong
parents: 456
diff changeset
   714
$r$ and no value needs to be generated. If yes, we start building the
Chengsong
parents: 456
diff changeset
   715
values incrementally by \emph{injecting} back the characters into the
Chengsong
parents: 456
diff changeset
   716
earlier values $v_n, \ldots, v_0$. This is the second phase of the
Chengsong
parents: 456
diff changeset
   717
algorithm from the right to left. For the first value $v_n$, we call the
Chengsong
parents: 456
diff changeset
   718
function $\textit{mkeps}$, which builds a POSIX lexical value
Chengsong
parents: 456
diff changeset
   719
for how the empty string has been matched by the (nullable) regular
Chengsong
parents: 456
diff changeset
   720
expression $r_n$. This function is defined as
Chengsong
parents: 456
diff changeset
   721
Chengsong
parents: 456
diff changeset
   722
	\begin{center}
Chengsong
parents: 456
diff changeset
   723
		\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
   724
			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
Chengsong
parents: 456
diff changeset
   725
			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
Chengsong
parents: 456
diff changeset
   726
			& \textit{if} $\nullable(r_{1})$\\ 
Chengsong
parents: 456
diff changeset
   727
			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
Chengsong
parents: 456
diff changeset
   728
			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
Chengsong
parents: 456
diff changeset
   729
			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
Chengsong
parents: 456
diff changeset
   730
			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
Chengsong
parents: 456
diff changeset
   731
		\end{tabular}
Chengsong
parents: 456
diff changeset
   732
	\end{center}
Chengsong
parents: 456
diff changeset
   733
Chengsong
parents: 456
diff changeset
   734
Chengsong
parents: 456
diff changeset
   735
\noindent 
Chengsong
parents: 456
diff changeset
   736
After the $\mkeps$-call, we inject back the characters one by one in order to build
Chengsong
parents: 456
diff changeset
   737
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
Chengsong
parents: 456
diff changeset
   738
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
Chengsong
parents: 456
diff changeset
   739
After injecting back $n$ characters, we get the lexical value for how $r_0$
Chengsong
parents: 456
diff changeset
   740
matches $s$. The POSIX value is maintained throught out the process.
Chengsong
parents: 456
diff changeset
   741
For this Sulzmann and Lu defined a function that reverses
Chengsong
parents: 456
diff changeset
   742
the ``chopping off'' of characters during the derivative phase. The
Chengsong
parents: 456
diff changeset
   743
corresponding function is called \emph{injection}, written
Chengsong
parents: 456
diff changeset
   744
$\textit{inj}$; it takes three arguments: the first one is a regular
Chengsong
parents: 456
diff changeset
   745
expression ${r_{i-1}}$, before the character is chopped off, the second
Chengsong
parents: 456
diff changeset
   746
is a character ${c_{i-1}}$, the character we want to inject and the
Chengsong
parents: 456
diff changeset
   747
third argument is the value ${v_i}$, into which one wants to inject the
Chengsong
parents: 456
diff changeset
   748
character (it corresponds to the regular expression after the character
Chengsong
parents: 456
diff changeset
   749
has been chopped off). The result of this function is a new value. The
Chengsong
parents: 456
diff changeset
   750
definition of $\textit{inj}$ is as follows: 
Chengsong
parents: 456
diff changeset
   751
Chengsong
parents: 456
diff changeset
   752
\begin{center}
Chengsong
parents: 456
diff changeset
   753
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
Chengsong
parents: 456
diff changeset
   754
  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
Chengsong
parents: 456
diff changeset
   755
  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
Chengsong
parents: 456
diff changeset
   756
  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
Chengsong
parents: 456
diff changeset
   757
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
Chengsong
parents: 456
diff changeset
   758
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
Chengsong
parents: 456
diff changeset
   759
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
Chengsong
parents: 456
diff changeset
   760
  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
Chengsong
parents: 456
diff changeset
   761
\end{tabular}
Chengsong
parents: 456
diff changeset
   762
\end{center}
Chengsong
parents: 456
diff changeset
   763
Chengsong
parents: 456
diff changeset
   764
\noindent This definition is by recursion on the ``shape'' of regular
Chengsong
parents: 456
diff changeset
   765
expressions and values. 
Chengsong
parents: 456
diff changeset
   766
The clauses basically do one thing--identifying the ``holes'' on 
Chengsong
parents: 456
diff changeset
   767
value to inject the character back into.
Chengsong
parents: 456
diff changeset
   768
For instance, in the last clause for injecting back to a value
Chengsong
parents: 456
diff changeset
   769
that would turn into a new star value that corresponds to a star,
Chengsong
parents: 456
diff changeset
   770
we know it must be a sequence value. And we know that the first 
Chengsong
parents: 456
diff changeset
   771
value of that sequence corresponds to the child regex of the star
Chengsong
parents: 456
diff changeset
   772
with the first character being chopped off--an iteration of the star
Chengsong
parents: 456
diff changeset
   773
that had just been unfolded. This value is followed by the already
Chengsong
parents: 456
diff changeset
   774
matched star iterations we collected before. So we inject the character 
Chengsong
parents: 456
diff changeset
   775
back to the first value and form a new value with this new iteration
Chengsong
parents: 456
diff changeset
   776
being added to the previous list of iterations, all under the $Stars$
Chengsong
parents: 456
diff changeset
   777
top level.
Chengsong
parents: 456
diff changeset
   778
Chengsong
parents: 456
diff changeset
   779
We have mentioned before that derivatives without simplification 
Chengsong
parents: 456
diff changeset
   780
can get clumsy, and this is true for values as well--they reflect
Chengsong
parents: 456
diff changeset
   781
the regular expressions size by definition.
Chengsong
parents: 456
diff changeset
   782
Chengsong
parents: 456
diff changeset
   783
One can introduce simplification on the regex and values, but have to
Chengsong
parents: 456
diff changeset
   784
be careful in not breaking the correctness as the injection 
Chengsong
parents: 456
diff changeset
   785
function heavily relies on the structure of the regexes and values
Chengsong
parents: 456
diff changeset
   786
being correct and match each other.
Chengsong
parents: 456
diff changeset
   787
It can be achieved by recording some extra rectification functions
Chengsong
parents: 456
diff changeset
   788
during the derivatives step, and applying these rectifications in 
Chengsong
parents: 456
diff changeset
   789
each run during the injection phase.
Chengsong
parents: 456
diff changeset
   790
And we can prove that the POSIX value of how
Chengsong
parents: 456
diff changeset
   791
regular expressions match strings will not be affected---although is much harder
Chengsong
parents: 456
diff changeset
   792
to establish. Some initial results in this regard have been
Chengsong
parents: 456
diff changeset
   793
obtained in \cite{AusafDyckhoffUrban2016}. 
Chengsong
parents: 456
diff changeset
   794
Chengsong
parents: 456
diff changeset
   795
%Brzozowski, after giving the derivatives and simplification,
Chengsong
parents: 456
diff changeset
   796
%did not explore lexing with simplification or he may well be 
Chengsong
parents: 456
diff changeset
   797
%stuck on an efficient simplificaiton with a proof.
Chengsong
parents: 456
diff changeset
   798
%He went on to explore the use of derivatives together with 
Chengsong
parents: 456
diff changeset
   799
%automaton, and did not try lexing using derivatives.
Chengsong
parents: 456
diff changeset
   800
Chengsong
parents: 456
diff changeset
   801
We want to get rid of complex and fragile rectification of values.
Chengsong
parents: 456
diff changeset
   802
Can we not create those intermediate values $v_1,\ldots v_n$,
Chengsong
parents: 456
diff changeset
   803
and get the lexing information that should be already there while
Chengsong
parents: 456
diff changeset
   804
doing derivatives in one pass, without a second phase of injection?
Chengsong
parents: 456
diff changeset
   805
In the meantime, can we make sure that simplifications
Chengsong
parents: 456
diff changeset
   806
are easily handled without breaking the correctness of the algorithm?
Chengsong
parents: 456
diff changeset
   807
Chengsong
parents: 456
diff changeset
   808
Sulzmann and Lu solved this problem by
Chengsong
parents: 456
diff changeset
   809
introducing additional informtaion to the 
Chengsong
parents: 456
diff changeset
   810
regular expressions called \emph{bitcodes}.
Chengsong
parents: 456
diff changeset
   811
Chengsong
parents: 456
diff changeset
   812
\subsection*{Bit-coded Algorithm}
Chengsong
parents: 456
diff changeset
   813
Bits and bitcodes (lists of bits) are defined as:
Chengsong
parents: 456
diff changeset
   814
Chengsong
parents: 456
diff changeset
   815
\begin{center}
Chengsong
parents: 456
diff changeset
   816
		$b ::=   1 \mid  0 \qquad
Chengsong
parents: 456
diff changeset
   817
bs ::= [] \mid b::bs    
Chengsong
parents: 456
diff changeset
   818
$
Chengsong
parents: 456
diff changeset
   819
\end{center}
Chengsong
parents: 456
diff changeset
   820
Chengsong
parents: 456
diff changeset
   821
\noindent
Chengsong
parents: 456
diff changeset
   822
The $1$ and $0$ are not in bold in order to avoid 
Chengsong
parents: 456
diff changeset
   823
confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
Chengsong
parents: 456
diff changeset
   824
bit-lists) can be used to encode values (or potentially incomplete values) in a
Chengsong
parents: 456
diff changeset
   825
compact form. This can be straightforwardly seen in the following
Chengsong
parents: 456
diff changeset
   826
coding function from values to bitcodes: 
Chengsong
parents: 456
diff changeset
   827
Chengsong
parents: 456
diff changeset
   828
\begin{center}
Chengsong
parents: 456
diff changeset
   829
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
   830
  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
Chengsong
parents: 456
diff changeset
   831
  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
Chengsong
parents: 456
diff changeset
   832
  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
Chengsong
parents: 456
diff changeset
   833
  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
Chengsong
parents: 456
diff changeset
   834
  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
Chengsong
parents: 456
diff changeset
   835
  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
Chengsong
parents: 456
diff changeset
   836
  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
Chengsong
parents: 456
diff changeset
   837
                                                 code(\Stars\,vs)$
Chengsong
parents: 456
diff changeset
   838
\end{tabular}    
Chengsong
parents: 456
diff changeset
   839
\end{center} 
Chengsong
parents: 456
diff changeset
   840
Chengsong
parents: 456
diff changeset
   841
\noindent
Chengsong
parents: 456
diff changeset
   842
Here $\textit{code}$ encodes a value into a bitcodes by converting
Chengsong
parents: 456
diff changeset
   843
$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
Chengsong
parents: 456
diff changeset
   844
star iteration by $1$. The border where a local star terminates
Chengsong
parents: 456
diff changeset
   845
is marked by $0$. This coding is lossy, as it throws away the information about
Chengsong
parents: 456
diff changeset
   846
characters, and also does not encode the ``boundary'' between two
Chengsong
parents: 456
diff changeset
   847
sequence values. Moreover, with only the bitcode we cannot even tell
Chengsong
parents: 456
diff changeset
   848
whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
Chengsong
parents: 456
diff changeset
   849
reason for choosing this compact way of storing information is that the
Chengsong
parents: 456
diff changeset
   850
relatively small size of bits can be easily manipulated and ``moved
Chengsong
parents: 456
diff changeset
   851
around'' in a regular expression. In order to recover values, we will 
Chengsong
parents: 456
diff changeset
   852
need the corresponding regular expression as an extra information. This
Chengsong
parents: 456
diff changeset
   853
means the decoding function is defined as:
Chengsong
parents: 456
diff changeset
   854
Chengsong
parents: 456
diff changeset
   855
Chengsong
parents: 456
diff changeset
   856
%\begin{definition}[Bitdecoding of Values]\mbox{}
Chengsong
parents: 456
diff changeset
   857
\begin{center}
Chengsong
parents: 456
diff changeset
   858
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
Chengsong
parents: 456
diff changeset
   859
  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
Chengsong
parents: 456
diff changeset
   860
  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
Chengsong
parents: 456
diff changeset
   861
  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   862
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
Chengsong
parents: 456
diff changeset
   863
       (\Left\,v, bs_1)$\\
Chengsong
parents: 456
diff changeset
   864
  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   865
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
Chengsong
parents: 456
diff changeset
   866
       (\Right\,v, bs_1)$\\                           
Chengsong
parents: 456
diff changeset
   867
  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   868
        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
Chengsong
parents: 456
diff changeset
   869
  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
Chengsong
parents: 456
diff changeset
   870
  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
Chengsong
parents: 456
diff changeset
   871
  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
Chengsong
parents: 456
diff changeset
   872
  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
Chengsong
parents: 456
diff changeset
   873
         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
Chengsong
parents: 456
diff changeset
   874
  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
Chengsong
parents: 456
diff changeset
   875
  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
Chengsong
parents: 456
diff changeset
   876
  
Chengsong
parents: 456
diff changeset
   877
  $\textit{decode}\,bs\,r$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   878
     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
Chengsong
parents: 456
diff changeset
   879
  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
Chengsong
parents: 456
diff changeset
   880
       \textit{else}\;\textit{None}$                       
Chengsong
parents: 456
diff changeset
   881
\end{tabular}    
Chengsong
parents: 456
diff changeset
   882
\end{center}    
Chengsong
parents: 456
diff changeset
   883
%\end{definition}
Chengsong
parents: 456
diff changeset
   884
Chengsong
parents: 456
diff changeset
   885
Sulzmann and Lu's integrated the bitcodes into regular expressions to
Chengsong
parents: 456
diff changeset
   886
create annotated regular expressions \cite{Sulzmann2014}.
Chengsong
parents: 456
diff changeset
   887
\emph{Annotated regular expressions} are defined by the following
Chengsong
parents: 456
diff changeset
   888
grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
Chengsong
parents: 456
diff changeset
   889
Chengsong
parents: 456
diff changeset
   890
\begin{center}
Chengsong
parents: 456
diff changeset
   891
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
   892
  $\textit{a}$ & $::=$  & $\ZERO$\\
Chengsong
parents: 456
diff changeset
   893
                  & $\mid$ & $_{bs}\ONE$\\
Chengsong
parents: 456
diff changeset
   894
                  & $\mid$ & $_{bs}{\bf c}$\\
Chengsong
parents: 456
diff changeset
   895
                  & $\mid$ & $_{bs}\sum\,as$\\
Chengsong
parents: 456
diff changeset
   896
                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
Chengsong
parents: 456
diff changeset
   897
                  & $\mid$ & $_{bs}a^*$
Chengsong
parents: 456
diff changeset
   898
\end{tabular}    
Chengsong
parents: 456
diff changeset
   899
\end{center}  
Chengsong
parents: 456
diff changeset
   900
%(in \textit{ALTS})
Chengsong
parents: 456
diff changeset
   901
Chengsong
parents: 456
diff changeset
   902
\noindent
Chengsong
parents: 456
diff changeset
   903
where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
Chengsong
parents: 456
diff changeset
   904
expressions and $as$ for a list of annotated regular expressions.
Chengsong
parents: 456
diff changeset
   905
The alternative constructor($\sum$) has been generalized to 
Chengsong
parents: 456
diff changeset
   906
accept a list of annotated regular expressions rather than just 2.
Chengsong
parents: 456
diff changeset
   907
We will show that these bitcodes encode information about
Chengsong
parents: 456
diff changeset
   908
the (POSIX) value that should be generated by the Sulzmann and Lu
Chengsong
parents: 456
diff changeset
   909
algorithm.
Chengsong
parents: 456
diff changeset
   910
Chengsong
parents: 456
diff changeset
   911
Chengsong
parents: 456
diff changeset
   912
To do lexing using annotated regular expressions, we shall first
Chengsong
parents: 456
diff changeset
   913
transform the usual (un-annotated) regular expressions into annotated
Chengsong
parents: 456
diff changeset
   914
regular expressions. This operation is called \emph{internalisation} and
Chengsong
parents: 456
diff changeset
   915
defined as follows:
Chengsong
parents: 456
diff changeset
   916
Chengsong
parents: 456
diff changeset
   917
%\begin{definition}
Chengsong
parents: 456
diff changeset
   918
\begin{center}
Chengsong
parents: 456
diff changeset
   919
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
   920
  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
Chengsong
parents: 456
diff changeset
   921
  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
Chengsong
parents: 456
diff changeset
   922
  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
Chengsong
parents: 456
diff changeset
   923
  $(r_1 + r_2)^\uparrow$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   924
  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
Chengsong
parents: 456
diff changeset
   925
  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
Chengsong
parents: 456
diff changeset
   926
  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   927
         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
Chengsong
parents: 456
diff changeset
   928
  $(r^*)^\uparrow$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   929
         $_{[]}(r^\uparrow)^*$\\
Chengsong
parents: 456
diff changeset
   930
\end{tabular}    
Chengsong
parents: 456
diff changeset
   931
\end{center}    
Chengsong
parents: 456
diff changeset
   932
%\end{definition}
Chengsong
parents: 456
diff changeset
   933
Chengsong
parents: 456
diff changeset
   934
\noindent
Chengsong
parents: 456
diff changeset
   935
We use up arrows here to indicate that the basic un-annotated regular
Chengsong
parents: 456
diff changeset
   936
expressions are ``lifted up'' into something slightly more complex. In the
Chengsong
parents: 456
diff changeset
   937
fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
Chengsong
parents: 456
diff changeset
   938
attach bits to the front of an annotated regular expression. Its
Chengsong
parents: 456
diff changeset
   939
definition is as follows:
Chengsong
parents: 456
diff changeset
   940
Chengsong
parents: 456
diff changeset
   941
\begin{center}
Chengsong
parents: 456
diff changeset
   942
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
   943
  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
Chengsong
parents: 456
diff changeset
   944
  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   945
     $_{bs @ bs'}\ONE$\\
Chengsong
parents: 456
diff changeset
   946
  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   947
     $_{bs@bs'}{\bf c}$\\
Chengsong
parents: 456
diff changeset
   948
  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   949
     $_{bs@bs'}\sum\textit{as}$\\
Chengsong
parents: 456
diff changeset
   950
  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   951
     $_{bs@bs'}a_1 \cdot a_2$\\
Chengsong
parents: 456
diff changeset
   952
  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   953
     $_{bs @ bs'}a^*$
Chengsong
parents: 456
diff changeset
   954
\end{tabular}    
Chengsong
parents: 456
diff changeset
   955
\end{center}  
Chengsong
parents: 456
diff changeset
   956
Chengsong
parents: 456
diff changeset
   957
\noindent
Chengsong
parents: 456
diff changeset
   958
After internalising the regular expression, we perform successive
Chengsong
parents: 456
diff changeset
   959
derivative operations on the annotated regular expressions. This
Chengsong
parents: 456
diff changeset
   960
derivative operation is the same as what we had previously for the
Chengsong
parents: 456
diff changeset
   961
basic regular expressions, except that we beed to take care of
Chengsong
parents: 456
diff changeset
   962
the bitcodes:
Chengsong
parents: 456
diff changeset
   963
Chengsong
parents: 456
diff changeset
   964
Chengsong
parents: 456
diff changeset
   965
\iffalse
Chengsong
parents: 456
diff changeset
   966
 %\begin{definition}{bder}
Chengsong
parents: 456
diff changeset
   967
\begin{center}
Chengsong
parents: 456
diff changeset
   968
  \begin{tabular}{@{}lcl@{}}
Chengsong
parents: 456
diff changeset
   969
  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
Chengsong
parents: 456
diff changeset
   970
  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
Chengsong
parents: 456
diff changeset
   971
  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   972
        $\textit{if}\;c=d\; \;\textit{then}\;
Chengsong
parents: 456
diff changeset
   973
         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
Chengsong
parents: 456
diff changeset
   974
  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   975
  $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
Chengsong
parents: 456
diff changeset
   976
  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   977
     $\textit{if}\;\textit{bnullable}\,a_1$\\
Chengsong
parents: 456
diff changeset
   978
					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
Chengsong
parents: 456
diff changeset
   979
					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
Chengsong
parents: 456
diff changeset
   980
  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
Chengsong
parents: 456
diff changeset
   981
  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   982
      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
Chengsong
parents: 456
diff changeset
   983
       (\textit{STAR}\,[]\,r)$
Chengsong
parents: 456
diff changeset
   984
\end{tabular}    
Chengsong
parents: 456
diff changeset
   985
\end{center}    
Chengsong
parents: 456
diff changeset
   986
%\end{definition}
Chengsong
parents: 456
diff changeset
   987
Chengsong
parents: 456
diff changeset
   988
\begin{center}
Chengsong
parents: 456
diff changeset
   989
  \begin{tabular}{@{}lcl@{}}
Chengsong
parents: 456
diff changeset
   990
  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
Chengsong
parents: 456
diff changeset
   991
  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
Chengsong
parents: 456
diff changeset
   992
  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   993
        $\textit{if}\;c=d\; \;\textit{then}\;
Chengsong
parents: 456
diff changeset
   994
         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
Chengsong
parents: 456
diff changeset
   995
  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   996
  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
Chengsong
parents: 456
diff changeset
   997
  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
   998
     $\textit{if}\;\textit{bnullable}\,a_1$\\
Chengsong
parents: 456
diff changeset
   999
					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
Chengsong
parents: 456
diff changeset
  1000
					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
Chengsong
parents: 456
diff changeset
  1001
  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
Chengsong
parents: 456
diff changeset
  1002
  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1003
      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
Chengsong
parents: 456
diff changeset
  1004
       (_{bs}\textit{STAR}\,[]\,r)$
Chengsong
parents: 456
diff changeset
  1005
\end{tabular}    
Chengsong
parents: 456
diff changeset
  1006
\end{center}    
Chengsong
parents: 456
diff changeset
  1007
%\end{definition}
Chengsong
parents: 456
diff changeset
  1008
\fi
Chengsong
parents: 456
diff changeset
  1009
Chengsong
parents: 456
diff changeset
  1010
\begin{center}
Chengsong
parents: 456
diff changeset
  1011
  \begin{tabular}{@{}lcl@{}}
Chengsong
parents: 456
diff changeset
  1012
  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
Chengsong
parents: 456
diff changeset
  1013
  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
Chengsong
parents: 456
diff changeset
  1014
  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1015
        $\textit{if}\;c=d\; \;\textit{then}\;
Chengsong
parents: 456
diff changeset
  1016
         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
Chengsong
parents: 456
diff changeset
  1017
  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1018
  $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
Chengsong
parents: 456
diff changeset
  1019
  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1020
     $\textit{if}\;\textit{bnullable}\,a_1$\\
Chengsong
parents: 456
diff changeset
  1021
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
Chengsong
parents: 456
diff changeset
  1022
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
Chengsong
parents: 456
diff changeset
  1023
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
Chengsong
parents: 456
diff changeset
  1024
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1025
      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
Chengsong
parents: 456
diff changeset
  1026
       (_{[]}r^*))$
Chengsong
parents: 456
diff changeset
  1027
\end{tabular}    
Chengsong
parents: 456
diff changeset
  1028
\end{center}    
Chengsong
parents: 456
diff changeset
  1029
Chengsong
parents: 456
diff changeset
  1030
%\end{definition}
Chengsong
parents: 456
diff changeset
  1031
\noindent
Chengsong
parents: 456
diff changeset
  1032
For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
Chengsong
parents: 456
diff changeset
  1033
we need to unfold it into a sequence,
Chengsong
parents: 456
diff changeset
  1034
and attach an additional bit $0$ to the front of $r \backslash c$
Chengsong
parents: 456
diff changeset
  1035
to indicate that there is one more star iteration. Also the sequence clause
Chengsong
parents: 456
diff changeset
  1036
is more subtle---when $a_1$ is $\textit{bnullable}$ (here
Chengsong
parents: 456
diff changeset
  1037
\textit{bnullable} is exactly the same as $\textit{nullable}$, except
Chengsong
parents: 456
diff changeset
  1038
that it is for annotated regular expressions, therefore we omit the
Chengsong
parents: 456
diff changeset
  1039
definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
Chengsong
parents: 456
diff changeset
  1040
$a_1$ matches the string prior to character $c$ (more on this later),
Chengsong
parents: 456
diff changeset
  1041
then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
Chengsong
parents: 456
diff changeset
  1042
\backslash c)$ will collapse the regular expression $a_1$(as it has
Chengsong
parents: 456
diff changeset
  1043
already been fully matched) and store the parsing information at the
Chengsong
parents: 456
diff changeset
  1044
head of the regular expression $a_2 \backslash c$ by fusing to it. The
Chengsong
parents: 456
diff changeset
  1045
bitsequence $\textit{bs}$, which was initially attached to the
Chengsong
parents: 456
diff changeset
  1046
first element of the sequence $a_1 \cdot a_2$, has
Chengsong
parents: 456
diff changeset
  1047
now been elevated to the top-level of $\sum$, as this information will be
Chengsong
parents: 456
diff changeset
  1048
needed whichever way the sequence is matched---no matter whether $c$ belongs
Chengsong
parents: 456
diff changeset
  1049
to $a_1$ or $ a_2$. After building these derivatives and maintaining all
Chengsong
parents: 456
diff changeset
  1050
the lexing information, we complete the lexing by collecting the
Chengsong
parents: 456
diff changeset
  1051
bitcodes using a generalised version of the $\textit{mkeps}$ function
Chengsong
parents: 456
diff changeset
  1052
for annotated regular expressions, called $\textit{bmkeps}$:
Chengsong
parents: 456
diff changeset
  1053
Chengsong
parents: 456
diff changeset
  1054
Chengsong
parents: 456
diff changeset
  1055
%\begin{definition}[\textit{bmkeps}]\mbox{}
Chengsong
parents: 456
diff changeset
  1056
\begin{center}
Chengsong
parents: 456
diff changeset
  1057
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
  1058
  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
Chengsong
parents: 456
diff changeset
  1059
  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1060
     $\textit{if}\;\textit{bnullable}\,a$\\
Chengsong
parents: 456
diff changeset
  1061
  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
Chengsong
parents: 456
diff changeset
  1062
  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
Chengsong
parents: 456
diff changeset
  1063
  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1064
     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
Chengsong
parents: 456
diff changeset
  1065
  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1066
     $bs \,@\, [0]$
Chengsong
parents: 456
diff changeset
  1067
\end{tabular}    
Chengsong
parents: 456
diff changeset
  1068
\end{center}    
Chengsong
parents: 456
diff changeset
  1069
%\end{definition}
Chengsong
parents: 456
diff changeset
  1070
Chengsong
parents: 456
diff changeset
  1071
\noindent
Chengsong
parents: 456
diff changeset
  1072
This function completes the value information by travelling along the
Chengsong
parents: 456
diff changeset
  1073
path of the regular expression that corresponds to a POSIX value and
Chengsong
parents: 456
diff changeset
  1074
collecting all the bitcodes, and using $S$ to indicate the end of star
Chengsong
parents: 456
diff changeset
  1075
iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
Chengsong
parents: 456
diff changeset
  1076
decode them, we get the value we expect. The corresponding lexing
Chengsong
parents: 456
diff changeset
  1077
algorithm looks as follows:
Chengsong
parents: 456
diff changeset
  1078
Chengsong
parents: 456
diff changeset
  1079
\begin{center}
Chengsong
parents: 456
diff changeset
  1080
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
  1081
  $\textit{blexer}\;r\,s$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1082
      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
Chengsong
parents: 456
diff changeset
  1083
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
Chengsong
parents: 456
diff changeset
  1084
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
Chengsong
parents: 456
diff changeset
  1085
  & & $\;\;\textit{else}\;\textit{None}$
Chengsong
parents: 456
diff changeset
  1086
\end{tabular}
Chengsong
parents: 456
diff changeset
  1087
\end{center}
Chengsong
parents: 456
diff changeset
  1088
Chengsong
parents: 456
diff changeset
  1089
\noindent
Chengsong
parents: 456
diff changeset
  1090
In this definition $\_\backslash s$ is the  generalisation  of the derivative
Chengsong
parents: 456
diff changeset
  1091
operation from characters to strings (just like the derivatives for un-annotated
Chengsong
parents: 456
diff changeset
  1092
regular expressions).
Chengsong
parents: 456
diff changeset
  1093
Chengsong
parents: 456
diff changeset
  1094
Remember tha one of the important reasons we introduced bitcodes
Chengsong
parents: 456
diff changeset
  1095
is that they can make simplification more structured and therefore guaranteeing
Chengsong
parents: 456
diff changeset
  1096
the correctness.
Chengsong
parents: 456
diff changeset
  1097
Chengsong
parents: 456
diff changeset
  1098
\subsection*{Our Simplification Rules}
Chengsong
parents: 456
diff changeset
  1099
Chengsong
parents: 456
diff changeset
  1100
In this section we introduce aggressive (in terms of size) simplification rules
Chengsong
parents: 456
diff changeset
  1101
on annotated regular expressions
Chengsong
parents: 456
diff changeset
  1102
in order to keep derivatives small. Such simplifications are promising
Chengsong
parents: 456
diff changeset
  1103
as we have
Chengsong
parents: 456
diff changeset
  1104
generated test data that show
Chengsong
parents: 456
diff changeset
  1105
that a good tight bound can be achieved. Obviously we could only
Chengsong
parents: 456
diff changeset
  1106
partially cover  the search space as there are infinitely many regular
Chengsong
parents: 456
diff changeset
  1107
expressions and strings. 
Chengsong
parents: 456
diff changeset
  1108
Chengsong
parents: 456
diff changeset
  1109
One modification we introduced is to allow a list of annotated regular
Chengsong
parents: 456
diff changeset
  1110
expressions in the $\sum$ constructor. This allows us to not just
Chengsong
parents: 456
diff changeset
  1111
delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
Chengsong
parents: 456
diff changeset
  1112
also unnecessary ``copies'' of regular expressions (very similar to
Chengsong
parents: 456
diff changeset
  1113
simplifying $r + r$ to just $r$, but in a more general setting). Another
Chengsong
parents: 456
diff changeset
  1114
modification is that we use simplification rules inspired by Antimirov's
Chengsong
parents: 456
diff changeset
  1115
work on partial derivatives. They maintain the idea that only the first
Chengsong
parents: 456
diff changeset
  1116
``copy'' of a regular expression in an alternative contributes to the
Chengsong
parents: 456
diff changeset
  1117
calculation of a POSIX value. All subsequent copies can be pruned away from
Chengsong
parents: 456
diff changeset
  1118
the regular expression. A recursive definition of our  simplification function 
Chengsong
parents: 456
diff changeset
  1119
that looks somewhat similar to our Scala code is given below:
Chengsong
parents: 456
diff changeset
  1120
%\comment{Use $\ZERO$, $\ONE$ and so on. 
Chengsong
parents: 456
diff changeset
  1121
%Is it $ALTS$ or $ALTS$?}\\
Chengsong
parents: 456
diff changeset
  1122
Chengsong
parents: 456
diff changeset
  1123
\begin{center}
Chengsong
parents: 456
diff changeset
  1124
  \begin{tabular}{@{}lcl@{}}
Chengsong
parents: 456
diff changeset
  1125
   
Chengsong
parents: 456
diff changeset
  1126
  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
Chengsong
parents: 456
diff changeset
  1127
   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
Chengsong
parents: 456
diff changeset
  1128
   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
Chengsong
parents: 456
diff changeset
  1129
   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
Chengsong
parents: 456
diff changeset
  1130
   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
Chengsong
parents: 456
diff changeset
  1131
   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
Chengsong
parents: 456
diff changeset
  1132
Chengsong
parents: 456
diff changeset
  1133
  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
Chengsong
parents: 456
diff changeset
  1134
  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
Chengsong
parents: 456
diff changeset
  1135
   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
Chengsong
parents: 456
diff changeset
  1136
   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
Chengsong
parents: 456
diff changeset
  1137
Chengsong
parents: 456
diff changeset
  1138
   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
Chengsong
parents: 456
diff changeset
  1139
\end{tabular}    
Chengsong
parents: 456
diff changeset
  1140
\end{center}    
Chengsong
parents: 456
diff changeset
  1141
Chengsong
parents: 456
diff changeset
  1142
\noindent
Chengsong
parents: 456
diff changeset
  1143
The simplification does a pattern matching on the regular expression.
Chengsong
parents: 456
diff changeset
  1144
When it detected that the regular expression is an alternative or
Chengsong
parents: 456
diff changeset
  1145
sequence, it will try to simplify its children regular expressions
Chengsong
parents: 456
diff changeset
  1146
recursively and then see if one of the children turn into $\ZERO$ or
Chengsong
parents: 456
diff changeset
  1147
$\ONE$, which might trigger further simplification at the current level.
Chengsong
parents: 456
diff changeset
  1148
The most involved part is the $\sum$ clause, where we use two
Chengsong
parents: 456
diff changeset
  1149
auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
Chengsong
parents: 456
diff changeset
  1150
alternatives and reduce as many duplicates as possible. Function
Chengsong
parents: 456
diff changeset
  1151
$\textit{distinct}$  keeps the first occurring copy only and remove all later ones
Chengsong
parents: 456
diff changeset
  1152
when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
Chengsong
parents: 456
diff changeset
  1153
Its recursive definition is given below:
Chengsong
parents: 456
diff changeset
  1154
Chengsong
parents: 456
diff changeset
  1155
 \begin{center}
Chengsong
parents: 456
diff changeset
  1156
  \begin{tabular}{@{}lcl@{}}
Chengsong
parents: 456
diff changeset
  1157
  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
Chengsong
parents: 456
diff changeset
  1158
     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
Chengsong
parents: 456
diff changeset
  1159
  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
Chengsong
parents: 456
diff changeset
  1160
    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
Chengsong
parents: 456
diff changeset
  1161
\end{tabular}    
Chengsong
parents: 456
diff changeset
  1162
\end{center}  
Chengsong
parents: 456
diff changeset
  1163
Chengsong
parents: 456
diff changeset
  1164
\noindent
Chengsong
parents: 456
diff changeset
  1165
Here $\textit{flatten}$ behaves like the traditional functional programming flatten
Chengsong
parents: 456
diff changeset
  1166
function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
Chengsong
parents: 456
diff changeset
  1167
removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
Chengsong
parents: 456
diff changeset
  1168
Chengsong
parents: 456
diff changeset
  1169
Having defined the $\simp$ function,
Chengsong
parents: 456
diff changeset
  1170
we can use the previous notation of  natural
Chengsong
parents: 456
diff changeset
  1171
extension from derivative w.r.t.~character to derivative
Chengsong
parents: 456
diff changeset
  1172
w.r.t.~string:%\comment{simp in  the [] case?}
Chengsong
parents: 456
diff changeset
  1173
Chengsong
parents: 456
diff changeset
  1174
\begin{center}
Chengsong
parents: 456
diff changeset
  1175
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
  1176
$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
Chengsong
parents: 456
diff changeset
  1177
$r \backslash_{simp} [\,] $ & $\dn$ & $r$
Chengsong
parents: 456
diff changeset
  1178
\end{tabular}
Chengsong
parents: 456
diff changeset
  1179
\end{center}
Chengsong
parents: 456
diff changeset
  1180
Chengsong
parents: 456
diff changeset
  1181
\noindent
Chengsong
parents: 456
diff changeset
  1182
to obtain an optimised version of the algorithm:
Chengsong
parents: 456
diff changeset
  1183
Chengsong
parents: 456
diff changeset
  1184
 \begin{center}
Chengsong
parents: 456
diff changeset
  1185
\begin{tabular}{lcl}
Chengsong
parents: 456
diff changeset
  1186
  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
Chengsong
parents: 456
diff changeset
  1187
      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
Chengsong
parents: 456
diff changeset
  1188
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
Chengsong
parents: 456
diff changeset
  1189
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
Chengsong
parents: 456
diff changeset
  1190
  & & $\;\;\textit{else}\;\textit{None}$
Chengsong
parents: 456
diff changeset
  1191
\end{tabular}
Chengsong
parents: 456
diff changeset
  1192
\end{center}
Chengsong
parents: 456
diff changeset
  1193
Chengsong
parents: 456
diff changeset
  1194
\noindent
Chengsong
parents: 456
diff changeset
  1195
This algorithm keeps the regular expression size small, for example,
Chengsong
parents: 456
diff changeset
  1196
with this simplification our previous $(a + aa)^*$ example's 8000 nodes
Chengsong
parents: 456
diff changeset
  1197
will be reduced to just 6 and stays constant, no matter how long the
Chengsong
parents: 456
diff changeset
  1198
input string is.
Chengsong
parents: 456
diff changeset
  1199
Chengsong
parents: 456
diff changeset
  1200
456
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1201
 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1202
Derivatives give a simple solution
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1203
to the problem of matching a string $s$ with a regular
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1204
expression $r$: if the derivative of $r$ w.r.t.\ (in
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1205
succession) all the characters of the string matches the empty string,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1206
then $r$ matches $s$ (and {\em vice versa}).  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1207
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1208
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1209
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1210
However, there are two difficulties with derivative-based matchers:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1211
First, Brzozowski's original matcher only generates a yes/no answer
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1212
for whether a regular expression matches a string or not.  This is too
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1213
little information in the context of lexing where separate tokens must
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1214
be identified and also classified (for example as keywords
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1215
or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1216
difficulty by cleverly extending Brzozowski's matching
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1217
algorithm. Their extended version generates additional information on
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1218
\emph{how} a regular expression matches a string following the POSIX
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1219
rules for regular expression matching. They achieve this by adding a
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1220
second ``phase'' to Brzozowski's algorithm involving an injection
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1221
function.  In our own earlier work we provided the formal
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1222
specification of what POSIX matching means and proved in Isabelle/HOL
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1223
the correctness
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1224
of Sulzmann and Lu's extended algorithm accordingly
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1225
\cite{AusafDyckhoffUrban2016}.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1226
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1227
The second difficulty is that Brzozowski's derivatives can 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1228
grow to arbitrarily big sizes. For example if we start with the
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1229
regular expression $(a+aa)^*$ and take
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1230
successive derivatives according to the character $a$, we end up with
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1231
a sequence of ever-growing derivatives like 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1232
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1233
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1234
\begin{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1235
\begin{tabular}{rll}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1236
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1237
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1238
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1239
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1240
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1241
\end{tabular}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1242
\end{center}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1243
 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1244
\noindent where after around 35 steps we run out of memory on a
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1245
typical computer (we shall define shortly the precise details of our
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1246
regular expressions and the derivative operation).  Clearly, the
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1247
notation involving $\ZERO$s and $\ONE$s already suggests
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1248
simplification rules that can be applied to regular regular
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1249
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1250
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1251
r$. While such simple-minded simplifications have been proved in our
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1252
earlier work to preserve the correctness of Sulzmann and Lu's
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1253
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1254
\emph{not} help with limiting the growth of the derivatives shown
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1255
above: the growth is slowed, but the derivatives can still grow rather
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1256
quickly beyond any finite bound.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1257
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1258
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1259
Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1260
\cite{Sulzmann2014} where they introduce bitcoded
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1261
regular expressions. In this version, POSIX values are
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1262
represented as bitsequences and such sequences are incrementally generated
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1263
when derivatives are calculated. The compact representation
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1264
of bitsequences and regular expressions allows them to define a more
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1265
``aggressive'' simplification method that keeps the size of the
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1266
derivatives finite no matter what the length of the string is.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1267
They make some informal claims about the correctness and linear behaviour
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1268
of this version, but do not provide any supporting proof arguments, not
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1269
even ``pencil-and-paper'' arguments. They write about their bitcoded
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1270
\emph{incremental parsing method} (that is the algorithm to be formalised
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1271
in this paper):
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1272
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1273
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1274
\begin{quote}\it
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1275
  ``Correctness Claim: We further claim that the incremental parsing
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1276
  method [..] in combination with the simplification steps [..]
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1277
  yields POSIX parse trees. We have tested this claim
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1278
  extensively [..] but yet
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1279
  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1280
\end{quote}  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1281
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1282
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1283
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1284
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1285
\section{Backgound}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1286
%Regular expression matching and lexing has been 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1287
% widely-used and well-implemented
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1288
%in software industry. 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1289
%TODO: expand the above into a full paragraph
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1290
%TODO: look up snort rules to use here--give readers idea of what regexes look like
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1291
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1292
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1293
Theoretical results say that regular expression matching
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1294
should be linear with respect to the input.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1295
Under a certain class of regular expressions and inputs though,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1296
practical implementations  suffer from non-linear or even 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1297
exponential running time,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1298
allowing a ReDoS (regular expression denial-of-service ) attack.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1299
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1300
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1301
%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1302
465
Chengsong
parents: 456
diff changeset
  1303
\section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking}
456
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1304
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1305
The reason behind is that regex libraries in popular language engines
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1306
 often want to support richer constructs
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1307
than  the most basic regular expressions, and lexing rather than matching
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1308
is needed for sub-match extraction.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1309
465
Chengsong
parents: 456
diff changeset
  1310
There is also static analysis work on regular expression that
Chengsong
parents: 456
diff changeset
  1311
have potential expoential behavious. Rathnayake and Thielecke 
Chengsong
parents: 456
diff changeset
  1312
\parencite{Rathnayake2014StaticAF} proposed an algorithm
Chengsong
parents: 456
diff changeset
  1313
that detects regular expressions triggering exponential
Chengsong
parents: 456
diff changeset
  1314
behavious on backtracking matchers.
Chengsong
parents: 456
diff changeset
  1315
People also developed static analysis methods for
Chengsong
parents: 456
diff changeset
  1316
generating non-linear polynomial worst-time estimates
Chengsong
parents: 456
diff changeset
  1317
for regexes, attack string that exploit the worst-time 
Chengsong
parents: 456
diff changeset
  1318
scenario, and "attack automata" that generates
Chengsong
parents: 456
diff changeset
  1319
attack strings.
Chengsong
parents: 456
diff changeset
  1320
For a comprehensive analysis, please refer to Weideman's thesis 
Chengsong
parents: 456
diff changeset
  1321
\parencite{Weideman2017Static}.
456
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1322
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1323
\subsection{DFA Approach}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1324
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1325
Exponential states.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1326
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1327
\subsection{NFA Approach}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1328
Backtracking.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1329
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1330
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1331
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1332
%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1333
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1334
\section{Our Approach}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1335
In the last fifteen or so years, Brzozowski's derivatives of regular
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1336
expressions have sparked quite a bit of interest in the functional
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1337
programming and theorem prover communities.  The beauty of
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1338
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1339
expressible in any functional language, and easily definable and
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1340
reasoned about in theorem provers---the definitions just consist of
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1341
inductive datatypes and simple recursive functions.  Derivatives of a
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1342
regular expression, written $r \backslash c$, give a simple solution
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1343
to the problem of matching a string $s$ with a regular
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1344
expression $r$: if the derivative of $r$ w.r.t.\ (in
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1345
succession) all the characters of the string matches the empty string,
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1346
then $r$ matches $s$ (and {\em vice versa}).  
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1347
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1348
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1349
This work aims to address the above vulnerability by the combination
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1350
of Brzozowski's derivatives and interactive theorem proving. We give an 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1351
improved version of  Sulzmann and Lu's bit-coded algorithm using 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1352
derivatives, which come with a formal guarantee in terms of correctness and 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1353
running time as an Isabelle/HOL proof.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1354
Then we improve the algorithm with an even stronger version of 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1355
simplification, and prove a time bound linear to input and
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1356
cubic to regular expression size using a technique by
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1357
Antimirov.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1358
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1359
\subsection{Existing Work}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1360
We are aware
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1361
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1362
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1363
of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1364
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1365
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1366
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1367
%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1368
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1369
\section{What this Template Includes}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1370
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1371
\subsection{Folders}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1372
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1373
This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1374
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1375
\keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1376
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1377
\keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1378
\begin{itemize}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1379
\item Chapter 1: Introduction to the thesis topic
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1380
\item Chapter 2: Background information and theory
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1381
\item Chapter 3: (Laboratory) experimental setup
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1382
\item Chapter 4: Details of experiment 1
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1383
\item Chapter 5: Details of experiment 2
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1384
\item Chapter 6: Discussion of the experimental results
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1385
\item Chapter 7: Conclusion and future directions
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1386
\end{itemize}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1387
This chapter layout is specialised for the experimental sciences, your discipline may be different.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1388
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1389
\keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1390
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1391
\subsection{Files}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1392
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1393
Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1394
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1395
\keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1396
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1397
\keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis. 
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1398
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1399
\keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1400
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1401
\keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis!
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1402
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1403
Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1404
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1405
\keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1406
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1407
\keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1408
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1409
\keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1410
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1411
\keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1412
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1413
\keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1414
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1415
\keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1416
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1417
\keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1418
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1419
So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1420
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1421
%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1422
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1423
\section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1424
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1425
You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1426
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1427
Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1428
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1429
Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb|\href{url}{name}| and only leave the name.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1430
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1431
When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper!
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1432
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1433
%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1434
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1435
\section{The \code{main.tex} File Explained}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1436
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1437
The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1438
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1439
Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1440
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1441
Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1442
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1443
Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1444
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1445
Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor).
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1446
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1447
The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1448
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1449
The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1450
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1451
The next page contains a one line dedication. Who will you dedicate your thesis to?
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1452
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1453
Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1454
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1455
After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1456
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1457
%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1458
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1459
\section{Thesis Features and Conventions}\label{ThesisConventions}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1460
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1461
To get the best out of this template, there are a few conventions that you may want to follow.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1462
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1463
One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1464
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1465
\subsection{Printing Format}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1466
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1467
This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1468
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1469
The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1470
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1471
The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1472
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1473
\subsection{Using US Letter Paper}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1474
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1475
The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1476
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1477
Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1478
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1479
\subsection{References}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1480
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1481
The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1482
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1483
Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1484
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1485
The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations).
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1486
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1487
\subsubsection{A Note on bibtex}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1488
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1489
The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1490
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1491
\subsection{Tables}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1492
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1493
Tables are an important way of displaying your results, below is an example table which was generated with this code:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1494
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1495
{\small
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1496
\begin{verbatim}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1497
\begin{table}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1498
\caption{The effects of treatments X and Y on the four groups studied.}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1499
\label{tab:treatments}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1500
\centering
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1501
\begin{tabular}{l l l}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1502
\toprule
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1503
\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1504
\midrule
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1505
1 & 0.2 & 0.8\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1506
2 & 0.17 & 0.7\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1507
3 & 0.24 & 0.75\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1508
4 & 0.68 & 0.3\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1509
\bottomrule\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1510
\end{tabular}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1511
\end{table}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1512
\end{verbatim}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1513
}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1514
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1515
\begin{table}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1516
\caption{The effects of treatments X and Y on the four groups studied.}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1517
\label{tab:treatments}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1518
\centering
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1519
\begin{tabular}{l l l}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1520
\toprule
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1521
\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1522
\midrule
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1523
1 & 0.2 & 0.8\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1524
2 & 0.17 & 0.7\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1525
3 & 0.24 & 0.75\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1526
4 & 0.68 & 0.3\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1527
\bottomrule\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1528
\end{tabular}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1529
\end{table}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1530
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1531
You can reference tables with \verb|\ref{<label>}| where the label is defined within the table environment. See \file{Chapter1.tex} for an example of the label and citation (e.g. Table~\ref{tab:treatments}).
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1532
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1533
\subsection{Figures}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1534
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1535
There will hopefully be many figures in your thesis (that should be placed in the \emph{Figures} folder). The way to insert figures into your thesis is to use a code template like this:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1536
\begin{verbatim}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1537
\begin{figure}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1538
\centering
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1539
\includegraphics{Figures/Electron}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1540
\decoRule
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1541
\caption[An Electron]{An electron (artist's impression).}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1542
\label{fig:Electron}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1543
\end{figure}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1544
\end{verbatim}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1545
Also look in the source file. Putting this code into the source file produces the picture of the electron that you can see in the figure below.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1546
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1547
\begin{figure}[th]
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1548
\centering
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1549
\includegraphics{Figures/Electron}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1550
\decoRule
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1551
\caption[An Electron]{An electron (artist's impression).}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1552
\label{fig:Electron}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1553
\end{figure}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1554
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1555
Sometimes figures don't always appear where you write them in the source. The placement depends on how much space there is on the page for the figure. Sometimes there is not enough room to fit a figure directly where it should go (in relation to the text) and so \LaTeX{} puts it at the top of the next page. Positioning figures is the job of \LaTeX{} and so you should only worry about making them look good!
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1556
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1557
Figures usually should have captions just in case you need to refer to them (such as in Figure~\ref{fig:Electron}). The \verb|\caption| command contains two parts, the first part, inside the square brackets is the title that will appear in the \emph{List of Figures}, and so should be short. The second part in the curly brackets should contain the longer and more descriptive caption text.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1558
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1559
The \verb|\decoRule| command is optional and simply puts an aesthetic horizontal line below the image. If you do this for one image, do it for all of them.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1560
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1561
\LaTeX{} is capable of using images in pdf, jpg and png format.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1562
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1563
\subsection{Typesetting mathematics}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1564
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1565
If your thesis is going to contain heavy mathematical content, be sure that \LaTeX{} will make it look beautiful, even though it won't be able to solve the equations for you.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1566
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1567
The \enquote{Not So Short Introduction to \LaTeX} (available on \href{http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf}{CTAN}) should tell you everything you need to know for most cases of typesetting mathematics. If you need more information, a much more thorough mathematical guide is available from the AMS called, \enquote{A Short Math Guide to \LaTeX} and can be downloaded from:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1568
\url{ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1569
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1570
There are many different \LaTeX{} symbols to remember, luckily you can find the most common symbols in \href{http://ctan.org/pkg/comprehensive}{The Comprehensive \LaTeX~Symbol List}.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1571
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1572
You can write an equation, which is automatically given an equation number by \LaTeX{} like this:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1573
\begin{verbatim}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1574
\begin{equation}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1575
E = mc^{2}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1576
\label{eqn:Einstein}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1577
\end{equation}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1578
\end{verbatim}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1579
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1580
This will produce Einstein's famous energy-matter equivalence equation:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1581
\begin{equation}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1582
E = mc^{2}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1583
\label{eqn:Einstein}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1584
\end{equation}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1585
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1586
All equations you write (which are not in the middle of paragraph text) are automatically given equation numbers by \LaTeX{}. If you don't want a particular equation numbered, use the unnumbered form:
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1587
\begin{verbatim}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1588
\[ a^{2}=4 \]
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1589
\end{verbatim}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1590
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1591
%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1592
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1593
\section{Sectioning and Subsectioning}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1594
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1595
You should break your thesis up into nice, bite-sized sections and subsections. \LaTeX{} automatically builds a table of Contents by looking at all the \verb|\chapter{}|, \verb|\section{}|  and \verb|\subsection{}| commands you write in the source.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1596
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1597
The Table of Contents should only list the sections to three (3) levels. A \verb|chapter{}| is level zero (0). A \verb|\section{}| is level one (1) and so a \verb|\subsection{}| is level two (2). In your thesis it is likely that you will even use a \verb|subsubsection{}|, which is level three (3). The depth to which the Table of Contents is formatted is set within \file{MastersDoctoralThesis.cls}. If you need this changed, you can do it in \file{main.tex}.
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1598
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1599
%----------------------------------------------------------------------------------------
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1600
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1601
\section{In Closing}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1602
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1603
You have reached the end of this mini-guide. You can now rename or overwrite this pdf file and begin writing your own \file{Chapter1.tex} and the rest of your thesis. The easy work of setting up the structure and framework has been taken care of for you. It's now your job to fill it out!
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1604
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1605
Good luck and have lots of fun!
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1606
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1607
\begin{flushright}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1608
Guide written by ---\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1609
Sunil Patel: \href{http://www.sunilpatel.co.uk}{www.sunilpatel.co.uk}\\
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1610
Vel: \href{http://www.LaTeXTemplates.com}{LaTeXTemplates.com}
26a5e640cdd7 realPhdThesis
Chengsong
parents:
diff changeset
  1611
\end{flushright}