ChengsongTanPhdThesis/Chapters/Chapter1.tex
author Chengsong
Fri, 25 Mar 2022 21:49:53 +0000
changeset 468 a0f27e21b42c
child 469 e5dd8cc0aa82
permissions -rwxr-xr-x
all texrelated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     1
% Chapter 1
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     2
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     3
\chapter{Introduction} % Main chapter title
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     4
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     5
\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1} 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     6
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     7
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     8
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     9
% Define some commands to keep the formatting separated from the content 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    10
\newcommand{\keyword}[1]{\textbf{#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    11
\newcommand{\tabhead}[1]{\textbf{#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    12
\newcommand{\code}[1]{\texttt{#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    14
\newcommand{\option}[1]{\texttt{\itshape#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    15
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    16
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    17
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    18
\newcommand{\ZERO}{\mbox{\bf 0}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    19
\newcommand{\ONE}{\mbox{\bf 1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    20
\def\lexer{\mathit{lexer}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    21
\def\mkeps{\mathit{mkeps}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    22
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    23
\def\DFA{\textit{DFA}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    24
\def\bmkeps{\textit{bmkeps}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    25
\def\retrieve{\textit{retrieve}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    26
\def\blexer{\textit{blexer}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    27
\def\flex{\textit{flex}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    28
\def\inj{\mathit{inj}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    29
\def\Empty{\mathit{Empty}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    30
\def\Left{\mathit{Left}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    31
\def\Right{\mathit{Right}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    32
\def\Stars{\mathit{Stars}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    33
\def\Char{\mathit{Char}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    34
\def\Seq{\mathit{Seq}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    35
\def\Der{\mathit{Der}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    36
\def\nullable{\mathit{nullable}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    37
\def\Z{\mathit{Z}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    38
\def\S{\mathit{S}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    39
\def\rup{r^\uparrow}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    40
\def\simp{\mathit{simp}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    41
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    42
\def\map{\mathit{map}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    43
\def\distinct{\mathit{distinct}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    44
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    45
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    46
%This part is about regular expressions, Brzozowski derivatives,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    47
%and a bit-coded lexing algorithm with proven correctness and time bounds.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    48
Regular expressions are widely used in modern day programming tasks.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    49
Be it IDE with syntax hightlighting and auto completion, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    50
command line tools like $\mathit{grep}$ that facilitates easy 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    51
processing of text by search and replace, network intrusion
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    52
detection systems that rejects suspicious traffic, or compiler
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    53
front ends--there is always an important phase of the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    54
task that involves matching a regular 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    55
exression with a string.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    56
Given its usefulness and ubiquity, one would imagine that
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    57
modern regular expression matching implementations
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    58
are mature and fully-studied.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    59
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    60
If you go to a popular programming language's
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    61
regex engine,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    62
you can supply it with regex and strings of your own,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    63
and get matching/lexing  information such as how a 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    64
sub-part of the regex matches a sub-part of the string.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    65
These lexing libraries are on average quite fast.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    66
%TODO: get source for SNORT/BRO's regex matching engine/speed
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    67
For example, the regex engines some network intrusion detection
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    68
systems use are able to process
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    69
 megabytes or even gigabytes of network traffic per second.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    70
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    71
Why do we need to have our version, if the algorithms are
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    72
blindingly fast already? Well it turns out it is not always the case.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    73
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    74
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    75
Take $(a^*)^*\,b$ and ask whether
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    76
strings of the form $aa..a$ match this regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    77
expression. Obviously this is not the case---the expected $b$ in the last
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    78
position is missing. One would expect that modern regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    79
matching engines can find this out very quickly. Alas, if one tries
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    80
this example in JavaScript, Python or Java 8 with strings like 28
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    81
$a$'s, one discovers that this decision takes around 30 seconds and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    82
takes considerably longer when adding a few more $a$'s, as the graphs
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    83
below show:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    84
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    85
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    86
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    87
\begin{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    88
\begin{axis}[
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    89
    xlabel={$n$},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    90
    x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    91
    ylabel={time in secs},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    92
    enlargelimits=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    93
    xtick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    94
    xmax=33,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    95
    ymax=35,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    96
    ytick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    97
    scaled ticks=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    98
    axis lines=left,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    99
    width=5cm,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   100
    height=4cm, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   101
    legend entries={JavaScript},  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   102
    legend pos=north west,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   103
    legend cell align=left]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   104
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   105
\end{axis}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   106
\end{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   107
  &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   108
\begin{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   109
\begin{axis}[
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   110
    xlabel={$n$},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   111
    x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   112
    %ylabel={time in secs},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   113
    enlargelimits=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   114
    xtick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   115
    xmax=33,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   116
    ymax=35,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   117
    ytick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   118
    scaled ticks=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   119
    axis lines=left,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   120
    width=5cm,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   121
    height=4cm, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   122
    legend entries={Python},  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   123
    legend pos=north west,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   124
    legend cell align=left]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   125
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   126
\end{axis}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   127
\end{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   128
  &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   129
\begin{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   130
\begin{axis}[
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   131
    xlabel={$n$},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   132
    x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   133
    %ylabel={time in secs},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   134
    enlargelimits=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   135
    xtick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   136
    xmax=33,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   137
    ymax=35,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   138
    ytick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   139
    scaled ticks=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   140
    axis lines=left,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   141
    width=5cm,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   142
    height=4cm, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   143
    legend entries={Java 8},  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   144
    legend pos=north west,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   145
    legend cell align=left]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   146
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   147
\end{axis}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   148
\end{tikzpicture}\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   149
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   150
           of the form $\underbrace{aa..a}_{n}$.}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   151
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   152
\end{center}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   153
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   154
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   155
This is clearly exponential behaviour, and 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   156
is triggered by some relatively simple regex patterns.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   157
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   158
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   159
This superlinear blowup in matching algorithms sometimes cause
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   160
considerable grief in real life: for example on 20 July 2016 one evil
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   161
regular expression brought the webpage
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   162
\href{http://stackexchange.com}{Stack Exchange} to its
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   163
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   164
In this instance, a regular expression intended to just trim white
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   165
spaces from the beginning and the end of a line actually consumed
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   166
massive amounts of CPU-resources---causing web servers to grind to a
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   167
halt. This happened when a post with 20,000 white spaces was submitted,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   168
but importantly the white spaces were neither at the beginning nor at
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   169
the end. As a result, the regular expression matching engine needed to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   170
backtrack over many choices. In this example, the time needed to process
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   171
the string was $O(n^2)$ with respect to the string length. This
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   172
quadratic overhead was enough for the homepage of Stack Exchange to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   173
respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   174
attack and therefore stopped the servers from responding to any
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   175
requests. This made the whole site become unavailable. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   176
A more 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   177
recent example is a global outage of all Cloudflare servers on 2 July
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   178
2019. A poorly written regular expression exhibited exponential
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   179
behaviour and exhausted CPUs that serve HTTP traffic. Although the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   180
outage had several causes, at the heart was a regular expression that
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   181
was used to monitor network
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   182
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   183
%TODO: data points for some new versions of languages
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   184
These problems with regular expressions 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   185
are not isolated events that happen
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   186
very occasionally, but actually quite widespread.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   187
They occur so often that they get a 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   188
name--Regular-Expression-Denial-Of-Service (ReDoS)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   189
attack.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   190
Davis et al. \parencite{Davis18} detected more
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   191
than 1000 super-linear (SL) regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   192
in Node.js, Python core libraries, and npm and pypi. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   193
They therefore concluded that evil regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   194
are problems more than "a parlour trick", but one that
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   195
requires
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   196
more research attention.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   197
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   198
 \section{Why are current algorithm for regexes slow?}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   199
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   200
Theoretical results say that regular expression matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   201
should be linear with respect to the input. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   202
You could construct
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   203
an NFA via Thompson construction, and simulate  running it.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   204
This would be linear.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   205
Or you could determinize the NFA into a DFA, and minimize that
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   206
DFA. Once you have the DFA, the running time is also linear, requiring just 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   207
one scanning pass of the input.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   208
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   209
But modern  regex libraries in popular language engines
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   210
 often want to support richer constructs
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   211
such as bounded repetitions and back-references
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   212
rather than  the most basic regular expressions.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   213
%put in illustrations
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   214
%a{1}{3}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   215
These make a DFA construction impossible because
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   216
of an exponential states explosion.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   217
 They also want to support lexing rather than just matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   218
 for tasks that involves text processing.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   219
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   220
 Existing regex libraries either pose restrictions on the user input, or does 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   221
 not give linear running time guarantee.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   222
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   223
 For example, the Rust regex engine claims to be linear, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   224
 but does not support lookarounds and back-references.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   225
 The GoLang regex library does not support over 1000 repetitions.  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   226
 Java and Python both support back-references, but shows
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   227
catastrophic backtracking behaviours on inputs without back-references(
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   228
when the language is still regular).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   229
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   230
 %TODO: verify the fact Rust does not allow 1000+ reps
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   231
 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   232
 Another thing about the these libraries is that there
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   233
 is no correctness guarantee.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   234
 In some cases they either fails to generate a lexing result when there is a match,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   235
 or gives the wrong way of matching.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   236
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   237
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   238
It turns out that regex libraries not only suffer from 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   239
exponential backtracking problems, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   240
but also undesired (or even buggy) outputs.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   241
%TODO: comment from who
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   242
xxx commented that most regex libraries are not
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   243
correctly implementing the POSIX (maximum-munch)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   244
rule of regular expression matching.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   245
A concrete example would be 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   246
the regex
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   247
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   248
(((((a*a*)b*)b){20})*)c
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   249
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   250
and the string
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   251
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   252
baabaabababaabaaaaaaaaababaa
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   253
aababababaaaabaaabaaaaaabaab
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   254
aabababaababaaaaaaaaababaaaa
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   255
babababaaaaaaaaaaaaac
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   256
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   257
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   258
This seemingly complex regex simply says "some $a$'s
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   259
followed by some $b$'s then followed by 1 single $b$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   260
and this iterates 20 times, finally followed by a $c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   261
And a POSIX match would involve the entire string,"eating up"
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   262
all the $b$'s in it.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   263
%TODO: give a coloured example of how this matches POSIXly
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   264
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   265
This regex would trigger catastrophic backtracking in 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   266
languages like Python and Java,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   267
whereas it gives a correct but uninformative (non-POSIX)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   268
match in languages like Go or .NET--The match with only 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   269
character $c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   270
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   271
Backtracking or depth-first search might give us
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   272
exponential running time, and quite a few tools 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   273
avoid that by determinising the $\mathit{NFA}$ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   274
into a $\mathit{DFA}$ and minimizes it.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   275
For example, $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   276
in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   277
lexers.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   278
However, they do not scale well with bounded repetitions.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   279
Bounded repetitions, usually written in the form
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   280
$r^{\{c\}}$ (where $c$ is a constant natural number),
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   281
denotes a regular expression accepting strings
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   282
that can be divided into $c$ substrings, and each 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   283
substring is in $r$. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   284
%TODO: draw example NFA.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   285
For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   286
an $\mathit{NFA}$ describing it would look like:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   287
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   288
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   289
   \node[state,initial] (q_0)   {$q_0$}; 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   290
   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   291
   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   292
   \node[state,accepting](q_3) [right=of q_2] {$q_3$};
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   293
    \path[->] 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   294
    (q_0) edge  node {a} (q_1)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   295
    	  edge [loop below] node {a,b} ()
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   296
    (q_1) edge  node  {a,b} (q_2)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   297
          edge [loop above] node {0} ()
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   298
    (q_2) edge  node  {a,b} (q_3);
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   299
\end{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   300
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   301
The red states are "counter states" which counts down 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   302
the number of characters needed in addition to the current
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   303
string to make a successful match.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   304
For example, state $q_1$ indicates a match that has
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   305
gone past the $(a|b)$ part of $(a|b)^*a(a|b)^{\{2\}}$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   306
and just consumed the "delimiter" $a$ in the middle, and 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   307
need to match 2 more iterations of $a|b$ to complete.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   308
State $q_2$ on the other hand, can be viewed as a state
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   309
after $q_1$ has consumed 1 character, and just waits
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   310
for 1 more character to complete.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   311
Depending on the actual characters appearing in the 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   312
input string, the states $q_1$ and $q_2$ may or may
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   313
not be active, independent from each other.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   314
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   315
contain at least 4 non-equivalent states that cannot be merged, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   316
because subset states indicating which of $q_1$ and $q_2$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   317
are active are at least four: $\phi$, $\{q_1\}$, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   318
$\{q_2\}$, $\{q_1, q_2\}$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   319
Generalizing this to regular expressions with larger
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   320
bounded repetitions number, we have $r^*ar^{\{n\}}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   321
in general would require at least $2^n$ states
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   322
in a $\mathit{DFA}$. This is to represent all different 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   323
configurations of "countdown" states.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   324
For those regexes, tools such as $\mathit{JFLEX}$ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   325
would generate gigantic $\mathit{DFA}$'s or even
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   326
give out memory errors.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   327
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   328
For this reason, regex libraries that support 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   329
bounded repetitions often choose to use the $\mathit{NFA}$ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   330
approach.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   331
One can simulate the $\mathit{NFA}$ running in two ways:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   332
one by keeping track of all active states after consuming 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   333
a character, and update that set of states iteratively.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   334
This is a breadth-first-search of the $\mathit{NFA}$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   335
for a path terminating
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   336
at an accepting state.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   337
Languages like $\mathit{GO}$ and $\mathit{RUST}$ use this
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   338
type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   339
in terms of input string length.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   340
The other way to use $\mathit{NFA}$ for matching is to take 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   341
a single state in a path each time, and backtrack if that path
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   342
fails. This is a depth-first-search that could end up
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   343
with exponential run time.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   344
The reason behind backtracking algorithms in languages like
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   345
Java and Python is that they support back-references.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   346
\subsection{Back References in Regex--Non-Regular part}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   347
If we label sub-expressions by parenthesizing them and give 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   348
them a number by the order their opening parenthesis appear,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   349
$\underset{1}{(}\ldots\underset{2}{(}\ldots\underset{3}{(}\ldots\underset{4}{(}\ldots)\ldots)\ldots)\ldots)$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   350
We can use the following syntax to denote that we want a string just matched by a 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   351
sub-expression to appear at a certain location again exactly:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   352
$(.*)\backslash 1$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   353
would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   354
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   355
Back-reference is a construct in the "regex" standard
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   356
that programmers found quite useful, but not exactly 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   357
regular any more.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   358
In fact, that allows the regex construct to express 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   359
languages that cannot be contained in context-free
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   360
languages
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   361
For example, the back reference $(a^*)\backslash 1 \backslash 1$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   362
expresses the language $\{a^na^na^n\mid n \in N\}$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   363
which cannot be expressed by context-free grammars.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   364
To express such a language one would need context-sensitive
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   365
grammars, and matching/parsing for such grammars is NP-hard in 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   366
general.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   367
%TODO:cite reference for NP-hardness of CSG matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   368
For such constructs, the most intuitive way of matching is
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   369
using backtracking algorithms, and there are no known algorithms 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   370
that does not backtrack.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   371
%TODO:read a bit more about back reference algorithms
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   372
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   373
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   374
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   375
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   376
\section{Our Solution--Brzozowski Derivatives}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   377
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   378
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   379
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   380
 Is it possible to have a regex lexing algorithm with proven correctness and 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   381
 time complexity, which allows easy extensions to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   382
  constructs like 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   383
 bounded repetitions, negation,  lookarounds, and even back-references? 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   384
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   385
 We propose Brzozowski's derivatives as a solution to this problem.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   386
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   387
The main contribution of this thesis is a proven correct lexing algorithm
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   388
with formalized time bounds.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   389
To our best knowledge, there is no lexing libraries using Brzozowski derivatives
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   390
that have a provable time guarantee, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   391
and claims about running time are usually speculative and backed by thin empirical
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   392
evidence.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   393
%TODO: give references
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   394
For example, Sulzmann and Lu had proposed an algorithm  in which they
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   395
claim a linear running time.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   396
But that was falsified by our experiments and the running time 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   397
is actually $\Omega(2^n)$ in the worst case.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   398
A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   399
%TODO: give references
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   400
lexer, which calculates POSIX matches and is based on derivatives.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   401
They formalized the correctness of the lexer, but not the complexity.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   402
In the performance evaluation section, they simply analyzed the run time
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   403
of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   404
and concluded that the algorithm is quadratic in terms of input length.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   405
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   406
the time it took to lex only 40 $a$'s was 5 minutes.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   407
We therefore believe our results of a proof of performance on general
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   408
inputs rather than specific examples a novel contribution.\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   409
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   410
 \section{Preliminaries about Lexing Using Brzozowski derivatives}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   411
 In the last fifteen or so years, Brzozowski's derivatives of regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   412
expressions have sparked quite a bit of interest in the functional
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   413
programming and theorem prover communities.  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   414
The beauty of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   415
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   416
expressible in any functional language, and easily definable and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   417
reasoned about in theorem provers---the definitions just consist of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   418
inductive datatypes and simple recursive functions. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   419
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   420
Suppose we have an alphabet $\Sigma$, the strings  whose characters
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   421
are from $\Sigma$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   422
can be expressed as $\Sigma^*$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   423
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   424
We use patterns to define a set of strings concisely. Regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   425
are one of such patterns systems:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   426
The basic regular expressions  are defined inductively
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   427
 by the following grammar:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   428
\[			r ::=   \ZERO \mid  \ONE
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   429
			 \mid  c  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   430
			 \mid  r_1 \cdot r_2
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   431
			 \mid  r_1 + r_2   
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   432
			 \mid r^*         
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   433
\]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   434
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   435
The language or set of strings defined by regular expressions are defined as
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   436
%TODO: FILL in the other defs
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   437
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   438
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   439
$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   440
$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   441
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   442
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   443
Which are also called the "language interpretation".
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   444
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   445
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   446
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   447
The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   448
where the operation transforms the regex to a new one containing
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   449
strings without the head character $c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   450
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   451
Formally, we define first such a transformation on any string set, which
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   452
we call semantic derivative:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   453
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   454
$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   455
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   456
Mathematically, it can be expressed as the 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   457
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   458
If the $\textit{StringSet}$ happen to have some structure, for example,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   459
if it is regular, then we have that it
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   460
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   461
The the derivative of regular expression, denoted as
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   462
$r \backslash c$, is a function that takes parameters
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   463
$r$ and $c$, and returns another regular expression $r'$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   464
which is computed by the following recursive function:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   465
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   466
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   467
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   468
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   469
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   470
		$d \backslash c$     & $\dn$ & 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   471
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   472
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   473
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   474
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   475
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   476
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   477
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   478
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   479
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   480
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   481
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   482
The $\nullable$ function tests whether the empty string $""$ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   483
is in the language of $r$:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   484
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   485
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   486
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   487
		\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   488
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   489
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   490
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   491
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   492
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   493
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   494
		\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   495
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   496
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   497
The empty set does not contain any string and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   498
therefore not the empty string, the empty string 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   499
regular expression contains the empty string
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   500
by definition, the character regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   501
is the singleton that contains character only,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   502
and therefore does not contain the empty string,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   503
the alternative regular expression(or "or" expression)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   504
might have one of its children regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   505
being nullable and any one of its children being nullable
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   506
would suffice. The sequence regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   507
would require both children to have the empty string
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   508
to compose an empty string and the Kleene star
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   509
operation naturally introduced the empty string.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   510
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   511
We can give the meaning of regular expressions derivatives
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   512
by language interpretation:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   513
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   514
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   515
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   516
  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   517
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   518
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   519
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   520
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   521
		$d \backslash c$     & $\dn$ & 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   522
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   523
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   524
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   525
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   526
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   527
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   528
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   529
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   530
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   531
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   532
The function derivative, written $\backslash c$, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   533
defines how a regular expression evolves into
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   534
a new regular expression after all the string it contains
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   535
is chopped off a certain head character $c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   536
The most involved cases are the sequence 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   537
and star case.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   538
The sequence case says that if the first regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   539
contains an empty string then second component of the sequence
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   540
might be chosen as the target regular expression to be chopped
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   541
off its head character.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   542
The star regular expression unwraps the iteration of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   543
regular expression and attack the star regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   544
to its back again to make sure there are 0 or more iterations
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   545
following this unfolded iteration.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   546
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   547
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   548
The main property of the derivative operation
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   549
that enables us to reason about the correctness of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   550
an algorithm using derivatives is 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   551
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   552
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   553
$c\!::\!s \in L(r)$ holds
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   554
if and only if $s \in L(r\backslash c)$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   555
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   556
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   557
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   558
We can generalise the derivative operation shown above for single characters
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   559
to strings as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   560
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   561
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   562
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   563
$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   564
$r \backslash [\,] $ & $\dn$ & $r$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   565
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   566
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   567
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   568
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   569
and then define Brzozowski's  regular-expression matching algorithm as:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   570
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   571
\[
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   572
match\;s\;r \;\dn\; nullable(r\backslash s)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   573
\]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   574
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   575
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   576
Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   577
this algorithm presented graphically is as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   578
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   579
\begin{equation}\label{graph:*}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   580
\begin{tikzcd}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   581
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   582
\end{tikzcd}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   583
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   584
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   585
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   586
where we start with  a regular expression  $r_0$, build successive
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   587
derivatives until we exhaust the string and then use \textit{nullable}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   588
to test whether the result can match the empty string. It can  be
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   589
relatively  easily shown that this matcher is correct  (that is given
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   590
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   591
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   592
Beautiful and simple definition.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   593
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   594
If we implement the above algorithm naively, however,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   595
the algorithm can be excruciatingly slow. For example, when starting with the regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   596
expression $(a + aa)^*$ and building 12 successive derivatives
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   597
w.r.t.~the character $a$, one obtains a derivative regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   598
with more than 8000 nodes (when viewed as a tree). Operations like
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   599
$\backslash$ and $\nullable$ need to traverse such trees and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   600
consequently the bigger the size of the derivative the slower the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   601
algorithm. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   602
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   603
Brzozowski was quick in finding that during this process a lot useless
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   604
$\ONE$s and $\ZERO$s are generated and therefore not optimal.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   605
He also introduced some "similarity rules" such
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   606
as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   607
different but language-equivalent sub-regexes to further decrease the size
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   608
of the intermediate regexes. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   609
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   610
More simplifications are possible, such as deleting duplicates
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   611
and opening up nested alternatives to trigger even more simplifications.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   612
And suppose we apply simplification after each derivative step, and compose
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   613
these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   614
\textit{simp}(a \backslash c)$. Then we can build
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   615
a matcher without having  cumbersome regular expressions.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   616
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   617
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   618
If we want the size of derivatives in the algorithm to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   619
stay even lower, we would need more aggressive simplifications.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   620
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   621
deleting duplicates whenever possible. For example, the parentheses in
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   622
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   623
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   624
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   625
$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   626
to achieve a very tight size bound, namely,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   627
 the same size bound as that of the \emph{partial derivatives}. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   628
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   629
Building derivatives and then simplify them.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   630
So far so good. But what if we want to 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   631
do lexing instead of just a YES/NO answer?
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   632
This requires us to go back again to the world 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   633
without simplification first for a moment.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   634
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   635
elegant(arguably as beautiful as the original
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   636
derivatives definition) solution for this.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   637
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   638
\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   639
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   640
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   641
They first defined the datatypes for storing the 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   642
lexing information called a \emph{value} or
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   643
sometimes also \emph{lexical value}.  These values and regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   644
expressions correspond to each other as illustrated in the following
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   645
table:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   646
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   647
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   648
	\begin{tabular}{c@{\hspace{20mm}}c}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   649
		\begin{tabular}{@{}rrl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   650
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   651
			$r$ & $::=$  & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   652
			& $\mid$ & $\ONE$   \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   653
			& $\mid$ & $c$          \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   654
			& $\mid$ & $r_1 \cdot r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   655
			& $\mid$ & $r_1 + r_2$   \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   656
			\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   657
			& $\mid$ & $r^*$         \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   658
		\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   659
		&
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   660
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   661
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   662
			$v$ & $::=$  & \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   663
			&        & $\Empty$   \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   664
			& $\mid$ & $\Char(c)$          \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   665
			& $\mid$ & $\Seq\,v_1\, v_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   666
			& $\mid$ & $\Left(v)$   \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   667
			& $\mid$ & $\Right(v)$  \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   668
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   669
		\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   670
	\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   671
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   672
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   673
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   674
One regular expression can have multiple lexical values. For example
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   675
for the regular expression $(a+b)^*$, it has a infinite list of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   676
values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   677
$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   678
$\ldots$, and vice versa.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   679
Even for the regular expression matching a certain string, there could 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   680
still be more than one value corresponding to it.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   681
Take the example where $r= (a^*\cdot a^*)^*$ and the string 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   682
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   683
The number of different ways of matching 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   684
without allowing any value under a star to be flattened
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   685
to an empty string can be given by the following formula:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   686
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   687
	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   688
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   689
and a closed form formula can be calculated to be
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   690
\begin{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   691
	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   692
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   693
which is clearly in exponential order.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   694
A lexer aimed at getting all the possible values has an exponential
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   695
worst case runtime. Therefore it is impractical to try to generate
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   696
all possible matches in a run. In practice, we are usually 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   697
interested about POSIX values, which by intuition always
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   698
match the leftmost regular expression when there is a choice
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   699
and always match a sub part as much as possible before proceeding
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   700
to the next token. For example, the above example has the POSIX value
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   701
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   702
The output of an algorithm we want would be a POSIX matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   703
encoded as a value.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   704
The contribution of Sulzmann and Lu is an extension of Brzozowski's
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   705
algorithm by a second phase (the first phase being building successive
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   706
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   707
is generated in case the regular expression matches  the string. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   708
Pictorially, the Sulzmann and Lu algorithm is as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   709
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   710
\begin{ceqn}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   711
\begin{equation}\label{graph:2}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   712
\begin{tikzcd}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   713
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   714
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   715
\end{tikzcd}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   716
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   717
\end{ceqn}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   718
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   719
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   720
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   721
For convenience, we shall employ the following notations: the regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   722
expression we start with is $r_0$, and the given string $s$ is composed
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   723
of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   724
left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   725
to the characters $c_0$, $c_1$  until we exhaust the string and obtain
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   726
the derivative $r_n$. We test whether this derivative is
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   727
$\textit{nullable}$ or not. If not, we know the string does not match
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   728
$r$ and no value needs to be generated. If yes, we start building the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   729
values incrementally by \emph{injecting} back the characters into the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   730
earlier values $v_n, \ldots, v_0$. This is the second phase of the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   731
algorithm from the right to left. For the first value $v_n$, we call the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   732
function $\textit{mkeps}$, which builds a POSIX lexical value
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   733
for how the empty string has been matched by the (nullable) regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   734
expression $r_n$. This function is defined as
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   735
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   736
	\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   737
		\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   738
			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   739
			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   740
			& \textit{if} $\nullable(r_{1})$\\ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   741
			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   742
			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   743
			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   744
			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   745
		\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   746
	\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   747
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   748
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   749
\noindent 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   750
After the $\mkeps$-call, we inject back the characters one by one in order to build
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   751
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   752
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   753
After injecting back $n$ characters, we get the lexical value for how $r_0$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   754
matches $s$. The POSIX value is maintained throught out the process.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   755
For this Sulzmann and Lu defined a function that reverses
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   756
the ``chopping off'' of characters during the derivative phase. The
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   757
corresponding function is called \emph{injection}, written
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   758
$\textit{inj}$; it takes three arguments: the first one is a regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   759
expression ${r_{i-1}}$, before the character is chopped off, the second
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   760
is a character ${c_{i-1}}$, the character we want to inject and the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   761
third argument is the value ${v_i}$, into which one wants to inject the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   762
character (it corresponds to the regular expression after the character
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   763
has been chopped off). The result of this function is a new value. The
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   764
definition of $\textit{inj}$ is as follows: 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   765
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   766
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   767
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   768
  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   769
  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   770
  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   771
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   772
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   773
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   774
  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   775
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   776
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   777
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   778
\noindent This definition is by recursion on the ``shape'' of regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   779
expressions and values. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   780
The clauses basically do one thing--identifying the ``holes'' on 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   781
value to inject the character back into.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   782
For instance, in the last clause for injecting back to a value
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   783
that would turn into a new star value that corresponds to a star,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   784
we know it must be a sequence value. And we know that the first 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   785
value of that sequence corresponds to the child regex of the star
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   786
with the first character being chopped off--an iteration of the star
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   787
that had just been unfolded. This value is followed by the already
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   788
matched star iterations we collected before. So we inject the character 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   789
back to the first value and form a new value with this new iteration
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   790
being added to the previous list of iterations, all under the $Stars$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   791
top level.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   792
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   793
We have mentioned before that derivatives without simplification 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   794
can get clumsy, and this is true for values as well--they reflect
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   795
the regular expressions size by definition.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   796
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   797
One can introduce simplification on the regex and values, but have to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   798
be careful in not breaking the correctness as the injection 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   799
function heavily relies on the structure of the regexes and values
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   800
being correct and match each other.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   801
It can be achieved by recording some extra rectification functions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   802
during the derivatives step, and applying these rectifications in 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   803
each run during the injection phase.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   804
And we can prove that the POSIX value of how
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   805
regular expressions match strings will not be affected---although is much harder
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   806
to establish. Some initial results in this regard have been
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   807
obtained in \cite{AusafDyckhoffUrban2016}. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   808
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   809
%Brzozowski, after giving the derivatives and simplification,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   810
%did not explore lexing with simplification or he may well be 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   811
%stuck on an efficient simplificaiton with a proof.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   812
%He went on to explore the use of derivatives together with 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   813
%automaton, and did not try lexing using derivatives.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   814
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   815
We want to get rid of complex and fragile rectification of values.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   816
Can we not create those intermediate values $v_1,\ldots v_n$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   817
and get the lexing information that should be already there while
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   818
doing derivatives in one pass, without a second phase of injection?
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   819
In the meantime, can we make sure that simplifications
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   820
are easily handled without breaking the correctness of the algorithm?
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   821
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   822
Sulzmann and Lu solved this problem by
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   823
introducing additional informtaion to the 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   824
regular expressions called \emph{bitcodes}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   825
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   826
\subsection*{Bit-coded Algorithm}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   827
Bits and bitcodes (lists of bits) are defined as:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   828
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   829
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   830
		$b ::=   1 \mid  0 \qquad
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   831
bs ::= [] \mid b::bs    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   832
$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   833
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   834
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   835
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   836
The $1$ and $0$ are not in bold in order to avoid 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   837
confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   838
bit-lists) can be used to encode values (or potentially incomplete values) in a
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   839
compact form. This can be straightforwardly seen in the following
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   840
coding function from values to bitcodes: 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   841
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   842
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   843
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   844
  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   845
  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   846
  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   847
  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   848
  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   849
  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   850
  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   851
                                                 code(\Stars\,vs)$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   852
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   853
\end{center} 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   854
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   855
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   856
Here $\textit{code}$ encodes a value into a bitcodes by converting
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   857
$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   858
star iteration by $1$. The border where a local star terminates
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   859
is marked by $0$. This coding is lossy, as it throws away the information about
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   860
characters, and also does not encode the ``boundary'' between two
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   861
sequence values. Moreover, with only the bitcode we cannot even tell
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   862
whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   863
reason for choosing this compact way of storing information is that the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   864
relatively small size of bits can be easily manipulated and ``moved
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   865
around'' in a regular expression. In order to recover values, we will 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   866
need the corresponding regular expression as an extra information. This
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   867
means the decoding function is defined as:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   868
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   869
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   870
%\begin{definition}[Bitdecoding of Values]\mbox{}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   871
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   872
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   873
  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   874
  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   875
  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   876
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   877
       (\Left\,v, bs_1)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   878
  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   879
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   880
       (\Right\,v, bs_1)$\\                           
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   881
  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   882
        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   883
  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   884
  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   885
  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   886
  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   887
         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   888
  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   889
  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   890
  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   891
  $\textit{decode}\,bs\,r$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   892
     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   893
  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   894
       \textit{else}\;\textit{None}$                       
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   895
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   896
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   897
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   898
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   899
Sulzmann and Lu's integrated the bitcodes into regular expressions to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   900
create annotated regular expressions \cite{Sulzmann2014}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   901
\emph{Annotated regular expressions} are defined by the following
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   902
grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   903
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   904
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   905
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   906
  $\textit{a}$ & $::=$  & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   907
                  & $\mid$ & $_{bs}\ONE$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   908
                  & $\mid$ & $_{bs}{\bf c}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   909
                  & $\mid$ & $_{bs}\sum\,as$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   910
                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   911
                  & $\mid$ & $_{bs}a^*$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   912
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   913
\end{center}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   914
%(in \textit{ALTS})
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   915
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   916
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   917
where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   918
expressions and $as$ for a list of annotated regular expressions.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   919
The alternative constructor($\sum$) has been generalized to 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   920
accept a list of annotated regular expressions rather than just 2.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   921
We will show that these bitcodes encode information about
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   922
the (POSIX) value that should be generated by the Sulzmann and Lu
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   923
algorithm.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   924
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   925
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   926
To do lexing using annotated regular expressions, we shall first
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   927
transform the usual (un-annotated) regular expressions into annotated
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   928
regular expressions. This operation is called \emph{internalisation} and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   929
defined as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   930
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   931
%\begin{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   932
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   933
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   934
  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   935
  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   936
  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   937
  $(r_1 + r_2)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   938
  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   939
  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   940
  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   941
         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   942
  $(r^*)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   943
         $_{[]}(r^\uparrow)^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   944
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   945
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   946
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   947
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   948
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   949
We use up arrows here to indicate that the basic un-annotated regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   950
expressions are ``lifted up'' into something slightly more complex. In the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   951
fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   952
attach bits to the front of an annotated regular expression. Its
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   953
definition is as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   954
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   955
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   956
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   957
  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   958
  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   959
     $_{bs @ bs'}\ONE$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   960
  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   961
     $_{bs@bs'}{\bf c}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   962
  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   963
     $_{bs@bs'}\sum\textit{as}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   964
  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   965
     $_{bs@bs'}a_1 \cdot a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   966
  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   967
     $_{bs @ bs'}a^*$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   968
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   969
\end{center}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   970
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   971
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   972
After internalising the regular expression, we perform successive
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   973
derivative operations on the annotated regular expressions. This
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   974
derivative operation is the same as what we had previously for the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   975
basic regular expressions, except that we beed to take care of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   976
the bitcodes:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   977
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   978
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   979
\iffalse
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   980
 %\begin{definition}{bder}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   981
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   982
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   983
  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   984
  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   985
  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   986
        $\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   987
         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   988
  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   989
  $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   990
  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   991
     $\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   992
					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   993
					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   994
  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   995
  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   996
      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   997
       (\textit{STAR}\,[]\,r)$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   998
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   999
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1000
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1001
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1002
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1003
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1004
  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1005
  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1006
  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1007
        $\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1008
         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1009
  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1010
  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1011
  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1012
     $\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1013
					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1014
					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1015
  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1016
  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1017
      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1018
       (_{bs}\textit{STAR}\,[]\,r)$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1019
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1020
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1021
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1022
\fi
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1023
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1024
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1025
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1026
  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1027
  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1028
  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1029
        $\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1030
         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1031
  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1032
  $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1033
  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1034
     $\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1035
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1036
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1037
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1038
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1039
      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1040
       (_{[]}r^*))$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1041
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1042
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1043
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1044
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1045
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1046
For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1047
we need to unfold it into a sequence,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1048
and attach an additional bit $0$ to the front of $r \backslash c$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1049
to indicate that there is one more star iteration. Also the sequence clause
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1050
is more subtle---when $a_1$ is $\textit{bnullable}$ (here
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1051
\textit{bnullable} is exactly the same as $\textit{nullable}$, except
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1052
that it is for annotated regular expressions, therefore we omit the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1053
definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1054
$a_1$ matches the string prior to character $c$ (more on this later),
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1055
then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1056
\backslash c)$ will collapse the regular expression $a_1$(as it has
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1057
already been fully matched) and store the parsing information at the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1058
head of the regular expression $a_2 \backslash c$ by fusing to it. The
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1059
bitsequence $\textit{bs}$, which was initially attached to the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1060
first element of the sequence $a_1 \cdot a_2$, has
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1061
now been elevated to the top-level of $\sum$, as this information will be
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1062
needed whichever way the sequence is matched---no matter whether $c$ belongs
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1063
to $a_1$ or $ a_2$. After building these derivatives and maintaining all
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1064
the lexing information, we complete the lexing by collecting the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1065
bitcodes using a generalised version of the $\textit{mkeps}$ function
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1066
for annotated regular expressions, called $\textit{bmkeps}$:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1067
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1068
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1069
%\begin{definition}[\textit{bmkeps}]\mbox{}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1070
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1071
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1072
  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1073
  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1074
     $\textit{if}\;\textit{bnullable}\,a$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1075
  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1076
  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1077
  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1078
     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1079
  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1080
     $bs \,@\, [0]$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1081
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1082
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1083
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1084
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1085
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1086
This function completes the value information by travelling along the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1087
path of the regular expression that corresponds to a POSIX value and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1088
collecting all the bitcodes, and using $S$ to indicate the end of star
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1089
iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1090
decode them, we get the value we expect. The corresponding lexing
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1091
algorithm looks as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1092
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1093
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1094
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1095
  $\textit{blexer}\;r\,s$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1096
      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1097
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1098
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1099
  & & $\;\;\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1100
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1101
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1102
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1103
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1104
In this definition $\_\backslash s$ is the  generalisation  of the derivative
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1105
operation from characters to strings (just like the derivatives for un-annotated
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1106
regular expressions).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1107
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1108
Remember tha one of the important reasons we introduced bitcodes
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1109
is that they can make simplification more structured and therefore guaranteeing
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1110
the correctness.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1111
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1112
\subsection*{Our Simplification Rules}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1113
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1114
In this section we introduce aggressive (in terms of size) simplification rules
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1115
on annotated regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1116
in order to keep derivatives small. Such simplifications are promising
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1117
as we have
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1118
generated test data that show
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1119
that a good tight bound can be achieved. Obviously we could only
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1120
partially cover  the search space as there are infinitely many regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1121
expressions and strings. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1122
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1123
One modification we introduced is to allow a list of annotated regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1124
expressions in the $\sum$ constructor. This allows us to not just
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1125
delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1126
also unnecessary ``copies'' of regular expressions (very similar to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1127
simplifying $r + r$ to just $r$, but in a more general setting). Another
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1128
modification is that we use simplification rules inspired by Antimirov's
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1129
work on partial derivatives. They maintain the idea that only the first
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1130
``copy'' of a regular expression in an alternative contributes to the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1131
calculation of a POSIX value. All subsequent copies can be pruned away from
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1132
the regular expression. A recursive definition of our  simplification function 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1133
that looks somewhat similar to our Scala code is given below:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1134
%\comment{Use $\ZERO$, $\ONE$ and so on. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1135
%Is it $ALTS$ or $ALTS$?}\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1136
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1137
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1138
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1139
   
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1140
  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1141
   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1142
   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1143
   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1144
   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1145
   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1146
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1147
  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1148
  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1149
   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1150
   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1151
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1152
   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1153
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1154
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1155
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1156
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1157
The simplification does a pattern matching on the regular expression.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1158
When it detected that the regular expression is an alternative or
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1159
sequence, it will try to simplify its children regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1160
recursively and then see if one of the children turn into $\ZERO$ or
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1161
$\ONE$, which might trigger further simplification at the current level.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1162
The most involved part is the $\sum$ clause, where we use two
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1163
auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1164
alternatives and reduce as many duplicates as possible. Function
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1165
$\textit{distinct}$  keeps the first occurring copy only and remove all later ones
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1166
when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1167
Its recursive definition is given below:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1168
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1169
 \begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1170
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1171
  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1172
     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1173
  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1174
    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1175
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1176
\end{center}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1177
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1178
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1179
Here $\textit{flatten}$ behaves like the traditional functional programming flatten
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1180
function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1181
removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1182
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1183
Having defined the $\simp$ function,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1184
we can use the previous notation of  natural
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1185
extension from derivative w.r.t.~character to derivative
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1186
w.r.t.~string:%\comment{simp in  the [] case?}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1187
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1188
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1189
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1190
$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1191
$r \backslash_{simp} [\,] $ & $\dn$ & $r$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1192
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1193
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1194
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1195
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1196
to obtain an optimised version of the algorithm:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1197
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1198
 \begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1199
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1200
  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1201
      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1202
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1203
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1204
  & & $\;\;\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1205
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1206
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1207
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1208
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1209
This algorithm keeps the regular expression size small, for example,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1210
with this simplification our previous $(a + aa)^*$ example's 8000 nodes
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1211
will be reduced to just 6 and stays constant, no matter how long the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1212
input string is.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1213
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1214
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1215
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1216
Derivatives give a simple solution
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1217
to the problem of matching a string $s$ with a regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1218
expression $r$: if the derivative of $r$ w.r.t.\ (in
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1219
succession) all the characters of the string matches the empty string,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1220
then $r$ matches $s$ (and {\em vice versa}).  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1221
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1222
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1223
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1224
However, there are two difficulties with derivative-based matchers:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1225
First, Brzozowski's original matcher only generates a yes/no answer
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1226
for whether a regular expression matches a string or not.  This is too
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1227
little information in the context of lexing where separate tokens must
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1228
be identified and also classified (for example as keywords
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1229
or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1230
difficulty by cleverly extending Brzozowski's matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1231
algorithm. Their extended version generates additional information on
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1232
\emph{how} a regular expression matches a string following the POSIX
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1233
rules for regular expression matching. They achieve this by adding a
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1234
second ``phase'' to Brzozowski's algorithm involving an injection
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1235
function.  In our own earlier work we provided the formal
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1236
specification of what POSIX matching means and proved in Isabelle/HOL
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1237
the correctness
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1238
of Sulzmann and Lu's extended algorithm accordingly
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1239
\cite{AusafDyckhoffUrban2016}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1240
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1241
The second difficulty is that Brzozowski's derivatives can 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1242
grow to arbitrarily big sizes. For example if we start with the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1243
regular expression $(a+aa)^*$ and take
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1244
successive derivatives according to the character $a$, we end up with
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1245
a sequence of ever-growing derivatives like 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1246
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1247
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1248
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1249
\begin{tabular}{rll}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1250
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1251
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1252
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1253
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1254
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1255
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1256
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1257
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1258
\noindent where after around 35 steps we run out of memory on a
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1259
typical computer (we shall define shortly the precise details of our
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1260
regular expressions and the derivative operation).  Clearly, the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1261
notation involving $\ZERO$s and $\ONE$s already suggests
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1262
simplification rules that can be applied to regular regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1263
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1264
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1265
r$. While such simple-minded simplifications have been proved in our
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1266
earlier work to preserve the correctness of Sulzmann and Lu's
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1267
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1268
\emph{not} help with limiting the growth of the derivatives shown
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1269
above: the growth is slowed, but the derivatives can still grow rather
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1270
quickly beyond any finite bound.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1271
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1272
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1273
Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1274
\cite{Sulzmann2014} where they introduce bitcoded
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1275
regular expressions. In this version, POSIX values are
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1276
represented as bitsequences and such sequences are incrementally generated
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1277
when derivatives are calculated. The compact representation
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1278
of bitsequences and regular expressions allows them to define a more
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1279
``aggressive'' simplification method that keeps the size of the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1280
derivatives finite no matter what the length of the string is.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1281
They make some informal claims about the correctness and linear behaviour
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1282
of this version, but do not provide any supporting proof arguments, not
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1283
even ``pencil-and-paper'' arguments. They write about their bitcoded
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1284
\emph{incremental parsing method} (that is the algorithm to be formalised
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1285
in this paper):
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1286
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1287
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1288
\begin{quote}\it
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1289
  ``Correctness Claim: We further claim that the incremental parsing
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1290
  method [..] in combination with the simplification steps [..]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1291
  yields POSIX parse trees. We have tested this claim
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1292
  extensively [..] but yet
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1293
  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1294
\end{quote}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1295
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1296
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1297
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1298
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1299
\section{Backgound}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1300
%Regular expression matching and lexing has been 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1301
% widely-used and well-implemented
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1302
%in software industry. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1303
%TODO: expand the above into a full paragraph
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1304
%TODO: look up snort rules to use here--give readers idea of what regexes look like
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1305
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1306
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1307
Theoretical results say that regular expression matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1308
should be linear with respect to the input.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1309
Under a certain class of regular expressions and inputs though,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1310
practical implementations  suffer from non-linear or even 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1311
exponential running time,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1312
allowing a ReDoS (regular expression denial-of-service ) attack.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1313
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1314
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1315
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1316
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1317
\section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1318
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1319
The reason behind is that regex libraries in popular language engines
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1320
 often want to support richer constructs
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1321
than  the most basic regular expressions, and lexing rather than matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1322
is needed for sub-match extraction.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1323
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1324
There is also static analysis work on regular expression that
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1325
have potential expoential behavious. Rathnayake and Thielecke 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1326
\parencite{Rathnayake2014StaticAF} proposed an algorithm
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1327
that detects regular expressions triggering exponential
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1328
behavious on backtracking matchers.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1329
People also developed static analysis methods for
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1330
generating non-linear polynomial worst-time estimates
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1331
for regexes, attack string that exploit the worst-time 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1332
scenario, and "attack automata" that generates
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1333
attack strings.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1334
For a comprehensive analysis, please refer to Weideman's thesis 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1335
\parencite{Weideman2017Static}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1336
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1337
\subsection{DFA Approach}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1338
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1339
Exponential states.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1340
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1341
\subsection{NFA Approach}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1342
Backtracking.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1343
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1344
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1345
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1346
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1347
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1348
\section{Our Approach}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1349
In the last fifteen or so years, Brzozowski's derivatives of regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1350
expressions have sparked quite a bit of interest in the functional
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1351
programming and theorem prover communities.  The beauty of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1352
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1353
expressible in any functional language, and easily definable and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1354
reasoned about in theorem provers---the definitions just consist of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1355
inductive datatypes and simple recursive functions.  Derivatives of a
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1356
regular expression, written $r \backslash c$, give a simple solution
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1357
to the problem of matching a string $s$ with a regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1358
expression $r$: if the derivative of $r$ w.r.t.\ (in
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1359
succession) all the characters of the string matches the empty string,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1360
then $r$ matches $s$ (and {\em vice versa}).  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1361
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1362
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1363
This work aims to address the above vulnerability by the combination
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1364
of Brzozowski's derivatives and interactive theorem proving. We give an 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1365
improved version of  Sulzmann and Lu's bit-coded algorithm using 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1366
derivatives, which come with a formal guarantee in terms of correctness and 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1367
running time as an Isabelle/HOL proof.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1368
Then we improve the algorithm with an even stronger version of 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1369
simplification, and prove a time bound linear to input and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1370
cubic to regular expression size using a technique by
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1371
Antimirov.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1372
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1373
\subsection{Existing Work}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1374
We are aware
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1375
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1376
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1377
of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1378
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1379
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1380
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1381
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1382
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1383
\section{What this Template Includes}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1384
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1385
\subsection{Folders}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1386
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1387
This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1388
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1389
\keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1390
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1391
\keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1392
\begin{itemize}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1393
\item Chapter 1: Introduction to the thesis topic
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1394
\item Chapter 2: Background information and theory
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1395
\item Chapter 3: (Laboratory) experimental setup
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1396
\item Chapter 4: Details of experiment 1
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1397
\item Chapter 5: Details of experiment 2
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1398
\item Chapter 6: Discussion of the experimental results
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1399
\item Chapter 7: Conclusion and future directions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1400
\end{itemize}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1401
This chapter layout is specialised for the experimental sciences, your discipline may be different.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1402
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1403
\keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1404
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1405
\subsection{Files}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1406
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1407
Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1408
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1409
\keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1410
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1411
\keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1412
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1413
\keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1414
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1415
\keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis!
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1416
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1417
Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1418
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1419
\keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1420
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1421
\keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1422
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1423
\keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1424
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1425
\keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1426
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1427
\keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1428
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1429
\keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1430
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1431
\keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1432
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1433
So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1434
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1435
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1436
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1437
\section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1438
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1439
You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1440
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1441
Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1442
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1443
Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb|\href{url}{name}| and only leave the name.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1444
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1445
When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper!
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1446
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1447
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1448
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1449
\section{The \code{main.tex} File Explained}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1450
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1451
The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1452
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1453
Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1454
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1455
Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1456
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1457
Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1458
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1459
Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1460
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1461
The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1462
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1463
The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1464
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1465
The next page contains a one line dedication. Who will you dedicate your thesis to?
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1466
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1467
Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1469
After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1470
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1471
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1472
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1473
\section{Thesis Features and Conventions}\label{ThesisConventions}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1474
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1475
To get the best out of this template, there are a few conventions that you may want to follow.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1476
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1477
One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1478
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1479
\subsection{Printing Format}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1480
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1481
This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1482
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1483
The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1484
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1485
The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1486
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1487
\subsection{Using US Letter Paper}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1488
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1489
The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1490
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1491
Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1492
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1493
\subsection{References}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1494
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1495
The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1496
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1497
Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1498
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1499
The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1500
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1501
\subsubsection{A Note on bibtex}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1502
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1503
The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1504
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1505
\subsection{Tables}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1506
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1507
Tables are an important way of displaying your results, below is an example table which was generated with this code:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1508
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1509
{\small
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1510
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1511
\begin{table}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1512
\caption{The effects of treatments X and Y on the four groups studied.}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1513
\label{tab:treatments}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1514
\centering
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1515
\begin{tabular}{l l l}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1516
\toprule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1517
\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1518
\midrule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1519
1 & 0.2 & 0.8\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1520
2 & 0.17 & 0.7\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1521
3 & 0.24 & 0.75\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1522
4 & 0.68 & 0.3\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1523
\bottomrule\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1524
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1525
\end{table}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1526
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1527
}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1528
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1529
\begin{table}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1530
\caption{The effects of treatments X and Y on the four groups studied.}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1531
\label{tab:treatments}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1532
\centering
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1533
\begin{tabular}{l l l}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1534
\toprule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1535
\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1536
\midrule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1537
1 & 0.2 & 0.8\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1538
2 & 0.17 & 0.7\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1539
3 & 0.24 & 0.75\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1540
4 & 0.68 & 0.3\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1541
\bottomrule\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1542
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1543
\end{table}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1544
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1545
You can reference tables with \verb|\ref{<label>}| where the label is defined within the table environment. See \file{Chapter1.tex} for an example of the label and citation (e.g. Table~\ref{tab:treatments}).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1546
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1547
\subsection{Figures}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1548
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1549
There will hopefully be many figures in your thesis (that should be placed in the \emph{Figures} folder). The way to insert figures into your thesis is to use a code template like this:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1550
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1551
\begin{figure}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1552
\centering
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1553
\includegraphics{Figures/Electron}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1554
\decoRule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1555
\caption[An Electron]{An electron (artist's impression).}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1556
\label{fig:Electron}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1557
\end{figure}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1558
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1559
Also look in the source file. Putting this code into the source file produces the picture of the electron that you can see in the figure below.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1560
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1561
\begin{figure}[th]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1562
\centering
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1563
\includegraphics{Figures/Electron}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1564
\decoRule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1565
\caption[An Electron]{An electron (artist's impression).}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1566
\label{fig:Electron}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1567
\end{figure}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1568
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1569
Sometimes figures don't always appear where you write them in the source. The placement depends on how much space there is on the page for the figure. Sometimes there is not enough room to fit a figure directly where it should go (in relation to the text) and so \LaTeX{} puts it at the top of the next page. Positioning figures is the job of \LaTeX{} and so you should only worry about making them look good!
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1570
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1571
Figures usually should have captions just in case you need to refer to them (such as in Figure~\ref{fig:Electron}). The \verb|\caption| command contains two parts, the first part, inside the square brackets is the title that will appear in the \emph{List of Figures}, and so should be short. The second part in the curly brackets should contain the longer and more descriptive caption text.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1572
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1573
The \verb|\decoRule| command is optional and simply puts an aesthetic horizontal line below the image. If you do this for one image, do it for all of them.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1574
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1575
\LaTeX{} is capable of using images in pdf, jpg and png format.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1576
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1577
\subsection{Typesetting mathematics}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1578
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1579
If your thesis is going to contain heavy mathematical content, be sure that \LaTeX{} will make it look beautiful, even though it won't be able to solve the equations for you.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1580
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1581
The \enquote{Not So Short Introduction to \LaTeX} (available on \href{http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf}{CTAN}) should tell you everything you need to know for most cases of typesetting mathematics. If you need more information, a much more thorough mathematical guide is available from the AMS called, \enquote{A Short Math Guide to \LaTeX} and can be downloaded from:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1582
\url{ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1583
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1584
There are many different \LaTeX{} symbols to remember, luckily you can find the most common symbols in \href{http://ctan.org/pkg/comprehensive}{The Comprehensive \LaTeX~Symbol List}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1585
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1586
You can write an equation, which is automatically given an equation number by \LaTeX{} like this:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1587
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1588
\begin{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1589
E = mc^{2}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1590
\label{eqn:Einstein}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1591
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1592
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1593
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1594
This will produce Einstein's famous energy-matter equivalence equation:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1595
\begin{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1596
E = mc^{2}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1597
\label{eqn:Einstein}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1598
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1599
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1600
All equations you write (which are not in the middle of paragraph text) are automatically given equation numbers by \LaTeX{}. If you don't want a particular equation numbered, use the unnumbered form:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1601
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1602
\[ a^{2}=4 \]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1603
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1604
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1605
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1606
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1607
\section{Sectioning and Subsectioning}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1608
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1609
You should break your thesis up into nice, bite-sized sections and subsections. \LaTeX{} automatically builds a table of Contents by looking at all the \verb|\chapter{}|, \verb|\section{}|  and \verb|\subsection{}| commands you write in the source.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1610
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1611
The Table of Contents should only list the sections to three (3) levels. A \verb|chapter{}| is level zero (0). A \verb|\section{}| is level one (1) and so a \verb|\subsection{}| is level two (2). In your thesis it is likely that you will even use a \verb|subsubsection{}|, which is level three (3). The depth to which the Table of Contents is formatted is set within \file{MastersDoctoralThesis.cls}. If you need this changed, you can do it in \file{main.tex}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1612
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1613
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1614
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1615
\section{In Closing}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1616
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1617
You have reached the end of this mini-guide. You can now rename or overwrite this pdf file and begin writing your own \file{Chapter1.tex} and the rest of your thesis. The easy work of setting up the structure and framework has been taken care of for you. It's now your job to fill it out!
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1618
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1619
Good luck and have lots of fun!
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1620
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1621
\begin{flushright}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1622
Guide written by ---\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1623
Sunil Patel: \href{http://www.sunilpatel.co.uk}{www.sunilpatel.co.uk}\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1624
Vel: \href{http://www.LaTeXTemplates.com}{LaTeXTemplates.com}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1625
\end{flushright}