ChengsongTanPhdThesis/Chapters/Chapter1.tex
author Chengsong
Mon, 28 Mar 2022 00:59:42 +0100
changeset 471 23818853a710
parent 469 e5dd8cc0aa82
child 472 6953d2786e7c
permissions -rwxr-xr-x
thesis
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     1
% Chapter 1
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     2
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     3
\chapter{Introduction} % Main chapter title
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     4
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     5
\label{Chapter1} % For referencing the chapter elsewhere, use \ref{Chapter1} 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     6
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     7
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     8
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
     9
% Define some commands to keep the formatting separated from the content 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    10
\newcommand{\keyword}[1]{\textbf{#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    11
\newcommand{\tabhead}[1]{\textbf{#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    12
\newcommand{\code}[1]{\texttt{#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    14
\newcommand{\option}[1]{\texttt{\itshape#1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    15
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    16
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    17
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    18
\newcommand{\ZERO}{\mbox{\bf 0}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    19
\newcommand{\ONE}{\mbox{\bf 1}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    20
\def\lexer{\mathit{lexer}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    21
\def\mkeps{\mathit{mkeps}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    22
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    23
\def\DFA{\textit{DFA}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    24
\def\bmkeps{\textit{bmkeps}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    25
\def\retrieve{\textit{retrieve}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    26
\def\blexer{\textit{blexer}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    27
\def\flex{\textit{flex}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    28
\def\inj{\mathit{inj}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    29
\def\Empty{\mathit{Empty}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    30
\def\Left{\mathit{Left}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    31
\def\Right{\mathit{Right}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    32
\def\Stars{\mathit{Stars}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    33
\def\Char{\mathit{Char}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    34
\def\Seq{\mathit{Seq}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    35
\def\Der{\mathit{Der}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    36
\def\nullable{\mathit{nullable}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    37
\def\Z{\mathit{Z}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    38
\def\S{\mathit{S}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    39
\def\rup{r^\uparrow}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    40
\def\simp{\mathit{simp}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    41
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    42
\def\map{\mathit{map}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    43
\def\distinct{\mathit{distinct}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    44
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    45
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    46
%This part is about regular expressions, Brzozowski derivatives,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    47
%and a bit-coded lexing algorithm with proven correctness and time bounds.
471
Chengsong
parents: 469
diff changeset
    48
Regular expressions are widely used in computer science: 
Chengsong
parents: 469
diff changeset
    49
be it in IDEs with syntax hightlighting and auto completion, 
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    50
command line tools like $\mathit{grep}$ that facilitates easy 
471
Chengsong
parents: 469
diff changeset
    51
text processing , network intrusion
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    52
detection systems that rejects suspicious traffic, or compiler
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    53
front ends--there is always an important phase of the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    54
task that involves matching a regular 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    55
exression with a string.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    56
Given its usefulness and ubiquity, one would imagine that
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    57
modern regular expression matching implementations
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    58
are mature and fully-studied.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    59
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    60
If you go to a popular programming language's
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    61
regex engine,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    62
you can supply it with regex and strings of your own,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    63
and get matching/lexing  information such as how a 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    64
sub-part of the regex matches a sub-part of the string.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    65
These lexing libraries are on average quite fast.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    66
%TODO: get source for SNORT/BRO's regex matching engine/speed
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    67
For example, the regex engines some network intrusion detection
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    68
systems use are able to process
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    69
 megabytes or even gigabytes of network traffic per second.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    70
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    71
Why do we need to have our version, if the algorithms are
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    72
blindingly fast already? Well it turns out it is not always the case.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    73
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    74
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    75
Take $(a^*)^*\,b$ and ask whether
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    76
strings of the form $aa..a$ match this regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    77
expression. Obviously this is not the case---the expected $b$ in the last
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    78
position is missing. One would expect that modern regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    79
matching engines can find this out very quickly. Alas, if one tries
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    80
this example in JavaScript, Python or Java 8 with strings like 28
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    81
$a$'s, one discovers that this decision takes around 30 seconds and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    82
takes considerably longer when adding a few more $a$'s, as the graphs
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    83
below show:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    84
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    85
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    86
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    87
\begin{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    88
\begin{axis}[
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    89
    xlabel={$n$},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    90
    x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    91
    ylabel={time in secs},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    92
    enlargelimits=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    93
    xtick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    94
    xmax=33,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    95
    ymax=35,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    96
    ytick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    97
    scaled ticks=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    98
    axis lines=left,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
    99
    width=5cm,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   100
    height=4cm, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   101
    legend entries={JavaScript},  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   102
    legend pos=north west,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   103
    legend cell align=left]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   104
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   105
\end{axis}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   106
\end{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   107
  &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   108
\begin{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   109
\begin{axis}[
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   110
    xlabel={$n$},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   111
    x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   112
    %ylabel={time in secs},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   113
    enlargelimits=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   114
    xtick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   115
    xmax=33,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   116
    ymax=35,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   117
    ytick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   118
    scaled ticks=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   119
    axis lines=left,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   120
    width=5cm,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   121
    height=4cm, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   122
    legend entries={Python},  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   123
    legend pos=north west,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   124
    legend cell align=left]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   125
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   126
\end{axis}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   127
\end{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   128
  &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   129
\begin{tikzpicture}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   130
\begin{axis}[
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   131
    xlabel={$n$},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   132
    x label style={at={(1.05,-0.05)}},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   133
    %ylabel={time in secs},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   134
    enlargelimits=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   135
    xtick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   136
    xmax=33,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   137
    ymax=35,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   138
    ytick={0,5,...,30},
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   139
    scaled ticks=false,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   140
    axis lines=left,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   141
    width=5cm,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   142
    height=4cm, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   143
    legend entries={Java 8},  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   144
    legend pos=north west,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   145
    legend cell align=left]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   146
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   147
\end{axis}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   148
\end{tikzpicture}\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   149
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   150
           of the form $\underbrace{aa..a}_{n}$.}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   151
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   152
\end{center}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   153
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   154
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   155
This is clearly exponential behaviour, and 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   156
is triggered by some relatively simple regex patterns.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   157
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   158
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   159
This superlinear blowup in matching algorithms sometimes cause
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   160
considerable grief in real life: for example on 20 July 2016 one evil
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   161
regular expression brought the webpage
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   162
\href{http://stackexchange.com}{Stack Exchange} to its
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   163
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   164
In this instance, a regular expression intended to just trim white
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   165
spaces from the beginning and the end of a line actually consumed
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   166
massive amounts of CPU-resources---causing web servers to grind to a
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   167
halt. This happened when a post with 20,000 white spaces was submitted,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   168
but importantly the white spaces were neither at the beginning nor at
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   169
the end. As a result, the regular expression matching engine needed to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   170
backtrack over many choices. In this example, the time needed to process
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   171
the string was $O(n^2)$ with respect to the string length. This
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   172
quadratic overhead was enough for the homepage of Stack Exchange to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   173
respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   174
attack and therefore stopped the servers from responding to any
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   175
requests. This made the whole site become unavailable. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   176
A more 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   177
recent example is a global outage of all Cloudflare servers on 2 July
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   178
2019. A poorly written regular expression exhibited exponential
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   179
behaviour and exhausted CPUs that serve HTTP traffic. Although the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   180
outage had several causes, at the heart was a regular expression that
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   181
was used to monitor network
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   182
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   183
%TODO: data points for some new versions of languages
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   184
These problems with regular expressions 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   185
are not isolated events that happen
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   186
very occasionally, but actually quite widespread.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   187
They occur so often that they get a 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   188
name--Regular-Expression-Denial-Of-Service (ReDoS)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   189
attack.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   190
Davis et al. \parencite{Davis18} detected more
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   191
than 1000 super-linear (SL) regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   192
in Node.js, Python core libraries, and npm and pypi. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   193
They therefore concluded that evil regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   194
are problems more than "a parlour trick", but one that
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   195
requires
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   196
more research attention.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   197
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   198
 \section{Why are current algorithm for regexes slow?}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   199
471
Chengsong
parents: 469
diff changeset
   200
%find literature/find out for yourself that REGEX->DFA on basic regexes
Chengsong
parents: 469
diff changeset
   201
%does not blow up the size
Chengsong
parents: 469
diff changeset
   202
Shouldn't regular expression matching be linear?
Chengsong
parents: 469
diff changeset
   203
How can one explain the super-linear behaviour of the 
Chengsong
parents: 469
diff changeset
   204
regex matching engines we have?
Chengsong
parents: 469
diff changeset
   205
The time cost of regex matching algorithms in general
Chengsong
parents: 469
diff changeset
   206
involve two phases: 
Chengsong
parents: 469
diff changeset
   207
the construction phase, in which the algorithm builds some  
Chengsong
parents: 469
diff changeset
   208
suitable data structure from the input regex $r$, we denote
Chengsong
parents: 469
diff changeset
   209
the time cost by $P_1(r)$.
Chengsong
parents: 469
diff changeset
   210
The lexing
Chengsong
parents: 469
diff changeset
   211
phase, when the input string $s$ is read and the data structure
Chengsong
parents: 469
diff changeset
   212
representing that regex $r$ is being operated on. We represent the time
Chengsong
parents: 469
diff changeset
   213
it takes by $P_2(r, s)$.\\
Chengsong
parents: 469
diff changeset
   214
In the case of a $\mathit{DFA}$,
Chengsong
parents: 469
diff changeset
   215
we have $P_2(r, s) = O( |s| )$,
Chengsong
parents: 469
diff changeset
   216
because we take at most $|s|$ steps, 
Chengsong
parents: 469
diff changeset
   217
and each step takes
Chengsong
parents: 469
diff changeset
   218
at most one transition--
Chengsong
parents: 469
diff changeset
   219
a deterministic-finite-automata
Chengsong
parents: 469
diff changeset
   220
by definition has at most one state active and at most one
Chengsong
parents: 469
diff changeset
   221
transition upon receiving an input symbol.
Chengsong
parents: 469
diff changeset
   222
But unfortunately in the  worst case
Chengsong
parents: 469
diff changeset
   223
$P_1(r) = O(exp^{|r|})$. An example will be given later. \\
Chengsong
parents: 469
diff changeset
   224
For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
Chengsong
parents: 469
diff changeset
   225
expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
Chengsong
parents: 469
diff changeset
   226
The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
Chengsong
parents: 469
diff changeset
   227
On the other hand, if backtracking is used, the worst-case time bound bloats
Chengsong
parents: 469
diff changeset
   228
to $|r| * 2^|s|$ .
Chengsong
parents: 469
diff changeset
   229
%on the input
Chengsong
parents: 469
diff changeset
   230
%And when calculating the time complexity of the matching algorithm,
Chengsong
parents: 469
diff changeset
   231
%we are assuming that each input reading step requires constant time.
Chengsong
parents: 469
diff changeset
   232
%which translates to that the number of 
Chengsong
parents: 469
diff changeset
   233
%states active and transitions taken each time is bounded by a
Chengsong
parents: 469
diff changeset
   234
%constant $C$.
Chengsong
parents: 469
diff changeset
   235
%But modern  regex libraries in popular language engines
Chengsong
parents: 469
diff changeset
   236
% often want to support much richer constructs than just
Chengsong
parents: 469
diff changeset
   237
% sequences and Kleene stars,
Chengsong
parents: 469
diff changeset
   238
%such as negation, intersection, 
Chengsong
parents: 469
diff changeset
   239
%bounded repetitions and back-references.
Chengsong
parents: 469
diff changeset
   240
%And de-sugaring these "extended" regular expressions 
Chengsong
parents: 469
diff changeset
   241
%into basic ones might bloat the size exponentially.
Chengsong
parents: 469
diff changeset
   242
%TODO: more reference for exponential size blowup on desugaring. 
Chengsong
parents: 469
diff changeset
   243
\subsection{Tools that uses $\mathit{DFA}$s}
Chengsong
parents: 469
diff changeset
   244
%TODO:more tools that use DFAs?
Chengsong
parents: 469
diff changeset
   245
$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
Chengsong
parents: 469
diff changeset
   246
in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
Chengsong
parents: 469
diff changeset
   247
lexers. The user provides a set of regular expressions
Chengsong
parents: 469
diff changeset
   248
and configurations to such lexer generators, and then 
Chengsong
parents: 469
diff changeset
   249
gets an output program encoding a minimized $\mathit{DFA}$
Chengsong
parents: 469
diff changeset
   250
that can be compiled and run. 
Chengsong
parents: 469
diff changeset
   251
The good things about $\mathit{DFA}$s is that once
Chengsong
parents: 469
diff changeset
   252
generated, they are fast and stable, unlike
Chengsong
parents: 469
diff changeset
   253
backtracking algorithms. 
Chengsong
parents: 469
diff changeset
   254
However, they do not scale well with bounded repetitions.\\
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   255
471
Chengsong
parents: 469
diff changeset
   256
Bounded repetitions, usually written in the form
Chengsong
parents: 469
diff changeset
   257
$r^{\{c\}}$ (where $c$ is a constant natural number),
Chengsong
parents: 469
diff changeset
   258
denotes a regular expression accepting strings
Chengsong
parents: 469
diff changeset
   259
that can be divided into $c$ substrings, where each 
Chengsong
parents: 469
diff changeset
   260
substring is in $r$. 
Chengsong
parents: 469
diff changeset
   261
For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
Chengsong
parents: 469
diff changeset
   262
an $\mathit{NFA}$ describing it would look like:
Chengsong
parents: 469
diff changeset
   263
\begin{center}
Chengsong
parents: 469
diff changeset
   264
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
Chengsong
parents: 469
diff changeset
   265
   \node[state,initial] (q_0)   {$q_0$}; 
Chengsong
parents: 469
diff changeset
   266
   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
Chengsong
parents: 469
diff changeset
   267
   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
Chengsong
parents: 469
diff changeset
   268
   \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
Chengsong
parents: 469
diff changeset
   269
    \path[->] 
Chengsong
parents: 469
diff changeset
   270
    (q_0) edge  node {a} (q_1)
Chengsong
parents: 469
diff changeset
   271
    	  edge [loop below] node {a,b} ()
Chengsong
parents: 469
diff changeset
   272
    (q_1) edge  node  {a,b} (q_2)
Chengsong
parents: 469
diff changeset
   273
    (q_2) edge  node  {a,b} (q_3);
Chengsong
parents: 469
diff changeset
   274
\end{tikzpicture}
Chengsong
parents: 469
diff changeset
   275
\end{center}
Chengsong
parents: 469
diff changeset
   276
The red states are "countdown states" which counts down 
Chengsong
parents: 469
diff changeset
   277
the number of characters needed in addition to the current
Chengsong
parents: 469
diff changeset
   278
string to make a successful match.
Chengsong
parents: 469
diff changeset
   279
For example, state $q_1$ indicates a match that has
Chengsong
parents: 469
diff changeset
   280
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
Chengsong
parents: 469
diff changeset
   281
and just consumed the "delimiter" $a$ in the middle, and 
Chengsong
parents: 469
diff changeset
   282
need to match 2 more iterations of $(a|b)$ to complete.
Chengsong
parents: 469
diff changeset
   283
State $q_2$ on the other hand, can be viewed as a state
Chengsong
parents: 469
diff changeset
   284
after $q_1$ has consumed 1 character, and just waits
Chengsong
parents: 469
diff changeset
   285
for 1 more character to complete.
Chengsong
parents: 469
diff changeset
   286
$q_3$ is the last state, requiring 0 more character and is accepting.
Chengsong
parents: 469
diff changeset
   287
Depending on the suffix of the
Chengsong
parents: 469
diff changeset
   288
input string up to the current read location,
Chengsong
parents: 469
diff changeset
   289
the states $q_1$ and $q_2$, $q_3$
Chengsong
parents: 469
diff changeset
   290
may or may
Chengsong
parents: 469
diff changeset
   291
not be active, independent from each other.
Chengsong
parents: 469
diff changeset
   292
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
Chengsong
parents: 469
diff changeset
   293
contain at least $2^3$ non-equivalent states that cannot be merged, 
Chengsong
parents: 469
diff changeset
   294
because the subset construction during determinisation will generate
Chengsong
parents: 469
diff changeset
   295
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
Chengsong
parents: 469
diff changeset
   296
Generalizing this to regular expressions with larger
Chengsong
parents: 469
diff changeset
   297
bounded repetitions number, we have that
Chengsong
parents: 469
diff changeset
   298
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
Chengsong
parents: 469
diff changeset
   299
would require at least $2^{n+1}$ states, if $r$ contains
Chengsong
parents: 469
diff changeset
   300
more than 1 string.
Chengsong
parents: 469
diff changeset
   301
This is to represent all different 
Chengsong
parents: 469
diff changeset
   302
scenarios which "countdown" states are active.
Chengsong
parents: 469
diff changeset
   303
For those regexes, tools such as $\mathit{JFLEX}$ 
Chengsong
parents: 469
diff changeset
   304
would generate gigantic $\mathit{DFA}$'s or
Chengsong
parents: 469
diff changeset
   305
out of memory errors.
Chengsong
parents: 469
diff changeset
   306
For this reason, regex libraries that support 
Chengsong
parents: 469
diff changeset
   307
bounded repetitions often choose to use the $\mathit{NFA}$ 
Chengsong
parents: 469
diff changeset
   308
approach.
Chengsong
parents: 469
diff changeset
   309
\subsection{The $\mathit{NFA}$ approach to regex matching}
Chengsong
parents: 469
diff changeset
   310
One can simulate the $\mathit{NFA}$ running in two ways:
Chengsong
parents: 469
diff changeset
   311
one by keeping track of all active states after consuming 
Chengsong
parents: 469
diff changeset
   312
a character, and update that set of states iteratively.
Chengsong
parents: 469
diff changeset
   313
This can be viewed as a breadth-first-search of the $\mathit{NFA}$
Chengsong
parents: 469
diff changeset
   314
for a path terminating
Chengsong
parents: 469
diff changeset
   315
at an accepting state.
Chengsong
parents: 469
diff changeset
   316
Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
Chengsong
parents: 469
diff changeset
   317
type of $\mathit{NFA}$ simulation, and guarantees a linear runtime
Chengsong
parents: 469
diff changeset
   318
in terms of input string length.
Chengsong
parents: 469
diff changeset
   319
%TODO:try out these lexers
Chengsong
parents: 469
diff changeset
   320
The other way to use $\mathit{NFA}$ for matching is choosing  
Chengsong
parents: 469
diff changeset
   321
a single transition each time, keeping all the other options in 
Chengsong
parents: 469
diff changeset
   322
a queue or stack, and backtracking if that choice eventually 
Chengsong
parents: 469
diff changeset
   323
fails. This method, often called a  "depth-first-search", 
Chengsong
parents: 469
diff changeset
   324
is efficient in a lot of cases, but could end up
Chengsong
parents: 469
diff changeset
   325
with exponential run time.\\
Chengsong
parents: 469
diff changeset
   326
%TODO:COMPARE java python lexer speed with Rust and Go
Chengsong
parents: 469
diff changeset
   327
The reason behind backtracking algorithms in languages like
Chengsong
parents: 469
diff changeset
   328
Java and Python is that they support back-references.
Chengsong
parents: 469
diff changeset
   329
\subsection{Back References in Regex--Non-Regular part}
Chengsong
parents: 469
diff changeset
   330
If we have a regular expression like this (the sequence
Chengsong
parents: 469
diff changeset
   331
operator is omitted for brevity):
Chengsong
parents: 469
diff changeset
   332
\begin{center}
Chengsong
parents: 469
diff changeset
   333
	$r_1(r_2(r_3r_4))$
Chengsong
parents: 469
diff changeset
   334
\end{center}
Chengsong
parents: 469
diff changeset
   335
We could label sub-expressions of interest 
Chengsong
parents: 469
diff changeset
   336
by parenthesizing them and giving 
Chengsong
parents: 469
diff changeset
   337
them a number by the order in which their opening parentheses appear.
Chengsong
parents: 469
diff changeset
   338
One possible way of parenthesizing and labelling is given below:
Chengsong
parents: 469
diff changeset
   339
\begin{center}
Chengsong
parents: 469
diff changeset
   340
	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
Chengsong
parents: 469
diff changeset
   341
\end{center}
Chengsong
parents: 469
diff changeset
   342
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
Chengsong
parents: 469
diff changeset
   343
by 1 to 4. $1$ would refer to the entire expression 
Chengsong
parents: 469
diff changeset
   344
$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
Chengsong
parents: 469
diff changeset
   345
These sub-expressions are called "capturing groups".
Chengsong
parents: 469
diff changeset
   346
We can use the following syntax to denote that we want a string just matched by a 
Chengsong
parents: 469
diff changeset
   347
sub-expression (capturing group) to appear at a certain location again, 
Chengsong
parents: 469
diff changeset
   348
exactly as it was:
Chengsong
parents: 469
diff changeset
   349
\begin{center}
Chengsong
parents: 469
diff changeset
   350
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
Chengsong
parents: 469
diff changeset
   351
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
Chengsong
parents: 469
diff changeset
   352
\end{center}
Chengsong
parents: 469
diff changeset
   353
The backslash and number $i$ are used to denote such 
Chengsong
parents: 469
diff changeset
   354
so-called "back-references".
Chengsong
parents: 469
diff changeset
   355
Let $e$ be an expression made of regular expressions 
Chengsong
parents: 469
diff changeset
   356
and back-references. $e$ contains the expression $e_i$
Chengsong
parents: 469
diff changeset
   357
as its $i$-th capturing group.
Chengsong
parents: 469
diff changeset
   358
The semantics of back-reference can be recursively
Chengsong
parents: 469
diff changeset
   359
written as:
Chengsong
parents: 469
diff changeset
   360
\begin{center}
Chengsong
parents: 469
diff changeset
   361
	\begin{tabular}{c}
Chengsong
parents: 469
diff changeset
   362
		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
Chengsong
parents: 469
diff changeset
   363
		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
Chengsong
parents: 469
diff changeset
   364
	\end{tabular}
Chengsong
parents: 469
diff changeset
   365
\end{center}
Chengsong
parents: 469
diff changeset
   366
The concrete example
Chengsong
parents: 469
diff changeset
   367
$((a|b|c|\ldots|z)^*)\backslash 1$
Chengsong
parents: 469
diff changeset
   368
would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
Chengsong
parents: 469
diff changeset
   369
Back-reference is a construct in the "regex" standard
Chengsong
parents: 469
diff changeset
   370
that programmers found quite useful, but not exactly 
Chengsong
parents: 469
diff changeset
   371
regular any more.
Chengsong
parents: 469
diff changeset
   372
In fact, that allows the regex construct to express 
Chengsong
parents: 469
diff changeset
   373
languages that cannot be contained in context-free
Chengsong
parents: 469
diff changeset
   374
languages either.
Chengsong
parents: 469
diff changeset
   375
For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
Chengsong
parents: 469
diff changeset
   376
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
Chengsong
parents: 469
diff changeset
   377
which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
Chengsong
parents: 469
diff changeset
   378
Such a language is contained in the context-sensitive hierarchy
Chengsong
parents: 469
diff changeset
   379
of formal languages. 
Chengsong
parents: 469
diff changeset
   380
Solving the back-reference expressions matching problem
Chengsong
parents: 469
diff changeset
   381
is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
Chengsong
parents: 469
diff changeset
   382
efficient solution is not known to exist.
Chengsong
parents: 469
diff changeset
   383
%TODO:read a bit more about back reference algorithms
Chengsong
parents: 469
diff changeset
   384
It seems that languages like Java and Python made the trade-off
Chengsong
parents: 469
diff changeset
   385
to support back-references at the expense of having to backtrack,
Chengsong
parents: 469
diff changeset
   386
even in the case of regexes not involving back-references.\\
Chengsong
parents: 469
diff changeset
   387
Summing these up, we can categorise existing 
Chengsong
parents: 469
diff changeset
   388
practical regex libraries into the ones  with  linear
Chengsong
parents: 469
diff changeset
   389
time guarantees like Go and Rust, which impose restrictions
Chengsong
parents: 469
diff changeset
   390
on the user input (not allowing back-references, 
Chengsong
parents: 469
diff changeset
   391
bounded repetitions canno exceed 1000 etc.), and ones  
Chengsong
parents: 469
diff changeset
   392
 that allows the programmer much freedom, but grinds to a halt
Chengsong
parents: 469
diff changeset
   393
 in some non-negligible portion of cases.
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   394
 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
471
Chengsong
parents: 469
diff changeset
   395
% For example, the Rust regex engine claims to be linear, 
Chengsong
parents: 469
diff changeset
   396
% but does not support lookarounds and back-references.
Chengsong
parents: 469
diff changeset
   397
% The GoLang regex library does not support over 1000 repetitions.  
Chengsong
parents: 469
diff changeset
   398
% Java and Python both support back-references, but shows
Chengsong
parents: 469
diff changeset
   399
%catastrophic backtracking behaviours on inputs without back-references(
Chengsong
parents: 469
diff changeset
   400
%when the language is still regular).
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   401
 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   402
 %TODO: verify the fact Rust does not allow 1000+ reps
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   403
 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs?
471
Chengsong
parents: 469
diff changeset
   404
\section{Buggy Regex Engines} 
Chengsong
parents: 469
diff changeset
   405
Chengsong
parents: 469
diff changeset
   406
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   407
 Another thing about the these libraries is that there
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   408
 is no correctness guarantee.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   409
 In some cases they either fails to generate a lexing result when there is a match,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   410
 or gives the wrong way of matching.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   411
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   412
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   413
It turns out that regex libraries not only suffer from 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   414
exponential backtracking problems, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   415
but also undesired (or even buggy) outputs.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   416
%TODO: comment from who
471
Chengsong
parents: 469
diff changeset
   417
Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   418
correctly implementing the POSIX (maximum-munch)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   419
rule of regular expression matching.
471
Chengsong
parents: 469
diff changeset
   420
This experience is echoed by the writer's
Chengsong
parents: 469
diff changeset
   421
tryout of a few online regex testers:
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   422
A concrete example would be 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   423
the regex
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   424
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   425
(((((a*a*)b*)b){20})*)c
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   426
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   427
and the string
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   428
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   429
baabaabababaabaaaaaaaaababaa
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   430
aababababaaaabaaabaaaaaabaab
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   431
aabababaababaaaaaaaaababaaaa
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   432
babababaaaaaaaaaaaaac
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   433
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   434
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   435
This seemingly complex regex simply says "some $a$'s
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   436
followed by some $b$'s then followed by 1 single $b$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   437
and this iterates 20 times, finally followed by a $c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   438
And a POSIX match would involve the entire string,"eating up"
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   439
all the $b$'s in it.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   440
%TODO: give a coloured example of how this matches POSIXly
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   441
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   442
This regex would trigger catastrophic backtracking in 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   443
languages like Python and Java,
471
Chengsong
parents: 469
diff changeset
   444
whereas it gives a non-POSIX  and uninformative 
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   445
match in languages like Go or .NET--The match with only 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   446
character $c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   447
471
Chengsong
parents: 469
diff changeset
   448
As Grathwohl\parencite{grathwohl2014crash} commented,
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   449
\begin{center}
471
Chengsong
parents: 469
diff changeset
   450
	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   451
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   452
471
Chengsong
parents: 469
diff changeset
   453
\section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking}
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   454
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   455
471
Chengsong
parents: 469
diff changeset
   456
There is also static analysis work on regular expression that
Chengsong
parents: 469
diff changeset
   457
have potential expoential behavious. Rathnayake and Thielecke 
Chengsong
parents: 469
diff changeset
   458
\parencite{Rathnayake2014StaticAF} proposed an algorithm
Chengsong
parents: 469
diff changeset
   459
that detects regular expressions triggering exponential
Chengsong
parents: 469
diff changeset
   460
behavious on backtracking matchers.
Chengsong
parents: 469
diff changeset
   461
People also developed static analysis methods for
Chengsong
parents: 469
diff changeset
   462
generating non-linear polynomial worst-time estimates
Chengsong
parents: 469
diff changeset
   463
for regexes, attack string that exploit the worst-time 
Chengsong
parents: 469
diff changeset
   464
scenario, and "attack automata" that generates
Chengsong
parents: 469
diff changeset
   465
attack strings \parencite{Weideman2017Static}.
Chengsong
parents: 469
diff changeset
   466
There are also tools to "debug" regular expressions
Chengsong
parents: 469
diff changeset
   467
that allows people to see why a match failed or was especially slow
Chengsong
parents: 469
diff changeset
   468
by showing the steps a back-tracking regex engine took\parencite{regexploit2021}.
Chengsong
parents: 469
diff changeset
   469
%TODO:also the regex101 debugger
Chengsong
parents: 469
diff changeset
   470
\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   471
 Is it possible to have a regex lexing algorithm with proven correctness and 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   472
 time complexity, which allows easy extensions to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   473
  constructs like 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   474
 bounded repetitions, negation,  lookarounds, and even back-references? 
471
Chengsong
parents: 469
diff changeset
   475
Building on top of Sulzmann and Lu's attempt to formalize the 
Chengsong
parents: 469
diff changeset
   476
notion of POSIX lexing rules \parencite{Sulzmann2014}, 
Chengsong
parents: 469
diff changeset
   477
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
Chengsong
parents: 469
diff changeset
   478
POSIX matching as a ternary relation recursively defined in a
Chengsong
parents: 469
diff changeset
   479
natural deduction style.
Chengsong
parents: 469
diff changeset
   480
With the formally-specified rules for what a POSIX matching is,
Chengsong
parents: 469
diff changeset
   481
they designed a regex matching algorithm based on Brzozowski derivatives, and 
Chengsong
parents: 469
diff changeset
   482
proved in Isabelle/HOL that the algorithm gives correct results.
Chengsong
parents: 469
diff changeset
   483
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   484
 
471
Chengsong
parents: 469
diff changeset
   485
Chengsong
parents: 469
diff changeset
   486
Chengsong
parents: 469
diff changeset
   487
Chengsong
parents: 469
diff changeset
   488
%----------------------------------------------------------------------------------------
Chengsong
parents: 469
diff changeset
   489
Chengsong
parents: 469
diff changeset
   490
\section{Our Approach}
Chengsong
parents: 469
diff changeset
   491
In the last fifteen or so years, Brzozowski's derivatives of regular
Chengsong
parents: 469
diff changeset
   492
expressions have sparked quite a bit of interest in the functional
Chengsong
parents: 469
diff changeset
   493
programming and theorem prover communities.  The beauty of
Chengsong
parents: 469
diff changeset
   494
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
Chengsong
parents: 469
diff changeset
   495
expressible in any functional language, and easily definable and
Chengsong
parents: 469
diff changeset
   496
reasoned about in theorem provers---the definitions just consist of
Chengsong
parents: 469
diff changeset
   497
inductive datatypes and simple recursive functions.  Derivatives of a
Chengsong
parents: 469
diff changeset
   498
regular expression, written $r \backslash c$, give a simple solution
Chengsong
parents: 469
diff changeset
   499
to the problem of matching a string $s$ with a regular
Chengsong
parents: 469
diff changeset
   500
expression $r$: if the derivative of $r$ w.r.t.\ (in
Chengsong
parents: 469
diff changeset
   501
succession) all the characters of the string matches the empty string,
Chengsong
parents: 469
diff changeset
   502
then $r$ matches $s$ (and {\em vice versa}).  
Chengsong
parents: 469
diff changeset
   503
Chengsong
parents: 469
diff changeset
   504
Chengsong
parents: 469
diff changeset
   505
This work aims to address the above vulnerability by the combination
Chengsong
parents: 469
diff changeset
   506
of Brzozowski's derivatives and interactive theorem proving. We give an 
Chengsong
parents: 469
diff changeset
   507
improved version of  Sulzmann and Lu's bit-coded algorithm using 
Chengsong
parents: 469
diff changeset
   508
derivatives, which come with a formal guarantee in terms of correctness and 
Chengsong
parents: 469
diff changeset
   509
running time as an Isabelle/HOL proof.
Chengsong
parents: 469
diff changeset
   510
Then we improve the algorithm with an even stronger version of 
Chengsong
parents: 469
diff changeset
   511
simplification, and prove a time bound linear to input and
Chengsong
parents: 469
diff changeset
   512
cubic to regular expression size using a technique by
Chengsong
parents: 469
diff changeset
   513
Antimirov.
Chengsong
parents: 469
diff changeset
   514
Chengsong
parents: 469
diff changeset
   515
\subsection{Existing Work}
Chengsong
parents: 469
diff changeset
   516
We are aware
Chengsong
parents: 469
diff changeset
   517
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
Chengsong
parents: 469
diff changeset
   518
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
Chengsong
parents: 469
diff changeset
   519
of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
Chengsong
parents: 469
diff changeset
   520
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
Chengsong
parents: 469
diff changeset
   521
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
Chengsong
parents: 469
diff changeset
   522
 
Chengsong
parents: 469
diff changeset
   523
 %We propose Brzozowski's derivatives as a solution to this problem.
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   524
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   525
The main contribution of this thesis is a proven correct lexing algorithm
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   526
with formalized time bounds.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   527
To our best knowledge, there is no lexing libraries using Brzozowski derivatives
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   528
that have a provable time guarantee, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   529
and claims about running time are usually speculative and backed by thin empirical
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   530
evidence.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   531
%TODO: give references
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   532
For example, Sulzmann and Lu had proposed an algorithm  in which they
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   533
claim a linear running time.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   534
But that was falsified by our experiments and the running time 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   535
is actually $\Omega(2^n)$ in the worst case.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   536
A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   537
%TODO: give references
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   538
lexer, which calculates POSIX matches and is based on derivatives.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   539
They formalized the correctness of the lexer, but not the complexity.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   540
In the performance evaluation section, they simply analyzed the run time
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   541
of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   542
and concluded that the algorithm is quadratic in terms of input length.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   543
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   544
the time it took to lex only 40 $a$'s was 5 minutes.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   545
We therefore believe our results of a proof of performance on general
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   546
inputs rather than specific examples a novel contribution.\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   547
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   548
 \section{Preliminaries about Lexing Using Brzozowski derivatives}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   549
 In the last fifteen or so years, Brzozowski's derivatives of regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   550
expressions have sparked quite a bit of interest in the functional
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   551
programming and theorem prover communities.  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   552
The beauty of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   553
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   554
expressible in any functional language, and easily definable and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   555
reasoned about in theorem provers---the definitions just consist of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   556
inductive datatypes and simple recursive functions. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   557
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   558
Suppose we have an alphabet $\Sigma$, the strings  whose characters
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   559
are from $\Sigma$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   560
can be expressed as $\Sigma^*$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   561
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   562
We use patterns to define a set of strings concisely. Regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   563
are one of such patterns systems:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   564
The basic regular expressions  are defined inductively
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   565
 by the following grammar:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   566
\[			r ::=   \ZERO \mid  \ONE
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   567
			 \mid  c  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   568
			 \mid  r_1 \cdot r_2
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   569
			 \mid  r_1 + r_2   
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   570
			 \mid r^*         
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   571
\]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   572
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   573
The language or set of strings defined by regular expressions are defined as
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   574
%TODO: FILL in the other defs
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   575
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   576
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   577
$L \; r_1 + r_2$ & $\dn$ & $ L \; r_1 \cup L \; r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   578
$L \; r_1 \cdot r_2$ & $\dn$ & $ L \; r_1 \cap L \; r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   579
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   580
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   581
Which are also called the "language interpretation".
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   582
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   583
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   584
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   585
The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   586
where the operation transforms the regex to a new one containing
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   587
strings without the head character $c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   588
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   589
Formally, we define first such a transformation on any string set, which
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   590
we call semantic derivative:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   591
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   592
$\Der \; c\; \textit{StringSet} = \{s \mid c :: s \in StringSet\}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   593
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   594
Mathematically, it can be expressed as the 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   595
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   596
If the $\textit{StringSet}$ happen to have some structure, for example,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   597
if it is regular, then we have that it
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   598
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   599
The the derivative of regular expression, denoted as
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   600
$r \backslash c$, is a function that takes parameters
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   601
$r$ and $c$, and returns another regular expression $r'$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   602
which is computed by the following recursive function:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   603
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   604
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   605
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   606
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   607
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   608
		$d \backslash c$     & $\dn$ & 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   609
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   610
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   611
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   612
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   613
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   614
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   615
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   616
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   617
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   618
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   619
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   620
The $\nullable$ function tests whether the empty string $""$ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   621
is in the language of $r$:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   622
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   623
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   624
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   625
		\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   626
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   627
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   628
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   629
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   630
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   631
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   632
		\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   633
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   634
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   635
The empty set does not contain any string and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   636
therefore not the empty string, the empty string 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   637
regular expression contains the empty string
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   638
by definition, the character regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   639
is the singleton that contains character only,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   640
and therefore does not contain the empty string,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   641
the alternative regular expression(or "or" expression)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   642
might have one of its children regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   643
being nullable and any one of its children being nullable
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   644
would suffice. The sequence regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   645
would require both children to have the empty string
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   646
to compose an empty string and the Kleene star
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   647
operation naturally introduced the empty string.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   648
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   649
We can give the meaning of regular expressions derivatives
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   650
by language interpretation:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   651
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   652
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   653
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   654
  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   655
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   656
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   657
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   658
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   659
		$d \backslash c$     & $\dn$ & 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   660
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   661
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   662
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   663
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   664
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   665
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   666
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   667
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   668
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   669
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   670
The function derivative, written $\backslash c$, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   671
defines how a regular expression evolves into
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   672
a new regular expression after all the string it contains
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   673
is chopped off a certain head character $c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   674
The most involved cases are the sequence 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   675
and star case.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   676
The sequence case says that if the first regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   677
contains an empty string then second component of the sequence
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   678
might be chosen as the target regular expression to be chopped
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   679
off its head character.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   680
The star regular expression unwraps the iteration of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   681
regular expression and attack the star regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   682
to its back again to make sure there are 0 or more iterations
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   683
following this unfolded iteration.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   684
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   685
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   686
The main property of the derivative operation
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   687
that enables us to reason about the correctness of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   688
an algorithm using derivatives is 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   689
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   690
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   691
$c\!::\!s \in L(r)$ holds
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   692
if and only if $s \in L(r\backslash c)$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   693
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   694
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   695
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   696
We can generalise the derivative operation shown above for single characters
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   697
to strings as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   698
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   699
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   700
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   701
$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   702
$r \backslash [\,] $ & $\dn$ & $r$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   703
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   704
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   705
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   706
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   707
and then define Brzozowski's  regular-expression matching algorithm as:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   708
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   709
\[
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   710
match\;s\;r \;\dn\; nullable(r\backslash s)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   711
\]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   712
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   713
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   714
Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   715
this algorithm presented graphically is as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   716
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   717
\begin{equation}\label{graph:*}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   718
\begin{tikzcd}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   719
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   720
\end{tikzcd}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   721
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   722
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   723
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   724
where we start with  a regular expression  $r_0$, build successive
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   725
derivatives until we exhaust the string and then use \textit{nullable}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   726
to test whether the result can match the empty string. It can  be
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   727
relatively  easily shown that this matcher is correct  (that is given
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   728
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   729
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   730
Beautiful and simple definition.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   731
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   732
If we implement the above algorithm naively, however,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   733
the algorithm can be excruciatingly slow. For example, when starting with the regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   734
expression $(a + aa)^*$ and building 12 successive derivatives
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   735
w.r.t.~the character $a$, one obtains a derivative regular expression
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   736
with more than 8000 nodes (when viewed as a tree). Operations like
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   737
$\backslash$ and $\nullable$ need to traverse such trees and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   738
consequently the bigger the size of the derivative the slower the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   739
algorithm. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   740
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   741
Brzozowski was quick in finding that during this process a lot useless
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   742
$\ONE$s and $\ZERO$s are generated and therefore not optimal.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   743
He also introduced some "similarity rules" such
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   744
as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   745
different but language-equivalent sub-regexes to further decrease the size
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   746
of the intermediate regexes. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   747
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   748
More simplifications are possible, such as deleting duplicates
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   749
and opening up nested alternatives to trigger even more simplifications.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   750
And suppose we apply simplification after each derivative step, and compose
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   751
these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   752
\textit{simp}(a \backslash c)$. Then we can build
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   753
a matcher without having  cumbersome regular expressions.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   754
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   755
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   756
If we want the size of derivatives in the algorithm to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   757
stay even lower, we would need more aggressive simplifications.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   758
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   759
deleting duplicates whenever possible. For example, the parentheses in
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   760
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   761
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   762
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   763
$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   764
to achieve a very tight size bound, namely,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   765
 the same size bound as that of the \emph{partial derivatives}. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   766
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   767
Building derivatives and then simplify them.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   768
So far so good. But what if we want to 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   769
do lexing instead of just a YES/NO answer?
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   770
This requires us to go back again to the world 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   771
without simplification first for a moment.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   772
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   773
elegant(arguably as beautiful as the original
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   774
derivatives definition) solution for this.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   775
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   776
\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   777
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   778
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   779
They first defined the datatypes for storing the 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   780
lexing information called a \emph{value} or
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   781
sometimes also \emph{lexical value}.  These values and regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   782
expressions correspond to each other as illustrated in the following
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   783
table:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   784
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   785
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   786
	\begin{tabular}{c@{\hspace{20mm}}c}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   787
		\begin{tabular}{@{}rrl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   788
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   789
			$r$ & $::=$  & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   790
			& $\mid$ & $\ONE$   \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   791
			& $\mid$ & $c$          \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   792
			& $\mid$ & $r_1 \cdot r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   793
			& $\mid$ & $r_1 + r_2$   \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   794
			\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   795
			& $\mid$ & $r^*$         \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   796
		\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   797
		&
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   798
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   799
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   800
			$v$ & $::=$  & \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   801
			&        & $\Empty$   \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   802
			& $\mid$ & $\Char(c)$          \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   803
			& $\mid$ & $\Seq\,v_1\, v_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   804
			& $\mid$ & $\Left(v)$   \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   805
			& $\mid$ & $\Right(v)$  \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   806
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   807
		\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   808
	\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   809
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   810
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   811
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   812
One regular expression can have multiple lexical values. For example
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   813
for the regular expression $(a+b)^*$, it has a infinite list of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   814
values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   815
$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   816
$\ldots$, and vice versa.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   817
Even for the regular expression matching a certain string, there could 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   818
still be more than one value corresponding to it.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   819
Take the example where $r= (a^*\cdot a^*)^*$ and the string 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   820
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   821
The number of different ways of matching 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   822
without allowing any value under a star to be flattened
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   823
to an empty string can be given by the following formula:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   824
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   825
	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   826
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   827
and a closed form formula can be calculated to be
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   828
\begin{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   829
	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   830
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   831
which is clearly in exponential order.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   832
A lexer aimed at getting all the possible values has an exponential
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   833
worst case runtime. Therefore it is impractical to try to generate
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   834
all possible matches in a run. In practice, we are usually 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   835
interested about POSIX values, which by intuition always
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   836
match the leftmost regular expression when there is a choice
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   837
and always match a sub part as much as possible before proceeding
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   838
to the next token. For example, the above example has the POSIX value
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   839
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   840
The output of an algorithm we want would be a POSIX matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   841
encoded as a value.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   842
The contribution of Sulzmann and Lu is an extension of Brzozowski's
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   843
algorithm by a second phase (the first phase being building successive
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   844
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   845
is generated in case the regular expression matches  the string. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   846
Pictorially, the Sulzmann and Lu algorithm is as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   847
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   848
\begin{ceqn}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   849
\begin{equation}\label{graph:2}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   850
\begin{tikzcd}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   851
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   852
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   853
\end{tikzcd}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   854
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   855
\end{ceqn}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   856
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   857
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   858
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   859
For convenience, we shall employ the following notations: the regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   860
expression we start with is $r_0$, and the given string $s$ is composed
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   861
of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   862
left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   863
to the characters $c_0$, $c_1$  until we exhaust the string and obtain
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   864
the derivative $r_n$. We test whether this derivative is
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   865
$\textit{nullable}$ or not. If not, we know the string does not match
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   866
$r$ and no value needs to be generated. If yes, we start building the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   867
values incrementally by \emph{injecting} back the characters into the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   868
earlier values $v_n, \ldots, v_0$. This is the second phase of the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   869
algorithm from the right to left. For the first value $v_n$, we call the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   870
function $\textit{mkeps}$, which builds a POSIX lexical value
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   871
for how the empty string has been matched by the (nullable) regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   872
expression $r_n$. This function is defined as
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   873
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   874
	\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   875
		\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   876
			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   877
			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   878
			& \textit{if} $\nullable(r_{1})$\\ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   879
			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   880
			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   881
			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   882
			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   883
		\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   884
	\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   885
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   886
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   887
\noindent 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   888
After the $\mkeps$-call, we inject back the characters one by one in order to build
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   889
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   890
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   891
After injecting back $n$ characters, we get the lexical value for how $r_0$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   892
matches $s$. The POSIX value is maintained throught out the process.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   893
For this Sulzmann and Lu defined a function that reverses
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   894
the ``chopping off'' of characters during the derivative phase. The
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   895
corresponding function is called \emph{injection}, written
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   896
$\textit{inj}$; it takes three arguments: the first one is a regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   897
expression ${r_{i-1}}$, before the character is chopped off, the second
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   898
is a character ${c_{i-1}}$, the character we want to inject and the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   899
third argument is the value ${v_i}$, into which one wants to inject the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   900
character (it corresponds to the regular expression after the character
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   901
has been chopped off). The result of this function is a new value. The
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   902
definition of $\textit{inj}$ is as follows: 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   903
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   904
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   905
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   906
  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   907
  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   908
  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   909
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   910
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   911
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   912
  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   913
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   914
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   915
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   916
\noindent This definition is by recursion on the ``shape'' of regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   917
expressions and values. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   918
The clauses basically do one thing--identifying the ``holes'' on 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   919
value to inject the character back into.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   920
For instance, in the last clause for injecting back to a value
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   921
that would turn into a new star value that corresponds to a star,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   922
we know it must be a sequence value. And we know that the first 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   923
value of that sequence corresponds to the child regex of the star
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   924
with the first character being chopped off--an iteration of the star
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   925
that had just been unfolded. This value is followed by the already
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   926
matched star iterations we collected before. So we inject the character 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   927
back to the first value and form a new value with this new iteration
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   928
being added to the previous list of iterations, all under the $Stars$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   929
top level.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   930
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   931
We have mentioned before that derivatives without simplification 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   932
can get clumsy, and this is true for values as well--they reflect
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   933
the regular expressions size by definition.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   934
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   935
One can introduce simplification on the regex and values, but have to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   936
be careful in not breaking the correctness as the injection 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   937
function heavily relies on the structure of the regexes and values
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   938
being correct and match each other.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   939
It can be achieved by recording some extra rectification functions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   940
during the derivatives step, and applying these rectifications in 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   941
each run during the injection phase.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   942
And we can prove that the POSIX value of how
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   943
regular expressions match strings will not be affected---although is much harder
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   944
to establish. Some initial results in this regard have been
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   945
obtained in \cite{AusafDyckhoffUrban2016}. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   946
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   947
%Brzozowski, after giving the derivatives and simplification,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   948
%did not explore lexing with simplification or he may well be 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   949
%stuck on an efficient simplificaiton with a proof.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   950
%He went on to explore the use of derivatives together with 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   951
%automaton, and did not try lexing using derivatives.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   952
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   953
We want to get rid of complex and fragile rectification of values.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   954
Can we not create those intermediate values $v_1,\ldots v_n$,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   955
and get the lexing information that should be already there while
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   956
doing derivatives in one pass, without a second phase of injection?
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   957
In the meantime, can we make sure that simplifications
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   958
are easily handled without breaking the correctness of the algorithm?
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   959
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   960
Sulzmann and Lu solved this problem by
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   961
introducing additional informtaion to the 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   962
regular expressions called \emph{bitcodes}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   963
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   964
\subsection*{Bit-coded Algorithm}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   965
Bits and bitcodes (lists of bits) are defined as:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   966
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   967
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   968
		$b ::=   1 \mid  0 \qquad
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   969
bs ::= [] \mid b::bs    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   970
$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   971
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   972
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   973
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   974
The $1$ and $0$ are not in bold in order to avoid 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   975
confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   976
bit-lists) can be used to encode values (or potentially incomplete values) in a
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   977
compact form. This can be straightforwardly seen in the following
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   978
coding function from values to bitcodes: 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   979
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   980
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   981
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   982
  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   983
  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   984
  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   985
  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   986
  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   987
  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   988
  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   989
                                                 code(\Stars\,vs)$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   990
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   991
\end{center} 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   992
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   993
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   994
Here $\textit{code}$ encodes a value into a bitcodes by converting
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   995
$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   996
star iteration by $1$. The border where a local star terminates
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   997
is marked by $0$. This coding is lossy, as it throws away the information about
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   998
characters, and also does not encode the ``boundary'' between two
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
   999
sequence values. Moreover, with only the bitcode we cannot even tell
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1000
whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1001
reason for choosing this compact way of storing information is that the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1002
relatively small size of bits can be easily manipulated and ``moved
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1003
around'' in a regular expression. In order to recover values, we will 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1004
need the corresponding regular expression as an extra information. This
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1005
means the decoding function is defined as:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1006
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1007
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1008
%\begin{definition}[Bitdecoding of Values]\mbox{}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1009
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1010
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1011
  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1012
  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1013
  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1014
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1015
       (\Left\,v, bs_1)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1016
  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1017
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1018
       (\Right\,v, bs_1)$\\                           
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1019
  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1020
        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1021
  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1022
  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1023
  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1024
  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1025
         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1026
  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1027
  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1028
  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1029
  $\textit{decode}\,bs\,r$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1030
     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1031
  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1032
       \textit{else}\;\textit{None}$                       
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1033
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1034
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1035
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1036
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1037
Sulzmann and Lu's integrated the bitcodes into regular expressions to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1038
create annotated regular expressions \cite{Sulzmann2014}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1039
\emph{Annotated regular expressions} are defined by the following
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1040
grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1041
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1042
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1043
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1044
  $\textit{a}$ & $::=$  & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1045
                  & $\mid$ & $_{bs}\ONE$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1046
                  & $\mid$ & $_{bs}{\bf c}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1047
                  & $\mid$ & $_{bs}\sum\,as$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1048
                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1049
                  & $\mid$ & $_{bs}a^*$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1050
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1051
\end{center}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1052
%(in \textit{ALTS})
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1053
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1054
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1055
where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1056
expressions and $as$ for a list of annotated regular expressions.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1057
The alternative constructor($\sum$) has been generalized to 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1058
accept a list of annotated regular expressions rather than just 2.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1059
We will show that these bitcodes encode information about
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1060
the (POSIX) value that should be generated by the Sulzmann and Lu
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1061
algorithm.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1062
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1063
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1064
To do lexing using annotated regular expressions, we shall first
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1065
transform the usual (un-annotated) regular expressions into annotated
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1066
regular expressions. This operation is called \emph{internalisation} and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1067
defined as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1068
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1069
%\begin{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1070
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1071
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1072
  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1073
  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1074
  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1075
  $(r_1 + r_2)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1076
  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1077
  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1078
  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1079
         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1080
  $(r^*)^\uparrow$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1081
         $_{[]}(r^\uparrow)^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1082
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1083
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1084
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1085
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1086
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1087
We use up arrows here to indicate that the basic un-annotated regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1088
expressions are ``lifted up'' into something slightly more complex. In the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1089
fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1090
attach bits to the front of an annotated regular expression. Its
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1091
definition is as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1092
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1093
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1094
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1095
  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1096
  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1097
     $_{bs @ bs'}\ONE$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1098
  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1099
     $_{bs@bs'}{\bf c}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1100
  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1101
     $_{bs@bs'}\sum\textit{as}$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1102
  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1103
     $_{bs@bs'}a_1 \cdot a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1104
  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1105
     $_{bs @ bs'}a^*$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1106
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1107
\end{center}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1108
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1109
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1110
After internalising the regular expression, we perform successive
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1111
derivative operations on the annotated regular expressions. This
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1112
derivative operation is the same as what we had previously for the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1113
basic regular expressions, except that we beed to take care of
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1114
the bitcodes:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1115
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1116
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1117
\iffalse
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1118
 %\begin{definition}{bder}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1119
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1120
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1121
  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1122
  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1123
  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1124
        $\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1125
         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1126
  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1127
  $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1128
  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1129
     $\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1130
					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1131
					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1132
  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1133
  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1134
      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1135
       (\textit{STAR}\,[]\,r)$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1136
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1137
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1138
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1139
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1140
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1141
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1142
  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1143
  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1144
  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1145
        $\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1146
         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1147
  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1148
  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1149
  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1150
     $\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1151
					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1152
					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1153
  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1154
  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1155
      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1156
       (_{bs}\textit{STAR}\,[]\,r)$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1157
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1158
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1159
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1160
\fi
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1161
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1162
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1163
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1164
  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1165
  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1166
  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1167
        $\textit{if}\;c=d\; \;\textit{then}\;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1168
         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1169
  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1170
  $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1171
  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1172
     $\textit{if}\;\textit{bnullable}\,a_1$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1173
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1174
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1175
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1176
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1177
      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1178
       (_{[]}r^*))$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1179
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1180
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1181
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1182
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1183
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1184
For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1185
we need to unfold it into a sequence,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1186
and attach an additional bit $0$ to the front of $r \backslash c$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1187
to indicate that there is one more star iteration. Also the sequence clause
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1188
is more subtle---when $a_1$ is $\textit{bnullable}$ (here
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1189
\textit{bnullable} is exactly the same as $\textit{nullable}$, except
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1190
that it is for annotated regular expressions, therefore we omit the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1191
definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1192
$a_1$ matches the string prior to character $c$ (more on this later),
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1193
then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1194
\backslash c)$ will collapse the regular expression $a_1$(as it has
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1195
already been fully matched) and store the parsing information at the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1196
head of the regular expression $a_2 \backslash c$ by fusing to it. The
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1197
bitsequence $\textit{bs}$, which was initially attached to the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1198
first element of the sequence $a_1 \cdot a_2$, has
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1199
now been elevated to the top-level of $\sum$, as this information will be
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1200
needed whichever way the sequence is matched---no matter whether $c$ belongs
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1201
to $a_1$ or $ a_2$. After building these derivatives and maintaining all
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1202
the lexing information, we complete the lexing by collecting the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1203
bitcodes using a generalised version of the $\textit{mkeps}$ function
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1204
for annotated regular expressions, called $\textit{bmkeps}$:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1205
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1206
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1207
%\begin{definition}[\textit{bmkeps}]\mbox{}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1208
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1209
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1210
  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1211
  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1212
     $\textit{if}\;\textit{bnullable}\,a$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1213
  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1214
  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1215
  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1216
     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1217
  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1218
     $bs \,@\, [0]$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1219
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1220
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1221
%\end{definition}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1222
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1223
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1224
This function completes the value information by travelling along the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1225
path of the regular expression that corresponds to a POSIX value and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1226
collecting all the bitcodes, and using $S$ to indicate the end of star
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1227
iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1228
decode them, we get the value we expect. The corresponding lexing
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1229
algorithm looks as follows:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1230
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1231
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1232
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1233
  $\textit{blexer}\;r\,s$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1234
      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1235
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1236
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1237
  & & $\;\;\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1238
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1239
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1240
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1241
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1242
In this definition $\_\backslash s$ is the  generalisation  of the derivative
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1243
operation from characters to strings (just like the derivatives for un-annotated
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1244
regular expressions).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1245
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1246
Remember tha one of the important reasons we introduced bitcodes
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1247
is that they can make simplification more structured and therefore guaranteeing
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1248
the correctness.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1249
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1250
\subsection*{Our Simplification Rules}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1251
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1252
In this section we introduce aggressive (in terms of size) simplification rules
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1253
on annotated regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1254
in order to keep derivatives small. Such simplifications are promising
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1255
as we have
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1256
generated test data that show
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1257
that a good tight bound can be achieved. Obviously we could only
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1258
partially cover  the search space as there are infinitely many regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1259
expressions and strings. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1260
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1261
One modification we introduced is to allow a list of annotated regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1262
expressions in the $\sum$ constructor. This allows us to not just
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1263
delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1264
also unnecessary ``copies'' of regular expressions (very similar to
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1265
simplifying $r + r$ to just $r$, but in a more general setting). Another
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1266
modification is that we use simplification rules inspired by Antimirov's
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1267
work on partial derivatives. They maintain the idea that only the first
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1268
``copy'' of a regular expression in an alternative contributes to the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1269
calculation of a POSIX value. All subsequent copies can be pruned away from
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1270
the regular expression. A recursive definition of our  simplification function 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1271
that looks somewhat similar to our Scala code is given below:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1272
%\comment{Use $\ZERO$, $\ONE$ and so on. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1273
%Is it $ALTS$ or $ALTS$?}\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1274
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1275
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1276
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1277
   
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1278
  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1279
   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1280
   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1281
   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1282
   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1283
   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1284
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1285
  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1286
  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1287
   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1288
   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1289
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1290
   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1291
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1292
\end{center}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1293
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1294
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1295
The simplification does a pattern matching on the regular expression.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1296
When it detected that the regular expression is an alternative or
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1297
sequence, it will try to simplify its children regular expressions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1298
recursively and then see if one of the children turn into $\ZERO$ or
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1299
$\ONE$, which might trigger further simplification at the current level.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1300
The most involved part is the $\sum$ clause, where we use two
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1301
auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1302
alternatives and reduce as many duplicates as possible. Function
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1303
$\textit{distinct}$  keeps the first occurring copy only and remove all later ones
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1304
when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1305
Its recursive definition is given below:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1306
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1307
 \begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1308
  \begin{tabular}{@{}lcl@{}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1309
  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1310
     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1311
  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1312
    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1313
\end{tabular}    
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1314
\end{center}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1315
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1316
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1317
Here $\textit{flatten}$ behaves like the traditional functional programming flatten
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1318
function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1319
removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1320
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1321
Having defined the $\simp$ function,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1322
we can use the previous notation of  natural
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1323
extension from derivative w.r.t.~character to derivative
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1324
w.r.t.~string:%\comment{simp in  the [] case?}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1325
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1326
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1327
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1328
$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1329
$r \backslash_{simp} [\,] $ & $\dn$ & $r$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1330
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1331
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1332
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1333
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1334
to obtain an optimised version of the algorithm:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1335
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1336
 \begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1337
\begin{tabular}{lcl}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1338
  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1339
      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1340
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1341
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1342
  & & $\;\;\textit{else}\;\textit{None}$
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1343
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1344
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1345
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1346
\noindent
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1347
This algorithm keeps the regular expression size small, for example,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1348
with this simplification our previous $(a + aa)^*$ example's 8000 nodes
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1349
will be reduced to just 6 and stays constant, no matter how long the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1350
input string is.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1351
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1352
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1353
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1354
Derivatives give a simple solution
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1355
to the problem of matching a string $s$ with a regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1356
expression $r$: if the derivative of $r$ w.r.t.\ (in
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1357
succession) all the characters of the string matches the empty string,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1358
then $r$ matches $s$ (and {\em vice versa}).  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1359
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1360
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1361
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1362
However, there are two difficulties with derivative-based matchers:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1363
First, Brzozowski's original matcher only generates a yes/no answer
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1364
for whether a regular expression matches a string or not.  This is too
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1365
little information in the context of lexing where separate tokens must
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1366
be identified and also classified (for example as keywords
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1367
or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1368
difficulty by cleverly extending Brzozowski's matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1369
algorithm. Their extended version generates additional information on
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1370
\emph{how} a regular expression matches a string following the POSIX
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1371
rules for regular expression matching. They achieve this by adding a
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1372
second ``phase'' to Brzozowski's algorithm involving an injection
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1373
function.  In our own earlier work we provided the formal
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1374
specification of what POSIX matching means and proved in Isabelle/HOL
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1375
the correctness
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1376
of Sulzmann and Lu's extended algorithm accordingly
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1377
\cite{AusafDyckhoffUrban2016}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1378
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1379
The second difficulty is that Brzozowski's derivatives can 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1380
grow to arbitrarily big sizes. For example if we start with the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1381
regular expression $(a+aa)^*$ and take
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1382
successive derivatives according to the character $a$, we end up with
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1383
a sequence of ever-growing derivatives like 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1384
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1385
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1386
\begin{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1387
\begin{tabular}{rll}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1388
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1389
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1390
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1391
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1392
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1393
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1394
\end{center}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1395
 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1396
\noindent where after around 35 steps we run out of memory on a
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1397
typical computer (we shall define shortly the precise details of our
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1398
regular expressions and the derivative operation).  Clearly, the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1399
notation involving $\ZERO$s and $\ONE$s already suggests
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1400
simplification rules that can be applied to regular regular
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1401
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1402
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1403
r$. While such simple-minded simplifications have been proved in our
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1404
earlier work to preserve the correctness of Sulzmann and Lu's
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1405
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1406
\emph{not} help with limiting the growth of the derivatives shown
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1407
above: the growth is slowed, but the derivatives can still grow rather
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1408
quickly beyond any finite bound.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1409
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1410
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1411
Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1412
\cite{Sulzmann2014} where they introduce bitcoded
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1413
regular expressions. In this version, POSIX values are
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1414
represented as bitsequences and such sequences are incrementally generated
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1415
when derivatives are calculated. The compact representation
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1416
of bitsequences and regular expressions allows them to define a more
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1417
``aggressive'' simplification method that keeps the size of the
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1418
derivatives finite no matter what the length of the string is.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1419
They make some informal claims about the correctness and linear behaviour
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1420
of this version, but do not provide any supporting proof arguments, not
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1421
even ``pencil-and-paper'' arguments. They write about their bitcoded
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1422
\emph{incremental parsing method} (that is the algorithm to be formalised
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1423
in this paper):
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1424
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1425
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1426
\begin{quote}\it
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1427
  ``Correctness Claim: We further claim that the incremental parsing
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1428
  method [..] in combination with the simplification steps [..]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1429
  yields POSIX parse trees. We have tested this claim
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1430
  extensively [..] but yet
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1431
  have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1432
\end{quote}  
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1433
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1434
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1435
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1436
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1437
\section{Backgound}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1438
%Regular expression matching and lexing has been 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1439
% widely-used and well-implemented
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1440
%in software industry. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1441
%TODO: expand the above into a full paragraph
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1442
%TODO: look up snort rules to use here--give readers idea of what regexes look like
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1443
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1444
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1445
Theoretical results say that regular expression matching
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1446
should be linear with respect to the input.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1447
Under a certain class of regular expressions and inputs though,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1448
practical implementations  suffer from non-linear or even 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1449
exponential running time,
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1450
allowing a ReDoS (regular expression denial-of-service ) attack.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1451
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1452
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1453
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1454
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1455
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1456
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1457
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1458
\section{What this Template Includes}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1459
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1460
\subsection{Folders}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1461
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1462
This template comes as a single zip file that expands out to several files and folders. The folder names are mostly self-explanatory:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1463
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1464
\keyword{Appendices} -- this is the folder where you put the appendices. Each appendix should go into its own separate \file{.tex} file. An example and template are included in the directory.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1465
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1466
\keyword{Chapters} -- this is the folder where you put the thesis chapters. A thesis usually has about six chapters, though there is no hard rule on this. Each chapter should go in its own separate \file{.tex} file and they can be split as:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1467
\begin{itemize}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1468
\item Chapter 1: Introduction to the thesis topic
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1469
\item Chapter 2: Background information and theory
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1470
\item Chapter 3: (Laboratory) experimental setup
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1471
\item Chapter 4: Details of experiment 1
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1472
\item Chapter 5: Details of experiment 2
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1473
\item Chapter 6: Discussion of the experimental results
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1474
\item Chapter 7: Conclusion and future directions
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1475
\end{itemize}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1476
This chapter layout is specialised for the experimental sciences, your discipline may be different.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1477
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1478
\keyword{Figures} -- this folder contains all figures for the thesis. These are the final images that will go into the thesis document.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1479
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1480
\subsection{Files}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1481
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1482
Included are also several files, most of them are plain text and you can see their contents in a text editor. After initial compilation, you will see that more auxiliary files are created by \LaTeX{} or BibTeX and which you don't need to delete or worry about:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1483
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1484
\keyword{example.bib} -- this is an important file that contains all the bibliographic information and references that you will be citing in the thesis for use with BibTeX. You can write it manually, but there are reference manager programs available that will create and manage it for you. Bibliographies in \LaTeX{} are a large subject and you may need to read about BibTeX before starting with this. Many modern reference managers will allow you to export your references in BibTeX format which greatly eases the amount of work you have to do.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1485
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1486
\keyword{MastersDoctoralThesis.cls} -- this is an important file. It is the class file that tells \LaTeX{} how to format the thesis. 
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1487
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1488
\keyword{main.pdf} -- this is your beautifully typeset thesis (in the PDF file format) created by \LaTeX{}. It is supplied in the PDF with the template and after you compile the template you should get an identical version.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1489
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1490
\keyword{main.tex} -- this is an important file. This is the file that you tell \LaTeX{} to compile to produce your thesis as a PDF file. It contains the framework and constructs that tell \LaTeX{} how to layout the thesis. It is heavily commented so you can read exactly what each line of code does and why it is there. After you put your own information into the \emph{THESIS INFORMATION} block -- you have now started your thesis!
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1491
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1492
Files that are \emph{not} included, but are created by \LaTeX{} as auxiliary files include:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1493
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1494
\keyword{main.aux} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1495
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1496
\keyword{main.bbl} -- this is an auxiliary file generated by BibTeX, if it is deleted, BibTeX simply regenerates it when you run the \file{main.aux} file. Whereas the \file{.bib} file contains all the references you have, this \file{.bbl} file contains the references you have actually cited in the thesis and is used to build the bibliography section of the thesis.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1497
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1498
\keyword{main.blg} -- this is an auxiliary file generated by BibTeX, if it is deleted BibTeX simply regenerates it when you run the main \file{.aux} file.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1499
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1500
\keyword{main.lof} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Figures} section.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1501
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1502
\keyword{main.log} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It contains messages from \LaTeX{}, if you receive errors and warnings from \LaTeX{}, they will be in this \file{.log} file.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1503
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1504
\keyword{main.lot} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file. It tells \LaTeX{} how to build the \emph{List of Tables} section.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1505
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1506
\keyword{main.out} -- this is an auxiliary file generated by \LaTeX{}, if it is deleted \LaTeX{} simply regenerates it when you run the main \file{.tex} file.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1507
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1508
So from this long list, only the files with the \file{.bib}, \file{.cls} and \file{.tex} extensions are the most important ones. The other auxiliary files can be ignored or deleted as \LaTeX{} and BibTeX will regenerate them.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1509
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1510
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1511
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1512
\section{Filling in Your Information in the \file{main.tex} File}\label{FillingFile}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1513
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1514
You will need to personalise the thesis template and make it your own by filling in your own information. This is done by editing the \file{main.tex} file in a text editor or your favourite LaTeX environment.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1515
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1516
Open the file and scroll down to the third large block titled \emph{THESIS INFORMATION} where you can see the entries for \emph{University Name}, \emph{Department Name}, etc \ldots
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1517
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1518
Fill out the information about yourself, your group and institution. You can also insert web links, if you do, make sure you use the full URL, including the \code{http://} for this. If you don't want these to be linked, simply remove the \verb|\href{url}{name}| and only leave the name.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1519
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1520
When you have done this, save the file and recompile \code{main.tex}. All the information you filled in should now be in the PDF, complete with web links. You can now begin your thesis proper!
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1521
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1522
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1523
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1524
\section{The \code{main.tex} File Explained}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1525
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1526
The \file{main.tex} file contains the structure of the thesis. There are plenty of written comments that explain what pages, sections and formatting the \LaTeX{} code is creating. Each major document element is divided into commented blocks with titles in all capitals to make it obvious what the following bit of code is doing. Initially there seems to be a lot of \LaTeX{} code, but this is all formatting, and it has all been taken care of so you don't have to do it.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1527
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1528
Begin by checking that your information on the title page is correct. For the thesis declaration, your institution may insist on something different than the text given. If this is the case, just replace what you see with what is required in the \emph{DECLARATION PAGE} block.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1529
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1530
Then comes a page which contains a funny quote. You can put your own, or quote your favourite scientist, author, person, and so on. Make sure to put the name of the person who you took the quote from.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1531
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1532
Following this is the abstract page which summarises your work in a condensed way and can almost be used as a standalone document to describe what you have done. The text you write will cause the heading to move up so don't worry about running out of space.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1533
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1534
Next come the acknowledgements. On this page, write about all the people who you wish to thank (not forgetting parents, partners and your advisor/supervisor).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1535
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1536
The contents pages, list of figures and tables are all taken care of for you and do not need to be manually created or edited. The next set of pages are more likely to be optional and can be deleted since they are for a more technical thesis: insert a list of abbreviations you have used in the thesis, then a list of the physical constants and numbers you refer to and finally, a list of mathematical symbols used in any formulae. Making the effort to fill these tables means the reader has a one-stop place to refer to instead of searching the internet and references to try and find out what you meant by certain abbreviations or symbols.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1537
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1538
The list of symbols is split into the Roman and Greek alphabets. Whereas the abbreviations and symbols ought to be listed in alphabetical order (and this is \emph{not} done automatically for you) the list of physical constants should be grouped into similar themes.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1539
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1540
The next page contains a one line dedication. Who will you dedicate your thesis to?
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1541
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1542
Finally, there is the block where the chapters are included. Uncomment the lines (delete the \code{\%} character) as you write the chapters. Each chapter should be written in its own file and put into the \emph{Chapters} folder and named \file{Chapter1}, \file{Chapter2}, etc\ldots Similarly for the appendices, uncomment the lines as you need them. Each appendix should go into its own file and placed in the \emph{Appendices} folder.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1543
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1544
After the preamble, chapters and appendices finally comes the bibliography. The bibliography style (called \option{authoryear}) is used for the bibliography and is a fully featured style that will even include links to where the referenced paper can be found online. Do not underestimate how grateful your reader will be to find that a reference to a paper is just a click away. Of course, this relies on you putting the URL information into the BibTeX file in the first place.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1545
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1546
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1547
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1548
\section{Thesis Features and Conventions}\label{ThesisConventions}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1549
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1550
To get the best out of this template, there are a few conventions that you may want to follow.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1551
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1552
One of the most important (and most difficult) things to keep track of in such a long document as a thesis is consistency. Using certain conventions and ways of doing things (such as using a Todo list) makes the job easier. Of course, all of these are optional and you can adopt your own method.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1553
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1554
\subsection{Printing Format}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1555
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1556
This thesis template is designed for double sided printing (i.e. content on the front and back of pages) as most theses are printed and bound this way. Switching to one sided printing is as simple as uncommenting the \option{oneside} option of the \code{documentclass} command at the top of the \file{main.tex} file. You may then wish to adjust the margins to suit specifications from your institution.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1557
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1558
The headers for the pages contain the page number on the outer side (so it is easy to flick through to the page you want) and the chapter name on the inner side.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1559
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1560
The text is set to 11 point by default with single line spacing, again, you can tune the text size and spacing should you want or need to using the options at the very start of \file{main.tex}. The spacing can be changed similarly by replacing the \option{singlespacing} with \option{onehalfspacing} or \option{doublespacing}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1561
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1562
\subsection{Using US Letter Paper}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1563
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1564
The paper size used in the template is A4, which is the standard size in Europe. If you are using this thesis template elsewhere and particularly in the United States, then you may have to change the A4 paper size to the US Letter size. This can be done in the margins settings section in \file{main.tex}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1565
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1566
Due to the differences in the paper size, the resulting margins may be different to what you like or require (as it is common for institutions to dictate certain margin sizes). If this is the case, then the margin sizes can be tweaked by modifying the values in the same block as where you set the paper size. Now your document should be set up for US Letter paper size with suitable margins.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1567
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1568
\subsection{References}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1569
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1570
The \code{biblatex} package is used to format the bibliography and inserts references such as this one \parencite{Reference1}. The options used in the \file{main.tex} file mean that the in-text citations of references are formatted with the author(s) listed with the date of the publication. Multiple references are separated by semicolons (e.g. \parencite{Reference2, Reference1}) and references with more than three authors only show the first author with \emph{et al.} indicating there are more authors (e.g. \parencite{Reference3}). This is done automatically for you. To see how you use references, have a look at the \file{Chapter1.tex} source file. Many reference managers allow you to simply drag the reference into the document as you type.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1571
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1572
Scientific references should come \emph{before} the punctuation mark if there is one (such as a comma or period). The same goes for footnotes\footnote{Such as this footnote, here down at the bottom of the page.}. You can change this but the most important thing is to keep the convention consistent throughout the thesis. Footnotes themselves should be full, descriptive sentences (beginning with a capital letter and ending with a full stop). The APA6 states: \enquote{Footnote numbers should be superscripted, [...], following any punctuation mark except a dash.} The Chicago manual of style states: \enquote{A note number should be placed at the end of a sentence or clause. The number follows any punctuation mark except the dash, which it precedes. It follows a closing parenthesis.}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1573
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1574
The bibliography is typeset with references listed in alphabetical order by the first author's last name. This is similar to the APA referencing style. To see how \LaTeX{} typesets the bibliography, have a look at the very end of this document (or just click on the reference number links in in-text citations).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1575
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1576
\subsubsection{A Note on bibtex}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1577
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1578
The bibtex backend used in the template by default does not correctly handle unicode character encoding (i.e. "international" characters). You may see a warning about this in the compilation log and, if your references contain unicode characters, they may not show up correctly or at all. The solution to this is to use the biber backend instead of the outdated bibtex backend. This is done by finding this in \file{main.tex}: \option{backend=bibtex} and changing it to \option{backend=biber}. You will then need to delete all auxiliary BibTeX files and navigate to the template directory in your terminal (command prompt). Once there, simply type \code{biber main} and biber will compile your bibliography. You can then compile \file{main.tex} as normal and your bibliography will be updated. An alternative is to set up your LaTeX editor to compile with biber instead of bibtex, see \href{http://tex.stackexchange.com/questions/154751/biblatex-with-biber-configuring-my-editor-to-avoid-undefined-citations/}{here} for how to do this for various editors.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1579
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1580
\subsection{Tables}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1581
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1582
Tables are an important way of displaying your results, below is an example table which was generated with this code:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1583
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1584
{\small
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1585
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1586
\begin{table}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1587
\caption{The effects of treatments X and Y on the four groups studied.}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1588
\label{tab:treatments}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1589
\centering
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1590
\begin{tabular}{l l l}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1591
\toprule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1592
\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1593
\midrule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1594
1 & 0.2 & 0.8\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1595
2 & 0.17 & 0.7\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1596
3 & 0.24 & 0.75\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1597
4 & 0.68 & 0.3\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1598
\bottomrule\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1599
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1600
\end{table}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1601
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1602
}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1603
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1604
\begin{table}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1605
\caption{The effects of treatments X and Y on the four groups studied.}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1606
\label{tab:treatments}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1607
\centering
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1608
\begin{tabular}{l l l}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1609
\toprule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1610
\tabhead{Groups} & \tabhead{Treatment X} & \tabhead{Treatment Y} \\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1611
\midrule
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1612
1 & 0.2 & 0.8\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1613
2 & 0.17 & 0.7\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1614
3 & 0.24 & 0.75\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1615
4 & 0.68 & 0.3\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1616
\bottomrule\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1617
\end{tabular}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1618
\end{table}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1619
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1620
You can reference tables with \verb|\ref{<label>}| where the label is defined within the table environment. See \file{Chapter1.tex} for an example of the label and citation (e.g. Table~\ref{tab:treatments}).
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1621
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1622
\subsection{Figures}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1623
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1624
There will hopefully be many figures in your thesis (that should be placed in the \emph{Figures} folder). The way to insert figures into your thesis is to use a code template like this:
469
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1625
%\begin{verbatim}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1626
%\begin{figure}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1627
%\centering
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1628
%\includegraphics{Figures/Electron}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1629
%\decoRule
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1630
%\caption[An Electron]{An electron (artist's impression).}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1631
%\label{fig:Electron}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1632
%\end{figure}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1633
%\end{verbatim}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1634
%Also look in the source file. Putting this code into the source file produces the picture of the electron that you can see in the figure below.
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1635
%
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1636
%\begin{figure}[th]
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1637
%\centering
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1638
%\includegraphics{Figures/Electron}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1639
%\decoRule
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1640
%\caption[An Electron]{An electron (artist's impression).}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1641
%\label{fig:Electron}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1642
%\end{figure}
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1643
469
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1644
%Sometimes figures don't always appear where you write them in the source. The placement depends on how much space there is on the page for the figure. Sometimes there is not enough room to fit a figure directly where it should go (in relation to the text) and so \LaTeX{} puts it at the top of the next page. Positioning figures is the job of \LaTeX{} and so you should only worry about making them look good!
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1645
%
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1646
%Figures usually should have captions just in case you need to refer to them (such as in Figure~\ref{fig:Electron}). The \verb|\caption| command contains two parts, the first part, inside the square brackets is the title that will appear in the \emph{List of Figures}, and so should be short. The second part in the curly brackets should contain the longer and more descriptive caption text.
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1647
%
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1648
%The \verb|\decoRule| command is optional and simply puts an aesthetic horizontal line below the image. If you do this for one image, do it for all of them.
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1649
%
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1650
%\LaTeX{} is capable of using images in pdf, jpg and png format.
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1651
%
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1652
%\subsection{Typesetting mathematics}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1653
%
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1654
%If your thesis is going to contain heavy mathematical content, be sure that \LaTeX{} will make it look beautiful, even though it won't be able to solve the equations for you.
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1655
%
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1656
%The \enquote{Not So Short Introduction to \LaTeX} (available on \href{http://www.ctan.org/tex-archive/info/lshort/english/lshort.pdf}{CTAN}) should tell you everything you need to know for most cases of typesetting mathematics. If you need more information, a much more thorough mathematical guide is available from the AMS called, \enquote{A Short Math Guide to \LaTeX} and can be downloaded from:
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1657
%\url{ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf}
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1658
%
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1659
%There are many different \LaTeX{} symbols to remember, luckily you can find the most common symbols in \href{http://ctan.org/pkg/comprehensive}{The Comprehensive \LaTeX~Symbol List}.
e5dd8cc0aa82 all electron pics removed
Chengsong
parents: 468
diff changeset
  1660
%
468
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1661
You can write an equation, which is automatically given an equation number by \LaTeX{} like this:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1662
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1663
\begin{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1664
E = mc^{2}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1665
\label{eqn:Einstein}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1666
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1667
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1668
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1669
This will produce Einstein's famous energy-matter equivalence equation:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1670
\begin{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1671
E = mc^{2}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1672
\label{eqn:Einstein}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1673
\end{equation}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1674
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1675
All equations you write (which are not in the middle of paragraph text) are automatically given equation numbers by \LaTeX{}. If you don't want a particular equation numbered, use the unnumbered form:
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1676
\begin{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1677
\[ a^{2}=4 \]
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1678
\end{verbatim}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1679
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1680
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1681
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1682
\section{Sectioning and Subsectioning}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1683
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1684
You should break your thesis up into nice, bite-sized sections and subsections. \LaTeX{} automatically builds a table of Contents by looking at all the \verb|\chapter{}|, \verb|\section{}|  and \verb|\subsection{}| commands you write in the source.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1685
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1686
The Table of Contents should only list the sections to three (3) levels. A \verb|chapter{}| is level zero (0). A \verb|\section{}| is level one (1) and so a \verb|\subsection{}| is level two (2). In your thesis it is likely that you will even use a \verb|subsubsection{}|, which is level three (3). The depth to which the Table of Contents is formatted is set within \file{MastersDoctoralThesis.cls}. If you need this changed, you can do it in \file{main.tex}.
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1687
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1688
%----------------------------------------------------------------------------------------
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1689
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1690
\section{In Closing}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1691
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1692
You have reached the end of this mini-guide. You can now rename or overwrite this pdf file and begin writing your own \file{Chapter1.tex} and the rest of your thesis. The easy work of setting up the structure and framework has been taken care of for you. It's now your job to fill it out!
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1693
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1694
Good luck and have lots of fun!
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1695
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1696
\begin{flushright}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1697
Guide written by ---\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1698
Sunil Patel: \href{http://www.sunilpatel.co.uk}{www.sunilpatel.co.uk}\\
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1699
Vel: \href{http://www.LaTeXTemplates.com}{LaTeXTemplates.com}
a0f27e21b42c all texrelated
Chengsong
parents:
diff changeset
  1700
\end{flushright}