30
+ − 1
\documentclass[a4paper,UKenglish]{lipics}
+ − 2
\usepackage{graphic}
+ − 3
\usepackage{data}
+ − 4
\usepackage{tikz-cd}
+ − 5
% \documentclass{article}
+ − 6
%\usepackage[utf8]{inputenc}
+ − 7
%\usepackage[english]{babel}
+ − 8
%\usepackage{listings}
+ − 9
% \usepackage{amsthm}
+ − 10
% \usepackage{hyperref}
+ − 11
% \usepackage[margin=0.5in]{geometry}
+ − 12
%\usepackage{pmboxdraw}
+ − 13
+ − 14
\title{POSIX Regular Expression Matching and Lexing}
+ − 15
\author{Chengsong Tan}
+ − 16
\affil{King's College London\\
+ − 17
London, UK\\
+ − 18
\texttt{chengsong.tan@kcl.ac.uk}}
+ − 19
\authorrunning{Chengsong Tan}
+ − 20
\Copyright{Chengsong Tan}
+ − 21
+ − 22
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+ − 23
\newcommand{\ZERO}{\mbox{\bf 0}}
+ − 24
\newcommand{\ONE}{\mbox{\bf 1}}
+ − 25
\def\lexer{\mathit{lexer}}
+ − 26
\def\mkeps{\mathit{mkeps}}
+ − 27
\def\inj{\mathit{inj}}
+ − 28
\def\Empty{\mathit{Empty}}
+ − 29
\def\Left{\mathit{Left}}
+ − 30
\def\Right{\mathit{Right}}
+ − 31
\def\Stars{\mathit{Stars}}
+ − 32
\def\Char{\mathit{Char}}
+ − 33
\def\Seq{\mathit{Seq}}
+ − 34
\def\Der{\mathit{Der}}
+ − 35
\def\nullable{\mathit{nullable}}
+ − 36
\def\Z{\mathit{Z}}
+ − 37
\def\S{\mathit{S}}
+ − 38
+ − 39
%\theoremstyle{theorem}
+ − 40
%\newtheorem{theorem}{Theorem}
+ − 41
%\theoremstyle{lemma}
+ − 42
%\newtheorem{lemma}{Lemma}
+ − 43
%\newcommand{\lemmaautorefname}{Lemma}
+ − 44
%\theoremstyle{definition}
+ − 45
%\newtheorem{definition}{Definition}
+ − 46
+ − 47
+ − 48
\begin{document}
+ − 49
+ − 50
\maketitle
+ − 51
+ − 52
\begin{abstract}
+ − 53
Brzozowski introduced in 1964 a beautifully simple algorithm for
+ − 54
regular expression matching based on the notion of derivatives of
+ − 55
regular expressions. In 2014, Sulzmann and Lu extended this
+ − 56
algorithm to not just give a YES/NO answer for whether or not a regular
+ − 57
expression matches a string, but in case it matches also \emph{how}
+ − 58
it matches the string. This is important for applications such as
+ − 59
lexing (tokenising a string). The problem is to make the algorithm
+ − 60
by Sulzmann and Lu fast on all inputs without breaking its
+ − 61
correctness.
+ − 62
\end{abstract}
+ − 63
+ − 64
\section{Introduction}
+ − 65
+ − 66
This PhD-project is about regular expression matching and
+ − 67
lexing. Given the maturity of this topic, the reader might wonder:
+ − 68
Surely, regular expressions must have already been studied to death?
+ − 69
What could possibly be \emph{not} known in this area? And surely all
+ − 70
implemented algorithms for regular expression matching are blindingly
+ − 71
fast?
+ − 72
+ − 73
+ − 74
+ − 75
Unfortunately these preconceptions are not supported by evidence: Take
+ − 76
for example the regular expression $(a^*)^*\,b$ and ask whether
+ − 77
strings of the form $aa..a$ match this regular
+ − 78
expression. Obviously they do not match---the expected $b$ in the last
+ − 79
position is missing. One would expect that modern regular expression
+ − 80
matching engines can find this out very quickly. Alas, if one tries
+ − 81
this example in JavaScript, Python or Java 8 with strings like 28
+ − 82
$a$'s, one discovers that this decision takes around 30 seconds and
+ − 83
takes considerably longer when adding a few more $a$'s, as the graphs
+ − 84
below show:
+ − 85
+ − 86
\begin{center}
+ − 87
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+ − 88
\begin{tikzpicture}
+ − 89
\begin{axis}[
+ − 90
xlabel={$n$},
+ − 91
x label style={at={(1.05,-0.05)}},
+ − 92
ylabel={time in secs},
+ − 93
enlargelimits=false,
+ − 94
xtick={0,5,...,30},
+ − 95
xmax=33,
+ − 96
ymax=35,
+ − 97
ytick={0,5,...,30},
+ − 98
scaled ticks=false,
+ − 99
axis lines=left,
+ − 100
width=5cm,
+ − 101
height=4cm,
+ − 102
legend entries={JavaScript},
+ − 103
legend pos=north west,
+ − 104
legend cell align=left]
+ − 105
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+ − 106
\end{axis}
+ − 107
\end{tikzpicture}
+ − 108
&
+ − 109
\begin{tikzpicture}
+ − 110
\begin{axis}[
+ − 111
xlabel={$n$},
+ − 112
x label style={at={(1.05,-0.05)}},
+ − 113
%ylabel={time in secs},
+ − 114
enlargelimits=false,
+ − 115
xtick={0,5,...,30},
+ − 116
xmax=33,
+ − 117
ymax=35,
+ − 118
ytick={0,5,...,30},
+ − 119
scaled ticks=false,
+ − 120
axis lines=left,
+ − 121
width=5cm,
+ − 122
height=4cm,
+ − 123
legend entries={Python},
+ − 124
legend pos=north west,
+ − 125
legend cell align=left]
+ − 126
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ − 127
\end{axis}
+ − 128
\end{tikzpicture}
+ − 129
&
+ − 130
\begin{tikzpicture}
+ − 131
\begin{axis}[
+ − 132
xlabel={$n$},
+ − 133
x label style={at={(1.05,-0.05)}},
+ − 134
%ylabel={time in secs},
+ − 135
enlargelimits=false,
+ − 136
xtick={0,5,...,30},
+ − 137
xmax=33,
+ − 138
ymax=35,
+ − 139
ytick={0,5,...,30},
+ − 140
scaled ticks=false,
+ − 141
axis lines=left,
+ − 142
width=5cm,
+ − 143
height=4cm,
+ − 144
legend entries={Java 8},
+ − 145
legend pos=north west,
+ − 146
legend cell align=left]
+ − 147
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+ − 148
\end{axis}
+ − 149
\end{tikzpicture}\\
+ − 150
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
+ − 151
of the form $\underbrace{aa..a}_{n}$.}
+ − 152
\end{tabular}
+ − 153
\end{center}
+ − 154
+ − 155
\noindent These are clearly abysmal and possibly surprising results.
+ − 156
One would expect these systems doing much better than that---after
+ − 157
all, given a DFA and a string, whether a string is matched by this DFA
+ − 158
should be linear.
+ − 159
+ − 160
Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
+ − 161
exhibit this ``exponential behaviour''. Unfortunately, such regular
+ − 162
expressions are not just a few ``outliers'', but actually they are
+ − 163
frequent enough that a separate name has been created for
+ − 164
them---\emph{evil regular expressions}. In empiric work, Davis et al
+ − 165
report that they have found thousands of such evil regular expressions
+ − 166
in the JavaScript and Python ecosystems \cite{Davis18}.
+ − 167
+ − 168
This exponential blowup sometimes causes real pain in real life:
+ − 169
for example on 20 July 2016 one evil regular expression brought the
+ − 170
webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}.
+ − 171
In this instance, a regular expression intended to just trim white
+ − 172
spaces from the beginning and the end of a line actually consumed
+ − 173
massive amounts of CPU-resources and because of this the web servers
+ − 174
ground to a halt. This happened when a post with 20,000 white spaces
+ − 175
was submitted, but importantly the white spaces were neither at the
+ − 176
beginning nor at the end. As a result, the regular expression matching
+ − 177
engine needed to backtrack over many choices.
+ − 178
+ − 179
The underlying problem is that many ``real life'' regular expression
+ − 180
matching engines do not use DFAs for matching. This is because they
+ − 181
support regular expressions that are not covered by the classical
+ − 182
automata theory, and in this more general setting there are quite a
+ − 183
few research questions still unanswered and fast algorithms still need
+ − 184
to be developed.
+ − 185
+ − 186
There is also another under-researched problem to do with regular
+ − 187
expressions and lexing, i.e.~the process of breaking up strings into
+ − 188
sequences of tokens according to some regular expressions. In this
+ − 189
setting one is not just interested in whether or not a regular
+ − 190
expression matches a string, but if it matches also in \emph{how} it
+ − 191
matches the string. Consider for example a regular expression
+ − 192
$r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
+ − 193
and so on; and a regular expression $r_{id}$ for recognising
+ − 194
identifiers (say, a single character followed by characters or
+ − 195
numbers). One can then form the compound regular expression
+ − 196
$(r_{key} + r_{id})^*$ and use it to tokenise strings. But then how
+ − 197
should the string \textit{iffoo} be tokenised? It could be tokenised
+ − 198
as a keyword followed by an identifier, or the entire string as a
+ − 199
single identifier. Similarly, how should the string \textit{if} be
+ − 200
tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
+ − 201
``fire''---so is it an identifier or a keyword? While in applications
+ − 202
there is a well-known strategy to decide these questions, called POSIX
+ − 203
matching, only relatively recently precise definitions of what POSIX
+ − 204
matching actually means have been formalised
+ − 205
\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
+ − 206
POSIX matching means matching the longest initial substring and
+ − 207
in the case of a tie, the initial submatch is chosen according to some priorities attached to the
+ − 208
regular expressions (e.g.~keywords have a higher priority than
+ − 209
identifiers). This sounds rather simple, but according to Grathwohl et
+ − 210
al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
+ − 211
+ − 212
\begin{quote}
+ − 213
\it{}``The POSIX strategy is more complicated than the greedy because of
+ − 214
the dependence on information about the length of matched strings in the
+ − 215
various subexpressions.''
+ − 216
\end{quote}
+ − 217
+ − 218
\noindent
+ − 219
This is also supported by evidence collected by Kuklewicz
+ − 220
\cite{Kuklewicz} who noticed that a number of POSIX regular expression
+ − 221
matchers calculate incorrect results.
+ − 222
+ − 223
Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
+ − 224
regular expression matching according to the POSIX strategy
+ − 225
\cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
+ − 226
Brzozowski from 1964 where he introduced the notion of derivatives of
+ − 227
regular expressions \cite{Brzozowski1964}. We shall briefly explain
+ − 228
the algorithms next.
+ − 229
+ − 230
\section{The Algorithms by Brzozowski, and Sulzmann and Lu}
+ − 231
+ − 232
Suppose regular expressions are given by the following grammar:
+ − 233
+ − 234
+ − 235
\begin{center}
+ − 236
+ − 237
\begin{tabular}{@{}rrl@{}}
+ − 238
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
+ − 239
$r$ & $::=$ & $\ZERO$\\
+ − 240
& $\mid$ & $\ONE$ \\
+ − 241
& $\mid$ & $c$ \\
+ − 242
& $\mid$ & $r_1 \cdot r_2$\\
+ − 243
& $\mid$ & $r_1 + r_2$ \\
+ − 244
\\
+ − 245
& $\mid$ & $r^*$ \\
+ − 246
\end{tabular}
+ − 247
+ − 248
\end{center}
+ − 249
+ − 250
\noindent
+ − 251
The intended meaning of the regular expressions is as usual: $\ZERO$
+ − 252
cannot match any string, $\ONE$ can match the empty string, the
+ − 253
character regular expression $c$ can match the character $c$, and so
+ − 254
on. The brilliant contribution by Brzozowski is the notion of
+ − 255
\emph{derivatives} of regular expressions. The idea behind this
+ − 256
notion is as follows: suppose a regular expression $r$ can match a
+ − 257
string of the form $c\!::\! s$ (that is a list of characters starting
+ − 258
with $c$), what does the regular expression look like that can match
+ − 259
just $s$? Brzozowski gave a neat answer to this question. He defined
+ − 260
the following operation on regular expressions, written
+ − 261
$r\backslash c$ (the derivative of $r$ w.r.t.~the character $c$):
+ − 262
+ − 263
\begin{center}
+ − 264
\begin{tabular}{lcl}
+ − 265
$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
+ − 266
$\ONE \backslash c$ & $\dn$ & $\ZERO$\\
+ − 267
$d \backslash c$ & $\dn$ &
+ − 268
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+ − 269
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+ − 270
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, \epsilon \in L(r_1)$\\
+ − 271
& & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+ − 272
& & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+ − 273
$(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
+ − 274
\end{tabular}
+ − 275
\end{center}
+ − 276
+ − 277
\noindent
+ − 278
The $\mathit{if}$ condition in the definition of $(r_1 \cdot r_2) \backslash c$ involves a membership testing: $\epsilon \overset{?}{\in} L(r_1)$.
+ − 279
Such testing is easily implemented by the following simple recursive function $\nullable(\_)$:
+ − 280
+ − 281
+ − 282
\begin{center}
+ − 283
\begin{tabular}{lcl}
+ − 284
$\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
+ − 285
$\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
+ − 286
$\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
+ − 287
$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+ − 288
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+ − 289
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
+ − 290
\end{tabular}
+ − 291
\end{center}
+ − 292
+ − 293
%Assuming the classic notion of a
+ − 294
%\emph{language} of a regular expression, written $L(\_)$, t
+ − 295
The main
+ − 296
property of the derivative operation is that
+ − 297
+ − 298
\begin{center}
+ − 299
$c\!::\!s \in L(r)$ holds
+ − 300
if and only if $s \in L(r\backslash c)$.
+ − 301
\end{center}
+ − 302
+ − 303
\noindent
+ − 304
So if we want to find out whether a string $s$
+ − 305
matches with a regular expression $r$, build the derivatives of $r$
+ − 306
w.r.t.\ (in succession) all the characters of the string $s$. Finally,
+ − 307
test whether the resulting regular expression can match the empty
+ − 308
string. If yes, then $r$ matches $s$, and no in the negative
+ − 309
case.\\
+ − 310
If we define the successive derivative operation to be like this:
+ − 311
\begin{center}
+ − 312
\begin{tabular}{lcl}
+ − 313
$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
+ − 314
$r \backslash \epsilon $ & $\dn$ & $r$
+ − 315
\end{tabular}
+ − 316
\end{center}
+ − 317
+ − 318
+ − 319
We obtain a simple and elegant regular
+ − 320
expression matching algorithm:
+ − 321
\begin{definition}{matcher}
+ − 322
\[
+ − 323
match\;s\;r \;\dn\; nullable(r\backslash s)
+ − 324
\]
+ − 325
\end{definition}
+ − 326
+ − 327
For us the main advantage is that derivatives can be
+ − 328
straightforwardly implemented in any functional programming language,
+ − 329
and are easily definable and reasoned about in theorem provers---the
+ − 330
definitions just consist of inductive datatypes and simple recursive
+ − 331
functions. Moreover, the notion of derivatives can be easily
+ − 332
generalised to cover extended regular expression constructors such as
+ − 333
the not-regular expression, written $\neg\,r$, or bounded repetitions
+ − 334
(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
+ − 335
straightforwardly realised within the classic automata approach.
+ − 336
+ − 337
+ − 338
One limitation, however, of Brzozowski's algorithm is that it only
+ − 339
produces a YES/NO answer for whether a string is being matched by a
+ − 340
regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this
+ − 341
algorithm to allow generation of an actual matching, called a
+ − 342
\emph{value}.
+ − 343
+ − 344
\begin{center}
+ − 345
\begin{tabular}{c@{\hspace{20mm}}c}
+ − 346
\begin{tabular}{@{}rrl@{}}
+ − 347
\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
+ − 348
$r$ & $::=$ & $\ZERO$\\
+ − 349
& $\mid$ & $\ONE$ \\
+ − 350
& $\mid$ & $c$ \\
+ − 351
& $\mid$ & $r_1 \cdot r_2$\\
+ − 352
& $\mid$ & $r_1 + r_2$ \\
+ − 353
\\
+ − 354
& $\mid$ & $r^*$ \\
+ − 355
\end{tabular}
+ − 356
&
+ − 357
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
+ − 358
\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
+ − 359
$v$ & $::=$ & \\
+ − 360
& & $\Empty$ \\
+ − 361
& $\mid$ & $\Char(c)$ \\
+ − 362
& $\mid$ & $\Seq\,v_1\, v_2$\\
+ − 363
& $\mid$ & $\Left(v)$ \\
+ − 364
& $\mid$ & $\Right(v)$ \\
+ − 365
& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
+ − 366
\end{tabular}
+ − 367
\end{tabular}
+ − 368
\end{center}
+ − 369
+ − 370
\noindent
+ − 371
Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them. \\
+ − 372
Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$. The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$.\\
+ − 373
+ − 374
To give a concrete example of how value works, consider the string $xy$ and the
+ − 375
regular expression $(x + (y + xy))^*$. We can view this regular
+ − 376
expression as a tree and if the string $xy$ is matched by two Star
+ − 377
``iterations'', then the $x$ is matched by the left-most alternative
+ − 378
in this tree and the $y$ by the right-left alternative. This suggests
+ − 379
to record this matching as
+ − 380
+ − 381
\begin{center}
+ − 382
$\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
+ − 383
\end{center}
+ − 384
+ − 385
\noindent
+ − 386
where $\Stars$ records how many
+ − 387
iterations were used; and $\Left$, respectively $\Right$, which
+ − 388
alternative is used. The value for
+ − 389
matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
+ − 390
would look as follows
+ − 391
+ − 392
\begin{center}
+ − 393
$\Stars\,[\Seq\,(\Char\,x)\,(\Char\,y)]$
+ − 394
\end{center}
+ − 395
+ − 396
\noindent
+ − 397
where $\Stars$ has only a single-element list for the single iteration
+ − 398
and $\Seq$ indicates that $xy$ is matched by a sequence regular
+ − 399
expression.
+ − 400
+ − 401
The contribution of Sulzmann and Lu is an extension of Brzozowski's
+ − 402
algorithm by a second phase (the first phase being building successive
+ − 403
derivatives). In this second phase, for every successful match the
+ − 404
corresponding POSIX value is computed. The whole process looks like this diagram:\\
+ − 405
\begin{tikzcd}
+ − 406
r_0 \arrow[r, "c_0"] \arrow[d] & r_1 \arrow[r, "c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+ − 407
v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
+ − 408
\end{tikzcd}
+ − 409
We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
+ − 410
First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
+ − 411
+ − 412
After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
+ − 413
An inductive proof can be routinely established.
+ − 414
We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the
+ − 415
process of simplification of regular expressions, which is needed in
+ − 416
order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
+ − 417
and Lu's algorithms. This is where the PhD-project hopes to advance
+ − 418
the state-of-the-art.
+ − 419
+ − 420
+ − 421
\section{Simplification of Regular Expressions}
+ − 422
+ − 423
The main drawback of building successive derivatives according to
+ − 424
Brzozowski's definition is that they can grow very quickly in size.
+ − 425
This is mainly due to the fact that the derivative operation generates
+ − 426
often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result,
+ − 427
if implemented naively both algorithms by Brzozowski and by Sulzmann
+ − 428
and Lu are excruciatingly slow. For example when starting with the
+ − 429
regular expression $(a + aa)^*$ and building 12 successive derivatives
+ − 430
w.r.t.~the character $a$, one obtains a derivative regular expression
+ − 431
with more than 8000 nodes (when viewed as a tree). Operations like
+ − 432
derivative and $\nullable$ need to traverse such trees and
+ − 433
consequently the bigger the size of the derivative the slower the
+ − 434
algorithm. Fortunately, one can simplify regular expressions after
+ − 435
each derivative step. Various simplifications of regular expressions
+ − 436
are possible, such as the simplifications of $\ZERO + r$,
+ − 437
$r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
+ − 438
$r$. These simplifications do not affect the answer for whether a
+ − 439
regular expression matches a string or not, but fortunately also do
+ − 440
not affect the POSIX strategy of how regular expressions match
+ − 441
strings---although the latter is much harder to establish. Some
+ − 442
initial results in this regard have been obtained in
+ − 443
\cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
+ − 444
is a very tight bound for the size. Such a tight bound is suggested by
+ − 445
work of Antimirov who proved that (partial) derivatives can be bound
+ − 446
by the number of characters contained in the initial regular
+ − 447
expression \cite{Antimirov95}. We believe, and have generated test
+ − 448
data, that a similar bound can be obtained for the derivatives in
+ − 449
Sulzmann and Lu's algorithm. Let us give some details about this next.
+ − 450
+ − 451
We first followed Sulzmann and Lu's idea of introducing
+ − 452
\emph{annotated regular expressions}~\cite{Sulzmann2014}. They are
+ − 453
defined by the following grammar:
+ − 454
+ − 455
\begin{center}
+ − 456
\begin{tabular}{lcl}
+ − 457
$\textit{a}$ & $::=$ & $\textit{ZERO}$\\
+ − 458
& $\mid$ & $\textit{ONE}\;\;bs$\\
+ − 459
& $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
+ − 460
& $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
+ − 461
& $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
+ − 462
& $\mid$ & $\textit{STAR}\;\;bs\,a$
+ − 463
\end{tabular}
+ − 464
\end{center}
+ − 465
+ − 466
\noindent
+ − 467
where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a
+ − 468
list of annotated regular expressions. These bitsequences encode
+ − 469
information about the (POSIX) value that should be generated by the
+ − 470
Sulzmann and Lu algorithm. Bitcodes are essentially incomplete values.
+ − 471
This can be straightforwardly seen in the following transformation:
+ − 472
\begin{center}
+ − 473
\begin{tabular}{lcl}
+ − 474
$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
+ − 475
$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
+ − 476
$\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
+ − 477
$\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
+ − 478
$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
+ − 479
$\textit{code}(\Stars\,[])$ & $\dn$ & $[\S]$\\
+ − 480
$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\Z :: code(v) \;@\;
+ − 481
code(\Stars\,vs)$
+ − 482
\end{tabular}
+ − 483
\end{center}
+ − 484
where $\Z$ and $\S$ are arbitrary names for the bits in the
+ − 485
bitsequences.
+ − 486
Here code encodes a value into a bitsequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $Left/Right$ or $Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around during the lexing process. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode them back into value:\\
+ − 487
TODO: definition of decode
+ − 488
\\
+ − 489
+ − 490
To do lexing using annotated regular expressions, we shall first transform the
+ − 491
usual (un-annotated) regular expressions into annotated regular
+ − 492
expressions:\\
+ − 493
TODO: definition of internalise
+ − 494
\\
+ − 495
Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\
+ − 496
TODO: bder
+ − 497
\\
+ − 498
This way, we do not have to use an injection function and a second phase, but instead only need to collect the bits while running $mkeps$:
+ − 499
TODO: mkepsBC
+ − 500
\\
+ − 501
and then decode the bits using the regular expression. The whole process looks like this:\\
+ − 502
r
+ − 503
\\
+ − 504
+ − 505
The main point of the bitsequences and annotated regular expressions
+ − 506
is that we can apply rather aggressive (in terms of size)
+ − 507
simplification rules in order to keep derivatives small.
+ − 508
+ − 509
We have
+ − 510
developed such ``aggressive'' simplification rules and generated test
+ − 511
data that show that the expected bound can be achieved. Obviously we
+ − 512
could only partially cover the search space as there are infinitely
+ − 513
many regular expressions and strings. One modification we introduced
+ − 514
is to allow a list of annotated regular expressions in the
+ − 515
\textit{ALTS} constructor. This allows us to not just delete
+ − 516
unnecessary $\ZERO$s and $\ONE$s from regular expressions, but also
+ − 517
unnecessary ``copies'' of regular expressions (very similar to
+ − 518
simplifying $r + r$ to just $r$, but in a more general
+ − 519
setting). Another modification is that we use simplification rules
+ − 520
inspired by Antimirov's work on partial derivatives. They maintain the
+ − 521
idea that only the first ``copy'' of a regular expression in an
+ − 522
alternative contributes to the calculation of a POSIX value. All
+ − 523
subsequent copies can be pruned from the regular expression.
+ − 524
+ − 525
We are currently engaged with proving that our simplification rules
+ − 526
actually do not affect the POSIX value that should be generated by the
+ − 527
algorithm according to the specification of a POSIX value and
+ − 528
furthermore that our derivatives stay small for all derivatives. For
+ − 529
this proof we use the theorem prover Isabelle. Once completed, this
+ − 530
result will advance the state-of-the-art: Sulzmann and Lu wrote in
+ − 531
their paper \cite{Sulzmann2014} about the bitcoded ``incremental
+ − 532
parsing method'' (that is the matching algorithm outlined in this
+ − 533
section):
+ − 534
+ − 535
\begin{quote}\it
+ − 536
``Correctness Claim: We further claim that the incremental parsing
+ − 537
method in Figure~5 in combination with the simplification steps in
+ − 538
Figure 6 yields POSIX parse trees. We have tested this claim
+ − 539
extensively by using the method in Figure~3 as a reference but yet
+ − 540
have to work out all proof details.''
+ − 541
\end{quote}
+ − 542
+ − 543
\noindent
+ − 544
We would settle the correctness claim and furthermore obtain a much
+ − 545
tighter bound on the sizes of derivatives. The result is that our
+ − 546
algorithm should be correct and faster on all inputs. The original
+ − 547
blow-up, as observed in JavaScript, Python and Java, would be excluded
+ − 548
from happening in our algorithm.
+ − 549
+ − 550
\section{Conclusion}
+ − 551
+ − 552
In this PhD-project we are interested in fast algorithms for regular
+ − 553
expression matching. While this seems to be a ``settled'' area, in
+ − 554
fact interesting research questions are popping up as soon as one steps
+ − 555
outside the classic automata theory (for example in terms of what kind
+ − 556
of regular expressions are supported). The reason why it is
+ − 557
interesting for us to look at the derivative approach introduced by
+ − 558
Brzozowski for regular expression matching, and then much further
+ − 559
developed by Sulzmann and Lu, is that derivatives can elegantly deal
+ − 560
with some of the regular expressions that are of interest in ``real
+ − 561
life''. This includes the not-regular expression, written $\neg\,r$
+ − 562
(that is all strings that are not recognised by $r$), but also bounded
+ − 563
regular expressions such as $r^{\{n\}}$ and $r^{\{n..m\}}$). There is
+ − 564
also hope that the derivatives can provide another angle for how to
+ − 565
deal more efficiently with back-references, which are one of the
+ − 566
reasons why regular expression engines in JavaScript, Python and Java
+ − 567
choose to not implement the classic automata approach of transforming
+ − 568
regular expressions into NFAs and then DFAs---because we simply do not
+ − 569
know how such back-references can be represented by DFAs.
+ − 570
+ − 571
+ − 572
\bibliographystyle{plain}
+ − 573
\bibliography{root}
+ − 574
+ − 575
+ − 576
\end{document}