532
+ − 1
% Chapter 1
+ − 2
+ − 3
\chapter{Introduction} % Main chapter title
+ − 4
+ − 5
\label{Introduction} % For referencing the chapter elsewhere, use \ref{Chapter1}
+ − 6
+ − 7
%----------------------------------------------------------------------------------------
+ − 8
+ − 9
% Define some commands to keep the formatting separated from the content
+ − 10
\newcommand{\keyword}[1]{\textbf{#1}}
+ − 11
\newcommand{\tabhead}[1]{\textbf{#1}}
+ − 12
\newcommand{\code}[1]{\texttt{#1}}
+ − 13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
+ − 14
\newcommand{\option}[1]{\texttt{\itshape#1}}
+ − 15
+ − 16
%boxes
+ − 17
\newcommand*{\mybox}[1]{\framebox{\strut #1}}
+ − 18
+ − 19
%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
+ − 20
\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
+ − 21
\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
543
+ − 22
\newcommand{\bderssimp}[2]{#1 \backslash_{bsimps} #2}
596
+ − 23
\newcommand{\rderssimp}[2]{#1 \backslash_{rsimps} #2}
564
+ − 24
\def\derssimp{\textit{ders}\_\textit{simp}}
557
+ − 25
\def\rders{\textit{rders}}
532
+ − 26
\newcommand{\bders}[2]{#1 \backslash #2}
+ − 27
\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
591
+ − 28
\def\bsimps{\textit{bsimp}}
554
+ − 29
\newcommand{\rsimp}[1]{\textit{rsimp}\; #1}
532
+ − 30
\newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'}
+ − 31
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+ − 32
\newcommand{\denote}{\stackrel{\mbox{\scriptsize denote}}{=}}%
+ − 33
\newcommand{\ZERO}{\mbox{\bf 0}}
+ − 34
\newcommand{\ONE}{\mbox{\bf 1}}
+ − 35
\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
555
+ − 36
\newcommand{\rdistinct}[2]{\textit{rdistinct} \;\; #1 \;\; #2}
594
+ − 37
\def\rdistincts{\textit{rdistinct}}
556
+ − 38
\def\rDistinct{\textit{rdistinct}}
532
+ − 39
\newcommand\hflat[1]{\llparenthesis #1 \rrparenthesis_*}
+ − 40
\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
+ − 41
\newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
+ − 42
+ − 43
\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
+ − 44
600
+ − 45
\def\SEQ{\textit{SEQ}}
+ − 46
\def\SEQs{\textit{SEQs}}
564
+ − 47
\def\case{\textit{case}}
554
+ − 48
\def\sequal{\stackrel{\mbox{\scriptsize rsimp}}{=}}
+ − 49
\def\rsimpalts{\textit{rsimp}_{ALTS}}
+ − 50
\def\good{\textit{good}}
+ − 51
\def\btrue{\textit{true}}
+ − 52
\def\bfalse{\textit{false}}
542
+ − 53
\def\bnullable{\textit{bnullable}}
543
+ − 54
\def\bnullables{\textit{bnullables}}
538
+ − 55
\def\Some{\textit{Some}}
+ − 56
\def\None{\textit{None}}
537
+ − 57
\def\code{\textit{code}}
532
+ − 58
\def\decode{\textit{decode}}
+ − 59
\def\internalise{\textit{internalise}}
+ − 60
\def\lexer{\mathit{lexer}}
+ − 61
\def\mkeps{\textit{mkeps}}
557
+ − 62
\newcommand{\rder}[2]{#2 \backslash_r #1}
532
+ − 63
585
+ − 64
\def\rerases{\textit{rerase}}
+ − 65
554
+ − 66
\def\nonnested{\textit{nonnested}}
532
+ − 67
\def\AZERO{\textit{AZERO}}
558
+ − 68
\def\sizeNregex{\textit{sizeNregex}}
532
+ − 69
\def\AONE{\textit{AONE}}
+ − 70
\def\ACHAR{\textit{ACHAR}}
+ − 71
585
+ − 72
\def\simpsulz{\textit{simp}_{Sulz}}
+ − 73
557
+ − 74
\def\scfrewrites{\stackrel{*}{\rightsquigarrow_{scf}}}
555
+ − 75
\def\frewrite{\rightsquigarrow_f}
+ − 76
\def\hrewrite{\rightsquigarrow_h}
+ − 77
\def\grewrite{\rightsquigarrow_g}
+ − 78
\def\frewrites{\stackrel{*}{\rightsquigarrow_f}}
+ − 79
\def\hrewrites{\stackrel{*}{\rightsquigarrow_h}}
+ − 80
\def\grewrites{\stackrel{*}{\rightsquigarrow_g}}
538
+ − 81
\def\fuse{\textit{fuse}}
+ − 82
\def\bder{\textit{bder}}
542
+ − 83
\def\der{\textit{der}}
532
+ − 84
\def\POSIX{\textit{POSIX}}
+ − 85
\def\ALTS{\textit{ALTS}}
+ − 86
\def\ASTAR{\textit{ASTAR}}
+ − 87
\def\DFA{\textit{DFA}}
538
+ − 88
\def\NFA{\textit{NFA}}
532
+ − 89
\def\bmkeps{\textit{bmkeps}}
543
+ − 90
\def\bmkepss{\textit{bmkepss}}
532
+ − 91
\def\retrieve{\textit{retrieve}}
+ − 92
\def\blexer{\textit{blexer}}
+ − 93
\def\flex{\textit{flex}}
573
+ − 94
\def\inj{\textit{inj}}
564
+ − 95
\def\Empty{\textit{Empty}}
567
+ − 96
\def\Left{\textit{Left}}
+ − 97
\def\Right{\textit{Right}}
573
+ − 98
\def\Stars{\textit{Stars}}
+ − 99
\def\Char{\textit{Char}}
+ − 100
\def\Seq{\textit{Seq}}
532
+ − 101
\def\Der{\textit{Der}}
+ − 102
\def\Ders{\textit{Ders}}
+ − 103
\def\nullable{\mathit{nullable}}
+ − 104
\def\Z{\mathit{Z}}
+ − 105
\def\S{\mathit{S}}
+ − 106
\def\rup{r^\uparrow}
+ − 107
%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
+ − 108
\def\distinctWith{\textit{distinctWith}}
+ − 109
\def\lf{\textit{lf}}
+ − 110
\def\PD{\textit{PD}}
+ − 111
\def\suffix{\textit{Suffix}}
543
+ − 112
\def\distinctBy{\textit{distinctBy}}
558
+ − 113
\def\starupdate{\textit{starUpdate}}
+ − 114
\def\starupdates{\textit{starUpdates}}
+ − 115
532
+ − 116
+ − 117
\def\size{\mathit{size}}
+ − 118
\def\rexp{\mathbf{rexp}}
+ − 119
\def\simp{\mathit{simp}}
+ − 120
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
+ − 121
\def\map{\mathit{map}}
+ − 122
\def\distinct{\mathit{distinct}}
+ − 123
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
590
+ − 124
\def\blexerStrong{\textit{blexerStrong}}
+ − 125
\def\bsimpStrong{\textit{bsimpStrong}}
591
+ − 126
\def\bdersStrongs{\textit{bdersStrong}}
590
+ − 127
\newcommand{\bdersStrong}[2]{#1 \backslash_{bsimpStrongs} #2}
+ − 128
532
+ − 129
\def\map{\textit{map}}
+ − 130
\def\rrexp{\textit{rrexp}}
554
+ − 131
\newcommand\rnullable[1]{\textit{rnullable} \; #1 }
532
+ − 132
\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
+ − 133
\newcommand\asize[1]{\llbracket #1 \rrbracket}
543
+ − 134
\newcommand\rerase[1]{ (#1)_{\downarrow_r}}
+ − 135
538
+ − 136
\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
532
+ − 137
543
+ − 138
+ − 139
\def\rflts{\textit{rflts}}
+ − 140
\def\rrewrite{\textit{rrewrite}}
+ − 141
\def\bsimpalts{\textit{bsimp}_{ALTS}}
596
+ − 142
\def\bsimpaseq{\textit{bsimp}_{ASEQ}}
+ − 143
\def\rsimlalts{\textit{rsimp}_{ALTs}}
+ − 144
\def\rsimpseq{\textit{rsimp}_{SEQ}}
543
+ − 145
532
+ − 146
\def\erase{\textit{erase}}
+ − 147
\def\STAR{\textit{STAR}}
+ − 148
\def\flts{\textit{flts}}
+ − 149
+ − 150
579
+ − 151
\def\zeroable{\textit{zeroable}}
+ − 152
\def\nub{\textit{nub}}
+ − 153
\def\filter{\textit{filter}}
601
+ − 154
%\def\not{\textit{not}}
579
+ − 155
+ − 156
+ − 157
532
+ − 158
\def\RZERO{\mathbf{0}_r }
+ − 159
\def\RONE{\mathbf{1}_r}
+ − 160
\newcommand\RCHAR[1]{\mathbf{#1}_r}
+ − 161
\newcommand\RSEQ[2]{#1 \cdot #2}
558
+ − 162
\newcommand\RALTS[1]{\sum #1}
532
+ − 163
\newcommand\RSTAR[1]{#1^*}
558
+ − 164
\newcommand\vsuf[2]{\textit{Suffix} \;#1\;#2}
532
+ − 165
538
+ − 166
+ − 167
590
+ − 168
+ − 169
\lstdefinestyle{myScalastyle}{
+ − 170
frame=tb,
+ − 171
language=scala,
+ − 172
aboveskip=3mm,
+ − 173
belowskip=3mm,
+ − 174
showstringspaces=false,
+ − 175
columns=flexible,
+ − 176
basicstyle={\small\ttfamily},
+ − 177
numbers=none,
+ − 178
numberstyle=\tiny\color{gray},
+ − 179
keywordstyle=\color{blue},
+ − 180
commentstyle=\color{dkgreen},
+ − 181
stringstyle=\color{mauve},
+ − 182
frame=single,
+ − 183
breaklines=true,
+ − 184
breakatwhitespace=true,
+ − 185
tabsize=3,
538
+ − 186
}
+ − 187
590
+ − 188
532
+ − 189
%----------------------------------------------------------------------------------------
+ − 190
%This part is about regular expressions, Brzozowski derivatives,
+ − 191
%and a bit-coded lexing algorithm with proven correctness and time bounds.
+ − 192
+ − 193
%TODO: look up snort rules to use here--give readers idea of what regexes look like
+ − 194
601
+ − 195
+ − 196
+ − 197
+ − 198
+ − 199
+ − 200
Regular expressions are widely used in computer science:
+ − 201
be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+ − 202
command-line tools like $\mathit{grep}$ that facilitate easy
+ − 203
text-processing; network intrusion
+ − 204
detection systems that reject suspicious traffic; or compiler
+ − 205
front ends--the majority of the solutions to these tasks
+ − 206
involve lexing with regular
+ − 207
expressions.
+ − 208
Given its usefulness and ubiquity, one would imagine that
+ − 209
modern regular expression matching implementations
+ − 210
are mature and fully studied.
602
+ − 211
Indeed, in a popular programming language's regex engine,
+ − 212
supplying it with regular expressions and strings,
+ − 213
in most cases one can
+ − 214
get the matching information in a very short time.
+ − 215
Those matchers can be blindingly fast--some
+ − 216
network intrusion detection systems
601
+ − 217
use regex engines that are able to process
+ − 218
megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
602
+ − 219
However, those matchers can exhibit a surprising security vulnerability
+ − 220
under a certain class of inputs.
+ − 221
%However, , this is not the case for $\mathbf{all}$ inputs.
601
+ − 222
%TODO: get source for SNORT/BRO's regex matching engine/speed
+ − 223
603
+ − 224
+ − 225
Take $(a^*)^*\,b$ and ask whether
+ − 226
strings of the form $aa..a$ match this regular
+ − 227
expression. Obviously this is not the case---the expected $b$ in the last
+ − 228
position is missing. One would expect that modern regular expression
+ − 229
matching engines can find this out very quickly. Alas, if one tries
+ − 230
this example in JavaScript, Python or Java 8, even with strings of a small
+ − 231
length, say around 30 $a$'s,
+ − 232
the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
+ − 233
This is clearly exponential behaviour, and
+ − 234
is triggered by some relatively simple regex patterns.
+ − 235
Java 9 and newer
+ − 236
versions improves this behaviour, but is still slow compared
+ − 237
with the approach we are going to use.
+ − 238
+ − 239
+ − 240
+ − 241
+ − 242
This superlinear blowup in regular expression engines
+ − 243
had repeatedly caused grief in real life that they
+ − 244
get a name for them--``catastrophic backtracking''.
+ − 245
For example, on 20 July 2016 one evil
+ − 246
regular expression brought the webpage
+ − 247
\href{http://stackexchange.com}{Stack Exchange} to its
+ − 248
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+ − 249
In this instance, a regular expression intended to just trim white
+ − 250
spaces from the beginning and the end of a line actually consumed
+ − 251
massive amounts of CPU resources---causing web servers to grind to a
+ − 252
halt. In this example, the time needed to process
+ − 253
the string was $O(n^2)$ with respect to the string length. This
+ − 254
quadratic overhead was enough for the homepage of Stack Exchange to
+ − 255
respond so slowly that the load balancer assumed a $\mathit{DoS}$
+ − 256
attack and therefore stopped the servers from responding to any
+ − 257
requests. This made the whole site become unavailable.
+ − 258
601
+ − 259
\begin{figure}[p]
532
+ − 260
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+ − 261
\begin{tikzpicture}
+ − 262
\begin{axis}[
+ − 263
xlabel={$n$},
+ − 264
x label style={at={(1.05,-0.05)}},
+ − 265
ylabel={time in secs},
+ − 266
enlargelimits=false,
+ − 267
xtick={0,5,...,30},
+ − 268
xmax=33,
+ − 269
ymax=35,
+ − 270
ytick={0,5,...,30},
+ − 271
scaled ticks=false,
+ − 272
axis lines=left,
+ − 273
width=5cm,
+ − 274
height=4cm,
+ − 275
legend entries={JavaScript},
+ − 276
legend pos=north west,
+ − 277
legend cell align=left]
+ − 278
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+ − 279
\end{axis}
+ − 280
\end{tikzpicture}
+ − 281
&
+ − 282
\begin{tikzpicture}
+ − 283
\begin{axis}[
+ − 284
xlabel={$n$},
+ − 285
x label style={at={(1.05,-0.05)}},
+ − 286
%ylabel={time in secs},
+ − 287
enlargelimits=false,
+ − 288
xtick={0,5,...,30},
+ − 289
xmax=33,
+ − 290
ymax=35,
+ − 291
ytick={0,5,...,30},
+ − 292
scaled ticks=false,
+ − 293
axis lines=left,
+ − 294
width=5cm,
+ − 295
height=4cm,
+ − 296
legend entries={Python},
+ − 297
legend pos=north west,
+ − 298
legend cell align=left]
+ − 299
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ − 300
\end{axis}
+ − 301
\end{tikzpicture}
+ − 302
&
+ − 303
\begin{tikzpicture}
+ − 304
\begin{axis}[
+ − 305
xlabel={$n$},
+ − 306
x label style={at={(1.05,-0.05)}},
+ − 307
%ylabel={time in secs},
+ − 308
enlargelimits=false,
+ − 309
xtick={0,5,...,30},
+ − 310
xmax=33,
+ − 311
ymax=35,
+ − 312
ytick={0,5,...,30},
+ − 313
scaled ticks=false,
+ − 314
axis lines=left,
+ − 315
width=5cm,
+ − 316
height=4cm,
+ − 317
legend entries={Java 8},
+ − 318
legend pos=north west,
+ − 319
legend cell align=left]
+ − 320
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+ − 321
\end{axis}
+ − 322
\end{tikzpicture}\\
601
+ − 323
\begin{tikzpicture}
+ − 324
\begin{axis}[
+ − 325
xlabel={$n$},
+ − 326
x label style={at={(1.05,-0.05)}},
+ − 327
ylabel={time in secs},
+ − 328
enlargelimits=false,
+ − 329
xtick={0,5,...,30},
+ − 330
xmax=33,
+ − 331
ymax=35,
+ − 332
ytick={0,5,...,30},
+ − 333
scaled ticks=false,
+ − 334
axis lines=left,
+ − 335
width=5cm,
+ − 336
height=4cm,
+ − 337
legend entries={Dart},
+ − 338
legend pos=north west,
+ − 339
legend cell align=left]
+ − 340
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+ − 341
\end{axis}
+ − 342
\end{tikzpicture}
+ − 343
&
+ − 344
\begin{tikzpicture}
+ − 345
\begin{axis}[
+ − 346
xlabel={$n$},
+ − 347
x label style={at={(1.05,-0.05)}},
+ − 348
%ylabel={time in secs},
+ − 349
enlargelimits=false,
+ − 350
xtick={0,5,...,30},
+ − 351
xmax=33,
+ − 352
ymax=35,
+ − 353
ytick={0,5,...,30},
+ − 354
scaled ticks=false,
+ − 355
axis lines=left,
+ − 356
width=5cm,
+ − 357
height=4cm,
+ − 358
legend entries={Swift},
+ − 359
legend pos=north west,
+ − 360
legend cell align=left]
+ − 361
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+ − 362
\end{axis}
+ − 363
\end{tikzpicture}
+ − 364
& \\
+ − 365
\multicolumn{3}{c}{Graphs}
532
+ − 366
\end{tabular}
601
+ − 367
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
+ − 368
of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
+ − 369
The reason for their superlinear behaviour is that they do a depth-first-search.
+ − 370
If the string does not match, the engine starts to explore all possibilities.
+ − 371
}\label{fig:aStarStarb}
+ − 372
\end{figure}\afterpage{\clearpage}
538
+ − 373
532
+ − 374
A more recent example is a global outage of all Cloudflare servers on 2 July
+ − 375
2019. A poorly written regular expression exhibited exponential
+ − 376
behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
+ − 377
had several causes, at the heart was a regular expression that
+ − 378
was used to monitor network
538
+ − 379
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
532
+ − 380
These problems with regular expressions
+ − 381
are not isolated events that happen
+ − 382
very occasionally, but actually widespread.
+ − 383
They occur so often that they get a
+ − 384
name--Regular-Expression-Denial-Of-Service (ReDoS)
+ − 385
attack.
538
+ − 386
\citeauthor{Davis18} detected more
532
+ − 387
than 1000 super-linear (SL) regular expressions
+ − 388
in Node.js, Python core libraries, and npm and pypi.
+ − 389
They therefore concluded that evil regular expressions
538
+ − 390
are problems "more than a parlour trick", but one that
532
+ − 391
requires
+ − 392
more research attention.
+ − 393
603
+ − 394
This work aims to address this issue
+ − 395
with the help of formal proofs.
+ − 396
We offer a lexing algorithm based
+ − 397
on Brzozowski derivatives with certified correctness (in
+ − 398
Isabelle/HOL)
+ − 399
and finiteness property.
604
+ − 400
Such properties guarantee the absence of
603
+ − 401
catastrophic backtracking in most cases.
604
+ − 402
We will give more details in the next sections
+ − 403
on (i) why the slow cases in graph \ref{fig:aStarStarb}
+ − 404
can occur
+ − 405
and (ii) why we choose our
+ − 406
approach (Brzozowski derivatives and formal proofs).
602
+ − 407
603
+ − 408
605
+ − 409
\section{Regex, and the Problems with Regex Matchers}
601
+ − 410
Regular expressions and regular expression matchers
+ − 411
have of course been studied for many, many years.
605
+ − 412
Theoretical results in automata theory say
604
+ − 413
that basic regular expression matching should be linear
605
+ − 414
w.r.t the input.
+ − 415
This assumes that the regular expression
+ − 416
$r$ was pre-processed and turned into a
+ − 417
deterministic finite automata (DFA) before matching,
+ − 418
which could be exponential\cite{Sakarovitch2009}.
604
+ − 419
By basic we mean textbook definitions such as the one
+ − 420
below, involving only characters, alternatives,
+ − 421
sequences, and Kleene stars:
+ − 422
\[
+ − 423
r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
+ − 424
\]
+ − 425
Modern regular expression matchers used by programmers,
+ − 426
however,
+ − 427
support richer constructs such as bounded repetitions
+ − 428
and back-references.
605
+ − 429
To differentiate, people use the word \emph{regex} to refer
+ − 430
to those expressions with richer constructs while reserving the
+ − 431
term \emph{regular expression}
+ − 432
for the more traditional meaning in formal languages theory.
+ − 433
We follow this convention
+ − 434
in this thesis.
+ − 435
In the future, we aim to support all the popular features of regexes,
604
+ − 436
but for this work we mainly look at regular expressions.
+ − 437
605
+ − 438
+ − 439
+ − 440
%Most modern regex libraries
+ − 441
%the so-called PCRE standard (Peral Compatible Regular Expressions)
+ − 442
%has the back-references
604
+ − 443
Regexes come with a lot of constructs
605
+ − 444
beyond the basic ones
+ − 445
that make it more convenient for
604
+ − 446
programmers to write regular expressions.
605
+ − 447
Depending on the types of these constructs
+ − 448
the task of matching and lexing with them
+ − 449
will have different levels of complexity increase.
604
+ − 450
Some of those constructs are syntactic sugars that are
+ − 451
simply short hand notations
605
+ − 452
that save the programmers a few keystrokes.
+ − 453
These will not cause trouble for regex libraries.
+ − 454
+ − 455
\noindent
+ − 456
For example the
604
+ − 457
non-binary alternative involving three or more choices:
+ − 458
\[
605
+ − 459
(a | b | c) \stackrel{means}{=} ((a + b)+ c)
604
+ − 460
\]
605
+ − 461
the range operator $-$ used to express the alternative
+ − 462
of all characters between its operands in a concise way:
604
+ − 463
\[
605
+ − 464
[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
604
+ − 465
\]
605
+ − 466
and the
+ − 467
wildcard character $.$ used to refer to any single character:
+ − 468
\[
+ − 469
. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
+ − 470
\]
604
+ − 471
605
+ − 472
\noindent
+ − 473
\subsection{Bounded Repetitions}
+ − 474
Some of those constructs do make the expressions much
+ − 475
more compact.
+ − 476
For example, the bounded regular expressions
+ − 477
(where $n$ and $m$ are constant natural numbers)
+ − 478
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$,
+ − 479
defined as
+ − 480
\begin{center}
+ − 481
\begin{tabular}{lcl}
+ − 482
$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
+ − 483
$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
+ − 484
$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
+ − 485
$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
+ − 486
\end{tabular}
+ − 487
\end{center}
+ − 488
are exponentially smaller compared with
+ − 489
their unfolded form: for example $r^{\{n\}}$
+ − 490
as opposed to
+ − 491
\[
+ − 492
\underbrace{r\ldots r}_\text{n copies of r}.
+ − 493
\]
+ − 494
%Therefore, a naive algorithm that simply unfolds
+ − 495
%them into their desugared forms
+ − 496
%will suffer from at least an exponential runtime increase.
603
+ − 497
+ − 498
The problem here is that tools based on the classic notion of
+ − 499
automata need to expand $r^{n}$ into $n$ connected
+ − 500
copies of the automaton for $r$. This leads to very inefficient matching
+ − 501
algorithms or algorithms that consume large amounts of memory.
605
+ − 502
Implementations using $\DFA$s will
+ − 503
either become excruciatingly slow
+ − 504
(for example Verbatim++\cite{Verbatimpp}) or get
+ − 505
out of memory errors (for example $\mathit{LEX}$ and
+ − 506
$\mathit{JFLEX}$\footnote{which are lexer generators
+ − 507
in C and JAVA that generate $\mathit{DFA}$-based
+ − 508
lexers. The user provides a set of regular expressions
+ − 509
and configurations to them, and then
+ − 510
gets an output program encoding a minimized $\mathit{DFA}$
+ − 511
that can be compiled and run.
+ − 512
When given the above countdown regular expression,
+ − 513
a small $n$ (a few dozen) would result in a
+ − 514
determinised automata
+ − 515
with millions of states.}) under large counters.
603
+ − 516
A classic example is the regular expression $(a+b)^* a (a+b)^{n}$
605
+ − 517
where the minimal DFA requires at least $2^{n+1}$ states.
+ − 518
For example, when $n$ is equal to 2,
604
+ − 519
an $\mathit{NFA}$ describing it would look like:
+ − 520
\begin{center}
+ − 521
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 522
\node[state,initial] (q_0) {$q_0$};
+ − 523
\node[state, red] (q_1) [right=of q_0] {$q_1$};
+ − 524
\node[state, red] (q_2) [right=of q_1] {$q_2$};
+ − 525
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+ − 526
\path[->]
+ − 527
(q_0) edge node {a} (q_1)
+ − 528
edge [loop below] node {a,b} ()
+ − 529
(q_1) edge node {a,b} (q_2)
+ − 530
(q_2) edge node {a,b} (q_3);
+ − 531
\end{tikzpicture}
+ − 532
\end{center}
605
+ − 533
which requires at least $2^3$ states
+ − 534
for its subset construction.\footnote{The
+ − 535
red states are "countdown states" which counts down
604
+ − 536
the number of characters needed in addition to the current
+ − 537
string to make a successful match.
+ − 538
For example, state $q_1$ indicates a match that has
+ − 539
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+ − 540
and just consumed the "delimiter" $a$ in the middle, and
+ − 541
need to match 2 more iterations of $(a|b)$ to complete.
+ − 542
State $q_2$ on the other hand, can be viewed as a state
+ − 543
after $q_1$ has consumed 1 character, and just waits
+ − 544
for 1 more character to complete.
+ − 545
$q_3$ is the last state, requiring 0 more character and is accepting.
+ − 546
Depending on the suffix of the
+ − 547
input string up to the current read location,
+ − 548
the states $q_1$ and $q_2$, $q_3$
+ − 549
may or may
+ − 550
not be active, independent from each other.
+ − 551
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+ − 552
contain at least $2^3$ non-equivalent states that cannot be merged,
+ − 553
because the subset construction during determinisation will generate
+ − 554
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+ − 555
Generalizing this to regular expressions with larger
+ − 556
bounded repetitions number, we have that
+ − 557
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
605
+ − 558
would require at least $2^{n+1}$ states, if $r$ itself contains
604
+ − 559
more than 1 string.
+ − 560
This is to represent all different
605
+ − 561
scenarios which "countdown" states are active.}
603
+ − 562
605
+ − 563
One of the most recent work in the context of lexing
+ − 564
%with this issue
+ − 565
is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+ − 566
This is relevant work and we will compare later on
+ − 567
our derivative-based matcher we are going to present.
+ − 568
There is also some newer work called
+ − 569
Verbatim++\cite{Verbatimpp}, which does not use derivatives,
+ − 570
but deterministic finite automaton instead.
+ − 571
%An example that gives problem to automaton approaches would be
+ − 572
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+ − 573
%It requires at least $2^{n+1}$ states to represent
+ − 574
%as a DFA.
538
+ − 575
+ − 576
605
+ − 577
Bounded repetitions are very important because they
+ − 578
tend to occur a lot in practical use.
+ − 579
For example in the regex library RegExLib,
607
+ − 580
the rules library of Snort \cite{Snort1999}\footnote{
+ − 581
Snort is a network intrusion detection (NID) tool
608
+ − 582
for monitoring network traffic.
+ − 583
The network security community curates a list
+ − 584
of malicious patterns written as regexes,
+ − 585
which is used by Snort's detection engine
+ − 586
to match against network traffic for any hostile
+ − 587
activities such as buffer overflow attacks.},
607
+ − 588
as well as in XML Schema definitions (XSDs).
605
+ − 589
According to Bj\"{o}rklund et al \cite{xml2015},
+ − 590
more than half of the
+ − 591
XSDs they found have bounded regular expressions in them.
+ − 592
Often the counters are quite large, the largest up to ten million.
+ − 593
An example XSD they gave
+ − 594
was:
606
+ − 595
%\begin{verbatim}
+ − 596
%<sequence minOccurs="0" maxOccurs="65535">
+ − 597
% <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ − 598
% <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+ − 599
%</sequence>
+ − 600
%\end{verbatim}
605
+ − 601
This can be seen as the expression
+ − 602
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+ − 603
regular expressions
+ − 604
satisfying certain constraints (such as
+ − 605
satisfying the floating point number format).
538
+ − 606
605
+ − 607
It is therefore quite unsatisfying that
+ − 608
some regular expressions matching libraries
+ − 609
impose adhoc limits
+ − 610
for bounded regular expressions:
+ − 611
For example, in the regular expression matching library in the Go
+ − 612
language the regular expression $a^{1001}$ is not permitted, because no counter
+ − 613
can be above 1000, and in the built-in Rust regular expression library
+ − 614
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
606
+ − 615
for being too big.
+ − 616
As Becchi and Crawley\cite{Becchi08} have pointed out,
+ − 617
the reason for these restrictions
+ − 618
are that they simulate a non-deterministic finite
+ − 619
automata (NFA) with a breadth-first search.
+ − 620
This way the number of active states could
+ − 621
be equal to the counter number.
+ − 622
When the counters are large,
+ − 623
the memory requirement could become
+ − 624
infeasible, and the pattern needs to be rejected straight away.
+ − 625
\begin{figure}[H]
+ − 626
\begin{center}
+ − 627
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+ − 628
+ − 629
\node (q0) [state, initial] {$0$};
+ − 630
\node (q1) [state, right = of q0] {$1$};
+ − 631
\node (q2) [state, right = of q1] {$2$};
+ − 632
\node (qdots) [right = of q2] {$\ldots$};
+ − 633
\node (qn) [state, right = of qdots] {$n$};
+ − 634
\node (qn1) [state, right = of qn] {$n+1$};
+ − 635
\node (qn2) [state, right = of qn1] {$n+2$};
+ − 636
\node (qn3) [state, accepting, right = of qn2] {$n+3$};
+ − 637
+ − 638
\path [-stealth, thick]
+ − 639
(q0) edge [loop above] node {a} ()
+ − 640
(q0) edge node {a} (q1)
+ − 641
(q1) edge node {.} (q2)
+ − 642
(q2) edge node {.} (qdots)
+ − 643
(qdots) edge node {.} (qn)
+ − 644
(qn) edge node {.} (qn1)
+ − 645
(qn1) edge node {b} (qn2)
+ − 646
(qn2) edge node {$c$} (qn3);
+ − 647
\end{tikzpicture}
+ − 648
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 649
% \node[state,initial] (q_0) {$0$};
+ − 650
% \node[state, ] (q_1) [right=of q_0] {$1$};
+ − 651
% \node[state, ] (q_2) [right=of q_1] {$2$};
+ − 652
% \node[state,
+ − 653
% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+ − 654
% \path[->]
+ − 655
% (q_0) edge node {a} (q_1)
+ − 656
% edge [loop below] node {a,b} ()
+ − 657
% (q_1) edge node {a,b} (q_2)
+ − 658
% (q_2) edge node {a,b} (q_3);
+ − 659
%\end{tikzpicture}
+ − 660
\end{center}
+ − 661
\caption{The example given by Becchi and Crawley
+ − 662
that NFA simulation can consume large
+ − 663
amounts of memory: $.^*a.^{\{n\}}bc$ matching
+ − 664
strings of the form $aaa\ldots aaaabc$.
+ − 665
When traversing in a breadth-first manner,
+ − 666
all states from 0 till $n+1$ will become active.}
+ − 667
\end{figure}
+ − 668
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+ − 669
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+ − 670
%in terms of input string length.
+ − 671
%TODO:try out these lexers
+ − 672
These problems can of course be solved in matching algorithms where
605
+ − 673
automata go beyond the classic notion and for instance include explicit
+ − 674
counters \cite{Turo_ov__2020}.
606
+ − 675
These solutions can be quite effective,
+ − 676
with the ability to process
+ − 677
gigabytes of string input per second
+ − 678
even with large counters \cite{Becchi08}.
+ − 679
But formally reasoning about these automata can be challenging
+ − 680
and un-intuitive.
+ − 681
Therefore, correctness and runtime claims made about these solutions need to be
+ − 682
taken with a grain of salt.
605
+ − 683
+ − 684
In the work reported in \cite{CSL2022} and here,
+ − 685
we add better support using derivatives
+ − 686
for bounded regular expressions $r^{\{n\}}$.
+ − 687
The results
+ − 688
extend straightforwardly to
+ − 689
repetitions with an interval such as
+ − 690
$r^{\{n\ldots m\}}$.
608
+ − 691
The merit of Brzozowski derivatives (more on this later)
605
+ − 692
on this problem is that
+ − 693
it can be naturally extended to support bounded repetitions.
+ − 694
Moreover these extensions are still made up of only
+ − 695
inductive datatypes and recursive functions,
+ − 696
making it handy to deal with using theorem provers.
+ − 697
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+ − 698
%straightforwardly extended to deal with bounded regular expressions
+ − 699
%and moreover the resulting code still consists of only simple
+ − 700
%recursive functions and inductive datatypes.
+ − 701
Finally, bounded regular expressions do not destroy our finite
+ − 702
boundedness property, which we shall prove later on.
538
+ − 703
+ − 704
606
+ − 705
+ − 706
+ − 707
605
+ − 708
\subsection{Back-References}
606
+ − 709
The other way to simulate an $\mathit{NFA}$ for matching is choosing
+ − 710
a single transition each time, keeping all the other options in
+ − 711
a queue or stack, and backtracking if that choice eventually
+ − 712
fails. This method, often called a "depth-first-search",
+ − 713
is efficient in a lot of cases, but could end up
+ − 714
with exponential run time.
+ − 715
The backtracking method is employed in regex libraries
+ − 716
that support \emph{back-references}, for example
+ − 717
in Java and Python.
605
+ − 718
%\section{Back-references and The Terminology Regex}
538
+ − 719
605
+ − 720
%When one constructs an $\NFA$ out of a regular expression
+ − 721
%there is often very little to be done in the first phase, one simply
+ − 722
%construct the $\NFA$ states based on the structure of the input regular expression.
538
+ − 723
605
+ − 724
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
+ − 725
%one by keeping track of all active states after consuming
+ − 726
%a character, and update that set of states iteratively.
+ − 727
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+ − 728
%for a path terminating
+ − 729
%at an accepting state.
606
+ − 730
+ − 731
+ − 732
+ − 733
Given a regular expression like this (the sequence
532
+ − 734
operator is omitted for brevity):
+ − 735
\begin{center}
606
+ − 736
$r_1r_2r_3r_4$
532
+ − 737
\end{center}
606
+ − 738
one could label sub-expressions of interest
532
+ − 739
by parenthesizing them and giving
+ − 740
them a number by the order in which their opening parentheses appear.
+ − 741
One possible way of parenthesizing and labelling is given below:
+ − 742
\begin{center}
+ − 743
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
+ − 744
\end{center}
606
+ − 745
The sub-expressions
+ − 746
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
+ − 747
by 1 to 4, and can be ``referred back'' by their respective numbers.
+ − 748
%These sub-expressions are called "capturing groups".
+ − 749
To do so, we use the syntax $\backslash i$
+ − 750
to denote that we want the sub-string
+ − 751
of the input just matched by the i-th
+ − 752
sub-expression to appear again,
+ − 753
exactly the same as it first appeared:
532
+ − 754
\begin{center}
+ − 755
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
+ − 756
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
+ − 757
\end{center}
606
+ − 758
%The backslash and number $i$ are the
+ − 759
%so-called "back-references".
+ − 760
%Let $e$ be an expression made of regular expressions
+ − 761
%and back-references. $e$ contains the expression $e_i$
+ − 762
%as its $i$-th capturing group.
+ − 763
%The semantics of back-reference can be recursively
+ − 764
%written as:
+ − 765
%\begin{center}
+ − 766
% \begin{tabular}{c}
+ − 767
% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+ − 768
% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+ − 769
% \end{tabular}
+ − 770
%\end{center}
+ − 771
A concrete example
+ − 772
for back-references would be
532
+ − 773
\begin{center}
607
+ − 774
$(.^*)\backslash 1$,
532
+ − 775
\end{center}
606
+ − 776
which would match
+ − 777
strings that can be split into two identical halves,
607
+ − 778
for example $\mathit{foofoo}$, $\mathit{ww}$ and etc.
+ − 779
Note that this is different from
+ − 780
repeating the sub-expression verbatim like
+ − 781
\begin{center}
+ − 782
$(.^*)(.^*)$,
+ − 783
\end{center}
+ − 784
which does not impose any restrictions on what strings the second
+ − 785
sub-expression $.^*$
+ − 786
might match.
+ − 787
Another example of back-references would be
+ − 788
\begin{center}
+ − 789
$(.)(.)\backslash 2\backslash 1$
+ − 790
\end{center}
+ − 791
which expresses four-character palindromes
+ − 792
like $abba$, $x??x$ etc.
+ − 793
+ − 794
Back-references is a regex construct
+ − 795
that programmers found quite useful.
+ − 796
According to Becchi and Crawley\cite{Becchi08},
+ − 797
6\% of Snort rules (up until 2008) include the use of them.
+ − 798
The most common use of back-references
+ − 799
would be expressing well-formed html files,
+ − 800
where back-references would be handy in expressing
+ − 801
a pair of opening and closing tags like
+ − 802
\begin{center}
+ − 803
$\langle html \rangle \ldots \langle / html \rangle$
+ − 804
\end{center}
+ − 805
A regex describing such a format
+ − 806
could be
+ − 807
\begin{center}
+ − 808
$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+ − 809
\end{center}
+ − 810
Despite being useful, the syntax and expressive power of regexes
+ − 811
go beyond the regular language hierarchy
+ − 812
with back-references.
+ − 813
In fact, they allow the regex construct to express
532
+ − 814
languages that cannot be contained in context-free
+ − 815
languages either.
608
+ − 816
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
532
+ − 817
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
+ − 818
which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
+ − 819
Such a language is contained in the context-sensitive hierarchy
+ − 820
of formal languages.
+ − 821
Solving the back-reference expressions matching problem
609
+ − 822
is known to be NP-complete \parencite{alfred2014algorithms}.
607
+ − 823
A non-bactracking,
532
+ − 824
efficient solution is not known to exist.
608
+ − 825
Regex libraries supporting back-references such as
+ − 826
PCRE \cite{pcre} therefore have to
609
+ − 827
revert to a depth-first search algorithm which backtracks.
+ − 828
What is unexpected is that even in the cases
+ − 829
not involving back-references, there is still
+ − 830
a (non-negligible) chance they might backtrack super-linearly,
+ − 831
as shown in the graphs in \ref{fig:aStarStarb}.
538
+ − 832
607
+ − 833
\subsection{Summary of the Catastrophic Backtracking Problem}
532
+ − 834
Summing these up, we can categorise existing
607
+ − 835
practical regex libraries into two kinds:
+ − 836
(i)The ones with linear
+ − 837
time guarantees like Go and Rust. The cost with them is that
+ − 838
they impose restrictions
532
+ − 839
on the user input (not allowing back-references,
607
+ − 840
bounded repetitions cannot exceed a counter limit etc.).
+ − 841
(ii) Those
+ − 842
that allow large bounded regular expressions and back-references
+ − 843
at the expense of using a backtracking algorithm.
+ − 844
They could grind to a halt
+ − 845
on some very simple cases, posing a vulnerability of
+ − 846
a ReDoS attack.
+ − 847
+ − 848
+ − 849
We would like to have regex engines that can
+ − 850
deal with the regular part (e.g.
+ − 851
bounded repetitions) of regexes more
+ − 852
efficiently.
+ − 853
Also we want to make sure that they do it correctly.
+ − 854
It turns out that such aim is not so easy to achieve.
532
+ − 855
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
+ − 856
% For example, the Rust regex engine claims to be linear,
+ − 857
% but does not support lookarounds and back-references.
+ − 858
% The GoLang regex library does not support over 1000 repetitions.
+ − 859
% Java and Python both support back-references, but shows
+ − 860
%catastrophic backtracking behaviours on inputs without back-references(
+ − 861
%when the language is still regular).
+ − 862
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
+ − 863
%TODO: verify the fact Rust does not allow 1000+ reps
+ − 864
+ − 865
605
+ − 866
+ − 867
+ − 868
%The time cost of regex matching algorithms in general
+ − 869
%involve two different phases, and different things can go differently wrong on
+ − 870
%these phases.
+ − 871
%$\DFA$s usually have problems in the first (construction) phase
+ − 872
%, whereas $\NFA$s usually run into trouble
+ − 873
%on the second phase.
+ − 874
+ − 875
+ − 876
\section{Error-prone POSIX Implementations}
607
+ − 877
When there are multiple ways of matching a string
+ − 878
with a regular expression, a matcher needs to
+ − 879
disambiguate.
+ − 880
The standard for which particular match to pick
+ − 881
is called the disambiguation strategy.
+ − 882
The more intuitive strategy is called POSIX,
+ − 883
which always chooses the longest initial match.
+ − 884
An alternative strategy would be greedy matches,
+ − 885
which always ends a sub-match as early as possible.
+ − 886
The POSIX standard is widely adopted in many operating systems.
+ − 887
However, many implementations (including the C libraries
+ − 888
used by Linux and OS X distributions) contain bugs
+ − 889
or do not meet the specification they claim to adhere to.
+ − 890
In some cases, they either fail to generate a lexing
+ − 891
result when there exists a match,
605
+ − 892
or give results that are inconsistent with the $\POSIX$ standard.
607
+ − 893
A concrete example would be the regex given by \cite{fowler2003}
605
+ − 894
\begin{center}
607
+ − 895
$(aba + ab + a)^* \text{and the string} ababa$
605
+ − 896
\end{center}
+ − 897
The correct $\POSIX$ match for the above would be
+ − 898
with the entire string $ababa$,
+ − 899
split into two Kleene star iterations, $[ab] [aba]$ at positions
+ − 900
$[0, 2), [2, 5)$
+ − 901
respectively.
+ − 902
But trying this out in regex101\parencite{regex101}
+ − 903
with different language engines would yield
+ − 904
the same two fragmented matches: $[aba]$ at $[0, 3)$
+ − 905
and $a$ at $[4, 5)$.
607
+ − 906
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
+ − 907
commented that most regex libraries are not
605
+ − 908
correctly implementing the POSIX (maximum-munch)
+ − 909
rule of regular expression matching.
+ − 910
As Grathwohl\parencite{grathwohl2014crash} wrote,
+ − 911
\begin{quote}
607
+ − 912
``The POSIX strategy is more complicated than the
605
+ − 913
greedy because of the dependence on information about
607
+ − 914
the length of matched strings in the various subexpressions.''
605
+ − 915
\end{quote}
+ − 916
%\noindent
607
+ − 917
The implementation complexity of POSIX rules also come from
+ − 918
the specification being not very clear.
+ − 919
There are many informal summaries of this disambiguation
+ − 920
strategy, which are often quite long and delicate.
608
+ − 921
For example Kuklewicz \cite{KuklewiczHaskell}
+ − 922
described the POSIX rule as
607
+ − 923
\begin{quote}
+ − 924
``
+ − 925
\begin{itemize}
+ − 926
\item
+ − 927
regular expressions (REs) take the leftmost starting match, and the longest match starting there
+ − 928
earlier subpatterns have leftmost-longest priority over later subpatterns\\
+ − 929
\item
+ − 930
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+ − 931
\item
+ − 932
REs have right associative concatenation which can be changed with parenthesis\\
+ − 933
\item
+ − 934
parenthesized subexpressions return the match from their last usage\\
+ − 935
\item
+ − 936
text of component subexpressions must be contained in the text of the
+ − 937
higher-level subexpressions\\
+ − 938
\item
+ − 939
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+ − 940
\item
+ − 941
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
+ − 942
\end{itemize}
+ − 943
\end{quote}
+ − 944
The text above
+ − 945
is trying to capture something very precise,
+ − 946
and is crying out for formalising.
608
+ − 947
Ausaf et al. \cite{AusafDyckhoffUrban2016}
609
+ − 948
are the first to fill the gap
+ − 949
by not just describing such a formalised POSIX
+ − 950
specification in Isabelle/HOL, but also proving
+ − 951
that their specification coincides with the
+ − 952
POSIX specification given by Okui and Suzuki \cite{Okui10}
+ − 953
which is a completely
+ − 954
different characterisation.
608
+ − 955
They then formally proved the correctness of
+ − 956
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
+ − 957
based on that specification.
605
+ − 958
608
+ − 959
In the next section we will very briefly
609
+ − 960
introduce Brzozowski derivatives and Sulzmann
+ − 961
and Lu's algorithm, which this thesis builds on.
+ − 962
We give a taste of what they
608
+ − 963
are like and why they are suitable for regular expression
+ − 964
matching and lexing.
532
+ − 965
609
+ − 966
\section{Our Solution--Formal Specification of POSIX Matching
+ − 967
and Brzozowski Derivatives}
+ − 968
Now we start with the central topic of the thesis: Brzozowski derivatives.
+ − 969
Brzozowski \cite{Brzozowski1964} first introduced the
+ − 970
concept of the \emph{derivative} in the 1960s.
+ − 971
The derivative of a regular expression $r$
+ − 972
with respect to a character $c$, is written as $r \backslash c$.\footnote{
+ − 973
Despite having the same name, regular expression
+ − 974
derivatives bear little similarity with the mathematical definition
+ − 975
of derivatives on functions.
+ − 976
}
+ − 977
It tells us what $r$ would transform into
+ − 978
if we chop off the first character $c$
+ − 979
from all strings in the language of $r$ ($L \; r$).
+ − 980
To give a flavour of Brzozowski derivatives, we present
+ − 981
two straightforward clauses from it:
+ − 982
\begin{center}
+ − 983
\begin{tabular}{lcl}
+ − 984
$d \backslash c$ & $\dn$ &
+ − 985
$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+ − 986
$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+ − 987
\end{tabular}
+ − 988
\end{center}
+ − 989
\noindent
+ − 990
The first clause says that for the regular expression
+ − 991
denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+ − 992
we check the derivative character $c$ against $d$,
+ − 993
returning a set containing only the empty string $\{ [] \}$
+ − 994
if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+ − 995
The second clause states that to obtain the regular expression
+ − 996
representing all strings' head character $c$ being chopped off
+ − 997
from $r_1 + r_2$, one simply needs to recursively take derivative
+ − 998
of $r_1$ and $r_2$ and then put them together.
532
+ − 999
609
+ − 1000
Thanks to the definition, derivatives have the nice property
+ − 1001
that $s \in L \; (r\backslash c)$ if and only if
+ − 1002
$c::s \in L \; r$.
+ − 1003
%This property can be used on regular expressions
+ − 1004
%matching and lexing--to test whether a string $s$ is in $L \; r$,
+ − 1005
%one simply takes derivatives of $r$ successively with
+ − 1006
%respect to the characters (in the correct order) in $s$,
+ − 1007
%and then test whether the empty string is in the last regular expression.
538
+ − 1008
Derivatives give a simple solution
609
+ − 1009
to the problem of matching and lexing a string $s$ with a regular
538
+ − 1010
expression $r$: if the derivative of $r$ w.r.t.\ (in
+ − 1011
succession) all the characters of the string matches the empty string,
+ − 1012
then $r$ matches $s$ (and {\em vice versa}).
532
+ − 1013
609
+ − 1014
This makes formally reasoning about these properties such
+ − 1015
as correctness and complexity smooth and intuitive.
+ − 1016
In fact, there has already been several mechanised proofs about them,
+ − 1017
for example the one by Owens and Slind \cite{Owens2008} in HOL4,
+ − 1018
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
+ − 1019
yet another in Coq by Coquand and Siles \cite{Coquand2012}.
+ − 1020
+ − 1021
In addition, one can extend the clauses to bounded repetitions
+ − 1022
``for free'':
+ − 1023
\begin{center}
+ − 1024
\begin{tabular}{lcl}
+ − 1025
$r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
+ − 1026
r^{\{n-1\}}$\\
+ − 1027
\end{tabular}
+ − 1028
\end{center}
+ − 1029
\noindent
+ − 1030
And experimental results suggest that unlike DFA-based solutions,
+ − 1031
this derivatives can support
+ − 1032
bounded regular expressions with large counters
+ − 1033
quite well.
532
+ − 1034
609
+ − 1035
There has also been
+ − 1036
extensions to other constructs.
+ − 1037
For example, Owens et al include the derivatives
+ − 1038
for \emph{NOT} regular expressions, which is
+ − 1039
able to concisely express C-style comments of the form
+ − 1040
$/* \ldots */$.
+ − 1041
Another extension for derivatives would be
+ − 1042
regular expressions with look-aheads, done by
+ − 1043
by Miyazaki and Minamide
+ − 1044
\cite{Takayuki2019}.
+ − 1045
%We therefore use Brzozowski derivatives on regular expressions
+ − 1046
%lexing
+ − 1047
+ − 1048
+ − 1049
+ − 1050
Given the above definitions and properties of
+ − 1051
Brzozowski derivatives, one quickly realises their potential
+ − 1052
in generating a formally verified algorithm for lexing--the clauses and property
+ − 1053
can be easily expressed in a functional programming language
+ − 1054
or converted to theorem prover
+ − 1055
code, with great extensibility.
+ − 1056
Perhaps this is the reason why it has sparked quite a bit of interest
+ − 1057
in the functional programming and theorem prover communities in the last
+ − 1058
fifteen or so years (
+ − 1059
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
+ − 1060
\cite{Chen12} and \cite{Coquand2012}
+ − 1061
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
+ − 1062
after they were first published.
+ − 1063
+ − 1064
+ − 1065
However, there are two difficulties with derivative-based matchers:
532
+ − 1066
First, Brzozowski's original matcher only generates a yes/no answer
+ − 1067
for whether a regular expression matches a string or not. This is too
+ − 1068
little information in the context of lexing where separate tokens must
+ − 1069
be identified and also classified (for example as keywords
609
+ − 1070
or identifiers).
+ − 1071
Second, derivative-based matchers need to be more efficient.
+ − 1072
Elegant and beautiful
+ − 1073
as many implementations are,
+ − 1074
they can be excruciatingly slow.
+ − 1075
For example, Sulzmann and Lu
+ − 1076
claim a linear running time of their proposed algorithm,
+ − 1077
but that was falsified by our experiments. The running time
+ − 1078
is actually $\Omega(2^n)$ in the worst case.
+ − 1079
A similar claim about a theoretical runtime of $O(n^2)$
+ − 1080
is made for the Verbatim \cite{Verbatim}
+ − 1081
%TODO: give references
+ − 1082
lexer, which calculates POSIX matches and is based on derivatives.
+ − 1083
They formalized the correctness of the lexer, but not the complexity.
+ − 1084
In the performance evaluation section, they simply analyzed the run time
+ − 1085
of matching $a$ with the string
+ − 1086
\begin{center}
+ − 1087
$\underbrace{a \ldots a}_{\text{n a's}}$
+ − 1088
\end{center}
+ − 1089
and concluded that the algorithm is quadratic in terms of input length.
+ − 1090
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+ − 1091
the time it took to lex only 40 $a$'s was 5 minutes.
+ − 1092
+ − 1093
+ − 1094
\subsection{Sulzmann and Lu's Algorithm}
+ − 1095
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
532
+ − 1096
difficulty by cleverly extending Brzozowski's matching
+ − 1097
algorithm. Their extended version generates additional information on
+ − 1098
\emph{how} a regular expression matches a string following the POSIX
+ − 1099
rules for regular expression matching. They achieve this by adding a
+ − 1100
second ``phase'' to Brzozowski's algorithm involving an injection
609
+ − 1101
function simplification of internal data structures
+ − 1102
eliminating the exponential behaviours.
+ − 1103
In an earlier work, Ausaf et al provided the formal
532
+ − 1104
specification of what POSIX matching means and proved in Isabelle/HOL
+ − 1105
the correctness
+ − 1106
of Sulzmann and Lu's extended algorithm accordingly
+ − 1107
\cite{AusafDyckhoffUrban2016}.
+ − 1108
609
+ − 1109
The version of the algorithm proven correct
+ − 1110
suffers from the
+ − 1111
second difficulty though, where the internal derivatives can
+ − 1112
grow to arbitrarily big sizes.
+ − 1113
For example if we start with the
532
+ − 1114
regular expression $(a+aa)^*$ and take
+ − 1115
successive derivatives according to the character $a$, we end up with
+ − 1116
a sequence of ever-growing derivatives like
+ − 1117
+ − 1118
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+ − 1119
\begin{center}
+ − 1120
\begin{tabular}{rll}
+ − 1121
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1122
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1123
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+ − 1124
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1125
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+ − 1126
\end{tabular}
+ − 1127
\end{center}
+ − 1128
+ − 1129
\noindent where after around 35 steps we run out of memory on a
609
+ − 1130
typical computer (we shall define in the next chapter
+ − 1131
the precise details of our
532
+ − 1132
regular expressions and the derivative operation). Clearly, the
+ − 1133
notation involving $\ZERO$s and $\ONE$s already suggests
+ − 1134
simplification rules that can be applied to regular regular
+ − 1135
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
+ − 1136
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
+ − 1137
r$. While such simple-minded simplifications have been proved in our
+ − 1138
earlier work to preserve the correctness of Sulzmann and Lu's
+ − 1139
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
+ − 1140
\emph{not} help with limiting the growth of the derivatives shown
+ − 1141
above: the growth is slowed, but the derivatives can still grow rather
+ − 1142
quickly beyond any finite bound.
+ − 1143
+ − 1144
Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
538
+ − 1145
\cite{Sulzmann2014} where they introduce bit-coded
532
+ − 1146
regular expressions. In this version, POSIX values are
538
+ − 1147
represented as bit sequences and such sequences are incrementally generated
532
+ − 1148
when derivatives are calculated. The compact representation
538
+ − 1149
of bit sequences and regular expressions allows them to define a more
532
+ − 1150
``aggressive'' simplification method that keeps the size of the
+ − 1151
derivatives finite no matter what the length of the string is.
+ − 1152
They make some informal claims about the correctness and linear behaviour
+ − 1153
of this version, but do not provide any supporting proof arguments, not
538
+ − 1154
even ``pencil-and-paper'' arguments. They write about their bit-coded
532
+ − 1155
\emph{incremental parsing method} (that is the algorithm to be formalised
538
+ − 1156
in this dissertation)
532
+ − 1157
+ − 1158
+ − 1159
+ − 1160
\begin{quote}\it
+ − 1161
``Correctness Claim: We further claim that the incremental parsing
+ − 1162
method [..] in combination with the simplification steps [..]
+ − 1163
yields POSIX parse trees. We have tested this claim
+ − 1164
extensively [..] but yet
+ − 1165
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
+ − 1166
\end{quote}
+ − 1167
Ausaf and Urban were able to back this correctness claim with
+ − 1168
a formal proof.
+ − 1169
609
+ − 1170
However a faster formally verified
+ − 1171
lexing program with the optimisations
+ − 1172
mentioned by Sulzmann and Lu's second algorithm
+ − 1173
is still missing.
+ − 1174
As they stated,
532
+ − 1175
\begin{quote}\it
609
+ − 1176
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
532
+ − 1177
\end{quote}
+ − 1178
This thesis implements the aggressive simplifications envisioned
+ − 1179
by Ausaf and Urban,
609
+ − 1180
together with a formal proof of the correctness with those simplifications.
532
+ − 1181
+ − 1182
+ − 1183
%----------------------------------------------------------------------------------------
+ − 1184
\section{Contribution}
+ − 1185
609
+ − 1186
In this thesis,
+ − 1187
we propose a solution to catastrophic
+ − 1188
backtracking and error-prone matchers: a formally verified
+ − 1189
regular expression lexing algorithm
+ − 1190
that is both fast
+ − 1191
and correct by extending Ausaf et al.'s work.
+ − 1192
The end result is %a regular expression lexing algorithm that comes with
538
+ − 1193
\begin{itemize}
+ − 1194
\item
609
+ − 1195
an improved version of Sulzmann and Lu's bit-coded algorithm using
+ − 1196
derivatives with simplifications,
+ − 1197
accompanied by
+ − 1198
a proven correctness theorem according to POSIX specification
+ − 1199
given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
+ − 1200
\item
+ − 1201
a complexity-related property for that algorithm saying that the
+ − 1202
internal data structure will
+ − 1203
remain finite,
538
+ − 1204
\item
609
+ − 1205
and extension to
+ − 1206
the bounded repetitions construct with the correctness and finiteness property
+ − 1207
maintained.
+ − 1208
\end{itemize}
532
+ − 1209
+ − 1210
609
+ − 1211
With a formal finiteness bound in place,
+ − 1212
we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
+ − 1213
Further improvements to the algorithm with an even stronger version of
+ − 1214
simplification is made.
+ − 1215
Thanks to our theorem-prover-friendly approach,
+ − 1216
we believe that
+ − 1217
this finiteness bound can be improved to a bound
+ − 1218
linear to input and
+ − 1219
cubic to the regular expression size using a technique by
+ − 1220
Antimirov\cite{Antimirov95}.
+ − 1221
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
+ − 1222
We are working out the
+ − 1223
details.
+ − 1224
538
+ − 1225
609
+ − 1226
To our best knowledge, no lexing libraries using Brzozowski derivatives
+ − 1227
have similar complexity-related bounds,
+ − 1228
and claims about running time are usually speculative and backed by empirical
+ − 1229
evidence on a few test cases.
+ − 1230
If a matching or lexing algorithm
+ − 1231
does not come with certain basic complexity related
+ − 1232
guarantees (for examaple the internal data structure size
+ − 1233
does not grow indefinitely),
+ − 1234
then they cannot claim with confidence having solved the problem
+ − 1235
of catastrophic backtracking.
+ − 1236
538
+ − 1237
+ − 1238
532
+ − 1239
+ − 1240
+ − 1241
\section{Structure of the thesis}
538
+ − 1242
In chapter 2 \ref{Inj} we will introduce the concepts
532
+ − 1243
and notations we
+ − 1244
use for describing the lexing algorithm by Sulzmann and Lu,
538
+ − 1245
and then give the lexing algorithm.
+ − 1246
We will give its variant in \ref{Bitcoded1}.
+ − 1247
Then we illustrate in \ref{Bitcoded2}
532
+ − 1248
how the algorithm without bitcodes falls short for such aggressive
+ − 1249
simplifications and therefore introduce our version of the
538
+ − 1250
bit-coded algorithm and
532
+ − 1251
its correctness proof .
538
+ − 1252
In \ref{Finite} we give the second guarantee
532
+ − 1253
of our bitcoded algorithm, that is a finite bound on the size of any
+ − 1254
regex's derivatives.
538
+ − 1255
In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
+ − 1256
in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
532
+ − 1257
algorithm to include constructs such as bounded repetitions and negations.
+ − 1258
+ − 1259
+ − 1260
+ − 1261
+ − 1262
+ − 1263
%----------------------------------------------------------------------------------------
+ − 1264
+ − 1265
+ − 1266
%----------------------------------------------------------------------------------------
+ − 1267
+ − 1268
%----------------------------------------------------------------------------------------
+ − 1269
+ − 1270
%----------------------------------------------------------------------------------------
+ − 1271
+ − 1272