532
+ − 1
% Chapter 1
+ − 2
+ − 3
\chapter{Introduction} % Main chapter title
+ − 4
+ − 5
\label{Introduction} % For referencing the chapter elsewhere, use \ref{Chapter1}
+ − 6
+ − 7
%----------------------------------------------------------------------------------------
+ − 8
+ − 9
% Define some commands to keep the formatting separated from the content
+ − 10
\newcommand{\keyword}[1]{\textbf{#1}}
+ − 11
\newcommand{\tabhead}[1]{\textbf{#1}}
+ − 12
\newcommand{\code}[1]{\texttt{#1}}
+ − 13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
+ − 14
\newcommand{\option}[1]{\texttt{\itshape#1}}
+ − 15
+ − 16
%boxes
+ − 17
\newcommand*{\mybox}[1]{\framebox{\strut #1}}
+ − 18
+ − 19
%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
+ − 20
\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
+ − 21
\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
543
+ − 22
\newcommand{\bderssimp}[2]{#1 \backslash_{bsimps} #2}
596
+ − 23
\newcommand{\rderssimp}[2]{#1 \backslash_{rsimps} #2}
564
+ − 24
\def\derssimp{\textit{ders}\_\textit{simp}}
557
+ − 25
\def\rders{\textit{rders}}
532
+ − 26
\newcommand{\bders}[2]{#1 \backslash #2}
+ − 27
\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
591
+ − 28
\def\bsimps{\textit{bsimp}}
554
+ − 29
\newcommand{\rsimp}[1]{\textit{rsimp}\; #1}
532
+ − 30
\newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'}
+ − 31
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+ − 32
\newcommand{\denote}{\stackrel{\mbox{\scriptsize denote}}{=}}%
+ − 33
\newcommand{\ZERO}{\mbox{\bf 0}}
+ − 34
\newcommand{\ONE}{\mbox{\bf 1}}
+ − 35
\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
555
+ − 36
\newcommand{\rdistinct}[2]{\textit{rdistinct} \;\; #1 \;\; #2}
594
+ − 37
\def\rdistincts{\textit{rdistinct}}
556
+ − 38
\def\rDistinct{\textit{rdistinct}}
532
+ − 39
\newcommand\hflat[1]{\llparenthesis #1 \rrparenthesis_*}
+ − 40
\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
+ − 41
\newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
620
+ − 42
\def\cbn{\textit{createdByNtimes}}
+ − 43
\def\hpa{\textit{highestPowerAux}}
+ − 44
\def\hpower{\textit{highestPower}}
+ − 45
\def\ntset{\textit{ntset}}
+ − 46
\def\optermsimp{\textit{optermsimp}}
+ − 47
\def\optermOsimp{\textit{optermOsimp}}
+ − 48
\def\optermosimp{\textit{optermosimp}}
+ − 49
\def\opterm{\textit{opterm}}
+ − 50
\def\nString{\textit{nonemptyString}}
532
+ − 51
+ − 52
\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
+ − 53
600
+ − 54
\def\SEQ{\textit{SEQ}}
+ − 55
\def\SEQs{\textit{SEQs}}
564
+ − 56
\def\case{\textit{case}}
554
+ − 57
\def\sequal{\stackrel{\mbox{\scriptsize rsimp}}{=}}
+ − 58
\def\rsimpalts{\textit{rsimp}_{ALTS}}
+ − 59
\def\good{\textit{good}}
+ − 60
\def\btrue{\textit{true}}
+ − 61
\def\bfalse{\textit{false}}
542
+ − 62
\def\bnullable{\textit{bnullable}}
543
+ − 63
\def\bnullables{\textit{bnullables}}
538
+ − 64
\def\Some{\textit{Some}}
+ − 65
\def\None{\textit{None}}
537
+ − 66
\def\code{\textit{code}}
532
+ − 67
\def\decode{\textit{decode}}
+ − 68
\def\internalise{\textit{internalise}}
+ − 69
\def\lexer{\mathit{lexer}}
+ − 70
\def\mkeps{\textit{mkeps}}
557
+ − 71
\newcommand{\rder}[2]{#2 \backslash_r #1}
532
+ − 72
585
+ − 73
\def\rerases{\textit{rerase}}
+ − 74
554
+ − 75
\def\nonnested{\textit{nonnested}}
532
+ − 76
\def\AZERO{\textit{AZERO}}
558
+ − 77
\def\sizeNregex{\textit{sizeNregex}}
532
+ − 78
\def\AONE{\textit{AONE}}
+ − 79
\def\ACHAR{\textit{ACHAR}}
+ − 80
585
+ − 81
\def\simpsulz{\textit{simp}_{Sulz}}
+ − 82
557
+ − 83
\def\scfrewrites{\stackrel{*}{\rightsquigarrow_{scf}}}
555
+ − 84
\def\frewrite{\rightsquigarrow_f}
+ − 85
\def\hrewrite{\rightsquigarrow_h}
+ − 86
\def\grewrite{\rightsquigarrow_g}
+ − 87
\def\frewrites{\stackrel{*}{\rightsquigarrow_f}}
+ − 88
\def\hrewrites{\stackrel{*}{\rightsquigarrow_h}}
+ − 89
\def\grewrites{\stackrel{*}{\rightsquigarrow_g}}
538
+ − 90
\def\fuse{\textit{fuse}}
+ − 91
\def\bder{\textit{bder}}
542
+ − 92
\def\der{\textit{der}}
532
+ − 93
\def\POSIX{\textit{POSIX}}
+ − 94
\def\ALTS{\textit{ALTS}}
+ − 95
\def\ASTAR{\textit{ASTAR}}
+ − 96
\def\DFA{\textit{DFA}}
538
+ − 97
\def\NFA{\textit{NFA}}
532
+ − 98
\def\bmkeps{\textit{bmkeps}}
543
+ − 99
\def\bmkepss{\textit{bmkepss}}
532
+ − 100
\def\retrieve{\textit{retrieve}}
+ − 101
\def\blexer{\textit{blexer}}
+ − 102
\def\flex{\textit{flex}}
573
+ − 103
\def\inj{\textit{inj}}
564
+ − 104
\def\Empty{\textit{Empty}}
567
+ − 105
\def\Left{\textit{Left}}
+ − 106
\def\Right{\textit{Right}}
573
+ − 107
\def\Stars{\textit{Stars}}
+ − 108
\def\Char{\textit{Char}}
+ − 109
\def\Seq{\textit{Seq}}
532
+ − 110
\def\Der{\textit{Der}}
+ − 111
\def\Ders{\textit{Ders}}
+ − 112
\def\nullable{\mathit{nullable}}
+ − 113
\def\Z{\mathit{Z}}
+ − 114
\def\S{\mathit{S}}
+ − 115
\def\rup{r^\uparrow}
+ − 116
%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
+ − 117
\def\distinctWith{\textit{distinctWith}}
+ − 118
\def\lf{\textit{lf}}
+ − 119
\def\PD{\textit{PD}}
+ − 120
\def\suffix{\textit{Suffix}}
543
+ − 121
\def\distinctBy{\textit{distinctBy}}
558
+ − 122
\def\starupdate{\textit{starUpdate}}
+ − 123
\def\starupdates{\textit{starUpdates}}
620
+ − 124
\def\nupdate{\textit{nupdate}}
+ − 125
\def\nupdates{\textit{nupdates}}
558
+ − 126
532
+ − 127
+ − 128
\def\size{\mathit{size}}
+ − 129
\def\rexp{\mathbf{rexp}}
+ − 130
\def\simp{\mathit{simp}}
+ − 131
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
+ − 132
\def\map{\mathit{map}}
+ − 133
\def\distinct{\mathit{distinct}}
+ − 134
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
590
+ − 135
\def\blexerStrong{\textit{blexerStrong}}
+ − 136
\def\bsimpStrong{\textit{bsimpStrong}}
591
+ − 137
\def\bdersStrongs{\textit{bdersStrong}}
590
+ − 138
\newcommand{\bdersStrong}[2]{#1 \backslash_{bsimpStrongs} #2}
+ − 139
532
+ − 140
\def\map{\textit{map}}
+ − 141
\def\rrexp{\textit{rrexp}}
554
+ − 142
\newcommand\rnullable[1]{\textit{rnullable} \; #1 }
532
+ − 143
\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
+ − 144
\newcommand\asize[1]{\llbracket #1 \rrbracket}
543
+ − 145
\newcommand\rerase[1]{ (#1)_{\downarrow_r}}
+ − 146
538
+ − 147
\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
532
+ − 148
543
+ − 149
+ − 150
\def\rflts{\textit{rflts}}
+ − 151
\def\rrewrite{\textit{rrewrite}}
+ − 152
\def\bsimpalts{\textit{bsimp}_{ALTS}}
596
+ − 153
\def\bsimpaseq{\textit{bsimp}_{ASEQ}}
+ − 154
\def\rsimlalts{\textit{rsimp}_{ALTs}}
+ − 155
\def\rsimpseq{\textit{rsimp}_{SEQ}}
543
+ − 156
532
+ − 157
\def\erase{\textit{erase}}
+ − 158
\def\STAR{\textit{STAR}}
+ − 159
\def\flts{\textit{flts}}
+ − 160
+ − 161
579
+ − 162
\def\zeroable{\textit{zeroable}}
+ − 163
\def\nub{\textit{nub}}
+ − 164
\def\filter{\textit{filter}}
601
+ − 165
%\def\not{\textit{not}}
579
+ − 166
+ − 167
+ − 168
532
+ − 169
\def\RZERO{\mathbf{0}_r }
+ − 170
\def\RONE{\mathbf{1}_r}
+ − 171
\newcommand\RCHAR[1]{\mathbf{#1}_r}
+ − 172
\newcommand\RSEQ[2]{#1 \cdot #2}
558
+ − 173
\newcommand\RALTS[1]{\sum #1}
532
+ − 174
\newcommand\RSTAR[1]{#1^*}
558
+ − 175
\newcommand\vsuf[2]{\textit{Suffix} \;#1\;#2}
532
+ − 176
538
+ − 177
+ − 178
590
+ − 179
+ − 180
\lstdefinestyle{myScalastyle}{
+ − 181
frame=tb,
+ − 182
language=scala,
+ − 183
aboveskip=3mm,
+ − 184
belowskip=3mm,
+ − 185
showstringspaces=false,
+ − 186
columns=flexible,
+ − 187
basicstyle={\small\ttfamily},
+ − 188
numbers=none,
+ − 189
numberstyle=\tiny\color{gray},
+ − 190
keywordstyle=\color{blue},
+ − 191
commentstyle=\color{dkgreen},
+ − 192
stringstyle=\color{mauve},
+ − 193
frame=single,
+ − 194
breaklines=true,
+ − 195
breakatwhitespace=true,
+ − 196
tabsize=3,
538
+ − 197
}
+ − 198
590
+ − 199
532
+ − 200
%----------------------------------------------------------------------------------------
+ − 201
%This part is about regular expressions, Brzozowski derivatives,
+ − 202
%and a bit-coded lexing algorithm with proven correctness and time bounds.
+ − 203
+ − 204
%TODO: look up snort rules to use here--give readers idea of what regexes look like
+ − 205
601
+ − 206
+ − 207
+ − 208
+ − 209
+ − 210
+ − 211
Regular expressions are widely used in computer science:
+ − 212
be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+ − 213
command-line tools like $\mathit{grep}$ that facilitate easy
634
+ − 214
text-processing \cite{grep}; network intrusion
612
+ − 215
detection systems that inspect suspicious traffic; or compiler
+ − 216
front ends.
+ − 217
Given their usefulness and ubiquity, one would assume that
601
+ − 218
modern regular expression matching implementations
+ − 219
are mature and fully studied.
634
+ − 220
Indeed, in a popular programming language's regular expression engine,
602
+ − 221
supplying it with regular expressions and strings,
+ − 222
in most cases one can
+ − 223
get the matching information in a very short time.
633
+ − 224
Those engines can be blindingly fast--some
602
+ − 225
network intrusion detection systems
634
+ − 226
use regular expression engines that are able to process
+ − 227
hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+ − 228
However, those engines can sometimes exhibit a surprising security vulnerability
602
+ − 229
under a certain class of inputs.
+ − 230
%However, , this is not the case for $\mathbf{all}$ inputs.
601
+ − 231
%TODO: get source for SNORT/BRO's regex matching engine/speed
+ − 232
603
+ − 233
633
+ − 234
Consider for example $(a^*)^*\,b$ and
+ − 235
strings of the form $aa..a$, these strings cannot be matched by this regular
+ − 236
expression: Obviously the expected $b$ in the last
+ − 237
position is missing. One would assume that modern regular expression
612
+ − 238
matching engines can find this out very quickly. Surprisingly, if one tries
+ − 239
this example in JavaScript, Python or Java 8, even with small strings,
+ − 240
say of lenght of around 30 $a$'s,
634
+ − 241
the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
+ − 242
This is clearly exponential behaviour, and as can be seen
612
+ − 243
is triggered by some relatively simple regular expressions.
603
+ − 244
Java 9 and newer
634
+ − 245
versions improve this behaviour somewhat, but are still slow compared
612
+ − 246
with the approach we are going to use in this thesis.
603
+ − 247
+ − 248
+ − 249
+ − 250
This superlinear blowup in regular expression engines
634
+ − 251
has caused grief in ``real life'' in the past where it is
612
+ − 252
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
603
+ − 253
For example, on 20 July 2016 one evil
+ − 254
regular expression brought the webpage
+ − 255
\href{http://stackexchange.com}{Stack Exchange} to its
+ − 256
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+ − 257
In this instance, a regular expression intended to just trim white
+ − 258
spaces from the beginning and the end of a line actually consumed
634
+ − 259
massive amounts of CPU resources---causing the web servers to grind to a
603
+ − 260
halt. In this example, the time needed to process
634
+ − 261
the string was
+ − 262
$O(n^2)$ with respect to the string length $n$. This
603
+ − 263
quadratic overhead was enough for the homepage of Stack Exchange to
+ − 264
respond so slowly that the load balancer assumed a $\mathit{DoS}$
+ − 265
attack and therefore stopped the servers from responding to any
+ − 266
requests. This made the whole site become unavailable.
+ − 267
601
+ − 268
\begin{figure}[p]
630
+ − 269
\begin{center}
612
+ − 270
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
532
+ − 271
\begin{tikzpicture}
+ − 272
\begin{axis}[
+ − 273
xlabel={$n$},
+ − 274
x label style={at={(1.05,-0.05)}},
+ − 275
ylabel={time in secs},
+ − 276
enlargelimits=false,
+ − 277
xtick={0,5,...,30},
+ − 278
xmax=33,
+ − 279
ymax=35,
+ − 280
ytick={0,5,...,30},
+ − 281
scaled ticks=false,
+ − 282
axis lines=left,
+ − 283
width=5cm,
+ − 284
height=4cm,
+ − 285
legend entries={JavaScript},
+ − 286
legend pos=north west,
+ − 287
legend cell align=left]
+ − 288
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+ − 289
\end{axis}
+ − 290
\end{tikzpicture}
+ − 291
&
+ − 292
\begin{tikzpicture}
+ − 293
\begin{axis}[
+ − 294
xlabel={$n$},
+ − 295
x label style={at={(1.05,-0.05)}},
+ − 296
%ylabel={time in secs},
+ − 297
enlargelimits=false,
+ − 298
xtick={0,5,...,30},
+ − 299
xmax=33,
+ − 300
ymax=35,
+ − 301
ytick={0,5,...,30},
+ − 302
scaled ticks=false,
+ − 303
axis lines=left,
+ − 304
width=5cm,
+ − 305
height=4cm,
+ − 306
legend entries={Python},
+ − 307
legend pos=north west,
+ − 308
legend cell align=left]
+ − 309
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ − 310
\end{axis}
612
+ − 311
\end{tikzpicture}\\
601
+ − 312
\begin{tikzpicture}
+ − 313
\begin{axis}[
+ − 314
xlabel={$n$},
+ − 315
x label style={at={(1.05,-0.05)}},
+ − 316
ylabel={time in secs},
+ − 317
enlargelimits=false,
+ − 318
xtick={0,5,...,30},
+ − 319
xmax=33,
+ − 320
ymax=35,
+ − 321
ytick={0,5,...,30},
+ − 322
scaled ticks=false,
+ − 323
axis lines=left,
+ − 324
width=5cm,
+ − 325
height=4cm,
612
+ − 326
legend entries={Java 8},
601
+ − 327
legend pos=north west,
+ − 328
legend cell align=left]
612
+ − 329
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
601
+ − 330
\end{axis}
+ − 331
\end{tikzpicture}
+ − 332
&
+ − 333
\begin{tikzpicture}
+ − 334
\begin{axis}[
+ − 335
xlabel={$n$},
+ − 336
x label style={at={(1.05,-0.05)}},
+ − 337
%ylabel={time in secs},
+ − 338
enlargelimits=false,
+ − 339
xtick={0,5,...,30},
+ − 340
xmax=33,
+ − 341
ymax=35,
+ − 342
ytick={0,5,...,30},
+ − 343
scaled ticks=false,
+ − 344
axis lines=left,
+ − 345
width=5cm,
+ − 346
height=4cm,
612
+ − 347
legend entries={Dart},
+ − 348
legend pos=north west,
+ − 349
legend cell align=left]
+ − 350
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+ − 351
\end{axis}
+ − 352
\end{tikzpicture}\\
+ − 353
\begin{tikzpicture}
+ − 354
\begin{axis}[
+ − 355
xlabel={$n$},
+ − 356
x label style={at={(1.05,-0.05)}},
+ − 357
ylabel={time in secs},
+ − 358
enlargelimits=false,
+ − 359
xtick={0,5,...,30},
+ − 360
xmax=33,
+ − 361
ymax=35,
+ − 362
ytick={0,5,...,30},
+ − 363
scaled ticks=false,
+ − 364
axis lines=left,
+ − 365
width=5cm,
+ − 366
height=4cm,
601
+ − 367
legend entries={Swift},
+ − 368
legend pos=north west,
+ − 369
legend cell align=left]
+ − 370
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+ − 371
\end{axis}
+ − 372
\end{tikzpicture}
612
+ − 373
&
+ − 374
\begin{tikzpicture}
+ − 375
\begin{axis}[
+ − 376
xlabel={$n$},
+ − 377
x label style={at={(1.05,-0.05)}},
+ − 378
%ylabel={time in secs},
+ − 379
enlargelimits=true,
+ − 380
%xtick={0,5000,...,40000},
+ − 381
%xmax=40000,
+ − 382
%ymax=35,
+ − 383
restrict x to domain*=0:40000,
+ − 384
restrict y to domain*=0:35,
+ − 385
%ytick={0,5,...,30},
+ − 386
%scaled ticks=false,
+ − 387
axis lines=left,
+ − 388
width=5cm,
+ − 389
height=4cm,
+ − 390
legend entries={Java9+},
+ − 391
legend pos=north west,
+ − 392
legend cell align=left]
+ − 393
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+ − 394
\end{axis}
+ − 395
\end{tikzpicture}\\
+ − 396
\multicolumn{2}{c}{Graphs}
532
+ − 397
\end{tabular}
630
+ − 398
\end{center}
601
+ − 399
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
+ − 400
of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
612
+ − 401
The reason for their superlinear behaviour is that they do a depth-first-search
+ − 402
using NFAs.
+ − 403
If the string does not match, the regular expression matching
+ − 404
engine starts to explore all possibilities.
601
+ − 405
}\label{fig:aStarStarb}
+ − 406
\end{figure}\afterpage{\clearpage}
538
+ − 407
532
+ − 408
A more recent example is a global outage of all Cloudflare servers on 2 July
612
+ − 409
2019. A poorly written regular expression exhibited catastrophic backtracking
+ − 410
and exhausted CPUs that serve HTTP traffic. Although the outage
532
+ − 411
had several causes, at the heart was a regular expression that
+ − 412
was used to monitor network
538
+ − 413
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
532
+ − 414
These problems with regular expressions
+ − 415
are not isolated events that happen
634
+ − 416
very rarely,
+ − 417
%but actually widespread.
+ − 418
%They occur so often that they have a
+ − 419
but they occur actually often enough that they have a
612
+ − 420
name: Regular-Expression-Denial-Of-Service (ReDoS)
634
+ − 421
attacks.
+ − 422
Davis et al. \cite{Davis18} detected more
612
+ − 423
than 1000 evil regular expressions
635
+ − 424
in Node.js, Python core libraries, npm and pypi.
532
+ − 425
They therefore concluded that evil regular expressions
634
+ − 426
are real problems rather than just "a parlour trick".
532
+ − 427
603
+ − 428
This work aims to address this issue
+ − 429
with the help of formal proofs.
612
+ − 430
We describe a lexing algorithm based
635
+ − 431
on Brzozowski derivatives with verified correctness
+ − 432
and a finiteness property for the size of derivatives
+ − 433
(which are all done in Isabelle/HOL).
612
+ − 434
Such properties %guarantee the absence of
+ − 435
are an important step in preventing
+ − 436
catastrophic backtracking once and for all.
604
+ − 437
We will give more details in the next sections
+ − 438
on (i) why the slow cases in graph \ref{fig:aStarStarb}
612
+ − 439
can occur in traditional regular expression engines
604
+ − 440
and (ii) why we choose our
612
+ − 441
approach based on Brzozowski derivatives and formal proofs.
602
+ − 442
603
+ − 443
612
+ − 444
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
601
+ − 445
Regular expressions and regular expression matchers
635
+ − 446
have clearly been studied for many, many years.
612
+ − 447
Theoretical results in automata theory state
604
+ − 448
that basic regular expression matching should be linear
605
+ − 449
w.r.t the input.
+ − 450
This assumes that the regular expression
+ − 451
$r$ was pre-processed and turned into a
635
+ − 452
deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
604
+ − 453
By basic we mean textbook definitions such as the one
612
+ − 454
below, involving only regular expressions for characters, alternatives,
604
+ − 455
sequences, and Kleene stars:
+ − 456
\[
612
+ − 457
r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
604
+ − 458
\]
+ − 459
Modern regular expression matchers used by programmers,
+ − 460
however,
635
+ − 461
support much richer constructs, such as bounded repetitions,
+ − 462
negations,
604
+ − 463
and back-references.
612
+ − 464
To differentiate, we use the word \emph{regex} to refer
605
+ − 465
to those expressions with richer constructs while reserving the
+ − 466
term \emph{regular expression}
+ − 467
for the more traditional meaning in formal languages theory.
+ − 468
We follow this convention
+ − 469
in this thesis.
+ − 470
In the future, we aim to support all the popular features of regexes,
612
+ − 471
but for this work we mainly look at basic regular expressions
+ − 472
and bounded repetitions.
604
+ − 473
605
+ − 474
+ − 475
+ − 476
%Most modern regex libraries
+ − 477
%the so-called PCRE standard (Peral Compatible Regular Expressions)
+ − 478
%has the back-references
612
+ − 479
Regexes come with a number of constructs
605
+ − 480
that make it more convenient for
604
+ − 481
programmers to write regular expressions.
612
+ − 482
Depending on the types of constructs
605
+ − 483
the task of matching and lexing with them
612
+ − 484
will have different levels of complexity.
635
+ − 485
Some of those constructs are syntactic sugars that are
604
+ − 486
simply short hand notations
605
+ − 487
that save the programmers a few keystrokes.
612
+ − 488
These will not cause problems for regex libraries.
605
+ − 489
For example the
612
+ − 490
non-binary alternative involving three or more choices just means:
604
+ − 491
\[
605
+ − 492
(a | b | c) \stackrel{means}{=} ((a + b)+ c)
604
+ − 493
\]
635
+ − 494
Similarly, the range operator
+ − 495
%used to express the alternative
+ − 496
%of all characters between its operands,
+ − 497
is just a concise way
+ − 498
of expressing an alternative of consecutive characters:
604
+ − 499
\[
635
+ − 500
[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )
604
+ − 501
\]
612
+ − 502
for an alternative. The
635
+ − 503
wildcard character '$.$' is used to refer to any single character,
605
+ − 504
\[
+ − 505
. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
+ − 506
\]
612
+ − 507
except the newline.
604
+ − 508
605
+ − 509
\subsection{Bounded Repetitions}
612
+ − 510
More interesting are bounded repetitions, which can
+ − 511
make the regular expressions much
605
+ − 512
more compact.
635
+ − 513
Normally there are four kinds of bounded repetitions:
612
+ − 514
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
+ − 515
(where $n$ and $m$ are constant natural numbers).
+ − 516
Like the star regular expressions, the set of strings or language
+ − 517
a bounded regular expression can match
+ − 518
is defined using the power operation on sets:
605
+ − 519
\begin{center}
+ − 520
\begin{tabular}{lcl}
+ − 521
$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
+ − 522
$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
+ − 523
$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
+ − 524
$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
+ − 525
\end{tabular}
+ − 526
\end{center}
612
+ − 527
The attraction of bounded repetitions is that they can be
635
+ − 528
used to avoid a size blow up: for example $r^{\{n\}}$
612
+ − 529
is a shorthand for
635
+ − 530
the much longer regular expression:
605
+ − 531
\[
+ − 532
\underbrace{r\ldots r}_\text{n copies of r}.
+ − 533
\]
+ − 534
%Therefore, a naive algorithm that simply unfolds
+ − 535
%them into their desugared forms
+ − 536
%will suffer from at least an exponential runtime increase.
603
+ − 537
612
+ − 538
+ − 539
The problem with matching
635
+ − 540
such bounded repetitions
612
+ − 541
is that tools based on the classic notion of
+ − 542
automata need to expand $r^{\{n\}}$ into $n$ connected
603
+ − 543
copies of the automaton for $r$. This leads to very inefficient matching
+ − 544
algorithms or algorithms that consume large amounts of memory.
605
+ − 545
Implementations using $\DFA$s will
635
+ − 546
in such situations
605
+ − 547
either become excruciatingly slow
635
+ − 548
(for example Verbatim++ \cite{Verbatimpp}) or run
+ − 549
out of memory (for example $\mathit{LEX}$ and
+ − 550
$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
605
+ − 551
in C and JAVA that generate $\mathit{DFA}$-based
+ − 552
lexers. The user provides a set of regular expressions
635
+ − 553
and configurations, and then
605
+ − 554
gets an output program encoding a minimized $\mathit{DFA}$
+ − 555
that can be compiled and run.
+ − 556
When given the above countdown regular expression,
635
+ − 557
a small $n$ (say 20) would result in a program representing a
+ − 558
DFA
612
+ − 559
with millions of states.}) for large counters.
+ − 560
A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
605
+ − 561
where the minimal DFA requires at least $2^{n+1}$ states.
+ − 562
For example, when $n$ is equal to 2,
635
+ − 563
the corresponding $\mathit{NFA}$ looks like:
+ − 564
\vspace{6mm}
604
+ − 565
\begin{center}
+ − 566
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 567
\node[state,initial] (q_0) {$q_0$};
+ − 568
\node[state, red] (q_1) [right=of q_0] {$q_1$};
+ − 569
\node[state, red] (q_2) [right=of q_1] {$q_2$};
+ − 570
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+ − 571
\path[->]
+ − 572
(q_0) edge node {a} (q_1)
+ − 573
edge [loop below] node {a,b} ()
+ − 574
(q_1) edge node {a,b} (q_2)
+ − 575
(q_2) edge node {a,b} (q_3);
+ − 576
\end{tikzpicture}
+ − 577
\end{center}
635
+ − 578
and when turned into a DFA by the subset construction
612
+ − 579
requires at least $2^3$ states.\footnote{The
635
+ − 580
red states are "countdown states" which count down
604
+ − 581
the number of characters needed in addition to the current
+ − 582
string to make a successful match.
+ − 583
For example, state $q_1$ indicates a match that has
+ − 584
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+ − 585
and just consumed the "delimiter" $a$ in the middle, and
635
+ − 586
needs to match 2 more iterations of $(a|b)$ to complete.
604
+ − 587
State $q_2$ on the other hand, can be viewed as a state
+ − 588
after $q_1$ has consumed 1 character, and just waits
+ − 589
for 1 more character to complete.
635
+ − 590
The state $q_3$ is the last (accepting) state, requiring 0
+ − 591
more characters.
604
+ − 592
Depending on the suffix of the
+ − 593
input string up to the current read location,
+ − 594
the states $q_1$ and $q_2$, $q_3$
+ − 595
may or may
635
+ − 596
not be active.
604
+ − 597
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+ − 598
contain at least $2^3$ non-equivalent states that cannot be merged,
+ − 599
because the subset construction during determinisation will generate
+ − 600
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+ − 601
Generalizing this to regular expressions with larger
+ − 602
bounded repetitions number, we have that
+ − 603
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
605
+ − 604
would require at least $2^{n+1}$ states, if $r$ itself contains
604
+ − 605
more than 1 string.
+ − 606
This is to represent all different
635
+ − 607
scenarios in which "countdown" states are active.}
603
+ − 608
538
+ − 609
635
+ − 610
Bounded repetitions are important because they
+ − 611
tend to occur frequently in practical use,
+ − 612
for example in the regex library RegExLib, in
607
+ − 613
the rules library of Snort \cite{Snort1999}\footnote{
+ − 614
Snort is a network intrusion detection (NID) tool
608
+ − 615
for monitoring network traffic.
+ − 616
The network security community curates a list
+ − 617
of malicious patterns written as regexes,
+ − 618
which is used by Snort's detection engine
+ − 619
to match against network traffic for any hostile
+ − 620
activities such as buffer overflow attacks.},
607
+ − 621
as well as in XML Schema definitions (XSDs).
605
+ − 622
According to Bj\"{o}rklund et al \cite{xml2015},
+ − 623
more than half of the
612
+ − 624
XSDs they found on the Maven.org central repository
+ − 625
have bounded regular expressions in them.
+ − 626
Often the counters are quite large, with the largest being
635
+ − 627
close to ten million.
+ − 628
A smaller sample XSD they gave
612
+ − 629
is:
+ − 630
\begin{verbatim}
+ − 631
<sequence minOccurs="0" maxOccurs="65535">
+ − 632
<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ − 633
<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+ − 634
</sequence>
+ − 635
\end{verbatim}
635
+ − 636
This can be seen as the regex
605
+ − 637
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+ − 638
regular expressions
+ − 639
satisfying certain constraints (such as
+ − 640
satisfying the floating point number format).
+ − 641
It is therefore quite unsatisfying that
+ − 642
some regular expressions matching libraries
+ − 643
impose adhoc limits
+ − 644
for bounded regular expressions:
+ − 645
For example, in the regular expression matching library in the Go
+ − 646
language the regular expression $a^{1001}$ is not permitted, because no counter
+ − 647
can be above 1000, and in the built-in Rust regular expression library
+ − 648
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
606
+ − 649
for being too big.
635
+ − 650
As Becchi and Crawley \cite{Becchi08} have pointed out,
606
+ − 651
the reason for these restrictions
612
+ − 652
is that they simulate a non-deterministic finite
606
+ − 653
automata (NFA) with a breadth-first search.
+ − 654
This way the number of active states could
+ − 655
be equal to the counter number.
+ − 656
When the counters are large,
+ − 657
the memory requirement could become
612
+ − 658
infeasible, and a regex engine
635
+ − 659
like in Go will reject this pattern straight away.
606
+ − 660
\begin{figure}[H]
+ − 661
\begin{center}
+ − 662
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+ − 663
+ − 664
\node (q0) [state, initial] {$0$};
+ − 665
\node (q1) [state, right = of q0] {$1$};
612
+ − 666
%\node (q2) [state, right = of q1] {$2$};
+ − 667
\node (qdots) [right = of q1] {$\ldots$};
606
+ − 668
\node (qn) [state, right = of qdots] {$n$};
+ − 669
\node (qn1) [state, right = of qn] {$n+1$};
+ − 670
\node (qn2) [state, right = of qn1] {$n+2$};
+ − 671
\node (qn3) [state, accepting, right = of qn2] {$n+3$};
+ − 672
+ − 673
\path [-stealth, thick]
+ − 674
(q0) edge [loop above] node {a} ()
+ − 675
(q0) edge node {a} (q1)
612
+ − 676
%(q1) edge node {.} (q2)
+ − 677
(q1) edge node {.} (qdots)
606
+ − 678
(qdots) edge node {.} (qn)
+ − 679
(qn) edge node {.} (qn1)
+ − 680
(qn1) edge node {b} (qn2)
+ − 681
(qn2) edge node {$c$} (qn3);
+ − 682
\end{tikzpicture}
+ − 683
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 684
% \node[state,initial] (q_0) {$0$};
+ − 685
% \node[state, ] (q_1) [right=of q_0] {$1$};
+ − 686
% \node[state, ] (q_2) [right=of q_1] {$2$};
+ − 687
% \node[state,
+ − 688
% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+ − 689
% \path[->]
+ − 690
% (q_0) edge node {a} (q_1)
+ − 691
% edge [loop below] node {a,b} ()
+ − 692
% (q_1) edge node {a,b} (q_2)
+ − 693
% (q_2) edge node {a,b} (q_3);
+ − 694
%\end{tikzpicture}
+ − 695
\end{center}
+ − 696
\caption{The example given by Becchi and Crawley
+ − 697
that NFA simulation can consume large
+ − 698
amounts of memory: $.^*a.^{\{n\}}bc$ matching
+ − 699
strings of the form $aaa\ldots aaaabc$.
+ − 700
When traversing in a breadth-first manner,
+ − 701
all states from 0 till $n+1$ will become active.}
+ − 702
\end{figure}
+ − 703
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+ − 704
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+ − 705
%in terms of input string length.
+ − 706
%TODO:try out these lexers
+ − 707
These problems can of course be solved in matching algorithms where
605
+ − 708
automata go beyond the classic notion and for instance include explicit
+ − 709
counters \cite{Turo_ov__2020}.
612
+ − 710
These solutions can be quite efficient,
606
+ − 711
with the ability to process
612
+ − 712
gigabytes of strings input per second
606
+ − 713
even with large counters \cite{Becchi08}.
635
+ − 714
These practical solutions do not come with
+ − 715
formal guarantees, and as pointed out by
+ − 716
Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
+ − 717
%But formal reasoning about these automata especially in Isabelle
+ − 718
%can be challenging
+ − 719
%and un-intuitive.
+ − 720
%Therefore, we take correctness and runtime claims made about these solutions
+ − 721
%with a grain of salt.
605
+ − 722
628
+ − 723
In the work reported in \cite{FoSSaCS2023} and here,
605
+ − 724
we add better support using derivatives
635
+ − 725
for bounded regular expression $r^{\{n\}}$.
+ − 726
Our results
605
+ − 727
extend straightforwardly to
635
+ − 728
repetitions with intervals such as
605
+ − 729
$r^{\{n\ldots m\}}$.
608
+ − 730
The merit of Brzozowski derivatives (more on this later)
605
+ − 731
on this problem is that
+ − 732
it can be naturally extended to support bounded repetitions.
635
+ − 733
Moreover these extensions are still made up of only small
605
+ − 734
inductive datatypes and recursive functions,
633
+ − 735
making it handy to deal with them in theorem provers.
605
+ − 736
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+ − 737
%straightforwardly extended to deal with bounded regular expressions
+ − 738
%and moreover the resulting code still consists of only simple
+ − 739
%recursive functions and inductive datatypes.
+ − 740
Finally, bounded regular expressions do not destroy our finite
+ − 741
boundedness property, which we shall prove later on.
538
+ − 742
+ − 743
606
+ − 744
+ − 745
+ − 746
605
+ − 747
\subsection{Back-References}
606
+ − 748
The other way to simulate an $\mathit{NFA}$ for matching is choosing
+ − 749
a single transition each time, keeping all the other options in
+ − 750
a queue or stack, and backtracking if that choice eventually
+ − 751
fails. This method, often called a "depth-first-search",
+ − 752
is efficient in a lot of cases, but could end up
+ − 753
with exponential run time.
+ − 754
The backtracking method is employed in regex libraries
+ − 755
that support \emph{back-references}, for example
+ − 756
in Java and Python.
605
+ − 757
%\section{Back-references and The Terminology Regex}
538
+ − 758
605
+ − 759
%When one constructs an $\NFA$ out of a regular expression
+ − 760
%there is often very little to be done in the first phase, one simply
+ − 761
%construct the $\NFA$ states based on the structure of the input regular expression.
538
+ − 762
605
+ − 763
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
+ − 764
%one by keeping track of all active states after consuming
+ − 765
%a character, and update that set of states iteratively.
+ − 766
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+ − 767
%for a path terminating
+ − 768
%at an accepting state.
606
+ − 769
+ − 770
+ − 771
+ − 772
Given a regular expression like this (the sequence
532
+ − 773
operator is omitted for brevity):
+ − 774
\begin{center}
606
+ − 775
$r_1r_2r_3r_4$
532
+ − 776
\end{center}
606
+ − 777
one could label sub-expressions of interest
532
+ − 778
by parenthesizing them and giving
+ − 779
them a number by the order in which their opening parentheses appear.
+ − 780
One possible way of parenthesizing and labelling is given below:
+ − 781
\begin{center}
+ − 782
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
+ − 783
\end{center}
606
+ − 784
The sub-expressions
+ − 785
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
+ − 786
by 1 to 4, and can be ``referred back'' by their respective numbers.
+ − 787
%These sub-expressions are called "capturing groups".
+ − 788
To do so, we use the syntax $\backslash i$
+ − 789
to denote that we want the sub-string
+ − 790
of the input just matched by the i-th
+ − 791
sub-expression to appear again,
+ − 792
exactly the same as it first appeared:
532
+ − 793
\begin{center}
+ − 794
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
+ − 795
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
+ − 796
\end{center}
606
+ − 797
%The backslash and number $i$ are the
+ − 798
%so-called "back-references".
+ − 799
%Let $e$ be an expression made of regular expressions
+ − 800
%and back-references. $e$ contains the expression $e_i$
+ − 801
%as its $i$-th capturing group.
+ − 802
%The semantics of back-reference can be recursively
+ − 803
%written as:
+ − 804
%\begin{center}
+ − 805
% \begin{tabular}{c}
+ − 806
% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+ − 807
% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+ − 808
% \end{tabular}
+ − 809
%\end{center}
+ − 810
A concrete example
618
+ − 811
for back-references is
532
+ − 812
\begin{center}
607
+ − 813
$(.^*)\backslash 1$,
532
+ − 814
\end{center}
618
+ − 815
which matches
606
+ − 816
strings that can be split into two identical halves,
618
+ − 817
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
607
+ − 818
Note that this is different from
+ − 819
repeating the sub-expression verbatim like
+ − 820
\begin{center}
+ − 821
$(.^*)(.^*)$,
+ − 822
\end{center}
+ − 823
which does not impose any restrictions on what strings the second
+ − 824
sub-expression $.^*$
+ − 825
might match.
618
+ − 826
Another example of back-references is
607
+ − 827
\begin{center}
+ − 828
$(.)(.)\backslash 2\backslash 1$
+ − 829
\end{center}
618
+ − 830
which matches four-character palindromes
+ − 831
like $abba$, $x??x$ and so on.
607
+ − 832
+ − 833
Back-references is a regex construct
618
+ − 834
that programmers find quite useful.
630
+ − 835
According to Becchi and Crawley~\cite{Becchi08},
618
+ − 836
6\% of Snort rules (up until 2008) use them.
607
+ − 837
The most common use of back-references
618
+ − 838
is to express well-formed html files,
+ − 839
where back-references are convenient for matching
+ − 840
opening and closing tags like
607
+ − 841
\begin{center}
+ − 842
$\langle html \rangle \ldots \langle / html \rangle$
+ − 843
\end{center}
+ − 844
A regex describing such a format
618
+ − 845
is
607
+ − 846
\begin{center}
+ − 847
$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+ − 848
\end{center}
618
+ − 849
Despite being useful, the expressive power of regexes
633
+ − 850
go beyond regular languages
618
+ − 851
once back-references are included.
607
+ − 852
In fact, they allow the regex construct to express
532
+ − 853
languages that cannot be contained in context-free
+ − 854
languages either.
608
+ − 855
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
532
+ − 856
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
633
+ − 857
which cannot be expressed by context-free grammars \parencite{campeanu2003formal}.
532
+ − 858
Such a language is contained in the context-sensitive hierarchy
+ − 859
of formal languages.
618
+ − 860
Also solving the matching problem involving back-references
609
+ − 861
is known to be NP-complete \parencite{alfred2014algorithms}.
608
+ − 862
Regex libraries supporting back-references such as
+ − 863
PCRE \cite{pcre} therefore have to
609
+ − 864
revert to a depth-first search algorithm which backtracks.
+ − 865
What is unexpected is that even in the cases
+ − 866
not involving back-references, there is still
+ − 867
a (non-negligible) chance they might backtrack super-linearly,
618
+ − 868
as shown in the graphs in figure\ref{fig:aStarStarb}.
538
+ − 869
532
+ − 870
Summing these up, we can categorise existing
607
+ − 871
practical regex libraries into two kinds:
622
+ − 872
(i) The ones with linear
+ − 873
time guarantees like Go and Rust. The downside with them is that
607
+ − 874
they impose restrictions
622
+ − 875
on the regular expressions (not allowing back-references,
+ − 876
bounded repetitions cannot exceed an ad hoc limit etc.).
633
+ − 877
And (ii) those
607
+ − 878
that allow large bounded regular expressions and back-references
622
+ − 879
at the expense of using backtracking algorithms.
+ − 880
They can potentially ``grind to a halt''
+ − 881
on some very simple cases, resulting
633
+ − 882
ReDoS attacks if exposed to the internet.
607
+ − 883
633
+ − 884
The problems with both approaches are the motivation for us
622
+ − 885
to look again at the regular expression matching problem.
+ − 886
Another motivation is that regular expression matching algorithms
+ − 887
that follow the POSIX standard often contain errors and bugs
+ − 888
as we shall explain next.
+ − 889
+ − 890
%We would like to have regex engines that can
+ − 891
%deal with the regular part (e.g.
+ − 892
%bounded repetitions) of regexes more
+ − 893
%efficiently.
+ − 894
%Also we want to make sure that they do it correctly.
+ − 895
%It turns out that such aim is not so easy to achieve.
532
+ − 896
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
+ − 897
% For example, the Rust regex engine claims to be linear,
+ − 898
% but does not support lookarounds and back-references.
+ − 899
% The GoLang regex library does not support over 1000 repetitions.
+ − 900
% Java and Python both support back-references, but shows
+ − 901
%catastrophic backtracking behaviours on inputs without back-references(
+ − 902
%when the language is still regular).
+ − 903
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
+ − 904
%TODO: verify the fact Rust does not allow 1000+ reps
+ − 905
+ − 906
605
+ − 907
+ − 908
+ − 909
%The time cost of regex matching algorithms in general
+ − 910
%involve two different phases, and different things can go differently wrong on
+ − 911
%these phases.
+ − 912
%$\DFA$s usually have problems in the first (construction) phase
+ − 913
%, whereas $\NFA$s usually run into trouble
+ − 914
%on the second phase.
+ − 915
+ − 916
+ − 917
\section{Error-prone POSIX Implementations}
622
+ − 918
Very often there are multiple ways of matching a string
+ − 919
with a regular expression.
+ − 920
In such cases the regular expressions matcher needs to
607
+ − 921
disambiguate.
622
+ − 922
The more widely used strategy is called POSIX,
+ − 923
which roughly speaking always chooses the longest initial match.
+ − 924
The POSIX strategy is widely adopted in many regular expression matchers.
607
+ − 925
However, many implementations (including the C libraries
+ − 926
used by Linux and OS X distributions) contain bugs
+ − 927
or do not meet the specification they claim to adhere to.
622
+ − 928
Kuklewicz maintains a unit test repository which lists some
628
+ − 929
problems with existing regular expression engines \cite{KuklewiczHaskell}.
622
+ − 930
In some cases, they either fail to generate a
607
+ − 931
result when there exists a match,
622
+ − 932
or give results that are inconsistent with the POSIX standard.
+ − 933
A concrete example is the regex:
605
+ − 934
\begin{center}
607
+ − 935
$(aba + ab + a)^* \text{and the string} ababa$
605
+ − 936
\end{center}
622
+ − 937
The correct POSIX match for the above
+ − 938
is the entire string $ababa$,
+ − 939
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
605
+ − 940
$[0, 2), [2, 5)$
+ − 941
respectively.
622
+ − 942
But trying this out in regex101 \parencite{regex101} \footnote{
+ − 943
regex101 is an online regular expression matcher which
+ − 944
provides API for trying out regular expression
+ − 945
engines of multiple popular programming languages like
+ − 946
Java, Python, Go, etc.}
+ − 947
with different engines yields
+ − 948
always matches: $[aba]$ at $[0, 3)$
605
+ − 949
and $a$ at $[4, 5)$.
607
+ − 950
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
+ − 951
commented that most regex libraries are not
622
+ − 952
correctly implementing the central POSIX
+ − 953
rule, called the maximum munch rule.
+ − 954
Grathwohl \parencite{grathwohl2014crash} wrote,
605
+ − 955
\begin{quote}
607
+ − 956
``The POSIX strategy is more complicated than the
605
+ − 957
greedy because of the dependence on information about
607
+ − 958
the length of matched strings in the various subexpressions.''
605
+ − 959
\end{quote}
+ − 960
%\noindent
622
+ − 961
We think the implementation complexity of POSIX rules also come from
+ − 962
the specification being not very precise.
607
+ − 963
There are many informal summaries of this disambiguation
+ − 964
strategy, which are often quite long and delicate.
608
+ − 965
For example Kuklewicz \cite{KuklewiczHaskell}
622
+ − 966
described the POSIX rule as (section 1, last paragraph):
607
+ − 967
\begin{quote}
+ − 968
\begin{itemize}
+ − 969
\item
+ − 970
regular expressions (REs) take the leftmost starting match, and the longest match starting there
+ − 971
earlier subpatterns have leftmost-longest priority over later subpatterns\\
+ − 972
\item
+ − 973
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+ − 974
\item
+ − 975
REs have right associative concatenation which can be changed with parenthesis\\
+ − 976
\item
+ − 977
parenthesized subexpressions return the match from their last usage\\
+ − 978
\item
+ − 979
text of component subexpressions must be contained in the text of the
+ − 980
higher-level subexpressions\\
+ − 981
\item
+ − 982
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+ − 983
\item
622
+ − 984
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
607
+ − 985
\end{itemize}
+ − 986
\end{quote}
622
+ − 987
%The text above
+ − 988
%is trying to capture something very precise,
+ − 989
%and is crying out for formalising.
608
+ − 990
Ausaf et al. \cite{AusafDyckhoffUrban2016}
622
+ − 991
are the first to
+ − 992
give a quite simple formalised POSIX
+ − 993
specification in Isabelle/HOL, and also prove
609
+ − 994
that their specification coincides with the
622
+ − 995
POSIX specification given by Okui and Suzuki \cite{Okui10}.
608
+ − 996
They then formally proved the correctness of
+ − 997
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
622
+ − 998
with regards to that specification.
+ − 999
They also found that the informal POSIX
+ − 1000
specification by Sulzmann and Lu does not work for the correctness proof.
605
+ − 1001
622
+ − 1002
In the next section we will briefly
609
+ − 1003
introduce Brzozowski derivatives and Sulzmann
622
+ − 1004
and Lu's algorithm, which the main point of this thesis builds on.
+ − 1005
%We give a taste of what they
+ − 1006
%are like and why they are suitable for regular expression
+ − 1007
%matching and lexing.
+ − 1008
\section{Formal Specification of POSIX Matching
609
+ − 1009
and Brzozowski Derivatives}
622
+ − 1010
%Now we start with the central topic of the thesis: Brzozowski derivatives.
609
+ − 1011
Brzozowski \cite{Brzozowski1964} first introduced the
622
+ − 1012
concept of a \emph{derivative} of regular expression in 1964.
609
+ − 1013
The derivative of a regular expression $r$
622
+ − 1014
with respect to a character $c$, is written as $r \backslash c$.
+ − 1015
This operation tells us what $r$ transforms into
+ − 1016
if we ``chop'' off the first character $c$
+ − 1017
from all strings in the language of $r$ (defined
+ − 1018
later as $L \; r$).
+ − 1019
%To give a flavour of Brzozowski derivatives, we present
+ − 1020
%two straightforward clauses from it:
+ − 1021
%\begin{center}
+ − 1022
% \begin{tabular}{lcl}
+ − 1023
% $d \backslash c$ & $\dn$ &
+ − 1024
% $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+ − 1025
%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+ − 1026
% \end{tabular}
+ − 1027
%\end{center}
+ − 1028
%\noindent
+ − 1029
%The first clause says that for the regular expression
+ − 1030
%denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+ − 1031
%we check the derivative character $c$ against $d$,
+ − 1032
%returning a set containing only the empty string $\{ [] \}$
+ − 1033
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+ − 1034
%The second clause states that to obtain the regular expression
+ − 1035
%representing all strings' head character $c$ being chopped off
+ − 1036
%from $r_1 + r_2$, one simply needs to recursively take derivative
+ − 1037
%of $r_1$ and $r_2$ and then put them together.
+ − 1038
Derivatives have the property
609
+ − 1039
that $s \in L \; (r\backslash c)$ if and only if
622
+ − 1040
$c::s \in L \; r$ where $::$ stands for list prepending.
609
+ − 1041
%This property can be used on regular expressions
+ − 1042
%matching and lexing--to test whether a string $s$ is in $L \; r$,
+ − 1043
%one simply takes derivatives of $r$ successively with
+ − 1044
%respect to the characters (in the correct order) in $s$,
+ − 1045
%and then test whether the empty string is in the last regular expression.
622
+ − 1046
With this derivatives give a simple solution
+ − 1047
to the problem of matching a string $s$ with a regular
538
+ − 1048
expression $r$: if the derivative of $r$ w.r.t.\ (in
+ − 1049
succession) all the characters of the string matches the empty string,
+ − 1050
then $r$ matches $s$ (and {\em vice versa}).
622
+ − 1051
%This makes formally reasoning about these properties such
+ − 1052
%as correctness and complexity smooth and intuitive.
+ − 1053
There had been several mechanised proofs about this property in various theorem
+ − 1054
provers,
+ − 1055
for example one by Owens and Slind \cite{Owens2008} in HOL4,
609
+ − 1056
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
+ − 1057
yet another in Coq by Coquand and Siles \cite{Coquand2012}.
+ − 1058
622
+ − 1059
In addition, one can extend derivatives to bounded repetitions
+ − 1060
relatively straightforwardly. For example, the derivative for
+ − 1061
this can be defined as:
609
+ − 1062
\begin{center}
+ − 1063
\begin{tabular}{lcl}
+ − 1064
$r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
+ − 1065
r^{\{n-1\}}$\\
+ − 1066
\end{tabular}
+ − 1067
\end{center}
+ − 1068
\noindent
622
+ − 1069
Experimental results suggest that unlike DFA-based solutions
+ − 1070
for bounded regular expressions,
+ − 1071
derivatives can cope
+ − 1072
large counters
609
+ − 1073
quite well.
532
+ − 1074
609
+ − 1075
There has also been
+ − 1076
extensions to other constructs.
+ − 1077
For example, Owens et al include the derivatives
622
+ − 1078
for the \emph{NOT} regular expression, which is
609
+ − 1079
able to concisely express C-style comments of the form
622
+ − 1080
$/* \ldots */$ (see \cite{Owens2008}).
+ − 1081
Another extension for derivatives is
+ − 1082
regular expressions with look-aheads, done
609
+ − 1083
by Miyazaki and Minamide
+ − 1084
\cite{Takayuki2019}.
+ − 1085
%We therefore use Brzozowski derivatives on regular expressions
+ − 1086
%lexing
+ − 1087
+ − 1088
+ − 1089
+ − 1090
Given the above definitions and properties of
+ − 1091
Brzozowski derivatives, one quickly realises their potential
+ − 1092
in generating a formally verified algorithm for lexing--the clauses and property
+ − 1093
can be easily expressed in a functional programming language
+ − 1094
or converted to theorem prover
+ − 1095
code, with great extensibility.
+ − 1096
Perhaps this is the reason why it has sparked quite a bit of interest
+ − 1097
in the functional programming and theorem prover communities in the last
+ − 1098
fifteen or so years (
+ − 1099
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
+ − 1100
\cite{Chen12} and \cite{Coquand2012}
+ − 1101
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
622
+ − 1102
after they were first published by Brzozowski.
609
+ − 1103
+ − 1104
+ − 1105
However, there are two difficulties with derivative-based matchers:
532
+ − 1106
First, Brzozowski's original matcher only generates a yes/no answer
+ − 1107
for whether a regular expression matches a string or not. This is too
+ − 1108
little information in the context of lexing where separate tokens must
+ − 1109
be identified and also classified (for example as keywords
609
+ − 1110
or identifiers).
+ − 1111
Second, derivative-based matchers need to be more efficient.
+ − 1112
Elegant and beautiful
+ − 1113
as many implementations are,
+ − 1114
they can be excruciatingly slow.
+ − 1115
For example, Sulzmann and Lu
+ − 1116
claim a linear running time of their proposed algorithm,
+ − 1117
but that was falsified by our experiments. The running time
+ − 1118
is actually $\Omega(2^n)$ in the worst case.
+ − 1119
A similar claim about a theoretical runtime of $O(n^2)$
+ − 1120
is made for the Verbatim \cite{Verbatim}
+ − 1121
%TODO: give references
+ − 1122
lexer, which calculates POSIX matches and is based on derivatives.
622
+ − 1123
They formalized the correctness of the lexer, but not their complexity result.
+ − 1124
In the performance evaluation section, they analyzed the run time
609
+ − 1125
of matching $a$ with the string
+ − 1126
\begin{center}
622
+ − 1127
$\underbrace{a \ldots a}_{\text{n a's}}$.
609
+ − 1128
\end{center}
622
+ − 1129
\noindent
+ − 1130
They concluded that the algorithm is quadratic in terms of
+ − 1131
the length of the input string.
609
+ − 1132
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
622
+ − 1133
the time it took to match a string of 40 $a$'s was approximately 5 minutes.
609
+ − 1134
+ − 1135
+ − 1136
\subsection{Sulzmann and Lu's Algorithm}
+ − 1137
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
622
+ − 1138
problem with the yes/no answer
+ − 1139
by cleverly extending Brzozowski's matching
532
+ − 1140
algorithm. Their extended version generates additional information on
+ − 1141
\emph{how} a regular expression matches a string following the POSIX
+ − 1142
rules for regular expression matching. They achieve this by adding a
+ − 1143
second ``phase'' to Brzozowski's algorithm involving an injection
622
+ − 1144
function.
+ − 1145
In earlier work, Ausaf et al provided the formal
532
+ − 1146
specification of what POSIX matching means and proved in Isabelle/HOL
+ − 1147
the correctness
622
+ − 1148
of this extended algorithm accordingly
532
+ − 1149
\cite{AusafDyckhoffUrban2016}.
+ − 1150
609
+ − 1151
The version of the algorithm proven correct
622
+ − 1152
suffers heavily from a
+ − 1153
second difficulty, where the internal derivatives can
609
+ − 1154
grow to arbitrarily big sizes.
+ − 1155
For example if we start with the
532
+ − 1156
regular expression $(a+aa)^*$ and take
+ − 1157
successive derivatives according to the character $a$, we end up with
+ − 1158
a sequence of ever-growing derivatives like
+ − 1159
+ − 1160
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+ − 1161
\begin{center}
+ − 1162
\begin{tabular}{rll}
+ − 1163
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1164
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1165
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+ − 1166
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1167
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+ − 1168
\end{tabular}
+ − 1169
\end{center}
+ − 1170
+ − 1171
\noindent where after around 35 steps we run out of memory on a
622
+ − 1172
typical computer. Clearly, the
532
+ − 1173
notation involving $\ZERO$s and $\ONE$s already suggests
+ − 1174
simplification rules that can be applied to regular regular
+ − 1175
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
+ − 1176
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
622
+ − 1177
r$. While such simple-minded simplifications have been proved in
+ − 1178
the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
532
+ − 1179
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
+ − 1180
\emph{not} help with limiting the growth of the derivatives shown
+ − 1181
above: the growth is slowed, but the derivatives can still grow rather
+ − 1182
quickly beyond any finite bound.
+ − 1183
622
+ − 1184
Therefore we want to look in this thesis at a second
+ − 1185
algorithm by Sulzmann and Lu where they
+ − 1186
overcame this ``growth problem'' \cite{Sulzmann2014}.
+ − 1187
In this version, POSIX values are
538
+ − 1188
represented as bit sequences and such sequences are incrementally generated
532
+ − 1189
when derivatives are calculated. The compact representation
538
+ − 1190
of bit sequences and regular expressions allows them to define a more
532
+ − 1191
``aggressive'' simplification method that keeps the size of the
+ − 1192
derivatives finite no matter what the length of the string is.
+ − 1193
They make some informal claims about the correctness and linear behaviour
+ − 1194
of this version, but do not provide any supporting proof arguments, not
538
+ − 1195
even ``pencil-and-paper'' arguments. They write about their bit-coded
532
+ − 1196
\emph{incremental parsing method} (that is the algorithm to be formalised
538
+ − 1197
in this dissertation)
532
+ − 1198
+ − 1199
+ − 1200
+ − 1201
\begin{quote}\it
+ − 1202
``Correctness Claim: We further claim that the incremental parsing
+ − 1203
method [..] in combination with the simplification steps [..]
+ − 1204
yields POSIX parse trees. We have tested this claim
+ − 1205
extensively [..] but yet
+ − 1206
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
+ − 1207
\end{quote}
622
+ − 1208
Ausaf and Urban made some initial progress towards the
+ − 1209
full correctness proof but still had to leave out the optimisation
+ − 1210
Sulzmann and Lu proposed.
+ − 1211
Ausaf wrote \cite{Ausaf},
532
+ − 1212
\begin{quote}\it
609
+ − 1213
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
532
+ − 1214
\end{quote}
+ − 1215
This thesis implements the aggressive simplifications envisioned
+ − 1216
by Ausaf and Urban,
622
+ − 1217
together with a formal proof of the correctness of those simplifications.
532
+ − 1218
+ − 1219
612
+ − 1220
One of the most recent work in the context of lexing
+ − 1221
%with this issue
630
+ − 1222
is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
612
+ − 1223
This is relevant work for us and we will compare it later with
+ − 1224
our derivative-based matcher we are going to present.
+ − 1225
There is also some newer work called
630
+ − 1226
Verbatim++~\cite{Verbatimpp}, which does not use derivatives,
612
+ − 1227
but deterministic finite automaton instead.
+ − 1228
%An example that gives problem to automaton approaches would be
+ − 1229
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+ − 1230
%It requires at least $2^{n+1}$ states to represent
+ − 1231
%as a DFA.
+ − 1232
+ − 1233
532
+ − 1234
%----------------------------------------------------------------------------------------
+ − 1235
\section{Contribution}
+ − 1236
609
+ − 1237
In this thesis,
+ − 1238
we propose a solution to catastrophic
+ − 1239
backtracking and error-prone matchers: a formally verified
+ − 1240
regular expression lexing algorithm
+ − 1241
that is both fast
+ − 1242
and correct by extending Ausaf et al.'s work.
+ − 1243
The end result is %a regular expression lexing algorithm that comes with
538
+ − 1244
\begin{itemize}
+ − 1245
\item
609
+ − 1246
an improved version of Sulzmann and Lu's bit-coded algorithm using
+ − 1247
derivatives with simplifications,
+ − 1248
accompanied by
+ − 1249
a proven correctness theorem according to POSIX specification
+ − 1250
given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
+ − 1251
\item
+ − 1252
a complexity-related property for that algorithm saying that the
+ − 1253
internal data structure will
+ − 1254
remain finite,
538
+ − 1255
\item
609
+ − 1256
and extension to
+ − 1257
the bounded repetitions construct with the correctness and finiteness property
+ − 1258
maintained.
622
+ − 1259
\end{itemize}
+ − 1260
\noindent
609
+ − 1261
With a formal finiteness bound in place,
+ − 1262
we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
631
+ − 1263
The Isabelle/HOL code for our formalisation can be
+ − 1264
found at
+ − 1265
\begin{center}
+ − 1266
\url{https://github.com/hellotommmy/posix}
+ − 1267
\end{center}
609
+ − 1268
Further improvements to the algorithm with an even stronger version of
631
+ − 1269
simplification can be made. We conjecture that the resulting size of derivatives
+ − 1270
can be bounded by a cubic bound w.r.t. the size of the regular expression.
+ − 1271
609
+ − 1272
538
+ − 1273
+ − 1274
532
+ − 1275
+ − 1276
+ − 1277
\section{Structure of the thesis}
622
+ − 1278
In chapter \ref{Inj} we will introduce the concepts
532
+ − 1279
and notations we
622
+ − 1280
use for describing regular expressions and derivatives,
+ − 1281
and the first version of their lexing algorithm without bitcodes (including
+ − 1282
its correctness proof).
+ − 1283
We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
+ − 1284
together with the correctness proof by Ausaf and Urban.
+ − 1285
Then we illustrate in chapter \ref{Bitcoded2}
+ − 1286
how Sulzmann and Lu's
+ − 1287
simplifications fail to simplify. We therefore introduce our version of the
+ − 1288
algorithm with simplification and
532
+ − 1289
its correctness proof .
622
+ − 1290
In chapter \ref{Finite} we give the second guarantee
532
+ − 1291
of our bitcoded algorithm, that is a finite bound on the size of any
631
+ − 1292
regular expression's derivatives.
+ − 1293
We also show how one can extend the
+ − 1294
algorithm to include bounded repetitions.
622
+ − 1295
In chapter \ref{Cubic} we discuss stronger simplification rules which
631
+ − 1296
improve the finite bound to a cubic bound.%and the NOT regular expression.
532
+ − 1297
+ − 1298
+ − 1299
+ − 1300
+ − 1301
+ − 1302
%----------------------------------------------------------------------------------------
+ − 1303
+ − 1304
+ − 1305
%----------------------------------------------------------------------------------------
+ − 1306
+ − 1307
%----------------------------------------------------------------------------------------
+ − 1308
+ − 1309
%----------------------------------------------------------------------------------------
+ − 1310
+ − 1311