532
+ − 1
% Chapter 1
+ − 2
+ − 3
\chapter{Introduction} % Main chapter title
+ − 4
+ − 5
\label{Introduction} % For referencing the chapter elsewhere, use \ref{Chapter1}
+ − 6
+ − 7
%----------------------------------------------------------------------------------------
+ − 8
+ − 9
% Define some commands to keep the formatting separated from the content
+ − 10
\newcommand{\keyword}[1]{\textbf{#1}}
+ − 11
\newcommand{\tabhead}[1]{\textbf{#1}}
+ − 12
\newcommand{\code}[1]{\texttt{#1}}
+ − 13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
+ − 14
\newcommand{\option}[1]{\texttt{\itshape#1}}
+ − 15
+ − 16
%boxes
+ − 17
\newcommand*{\mybox}[1]{\framebox{\strut #1}}
+ − 18
+ − 19
%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
+ − 20
\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
+ − 21
\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
543
+ − 22
\newcommand{\bderssimp}[2]{#1 \backslash_{bsimps} #2}
596
+ − 23
\newcommand{\rderssimp}[2]{#1 \backslash_{rsimps} #2}
564
+ − 24
\def\derssimp{\textit{ders}\_\textit{simp}}
557
+ − 25
\def\rders{\textit{rders}}
532
+ − 26
\newcommand{\bders}[2]{#1 \backslash #2}
+ − 27
\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
591
+ − 28
\def\bsimps{\textit{bsimp}}
554
+ − 29
\newcommand{\rsimp}[1]{\textit{rsimp}\; #1}
532
+ − 30
\newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'}
+ − 31
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+ − 32
\newcommand{\denote}{\stackrel{\mbox{\scriptsize denote}}{=}}%
+ − 33
\newcommand{\ZERO}{\mbox{\bf 0}}
+ − 34
\newcommand{\ONE}{\mbox{\bf 1}}
+ − 35
\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
555
+ − 36
\newcommand{\rdistinct}[2]{\textit{rdistinct} \;\; #1 \;\; #2}
594
+ − 37
\def\rdistincts{\textit{rdistinct}}
556
+ − 38
\def\rDistinct{\textit{rdistinct}}
532
+ − 39
\newcommand\hflat[1]{\llparenthesis #1 \rrparenthesis_*}
+ − 40
\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
+ − 41
\newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
620
+ − 42
\def\cbn{\textit{createdByNtimes}}
+ − 43
\def\hpa{\textit{highestPowerAux}}
+ − 44
\def\hpower{\textit{highestPower}}
+ − 45
\def\ntset{\textit{ntset}}
+ − 46
\def\optermsimp{\textit{optermsimp}}
+ − 47
\def\optermOsimp{\textit{optermOsimp}}
+ − 48
\def\optermosimp{\textit{optermosimp}}
+ − 49
\def\opterm{\textit{opterm}}
+ − 50
\def\nString{\textit{nonemptyString}}
532
+ − 51
+ − 52
\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
+ − 53
600
+ − 54
\def\SEQ{\textit{SEQ}}
+ − 55
\def\SEQs{\textit{SEQs}}
564
+ − 56
\def\case{\textit{case}}
554
+ − 57
\def\sequal{\stackrel{\mbox{\scriptsize rsimp}}{=}}
+ − 58
\def\rsimpalts{\textit{rsimp}_{ALTS}}
+ − 59
\def\good{\textit{good}}
+ − 60
\def\btrue{\textit{true}}
+ − 61
\def\bfalse{\textit{false}}
542
+ − 62
\def\bnullable{\textit{bnullable}}
543
+ − 63
\def\bnullables{\textit{bnullables}}
538
+ − 64
\def\Some{\textit{Some}}
+ − 65
\def\None{\textit{None}}
537
+ − 66
\def\code{\textit{code}}
532
+ − 67
\def\decode{\textit{decode}}
+ − 68
\def\internalise{\textit{internalise}}
+ − 69
\def\lexer{\mathit{lexer}}
+ − 70
\def\mkeps{\textit{mkeps}}
557
+ − 71
\newcommand{\rder}[2]{#2 \backslash_r #1}
532
+ − 72
585
+ − 73
\def\rerases{\textit{rerase}}
+ − 74
554
+ − 75
\def\nonnested{\textit{nonnested}}
532
+ − 76
\def\AZERO{\textit{AZERO}}
558
+ − 77
\def\sizeNregex{\textit{sizeNregex}}
532
+ − 78
\def\AONE{\textit{AONE}}
+ − 79
\def\ACHAR{\textit{ACHAR}}
+ − 80
585
+ − 81
\def\simpsulz{\textit{simp}_{Sulz}}
+ − 82
557
+ − 83
\def\scfrewrites{\stackrel{*}{\rightsquigarrow_{scf}}}
555
+ − 84
\def\frewrite{\rightsquigarrow_f}
+ − 85
\def\hrewrite{\rightsquigarrow_h}
+ − 86
\def\grewrite{\rightsquigarrow_g}
+ − 87
\def\frewrites{\stackrel{*}{\rightsquigarrow_f}}
+ − 88
\def\hrewrites{\stackrel{*}{\rightsquigarrow_h}}
+ − 89
\def\grewrites{\stackrel{*}{\rightsquigarrow_g}}
538
+ − 90
\def\fuse{\textit{fuse}}
+ − 91
\def\bder{\textit{bder}}
542
+ − 92
\def\der{\textit{der}}
532
+ − 93
\def\POSIX{\textit{POSIX}}
+ − 94
\def\ALTS{\textit{ALTS}}
+ − 95
\def\ASTAR{\textit{ASTAR}}
+ − 96
\def\DFA{\textit{DFA}}
538
+ − 97
\def\NFA{\textit{NFA}}
532
+ − 98
\def\bmkeps{\textit{bmkeps}}
543
+ − 99
\def\bmkepss{\textit{bmkepss}}
532
+ − 100
\def\retrieve{\textit{retrieve}}
+ − 101
\def\blexer{\textit{blexer}}
+ − 102
\def\flex{\textit{flex}}
573
+ − 103
\def\inj{\textit{inj}}
564
+ − 104
\def\Empty{\textit{Empty}}
567
+ − 105
\def\Left{\textit{Left}}
+ − 106
\def\Right{\textit{Right}}
573
+ − 107
\def\Stars{\textit{Stars}}
+ − 108
\def\Char{\textit{Char}}
+ − 109
\def\Seq{\textit{Seq}}
532
+ − 110
\def\Der{\textit{Der}}
+ − 111
\def\Ders{\textit{Ders}}
+ − 112
\def\nullable{\mathit{nullable}}
+ − 113
\def\Z{\mathit{Z}}
+ − 114
\def\S{\mathit{S}}
+ − 115
\def\rup{r^\uparrow}
+ − 116
%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
+ − 117
\def\distinctWith{\textit{distinctWith}}
+ − 118
\def\lf{\textit{lf}}
+ − 119
\def\PD{\textit{PD}}
+ − 120
\def\suffix{\textit{Suffix}}
543
+ − 121
\def\distinctBy{\textit{distinctBy}}
558
+ − 122
\def\starupdate{\textit{starUpdate}}
+ − 123
\def\starupdates{\textit{starUpdates}}
620
+ − 124
\def\nupdate{\textit{nupdate}}
+ − 125
\def\nupdates{\textit{nupdates}}
558
+ − 126
532
+ − 127
+ − 128
\def\size{\mathit{size}}
+ − 129
\def\rexp{\mathbf{rexp}}
+ − 130
\def\simp{\mathit{simp}}
+ − 131
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
+ − 132
\def\map{\mathit{map}}
+ − 133
\def\distinct{\mathit{distinct}}
+ − 134
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
590
+ − 135
\def\blexerStrong{\textit{blexerStrong}}
+ − 136
\def\bsimpStrong{\textit{bsimpStrong}}
591
+ − 137
\def\bdersStrongs{\textit{bdersStrong}}
590
+ − 138
\newcommand{\bdersStrong}[2]{#1 \backslash_{bsimpStrongs} #2}
+ − 139
532
+ − 140
\def\map{\textit{map}}
+ − 141
\def\rrexp{\textit{rrexp}}
554
+ − 142
\newcommand\rnullable[1]{\textit{rnullable} \; #1 }
532
+ − 143
\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
+ − 144
\newcommand\asize[1]{\llbracket #1 \rrbracket}
543
+ − 145
\newcommand\rerase[1]{ (#1)_{\downarrow_r}}
+ − 146
538
+ − 147
\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
532
+ − 148
543
+ − 149
+ − 150
\def\rflts{\textit{rflts}}
+ − 151
\def\rrewrite{\textit{rrewrite}}
+ − 152
\def\bsimpalts{\textit{bsimp}_{ALTS}}
596
+ − 153
\def\bsimpaseq{\textit{bsimp}_{ASEQ}}
+ − 154
\def\rsimlalts{\textit{rsimp}_{ALTs}}
+ − 155
\def\rsimpseq{\textit{rsimp}_{SEQ}}
543
+ − 156
532
+ − 157
\def\erase{\textit{erase}}
+ − 158
\def\STAR{\textit{STAR}}
+ − 159
\def\flts{\textit{flts}}
+ − 160
+ − 161
579
+ − 162
\def\zeroable{\textit{zeroable}}
+ − 163
\def\nub{\textit{nub}}
+ − 164
\def\filter{\textit{filter}}
601
+ − 165
%\def\not{\textit{not}}
579
+ − 166
+ − 167
+ − 168
532
+ − 169
\def\RZERO{\mathbf{0}_r }
+ − 170
\def\RONE{\mathbf{1}_r}
+ − 171
\newcommand\RCHAR[1]{\mathbf{#1}_r}
+ − 172
\newcommand\RSEQ[2]{#1 \cdot #2}
558
+ − 173
\newcommand\RALTS[1]{\sum #1}
532
+ − 174
\newcommand\RSTAR[1]{#1^*}
558
+ − 175
\newcommand\vsuf[2]{\textit{Suffix} \;#1\;#2}
532
+ − 176
538
+ − 177
+ − 178
590
+ − 179
+ − 180
\lstdefinestyle{myScalastyle}{
+ − 181
frame=tb,
+ − 182
language=scala,
+ − 183
aboveskip=3mm,
+ − 184
belowskip=3mm,
+ − 185
showstringspaces=false,
+ − 186
columns=flexible,
+ − 187
basicstyle={\small\ttfamily},
+ − 188
numbers=none,
+ − 189
numberstyle=\tiny\color{gray},
+ − 190
keywordstyle=\color{blue},
+ − 191
commentstyle=\color{dkgreen},
+ − 192
stringstyle=\color{mauve},
+ − 193
frame=single,
+ − 194
breaklines=true,
+ − 195
breakatwhitespace=true,
+ − 196
tabsize=3,
538
+ − 197
}
+ − 198
590
+ − 199
532
+ − 200
%----------------------------------------------------------------------------------------
+ − 201
%This part is about regular expressions, Brzozowski derivatives,
+ − 202
%and a bit-coded lexing algorithm with proven correctness and time bounds.
+ − 203
+ − 204
%TODO: look up snort rules to use here--give readers idea of what regexes look like
+ − 205
601
+ − 206
+ − 207
+ − 208
+ − 209
+ − 210
+ − 211
Regular expressions are widely used in computer science:
+ − 212
be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+ − 213
command-line tools like $\mathit{grep}$ that facilitate easy
634
+ − 214
text-processing \cite{grep}; network intrusion
612
+ − 215
detection systems that inspect suspicious traffic; or compiler
+ − 216
front ends.
+ − 217
Given their usefulness and ubiquity, one would assume that
601
+ − 218
modern regular expression matching implementations
+ − 219
are mature and fully studied.
634
+ − 220
Indeed, in a popular programming language's regular expression engine,
602
+ − 221
supplying it with regular expressions and strings,
+ − 222
in most cases one can
+ − 223
get the matching information in a very short time.
633
+ − 224
Those engines can be blindingly fast--some
602
+ − 225
network intrusion detection systems
634
+ − 226
use regular expression engines that are able to process
+ − 227
hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+ − 228
However, those engines can sometimes exhibit a surprising security vulnerability
602
+ − 229
under a certain class of inputs.
+ − 230
%However, , this is not the case for $\mathbf{all}$ inputs.
601
+ − 231
%TODO: get source for SNORT/BRO's regex matching engine/speed
+ − 232
603
+ − 233
633
+ − 234
Consider for example $(a^*)^*\,b$ and
+ − 235
strings of the form $aa..a$, these strings cannot be matched by this regular
+ − 236
expression: Obviously the expected $b$ in the last
+ − 237
position is missing. One would assume that modern regular expression
612
+ − 238
matching engines can find this out very quickly. Surprisingly, if one tries
+ − 239
this example in JavaScript, Python or Java 8, even with small strings,
+ − 240
say of lenght of around 30 $a$'s,
634
+ − 241
the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
+ − 242
This is clearly exponential behaviour, and as can be seen
612
+ − 243
is triggered by some relatively simple regular expressions.
603
+ − 244
Java 9 and newer
634
+ − 245
versions improve this behaviour somewhat, but are still slow compared
612
+ − 246
with the approach we are going to use in this thesis.
603
+ − 247
+ − 248
+ − 249
+ − 250
This superlinear blowup in regular expression engines
634
+ − 251
has caused grief in ``real life'' in the past where it is
612
+ − 252
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
603
+ − 253
For example, on 20 July 2016 one evil
+ − 254
regular expression brought the webpage
+ − 255
\href{http://stackexchange.com}{Stack Exchange} to its
+ − 256
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+ − 257
In this instance, a regular expression intended to just trim white
+ − 258
spaces from the beginning and the end of a line actually consumed
634
+ − 259
massive amounts of CPU resources---causing the web servers to grind to a
603
+ − 260
halt. In this example, the time needed to process
634
+ − 261
the string was
+ − 262
$O(n^2)$ with respect to the string length $n$. This
603
+ − 263
quadratic overhead was enough for the homepage of Stack Exchange to
+ − 264
respond so slowly that the load balancer assumed a $\mathit{DoS}$
+ − 265
attack and therefore stopped the servers from responding to any
+ − 266
requests. This made the whole site become unavailable.
+ − 267
601
+ − 268
\begin{figure}[p]
630
+ − 269
\begin{center}
612
+ − 270
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
532
+ − 271
\begin{tikzpicture}
+ − 272
\begin{axis}[
+ − 273
xlabel={$n$},
+ − 274
x label style={at={(1.05,-0.05)}},
+ − 275
ylabel={time in secs},
+ − 276
enlargelimits=false,
+ − 277
xtick={0,5,...,30},
+ − 278
xmax=33,
+ − 279
ymax=35,
+ − 280
ytick={0,5,...,30},
+ − 281
scaled ticks=false,
+ − 282
axis lines=left,
+ − 283
width=5cm,
+ − 284
height=4cm,
+ − 285
legend entries={JavaScript},
+ − 286
legend pos=north west,
+ − 287
legend cell align=left]
+ − 288
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+ − 289
\end{axis}
+ − 290
\end{tikzpicture}
+ − 291
&
+ − 292
\begin{tikzpicture}
+ − 293
\begin{axis}[
+ − 294
xlabel={$n$},
+ − 295
x label style={at={(1.05,-0.05)}},
+ − 296
%ylabel={time in secs},
+ − 297
enlargelimits=false,
+ − 298
xtick={0,5,...,30},
+ − 299
xmax=33,
+ − 300
ymax=35,
+ − 301
ytick={0,5,...,30},
+ − 302
scaled ticks=false,
+ − 303
axis lines=left,
+ − 304
width=5cm,
+ − 305
height=4cm,
+ − 306
legend entries={Python},
+ − 307
legend pos=north west,
+ − 308
legend cell align=left]
+ − 309
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ − 310
\end{axis}
612
+ − 311
\end{tikzpicture}\\
601
+ − 312
\begin{tikzpicture}
+ − 313
\begin{axis}[
+ − 314
xlabel={$n$},
+ − 315
x label style={at={(1.05,-0.05)}},
+ − 316
ylabel={time in secs},
+ − 317
enlargelimits=false,
+ − 318
xtick={0,5,...,30},
+ − 319
xmax=33,
+ − 320
ymax=35,
+ − 321
ytick={0,5,...,30},
+ − 322
scaled ticks=false,
+ − 323
axis lines=left,
+ − 324
width=5cm,
+ − 325
height=4cm,
612
+ − 326
legend entries={Java 8},
601
+ − 327
legend pos=north west,
+ − 328
legend cell align=left]
612
+ − 329
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
601
+ − 330
\end{axis}
+ − 331
\end{tikzpicture}
+ − 332
&
+ − 333
\begin{tikzpicture}
+ − 334
\begin{axis}[
+ − 335
xlabel={$n$},
+ − 336
x label style={at={(1.05,-0.05)}},
+ − 337
%ylabel={time in secs},
+ − 338
enlargelimits=false,
+ − 339
xtick={0,5,...,30},
+ − 340
xmax=33,
+ − 341
ymax=35,
+ − 342
ytick={0,5,...,30},
+ − 343
scaled ticks=false,
+ − 344
axis lines=left,
+ − 345
width=5cm,
+ − 346
height=4cm,
612
+ − 347
legend entries={Dart},
+ − 348
legend pos=north west,
+ − 349
legend cell align=left]
+ − 350
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+ − 351
\end{axis}
+ − 352
\end{tikzpicture}\\
+ − 353
\begin{tikzpicture}
+ − 354
\begin{axis}[
+ − 355
xlabel={$n$},
+ − 356
x label style={at={(1.05,-0.05)}},
+ − 357
ylabel={time in secs},
+ − 358
enlargelimits=false,
+ − 359
xtick={0,5,...,30},
+ − 360
xmax=33,
+ − 361
ymax=35,
+ − 362
ytick={0,5,...,30},
+ − 363
scaled ticks=false,
+ − 364
axis lines=left,
+ − 365
width=5cm,
+ − 366
height=4cm,
601
+ − 367
legend entries={Swift},
+ − 368
legend pos=north west,
+ − 369
legend cell align=left]
+ − 370
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+ − 371
\end{axis}
+ − 372
\end{tikzpicture}
612
+ − 373
&
+ − 374
\begin{tikzpicture}
+ − 375
\begin{axis}[
+ − 376
xlabel={$n$},
+ − 377
x label style={at={(1.05,-0.05)}},
+ − 378
%ylabel={time in secs},
+ − 379
enlargelimits=true,
+ − 380
%xtick={0,5000,...,40000},
+ − 381
%xmax=40000,
+ − 382
%ymax=35,
+ − 383
restrict x to domain*=0:40000,
+ − 384
restrict y to domain*=0:35,
+ − 385
%ytick={0,5,...,30},
+ − 386
%scaled ticks=false,
+ − 387
axis lines=left,
+ − 388
width=5cm,
+ − 389
height=4cm,
+ − 390
legend entries={Java9+},
+ − 391
legend pos=north west,
+ − 392
legend cell align=left]
+ − 393
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+ − 394
\end{axis}
+ − 395
\end{tikzpicture}\\
+ − 396
\multicolumn{2}{c}{Graphs}
532
+ − 397
\end{tabular}
630
+ − 398
\end{center}
601
+ − 399
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
+ − 400
of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
612
+ − 401
The reason for their superlinear behaviour is that they do a depth-first-search
+ − 402
using NFAs.
+ − 403
If the string does not match, the regular expression matching
+ − 404
engine starts to explore all possibilities.
601
+ − 405
}\label{fig:aStarStarb}
+ − 406
\end{figure}\afterpage{\clearpage}
538
+ − 407
532
+ − 408
A more recent example is a global outage of all Cloudflare servers on 2 July
612
+ − 409
2019. A poorly written regular expression exhibited catastrophic backtracking
+ − 410
and exhausted CPUs that serve HTTP traffic. Although the outage
532
+ − 411
had several causes, at the heart was a regular expression that
+ − 412
was used to monitor network
538
+ − 413
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
532
+ − 414
These problems with regular expressions
+ − 415
are not isolated events that happen
634
+ − 416
very rarely,
+ − 417
%but actually widespread.
+ − 418
%They occur so often that they have a
+ − 419
but they occur actually often enough that they have a
612
+ − 420
name: Regular-Expression-Denial-Of-Service (ReDoS)
634
+ − 421
attacks.
+ − 422
Davis et al. \cite{Davis18} detected more
612
+ − 423
than 1000 evil regular expressions
+ − 424
in Node.js, Python core libraries, npm and in pypi.
532
+ − 425
They therefore concluded that evil regular expressions
634
+ − 426
are real problems rather than just "a parlour trick".
532
+ − 427
603
+ − 428
This work aims to address this issue
+ − 429
with the help of formal proofs.
612
+ − 430
We describe a lexing algorithm based
+ − 431
on Brzozowski derivatives with verified correctness (in
603
+ − 432
Isabelle/HOL)
634
+ − 433
and a finiteness property for the size of derivatives.
612
+ − 434
Such properties %guarantee the absence of
+ − 435
are an important step in preventing
+ − 436
catastrophic backtracking once and for all.
604
+ − 437
We will give more details in the next sections
+ − 438
on (i) why the slow cases in graph \ref{fig:aStarStarb}
612
+ − 439
can occur in traditional regular expression engines
604
+ − 440
and (ii) why we choose our
612
+ − 441
approach based on Brzozowski derivatives and formal proofs.
602
+ − 442
603
+ − 443
612
+ − 444
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
601
+ − 445
Regular expressions and regular expression matchers
+ − 446
have of course been studied for many, many years.
612
+ − 447
Theoretical results in automata theory state
604
+ − 448
that basic regular expression matching should be linear
605
+ − 449
w.r.t the input.
+ − 450
This assumes that the regular expression
+ − 451
$r$ was pre-processed and turned into a
630
+ − 452
deterministic finite automaton (DFA) before matching~\cite{Sakarovitch2009}.
604
+ − 453
By basic we mean textbook definitions such as the one
612
+ − 454
below, involving only regular expressions for characters, alternatives,
604
+ − 455
sequences, and Kleene stars:
+ − 456
\[
612
+ − 457
r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
604
+ − 458
\]
+ − 459
Modern regular expression matchers used by programmers,
+ − 460
however,
612
+ − 461
support much richer constructs, such as bounded repetitions
604
+ − 462
and back-references.
612
+ − 463
To differentiate, we use the word \emph{regex} to refer
605
+ − 464
to those expressions with richer constructs while reserving the
+ − 465
term \emph{regular expression}
+ − 466
for the more traditional meaning in formal languages theory.
+ − 467
We follow this convention
+ − 468
in this thesis.
+ − 469
In the future, we aim to support all the popular features of regexes,
612
+ − 470
but for this work we mainly look at basic regular expressions
+ − 471
and bounded repetitions.
604
+ − 472
605
+ − 473
+ − 474
+ − 475
%Most modern regex libraries
+ − 476
%the so-called PCRE standard (Peral Compatible Regular Expressions)
+ − 477
%has the back-references
612
+ − 478
Regexes come with a number of constructs
605
+ − 479
that make it more convenient for
604
+ − 480
programmers to write regular expressions.
612
+ − 481
Depending on the types of constructs
605
+ − 482
the task of matching and lexing with them
612
+ − 483
will have different levels of complexity.
+ − 484
Some of those constructs are just syntactic sugars that are
604
+ − 485
simply short hand notations
605
+ − 486
that save the programmers a few keystrokes.
612
+ − 487
These will not cause problems for regex libraries.
605
+ − 488
For example the
612
+ − 489
non-binary alternative involving three or more choices just means:
604
+ − 490
\[
605
+ − 491
(a | b | c) \stackrel{means}{=} ((a + b)+ c)
604
+ − 492
\]
612
+ − 493
Similarly, the range operator used to express the alternative
+ − 494
of all characters between its operands is just a concise way:
604
+ − 495
\[
605
+ − 496
[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
604
+ − 497
\]
612
+ − 498
for an alternative. The
+ − 499
wildcard character $.$ is used to refer to any single character,
605
+ − 500
\[
+ − 501
. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
+ − 502
\]
612
+ − 503
except the newline.
604
+ − 504
605
+ − 505
\subsection{Bounded Repetitions}
612
+ − 506
More interesting are bounded repetitions, which can
+ − 507
make the regular expressions much
605
+ − 508
more compact.
612
+ − 509
There are
+ − 510
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
+ − 511
(where $n$ and $m$ are constant natural numbers).
+ − 512
Like the star regular expressions, the set of strings or language
+ − 513
a bounded regular expression can match
+ − 514
is defined using the power operation on sets:
605
+ − 515
\begin{center}
+ − 516
\begin{tabular}{lcl}
+ − 517
$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
+ − 518
$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
+ − 519
$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
+ − 520
$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
+ − 521
\end{tabular}
+ − 522
\end{center}
612
+ − 523
The attraction of bounded repetitions is that they can be
+ − 524
used to avoid a blow up: for example $r^{\{n\}}$
+ − 525
is a shorthand for
605
+ − 526
\[
+ − 527
\underbrace{r\ldots r}_\text{n copies of r}.
+ − 528
\]
+ − 529
%Therefore, a naive algorithm that simply unfolds
+ − 530
%them into their desugared forms
+ − 531
%will suffer from at least an exponential runtime increase.
603
+ − 532
612
+ − 533
+ − 534
The problem with matching
+ − 535
is that tools based on the classic notion of
+ − 536
automata need to expand $r^{\{n\}}$ into $n$ connected
603
+ − 537
copies of the automaton for $r$. This leads to very inefficient matching
+ − 538
algorithms or algorithms that consume large amounts of memory.
605
+ − 539
Implementations using $\DFA$s will
+ − 540
either become excruciatingly slow
630
+ − 541
(for example Verbatim++~\cite{Verbatimpp}) or get
605
+ − 542
out of memory errors (for example $\mathit{LEX}$ and
+ − 543
$\mathit{JFLEX}$\footnote{which are lexer generators
+ − 544
in C and JAVA that generate $\mathit{DFA}$-based
+ − 545
lexers. The user provides a set of regular expressions
+ − 546
and configurations to them, and then
+ − 547
gets an output program encoding a minimized $\mathit{DFA}$
+ − 548
that can be compiled and run.
+ − 549
When given the above countdown regular expression,
+ − 550
a small $n$ (a few dozen) would result in a
+ − 551
determinised automata
612
+ − 552
with millions of states.}) for large counters.
+ − 553
A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
605
+ − 554
where the minimal DFA requires at least $2^{n+1}$ states.
+ − 555
For example, when $n$ is equal to 2,
612
+ − 556
The corresponding $\mathit{NFA}$ looks like:
604
+ − 557
\begin{center}
+ − 558
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 559
\node[state,initial] (q_0) {$q_0$};
+ − 560
\node[state, red] (q_1) [right=of q_0] {$q_1$};
+ − 561
\node[state, red] (q_2) [right=of q_1] {$q_2$};
+ − 562
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+ − 563
\path[->]
+ − 564
(q_0) edge node {a} (q_1)
+ − 565
edge [loop below] node {a,b} ()
+ − 566
(q_1) edge node {a,b} (q_2)
+ − 567
(q_2) edge node {a,b} (q_3);
+ − 568
\end{tikzpicture}
+ − 569
\end{center}
612
+ − 570
when turned into a DFA by the subset construction
+ − 571
requires at least $2^3$ states.\footnote{The
605
+ − 572
red states are "countdown states" which counts down
604
+ − 573
the number of characters needed in addition to the current
+ − 574
string to make a successful match.
+ − 575
For example, state $q_1$ indicates a match that has
+ − 576
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+ − 577
and just consumed the "delimiter" $a$ in the middle, and
+ − 578
need to match 2 more iterations of $(a|b)$ to complete.
+ − 579
State $q_2$ on the other hand, can be viewed as a state
+ − 580
after $q_1$ has consumed 1 character, and just waits
+ − 581
for 1 more character to complete.
+ − 582
$q_3$ is the last state, requiring 0 more character and is accepting.
+ − 583
Depending on the suffix of the
+ − 584
input string up to the current read location,
+ − 585
the states $q_1$ and $q_2$, $q_3$
+ − 586
may or may
+ − 587
not be active, independent from each other.
+ − 588
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+ − 589
contain at least $2^3$ non-equivalent states that cannot be merged,
+ − 590
because the subset construction during determinisation will generate
+ − 591
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+ − 592
Generalizing this to regular expressions with larger
+ − 593
bounded repetitions number, we have that
+ − 594
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
605
+ − 595
would require at least $2^{n+1}$ states, if $r$ itself contains
604
+ − 596
more than 1 string.
+ − 597
This is to represent all different
605
+ − 598
scenarios which "countdown" states are active.}
603
+ − 599
538
+ − 600
605
+ − 601
Bounded repetitions are very important because they
612
+ − 602
tend to occur a lot in practical use,
+ − 603
for example in the regex library RegExLib,
607
+ − 604
the rules library of Snort \cite{Snort1999}\footnote{
+ − 605
Snort is a network intrusion detection (NID) tool
608
+ − 606
for monitoring network traffic.
+ − 607
The network security community curates a list
+ − 608
of malicious patterns written as regexes,
+ − 609
which is used by Snort's detection engine
+ − 610
to match against network traffic for any hostile
+ − 611
activities such as buffer overflow attacks.},
607
+ − 612
as well as in XML Schema definitions (XSDs).
605
+ − 613
According to Bj\"{o}rklund et al \cite{xml2015},
+ − 614
more than half of the
612
+ − 615
XSDs they found on the Maven.org central repository
+ − 616
have bounded regular expressions in them.
+ − 617
Often the counters are quite large, with the largest being
+ − 618
approximately up to ten million.
605
+ − 619
An example XSD they gave
612
+ − 620
is:
+ − 621
\begin{verbatim}
+ − 622
<sequence minOccurs="0" maxOccurs="65535">
+ − 623
<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ − 624
<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+ − 625
</sequence>
+ − 626
\end{verbatim}
605
+ − 627
This can be seen as the expression
+ − 628
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+ − 629
regular expressions
+ − 630
satisfying certain constraints (such as
+ − 631
satisfying the floating point number format).
+ − 632
It is therefore quite unsatisfying that
+ − 633
some regular expressions matching libraries
+ − 634
impose adhoc limits
+ − 635
for bounded regular expressions:
+ − 636
For example, in the regular expression matching library in the Go
+ − 637
language the regular expression $a^{1001}$ is not permitted, because no counter
+ − 638
can be above 1000, and in the built-in Rust regular expression library
+ − 639
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
606
+ − 640
for being too big.
630
+ − 641
As Becchi and Crawley~\cite{Becchi08} have pointed out,
606
+ − 642
the reason for these restrictions
612
+ − 643
is that they simulate a non-deterministic finite
606
+ − 644
automata (NFA) with a breadth-first search.
+ − 645
This way the number of active states could
+ − 646
be equal to the counter number.
+ − 647
When the counters are large,
+ − 648
the memory requirement could become
612
+ − 649
infeasible, and a regex engine
+ − 650
like Go will reject this pattern straight away.
606
+ − 651
\begin{figure}[H]
+ − 652
\begin{center}
+ − 653
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+ − 654
+ − 655
\node (q0) [state, initial] {$0$};
+ − 656
\node (q1) [state, right = of q0] {$1$};
612
+ − 657
%\node (q2) [state, right = of q1] {$2$};
+ − 658
\node (qdots) [right = of q1] {$\ldots$};
606
+ − 659
\node (qn) [state, right = of qdots] {$n$};
+ − 660
\node (qn1) [state, right = of qn] {$n+1$};
+ − 661
\node (qn2) [state, right = of qn1] {$n+2$};
+ − 662
\node (qn3) [state, accepting, right = of qn2] {$n+3$};
+ − 663
+ − 664
\path [-stealth, thick]
+ − 665
(q0) edge [loop above] node {a} ()
+ − 666
(q0) edge node {a} (q1)
612
+ − 667
%(q1) edge node {.} (q2)
+ − 668
(q1) edge node {.} (qdots)
606
+ − 669
(qdots) edge node {.} (qn)
+ − 670
(qn) edge node {.} (qn1)
+ − 671
(qn1) edge node {b} (qn2)
+ − 672
(qn2) edge node {$c$} (qn3);
+ − 673
\end{tikzpicture}
+ − 674
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 675
% \node[state,initial] (q_0) {$0$};
+ − 676
% \node[state, ] (q_1) [right=of q_0] {$1$};
+ − 677
% \node[state, ] (q_2) [right=of q_1] {$2$};
+ − 678
% \node[state,
+ − 679
% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+ − 680
% \path[->]
+ − 681
% (q_0) edge node {a} (q_1)
+ − 682
% edge [loop below] node {a,b} ()
+ − 683
% (q_1) edge node {a,b} (q_2)
+ − 684
% (q_2) edge node {a,b} (q_3);
+ − 685
%\end{tikzpicture}
+ − 686
\end{center}
+ − 687
\caption{The example given by Becchi and Crawley
+ − 688
that NFA simulation can consume large
+ − 689
amounts of memory: $.^*a.^{\{n\}}bc$ matching
+ − 690
strings of the form $aaa\ldots aaaabc$.
+ − 691
When traversing in a breadth-first manner,
+ − 692
all states from 0 till $n+1$ will become active.}
+ − 693
\end{figure}
+ − 694
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+ − 695
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+ − 696
%in terms of input string length.
+ − 697
%TODO:try out these lexers
+ − 698
These problems can of course be solved in matching algorithms where
605
+ − 699
automata go beyond the classic notion and for instance include explicit
+ − 700
counters \cite{Turo_ov__2020}.
612
+ − 701
These solutions can be quite efficient,
606
+ − 702
with the ability to process
612
+ − 703
gigabytes of strings input per second
606
+ − 704
even with large counters \cite{Becchi08}.
612
+ − 705
But formal reasoning about these automata especially in Isabelle
+ − 706
can be challenging
+ − 707
and un-intuitive.
+ − 708
Therefore, we take correctness and runtime claims made about these solutions
+ − 709
with a grain of salt.
605
+ − 710
628
+ − 711
In the work reported in \cite{FoSSaCS2023} and here,
605
+ − 712
we add better support using derivatives
+ − 713
for bounded regular expressions $r^{\{n\}}$.
+ − 714
The results
+ − 715
extend straightforwardly to
+ − 716
repetitions with an interval such as
+ − 717
$r^{\{n\ldots m\}}$.
608
+ − 718
The merit of Brzozowski derivatives (more on this later)
605
+ − 719
on this problem is that
+ − 720
it can be naturally extended to support bounded repetitions.
+ − 721
Moreover these extensions are still made up of only
+ − 722
inductive datatypes and recursive functions,
633
+ − 723
making it handy to deal with them in theorem provers.
605
+ − 724
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+ − 725
%straightforwardly extended to deal with bounded regular expressions
+ − 726
%and moreover the resulting code still consists of only simple
+ − 727
%recursive functions and inductive datatypes.
+ − 728
Finally, bounded regular expressions do not destroy our finite
+ − 729
boundedness property, which we shall prove later on.
538
+ − 730
+ − 731
606
+ − 732
+ − 733
+ − 734
605
+ − 735
\subsection{Back-References}
606
+ − 736
The other way to simulate an $\mathit{NFA}$ for matching is choosing
+ − 737
a single transition each time, keeping all the other options in
+ − 738
a queue or stack, and backtracking if that choice eventually
+ − 739
fails. This method, often called a "depth-first-search",
+ − 740
is efficient in a lot of cases, but could end up
+ − 741
with exponential run time.
+ − 742
The backtracking method is employed in regex libraries
+ − 743
that support \emph{back-references}, for example
+ − 744
in Java and Python.
605
+ − 745
%\section{Back-references and The Terminology Regex}
538
+ − 746
605
+ − 747
%When one constructs an $\NFA$ out of a regular expression
+ − 748
%there is often very little to be done in the first phase, one simply
+ − 749
%construct the $\NFA$ states based on the structure of the input regular expression.
538
+ − 750
605
+ − 751
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
+ − 752
%one by keeping track of all active states after consuming
+ − 753
%a character, and update that set of states iteratively.
+ − 754
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+ − 755
%for a path terminating
+ − 756
%at an accepting state.
606
+ − 757
+ − 758
+ − 759
+ − 760
Given a regular expression like this (the sequence
532
+ − 761
operator is omitted for brevity):
+ − 762
\begin{center}
606
+ − 763
$r_1r_2r_3r_4$
532
+ − 764
\end{center}
606
+ − 765
one could label sub-expressions of interest
532
+ − 766
by parenthesizing them and giving
+ − 767
them a number by the order in which their opening parentheses appear.
+ − 768
One possible way of parenthesizing and labelling is given below:
+ − 769
\begin{center}
+ − 770
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
+ − 771
\end{center}
606
+ − 772
The sub-expressions
+ − 773
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
+ − 774
by 1 to 4, and can be ``referred back'' by their respective numbers.
+ − 775
%These sub-expressions are called "capturing groups".
+ − 776
To do so, we use the syntax $\backslash i$
+ − 777
to denote that we want the sub-string
+ − 778
of the input just matched by the i-th
+ − 779
sub-expression to appear again,
+ − 780
exactly the same as it first appeared:
532
+ − 781
\begin{center}
+ − 782
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
+ − 783
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
+ − 784
\end{center}
606
+ − 785
%The backslash and number $i$ are the
+ − 786
%so-called "back-references".
+ − 787
%Let $e$ be an expression made of regular expressions
+ − 788
%and back-references. $e$ contains the expression $e_i$
+ − 789
%as its $i$-th capturing group.
+ − 790
%The semantics of back-reference can be recursively
+ − 791
%written as:
+ − 792
%\begin{center}
+ − 793
% \begin{tabular}{c}
+ − 794
% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+ − 795
% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+ − 796
% \end{tabular}
+ − 797
%\end{center}
+ − 798
A concrete example
618
+ − 799
for back-references is
532
+ − 800
\begin{center}
607
+ − 801
$(.^*)\backslash 1$,
532
+ − 802
\end{center}
618
+ − 803
which matches
606
+ − 804
strings that can be split into two identical halves,
618
+ − 805
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
607
+ − 806
Note that this is different from
+ − 807
repeating the sub-expression verbatim like
+ − 808
\begin{center}
+ − 809
$(.^*)(.^*)$,
+ − 810
\end{center}
+ − 811
which does not impose any restrictions on what strings the second
+ − 812
sub-expression $.^*$
+ − 813
might match.
618
+ − 814
Another example of back-references is
607
+ − 815
\begin{center}
+ − 816
$(.)(.)\backslash 2\backslash 1$
+ − 817
\end{center}
618
+ − 818
which matches four-character palindromes
+ − 819
like $abba$, $x??x$ and so on.
607
+ − 820
+ − 821
Back-references is a regex construct
618
+ − 822
that programmers find quite useful.
630
+ − 823
According to Becchi and Crawley~\cite{Becchi08},
618
+ − 824
6\% of Snort rules (up until 2008) use them.
607
+ − 825
The most common use of back-references
618
+ − 826
is to express well-formed html files,
+ − 827
where back-references are convenient for matching
+ − 828
opening and closing tags like
607
+ − 829
\begin{center}
+ − 830
$\langle html \rangle \ldots \langle / html \rangle$
+ − 831
\end{center}
+ − 832
A regex describing such a format
618
+ − 833
is
607
+ − 834
\begin{center}
+ − 835
$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+ − 836
\end{center}
618
+ − 837
Despite being useful, the expressive power of regexes
633
+ − 838
go beyond regular languages
618
+ − 839
once back-references are included.
607
+ − 840
In fact, they allow the regex construct to express
532
+ − 841
languages that cannot be contained in context-free
+ − 842
languages either.
608
+ − 843
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
532
+ − 844
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
633
+ − 845
which cannot be expressed by context-free grammars \parencite{campeanu2003formal}.
532
+ − 846
Such a language is contained in the context-sensitive hierarchy
+ − 847
of formal languages.
618
+ − 848
Also solving the matching problem involving back-references
609
+ − 849
is known to be NP-complete \parencite{alfred2014algorithms}.
608
+ − 850
Regex libraries supporting back-references such as
+ − 851
PCRE \cite{pcre} therefore have to
609
+ − 852
revert to a depth-first search algorithm which backtracks.
+ − 853
What is unexpected is that even in the cases
+ − 854
not involving back-references, there is still
+ − 855
a (non-negligible) chance they might backtrack super-linearly,
618
+ − 856
as shown in the graphs in figure\ref{fig:aStarStarb}.
538
+ − 857
532
+ − 858
Summing these up, we can categorise existing
607
+ − 859
practical regex libraries into two kinds:
622
+ − 860
(i) The ones with linear
+ − 861
time guarantees like Go and Rust. The downside with them is that
607
+ − 862
they impose restrictions
622
+ − 863
on the regular expressions (not allowing back-references,
+ − 864
bounded repetitions cannot exceed an ad hoc limit etc.).
633
+ − 865
And (ii) those
607
+ − 866
that allow large bounded regular expressions and back-references
622
+ − 867
at the expense of using backtracking algorithms.
+ − 868
They can potentially ``grind to a halt''
+ − 869
on some very simple cases, resulting
633
+ − 870
ReDoS attacks if exposed to the internet.
607
+ − 871
633
+ − 872
The problems with both approaches are the motivation for us
622
+ − 873
to look again at the regular expression matching problem.
+ − 874
Another motivation is that regular expression matching algorithms
+ − 875
that follow the POSIX standard often contain errors and bugs
+ − 876
as we shall explain next.
+ − 877
+ − 878
%We would like to have regex engines that can
+ − 879
%deal with the regular part (e.g.
+ − 880
%bounded repetitions) of regexes more
+ − 881
%efficiently.
+ − 882
%Also we want to make sure that they do it correctly.
+ − 883
%It turns out that such aim is not so easy to achieve.
532
+ − 884
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
+ − 885
% For example, the Rust regex engine claims to be linear,
+ − 886
% but does not support lookarounds and back-references.
+ − 887
% The GoLang regex library does not support over 1000 repetitions.
+ − 888
% Java and Python both support back-references, but shows
+ − 889
%catastrophic backtracking behaviours on inputs without back-references(
+ − 890
%when the language is still regular).
+ − 891
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
+ − 892
%TODO: verify the fact Rust does not allow 1000+ reps
+ − 893
+ − 894
605
+ − 895
+ − 896
+ − 897
%The time cost of regex matching algorithms in general
+ − 898
%involve two different phases, and different things can go differently wrong on
+ − 899
%these phases.
+ − 900
%$\DFA$s usually have problems in the first (construction) phase
+ − 901
%, whereas $\NFA$s usually run into trouble
+ − 902
%on the second phase.
+ − 903
+ − 904
+ − 905
\section{Error-prone POSIX Implementations}
622
+ − 906
Very often there are multiple ways of matching a string
+ − 907
with a regular expression.
+ − 908
In such cases the regular expressions matcher needs to
607
+ − 909
disambiguate.
622
+ − 910
The more widely used strategy is called POSIX,
+ − 911
which roughly speaking always chooses the longest initial match.
+ − 912
The POSIX strategy is widely adopted in many regular expression matchers.
607
+ − 913
However, many implementations (including the C libraries
+ − 914
used by Linux and OS X distributions) contain bugs
+ − 915
or do not meet the specification they claim to adhere to.
622
+ − 916
Kuklewicz maintains a unit test repository which lists some
628
+ − 917
problems with existing regular expression engines \cite{KuklewiczHaskell}.
622
+ − 918
In some cases, they either fail to generate a
607
+ − 919
result when there exists a match,
622
+ − 920
or give results that are inconsistent with the POSIX standard.
+ − 921
A concrete example is the regex:
605
+ − 922
\begin{center}
607
+ − 923
$(aba + ab + a)^* \text{and the string} ababa$
605
+ − 924
\end{center}
622
+ − 925
The correct POSIX match for the above
+ − 926
is the entire string $ababa$,
+ − 927
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
605
+ − 928
$[0, 2), [2, 5)$
+ − 929
respectively.
622
+ − 930
But trying this out in regex101 \parencite{regex101} \footnote{
+ − 931
regex101 is an online regular expression matcher which
+ − 932
provides API for trying out regular expression
+ − 933
engines of multiple popular programming languages like
+ − 934
Java, Python, Go, etc.}
+ − 935
with different engines yields
+ − 936
always matches: $[aba]$ at $[0, 3)$
605
+ − 937
and $a$ at $[4, 5)$.
607
+ − 938
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
+ − 939
commented that most regex libraries are not
622
+ − 940
correctly implementing the central POSIX
+ − 941
rule, called the maximum munch rule.
+ − 942
Grathwohl \parencite{grathwohl2014crash} wrote,
605
+ − 943
\begin{quote}
607
+ − 944
``The POSIX strategy is more complicated than the
605
+ − 945
greedy because of the dependence on information about
607
+ − 946
the length of matched strings in the various subexpressions.''
605
+ − 947
\end{quote}
+ − 948
%\noindent
622
+ − 949
We think the implementation complexity of POSIX rules also come from
+ − 950
the specification being not very precise.
607
+ − 951
There are many informal summaries of this disambiguation
+ − 952
strategy, which are often quite long and delicate.
608
+ − 953
For example Kuklewicz \cite{KuklewiczHaskell}
622
+ − 954
described the POSIX rule as (section 1, last paragraph):
607
+ − 955
\begin{quote}
+ − 956
\begin{itemize}
+ − 957
\item
+ − 958
regular expressions (REs) take the leftmost starting match, and the longest match starting there
+ − 959
earlier subpatterns have leftmost-longest priority over later subpatterns\\
+ − 960
\item
+ − 961
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+ − 962
\item
+ − 963
REs have right associative concatenation which can be changed with parenthesis\\
+ − 964
\item
+ − 965
parenthesized subexpressions return the match from their last usage\\
+ − 966
\item
+ − 967
text of component subexpressions must be contained in the text of the
+ − 968
higher-level subexpressions\\
+ − 969
\item
+ − 970
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+ − 971
\item
622
+ − 972
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
607
+ − 973
\end{itemize}
+ − 974
\end{quote}
622
+ − 975
%The text above
+ − 976
%is trying to capture something very precise,
+ − 977
%and is crying out for formalising.
608
+ − 978
Ausaf et al. \cite{AusafDyckhoffUrban2016}
622
+ − 979
are the first to
+ − 980
give a quite simple formalised POSIX
+ − 981
specification in Isabelle/HOL, and also prove
609
+ − 982
that their specification coincides with the
622
+ − 983
POSIX specification given by Okui and Suzuki \cite{Okui10}.
608
+ − 984
They then formally proved the correctness of
+ − 985
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
622
+ − 986
with regards to that specification.
+ − 987
They also found that the informal POSIX
+ − 988
specification by Sulzmann and Lu does not work for the correctness proof.
605
+ − 989
622
+ − 990
In the next section we will briefly
609
+ − 991
introduce Brzozowski derivatives and Sulzmann
622
+ − 992
and Lu's algorithm, which the main point of this thesis builds on.
+ − 993
%We give a taste of what they
+ − 994
%are like and why they are suitable for regular expression
+ − 995
%matching and lexing.
+ − 996
\section{Formal Specification of POSIX Matching
609
+ − 997
and Brzozowski Derivatives}
622
+ − 998
%Now we start with the central topic of the thesis: Brzozowski derivatives.
609
+ − 999
Brzozowski \cite{Brzozowski1964} first introduced the
622
+ − 1000
concept of a \emph{derivative} of regular expression in 1964.
609
+ − 1001
The derivative of a regular expression $r$
622
+ − 1002
with respect to a character $c$, is written as $r \backslash c$.
+ − 1003
This operation tells us what $r$ transforms into
+ − 1004
if we ``chop'' off the first character $c$
+ − 1005
from all strings in the language of $r$ (defined
+ − 1006
later as $L \; r$).
+ − 1007
%To give a flavour of Brzozowski derivatives, we present
+ − 1008
%two straightforward clauses from it:
+ − 1009
%\begin{center}
+ − 1010
% \begin{tabular}{lcl}
+ − 1011
% $d \backslash c$ & $\dn$ &
+ − 1012
% $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+ − 1013
%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+ − 1014
% \end{tabular}
+ − 1015
%\end{center}
+ − 1016
%\noindent
+ − 1017
%The first clause says that for the regular expression
+ − 1018
%denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+ − 1019
%we check the derivative character $c$ against $d$,
+ − 1020
%returning a set containing only the empty string $\{ [] \}$
+ − 1021
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+ − 1022
%The second clause states that to obtain the regular expression
+ − 1023
%representing all strings' head character $c$ being chopped off
+ − 1024
%from $r_1 + r_2$, one simply needs to recursively take derivative
+ − 1025
%of $r_1$ and $r_2$ and then put them together.
+ − 1026
Derivatives have the property
609
+ − 1027
that $s \in L \; (r\backslash c)$ if and only if
622
+ − 1028
$c::s \in L \; r$ where $::$ stands for list prepending.
609
+ − 1029
%This property can be used on regular expressions
+ − 1030
%matching and lexing--to test whether a string $s$ is in $L \; r$,
+ − 1031
%one simply takes derivatives of $r$ successively with
+ − 1032
%respect to the characters (in the correct order) in $s$,
+ − 1033
%and then test whether the empty string is in the last regular expression.
622
+ − 1034
With this derivatives give a simple solution
+ − 1035
to the problem of matching a string $s$ with a regular
538
+ − 1036
expression $r$: if the derivative of $r$ w.r.t.\ (in
+ − 1037
succession) all the characters of the string matches the empty string,
+ − 1038
then $r$ matches $s$ (and {\em vice versa}).
622
+ − 1039
%This makes formally reasoning about these properties such
+ − 1040
%as correctness and complexity smooth and intuitive.
+ − 1041
There had been several mechanised proofs about this property in various theorem
+ − 1042
provers,
+ − 1043
for example one by Owens and Slind \cite{Owens2008} in HOL4,
609
+ − 1044
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
+ − 1045
yet another in Coq by Coquand and Siles \cite{Coquand2012}.
+ − 1046
622
+ − 1047
In addition, one can extend derivatives to bounded repetitions
+ − 1048
relatively straightforwardly. For example, the derivative for
+ − 1049
this can be defined as:
609
+ − 1050
\begin{center}
+ − 1051
\begin{tabular}{lcl}
+ − 1052
$r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
+ − 1053
r^{\{n-1\}}$\\
+ − 1054
\end{tabular}
+ − 1055
\end{center}
+ − 1056
\noindent
622
+ − 1057
Experimental results suggest that unlike DFA-based solutions
+ − 1058
for bounded regular expressions,
+ − 1059
derivatives can cope
+ − 1060
large counters
609
+ − 1061
quite well.
532
+ − 1062
609
+ − 1063
There has also been
+ − 1064
extensions to other constructs.
+ − 1065
For example, Owens et al include the derivatives
622
+ − 1066
for the \emph{NOT} regular expression, which is
609
+ − 1067
able to concisely express C-style comments of the form
622
+ − 1068
$/* \ldots */$ (see \cite{Owens2008}).
+ − 1069
Another extension for derivatives is
+ − 1070
regular expressions with look-aheads, done
609
+ − 1071
by Miyazaki and Minamide
+ − 1072
\cite{Takayuki2019}.
+ − 1073
%We therefore use Brzozowski derivatives on regular expressions
+ − 1074
%lexing
+ − 1075
+ − 1076
+ − 1077
+ − 1078
Given the above definitions and properties of
+ − 1079
Brzozowski derivatives, one quickly realises their potential
+ − 1080
in generating a formally verified algorithm for lexing--the clauses and property
+ − 1081
can be easily expressed in a functional programming language
+ − 1082
or converted to theorem prover
+ − 1083
code, with great extensibility.
+ − 1084
Perhaps this is the reason why it has sparked quite a bit of interest
+ − 1085
in the functional programming and theorem prover communities in the last
+ − 1086
fifteen or so years (
+ − 1087
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
+ − 1088
\cite{Chen12} and \cite{Coquand2012}
+ − 1089
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
622
+ − 1090
after they were first published by Brzozowski.
609
+ − 1091
+ − 1092
+ − 1093
However, there are two difficulties with derivative-based matchers:
532
+ − 1094
First, Brzozowski's original matcher only generates a yes/no answer
+ − 1095
for whether a regular expression matches a string or not. This is too
+ − 1096
little information in the context of lexing where separate tokens must
+ − 1097
be identified and also classified (for example as keywords
609
+ − 1098
or identifiers).
+ − 1099
Second, derivative-based matchers need to be more efficient.
+ − 1100
Elegant and beautiful
+ − 1101
as many implementations are,
+ − 1102
they can be excruciatingly slow.
+ − 1103
For example, Sulzmann and Lu
+ − 1104
claim a linear running time of their proposed algorithm,
+ − 1105
but that was falsified by our experiments. The running time
+ − 1106
is actually $\Omega(2^n)$ in the worst case.
+ − 1107
A similar claim about a theoretical runtime of $O(n^2)$
+ − 1108
is made for the Verbatim \cite{Verbatim}
+ − 1109
%TODO: give references
+ − 1110
lexer, which calculates POSIX matches and is based on derivatives.
622
+ − 1111
They formalized the correctness of the lexer, but not their complexity result.
+ − 1112
In the performance evaluation section, they analyzed the run time
609
+ − 1113
of matching $a$ with the string
+ − 1114
\begin{center}
622
+ − 1115
$\underbrace{a \ldots a}_{\text{n a's}}$.
609
+ − 1116
\end{center}
622
+ − 1117
\noindent
+ − 1118
They concluded that the algorithm is quadratic in terms of
+ − 1119
the length of the input string.
609
+ − 1120
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
622
+ − 1121
the time it took to match a string of 40 $a$'s was approximately 5 minutes.
609
+ − 1122
+ − 1123
+ − 1124
\subsection{Sulzmann and Lu's Algorithm}
+ − 1125
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
622
+ − 1126
problem with the yes/no answer
+ − 1127
by cleverly extending Brzozowski's matching
532
+ − 1128
algorithm. Their extended version generates additional information on
+ − 1129
\emph{how} a regular expression matches a string following the POSIX
+ − 1130
rules for regular expression matching. They achieve this by adding a
+ − 1131
second ``phase'' to Brzozowski's algorithm involving an injection
622
+ − 1132
function.
+ − 1133
In earlier work, Ausaf et al provided the formal
532
+ − 1134
specification of what POSIX matching means and proved in Isabelle/HOL
+ − 1135
the correctness
622
+ − 1136
of this extended algorithm accordingly
532
+ − 1137
\cite{AusafDyckhoffUrban2016}.
+ − 1138
609
+ − 1139
The version of the algorithm proven correct
622
+ − 1140
suffers heavily from a
+ − 1141
second difficulty, where the internal derivatives can
609
+ − 1142
grow to arbitrarily big sizes.
+ − 1143
For example if we start with the
532
+ − 1144
regular expression $(a+aa)^*$ and take
+ − 1145
successive derivatives according to the character $a$, we end up with
+ − 1146
a sequence of ever-growing derivatives like
+ − 1147
+ − 1148
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+ − 1149
\begin{center}
+ − 1150
\begin{tabular}{rll}
+ − 1151
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1152
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1153
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+ − 1154
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1155
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+ − 1156
\end{tabular}
+ − 1157
\end{center}
+ − 1158
+ − 1159
\noindent where after around 35 steps we run out of memory on a
622
+ − 1160
typical computer. Clearly, the
532
+ − 1161
notation involving $\ZERO$s and $\ONE$s already suggests
+ − 1162
simplification rules that can be applied to regular regular
+ − 1163
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
+ − 1164
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
622
+ − 1165
r$. While such simple-minded simplifications have been proved in
+ − 1166
the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
532
+ − 1167
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
+ − 1168
\emph{not} help with limiting the growth of the derivatives shown
+ − 1169
above: the growth is slowed, but the derivatives can still grow rather
+ − 1170
quickly beyond any finite bound.
+ − 1171
622
+ − 1172
Therefore we want to look in this thesis at a second
+ − 1173
algorithm by Sulzmann and Lu where they
+ − 1174
overcame this ``growth problem'' \cite{Sulzmann2014}.
+ − 1175
In this version, POSIX values are
538
+ − 1176
represented as bit sequences and such sequences are incrementally generated
532
+ − 1177
when derivatives are calculated. The compact representation
538
+ − 1178
of bit sequences and regular expressions allows them to define a more
532
+ − 1179
``aggressive'' simplification method that keeps the size of the
+ − 1180
derivatives finite no matter what the length of the string is.
+ − 1181
They make some informal claims about the correctness and linear behaviour
+ − 1182
of this version, but do not provide any supporting proof arguments, not
538
+ − 1183
even ``pencil-and-paper'' arguments. They write about their bit-coded
532
+ − 1184
\emph{incremental parsing method} (that is the algorithm to be formalised
538
+ − 1185
in this dissertation)
532
+ − 1186
+ − 1187
+ − 1188
+ − 1189
\begin{quote}\it
+ − 1190
``Correctness Claim: We further claim that the incremental parsing
+ − 1191
method [..] in combination with the simplification steps [..]
+ − 1192
yields POSIX parse trees. We have tested this claim
+ − 1193
extensively [..] but yet
+ − 1194
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
+ − 1195
\end{quote}
622
+ − 1196
Ausaf and Urban made some initial progress towards the
+ − 1197
full correctness proof but still had to leave out the optimisation
+ − 1198
Sulzmann and Lu proposed.
+ − 1199
Ausaf wrote \cite{Ausaf},
532
+ − 1200
\begin{quote}\it
609
+ − 1201
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
532
+ − 1202
\end{quote}
+ − 1203
This thesis implements the aggressive simplifications envisioned
+ − 1204
by Ausaf and Urban,
622
+ − 1205
together with a formal proof of the correctness of those simplifications.
532
+ − 1206
+ − 1207
612
+ − 1208
One of the most recent work in the context of lexing
+ − 1209
%with this issue
630
+ − 1210
is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
612
+ − 1211
This is relevant work for us and we will compare it later with
+ − 1212
our derivative-based matcher we are going to present.
+ − 1213
There is also some newer work called
630
+ − 1214
Verbatim++~\cite{Verbatimpp}, which does not use derivatives,
612
+ − 1215
but deterministic finite automaton instead.
+ − 1216
%An example that gives problem to automaton approaches would be
+ − 1217
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+ − 1218
%It requires at least $2^{n+1}$ states to represent
+ − 1219
%as a DFA.
+ − 1220
+ − 1221
532
+ − 1222
%----------------------------------------------------------------------------------------
+ − 1223
\section{Contribution}
+ − 1224
609
+ − 1225
In this thesis,
+ − 1226
we propose a solution to catastrophic
+ − 1227
backtracking and error-prone matchers: a formally verified
+ − 1228
regular expression lexing algorithm
+ − 1229
that is both fast
+ − 1230
and correct by extending Ausaf et al.'s work.
+ − 1231
The end result is %a regular expression lexing algorithm that comes with
538
+ − 1232
\begin{itemize}
+ − 1233
\item
609
+ − 1234
an improved version of Sulzmann and Lu's bit-coded algorithm using
+ − 1235
derivatives with simplifications,
+ − 1236
accompanied by
+ − 1237
a proven correctness theorem according to POSIX specification
+ − 1238
given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
+ − 1239
\item
+ − 1240
a complexity-related property for that algorithm saying that the
+ − 1241
internal data structure will
+ − 1242
remain finite,
538
+ − 1243
\item
609
+ − 1244
and extension to
+ − 1245
the bounded repetitions construct with the correctness and finiteness property
+ − 1246
maintained.
622
+ − 1247
\end{itemize}
+ − 1248
\noindent
609
+ − 1249
With a formal finiteness bound in place,
+ − 1250
we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
631
+ − 1251
The Isabelle/HOL code for our formalisation can be
+ − 1252
found at
+ − 1253
\begin{center}
+ − 1254
\url{https://github.com/hellotommmy/posix}
+ − 1255
\end{center}
609
+ − 1256
Further improvements to the algorithm with an even stronger version of
631
+ − 1257
simplification can be made. We conjecture that the resulting size of derivatives
+ − 1258
can be bounded by a cubic bound w.r.t. the size of the regular expression.
+ − 1259
609
+ − 1260
538
+ − 1261
+ − 1262
532
+ − 1263
+ − 1264
+ − 1265
\section{Structure of the thesis}
622
+ − 1266
In chapter \ref{Inj} we will introduce the concepts
532
+ − 1267
and notations we
622
+ − 1268
use for describing regular expressions and derivatives,
+ − 1269
and the first version of their lexing algorithm without bitcodes (including
+ − 1270
its correctness proof).
+ − 1271
We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
+ − 1272
together with the correctness proof by Ausaf and Urban.
+ − 1273
Then we illustrate in chapter \ref{Bitcoded2}
+ − 1274
how Sulzmann and Lu's
+ − 1275
simplifications fail to simplify. We therefore introduce our version of the
+ − 1276
algorithm with simplification and
532
+ − 1277
its correctness proof .
622
+ − 1278
In chapter \ref{Finite} we give the second guarantee
532
+ − 1279
of our bitcoded algorithm, that is a finite bound on the size of any
631
+ − 1280
regular expression's derivatives.
+ − 1281
We also show how one can extend the
+ − 1282
algorithm to include bounded repetitions.
622
+ − 1283
In chapter \ref{Cubic} we discuss stronger simplification rules which
631
+ − 1284
improve the finite bound to a cubic bound.%and the NOT regular expression.
532
+ − 1285
+ − 1286
+ − 1287
+ − 1288
+ − 1289
+ − 1290
%----------------------------------------------------------------------------------------
+ − 1291
+ − 1292
+ − 1293
%----------------------------------------------------------------------------------------
+ − 1294
+ − 1295
%----------------------------------------------------------------------------------------
+ − 1296
+ − 1297
%----------------------------------------------------------------------------------------
+ − 1298
+ − 1299