532
+ − 1
% Chapter 1
+ − 2
+ − 3
\chapter{Introduction} % Main chapter title
+ − 4
+ − 5
\label{Introduction} % For referencing the chapter elsewhere, use \ref{Chapter1}
+ − 6
+ − 7
%----------------------------------------------------------------------------------------
+ − 8
+ − 9
% Define some commands to keep the formatting separated from the content
+ − 10
\newcommand{\keyword}[1]{\textbf{#1}}
+ − 11
\newcommand{\tabhead}[1]{\textbf{#1}}
+ − 12
\newcommand{\code}[1]{\texttt{#1}}
+ − 13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
+ − 14
\newcommand{\option}[1]{\texttt{\itshape#1}}
+ − 15
+ − 16
%boxes
+ − 17
\newcommand*{\mybox}[1]{\framebox{\strut #1}}
+ − 18
+ − 19
%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
+ − 20
\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
+ − 21
\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
543
+ − 22
\newcommand{\bderssimp}[2]{#1 \backslash_{bsimps} #2}
596
+ − 23
\newcommand{\rderssimp}[2]{#1 \backslash_{rsimps} #2}
564
+ − 24
\def\derssimp{\textit{ders}\_\textit{simp}}
557
+ − 25
\def\rders{\textit{rders}}
532
+ − 26
\newcommand{\bders}[2]{#1 \backslash #2}
+ − 27
\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
591
+ − 28
\def\bsimps{\textit{bsimp}}
554
+ − 29
\newcommand{\rsimp}[1]{\textit{rsimp}\; #1}
532
+ − 30
\newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'}
+ − 31
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+ − 32
\newcommand{\denote}{\stackrel{\mbox{\scriptsize denote}}{=}}%
+ − 33
\newcommand{\ZERO}{\mbox{\bf 0}}
+ − 34
\newcommand{\ONE}{\mbox{\bf 1}}
+ − 35
\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
555
+ − 36
\newcommand{\rdistinct}[2]{\textit{rdistinct} \;\; #1 \;\; #2}
594
+ − 37
\def\rdistincts{\textit{rdistinct}}
556
+ − 38
\def\rDistinct{\textit{rdistinct}}
532
+ − 39
\newcommand\hflat[1]{\llparenthesis #1 \rrparenthesis_*}
+ − 40
\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
+ − 41
\newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
620
+ − 42
\def\cbn{\textit{createdByNtimes}}
+ − 43
\def\hpa{\textit{highestPowerAux}}
+ − 44
\def\hpower{\textit{highestPower}}
+ − 45
\def\ntset{\textit{ntset}}
+ − 46
\def\optermsimp{\textit{optermsimp}}
+ − 47
\def\optermOsimp{\textit{optermOsimp}}
+ − 48
\def\optermosimp{\textit{optermosimp}}
+ − 49
\def\opterm{\textit{opterm}}
+ − 50
\def\nString{\textit{nonemptyString}}
532
+ − 51
+ − 52
\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
+ − 53
600
+ − 54
\def\SEQ{\textit{SEQ}}
+ − 55
\def\SEQs{\textit{SEQs}}
564
+ − 56
\def\case{\textit{case}}
554
+ − 57
\def\sequal{\stackrel{\mbox{\scriptsize rsimp}}{=}}
+ − 58
\def\rsimpalts{\textit{rsimp}_{ALTS}}
+ − 59
\def\good{\textit{good}}
+ − 60
\def\btrue{\textit{true}}
+ − 61
\def\bfalse{\textit{false}}
542
+ − 62
\def\bnullable{\textit{bnullable}}
543
+ − 63
\def\bnullables{\textit{bnullables}}
538
+ − 64
\def\Some{\textit{Some}}
+ − 65
\def\None{\textit{None}}
537
+ − 66
\def\code{\textit{code}}
532
+ − 67
\def\decode{\textit{decode}}
+ − 68
\def\internalise{\textit{internalise}}
+ − 69
\def\lexer{\mathit{lexer}}
+ − 70
\def\mkeps{\textit{mkeps}}
557
+ − 71
\newcommand{\rder}[2]{#2 \backslash_r #1}
532
+ − 72
585
+ − 73
\def\rerases{\textit{rerase}}
+ − 74
554
+ − 75
\def\nonnested{\textit{nonnested}}
532
+ − 76
\def\AZERO{\textit{AZERO}}
558
+ − 77
\def\sizeNregex{\textit{sizeNregex}}
532
+ − 78
\def\AONE{\textit{AONE}}
+ − 79
\def\ACHAR{\textit{ACHAR}}
+ − 80
585
+ − 81
\def\simpsulz{\textit{simp}_{Sulz}}
+ − 82
557
+ − 83
\def\scfrewrites{\stackrel{*}{\rightsquigarrow_{scf}}}
555
+ − 84
\def\frewrite{\rightsquigarrow_f}
+ − 85
\def\hrewrite{\rightsquigarrow_h}
+ − 86
\def\grewrite{\rightsquigarrow_g}
+ − 87
\def\frewrites{\stackrel{*}{\rightsquigarrow_f}}
+ − 88
\def\hrewrites{\stackrel{*}{\rightsquigarrow_h}}
+ − 89
\def\grewrites{\stackrel{*}{\rightsquigarrow_g}}
538
+ − 90
\def\fuse{\textit{fuse}}
+ − 91
\def\bder{\textit{bder}}
542
+ − 92
\def\der{\textit{der}}
532
+ − 93
\def\POSIX{\textit{POSIX}}
+ − 94
\def\ALTS{\textit{ALTS}}
+ − 95
\def\ASTAR{\textit{ASTAR}}
+ − 96
\def\DFA{\textit{DFA}}
538
+ − 97
\def\NFA{\textit{NFA}}
532
+ − 98
\def\bmkeps{\textit{bmkeps}}
543
+ − 99
\def\bmkepss{\textit{bmkepss}}
532
+ − 100
\def\retrieve{\textit{retrieve}}
+ − 101
\def\blexer{\textit{blexer}}
+ − 102
\def\flex{\textit{flex}}
573
+ − 103
\def\inj{\textit{inj}}
564
+ − 104
\def\Empty{\textit{Empty}}
567
+ − 105
\def\Left{\textit{Left}}
+ − 106
\def\Right{\textit{Right}}
573
+ − 107
\def\Stars{\textit{Stars}}
+ − 108
\def\Char{\textit{Char}}
+ − 109
\def\Seq{\textit{Seq}}
532
+ − 110
\def\Der{\textit{Der}}
+ − 111
\def\Ders{\textit{Ders}}
+ − 112
\def\nullable{\mathit{nullable}}
+ − 113
\def\Z{\mathit{Z}}
+ − 114
\def\S{\mathit{S}}
+ − 115
\def\rup{r^\uparrow}
+ − 116
%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
+ − 117
\def\distinctWith{\textit{distinctWith}}
+ − 118
\def\lf{\textit{lf}}
+ − 119
\def\PD{\textit{PD}}
+ − 120
\def\suffix{\textit{Suffix}}
543
+ − 121
\def\distinctBy{\textit{distinctBy}}
558
+ − 122
\def\starupdate{\textit{starUpdate}}
+ − 123
\def\starupdates{\textit{starUpdates}}
620
+ − 124
\def\nupdate{\textit{nupdate}}
+ − 125
\def\nupdates{\textit{nupdates}}
558
+ − 126
532
+ − 127
+ − 128
\def\size{\mathit{size}}
+ − 129
\def\rexp{\mathbf{rexp}}
+ − 130
\def\simp{\mathit{simp}}
+ − 131
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
+ − 132
\def\map{\mathit{map}}
+ − 133
\def\distinct{\mathit{distinct}}
+ − 134
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
590
+ − 135
\def\blexerStrong{\textit{blexerStrong}}
+ − 136
\def\bsimpStrong{\textit{bsimpStrong}}
591
+ − 137
\def\bdersStrongs{\textit{bdersStrong}}
590
+ − 138
\newcommand{\bdersStrong}[2]{#1 \backslash_{bsimpStrongs} #2}
+ − 139
532
+ − 140
\def\map{\textit{map}}
+ − 141
\def\rrexp{\textit{rrexp}}
554
+ − 142
\newcommand\rnullable[1]{\textit{rnullable} \; #1 }
532
+ − 143
\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
+ − 144
\newcommand\asize[1]{\llbracket #1 \rrbracket}
543
+ − 145
\newcommand\rerase[1]{ (#1)_{\downarrow_r}}
+ − 146
538
+ − 147
\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
532
+ − 148
543
+ − 149
+ − 150
\def\rflts{\textit{rflts}}
+ − 151
\def\rrewrite{\textit{rrewrite}}
+ − 152
\def\bsimpalts{\textit{bsimp}_{ALTS}}
596
+ − 153
\def\bsimpaseq{\textit{bsimp}_{ASEQ}}
+ − 154
\def\rsimlalts{\textit{rsimp}_{ALTs}}
+ − 155
\def\rsimpseq{\textit{rsimp}_{SEQ}}
543
+ − 156
532
+ − 157
\def\erase{\textit{erase}}
+ − 158
\def\STAR{\textit{STAR}}
+ − 159
\def\flts{\textit{flts}}
+ − 160
+ − 161
579
+ − 162
\def\zeroable{\textit{zeroable}}
+ − 163
\def\nub{\textit{nub}}
+ − 164
\def\filter{\textit{filter}}
601
+ − 165
%\def\not{\textit{not}}
579
+ − 166
+ − 167
+ − 168
532
+ − 169
\def\RZERO{\mathbf{0}_r }
+ − 170
\def\RONE{\mathbf{1}_r}
+ − 171
\newcommand\RCHAR[1]{\mathbf{#1}_r}
+ − 172
\newcommand\RSEQ[2]{#1 \cdot #2}
558
+ − 173
\newcommand\RALTS[1]{\sum #1}
532
+ − 174
\newcommand\RSTAR[1]{#1^*}
558
+ − 175
\newcommand\vsuf[2]{\textit{Suffix} \;#1\;#2}
532
+ − 176
538
+ − 177
+ − 178
590
+ − 179
+ − 180
\lstdefinestyle{myScalastyle}{
+ − 181
frame=tb,
+ − 182
language=scala,
+ − 183
aboveskip=3mm,
+ − 184
belowskip=3mm,
+ − 185
showstringspaces=false,
+ − 186
columns=flexible,
+ − 187
basicstyle={\small\ttfamily},
+ − 188
numbers=none,
+ − 189
numberstyle=\tiny\color{gray},
+ − 190
keywordstyle=\color{blue},
+ − 191
commentstyle=\color{dkgreen},
+ − 192
stringstyle=\color{mauve},
+ − 193
frame=single,
+ − 194
breaklines=true,
+ − 195
breakatwhitespace=true,
+ − 196
tabsize=3,
538
+ − 197
}
+ − 198
590
+ − 199
532
+ − 200
%----------------------------------------------------------------------------------------
+ − 201
%This part is about regular expressions, Brzozowski derivatives,
+ − 202
%and a bit-coded lexing algorithm with proven correctness and time bounds.
+ − 203
+ − 204
%TODO: look up snort rules to use here--give readers idea of what regexes look like
+ − 205
601
+ − 206
+ − 207
+ − 208
+ − 209
+ − 210
+ − 211
Regular expressions are widely used in computer science:
+ − 212
be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+ − 213
command-line tools like $\mathit{grep}$ that facilitate easy
+ − 214
text-processing; network intrusion
612
+ − 215
detection systems that inspect suspicious traffic; or compiler
+ − 216
front ends.
+ − 217
Given their usefulness and ubiquity, one would assume that
601
+ − 218
modern regular expression matching implementations
+ − 219
are mature and fully studied.
602
+ − 220
Indeed, in a popular programming language's regex engine,
+ − 221
supplying it with regular expressions and strings,
+ − 222
in most cases one can
+ − 223
get the matching information in a very short time.
+ − 224
Those matchers can be blindingly fast--some
+ − 225
network intrusion detection systems
601
+ − 226
use regex engines that are able to process
+ − 227
megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
602
+ − 228
However, those matchers can exhibit a surprising security vulnerability
+ − 229
under a certain class of inputs.
+ − 230
%However, , this is not the case for $\mathbf{all}$ inputs.
601
+ − 231
%TODO: get source for SNORT/BRO's regex matching engine/speed
+ − 232
603
+ − 233
612
+ − 234
Consider $(a^*)^*\,b$ and ask whether
+ − 235
strings of the form $aa..a$ can be matched by this regular
603
+ − 236
expression. Obviously this is not the case---the expected $b$ in the last
+ − 237
position is missing. One would expect that modern regular expression
612
+ − 238
matching engines can find this out very quickly. Surprisingly, if one tries
+ − 239
this example in JavaScript, Python or Java 8, even with small strings,
+ − 240
say of lenght of around 30 $a$'s,
+ − 241
the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
603
+ − 242
This is clearly exponential behaviour, and
612
+ − 243
is triggered by some relatively simple regular expressions.
603
+ − 244
Java 9 and newer
612
+ − 245
versions improve this behaviour somewhat, but is still slow compared
+ − 246
with the approach we are going to use in this thesis.
603
+ − 247
+ − 248
+ − 249
+ − 250
This superlinear blowup in regular expression engines
612
+ − 251
had repeatedly caused grief in ``real life'' where it is
+ − 252
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
603
+ − 253
For example, on 20 July 2016 one evil
+ − 254
regular expression brought the webpage
+ − 255
\href{http://stackexchange.com}{Stack Exchange} to its
+ − 256
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+ − 257
In this instance, a regular expression intended to just trim white
+ − 258
spaces from the beginning and the end of a line actually consumed
+ − 259
massive amounts of CPU resources---causing web servers to grind to a
+ − 260
halt. In this example, the time needed to process
+ − 261
the string was $O(n^2)$ with respect to the string length. This
+ − 262
quadratic overhead was enough for the homepage of Stack Exchange to
+ − 263
respond so slowly that the load balancer assumed a $\mathit{DoS}$
+ − 264
attack and therefore stopped the servers from responding to any
+ − 265
requests. This made the whole site become unavailable.
+ − 266
601
+ − 267
\begin{figure}[p]
612
+ − 268
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
532
+ − 269
\begin{tikzpicture}
+ − 270
\begin{axis}[
+ − 271
xlabel={$n$},
+ − 272
x label style={at={(1.05,-0.05)}},
+ − 273
ylabel={time in secs},
+ − 274
enlargelimits=false,
+ − 275
xtick={0,5,...,30},
+ − 276
xmax=33,
+ − 277
ymax=35,
+ − 278
ytick={0,5,...,30},
+ − 279
scaled ticks=false,
+ − 280
axis lines=left,
+ − 281
width=5cm,
+ − 282
height=4cm,
+ − 283
legend entries={JavaScript},
+ − 284
legend pos=north west,
+ − 285
legend cell align=left]
+ − 286
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+ − 287
\end{axis}
+ − 288
\end{tikzpicture}
+ − 289
&
+ − 290
\begin{tikzpicture}
+ − 291
\begin{axis}[
+ − 292
xlabel={$n$},
+ − 293
x label style={at={(1.05,-0.05)}},
+ − 294
%ylabel={time in secs},
+ − 295
enlargelimits=false,
+ − 296
xtick={0,5,...,30},
+ − 297
xmax=33,
+ − 298
ymax=35,
+ − 299
ytick={0,5,...,30},
+ − 300
scaled ticks=false,
+ − 301
axis lines=left,
+ − 302
width=5cm,
+ − 303
height=4cm,
+ − 304
legend entries={Python},
+ − 305
legend pos=north west,
+ − 306
legend cell align=left]
+ − 307
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ − 308
\end{axis}
612
+ − 309
\end{tikzpicture}\\
601
+ − 310
\begin{tikzpicture}
+ − 311
\begin{axis}[
+ − 312
xlabel={$n$},
+ − 313
x label style={at={(1.05,-0.05)}},
+ − 314
ylabel={time in secs},
+ − 315
enlargelimits=false,
+ − 316
xtick={0,5,...,30},
+ − 317
xmax=33,
+ − 318
ymax=35,
+ − 319
ytick={0,5,...,30},
+ − 320
scaled ticks=false,
+ − 321
axis lines=left,
+ − 322
width=5cm,
+ − 323
height=4cm,
612
+ − 324
legend entries={Java 8},
601
+ − 325
legend pos=north west,
+ − 326
legend cell align=left]
612
+ − 327
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
601
+ − 328
\end{axis}
+ − 329
\end{tikzpicture}
+ − 330
&
+ − 331
\begin{tikzpicture}
+ − 332
\begin{axis}[
+ − 333
xlabel={$n$},
+ − 334
x label style={at={(1.05,-0.05)}},
+ − 335
%ylabel={time in secs},
+ − 336
enlargelimits=false,
+ − 337
xtick={0,5,...,30},
+ − 338
xmax=33,
+ − 339
ymax=35,
+ − 340
ytick={0,5,...,30},
+ − 341
scaled ticks=false,
+ − 342
axis lines=left,
+ − 343
width=5cm,
+ − 344
height=4cm,
612
+ − 345
legend entries={Dart},
+ − 346
legend pos=north west,
+ − 347
legend cell align=left]
+ − 348
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+ − 349
\end{axis}
+ − 350
\end{tikzpicture}\\
+ − 351
\begin{tikzpicture}
+ − 352
\begin{axis}[
+ − 353
xlabel={$n$},
+ − 354
x label style={at={(1.05,-0.05)}},
+ − 355
ylabel={time in secs},
+ − 356
enlargelimits=false,
+ − 357
xtick={0,5,...,30},
+ − 358
xmax=33,
+ − 359
ymax=35,
+ − 360
ytick={0,5,...,30},
+ − 361
scaled ticks=false,
+ − 362
axis lines=left,
+ − 363
width=5cm,
+ − 364
height=4cm,
601
+ − 365
legend entries={Swift},
+ − 366
legend pos=north west,
+ − 367
legend cell align=left]
+ − 368
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+ − 369
\end{axis}
+ − 370
\end{tikzpicture}
612
+ − 371
&
+ − 372
\begin{tikzpicture}
+ − 373
\begin{axis}[
+ − 374
xlabel={$n$},
+ − 375
x label style={at={(1.05,-0.05)}},
+ − 376
%ylabel={time in secs},
+ − 377
enlargelimits=true,
+ − 378
%xtick={0,5000,...,40000},
+ − 379
%xmax=40000,
+ − 380
%ymax=35,
+ − 381
restrict x to domain*=0:40000,
+ − 382
restrict y to domain*=0:35,
+ − 383
%ytick={0,5,...,30},
+ − 384
%scaled ticks=false,
+ − 385
axis lines=left,
+ − 386
width=5cm,
+ − 387
height=4cm,
+ − 388
legend entries={Java9+},
+ − 389
legend pos=north west,
+ − 390
legend cell align=left]
+ − 391
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+ − 392
\end{axis}
+ − 393
\end{tikzpicture}\\
+ − 394
\multicolumn{2}{c}{Graphs}
532
+ − 395
\end{tabular}
601
+ − 396
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
+ − 397
of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
612
+ − 398
The reason for their superlinear behaviour is that they do a depth-first-search
+ − 399
using NFAs.
+ − 400
If the string does not match, the regular expression matching
+ − 401
engine starts to explore all possibilities.
601
+ − 402
}\label{fig:aStarStarb}
+ − 403
\end{figure}\afterpage{\clearpage}
538
+ − 404
532
+ − 405
A more recent example is a global outage of all Cloudflare servers on 2 July
612
+ − 406
2019. A poorly written regular expression exhibited catastrophic backtracking
+ − 407
and exhausted CPUs that serve HTTP traffic. Although the outage
532
+ − 408
had several causes, at the heart was a regular expression that
+ − 409
was used to monitor network
538
+ − 410
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
532
+ − 411
These problems with regular expressions
+ − 412
are not isolated events that happen
+ − 413
very occasionally, but actually widespread.
612
+ − 414
They occur so often that they have a
+ − 415
name: Regular-Expression-Denial-Of-Service (ReDoS)
532
+ − 416
attack.
538
+ − 417
\citeauthor{Davis18} detected more
612
+ − 418
than 1000 evil regular expressions
+ − 419
in Node.js, Python core libraries, npm and in pypi.
532
+ − 420
They therefore concluded that evil regular expressions
612
+ − 421
are real problems rather than "a parlour trick".
532
+ − 422
603
+ − 423
This work aims to address this issue
+ − 424
with the help of formal proofs.
612
+ − 425
We describe a lexing algorithm based
+ − 426
on Brzozowski derivatives with verified correctness (in
603
+ − 427
Isabelle/HOL)
612
+ − 428
and a finiteness property.
+ − 429
Such properties %guarantee the absence of
+ − 430
are an important step in preventing
+ − 431
catastrophic backtracking once and for all.
604
+ − 432
We will give more details in the next sections
+ − 433
on (i) why the slow cases in graph \ref{fig:aStarStarb}
612
+ − 434
can occur in traditional regular expression engines
604
+ − 435
and (ii) why we choose our
612
+ − 436
approach based on Brzozowski derivatives and formal proofs.
602
+ − 437
603
+ − 438
612
+ − 439
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
601
+ − 440
Regular expressions and regular expression matchers
+ − 441
have of course been studied for many, many years.
612
+ − 442
Theoretical results in automata theory state
604
+ − 443
that basic regular expression matching should be linear
605
+ − 444
w.r.t the input.
+ − 445
This assumes that the regular expression
+ − 446
$r$ was pre-processed and turned into a
612
+ − 447
deterministic finite automaton (DFA) before matching\cite{Sakarovitch2009}.
604
+ − 448
By basic we mean textbook definitions such as the one
612
+ − 449
below, involving only regular expressions for characters, alternatives,
604
+ − 450
sequences, and Kleene stars:
+ − 451
\[
612
+ − 452
r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
604
+ − 453
\]
+ − 454
Modern regular expression matchers used by programmers,
+ − 455
however,
612
+ − 456
support much richer constructs, such as bounded repetitions
604
+ − 457
and back-references.
612
+ − 458
To differentiate, we use the word \emph{regex} to refer
605
+ − 459
to those expressions with richer constructs while reserving the
+ − 460
term \emph{regular expression}
+ − 461
for the more traditional meaning in formal languages theory.
+ − 462
We follow this convention
+ − 463
in this thesis.
+ − 464
In the future, we aim to support all the popular features of regexes,
612
+ − 465
but for this work we mainly look at basic regular expressions
+ − 466
and bounded repetitions.
604
+ − 467
605
+ − 468
+ − 469
+ − 470
%Most modern regex libraries
+ − 471
%the so-called PCRE standard (Peral Compatible Regular Expressions)
+ − 472
%has the back-references
612
+ − 473
Regexes come with a number of constructs
605
+ − 474
that make it more convenient for
604
+ − 475
programmers to write regular expressions.
612
+ − 476
Depending on the types of constructs
605
+ − 477
the task of matching and lexing with them
612
+ − 478
will have different levels of complexity.
+ − 479
Some of those constructs are just syntactic sugars that are
604
+ − 480
simply short hand notations
605
+ − 481
that save the programmers a few keystrokes.
612
+ − 482
These will not cause problems for regex libraries.
605
+ − 483
For example the
612
+ − 484
non-binary alternative involving three or more choices just means:
604
+ − 485
\[
605
+ − 486
(a | b | c) \stackrel{means}{=} ((a + b)+ c)
604
+ − 487
\]
612
+ − 488
Similarly, the range operator used to express the alternative
+ − 489
of all characters between its operands is just a concise way:
604
+ − 490
\[
605
+ − 491
[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
604
+ − 492
\]
612
+ − 493
for an alternative. The
+ − 494
wildcard character $.$ is used to refer to any single character,
605
+ − 495
\[
+ − 496
. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
+ − 497
\]
612
+ − 498
except the newline.
604
+ − 499
605
+ − 500
\subsection{Bounded Repetitions}
612
+ − 501
More interesting are bounded repetitions, which can
+ − 502
make the regular expressions much
605
+ − 503
more compact.
612
+ − 504
There are
+ − 505
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
+ − 506
(where $n$ and $m$ are constant natural numbers).
+ − 507
Like the star regular expressions, the set of strings or language
+ − 508
a bounded regular expression can match
+ − 509
is defined using the power operation on sets:
605
+ − 510
\begin{center}
+ − 511
\begin{tabular}{lcl}
+ − 512
$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
+ − 513
$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
+ − 514
$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
+ − 515
$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
+ − 516
\end{tabular}
+ − 517
\end{center}
612
+ − 518
The attraction of bounded repetitions is that they can be
+ − 519
used to avoid a blow up: for example $r^{\{n\}}$
+ − 520
is a shorthand for
605
+ − 521
\[
+ − 522
\underbrace{r\ldots r}_\text{n copies of r}.
+ − 523
\]
+ − 524
%Therefore, a naive algorithm that simply unfolds
+ − 525
%them into their desugared forms
+ − 526
%will suffer from at least an exponential runtime increase.
603
+ − 527
612
+ − 528
+ − 529
The problem with matching
+ − 530
is that tools based on the classic notion of
+ − 531
automata need to expand $r^{\{n\}}$ into $n$ connected
603
+ − 532
copies of the automaton for $r$. This leads to very inefficient matching
+ − 533
algorithms or algorithms that consume large amounts of memory.
605
+ − 534
Implementations using $\DFA$s will
+ − 535
either become excruciatingly slow
+ − 536
(for example Verbatim++\cite{Verbatimpp}) or get
+ − 537
out of memory errors (for example $\mathit{LEX}$ and
+ − 538
$\mathit{JFLEX}$\footnote{which are lexer generators
+ − 539
in C and JAVA that generate $\mathit{DFA}$-based
+ − 540
lexers. The user provides a set of regular expressions
+ − 541
and configurations to them, and then
+ − 542
gets an output program encoding a minimized $\mathit{DFA}$
+ − 543
that can be compiled and run.
+ − 544
When given the above countdown regular expression,
+ − 545
a small $n$ (a few dozen) would result in a
+ − 546
determinised automata
612
+ − 547
with millions of states.}) for large counters.
+ − 548
A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
605
+ − 549
where the minimal DFA requires at least $2^{n+1}$ states.
+ − 550
For example, when $n$ is equal to 2,
612
+ − 551
The corresponding $\mathit{NFA}$ looks like:
604
+ − 552
\begin{center}
+ − 553
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 554
\node[state,initial] (q_0) {$q_0$};
+ − 555
\node[state, red] (q_1) [right=of q_0] {$q_1$};
+ − 556
\node[state, red] (q_2) [right=of q_1] {$q_2$};
+ − 557
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+ − 558
\path[->]
+ − 559
(q_0) edge node {a} (q_1)
+ − 560
edge [loop below] node {a,b} ()
+ − 561
(q_1) edge node {a,b} (q_2)
+ − 562
(q_2) edge node {a,b} (q_3);
+ − 563
\end{tikzpicture}
+ − 564
\end{center}
612
+ − 565
when turned into a DFA by the subset construction
+ − 566
requires at least $2^3$ states.\footnote{The
605
+ − 567
red states are "countdown states" which counts down
604
+ − 568
the number of characters needed in addition to the current
+ − 569
string to make a successful match.
+ − 570
For example, state $q_1$ indicates a match that has
+ − 571
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+ − 572
and just consumed the "delimiter" $a$ in the middle, and
+ − 573
need to match 2 more iterations of $(a|b)$ to complete.
+ − 574
State $q_2$ on the other hand, can be viewed as a state
+ − 575
after $q_1$ has consumed 1 character, and just waits
+ − 576
for 1 more character to complete.
+ − 577
$q_3$ is the last state, requiring 0 more character and is accepting.
+ − 578
Depending on the suffix of the
+ − 579
input string up to the current read location,
+ − 580
the states $q_1$ and $q_2$, $q_3$
+ − 581
may or may
+ − 582
not be active, independent from each other.
+ − 583
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+ − 584
contain at least $2^3$ non-equivalent states that cannot be merged,
+ − 585
because the subset construction during determinisation will generate
+ − 586
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+ − 587
Generalizing this to regular expressions with larger
+ − 588
bounded repetitions number, we have that
+ − 589
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
605
+ − 590
would require at least $2^{n+1}$ states, if $r$ itself contains
604
+ − 591
more than 1 string.
+ − 592
This is to represent all different
605
+ − 593
scenarios which "countdown" states are active.}
603
+ − 594
538
+ − 595
605
+ − 596
Bounded repetitions are very important because they
612
+ − 597
tend to occur a lot in practical use,
+ − 598
for example in the regex library RegExLib,
607
+ − 599
the rules library of Snort \cite{Snort1999}\footnote{
+ − 600
Snort is a network intrusion detection (NID) tool
608
+ − 601
for monitoring network traffic.
+ − 602
The network security community curates a list
+ − 603
of malicious patterns written as regexes,
+ − 604
which is used by Snort's detection engine
+ − 605
to match against network traffic for any hostile
+ − 606
activities such as buffer overflow attacks.},
607
+ − 607
as well as in XML Schema definitions (XSDs).
605
+ − 608
According to Bj\"{o}rklund et al \cite{xml2015},
+ − 609
more than half of the
612
+ − 610
XSDs they found on the Maven.org central repository
+ − 611
have bounded regular expressions in them.
+ − 612
Often the counters are quite large, with the largest being
+ − 613
approximately up to ten million.
605
+ − 614
An example XSD they gave
612
+ − 615
is:
+ − 616
\begin{verbatim}
+ − 617
<sequence minOccurs="0" maxOccurs="65535">
+ − 618
<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ − 619
<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+ − 620
</sequence>
+ − 621
\end{verbatim}
605
+ − 622
This can be seen as the expression
+ − 623
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+ − 624
regular expressions
+ − 625
satisfying certain constraints (such as
+ − 626
satisfying the floating point number format).
+ − 627
It is therefore quite unsatisfying that
+ − 628
some regular expressions matching libraries
+ − 629
impose adhoc limits
+ − 630
for bounded regular expressions:
+ − 631
For example, in the regular expression matching library in the Go
+ − 632
language the regular expression $a^{1001}$ is not permitted, because no counter
+ − 633
can be above 1000, and in the built-in Rust regular expression library
+ − 634
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
606
+ − 635
for being too big.
+ − 636
As Becchi and Crawley\cite{Becchi08} have pointed out,
+ − 637
the reason for these restrictions
612
+ − 638
is that they simulate a non-deterministic finite
606
+ − 639
automata (NFA) with a breadth-first search.
+ − 640
This way the number of active states could
+ − 641
be equal to the counter number.
+ − 642
When the counters are large,
+ − 643
the memory requirement could become
612
+ − 644
infeasible, and a regex engine
+ − 645
like Go will reject this pattern straight away.
606
+ − 646
\begin{figure}[H]
+ − 647
\begin{center}
+ − 648
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+ − 649
+ − 650
\node (q0) [state, initial] {$0$};
+ − 651
\node (q1) [state, right = of q0] {$1$};
612
+ − 652
%\node (q2) [state, right = of q1] {$2$};
+ − 653
\node (qdots) [right = of q1] {$\ldots$};
606
+ − 654
\node (qn) [state, right = of qdots] {$n$};
+ − 655
\node (qn1) [state, right = of qn] {$n+1$};
+ − 656
\node (qn2) [state, right = of qn1] {$n+2$};
+ − 657
\node (qn3) [state, accepting, right = of qn2] {$n+3$};
+ − 658
+ − 659
\path [-stealth, thick]
+ − 660
(q0) edge [loop above] node {a} ()
+ − 661
(q0) edge node {a} (q1)
612
+ − 662
%(q1) edge node {.} (q2)
+ − 663
(q1) edge node {.} (qdots)
606
+ − 664
(qdots) edge node {.} (qn)
+ − 665
(qn) edge node {.} (qn1)
+ − 666
(qn1) edge node {b} (qn2)
+ − 667
(qn2) edge node {$c$} (qn3);
+ − 668
\end{tikzpicture}
+ − 669
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 670
% \node[state,initial] (q_0) {$0$};
+ − 671
% \node[state, ] (q_1) [right=of q_0] {$1$};
+ − 672
% \node[state, ] (q_2) [right=of q_1] {$2$};
+ − 673
% \node[state,
+ − 674
% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+ − 675
% \path[->]
+ − 676
% (q_0) edge node {a} (q_1)
+ − 677
% edge [loop below] node {a,b} ()
+ − 678
% (q_1) edge node {a,b} (q_2)
+ − 679
% (q_2) edge node {a,b} (q_3);
+ − 680
%\end{tikzpicture}
+ − 681
\end{center}
+ − 682
\caption{The example given by Becchi and Crawley
+ − 683
that NFA simulation can consume large
+ − 684
amounts of memory: $.^*a.^{\{n\}}bc$ matching
+ − 685
strings of the form $aaa\ldots aaaabc$.
+ − 686
When traversing in a breadth-first manner,
+ − 687
all states from 0 till $n+1$ will become active.}
+ − 688
\end{figure}
+ − 689
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+ − 690
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+ − 691
%in terms of input string length.
+ − 692
%TODO:try out these lexers
+ − 693
These problems can of course be solved in matching algorithms where
605
+ − 694
automata go beyond the classic notion and for instance include explicit
+ − 695
counters \cite{Turo_ov__2020}.
612
+ − 696
These solutions can be quite efficient,
606
+ − 697
with the ability to process
612
+ − 698
gigabytes of strings input per second
606
+ − 699
even with large counters \cite{Becchi08}.
612
+ − 700
But formal reasoning about these automata especially in Isabelle
+ − 701
can be challenging
+ − 702
and un-intuitive.
+ − 703
Therefore, we take correctness and runtime claims made about these solutions
+ − 704
with a grain of salt.
605
+ − 705
+ − 706
In the work reported in \cite{CSL2022} and here,
+ − 707
we add better support using derivatives
+ − 708
for bounded regular expressions $r^{\{n\}}$.
+ − 709
The results
+ − 710
extend straightforwardly to
+ − 711
repetitions with an interval such as
+ − 712
$r^{\{n\ldots m\}}$.
608
+ − 713
The merit of Brzozowski derivatives (more on this later)
605
+ − 714
on this problem is that
+ − 715
it can be naturally extended to support bounded repetitions.
+ − 716
Moreover these extensions are still made up of only
+ − 717
inductive datatypes and recursive functions,
+ − 718
making it handy to deal with using theorem provers.
+ − 719
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+ − 720
%straightforwardly extended to deal with bounded regular expressions
+ − 721
%and moreover the resulting code still consists of only simple
+ − 722
%recursive functions and inductive datatypes.
+ − 723
Finally, bounded regular expressions do not destroy our finite
+ − 724
boundedness property, which we shall prove later on.
538
+ − 725
+ − 726
606
+ − 727
+ − 728
+ − 729
605
+ − 730
\subsection{Back-References}
606
+ − 731
The other way to simulate an $\mathit{NFA}$ for matching is choosing
+ − 732
a single transition each time, keeping all the other options in
+ − 733
a queue or stack, and backtracking if that choice eventually
+ − 734
fails. This method, often called a "depth-first-search",
+ − 735
is efficient in a lot of cases, but could end up
+ − 736
with exponential run time.
+ − 737
The backtracking method is employed in regex libraries
+ − 738
that support \emph{back-references}, for example
+ − 739
in Java and Python.
605
+ − 740
%\section{Back-references and The Terminology Regex}
538
+ − 741
605
+ − 742
%When one constructs an $\NFA$ out of a regular expression
+ − 743
%there is often very little to be done in the first phase, one simply
+ − 744
%construct the $\NFA$ states based on the structure of the input regular expression.
538
+ − 745
605
+ − 746
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
+ − 747
%one by keeping track of all active states after consuming
+ − 748
%a character, and update that set of states iteratively.
+ − 749
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+ − 750
%for a path terminating
+ − 751
%at an accepting state.
606
+ − 752
+ − 753
+ − 754
+ − 755
Given a regular expression like this (the sequence
532
+ − 756
operator is omitted for brevity):
+ − 757
\begin{center}
606
+ − 758
$r_1r_2r_3r_4$
532
+ − 759
\end{center}
606
+ − 760
one could label sub-expressions of interest
532
+ − 761
by parenthesizing them and giving
+ − 762
them a number by the order in which their opening parentheses appear.
+ − 763
One possible way of parenthesizing and labelling is given below:
+ − 764
\begin{center}
+ − 765
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
+ − 766
\end{center}
606
+ − 767
The sub-expressions
+ − 768
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
+ − 769
by 1 to 4, and can be ``referred back'' by their respective numbers.
+ − 770
%These sub-expressions are called "capturing groups".
+ − 771
To do so, we use the syntax $\backslash i$
+ − 772
to denote that we want the sub-string
+ − 773
of the input just matched by the i-th
+ − 774
sub-expression to appear again,
+ − 775
exactly the same as it first appeared:
532
+ − 776
\begin{center}
+ − 777
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
+ − 778
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
+ − 779
\end{center}
606
+ − 780
%The backslash and number $i$ are the
+ − 781
%so-called "back-references".
+ − 782
%Let $e$ be an expression made of regular expressions
+ − 783
%and back-references. $e$ contains the expression $e_i$
+ − 784
%as its $i$-th capturing group.
+ − 785
%The semantics of back-reference can be recursively
+ − 786
%written as:
+ − 787
%\begin{center}
+ − 788
% \begin{tabular}{c}
+ − 789
% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+ − 790
% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+ − 791
% \end{tabular}
+ − 792
%\end{center}
+ − 793
A concrete example
618
+ − 794
for back-references is
532
+ − 795
\begin{center}
607
+ − 796
$(.^*)\backslash 1$,
532
+ − 797
\end{center}
618
+ − 798
which matches
606
+ − 799
strings that can be split into two identical halves,
618
+ − 800
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
607
+ − 801
Note that this is different from
+ − 802
repeating the sub-expression verbatim like
+ − 803
\begin{center}
+ − 804
$(.^*)(.^*)$,
+ − 805
\end{center}
+ − 806
which does not impose any restrictions on what strings the second
+ − 807
sub-expression $.^*$
+ − 808
might match.
618
+ − 809
Another example of back-references is
607
+ − 810
\begin{center}
+ − 811
$(.)(.)\backslash 2\backslash 1$
+ − 812
\end{center}
618
+ − 813
which matches four-character palindromes
+ − 814
like $abba$, $x??x$ and so on.
607
+ − 815
+ − 816
Back-references is a regex construct
618
+ − 817
that programmers find quite useful.
607
+ − 818
According to Becchi and Crawley\cite{Becchi08},
618
+ − 819
6\% of Snort rules (up until 2008) use them.
607
+ − 820
The most common use of back-references
618
+ − 821
is to express well-formed html files,
+ − 822
where back-references are convenient for matching
+ − 823
opening and closing tags like
607
+ − 824
\begin{center}
+ − 825
$\langle html \rangle \ldots \langle / html \rangle$
+ − 826
\end{center}
+ − 827
A regex describing such a format
618
+ − 828
is
607
+ − 829
\begin{center}
+ − 830
$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+ − 831
\end{center}
618
+ − 832
Despite being useful, the expressive power of regexes
607
+ − 833
go beyond the regular language hierarchy
618
+ − 834
once back-references are included.
607
+ − 835
In fact, they allow the regex construct to express
532
+ − 836
languages that cannot be contained in context-free
+ − 837
languages either.
608
+ − 838
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
532
+ − 839
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
+ − 840
which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
+ − 841
Such a language is contained in the context-sensitive hierarchy
+ − 842
of formal languages.
618
+ − 843
Also solving the matching problem involving back-references
609
+ − 844
is known to be NP-complete \parencite{alfred2014algorithms}.
608
+ − 845
Regex libraries supporting back-references such as
+ − 846
PCRE \cite{pcre} therefore have to
609
+ − 847
revert to a depth-first search algorithm which backtracks.
+ − 848
What is unexpected is that even in the cases
+ − 849
not involving back-references, there is still
+ − 850
a (non-negligible) chance they might backtrack super-linearly,
618
+ − 851
as shown in the graphs in figure\ref{fig:aStarStarb}.
538
+ − 852
532
+ − 853
Summing these up, we can categorise existing
607
+ − 854
practical regex libraries into two kinds:
622
+ − 855
(i) The ones with linear
+ − 856
time guarantees like Go and Rust. The downside with them is that
607
+ − 857
they impose restrictions
622
+ − 858
on the regular expressions (not allowing back-references,
+ − 859
bounded repetitions cannot exceed an ad hoc limit etc.).
607
+ − 860
(ii) Those
+ − 861
that allow large bounded regular expressions and back-references
622
+ − 862
at the expense of using backtracking algorithms.
+ − 863
They can potentially ``grind to a halt''
+ − 864
on some very simple cases, resulting
+ − 865
ReDoS attacks.
607
+ − 866
622
+ − 867
The proble with both approaches is the motivation for us
+ − 868
to look again at the regular expression matching problem.
+ − 869
Another motivation is that regular expression matching algorithms
+ − 870
that follow the POSIX standard often contain errors and bugs
+ − 871
as we shall explain next.
+ − 872
+ − 873
%We would like to have regex engines that can
+ − 874
%deal with the regular part (e.g.
+ − 875
%bounded repetitions) of regexes more
+ − 876
%efficiently.
+ − 877
%Also we want to make sure that they do it correctly.
+ − 878
%It turns out that such aim is not so easy to achieve.
532
+ − 879
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
+ − 880
% For example, the Rust regex engine claims to be linear,
+ − 881
% but does not support lookarounds and back-references.
+ − 882
% The GoLang regex library does not support over 1000 repetitions.
+ − 883
% Java and Python both support back-references, but shows
+ − 884
%catastrophic backtracking behaviours on inputs without back-references(
+ − 885
%when the language is still regular).
+ − 886
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
+ − 887
%TODO: verify the fact Rust does not allow 1000+ reps
+ − 888
+ − 889
605
+ − 890
+ − 891
+ − 892
%The time cost of regex matching algorithms in general
+ − 893
%involve two different phases, and different things can go differently wrong on
+ − 894
%these phases.
+ − 895
%$\DFA$s usually have problems in the first (construction) phase
+ − 896
%, whereas $\NFA$s usually run into trouble
+ − 897
%on the second phase.
+ − 898
+ − 899
+ − 900
\section{Error-prone POSIX Implementations}
622
+ − 901
Very often there are multiple ways of matching a string
+ − 902
with a regular expression.
+ − 903
In such cases the regular expressions matcher needs to
607
+ − 904
disambiguate.
622
+ − 905
The more widely used strategy is called POSIX,
+ − 906
which roughly speaking always chooses the longest initial match.
+ − 907
The POSIX strategy is widely adopted in many regular expression matchers.
607
+ − 908
However, many implementations (including the C libraries
+ − 909
used by Linux and OS X distributions) contain bugs
+ − 910
or do not meet the specification they claim to adhere to.
622
+ − 911
Kuklewicz maintains a unit test repository which lists some
+ − 912
problems with existing regular expression engines \ref{KuklewiczHaskell}.
+ − 913
In some cases, they either fail to generate a
607
+ − 914
result when there exists a match,
622
+ − 915
or give results that are inconsistent with the POSIX standard.
+ − 916
A concrete example is the regex:
605
+ − 917
\begin{center}
607
+ − 918
$(aba + ab + a)^* \text{and the string} ababa$
605
+ − 919
\end{center}
622
+ − 920
The correct POSIX match for the above
+ − 921
is the entire string $ababa$,
+ − 922
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
605
+ − 923
$[0, 2), [2, 5)$
+ − 924
respectively.
622
+ − 925
But trying this out in regex101 \parencite{regex101} \footnote{
+ − 926
regex101 is an online regular expression matcher which
+ − 927
provides API for trying out regular expression
+ − 928
engines of multiple popular programming languages like
+ − 929
Java, Python, Go, etc.}
+ − 930
with different engines yields
+ − 931
always matches: $[aba]$ at $[0, 3)$
605
+ − 932
and $a$ at $[4, 5)$.
607
+ − 933
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
+ − 934
commented that most regex libraries are not
622
+ − 935
correctly implementing the central POSIX
+ − 936
rule, called the maximum munch rule.
+ − 937
Grathwohl \parencite{grathwohl2014crash} wrote,
605
+ − 938
\begin{quote}
607
+ − 939
``The POSIX strategy is more complicated than the
605
+ − 940
greedy because of the dependence on information about
607
+ − 941
the length of matched strings in the various subexpressions.''
605
+ − 942
\end{quote}
+ − 943
%\noindent
622
+ − 944
We think the implementation complexity of POSIX rules also come from
+ − 945
the specification being not very precise.
607
+ − 946
There are many informal summaries of this disambiguation
+ − 947
strategy, which are often quite long and delicate.
608
+ − 948
For example Kuklewicz \cite{KuklewiczHaskell}
622
+ − 949
described the POSIX rule as (section 1, last paragraph):
607
+ − 950
\begin{quote}
+ − 951
\begin{itemize}
+ − 952
\item
+ − 953
regular expressions (REs) take the leftmost starting match, and the longest match starting there
+ − 954
earlier subpatterns have leftmost-longest priority over later subpatterns\\
+ − 955
\item
+ − 956
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+ − 957
\item
+ − 958
REs have right associative concatenation which can be changed with parenthesis\\
+ − 959
\item
+ − 960
parenthesized subexpressions return the match from their last usage\\
+ − 961
\item
+ − 962
text of component subexpressions must be contained in the text of the
+ − 963
higher-level subexpressions\\
+ − 964
\item
+ − 965
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+ − 966
\item
622
+ − 967
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
607
+ − 968
\end{itemize}
+ − 969
\end{quote}
622
+ − 970
%The text above
+ − 971
%is trying to capture something very precise,
+ − 972
%and is crying out for formalising.
608
+ − 973
Ausaf et al. \cite{AusafDyckhoffUrban2016}
622
+ − 974
are the first to
+ − 975
give a quite simple formalised POSIX
+ − 976
specification in Isabelle/HOL, and also prove
609
+ − 977
that their specification coincides with the
622
+ − 978
POSIX specification given by Okui and Suzuki \cite{Okui10}.
608
+ − 979
They then formally proved the correctness of
+ − 980
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
622
+ − 981
with regards to that specification.
+ − 982
They also found that the informal POSIX
+ − 983
specification by Sulzmann and Lu does not work for the correctness proof.
605
+ − 984
622
+ − 985
In the next section we will briefly
609
+ − 986
introduce Brzozowski derivatives and Sulzmann
622
+ − 987
and Lu's algorithm, which the main point of this thesis builds on.
+ − 988
%We give a taste of what they
+ − 989
%are like and why they are suitable for regular expression
+ − 990
%matching and lexing.
+ − 991
\section{Formal Specification of POSIX Matching
609
+ − 992
and Brzozowski Derivatives}
622
+ − 993
%Now we start with the central topic of the thesis: Brzozowski derivatives.
609
+ − 994
Brzozowski \cite{Brzozowski1964} first introduced the
622
+ − 995
concept of a \emph{derivative} of regular expression in 1964.
609
+ − 996
The derivative of a regular expression $r$
622
+ − 997
with respect to a character $c$, is written as $r \backslash c$.
+ − 998
This operation tells us what $r$ transforms into
+ − 999
if we ``chop'' off the first character $c$
+ − 1000
from all strings in the language of $r$ (defined
+ − 1001
later as $L \; r$).
+ − 1002
%To give a flavour of Brzozowski derivatives, we present
+ − 1003
%two straightforward clauses from it:
+ − 1004
%\begin{center}
+ − 1005
% \begin{tabular}{lcl}
+ − 1006
% $d \backslash c$ & $\dn$ &
+ − 1007
% $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+ − 1008
%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+ − 1009
% \end{tabular}
+ − 1010
%\end{center}
+ − 1011
%\noindent
+ − 1012
%The first clause says that for the regular expression
+ − 1013
%denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
+ − 1014
%we check the derivative character $c$ against $d$,
+ − 1015
%returning a set containing only the empty string $\{ [] \}$
+ − 1016
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+ − 1017
%The second clause states that to obtain the regular expression
+ − 1018
%representing all strings' head character $c$ being chopped off
+ − 1019
%from $r_1 + r_2$, one simply needs to recursively take derivative
+ − 1020
%of $r_1$ and $r_2$ and then put them together.
+ − 1021
Derivatives have the property
609
+ − 1022
that $s \in L \; (r\backslash c)$ if and only if
622
+ − 1023
$c::s \in L \; r$ where $::$ stands for list prepending.
609
+ − 1024
%This property can be used on regular expressions
+ − 1025
%matching and lexing--to test whether a string $s$ is in $L \; r$,
+ − 1026
%one simply takes derivatives of $r$ successively with
+ − 1027
%respect to the characters (in the correct order) in $s$,
+ − 1028
%and then test whether the empty string is in the last regular expression.
622
+ − 1029
With this derivatives give a simple solution
+ − 1030
to the problem of matching a string $s$ with a regular
538
+ − 1031
expression $r$: if the derivative of $r$ w.r.t.\ (in
+ − 1032
succession) all the characters of the string matches the empty string,
+ − 1033
then $r$ matches $s$ (and {\em vice versa}).
622
+ − 1034
%This makes formally reasoning about these properties such
+ − 1035
%as correctness and complexity smooth and intuitive.
+ − 1036
There had been several mechanised proofs about this property in various theorem
+ − 1037
provers,
+ − 1038
for example one by Owens and Slind \cite{Owens2008} in HOL4,
609
+ − 1039
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
+ − 1040
yet another in Coq by Coquand and Siles \cite{Coquand2012}.
+ − 1041
622
+ − 1042
In addition, one can extend derivatives to bounded repetitions
+ − 1043
relatively straightforwardly. For example, the derivative for
+ − 1044
this can be defined as:
609
+ − 1045
\begin{center}
+ − 1046
\begin{tabular}{lcl}
+ − 1047
$r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
+ − 1048
r^{\{n-1\}}$\\
+ − 1049
\end{tabular}
+ − 1050
\end{center}
+ − 1051
\noindent
622
+ − 1052
Experimental results suggest that unlike DFA-based solutions
+ − 1053
for bounded regular expressions,
+ − 1054
derivatives can cope
+ − 1055
large counters
609
+ − 1056
quite well.
532
+ − 1057
609
+ − 1058
There has also been
+ − 1059
extensions to other constructs.
+ − 1060
For example, Owens et al include the derivatives
622
+ − 1061
for the \emph{NOT} regular expression, which is
609
+ − 1062
able to concisely express C-style comments of the form
622
+ − 1063
$/* \ldots */$ (see \cite{Owens2008}).
+ − 1064
Another extension for derivatives is
+ − 1065
regular expressions with look-aheads, done
609
+ − 1066
by Miyazaki and Minamide
+ − 1067
\cite{Takayuki2019}.
+ − 1068
%We therefore use Brzozowski derivatives on regular expressions
+ − 1069
%lexing
+ − 1070
+ − 1071
+ − 1072
+ − 1073
Given the above definitions and properties of
+ − 1074
Brzozowski derivatives, one quickly realises their potential
+ − 1075
in generating a formally verified algorithm for lexing--the clauses and property
+ − 1076
can be easily expressed in a functional programming language
+ − 1077
or converted to theorem prover
+ − 1078
code, with great extensibility.
+ − 1079
Perhaps this is the reason why it has sparked quite a bit of interest
+ − 1080
in the functional programming and theorem prover communities in the last
+ − 1081
fifteen or so years (
+ − 1082
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
+ − 1083
\cite{Chen12} and \cite{Coquand2012}
+ − 1084
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
622
+ − 1085
after they were first published by Brzozowski.
609
+ − 1086
+ − 1087
+ − 1088
However, there are two difficulties with derivative-based matchers:
532
+ − 1089
First, Brzozowski's original matcher only generates a yes/no answer
+ − 1090
for whether a regular expression matches a string or not. This is too
+ − 1091
little information in the context of lexing where separate tokens must
+ − 1092
be identified and also classified (for example as keywords
609
+ − 1093
or identifiers).
+ − 1094
Second, derivative-based matchers need to be more efficient.
+ − 1095
Elegant and beautiful
+ − 1096
as many implementations are,
+ − 1097
they can be excruciatingly slow.
+ − 1098
For example, Sulzmann and Lu
+ − 1099
claim a linear running time of their proposed algorithm,
+ − 1100
but that was falsified by our experiments. The running time
+ − 1101
is actually $\Omega(2^n)$ in the worst case.
+ − 1102
A similar claim about a theoretical runtime of $O(n^2)$
+ − 1103
is made for the Verbatim \cite{Verbatim}
+ − 1104
%TODO: give references
+ − 1105
lexer, which calculates POSIX matches and is based on derivatives.
622
+ − 1106
They formalized the correctness of the lexer, but not their complexity result.
+ − 1107
In the performance evaluation section, they analyzed the run time
609
+ − 1108
of matching $a$ with the string
+ − 1109
\begin{center}
622
+ − 1110
$\underbrace{a \ldots a}_{\text{n a's}}$.
609
+ − 1111
\end{center}
622
+ − 1112
\noindent
+ − 1113
They concluded that the algorithm is quadratic in terms of
+ − 1114
the length of the input string.
609
+ − 1115
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
622
+ − 1116
the time it took to match a string of 40 $a$'s was approximately 5 minutes.
609
+ − 1117
+ − 1118
+ − 1119
\subsection{Sulzmann and Lu's Algorithm}
+ − 1120
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
622
+ − 1121
problem with the yes/no answer
+ − 1122
by cleverly extending Brzozowski's matching
532
+ − 1123
algorithm. Their extended version generates additional information on
+ − 1124
\emph{how} a regular expression matches a string following the POSIX
+ − 1125
rules for regular expression matching. They achieve this by adding a
+ − 1126
second ``phase'' to Brzozowski's algorithm involving an injection
622
+ − 1127
function.
+ − 1128
In earlier work, Ausaf et al provided the formal
532
+ − 1129
specification of what POSIX matching means and proved in Isabelle/HOL
+ − 1130
the correctness
622
+ − 1131
of this extended algorithm accordingly
532
+ − 1132
\cite{AusafDyckhoffUrban2016}.
+ − 1133
609
+ − 1134
The version of the algorithm proven correct
622
+ − 1135
suffers heavily from a
+ − 1136
second difficulty, where the internal derivatives can
609
+ − 1137
grow to arbitrarily big sizes.
+ − 1138
For example if we start with the
532
+ − 1139
regular expression $(a+aa)^*$ and take
+ − 1140
successive derivatives according to the character $a$, we end up with
+ − 1141
a sequence of ever-growing derivatives like
+ − 1142
+ − 1143
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+ − 1144
\begin{center}
+ − 1145
\begin{tabular}{rll}
+ − 1146
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1147
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1148
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+ − 1149
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1150
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+ − 1151
\end{tabular}
+ − 1152
\end{center}
+ − 1153
+ − 1154
\noindent where after around 35 steps we run out of memory on a
622
+ − 1155
typical computer. Clearly, the
532
+ − 1156
notation involving $\ZERO$s and $\ONE$s already suggests
+ − 1157
simplification rules that can be applied to regular regular
+ − 1158
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
+ − 1159
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
622
+ − 1160
r$. While such simple-minded simplifications have been proved in
+ − 1161
the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
532
+ − 1162
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
+ − 1163
\emph{not} help with limiting the growth of the derivatives shown
+ − 1164
above: the growth is slowed, but the derivatives can still grow rather
+ − 1165
quickly beyond any finite bound.
+ − 1166
622
+ − 1167
Therefore we want to look in this thesis at a second
+ − 1168
algorithm by Sulzmann and Lu where they
+ − 1169
overcame this ``growth problem'' \cite{Sulzmann2014}.
+ − 1170
In this version, POSIX values are
538
+ − 1171
represented as bit sequences and such sequences are incrementally generated
532
+ − 1172
when derivatives are calculated. The compact representation
538
+ − 1173
of bit sequences and regular expressions allows them to define a more
532
+ − 1174
``aggressive'' simplification method that keeps the size of the
+ − 1175
derivatives finite no matter what the length of the string is.
+ − 1176
They make some informal claims about the correctness and linear behaviour
+ − 1177
of this version, but do not provide any supporting proof arguments, not
538
+ − 1178
even ``pencil-and-paper'' arguments. They write about their bit-coded
532
+ − 1179
\emph{incremental parsing method} (that is the algorithm to be formalised
538
+ − 1180
in this dissertation)
532
+ − 1181
+ − 1182
+ − 1183
+ − 1184
\begin{quote}\it
+ − 1185
``Correctness Claim: We further claim that the incremental parsing
+ − 1186
method [..] in combination with the simplification steps [..]
+ − 1187
yields POSIX parse trees. We have tested this claim
+ − 1188
extensively [..] but yet
+ − 1189
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
+ − 1190
\end{quote}
622
+ − 1191
Ausaf and Urban made some initial progress towards the
+ − 1192
full correctness proof but still had to leave out the optimisation
+ − 1193
Sulzmann and Lu proposed.
+ − 1194
Ausaf wrote \cite{Ausaf},
532
+ − 1195
\begin{quote}\it
609
+ − 1196
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
532
+ − 1197
\end{quote}
+ − 1198
This thesis implements the aggressive simplifications envisioned
+ − 1199
by Ausaf and Urban,
622
+ − 1200
together with a formal proof of the correctness of those simplifications.
532
+ − 1201
+ − 1202
612
+ − 1203
One of the most recent work in the context of lexing
+ − 1204
%with this issue
+ − 1205
is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+ − 1206
This is relevant work for us and we will compare it later with
+ − 1207
our derivative-based matcher we are going to present.
+ − 1208
There is also some newer work called
+ − 1209
Verbatim++\cite{Verbatimpp}, which does not use derivatives,
+ − 1210
but deterministic finite automaton instead.
+ − 1211
%An example that gives problem to automaton approaches would be
+ − 1212
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+ − 1213
%It requires at least $2^{n+1}$ states to represent
+ − 1214
%as a DFA.
+ − 1215
+ − 1216
532
+ − 1217
%----------------------------------------------------------------------------------------
+ − 1218
\section{Contribution}
+ − 1219
609
+ − 1220
In this thesis,
+ − 1221
we propose a solution to catastrophic
+ − 1222
backtracking and error-prone matchers: a formally verified
+ − 1223
regular expression lexing algorithm
+ − 1224
that is both fast
+ − 1225
and correct by extending Ausaf et al.'s work.
+ − 1226
The end result is %a regular expression lexing algorithm that comes with
538
+ − 1227
\begin{itemize}
+ − 1228
\item
609
+ − 1229
an improved version of Sulzmann and Lu's bit-coded algorithm using
+ − 1230
derivatives with simplifications,
+ − 1231
accompanied by
+ − 1232
a proven correctness theorem according to POSIX specification
+ − 1233
given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
+ − 1234
\item
+ − 1235
a complexity-related property for that algorithm saying that the
+ − 1236
internal data structure will
+ − 1237
remain finite,
538
+ − 1238
\item
609
+ − 1239
and extension to
+ − 1240
the bounded repetitions construct with the correctness and finiteness property
+ − 1241
maintained.
622
+ − 1242
\end{itemize}
+ − 1243
\noindent
609
+ − 1244
With a formal finiteness bound in place,
+ − 1245
we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
+ − 1246
Further improvements to the algorithm with an even stronger version of
622
+ − 1247
simplification can be made.
609
+ − 1248
538
+ − 1249
+ − 1250
532
+ − 1251
+ − 1252
+ − 1253
\section{Structure of the thesis}
622
+ − 1254
In chapter \ref{Inj} we will introduce the concepts
532
+ − 1255
and notations we
622
+ − 1256
use for describing regular expressions and derivatives,
+ − 1257
and the first version of their lexing algorithm without bitcodes (including
+ − 1258
its correctness proof).
+ − 1259
We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
+ − 1260
together with the correctness proof by Ausaf and Urban.
+ − 1261
Then we illustrate in chapter \ref{Bitcoded2}
+ − 1262
how Sulzmann and Lu's
+ − 1263
simplifications fail to simplify. We therefore introduce our version of the
+ − 1264
algorithm with simplification and
532
+ − 1265
its correctness proof .
622
+ − 1266
In chapter \ref{Finite} we give the second guarantee
532
+ − 1267
of our bitcoded algorithm, that is a finite bound on the size of any
622
+ − 1268
regular expression's derivatives.
+ − 1269
In chapter \ref{Cubic} we discuss stronger simplification rules which
+ − 1270
improve the finite bound to a polynomial bound, and also show how one can extend the
+ − 1271
algorithm to include bounded repetitions. %and the NOT regular expression.
532
+ − 1272
+ − 1273
+ − 1274
+ − 1275
+ − 1276
+ − 1277
%----------------------------------------------------------------------------------------
+ − 1278
+ − 1279
+ − 1280
%----------------------------------------------------------------------------------------
+ − 1281
+ − 1282
%----------------------------------------------------------------------------------------
+ − 1283
+ − 1284
%----------------------------------------------------------------------------------------
+ − 1285
+ − 1286