532
+ − 1
% Chapter 1
+ − 2
+ − 3
\chapter{Introduction} % Main chapter title
+ − 4
+ − 5
\label{Introduction} % For referencing the chapter elsewhere, use \ref{Chapter1}
+ − 6
+ − 7
%----------------------------------------------------------------------------------------
+ − 8
+ − 9
% Define some commands to keep the formatting separated from the content
+ − 10
\newcommand{\keyword}[1]{\textbf{#1}}
+ − 11
\newcommand{\tabhead}[1]{\textbf{#1}}
+ − 12
\newcommand{\code}[1]{\texttt{#1}}
+ − 13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
+ − 14
\newcommand{\option}[1]{\texttt{\itshape#1}}
+ − 15
+ − 16
%boxes
+ − 17
\newcommand*{\mybox}[1]{\framebox{\strut #1}}
+ − 18
+ − 19
%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
+ − 20
\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
+ − 21
\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
543
+ − 22
\newcommand{\bderssimp}[2]{#1 \backslash_{bsimps} #2}
532
+ − 23
\newcommand{\rderssimp}[2]{#1 \backslash_{rsimp} #2}
564
+ − 24
\def\derssimp{\textit{ders}\_\textit{simp}}
557
+ − 25
\def\rders{\textit{rders}}
532
+ − 26
\newcommand{\bders}[2]{#1 \backslash #2}
+ − 27
\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
554
+ − 28
\newcommand{\rsimp}[1]{\textit{rsimp}\; #1}
532
+ − 29
\newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'}
+ − 30
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+ − 31
\newcommand{\denote}{\stackrel{\mbox{\scriptsize denote}}{=}}%
+ − 32
\newcommand{\ZERO}{\mbox{\bf 0}}
+ − 33
\newcommand{\ONE}{\mbox{\bf 1}}
+ − 34
\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
555
+ − 35
\newcommand{\rdistinct}[2]{\textit{rdistinct} \;\; #1 \;\; #2}
556
+ − 36
\def\rDistinct{\textit{rdistinct}}
532
+ − 37
\newcommand\hflat[1]{\llparenthesis #1 \rrparenthesis_*}
+ − 38
\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
+ − 39
\newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
+ − 40
+ − 41
\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
+ − 42
564
+ − 43
\def\case{\textit{case}}
554
+ − 44
\def\sequal{\stackrel{\mbox{\scriptsize rsimp}}{=}}
+ − 45
\def\rsimpalts{\textit{rsimp}_{ALTS}}
+ − 46
\def\good{\textit{good}}
+ − 47
\def\btrue{\textit{true}}
+ − 48
\def\bfalse{\textit{false}}
542
+ − 49
\def\bnullable{\textit{bnullable}}
543
+ − 50
\def\bnullables{\textit{bnullables}}
538
+ − 51
\def\Some{\textit{Some}}
+ − 52
\def\None{\textit{None}}
537
+ − 53
\def\code{\textit{code}}
532
+ − 54
\def\decode{\textit{decode}}
+ − 55
\def\internalise{\textit{internalise}}
+ − 56
\def\lexer{\mathit{lexer}}
+ − 57
\def\mkeps{\textit{mkeps}}
557
+ − 58
\newcommand{\rder}[2]{#2 \backslash_r #1}
532
+ − 59
585
+ − 60
\def\rerases{\textit{rerase}}
+ − 61
554
+ − 62
\def\nonnested{\textit{nonnested}}
532
+ − 63
\def\AZERO{\textit{AZERO}}
558
+ − 64
\def\sizeNregex{\textit{sizeNregex}}
532
+ − 65
\def\AONE{\textit{AONE}}
+ − 66
\def\ACHAR{\textit{ACHAR}}
+ − 67
585
+ − 68
\def\simpsulz{\textit{simp}_{Sulz}}
+ − 69
557
+ − 70
\def\scfrewrites{\stackrel{*}{\rightsquigarrow_{scf}}}
555
+ − 71
\def\frewrite{\rightsquigarrow_f}
+ − 72
\def\hrewrite{\rightsquigarrow_h}
+ − 73
\def\grewrite{\rightsquigarrow_g}
+ − 74
\def\frewrites{\stackrel{*}{\rightsquigarrow_f}}
+ − 75
\def\hrewrites{\stackrel{*}{\rightsquigarrow_h}}
+ − 76
\def\grewrites{\stackrel{*}{\rightsquigarrow_g}}
538
+ − 77
\def\fuse{\textit{fuse}}
+ − 78
\def\bder{\textit{bder}}
542
+ − 79
\def\der{\textit{der}}
532
+ − 80
\def\POSIX{\textit{POSIX}}
+ − 81
\def\ALTS{\textit{ALTS}}
+ − 82
\def\ASTAR{\textit{ASTAR}}
+ − 83
\def\DFA{\textit{DFA}}
538
+ − 84
\def\NFA{\textit{NFA}}
532
+ − 85
\def\bmkeps{\textit{bmkeps}}
543
+ − 86
\def\bmkepss{\textit{bmkepss}}
532
+ − 87
\def\retrieve{\textit{retrieve}}
+ − 88
\def\blexer{\textit{blexer}}
+ − 89
\def\flex{\textit{flex}}
573
+ − 90
\def\inj{\textit{inj}}
564
+ − 91
\def\Empty{\textit{Empty}}
567
+ − 92
\def\Left{\textit{Left}}
+ − 93
\def\Right{\textit{Right}}
573
+ − 94
\def\Stars{\textit{Stars}}
+ − 95
\def\Char{\textit{Char}}
+ − 96
\def\Seq{\textit{Seq}}
532
+ − 97
\def\Der{\textit{Der}}
+ − 98
\def\Ders{\textit{Ders}}
+ − 99
\def\nullable{\mathit{nullable}}
+ − 100
\def\Z{\mathit{Z}}
+ − 101
\def\S{\mathit{S}}
+ − 102
\def\rup{r^\uparrow}
+ − 103
%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
+ − 104
\def\distinctWith{\textit{distinctWith}}
+ − 105
\def\lf{\textit{lf}}
+ − 106
\def\PD{\textit{PD}}
+ − 107
\def\suffix{\textit{Suffix}}
543
+ − 108
\def\distinctBy{\textit{distinctBy}}
558
+ − 109
\def\starupdate{\textit{starUpdate}}
+ − 110
\def\starupdates{\textit{starUpdates}}
+ − 111
532
+ − 112
+ − 113
\def\size{\mathit{size}}
+ − 114
\def\rexp{\mathbf{rexp}}
+ − 115
\def\simp{\mathit{simp}}
+ − 116
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
+ − 117
\def\map{\mathit{map}}
+ − 118
\def\distinct{\mathit{distinct}}
+ − 119
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
590
+ − 120
\def\blexerStrong{\textit{blexerStrong}}
+ − 121
\def\bsimpStrong{\textit{bsimpStrong}}
+ − 122
%\def\bdersStrong{\textit{bdersStrong}}
+ − 123
\newcommand{\bdersStrong}[2]{#1 \backslash_{bsimpStrongs} #2}
+ − 124
532
+ − 125
\def\map{\textit{map}}
+ − 126
\def\rrexp{\textit{rrexp}}
554
+ − 127
\newcommand\rnullable[1]{\textit{rnullable} \; #1 }
532
+ − 128
\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
+ − 129
\newcommand\asize[1]{\llbracket #1 \rrbracket}
543
+ − 130
\newcommand\rerase[1]{ (#1)_{\downarrow_r}}
+ − 131
538
+ − 132
\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
532
+ − 133
543
+ − 134
+ − 135
\def\rflts{\textit{rflts}}
+ − 136
\def\rrewrite{\textit{rrewrite}}
+ − 137
\def\bsimpalts{\textit{bsimp}_{ALTS}}
+ − 138
532
+ − 139
\def\erase{\textit{erase}}
+ − 140
\def\STAR{\textit{STAR}}
+ − 141
\def\flts{\textit{flts}}
+ − 142
+ − 143
579
+ − 144
\def\zeroable{\textit{zeroable}}
+ − 145
\def\nub{\textit{nub}}
+ − 146
\def\filter{\textit{filter}}
+ − 147
\def\not{\textit{not}}
+ − 148
+ − 149
+ − 150
532
+ − 151
\def\RZERO{\mathbf{0}_r }
+ − 152
\def\RONE{\mathbf{1}_r}
+ − 153
\newcommand\RCHAR[1]{\mathbf{#1}_r}
+ − 154
\newcommand\RSEQ[2]{#1 \cdot #2}
558
+ − 155
\newcommand\RALTS[1]{\sum #1}
532
+ − 156
\newcommand\RSTAR[1]{#1^*}
558
+ − 157
\newcommand\vsuf[2]{\textit{Suffix} \;#1\;#2}
532
+ − 158
538
+ − 159
+ − 160
590
+ − 161
+ − 162
\lstdefinestyle{myScalastyle}{
+ − 163
frame=tb,
+ − 164
language=scala,
+ − 165
aboveskip=3mm,
+ − 166
belowskip=3mm,
+ − 167
showstringspaces=false,
+ − 168
columns=flexible,
+ − 169
basicstyle={\small\ttfamily},
+ − 170
numbers=none,
+ − 171
numberstyle=\tiny\color{gray},
+ − 172
keywordstyle=\color{blue},
+ − 173
commentstyle=\color{dkgreen},
+ − 174
stringstyle=\color{mauve},
+ − 175
frame=single,
+ − 176
breaklines=true,
+ − 177
breakatwhitespace=true,
+ − 178
tabsize=3,
538
+ − 179
}
+ − 180
590
+ − 181
532
+ − 182
%----------------------------------------------------------------------------------------
+ − 183
%This part is about regular expressions, Brzozowski derivatives,
+ − 184
%and a bit-coded lexing algorithm with proven correctness and time bounds.
+ − 185
+ − 186
%TODO: look up snort rules to use here--give readers idea of what regexes look like
+ − 187
+ − 188
\begin{figure}
+ − 189
\centering
+ − 190
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+ − 191
\begin{tikzpicture}
+ − 192
\begin{axis}[
+ − 193
xlabel={$n$},
+ − 194
x label style={at={(1.05,-0.05)}},
+ − 195
ylabel={time in secs},
+ − 196
enlargelimits=false,
+ − 197
xtick={0,5,...,30},
+ − 198
xmax=33,
+ − 199
ymax=35,
+ − 200
ytick={0,5,...,30},
+ − 201
scaled ticks=false,
+ − 202
axis lines=left,
+ − 203
width=5cm,
+ − 204
height=4cm,
+ − 205
legend entries={JavaScript},
+ − 206
legend pos=north west,
+ − 207
legend cell align=left]
+ − 208
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+ − 209
\end{axis}
+ − 210
\end{tikzpicture}
+ − 211
&
+ − 212
\begin{tikzpicture}
+ − 213
\begin{axis}[
+ − 214
xlabel={$n$},
+ − 215
x label style={at={(1.05,-0.05)}},
+ − 216
%ylabel={time in secs},
+ − 217
enlargelimits=false,
+ − 218
xtick={0,5,...,30},
+ − 219
xmax=33,
+ − 220
ymax=35,
+ − 221
ytick={0,5,...,30},
+ − 222
scaled ticks=false,
+ − 223
axis lines=left,
+ − 224
width=5cm,
+ − 225
height=4cm,
+ − 226
legend entries={Python},
+ − 227
legend pos=north west,
+ − 228
legend cell align=left]
+ − 229
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ − 230
\end{axis}
+ − 231
\end{tikzpicture}
+ − 232
&
+ − 233
\begin{tikzpicture}
+ − 234
\begin{axis}[
+ − 235
xlabel={$n$},
+ − 236
x label style={at={(1.05,-0.05)}},
+ − 237
%ylabel={time in secs},
+ − 238
enlargelimits=false,
+ − 239
xtick={0,5,...,30},
+ − 240
xmax=33,
+ − 241
ymax=35,
+ − 242
ytick={0,5,...,30},
+ − 243
scaled ticks=false,
+ − 244
axis lines=left,
+ − 245
width=5cm,
+ − 246
height=4cm,
+ − 247
legend entries={Java 8},
+ − 248
legend pos=north west,
+ − 249
legend cell align=left]
+ − 250
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+ − 251
\end{axis}
+ − 252
\end{tikzpicture}\\
+ − 253
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
+ − 254
of the form $\underbrace{aa..a}_{n}$.}
+ − 255
\end{tabular}
+ − 256
\caption{aStarStarb} \label{fig:aStarStarb}
+ − 257
\end{figure}
+ − 258
+ − 259
+ − 260
+ − 261
538
+ − 262
+ − 263
Regular expressions are widely used in computer science:
+ − 264
be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+ − 265
command-line tools like $\mathit{grep}$ that facilitate easy
+ − 266
text-processing; network intrusion
+ − 267
detection systems that reject suspicious traffic; or compiler
+ − 268
front ends--the majority of the solutions to these tasks
+ − 269
involve lexing with regular
+ − 270
expressions.
+ − 271
Given its usefulness and ubiquity, one would imagine that
+ − 272
modern regular expression matching implementations
+ − 273
are mature and fully studied.
+ − 274
Indeed, in a popular programming language' regex engine,
+ − 275
supplying it with regular expressions and strings, one can
+ − 276
get rich matching information in a very short time.
+ − 277
Some network intrusion detection systems
+ − 278
use regex engines that are able to process
+ − 279
megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+ − 280
Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+ − 281
%TODO: get source for SNORT/BRO's regex matching engine/speed
+ − 282
+ − 283
+ − 284
Take $(a^*)^*\,b$ and ask whether
+ − 285
strings of the form $aa..a$ match this regular
+ − 286
expression. Obviously this is not the case---the expected $b$ in the last
+ − 287
position is missing. One would expect that modern regular expression
+ − 288
matching engines can find this out very quickly. Alas, if one tries
+ − 289
this example in JavaScript, Python or Java 8, even with strings of a small
+ − 290
length, say around 30 $a$'s, one discovers that
+ − 291
this decision takes crazy time to finish given the simplicity of the problem.
+ − 292
This is clearly exponential behaviour, and
+ − 293
is triggered by some relatively simple regex patterns, as the graphs
+ − 294
in \ref{fig:aStarStarb} show.
+ − 295
+ − 296
+ − 297
+ − 298
+ − 299
\ChristianComment{Superlinear I just leave out the explanation
+ − 300
which I find once used would distract the flow. Plus if i just say exponential
+ − 301
here the 2016 event in StackExchange was not exponential, but just quardratic so would be
+ − 302
in accurate}
+ − 303
+ − 304
This superlinear blowup in regular expression engines
+ − 305
had repeatedly caused grief in real life.
+ − 306
For example, on 20 July 2016 one evil
532
+ − 307
regular expression brought the webpage
+ − 308
\href{http://stackexchange.com}{Stack Exchange} to its
538
+ − 309
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
532
+ − 310
In this instance, a regular expression intended to just trim white
+ − 311
spaces from the beginning and the end of a line actually consumed
538
+ − 312
massive amounts of CPU resources---causing web servers to grind to a
+ − 313
halt. In this example, the time needed to process
532
+ − 314
the string was $O(n^2)$ with respect to the string length. This
+ − 315
quadratic overhead was enough for the homepage of Stack Exchange to
+ − 316
respond so slowly that the load balancer assumed a $\mathit{DoS}$
+ − 317
attack and therefore stopped the servers from responding to any
+ − 318
requests. This made the whole site become unavailable.
538
+ − 319
532
+ − 320
A more recent example is a global outage of all Cloudflare servers on 2 July
+ − 321
2019. A poorly written regular expression exhibited exponential
+ − 322
behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
+ − 323
had several causes, at the heart was a regular expression that
+ − 324
was used to monitor network
538
+ − 325
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
532
+ − 326
%TODO: data points for some new versions of languages
+ − 327
These problems with regular expressions
+ − 328
are not isolated events that happen
+ − 329
very occasionally, but actually widespread.
+ − 330
They occur so often that they get a
+ − 331
name--Regular-Expression-Denial-Of-Service (ReDoS)
+ − 332
attack.
538
+ − 333
\citeauthor{Davis18} detected more
532
+ − 334
than 1000 super-linear (SL) regular expressions
+ − 335
in Node.js, Python core libraries, and npm and pypi.
+ − 336
They therefore concluded that evil regular expressions
538
+ − 337
are problems "more than a parlour trick", but one that
532
+ − 338
requires
+ − 339
more research attention.
+ − 340
538
+ − 341
+ − 342
But the problems are not limited to slowness on certain
+ − 343
cases.
+ − 344
Another thing about these libraries is that there
+ − 345
is no correctness guarantee.
+ − 346
In some cases, they either fail to generate a lexing result when there exists a match,
+ − 347
or give results that are inconsistent with the $\POSIX$ standard.
+ − 348
A concrete example would be
+ − 349
the regex
+ − 350
\begin{verbatim}
+ − 351
(aba|ab|a)*
+ − 352
\end{verbatim}
+ − 353
and the string
+ − 354
\begin{verbatim}
+ − 355
ababa
+ − 356
\end{verbatim}
+ − 357
The correct $\POSIX$ match for the above would be
+ − 358
with the entire string $ababa$,
+ − 359
split into two Kleene star iterations, $[ab] [aba]$ at positions
+ − 360
$[0, 2), [2, 5)$
+ − 361
respectively.
+ − 362
But trying this out in regex101\parencite{regex101}
+ − 363
with different language engines would yield
+ − 364
the same two fragmented matches: $[aba]$ at $[0, 3)$
+ − 365
and $a$ at $[4, 5)$.
+ − 366
+ − 367
Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+ − 368
correctly implementing the POSIX (maximum-munch)
+ − 369
rule of regular expression matching.
+ − 370
+ − 371
As Grathwohl\parencite{grathwohl2014crash} commented,
+ − 372
\begin{center}
+ − 373
``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
+ − 374
\end{center}
+ − 375
+ − 376
+ − 377
To summarise the above, regular expressions are important.
+ − 378
They are popular and programming languages' library functions
+ − 379
for them are very fast on non-catastrophic cases.
+ − 380
But there are problems with current practical implementations.
+ − 381
First thing is that the running time might blow up.
+ − 382
The second problem is that they might be error-prone on certain
543
+ − 383
very simple cases.
538
+ − 384
In the next part of the chapter, we will look into reasons why
+ − 385
certain regex engines are running horribly slow on the "catastrophic"
+ − 386
cases and propose a solution that addresses both of these problems
+ − 387
based on Brzozowski and Sulzmann and Lu's work.
+ − 388
+ − 389
+ − 390
\section{Why are current regex engines slow?}
532
+ − 391
+ − 392
%find literature/find out for yourself that REGEX->DFA on basic regexes
+ − 393
%does not blow up the size
+ − 394
Shouldn't regular expression matching be linear?
+ − 395
How can one explain the super-linear behaviour of the
+ − 396
regex matching engines we have?
+ − 397
The time cost of regex matching algorithms in general
538
+ − 398
involve two different phases, and different things can go differently wrong on
+ − 399
these phases.
+ − 400
$\DFA$s usually have problems in the first (construction) phase
+ − 401
, whereas $\NFA$s usually run into trouble
+ − 402
on the second phase.
+ − 403
+ − 404
\subsection{Different Phases of a Matching/Lexing Algorithm}
+ − 405
+ − 406
+ − 407
Most lexing algorithms can be roughly divided into
+ − 408
two phases during its run.
+ − 409
The first phase is the "construction" phase,
+ − 410
in which the algorithm builds some
+ − 411
suitable data structure from the input regex $r$, so that
+ − 412
it can be easily operated on later.
+ − 413
We denote
+ − 414
the time cost for such a phase by $P_1(r)$.
+ − 415
The second phase is the lexing phase, when the input string
+ − 416
$s$ is read and the data structure
+ − 417
representing that regex $r$ is being operated on.
+ − 418
We represent the time
532
+ − 419
it takes by $P_2(r, s)$.\\
538
+ − 420
+ − 421
For $\mathit{DFA}$,
532
+ − 422
we have $P_2(r, s) = O( |s| )$,
+ − 423
because we take at most $|s|$ steps,
+ − 424
and each step takes
+ − 425
at most one transition--
+ − 426
a deterministic-finite-automata
+ − 427
by definition has at most one state active and at most one
+ − 428
transition upon receiving an input symbol.
+ − 429
But unfortunately in the worst case
538
+ − 430
$P_1(r) = O(exp^{|r|})$. An example will be given later.
+ − 431
+ − 432
532
+ − 433
For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
+ − 434
expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
+ − 435
The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+ − 436
On the other hand, if backtracking is used, the worst-case time bound bloats
538
+ − 437
to $|r| * 2^|s|$.
532
+ − 438
%on the input
+ − 439
%And when calculating the time complexity of the matching algorithm,
+ − 440
%we are assuming that each input reading step requires constant time.
+ − 441
%which translates to that the number of
+ − 442
%states active and transitions taken each time is bounded by a
+ − 443
%constant $C$.
+ − 444
%But modern regex libraries in popular language engines
+ − 445
% often want to support much richer constructs than just
+ − 446
% sequences and Kleene stars,
+ − 447
%such as negation, intersection,
+ − 448
%bounded repetitions and back-references.
+ − 449
%And de-sugaring these "extended" regular expressions
+ − 450
%into basic ones might bloat the size exponentially.
+ − 451
%TODO: more reference for exponential size blowup on desugaring.
538
+ − 452
+ − 453
\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+ − 454
+ − 455
532
+ − 456
The good things about $\mathit{DFA}$s is that once
+ − 457
generated, they are fast and stable, unlike
+ − 458
backtracking algorithms.
538
+ − 459
However, they do not scale well with bounded repetitions.
532
+ − 460
538
+ − 461
\subsubsection{Problems with Bounded Repetitions}
532
+ − 462
Bounded repetitions, usually written in the form
+ − 463
$r^{\{c\}}$ (where $c$ is a constant natural number),
+ − 464
denotes a regular expression accepting strings
+ − 465
that can be divided into $c$ substrings, where each
+ − 466
substring is in $r$.
+ − 467
For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
+ − 468
an $\mathit{NFA}$ describing it would look like:
+ − 469
\begin{center}
+ − 470
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 471
\node[state,initial] (q_0) {$q_0$};
+ − 472
\node[state, red] (q_1) [right=of q_0] {$q_1$};
+ − 473
\node[state, red] (q_2) [right=of q_1] {$q_2$};
+ − 474
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+ − 475
\path[->]
+ − 476
(q_0) edge node {a} (q_1)
+ − 477
edge [loop below] node {a,b} ()
+ − 478
(q_1) edge node {a,b} (q_2)
+ − 479
(q_2) edge node {a,b} (q_3);
+ − 480
\end{tikzpicture}
+ − 481
\end{center}
+ − 482
The red states are "countdown states" which counts down
+ − 483
the number of characters needed in addition to the current
+ − 484
string to make a successful match.
+ − 485
For example, state $q_1$ indicates a match that has
+ − 486
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+ − 487
and just consumed the "delimiter" $a$ in the middle, and
+ − 488
need to match 2 more iterations of $(a|b)$ to complete.
+ − 489
State $q_2$ on the other hand, can be viewed as a state
+ − 490
after $q_1$ has consumed 1 character, and just waits
+ − 491
for 1 more character to complete.
+ − 492
$q_3$ is the last state, requiring 0 more character and is accepting.
+ − 493
Depending on the suffix of the
+ − 494
input string up to the current read location,
+ − 495
the states $q_1$ and $q_2$, $q_3$
+ − 496
may or may
+ − 497
not be active, independent from each other.
+ − 498
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+ − 499
contain at least $2^3$ non-equivalent states that cannot be merged,
+ − 500
because the subset construction during determinisation will generate
+ − 501
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+ − 502
Generalizing this to regular expressions with larger
+ − 503
bounded repetitions number, we have that
+ − 504
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
+ − 505
would require at least $2^{n+1}$ states, if $r$ contains
+ − 506
more than 1 string.
+ − 507
This is to represent all different
+ − 508
scenarios which "countdown" states are active.
538
+ − 509
For those regexes, tools that uses $\DFA$s will get
532
+ − 510
out of memory errors.
538
+ − 511
+ − 512
\subsubsection{Tools that uses $\mathit{DFA}$s}
+ − 513
%TODO:more tools that use DFAs?
+ − 514
$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
+ − 515
in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
+ − 516
lexers. The user provides a set of regular expressions
+ − 517
and configurations to such lexer generators, and then
+ − 518
gets an output program encoding a minimized $\mathit{DFA}$
+ − 519
that can be compiled and run.
+ − 520
When given the above countdown regular expression,
+ − 521
a small number $n$ would result in a determinised automata
+ − 522
with millions of states.
+ − 523
532
+ − 524
For this reason, regex libraries that support
+ − 525
bounded repetitions often choose to use the $\mathit{NFA}$
+ − 526
approach.
538
+ − 527
+ − 528
+ − 529
+ − 530
+ − 531
+ − 532
+ − 533
+ − 534
+ − 535
\subsection{Why $\mathit{NFA}$s can be slow in the second phase}
+ − 536
When one constructs an $\NFA$ out of a regular expression
+ − 537
there is often very little to be done in the first phase, one simply
+ − 538
construct the $\NFA$ states based on the structure of the input regular expression.
+ − 539
+ − 540
In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
532
+ − 541
one by keeping track of all active states after consuming
+ − 542
a character, and update that set of states iteratively.
+ − 543
This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+ − 544
for a path terminating
+ − 545
at an accepting state.
+ − 546
Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
538
+ − 547
type of $\mathit{NFA}$ simulation and guarantees a linear runtime
532
+ − 548
in terms of input string length.
+ − 549
%TODO:try out these lexers
+ − 550
The other way to use $\mathit{NFA}$ for matching is choosing
+ − 551
a single transition each time, keeping all the other options in
+ − 552
a queue or stack, and backtracking if that choice eventually
+ − 553
fails. This method, often called a "depth-first-search",
+ − 554
is efficient in a lot of cases, but could end up
+ − 555
with exponential run time.\\
+ − 556
%TODO:COMPARE java python lexer speed with Rust and Go
+ − 557
The reason behind backtracking algorithms in languages like
+ − 558
Java and Python is that they support back-references.
538
+ − 559
\subsubsection{Back References}
532
+ − 560
If we have a regular expression like this (the sequence
+ − 561
operator is omitted for brevity):
+ − 562
\begin{center}
+ − 563
$r_1(r_2(r_3r_4))$
+ − 564
\end{center}
+ − 565
We could label sub-expressions of interest
+ − 566
by parenthesizing them and giving
+ − 567
them a number by the order in which their opening parentheses appear.
+ − 568
One possible way of parenthesizing and labelling is given below:
+ − 569
\begin{center}
+ − 570
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
+ − 571
\end{center}
+ − 572
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
+ − 573
by 1 to 4. $1$ would refer to the entire expression
+ − 574
$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
+ − 575
These sub-expressions are called "capturing groups".
+ − 576
We can use the following syntax to denote that we want a string just matched by a
+ − 577
sub-expression (capturing group) to appear at a certain location again,
+ − 578
exactly as it was:
+ − 579
\begin{center}
+ − 580
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
+ − 581
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
+ − 582
\end{center}
+ − 583
The backslash and number $i$ are used to denote such
+ − 584
so-called "back-references".
+ − 585
Let $e$ be an expression made of regular expressions
+ − 586
and back-references. $e$ contains the expression $e_i$
+ − 587
as its $i$-th capturing group.
+ − 588
The semantics of back-reference can be recursively
+ − 589
written as:
+ − 590
\begin{center}
+ − 591
\begin{tabular}{c}
+ − 592
$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+ − 593
$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+ − 594
\end{tabular}
+ − 595
\end{center}
+ − 596
The concrete example
+ − 597
$((a|b|c|\ldots|z)^*)\backslash 1$
+ − 598
would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
+ − 599
Back-reference is a construct in the "regex" standard
+ − 600
that programmers found useful, but not exactly
+ − 601
regular any more.
+ − 602
In fact, that allows the regex construct to express
+ − 603
languages that cannot be contained in context-free
+ − 604
languages either.
+ − 605
For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
+ − 606
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
+ − 607
which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
+ − 608
Such a language is contained in the context-sensitive hierarchy
+ − 609
of formal languages.
+ − 610
Solving the back-reference expressions matching problem
+ − 611
is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
+ − 612
efficient solution is not known to exist.
+ − 613
%TODO:read a bit more about back reference algorithms
538
+ − 614
532
+ − 615
It seems that languages like Java and Python made the trade-off
+ − 616
to support back-references at the expense of having to backtrack,
+ − 617
even in the case of regexes not involving back-references.\\
+ − 618
Summing these up, we can categorise existing
+ − 619
practical regex libraries into the ones with linear
+ − 620
time guarantees like Go and Rust, which impose restrictions
+ − 621
on the user input (not allowing back-references,
538
+ − 622
bounded repetitions cannot exceed 1000 etc.), and ones
532
+ − 623
that allows the programmer much freedom, but grinds to a halt
+ − 624
in some non-negligible portion of cases.
+ − 625
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
+ − 626
% For example, the Rust regex engine claims to be linear,
+ − 627
% but does not support lookarounds and back-references.
+ − 628
% The GoLang regex library does not support over 1000 repetitions.
+ − 629
% Java and Python both support back-references, but shows
+ − 630
%catastrophic backtracking behaviours on inputs without back-references(
+ − 631
%when the language is still regular).
+ − 632
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
+ − 633
%TODO: verify the fact Rust does not allow 1000+ reps
538
+ − 634
\ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?}
532
+ − 635
+ − 636
538
+ − 637
So we have practical implementations
+ − 638
on regular expression matching/lexing which are fast
+ − 639
but do not come with any guarantees that it will not grind to a halt
+ − 640
or give wrong answers.
+ − 641
Our goal is to have a regex lexing algorithm that comes with
+ − 642
\begin{itemize}
+ − 643
\item
+ − 644
proven correctness
+ − 645
\item
+ − 646
proven non-catastrophic properties
+ − 647
\item
+ − 648
easy extensions to
+ − 649
constructs like
+ − 650
bounded repetitions, negation, lookarounds, and even back-references.
+ − 651
\end{itemize}
532
+ − 652
538
+ − 653
\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
+ − 654
We propose Brzozowski derivatives on regular expressions as
+ − 655
a solution to this.
+ − 656
In the last fifteen or so years, Brzozowski's derivatives of regular
+ − 657
expressions have sparked quite a bit of interest in the functional
+ − 658
programming and theorem prover communities.
532
+ − 659
538
+ − 660
\subsection{Motivation}
+ − 661
+ − 662
Derivatives give a simple solution
+ − 663
to the problem of matching a string $s$ with a regular
+ − 664
expression $r$: if the derivative of $r$ w.r.t.\ (in
+ − 665
succession) all the characters of the string matches the empty string,
+ − 666
then $r$ matches $s$ (and {\em vice versa}).
532
+ − 667
538
+ − 668
The beauty of
532
+ − 669
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
+ − 670
expressible in any functional language, and easily definable and
+ − 671
reasoned about in theorem provers---the definitions just consist of
+ − 672
inductive datatypes and simple recursive functions.
+ − 673
And an algorithms based on it by
+ − 674
Suzmann and Lu \parencite{Sulzmann2014} allows easy extension
+ − 675
to include extended regular expressions and
+ − 676
simplification of internal data structures
+ − 677
eliminating the exponential behaviours.
+ − 678
+ − 679
However, two difficulties with derivative-based matchers exist:
538
+ − 680
\subsubsection{Problems with Current Brzozowski Matchers}
532
+ − 681
First, Brzozowski's original matcher only generates a yes/no answer
+ − 682
for whether a regular expression matches a string or not. This is too
+ − 683
little information in the context of lexing where separate tokens must
+ − 684
be identified and also classified (for example as keywords
+ − 685
or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this
+ − 686
difficulty by cleverly extending Brzozowski's matching
+ − 687
algorithm. Their extended version generates additional information on
+ − 688
\emph{how} a regular expression matches a string following the POSIX
+ − 689
rules for regular expression matching. They achieve this by adding a
+ − 690
second ``phase'' to Brzozowski's algorithm involving an injection
538
+ − 691
function. In our own earlier work, we provided the formal
532
+ − 692
specification of what POSIX matching means and proved in Isabelle/HOL
+ − 693
the correctness
+ − 694
of Sulzmann and Lu's extended algorithm accordingly
+ − 695
\cite{AusafDyckhoffUrban2016}.
+ − 696
+ − 697
The second difficulty is that Brzozowski's derivatives can
+ − 698
grow to arbitrarily big sizes. For example if we start with the
+ − 699
regular expression $(a+aa)^*$ and take
+ − 700
successive derivatives according to the character $a$, we end up with
+ − 701
a sequence of ever-growing derivatives like
+ − 702
+ − 703
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+ − 704
\begin{center}
+ − 705
\begin{tabular}{rll}
+ − 706
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 707
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 708
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+ − 709
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 710
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+ − 711
\end{tabular}
+ − 712
\end{center}
+ − 713
+ − 714
\noindent where after around 35 steps we run out of memory on a
+ − 715
typical computer (we shall define shortly the precise details of our
+ − 716
regular expressions and the derivative operation). Clearly, the
+ − 717
notation involving $\ZERO$s and $\ONE$s already suggests
+ − 718
simplification rules that can be applied to regular regular
+ − 719
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
+ − 720
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
+ − 721
r$. While such simple-minded simplifications have been proved in our
+ − 722
earlier work to preserve the correctness of Sulzmann and Lu's
+ − 723
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
+ − 724
\emph{not} help with limiting the growth of the derivatives shown
+ − 725
above: the growth is slowed, but the derivatives can still grow rather
+ − 726
quickly beyond any finite bound.
+ − 727
+ − 728
+ − 729
Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
538
+ − 730
\cite{Sulzmann2014} where they introduce bit-coded
532
+ − 731
regular expressions. In this version, POSIX values are
538
+ − 732
represented as bit sequences and such sequences are incrementally generated
532
+ − 733
when derivatives are calculated. The compact representation
538
+ − 734
of bit sequences and regular expressions allows them to define a more
532
+ − 735
``aggressive'' simplification method that keeps the size of the
+ − 736
derivatives finite no matter what the length of the string is.
+ − 737
They make some informal claims about the correctness and linear behaviour
+ − 738
of this version, but do not provide any supporting proof arguments, not
538
+ − 739
even ``pencil-and-paper'' arguments. They write about their bit-coded
532
+ − 740
\emph{incremental parsing method} (that is the algorithm to be formalised
538
+ − 741
in this dissertation)
532
+ − 742
+ − 743
+ − 744
+ − 745
\begin{quote}\it
+ − 746
``Correctness Claim: We further claim that the incremental parsing
+ − 747
method [..] in combination with the simplification steps [..]
+ − 748
yields POSIX parse trees. We have tested this claim
+ − 749
extensively [..] but yet
+ − 750
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
+ − 751
\end{quote}
+ − 752
+ − 753
Ausaf and Urban were able to back this correctness claim with
+ − 754
a formal proof.
+ − 755
+ − 756
But as they stated,
+ − 757
\begin{quote}\it
+ − 758
The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.
+ − 759
\end{quote}
+ − 760
+ − 761
This thesis implements the aggressive simplifications envisioned
+ − 762
by Ausaf and Urban,
+ − 763
and gives a formal proof of the correctness with those simplifications.
+ − 764
+ − 765
+ − 766
%----------------------------------------------------------------------------------------
+ − 767
\section{Contribution}
+ − 768
+ − 769
+ − 770
+ − 771
This work addresses the vulnerability of super-linear and
+ − 772
buggy regex implementations by the combination
+ − 773
of Brzozowski's derivatives and interactive theorem proving.
+ − 774
We give an
+ − 775
improved version of Sulzmann and Lu's bit-coded algorithm using
+ − 776
derivatives, which come with a formal guarantee in terms of correctness and
+ − 777
running time as an Isabelle/HOL proof.
538
+ − 778
Further improvements to the algorithm with an even stronger version of
+ − 779
simplification is made.
+ − 780
We have not yet come up with one, but believe that it leads to a
+ − 781
formalised proof with a time bound linear to input and
532
+ − 782
cubic to regular expression size using a technique by
538
+ − 783
Antimirov\cite{Antimirov}.
532
+ − 784
+ − 785
538
+ − 786
The main contribution of this thesis is
+ − 787
\begin{itemize}
+ − 788
\item
+ − 789
a proven correct lexing algorithm
+ − 790
\item
+ − 791
with formalized finite bounds on internal data structures' sizes.
+ − 792
\end{itemize}
+ − 793
532
+ − 794
To our best knowledge, no lexing libraries using Brzozowski derivatives
+ − 795
have a provable time guarantee,
+ − 796
and claims about running time are usually speculative and backed by thin empirical
+ − 797
evidence.
+ − 798
%TODO: give references
+ − 799
For example, Sulzmann and Lu had proposed an algorithm in which they
+ − 800
claim a linear running time.
+ − 801
But that was falsified by our experiments and the running time
+ − 802
is actually $\Omega(2^n)$ in the worst case.
+ − 803
A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
+ − 804
%TODO: give references
+ − 805
lexer, which calculates POSIX matches and is based on derivatives.
+ − 806
They formalized the correctness of the lexer, but not the complexity.
+ − 807
In the performance evaluation section, they simply analyzed the run time
+ − 808
of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
+ − 809
and concluded that the algorithm is quadratic in terms of input length.
+ − 810
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+ − 811
the time it took to lex only 40 $a$'s was 5 minutes.
+ − 812
+ − 813
+ − 814
+ − 815
\subsection{Related Work}
+ − 816
We are aware
+ − 817
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+ − 818
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
+ − 819
of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
+ − 820
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
+ − 821
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
+ − 822
538
+ − 823
+ − 824
When a regular expression does not behave as intended,
+ − 825
people usually try to rewrite the regex to some equivalent form
+ − 826
or they try to avoid the possibly problematic patterns completely,
+ − 827
for which many false positives exist\parencite{Davis18}.
+ − 828
Animated tools to "debug" regular expressions such as
+ − 829
\parencite{regexploit2021} \parencite{regex101} are also popular.
+ − 830
We are also aware of static analysis work on regular expressions that
+ − 831
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
+ − 832
\parencite{Rathnayake2014StaticAF} proposed an algorithm
+ − 833
that detects regular expressions triggering exponential
+ − 834
behavious on backtracking matchers.
+ − 835
Weideman \parencite{Weideman2017Static} came up with
+ − 836
non-linear polynomial worst-time estimates
+ − 837
for regexes, attack string that exploit the worst-time
+ − 838
scenario, and "attack automata" that generates
+ − 839
attack strings.
+ − 840
+ − 841
532
+ − 842
+ − 843
+ − 844
\section{Structure of the thesis}
538
+ − 845
In chapter 2 \ref{Inj} we will introduce the concepts
532
+ − 846
and notations we
+ − 847
use for describing the lexing algorithm by Sulzmann and Lu,
538
+ − 848
and then give the lexing algorithm.
+ − 849
We will give its variant in \ref{Bitcoded1}.
+ − 850
Then we illustrate in \ref{Bitcoded2}
532
+ − 851
how the algorithm without bitcodes falls short for such aggressive
+ − 852
simplifications and therefore introduce our version of the
538
+ − 853
bit-coded algorithm and
532
+ − 854
its correctness proof .
538
+ − 855
In \ref{Finite} we give the second guarantee
532
+ − 856
of our bitcoded algorithm, that is a finite bound on the size of any
+ − 857
regex's derivatives.
538
+ − 858
In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
+ − 859
in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
532
+ − 860
algorithm to include constructs such as bounded repetitions and negations.
+ − 861
+ − 862
+ − 863
+ − 864
+ − 865
+ − 866
%----------------------------------------------------------------------------------------
+ − 867
+ − 868
+ − 869
%----------------------------------------------------------------------------------------
+ − 870
+ − 871
%----------------------------------------------------------------------------------------
+ − 872
+ − 873
%----------------------------------------------------------------------------------------
+ − 874
+ − 875