532
+ − 1
% Chapter 1
+ − 2
+ − 3
\chapter{Introduction} % Main chapter title
+ − 4
+ − 5
\label{Introduction} % For referencing the chapter elsewhere, use \ref{Chapter1}
+ − 6
+ − 7
%----------------------------------------------------------------------------------------
+ − 8
+ − 9
% Define some commands to keep the formatting separated from the content
+ − 10
\newcommand{\keyword}[1]{\textbf{#1}}
+ − 11
\newcommand{\tabhead}[1]{\textbf{#1}}
+ − 12
\newcommand{\code}[1]{\texttt{#1}}
+ − 13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
+ − 14
\newcommand{\option}[1]{\texttt{\itshape#1}}
+ − 15
+ − 16
%boxes
+ − 17
\newcommand*{\mybox}[1]{\framebox{\strut #1}}
+ − 18
+ − 19
%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
+ − 20
\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
+ − 21
\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
543
+ − 22
\newcommand{\bderssimp}[2]{#1 \backslash_{bsimps} #2}
532
+ − 23
\newcommand{\rderssimp}[2]{#1 \backslash_{rsimp} #2}
564
+ − 24
\def\derssimp{\textit{ders}\_\textit{simp}}
557
+ − 25
\def\rders{\textit{rders}}
532
+ − 26
\newcommand{\bders}[2]{#1 \backslash #2}
+ − 27
\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
554
+ − 28
\newcommand{\rsimp}[1]{\textit{rsimp}\; #1}
532
+ − 29
\newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'}
+ − 30
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+ − 31
\newcommand{\denote}{\stackrel{\mbox{\scriptsize denote}}{=}}%
+ − 32
\newcommand{\ZERO}{\mbox{\bf 0}}
+ − 33
\newcommand{\ONE}{\mbox{\bf 1}}
+ − 34
\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
555
+ − 35
\newcommand{\rdistinct}[2]{\textit{rdistinct} \;\; #1 \;\; #2}
556
+ − 36
\def\rDistinct{\textit{rdistinct}}
532
+ − 37
\newcommand\hflat[1]{\llparenthesis #1 \rrparenthesis_*}
+ − 38
\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
+ − 39
\newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
+ − 40
+ − 41
\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
+ − 42
564
+ − 43
\def\case{\textit{case}}
554
+ − 44
\def\sequal{\stackrel{\mbox{\scriptsize rsimp}}{=}}
+ − 45
\def\rsimpalts{\textit{rsimp}_{ALTS}}
+ − 46
\def\good{\textit{good}}
+ − 47
\def\btrue{\textit{true}}
+ − 48
\def\bfalse{\textit{false}}
542
+ − 49
\def\bnullable{\textit{bnullable}}
543
+ − 50
\def\bnullables{\textit{bnullables}}
538
+ − 51
\def\Some{\textit{Some}}
+ − 52
\def\None{\textit{None}}
537
+ − 53
\def\code{\textit{code}}
532
+ − 54
\def\decode{\textit{decode}}
+ − 55
\def\internalise{\textit{internalise}}
+ − 56
\def\lexer{\mathit{lexer}}
+ − 57
\def\mkeps{\textit{mkeps}}
557
+ − 58
\newcommand{\rder}[2]{#2 \backslash_r #1}
532
+ − 59
554
+ − 60
\def\nonnested{\textit{nonnested}}
532
+ − 61
\def\AZERO{\textit{AZERO}}
558
+ − 62
\def\sizeNregex{\textit{sizeNregex}}
532
+ − 63
\def\AONE{\textit{AONE}}
+ − 64
\def\ACHAR{\textit{ACHAR}}
+ − 65
557
+ − 66
\def\scfrewrites{\stackrel{*}{\rightsquigarrow_{scf}}}
555
+ − 67
\def\frewrite{\rightsquigarrow_f}
+ − 68
\def\hrewrite{\rightsquigarrow_h}
+ − 69
\def\grewrite{\rightsquigarrow_g}
+ − 70
\def\frewrites{\stackrel{*}{\rightsquigarrow_f}}
+ − 71
\def\hrewrites{\stackrel{*}{\rightsquigarrow_h}}
+ − 72
\def\grewrites{\stackrel{*}{\rightsquigarrow_g}}
538
+ − 73
\def\fuse{\textit{fuse}}
+ − 74
\def\bder{\textit{bder}}
542
+ − 75
\def\der{\textit{der}}
532
+ − 76
\def\POSIX{\textit{POSIX}}
+ − 77
\def\ALTS{\textit{ALTS}}
+ − 78
\def\ASTAR{\textit{ASTAR}}
+ − 79
\def\DFA{\textit{DFA}}
538
+ − 80
\def\NFA{\textit{NFA}}
532
+ − 81
\def\bmkeps{\textit{bmkeps}}
543
+ − 82
\def\bmkepss{\textit{bmkepss}}
532
+ − 83
\def\retrieve{\textit{retrieve}}
+ − 84
\def\blexer{\textit{blexer}}
+ − 85
\def\flex{\textit{flex}}
573
+ − 86
\def\inj{\textit{inj}}
564
+ − 87
\def\Empty{\textit{Empty}}
567
+ − 88
\def\Left{\textit{Left}}
+ − 89
\def\Right{\textit{Right}}
573
+ − 90
\def\Stars{\textit{Stars}}
+ − 91
\def\Char{\textit{Char}}
+ − 92
\def\Seq{\textit{Seq}}
532
+ − 93
\def\Der{\textit{Der}}
+ − 94
\def\Ders{\textit{Ders}}
+ − 95
\def\nullable{\mathit{nullable}}
+ − 96
\def\Z{\mathit{Z}}
+ − 97
\def\S{\mathit{S}}
+ − 98
\def\rup{r^\uparrow}
+ − 99
%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
+ − 100
\def\distinctWith{\textit{distinctWith}}
+ − 101
\def\lf{\textit{lf}}
+ − 102
\def\PD{\textit{PD}}
+ − 103
\def\suffix{\textit{Suffix}}
543
+ − 104
\def\distinctBy{\textit{distinctBy}}
558
+ − 105
\def\starupdate{\textit{starUpdate}}
+ − 106
\def\starupdates{\textit{starUpdates}}
+ − 107
532
+ − 108
+ − 109
\def\size{\mathit{size}}
+ − 110
\def\rexp{\mathbf{rexp}}
+ − 111
\def\simp{\mathit{simp}}
+ − 112
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
+ − 113
\def\map{\mathit{map}}
+ − 114
\def\distinct{\mathit{distinct}}
+ − 115
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
+ − 116
\def\map{\textit{map}}
+ − 117
\def\rrexp{\textit{rrexp}}
554
+ − 118
\newcommand\rnullable[1]{\textit{rnullable} \; #1 }
532
+ − 119
\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
+ − 120
\newcommand\asize[1]{\llbracket #1 \rrbracket}
543
+ − 121
\newcommand\rerase[1]{ (#1)_{\downarrow_r}}
+ − 122
538
+ − 123
\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
532
+ − 124
543
+ − 125
+ − 126
\def\rflts{\textit{rflts}}
+ − 127
\def\rrewrite{\textit{rrewrite}}
+ − 128
\def\bsimpalts{\textit{bsimp}_{ALTS}}
+ − 129
532
+ − 130
\def\erase{\textit{erase}}
+ − 131
\def\STAR{\textit{STAR}}
+ − 132
\def\flts{\textit{flts}}
+ − 133
+ − 134
579
+ − 135
\def\zeroable{\textit{zeroable}}
+ − 136
\def\nub{\textit{nub}}
+ − 137
\def\filter{\textit{filter}}
+ − 138
\def\not{\textit{not}}
+ − 139
+ − 140
+ − 141
532
+ − 142
\def\RZERO{\mathbf{0}_r }
+ − 143
\def\RONE{\mathbf{1}_r}
+ − 144
\newcommand\RCHAR[1]{\mathbf{#1}_r}
+ − 145
\newcommand\RSEQ[2]{#1 \cdot #2}
558
+ − 146
\newcommand\RALTS[1]{\sum #1}
532
+ − 147
\newcommand\RSTAR[1]{#1^*}
558
+ − 148
\newcommand\vsuf[2]{\textit{Suffix} \;#1\;#2}
532
+ − 149
538
+ − 150
+ − 151
+ − 152
\pgfplotsset{
+ − 153
myplotstyle/.style={
+ − 154
legend style={draw=none, font=\small},
+ − 155
legend cell align=left,
+ − 156
legend pos=north east,
+ − 157
ylabel style={align=center, font=\bfseries\boldmath},
+ − 158
xlabel style={align=center, font=\bfseries\boldmath},
+ − 159
x tick label style={font=\bfseries\boldmath},
+ − 160
y tick label style={font=\bfseries\boldmath},
+ − 161
scaled ticks=true,
+ − 162
every axis plot/.append style={thick},
+ − 163
},
+ − 164
}
+ − 165
532
+ − 166
%----------------------------------------------------------------------------------------
+ − 167
%This part is about regular expressions, Brzozowski derivatives,
+ − 168
%and a bit-coded lexing algorithm with proven correctness and time bounds.
+ − 169
+ − 170
%TODO: look up snort rules to use here--give readers idea of what regexes look like
+ − 171
+ − 172
\begin{figure}
+ − 173
\centering
+ − 174
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+ − 175
\begin{tikzpicture}
+ − 176
\begin{axis}[
+ − 177
xlabel={$n$},
+ − 178
x label style={at={(1.05,-0.05)}},
+ − 179
ylabel={time in secs},
+ − 180
enlargelimits=false,
+ − 181
xtick={0,5,...,30},
+ − 182
xmax=33,
+ − 183
ymax=35,
+ − 184
ytick={0,5,...,30},
+ − 185
scaled ticks=false,
+ − 186
axis lines=left,
+ − 187
width=5cm,
+ − 188
height=4cm,
+ − 189
legend entries={JavaScript},
+ − 190
legend pos=north west,
+ − 191
legend cell align=left]
+ − 192
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+ − 193
\end{axis}
+ − 194
\end{tikzpicture}
+ − 195
&
+ − 196
\begin{tikzpicture}
+ − 197
\begin{axis}[
+ − 198
xlabel={$n$},
+ − 199
x label style={at={(1.05,-0.05)}},
+ − 200
%ylabel={time in secs},
+ − 201
enlargelimits=false,
+ − 202
xtick={0,5,...,30},
+ − 203
xmax=33,
+ − 204
ymax=35,
+ − 205
ytick={0,5,...,30},
+ − 206
scaled ticks=false,
+ − 207
axis lines=left,
+ − 208
width=5cm,
+ − 209
height=4cm,
+ − 210
legend entries={Python},
+ − 211
legend pos=north west,
+ − 212
legend cell align=left]
+ − 213
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ − 214
\end{axis}
+ − 215
\end{tikzpicture}
+ − 216
&
+ − 217
\begin{tikzpicture}
+ − 218
\begin{axis}[
+ − 219
xlabel={$n$},
+ − 220
x label style={at={(1.05,-0.05)}},
+ − 221
%ylabel={time in secs},
+ − 222
enlargelimits=false,
+ − 223
xtick={0,5,...,30},
+ − 224
xmax=33,
+ − 225
ymax=35,
+ − 226
ytick={0,5,...,30},
+ − 227
scaled ticks=false,
+ − 228
axis lines=left,
+ − 229
width=5cm,
+ − 230
height=4cm,
+ − 231
legend entries={Java 8},
+ − 232
legend pos=north west,
+ − 233
legend cell align=left]
+ − 234
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+ − 235
\end{axis}
+ − 236
\end{tikzpicture}\\
+ − 237
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
+ − 238
of the form $\underbrace{aa..a}_{n}$.}
+ − 239
\end{tabular}
+ − 240
\caption{aStarStarb} \label{fig:aStarStarb}
+ − 241
\end{figure}
+ − 242
+ − 243
+ − 244
+ − 245
538
+ − 246
+ − 247
Regular expressions are widely used in computer science:
+ − 248
be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+ − 249
command-line tools like $\mathit{grep}$ that facilitate easy
+ − 250
text-processing; network intrusion
+ − 251
detection systems that reject suspicious traffic; or compiler
+ − 252
front ends--the majority of the solutions to these tasks
+ − 253
involve lexing with regular
+ − 254
expressions.
+ − 255
Given its usefulness and ubiquity, one would imagine that
+ − 256
modern regular expression matching implementations
+ − 257
are mature and fully studied.
+ − 258
Indeed, in a popular programming language' regex engine,
+ − 259
supplying it with regular expressions and strings, one can
+ − 260
get rich matching information in a very short time.
+ − 261
Some network intrusion detection systems
+ − 262
use regex engines that are able to process
+ − 263
megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+ − 264
Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+ − 265
%TODO: get source for SNORT/BRO's regex matching engine/speed
+ − 266
+ − 267
+ − 268
Take $(a^*)^*\,b$ and ask whether
+ − 269
strings of the form $aa..a$ match this regular
+ − 270
expression. Obviously this is not the case---the expected $b$ in the last
+ − 271
position is missing. One would expect that modern regular expression
+ − 272
matching engines can find this out very quickly. Alas, if one tries
+ − 273
this example in JavaScript, Python or Java 8, even with strings of a small
+ − 274
length, say around 30 $a$'s, one discovers that
+ − 275
this decision takes crazy time to finish given the simplicity of the problem.
+ − 276
This is clearly exponential behaviour, and
+ − 277
is triggered by some relatively simple regex patterns, as the graphs
+ − 278
in \ref{fig:aStarStarb} show.
+ − 279
+ − 280
+ − 281
+ − 282
+ − 283
\ChristianComment{Superlinear I just leave out the explanation
+ − 284
which I find once used would distract the flow. Plus if i just say exponential
+ − 285
here the 2016 event in StackExchange was not exponential, but just quardratic so would be
+ − 286
in accurate}
+ − 287
+ − 288
This superlinear blowup in regular expression engines
+ − 289
had repeatedly caused grief in real life.
+ − 290
For example, on 20 July 2016 one evil
532
+ − 291
regular expression brought the webpage
+ − 292
\href{http://stackexchange.com}{Stack Exchange} to its
538
+ − 293
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
532
+ − 294
In this instance, a regular expression intended to just trim white
+ − 295
spaces from the beginning and the end of a line actually consumed
538
+ − 296
massive amounts of CPU resources---causing web servers to grind to a
+ − 297
halt. In this example, the time needed to process
532
+ − 298
the string was $O(n^2)$ with respect to the string length. This
+ − 299
quadratic overhead was enough for the homepage of Stack Exchange to
+ − 300
respond so slowly that the load balancer assumed a $\mathit{DoS}$
+ − 301
attack and therefore stopped the servers from responding to any
+ − 302
requests. This made the whole site become unavailable.
538
+ − 303
532
+ − 304
A more recent example is a global outage of all Cloudflare servers on 2 July
+ − 305
2019. A poorly written regular expression exhibited exponential
+ − 306
behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
+ − 307
had several causes, at the heart was a regular expression that
+ − 308
was used to monitor network
538
+ − 309
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
532
+ − 310
%TODO: data points for some new versions of languages
+ − 311
These problems with regular expressions
+ − 312
are not isolated events that happen
+ − 313
very occasionally, but actually widespread.
+ − 314
They occur so often that they get a
+ − 315
name--Regular-Expression-Denial-Of-Service (ReDoS)
+ − 316
attack.
538
+ − 317
\citeauthor{Davis18} detected more
532
+ − 318
than 1000 super-linear (SL) regular expressions
+ − 319
in Node.js, Python core libraries, and npm and pypi.
+ − 320
They therefore concluded that evil regular expressions
538
+ − 321
are problems "more than a parlour trick", but one that
532
+ − 322
requires
+ − 323
more research attention.
+ − 324
538
+ − 325
+ − 326
But the problems are not limited to slowness on certain
+ − 327
cases.
+ − 328
Another thing about these libraries is that there
+ − 329
is no correctness guarantee.
+ − 330
In some cases, they either fail to generate a lexing result when there exists a match,
+ − 331
or give results that are inconsistent with the $\POSIX$ standard.
+ − 332
A concrete example would be
+ − 333
the regex
+ − 334
\begin{verbatim}
+ − 335
(aba|ab|a)*
+ − 336
\end{verbatim}
+ − 337
and the string
+ − 338
\begin{verbatim}
+ − 339
ababa
+ − 340
\end{verbatim}
+ − 341
The correct $\POSIX$ match for the above would be
+ − 342
with the entire string $ababa$,
+ − 343
split into two Kleene star iterations, $[ab] [aba]$ at positions
+ − 344
$[0, 2), [2, 5)$
+ − 345
respectively.
+ − 346
But trying this out in regex101\parencite{regex101}
+ − 347
with different language engines would yield
+ − 348
the same two fragmented matches: $[aba]$ at $[0, 3)$
+ − 349
and $a$ at $[4, 5)$.
+ − 350
+ − 351
Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+ − 352
correctly implementing the POSIX (maximum-munch)
+ − 353
rule of regular expression matching.
+ − 354
+ − 355
As Grathwohl\parencite{grathwohl2014crash} commented,
+ − 356
\begin{center}
+ − 357
``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
+ − 358
\end{center}
+ − 359
+ − 360
+ − 361
To summarise the above, regular expressions are important.
+ − 362
They are popular and programming languages' library functions
+ − 363
for them are very fast on non-catastrophic cases.
+ − 364
But there are problems with current practical implementations.
+ − 365
First thing is that the running time might blow up.
+ − 366
The second problem is that they might be error-prone on certain
543
+ − 367
very simple cases.
538
+ − 368
In the next part of the chapter, we will look into reasons why
+ − 369
certain regex engines are running horribly slow on the "catastrophic"
+ − 370
cases and propose a solution that addresses both of these problems
+ − 371
based on Brzozowski and Sulzmann and Lu's work.
+ − 372
+ − 373
+ − 374
\section{Why are current regex engines slow?}
532
+ − 375
+ − 376
%find literature/find out for yourself that REGEX->DFA on basic regexes
+ − 377
%does not blow up the size
+ − 378
Shouldn't regular expression matching be linear?
+ − 379
How can one explain the super-linear behaviour of the
+ − 380
regex matching engines we have?
+ − 381
The time cost of regex matching algorithms in general
538
+ − 382
involve two different phases, and different things can go differently wrong on
+ − 383
these phases.
+ − 384
$\DFA$s usually have problems in the first (construction) phase
+ − 385
, whereas $\NFA$s usually run into trouble
+ − 386
on the second phase.
+ − 387
+ − 388
\subsection{Different Phases of a Matching/Lexing Algorithm}
+ − 389
+ − 390
+ − 391
Most lexing algorithms can be roughly divided into
+ − 392
two phases during its run.
+ − 393
The first phase is the "construction" phase,
+ − 394
in which the algorithm builds some
+ − 395
suitable data structure from the input regex $r$, so that
+ − 396
it can be easily operated on later.
+ − 397
We denote
+ − 398
the time cost for such a phase by $P_1(r)$.
+ − 399
The second phase is the lexing phase, when the input string
+ − 400
$s$ is read and the data structure
+ − 401
representing that regex $r$ is being operated on.
+ − 402
We represent the time
532
+ − 403
it takes by $P_2(r, s)$.\\
538
+ − 404
+ − 405
For $\mathit{DFA}$,
532
+ − 406
we have $P_2(r, s) = O( |s| )$,
+ − 407
because we take at most $|s|$ steps,
+ − 408
and each step takes
+ − 409
at most one transition--
+ − 410
a deterministic-finite-automata
+ − 411
by definition has at most one state active and at most one
+ − 412
transition upon receiving an input symbol.
+ − 413
But unfortunately in the worst case
538
+ − 414
$P_1(r) = O(exp^{|r|})$. An example will be given later.
+ − 415
+ − 416
532
+ − 417
For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold
+ − 418
expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
+ − 419
The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+ − 420
On the other hand, if backtracking is used, the worst-case time bound bloats
538
+ − 421
to $|r| * 2^|s|$.
532
+ − 422
%on the input
+ − 423
%And when calculating the time complexity of the matching algorithm,
+ − 424
%we are assuming that each input reading step requires constant time.
+ − 425
%which translates to that the number of
+ − 426
%states active and transitions taken each time is bounded by a
+ − 427
%constant $C$.
+ − 428
%But modern regex libraries in popular language engines
+ − 429
% often want to support much richer constructs than just
+ − 430
% sequences and Kleene stars,
+ − 431
%such as negation, intersection,
+ − 432
%bounded repetitions and back-references.
+ − 433
%And de-sugaring these "extended" regular expressions
+ − 434
%into basic ones might bloat the size exponentially.
+ − 435
%TODO: more reference for exponential size blowup on desugaring.
538
+ − 436
+ − 437
\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+ − 438
+ − 439
532
+ − 440
The good things about $\mathit{DFA}$s is that once
+ − 441
generated, they are fast and stable, unlike
+ − 442
backtracking algorithms.
538
+ − 443
However, they do not scale well with bounded repetitions.
532
+ − 444
538
+ − 445
\subsubsection{Problems with Bounded Repetitions}
532
+ − 446
Bounded repetitions, usually written in the form
+ − 447
$r^{\{c\}}$ (where $c$ is a constant natural number),
+ − 448
denotes a regular expression accepting strings
+ − 449
that can be divided into $c$ substrings, where each
+ − 450
substring is in $r$.
+ − 451
For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
+ − 452
an $\mathit{NFA}$ describing it would look like:
+ − 453
\begin{center}
+ − 454
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 455
\node[state,initial] (q_0) {$q_0$};
+ − 456
\node[state, red] (q_1) [right=of q_0] {$q_1$};
+ − 457
\node[state, red] (q_2) [right=of q_1] {$q_2$};
+ − 458
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+ − 459
\path[->]
+ − 460
(q_0) edge node {a} (q_1)
+ − 461
edge [loop below] node {a,b} ()
+ − 462
(q_1) edge node {a,b} (q_2)
+ − 463
(q_2) edge node {a,b} (q_3);
+ − 464
\end{tikzpicture}
+ − 465
\end{center}
+ − 466
The red states are "countdown states" which counts down
+ − 467
the number of characters needed in addition to the current
+ − 468
string to make a successful match.
+ − 469
For example, state $q_1$ indicates a match that has
+ − 470
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+ − 471
and just consumed the "delimiter" $a$ in the middle, and
+ − 472
need to match 2 more iterations of $(a|b)$ to complete.
+ − 473
State $q_2$ on the other hand, can be viewed as a state
+ − 474
after $q_1$ has consumed 1 character, and just waits
+ − 475
for 1 more character to complete.
+ − 476
$q_3$ is the last state, requiring 0 more character and is accepting.
+ − 477
Depending on the suffix of the
+ − 478
input string up to the current read location,
+ − 479
the states $q_1$ and $q_2$, $q_3$
+ − 480
may or may
+ − 481
not be active, independent from each other.
+ − 482
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+ − 483
contain at least $2^3$ non-equivalent states that cannot be merged,
+ − 484
because the subset construction during determinisation will generate
+ − 485
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+ − 486
Generalizing this to regular expressions with larger
+ − 487
bounded repetitions number, we have that
+ − 488
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
+ − 489
would require at least $2^{n+1}$ states, if $r$ contains
+ − 490
more than 1 string.
+ − 491
This is to represent all different
+ − 492
scenarios which "countdown" states are active.
538
+ − 493
For those regexes, tools that uses $\DFA$s will get
532
+ − 494
out of memory errors.
538
+ − 495
+ − 496
\subsubsection{Tools that uses $\mathit{DFA}$s}
+ − 497
%TODO:more tools that use DFAs?
+ − 498
$\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
+ − 499
in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
+ − 500
lexers. The user provides a set of regular expressions
+ − 501
and configurations to such lexer generators, and then
+ − 502
gets an output program encoding a minimized $\mathit{DFA}$
+ − 503
that can be compiled and run.
+ − 504
When given the above countdown regular expression,
+ − 505
a small number $n$ would result in a determinised automata
+ − 506
with millions of states.
+ − 507
532
+ − 508
For this reason, regex libraries that support
+ − 509
bounded repetitions often choose to use the $\mathit{NFA}$
+ − 510
approach.
538
+ − 511
+ − 512
+ − 513
+ − 514
+ − 515
+ − 516
+ − 517
+ − 518
+ − 519
\subsection{Why $\mathit{NFA}$s can be slow in the second phase}
+ − 520
When one constructs an $\NFA$ out of a regular expression
+ − 521
there is often very little to be done in the first phase, one simply
+ − 522
construct the $\NFA$ states based on the structure of the input regular expression.
+ − 523
+ − 524
In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
532
+ − 525
one by keeping track of all active states after consuming
+ − 526
a character, and update that set of states iteratively.
+ − 527
This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+ − 528
for a path terminating
+ − 529
at an accepting state.
+ − 530
Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
538
+ − 531
type of $\mathit{NFA}$ simulation and guarantees a linear runtime
532
+ − 532
in terms of input string length.
+ − 533
%TODO:try out these lexers
+ − 534
The other way to use $\mathit{NFA}$ for matching is choosing
+ − 535
a single transition each time, keeping all the other options in
+ − 536
a queue or stack, and backtracking if that choice eventually
+ − 537
fails. This method, often called a "depth-first-search",
+ − 538
is efficient in a lot of cases, but could end up
+ − 539
with exponential run time.\\
+ − 540
%TODO:COMPARE java python lexer speed with Rust and Go
+ − 541
The reason behind backtracking algorithms in languages like
+ − 542
Java and Python is that they support back-references.
538
+ − 543
\subsubsection{Back References}
532
+ − 544
If we have a regular expression like this (the sequence
+ − 545
operator is omitted for brevity):
+ − 546
\begin{center}
+ − 547
$r_1(r_2(r_3r_4))$
+ − 548
\end{center}
+ − 549
We could label sub-expressions of interest
+ − 550
by parenthesizing them and giving
+ − 551
them a number by the order in which their opening parentheses appear.
+ − 552
One possible way of parenthesizing and labelling is given below:
+ − 553
\begin{center}
+ − 554
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
+ − 555
\end{center}
+ − 556
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
+ − 557
by 1 to 4. $1$ would refer to the entire expression
+ − 558
$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
+ − 559
These sub-expressions are called "capturing groups".
+ − 560
We can use the following syntax to denote that we want a string just matched by a
+ − 561
sub-expression (capturing group) to appear at a certain location again,
+ − 562
exactly as it was:
+ − 563
\begin{center}
+ − 564
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
+ − 565
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
+ − 566
\end{center}
+ − 567
The backslash and number $i$ are used to denote such
+ − 568
so-called "back-references".
+ − 569
Let $e$ be an expression made of regular expressions
+ − 570
and back-references. $e$ contains the expression $e_i$
+ − 571
as its $i$-th capturing group.
+ − 572
The semantics of back-reference can be recursively
+ − 573
written as:
+ − 574
\begin{center}
+ − 575
\begin{tabular}{c}
+ − 576
$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+ − 577
$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+ − 578
\end{tabular}
+ − 579
\end{center}
+ − 580
The concrete example
+ − 581
$((a|b|c|\ldots|z)^*)\backslash 1$
+ − 582
would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
+ − 583
Back-reference is a construct in the "regex" standard
+ − 584
that programmers found useful, but not exactly
+ − 585
regular any more.
+ − 586
In fact, that allows the regex construct to express
+ − 587
languages that cannot be contained in context-free
+ − 588
languages either.
+ − 589
For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
+ − 590
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
+ − 591
which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
+ − 592
Such a language is contained in the context-sensitive hierarchy
+ − 593
of formal languages.
+ − 594
Solving the back-reference expressions matching problem
+ − 595
is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
+ − 596
efficient solution is not known to exist.
+ − 597
%TODO:read a bit more about back reference algorithms
538
+ − 598
532
+ − 599
It seems that languages like Java and Python made the trade-off
+ − 600
to support back-references at the expense of having to backtrack,
+ − 601
even in the case of regexes not involving back-references.\\
+ − 602
Summing these up, we can categorise existing
+ − 603
practical regex libraries into the ones with linear
+ − 604
time guarantees like Go and Rust, which impose restrictions
+ − 605
on the user input (not allowing back-references,
538
+ − 606
bounded repetitions cannot exceed 1000 etc.), and ones
532
+ − 607
that allows the programmer much freedom, but grinds to a halt
+ − 608
in some non-negligible portion of cases.
+ − 609
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
+ − 610
% For example, the Rust regex engine claims to be linear,
+ − 611
% but does not support lookarounds and back-references.
+ − 612
% The GoLang regex library does not support over 1000 repetitions.
+ − 613
% Java and Python both support back-references, but shows
+ − 614
%catastrophic backtracking behaviours on inputs without back-references(
+ − 615
%when the language is still regular).
+ − 616
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
+ − 617
%TODO: verify the fact Rust does not allow 1000+ reps
538
+ − 618
\ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?}
532
+ − 619
+ − 620
538
+ − 621
So we have practical implementations
+ − 622
on regular expression matching/lexing which are fast
+ − 623
but do not come with any guarantees that it will not grind to a halt
+ − 624
or give wrong answers.
+ − 625
Our goal is to have a regex lexing algorithm that comes with
+ − 626
\begin{itemize}
+ − 627
\item
+ − 628
proven correctness
+ − 629
\item
+ − 630
proven non-catastrophic properties
+ − 631
\item
+ − 632
easy extensions to
+ − 633
constructs like
+ − 634
bounded repetitions, negation, lookarounds, and even back-references.
+ − 635
\end{itemize}
532
+ − 636
538
+ − 637
\section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
+ − 638
We propose Brzozowski derivatives on regular expressions as
+ − 639
a solution to this.
+ − 640
In the last fifteen or so years, Brzozowski's derivatives of regular
+ − 641
expressions have sparked quite a bit of interest in the functional
+ − 642
programming and theorem prover communities.
532
+ − 643
538
+ − 644
\subsection{Motivation}
+ − 645
+ − 646
Derivatives give a simple solution
+ − 647
to the problem of matching a string $s$ with a regular
+ − 648
expression $r$: if the derivative of $r$ w.r.t.\ (in
+ − 649
succession) all the characters of the string matches the empty string,
+ − 650
then $r$ matches $s$ (and {\em vice versa}).
532
+ − 651
538
+ − 652
The beauty of
532
+ − 653
Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
+ − 654
expressible in any functional language, and easily definable and
+ − 655
reasoned about in theorem provers---the definitions just consist of
+ − 656
inductive datatypes and simple recursive functions.
+ − 657
And an algorithms based on it by
+ − 658
Suzmann and Lu \parencite{Sulzmann2014} allows easy extension
+ − 659
to include extended regular expressions and
+ − 660
simplification of internal data structures
+ − 661
eliminating the exponential behaviours.
+ − 662
+ − 663
However, two difficulties with derivative-based matchers exist:
538
+ − 664
\subsubsection{Problems with Current Brzozowski Matchers}
532
+ − 665
First, Brzozowski's original matcher only generates a yes/no answer
+ − 666
for whether a regular expression matches a string or not. This is too
+ − 667
little information in the context of lexing where separate tokens must
+ − 668
be identified and also classified (for example as keywords
+ − 669
or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this
+ − 670
difficulty by cleverly extending Brzozowski's matching
+ − 671
algorithm. Their extended version generates additional information on
+ − 672
\emph{how} a regular expression matches a string following the POSIX
+ − 673
rules for regular expression matching. They achieve this by adding a
+ − 674
second ``phase'' to Brzozowski's algorithm involving an injection
538
+ − 675
function. In our own earlier work, we provided the formal
532
+ − 676
specification of what POSIX matching means and proved in Isabelle/HOL
+ − 677
the correctness
+ − 678
of Sulzmann and Lu's extended algorithm accordingly
+ − 679
\cite{AusafDyckhoffUrban2016}.
+ − 680
+ − 681
The second difficulty is that Brzozowski's derivatives can
+ − 682
grow to arbitrarily big sizes. For example if we start with the
+ − 683
regular expression $(a+aa)^*$ and take
+ − 684
successive derivatives according to the character $a$, we end up with
+ − 685
a sequence of ever-growing derivatives like
+ − 686
+ − 687
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+ − 688
\begin{center}
+ − 689
\begin{tabular}{rll}
+ − 690
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 691
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 692
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+ − 693
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 694
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+ − 695
\end{tabular}
+ − 696
\end{center}
+ − 697
+ − 698
\noindent where after around 35 steps we run out of memory on a
+ − 699
typical computer (we shall define shortly the precise details of our
+ − 700
regular expressions and the derivative operation). Clearly, the
+ − 701
notation involving $\ZERO$s and $\ONE$s already suggests
+ − 702
simplification rules that can be applied to regular regular
+ − 703
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
+ − 704
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
+ − 705
r$. While such simple-minded simplifications have been proved in our
+ − 706
earlier work to preserve the correctness of Sulzmann and Lu's
+ − 707
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
+ − 708
\emph{not} help with limiting the growth of the derivatives shown
+ − 709
above: the growth is slowed, but the derivatives can still grow rather
+ − 710
quickly beyond any finite bound.
+ − 711
+ − 712
+ − 713
Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
538
+ − 714
\cite{Sulzmann2014} where they introduce bit-coded
532
+ − 715
regular expressions. In this version, POSIX values are
538
+ − 716
represented as bit sequences and such sequences are incrementally generated
532
+ − 717
when derivatives are calculated. The compact representation
538
+ − 718
of bit sequences and regular expressions allows them to define a more
532
+ − 719
``aggressive'' simplification method that keeps the size of the
+ − 720
derivatives finite no matter what the length of the string is.
+ − 721
They make some informal claims about the correctness and linear behaviour
+ − 722
of this version, but do not provide any supporting proof arguments, not
538
+ − 723
even ``pencil-and-paper'' arguments. They write about their bit-coded
532
+ − 724
\emph{incremental parsing method} (that is the algorithm to be formalised
538
+ − 725
in this dissertation)
532
+ − 726
+ − 727
+ − 728
+ − 729
\begin{quote}\it
+ − 730
``Correctness Claim: We further claim that the incremental parsing
+ − 731
method [..] in combination with the simplification steps [..]
+ − 732
yields POSIX parse trees. We have tested this claim
+ − 733
extensively [..] but yet
+ − 734
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
+ − 735
\end{quote}
+ − 736
+ − 737
Ausaf and Urban were able to back this correctness claim with
+ − 738
a formal proof.
+ − 739
+ − 740
But as they stated,
+ − 741
\begin{quote}\it
+ − 742
The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.
+ − 743
\end{quote}
+ − 744
+ − 745
This thesis implements the aggressive simplifications envisioned
+ − 746
by Ausaf and Urban,
+ − 747
and gives a formal proof of the correctness with those simplifications.
+ − 748
+ − 749
+ − 750
%----------------------------------------------------------------------------------------
+ − 751
\section{Contribution}
+ − 752
+ − 753
+ − 754
+ − 755
This work addresses the vulnerability of super-linear and
+ − 756
buggy regex implementations by the combination
+ − 757
of Brzozowski's derivatives and interactive theorem proving.
+ − 758
We give an
+ − 759
improved version of Sulzmann and Lu's bit-coded algorithm using
+ − 760
derivatives, which come with a formal guarantee in terms of correctness and
+ − 761
running time as an Isabelle/HOL proof.
538
+ − 762
Further improvements to the algorithm with an even stronger version of
+ − 763
simplification is made.
+ − 764
We have not yet come up with one, but believe that it leads to a
+ − 765
formalised proof with a time bound linear to input and
532
+ − 766
cubic to regular expression size using a technique by
538
+ − 767
Antimirov\cite{Antimirov}.
532
+ − 768
+ − 769
538
+ − 770
The main contribution of this thesis is
+ − 771
\begin{itemize}
+ − 772
\item
+ − 773
a proven correct lexing algorithm
+ − 774
\item
+ − 775
with formalized finite bounds on internal data structures' sizes.
+ − 776
\end{itemize}
+ − 777
532
+ − 778
To our best knowledge, no lexing libraries using Brzozowski derivatives
+ − 779
have a provable time guarantee,
+ − 780
and claims about running time are usually speculative and backed by thin empirical
+ − 781
evidence.
+ − 782
%TODO: give references
+ − 783
For example, Sulzmann and Lu had proposed an algorithm in which they
+ − 784
claim a linear running time.
+ − 785
But that was falsified by our experiments and the running time
+ − 786
is actually $\Omega(2^n)$ in the worst case.
+ − 787
A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
+ − 788
%TODO: give references
+ − 789
lexer, which calculates POSIX matches and is based on derivatives.
+ − 790
They formalized the correctness of the lexer, but not the complexity.
+ − 791
In the performance evaluation section, they simply analyzed the run time
+ − 792
of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
+ − 793
and concluded that the algorithm is quadratic in terms of input length.
+ − 794
When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+ − 795
the time it took to lex only 40 $a$'s was 5 minutes.
+ − 796
+ − 797
+ − 798
+ − 799
\subsection{Related Work}
+ − 800
We are aware
+ − 801
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+ − 802
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
+ − 803
of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
+ − 804
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
+ − 805
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
+ − 806
538
+ − 807
+ − 808
When a regular expression does not behave as intended,
+ − 809
people usually try to rewrite the regex to some equivalent form
+ − 810
or they try to avoid the possibly problematic patterns completely,
+ − 811
for which many false positives exist\parencite{Davis18}.
+ − 812
Animated tools to "debug" regular expressions such as
+ − 813
\parencite{regexploit2021} \parencite{regex101} are also popular.
+ − 814
We are also aware of static analysis work on regular expressions that
+ − 815
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
+ − 816
\parencite{Rathnayake2014StaticAF} proposed an algorithm
+ − 817
that detects regular expressions triggering exponential
+ − 818
behavious on backtracking matchers.
+ − 819
Weideman \parencite{Weideman2017Static} came up with
+ − 820
non-linear polynomial worst-time estimates
+ − 821
for regexes, attack string that exploit the worst-time
+ − 822
scenario, and "attack automata" that generates
+ − 823
attack strings.
+ − 824
+ − 825
532
+ − 826
+ − 827
+ − 828
\section{Structure of the thesis}
538
+ − 829
In chapter 2 \ref{Inj} we will introduce the concepts
532
+ − 830
and notations we
+ − 831
use for describing the lexing algorithm by Sulzmann and Lu,
538
+ − 832
and then give the lexing algorithm.
+ − 833
We will give its variant in \ref{Bitcoded1}.
+ − 834
Then we illustrate in \ref{Bitcoded2}
532
+ − 835
how the algorithm without bitcodes falls short for such aggressive
+ − 836
simplifications and therefore introduce our version of the
538
+ − 837
bit-coded algorithm and
532
+ − 838
its correctness proof .
538
+ − 839
In \ref{Finite} we give the second guarantee
532
+ − 840
of our bitcoded algorithm, that is a finite bound on the size of any
+ − 841
regex's derivatives.
538
+ − 842
In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
+ − 843
in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
532
+ − 844
algorithm to include constructs such as bounded repetitions and negations.
+ − 845
+ − 846
+ − 847
+ − 848
+ − 849
+ − 850
%----------------------------------------------------------------------------------------
+ − 851
+ − 852
+ − 853
%----------------------------------------------------------------------------------------
+ − 854
+ − 855
%----------------------------------------------------------------------------------------
+ − 856
+ − 857
%----------------------------------------------------------------------------------------
+ − 858
+ − 859