532
+ − 1
% Chapter 1
+ − 2
+ − 3
\chapter{Introduction} % Main chapter title
+ − 4
+ − 5
\label{Introduction} % For referencing the chapter elsewhere, use \ref{Chapter1}
+ − 6
+ − 7
%----------------------------------------------------------------------------------------
+ − 8
+ − 9
% Define some commands to keep the formatting separated from the content
+ − 10
\newcommand{\keyword}[1]{\textbf{#1}}
+ − 11
\newcommand{\tabhead}[1]{\textbf{#1}}
+ − 12
\newcommand{\code}[1]{\texttt{#1}}
+ − 13
\newcommand{\file}[1]{\texttt{\bfseries#1}}
+ − 14
\newcommand{\option}[1]{\texttt{\itshape#1}}
+ − 15
+ − 16
%boxes
+ − 17
\newcommand*{\mybox}[1]{\framebox{\strut #1}}
+ − 18
+ − 19
%\newcommand{\sflataux}[1]{\textit{sflat}\_\textit{aux} \, #1}
+ − 20
\newcommand\sflat[1]{\llparenthesis #1 \rrparenthesis }
+ − 21
\newcommand{\ASEQ}[3]{\textit{ASEQ}_{#1} \, #2 \, #3}
543
+ − 22
\newcommand{\bderssimp}[2]{#1 \backslash_{bsimps} #2}
596
+ − 23
\newcommand{\rderssimp}[2]{#1 \backslash_{rsimps} #2}
564
+ − 24
\def\derssimp{\textit{ders}\_\textit{simp}}
557
+ − 25
\def\rders{\textit{rders}}
532
+ − 26
\newcommand{\bders}[2]{#1 \backslash #2}
+ − 27
\newcommand{\bsimp}[1]{\textit{bsimp}(#1)}
591
+ − 28
\def\bsimps{\textit{bsimp}}
554
+ − 29
\newcommand{\rsimp}[1]{\textit{rsimp}\; #1}
532
+ − 30
\newcommand{\sflataux}[1]{\llparenthesis #1 \rrparenthesis'}
+ − 31
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+ − 32
\newcommand{\denote}{\stackrel{\mbox{\scriptsize denote}}{=}}%
+ − 33
\newcommand{\ZERO}{\mbox{\bf 0}}
+ − 34
\newcommand{\ONE}{\mbox{\bf 1}}
+ − 35
\newcommand{\AALTS}[2]{\oplus {\scriptstyle #1}\, #2}
555
+ − 36
\newcommand{\rdistinct}[2]{\textit{rdistinct} \;\; #1 \;\; #2}
594
+ − 37
\def\rdistincts{\textit{rdistinct}}
556
+ − 38
\def\rDistinct{\textit{rdistinct}}
532
+ − 39
\newcommand\hflat[1]{\llparenthesis #1 \rrparenthesis_*}
+ − 40
\newcommand\hflataux[1]{\llparenthesis #1 \rrparenthesis_*'}
+ − 41
\newcommand\createdByStar[1]{\textit{createdByStar}(#1)}
620
+ − 42
\def\cbn{\textit{createdByNtimes}}
+ − 43
\def\hpa{\textit{highestPowerAux}}
+ − 44
\def\hpower{\textit{highestPower}}
+ − 45
\def\ntset{\textit{ntset}}
+ − 46
\def\optermsimp{\textit{optermsimp}}
+ − 47
\def\optermOsimp{\textit{optermOsimp}}
+ − 48
\def\optermosimp{\textit{optermosimp}}
+ − 49
\def\opterm{\textit{opterm}}
+ − 50
\def\nString{\textit{nonemptyString}}
532
+ − 51
+ − 52
\newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
+ − 53
600
+ − 54
\def\SEQ{\textit{SEQ}}
+ − 55
\def\SEQs{\textit{SEQs}}
564
+ − 56
\def\case{\textit{case}}
554
+ − 57
\def\sequal{\stackrel{\mbox{\scriptsize rsimp}}{=}}
+ − 58
\def\rsimpalts{\textit{rsimp}_{ALTS}}
+ − 59
\def\good{\textit{good}}
+ − 60
\def\btrue{\textit{true}}
+ − 61
\def\bfalse{\textit{false}}
542
+ − 62
\def\bnullable{\textit{bnullable}}
543
+ − 63
\def\bnullables{\textit{bnullables}}
538
+ − 64
\def\Some{\textit{Some}}
+ − 65
\def\None{\textit{None}}
537
+ − 66
\def\code{\textit{code}}
532
+ − 67
\def\decode{\textit{decode}}
+ − 68
\def\internalise{\textit{internalise}}
+ − 69
\def\lexer{\mathit{lexer}}
+ − 70
\def\mkeps{\textit{mkeps}}
557
+ − 71
\newcommand{\rder}[2]{#2 \backslash_r #1}
532
+ − 72
585
+ − 73
\def\rerases{\textit{rerase}}
+ − 74
554
+ − 75
\def\nonnested{\textit{nonnested}}
532
+ − 76
\def\AZERO{\textit{AZERO}}
558
+ − 77
\def\sizeNregex{\textit{sizeNregex}}
532
+ − 78
\def\AONE{\textit{AONE}}
+ − 79
\def\ACHAR{\textit{ACHAR}}
+ − 80
585
+ − 81
\def\simpsulz{\textit{simp}_{Sulz}}
+ − 82
557
+ − 83
\def\scfrewrites{\stackrel{*}{\rightsquigarrow_{scf}}}
555
+ − 84
\def\frewrite{\rightsquigarrow_f}
+ − 85
\def\hrewrite{\rightsquigarrow_h}
+ − 86
\def\grewrite{\rightsquigarrow_g}
+ − 87
\def\frewrites{\stackrel{*}{\rightsquigarrow_f}}
+ − 88
\def\hrewrites{\stackrel{*}{\rightsquigarrow_h}}
+ − 89
\def\grewrites{\stackrel{*}{\rightsquigarrow_g}}
538
+ − 90
\def\fuse{\textit{fuse}}
+ − 91
\def\bder{\textit{bder}}
542
+ − 92
\def\der{\textit{der}}
532
+ − 93
\def\POSIX{\textit{POSIX}}
+ − 94
\def\ALTS{\textit{ALTS}}
+ − 95
\def\ASTAR{\textit{ASTAR}}
+ − 96
\def\DFA{\textit{DFA}}
538
+ − 97
\def\NFA{\textit{NFA}}
532
+ − 98
\def\bmkeps{\textit{bmkeps}}
543
+ − 99
\def\bmkepss{\textit{bmkepss}}
532
+ − 100
\def\retrieve{\textit{retrieve}}
+ − 101
\def\blexer{\textit{blexer}}
+ − 102
\def\flex{\textit{flex}}
573
+ − 103
\def\inj{\textit{inj}}
564
+ − 104
\def\Empty{\textit{Empty}}
567
+ − 105
\def\Left{\textit{Left}}
+ − 106
\def\Right{\textit{Right}}
573
+ − 107
\def\Stars{\textit{Stars}}
+ − 108
\def\Char{\textit{Char}}
+ − 109
\def\Seq{\textit{Seq}}
532
+ − 110
\def\Der{\textit{Der}}
+ − 111
\def\Ders{\textit{Ders}}
+ − 112
\def\nullable{\mathit{nullable}}
+ − 113
\def\Z{\mathit{Z}}
+ − 114
\def\S{\mathit{S}}
+ − 115
\def\rup{r^\uparrow}
+ − 116
%\def\bderssimp{\mathit{bders}\_\mathit{simp}}
+ − 117
\def\distinctWith{\textit{distinctWith}}
+ − 118
\def\lf{\textit{lf}}
+ − 119
\def\PD{\textit{PD}}
+ − 120
\def\suffix{\textit{Suffix}}
543
+ − 121
\def\distinctBy{\textit{distinctBy}}
558
+ − 122
\def\starupdate{\textit{starUpdate}}
+ − 123
\def\starupdates{\textit{starUpdates}}
620
+ − 124
\def\nupdate{\textit{nupdate}}
+ − 125
\def\nupdates{\textit{nupdates}}
558
+ − 126
532
+ − 127
+ − 128
\def\size{\mathit{size}}
+ − 129
\def\rexp{\mathbf{rexp}}
+ − 130
\def\simp{\mathit{simp}}
+ − 131
\def\simpALTs{\mathit{simp}\_\mathit{ALTs}}
+ − 132
\def\map{\mathit{map}}
+ − 133
\def\distinct{\mathit{distinct}}
+ − 134
\def\blexersimp{\mathit{blexer}\_\mathit{simp}}
590
+ − 135
\def\blexerStrong{\textit{blexerStrong}}
+ − 136
\def\bsimpStrong{\textit{bsimpStrong}}
591
+ − 137
\def\bdersStrongs{\textit{bdersStrong}}
590
+ − 138
\newcommand{\bdersStrong}[2]{#1 \backslash_{bsimpStrongs} #2}
+ − 139
532
+ − 140
\def\map{\textit{map}}
+ − 141
\def\rrexp{\textit{rrexp}}
554
+ − 142
\newcommand\rnullable[1]{\textit{rnullable} \; #1 }
532
+ − 143
\newcommand\rsize[1]{\llbracket #1 \rrbracket_r}
+ − 144
\newcommand\asize[1]{\llbracket #1 \rrbracket}
543
+ − 145
\newcommand\rerase[1]{ (#1)_{\downarrow_r}}
+ − 146
538
+ − 147
\newcommand\ChristianComment[1]{\textcolor{blue}{#1}\\}
532
+ − 148
543
+ − 149
+ − 150
\def\rflts{\textit{rflts}}
+ − 151
\def\rrewrite{\textit{rrewrite}}
+ − 152
\def\bsimpalts{\textit{bsimp}_{ALTS}}
596
+ − 153
\def\bsimpaseq{\textit{bsimp}_{ASEQ}}
+ − 154
\def\rsimlalts{\textit{rsimp}_{ALTs}}
+ − 155
\def\rsimpseq{\textit{rsimp}_{SEQ}}
543
+ − 156
532
+ − 157
\def\erase{\textit{erase}}
+ − 158
\def\STAR{\textit{STAR}}
+ − 159
\def\flts{\textit{flts}}
+ − 160
+ − 161
579
+ − 162
\def\zeroable{\textit{zeroable}}
+ − 163
\def\nub{\textit{nub}}
+ − 164
\def\filter{\textit{filter}}
601
+ − 165
%\def\not{\textit{not}}
579
+ − 166
+ − 167
+ − 168
532
+ − 169
\def\RZERO{\mathbf{0}_r }
+ − 170
\def\RONE{\mathbf{1}_r}
+ − 171
\newcommand\RCHAR[1]{\mathbf{#1}_r}
+ − 172
\newcommand\RSEQ[2]{#1 \cdot #2}
558
+ − 173
\newcommand\RALTS[1]{\sum #1}
532
+ − 174
\newcommand\RSTAR[1]{#1^*}
558
+ − 175
\newcommand\vsuf[2]{\textit{Suffix} \;#1\;#2}
532
+ − 176
538
+ − 177
+ − 178
590
+ − 179
+ − 180
\lstdefinestyle{myScalastyle}{
+ − 181
frame=tb,
+ − 182
language=scala,
+ − 183
aboveskip=3mm,
+ − 184
belowskip=3mm,
+ − 185
showstringspaces=false,
+ − 186
columns=flexible,
+ − 187
basicstyle={\small\ttfamily},
+ − 188
numbers=none,
+ − 189
numberstyle=\tiny\color{gray},
+ − 190
keywordstyle=\color{blue},
+ − 191
commentstyle=\color{dkgreen},
+ − 192
stringstyle=\color{mauve},
+ − 193
frame=single,
+ − 194
breaklines=true,
+ − 195
breakatwhitespace=true,
+ − 196
tabsize=3,
538
+ − 197
}
+ − 198
590
+ − 199
532
+ − 200
%----------------------------------------------------------------------------------------
+ − 201
%This part is about regular expressions, Brzozowski derivatives,
+ − 202
%and a bit-coded lexing algorithm with proven correctness and time bounds.
+ − 203
+ − 204
%TODO: look up snort rules to use here--give readers idea of what regexes look like
+ − 205
601
+ − 206
648
+ − 207
Regular expressions, since their inception in the 1940s,
+ − 208
have been subject to extensive study and implementation.
+ − 209
Their primary application lies in lexing,
+ − 210
a process whereby an unstructured input string is disassembled into a tree of tokens
+ − 211
with their guidance.
+ − 212
This is often used to match strings that comprises of numerous fields,
+ − 213
where certain fields may recur or be omitted.
+ − 214
For example, the regular expression for recognising email addresses is
+ − 215
\begin{center}
+ − 216
\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
+ − 217
\end{center}
+ − 218
Using this, regular expression matchers and lexers are able to extract
+ − 219
for example the domain names by the use of \verb|[a-zA-Z0-9.-]+|.
+ − 220
Consequently, they are an indispensible component in text processing tools
+ − 221
of software systems such as compilers, IDEs, and firewalls.
601
+ − 222
648
+ − 223
The study of regular expressions is ongoing due to an
+ − 224
issue known as catastrophic backtracking.
+ − 225
This phenomenon refers to scenarios in which the regular expression
+ − 226
matching or lexing engine exhibits a disproportionately long
+ − 227
runtime despite the simplicity of the input and expression.
+ − 228
+ − 229
The origin of catastrophic backtracking is rooted in the
+ − 230
ambiguities of lexing.
+ − 231
In the process of matching a multi-character string with
+ − 232
a regular expression that encompasses several sub-expressions,
+ − 233
different positions can be designated to mark
+ − 234
the boundaries of corresponding substrings of the sub-expressions.
+ − 235
For instance, in matching the string $aaa$ with the
+ − 236
regular expression $(a+aa)^*$, the divide between
+ − 237
the initial match and the subsequent iteration could either be
+ − 238
set between the first and second characters ($a | aa$) or between the second and third characters ($aa | a$). As both the length of the input string and the structural complexity of the regular expression increase, the number of potential delimiter combinations can grow exponentially, leading to a corresponding increase in complexity for algorithms that do not optimally navigate these possibilities.
+ − 239
+ − 240
Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
+ − 241
+ − 242
Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
601
+ − 243
+ − 244
648
+ − 245
%Regular expressions
+ − 246
%have been extensively studied and
+ − 247
%implemented since their invention in 1940s.
+ − 248
%It is primarily used in lexing, where an unstructured input string is broken
+ − 249
%down into a tree of tokens.
+ − 250
%That tree's construction is guided by the shape of the regular expression.
+ − 251
%This is particularly useful in expressing the shape of a string
+ − 252
%with many fields, where certain fields might repeat or be omitted.
+ − 253
%Regular expression matchers and Lexers allow us to
+ − 254
%identify and delimit different subsections of a string and potentially
+ − 255
%extract information from inputs, making them
+ − 256
%an indispensible component in modern software systems' text processing tasks
+ − 257
%such as compilers, IDEs, and firewalls.
+ − 258
%Research on them is far from over due to the well-known issue called catastrophic-backtracking,
+ − 259
%which means the regular expression matching or lexing engine takes an unproportional time to run
+ − 260
%despite the input and the expression being relatively simple.
+ − 261
%
+ − 262
%Catastrophic backtracking stems from the ambiguities of lexing:
+ − 263
%when matching a multiple-character string with a regular
+ − 264
%exression that includes serveral sub-expressions, there might be different positions to set
+ − 265
%the border between sub-expressions' corresponding sub-strings.
+ − 266
%For example, matching the string $aaa$ against the regular expression
+ − 267
%$(a+aa)^*$, the border between the initial match and the second iteration
+ − 268
%could be between the first and second character ($a | aa$)
+ − 269
%or between the second and third character ($aa | a$).
+ − 270
%As the size of the input string and the structural complexity
+ − 271
%of the regular expression increase,
+ − 272
%the number of different combinations of delimiters can grow exponentially, and
+ − 273
%algorithms that explore these possibilities unwisely will also see an exponential complexity.
+ − 274
%
+ − 275
%Catastrophic backtracking allow a class of computationally inexpensive attacks called
+ − 276
%Regular expression Denial of Service attacks (ReDoS), in which the hacker
+ − 277
%simply sends out a small attack string to a server,
+ − 278
%triggering high-complexity behaviours in its regular expression engine.
+ − 279
%These attacks, be it deliberate or not, have caused real-world systems to go down (see more
+ − 280
%details of this in the practical overview section in chapter \ref{Introduction}).
+ − 281
%There are different disambiguation strategies to select sub-matches, most notably Greedy and POSIX.
+ − 282
%The widely adopted strategy in practice is POSIX, which always go for the longest sub-match possible.
+ − 283
%There have been prose definitions like the following
+ − 284
%by Kuklewicz \cite{KuklewiczHaskell}:
+ − 285
%described the POSIX rule as (section 1, last paragraph):
646
+ − 286
\begin{quote}
+ − 287
\begin{itemize}
+ − 288
\item
+ − 289
regular expressions (REs) take the leftmost starting match, and the longest match starting there
+ − 290
earlier subpatterns have leftmost-longest priority over later subpatterns\\
+ − 291
\item
+ − 292
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+ − 293
\item
+ − 294
REs have right associative concatenation which can be changed with parenthesis\\
+ − 295
\item
+ − 296
parenthesized subexpressions return the match from their last usage\\
+ − 297
\item
+ − 298
text of component subexpressions must be contained in the text of the
+ − 299
higher-level subexpressions\\
+ − 300
\item
+ − 301
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+ − 302
\item
+ − 303
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+ − 304
\end{itemize}
+ − 305
\end{quote}
648
+ − 306
However, the author noted that lexers are rarely correct according to this standard.
+ − 307
Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove
+ − 308
the correctness of such algorithms against the POSIX semantics definitions
+ − 309
would be valuable.
646
+ − 310
648
+ − 311
Formal proofs are machine checked programs
+ − 312
such as the ones written in Isabelle/HOL, is a powerful means
646
+ − 313
for computer scientists to be certain about correctness of their algorithms and correctness properties.
648
+ − 314
This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions
646
+ − 315
and proof rules that are known to be correct.
+ − 316
The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such
+ − 317
methods as their implementation and complexity analysis tend to be error-prone.
+ − 318
648
+ − 319
There have been many interesting steps taken in the direction of formal proofs and regular expressions
+ − 320
lexing and matching.
646
+ − 321
+ − 322
+ − 323
648
+ − 324
(ends)
646
+ − 325
+ − 326
%Regular expressions are widely used in computer science:
+ − 327
%be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
+ − 328
%command-line tools like $\mathit{grep}$ that facilitate easy
+ − 329
%text-processing \cite{grep}; network intrusion
+ − 330
%detection systems that inspect suspicious traffic; or compiler
+ − 331
%front ends.
+ − 332
%Given their usefulness and ubiquity, one would assume that
+ − 333
%modern regular expression matching implementations
+ − 334
%are mature and fully studied.
+ − 335
%Indeed, in many popular programming languages' regular expression engines,
+ − 336
%one can supply a regular expression and a string, and in most cases get
+ − 337
%get the matching information in a very short time.
+ − 338
%Those engines can sometimes be blindingly fast--some
+ − 339
%network intrusion detection systems
+ − 340
%use regular expression engines that are able to process
+ − 341
%hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+ − 342
%However, those engines can sometimes also exhibit a surprising security vulnerability
+ − 343
%under a certain class of inputs.
602
+ − 344
%However, , this is not the case for $\mathbf{all}$ inputs.
601
+ − 345
%TODO: get source for SNORT/BRO's regex matching engine/speed
+ − 346
603
+ − 347
638
+ − 348
Consider for example the regular expression $(a^*)^*\,b$ and
+ − 349
strings of the form $aa..a$. These strings cannot be matched by this regular
633
+ − 350
expression: Obviously the expected $b$ in the last
+ − 351
position is missing. One would assume that modern regular expression
612
+ − 352
matching engines can find this out very quickly. Surprisingly, if one tries
+ − 353
this example in JavaScript, Python or Java 8, even with small strings,
+ − 354
say of lenght of around 30 $a$'s,
634
+ − 355
the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
638
+ − 356
The algorithms clearly show exponential behaviour, and as can be seen
612
+ − 357
is triggered by some relatively simple regular expressions.
603
+ − 358
Java 9 and newer
634
+ − 359
versions improve this behaviour somewhat, but are still slow compared
612
+ − 360
with the approach we are going to use in this thesis.
603
+ − 361
+ − 362
+ − 363
+ − 364
This superlinear blowup in regular expression engines
638
+ − 365
has caused grief in ``real life'' where it is
612
+ − 366
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
603
+ − 367
For example, on 20 July 2016 one evil
+ − 368
regular expression brought the webpage
+ − 369
\href{http://stackexchange.com}{Stack Exchange} to its
+ − 370
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
+ − 371
In this instance, a regular expression intended to just trim white
+ − 372
spaces from the beginning and the end of a line actually consumed
634
+ − 373
massive amounts of CPU resources---causing the web servers to grind to a
603
+ − 374
halt. In this example, the time needed to process
634
+ − 375
the string was
+ − 376
$O(n^2)$ with respect to the string length $n$. This
603
+ − 377
quadratic overhead was enough for the homepage of Stack Exchange to
+ − 378
respond so slowly that the load balancer assumed a $\mathit{DoS}$
+ − 379
attack and therefore stopped the servers from responding to any
+ − 380
requests. This made the whole site become unavailable.
+ − 381
601
+ − 382
\begin{figure}[p]
630
+ − 383
\begin{center}
612
+ − 384
\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
532
+ − 385
\begin{tikzpicture}
+ − 386
\begin{axis}[
+ − 387
xlabel={$n$},
+ − 388
x label style={at={(1.05,-0.05)}},
+ − 389
ylabel={time in secs},
+ − 390
enlargelimits=false,
+ − 391
xtick={0,5,...,30},
+ − 392
xmax=33,
+ − 393
ymax=35,
+ − 394
ytick={0,5,...,30},
+ − 395
scaled ticks=false,
+ − 396
axis lines=left,
+ − 397
width=5cm,
+ − 398
height=4cm,
+ − 399
legend entries={JavaScript},
+ − 400
legend pos=north west,
+ − 401
legend cell align=left]
+ − 402
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
+ − 403
\end{axis}
+ − 404
\end{tikzpicture}
+ − 405
&
+ − 406
\begin{tikzpicture}
+ − 407
\begin{axis}[
+ − 408
xlabel={$n$},
+ − 409
x label style={at={(1.05,-0.05)}},
+ − 410
%ylabel={time in secs},
+ − 411
enlargelimits=false,
+ − 412
xtick={0,5,...,30},
+ − 413
xmax=33,
+ − 414
ymax=35,
+ − 415
ytick={0,5,...,30},
+ − 416
scaled ticks=false,
+ − 417
axis lines=left,
+ − 418
width=5cm,
+ − 419
height=4cm,
+ − 420
legend entries={Python},
+ − 421
legend pos=north west,
+ − 422
legend cell align=left]
+ − 423
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
+ − 424
\end{axis}
612
+ − 425
\end{tikzpicture}\\
601
+ − 426
\begin{tikzpicture}
+ − 427
\begin{axis}[
+ − 428
xlabel={$n$},
+ − 429
x label style={at={(1.05,-0.05)}},
+ − 430
ylabel={time in secs},
+ − 431
enlargelimits=false,
+ − 432
xtick={0,5,...,30},
+ − 433
xmax=33,
+ − 434
ymax=35,
+ − 435
ytick={0,5,...,30},
+ − 436
scaled ticks=false,
+ − 437
axis lines=left,
+ − 438
width=5cm,
+ − 439
height=4cm,
612
+ − 440
legend entries={Java 8},
601
+ − 441
legend pos=north west,
+ − 442
legend cell align=left]
612
+ − 443
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
601
+ − 444
\end{axis}
+ − 445
\end{tikzpicture}
+ − 446
&
+ − 447
\begin{tikzpicture}
+ − 448
\begin{axis}[
+ − 449
xlabel={$n$},
+ − 450
x label style={at={(1.05,-0.05)}},
+ − 451
%ylabel={time in secs},
+ − 452
enlargelimits=false,
+ − 453
xtick={0,5,...,30},
+ − 454
xmax=33,
+ − 455
ymax=35,
+ − 456
ytick={0,5,...,30},
+ − 457
scaled ticks=false,
+ − 458
axis lines=left,
+ − 459
width=5cm,
+ − 460
height=4cm,
612
+ − 461
legend entries={Dart},
+ − 462
legend pos=north west,
+ − 463
legend cell align=left]
+ − 464
\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+ − 465
\end{axis}
+ − 466
\end{tikzpicture}\\
+ − 467
\begin{tikzpicture}
+ − 468
\begin{axis}[
+ − 469
xlabel={$n$},
+ − 470
x label style={at={(1.05,-0.05)}},
+ − 471
ylabel={time in secs},
+ − 472
enlargelimits=false,
+ − 473
xtick={0,5,...,30},
+ − 474
xmax=33,
+ − 475
ymax=35,
+ − 476
ytick={0,5,...,30},
+ − 477
scaled ticks=false,
+ − 478
axis lines=left,
+ − 479
width=5cm,
+ − 480
height=4cm,
601
+ − 481
legend entries={Swift},
+ − 482
legend pos=north west,
+ − 483
legend cell align=left]
+ − 484
\addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
+ − 485
\end{axis}
+ − 486
\end{tikzpicture}
612
+ − 487
&
+ − 488
\begin{tikzpicture}
+ − 489
\begin{axis}[
+ − 490
xlabel={$n$},
+ − 491
x label style={at={(1.05,-0.05)}},
+ − 492
%ylabel={time in secs},
+ − 493
enlargelimits=true,
+ − 494
%xtick={0,5000,...,40000},
+ − 495
%xmax=40000,
+ − 496
%ymax=35,
+ − 497
restrict x to domain*=0:40000,
+ − 498
restrict y to domain*=0:35,
+ − 499
%ytick={0,5,...,30},
+ − 500
%scaled ticks=false,
+ − 501
axis lines=left,
+ − 502
width=5cm,
+ − 503
height=4cm,
+ − 504
legend entries={Java9+},
+ − 505
legend pos=north west,
+ − 506
legend cell align=left]
+ − 507
\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+ − 508
\end{axis}
+ − 509
\end{tikzpicture}\\
+ − 510
\multicolumn{2}{c}{Graphs}
532
+ − 511
\end{tabular}
630
+ − 512
\end{center}
601
+ − 513
\caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings
+ − 514
of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
612
+ − 515
The reason for their superlinear behaviour is that they do a depth-first-search
+ − 516
using NFAs.
+ − 517
If the string does not match, the regular expression matching
+ − 518
engine starts to explore all possibilities.
601
+ − 519
}\label{fig:aStarStarb}
+ − 520
\end{figure}\afterpage{\clearpage}
538
+ − 521
532
+ − 522
A more recent example is a global outage of all Cloudflare servers on 2 July
612
+ − 523
2019. A poorly written regular expression exhibited catastrophic backtracking
+ − 524
and exhausted CPUs that serve HTTP traffic. Although the outage
532
+ − 525
had several causes, at the heart was a regular expression that
+ − 526
was used to monitor network
538
+ − 527
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
532
+ − 528
These problems with regular expressions
+ − 529
are not isolated events that happen
634
+ − 530
very rarely,
+ − 531
%but actually widespread.
+ − 532
%They occur so often that they have a
+ − 533
but they occur actually often enough that they have a
612
+ − 534
name: Regular-Expression-Denial-Of-Service (ReDoS)
634
+ − 535
attacks.
+ − 536
Davis et al. \cite{Davis18} detected more
612
+ − 537
than 1000 evil regular expressions
635
+ − 538
in Node.js, Python core libraries, npm and pypi.
532
+ − 539
They therefore concluded that evil regular expressions
638
+ − 540
are a real problem rather than just "a parlour trick".
532
+ − 541
638
+ − 542
The work in this thesis aims to address this issue
603
+ − 543
with the help of formal proofs.
612
+ − 544
We describe a lexing algorithm based
635
+ − 545
on Brzozowski derivatives with verified correctness
+ − 546
and a finiteness property for the size of derivatives
+ − 547
(which are all done in Isabelle/HOL).
612
+ − 548
Such properties %guarantee the absence of
+ − 549
are an important step in preventing
+ − 550
catastrophic backtracking once and for all.
604
+ − 551
We will give more details in the next sections
+ − 552
on (i) why the slow cases in graph \ref{fig:aStarStarb}
612
+ − 553
can occur in traditional regular expression engines
604
+ − 554
and (ii) why we choose our
612
+ − 555
approach based on Brzozowski derivatives and formal proofs.
602
+ − 556
603
+ − 557
612
+ − 558
\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
601
+ − 559
Regular expressions and regular expression matchers
635
+ − 560
have clearly been studied for many, many years.
612
+ − 561
Theoretical results in automata theory state
604
+ − 562
that basic regular expression matching should be linear
605
+ − 563
w.r.t the input.
+ − 564
This assumes that the regular expression
+ − 565
$r$ was pre-processed and turned into a
635
+ − 566
deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
604
+ − 567
By basic we mean textbook definitions such as the one
612
+ − 568
below, involving only regular expressions for characters, alternatives,
604
+ − 569
sequences, and Kleene stars:
+ − 570
\[
612
+ − 571
r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
604
+ − 572
\]
+ − 573
Modern regular expression matchers used by programmers,
+ − 574
however,
635
+ − 575
support much richer constructs, such as bounded repetitions,
+ − 576
negations,
604
+ − 577
and back-references.
612
+ − 578
To differentiate, we use the word \emph{regex} to refer
605
+ − 579
to those expressions with richer constructs while reserving the
+ − 580
term \emph{regular expression}
+ − 581
for the more traditional meaning in formal languages theory.
+ − 582
We follow this convention
+ − 583
in this thesis.
+ − 584
In the future, we aim to support all the popular features of regexes,
612
+ − 585
but for this work we mainly look at basic regular expressions
+ − 586
and bounded repetitions.
604
+ − 587
605
+ − 588
+ − 589
+ − 590
%Most modern regex libraries
+ − 591
%the so-called PCRE standard (Peral Compatible Regular Expressions)
+ − 592
%has the back-references
612
+ − 593
Regexes come with a number of constructs
605
+ − 594
that make it more convenient for
604
+ − 595
programmers to write regular expressions.
612
+ − 596
Depending on the types of constructs
605
+ − 597
the task of matching and lexing with them
612
+ − 598
will have different levels of complexity.
635
+ − 599
Some of those constructs are syntactic sugars that are
604
+ − 600
simply short hand notations
605
+ − 601
that save the programmers a few keystrokes.
612
+ − 602
These will not cause problems for regex libraries.
605
+ − 603
For example the
612
+ − 604
non-binary alternative involving three or more choices just means:
604
+ − 605
\[
605
+ − 606
(a | b | c) \stackrel{means}{=} ((a + b)+ c)
604
+ − 607
\]
635
+ − 608
Similarly, the range operator
+ − 609
%used to express the alternative
+ − 610
%of all characters between its operands,
+ − 611
is just a concise way
+ − 612
of expressing an alternative of consecutive characters:
604
+ − 613
\[
635
+ − 614
[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )
604
+ − 615
\]
612
+ − 616
for an alternative. The
635
+ − 617
wildcard character '$.$' is used to refer to any single character,
605
+ − 618
\[
+ − 619
. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
+ − 620
\]
612
+ − 621
except the newline.
604
+ − 622
605
+ − 623
\subsection{Bounded Repetitions}
612
+ − 624
More interesting are bounded repetitions, which can
+ − 625
make the regular expressions much
605
+ − 626
more compact.
635
+ − 627
Normally there are four kinds of bounded repetitions:
612
+ − 628
$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
+ − 629
(where $n$ and $m$ are constant natural numbers).
+ − 630
Like the star regular expressions, the set of strings or language
+ − 631
a bounded regular expression can match
+ − 632
is defined using the power operation on sets:
605
+ − 633
\begin{center}
+ − 634
\begin{tabular}{lcl}
+ − 635
$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
+ − 636
$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
+ − 637
$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
+ − 638
$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
+ − 639
\end{tabular}
+ − 640
\end{center}
612
+ − 641
The attraction of bounded repetitions is that they can be
635
+ − 642
used to avoid a size blow up: for example $r^{\{n\}}$
612
+ − 643
is a shorthand for
635
+ − 644
the much longer regular expression:
605
+ − 645
\[
+ − 646
\underbrace{r\ldots r}_\text{n copies of r}.
+ − 647
\]
+ − 648
%Therefore, a naive algorithm that simply unfolds
+ − 649
%them into their desugared forms
+ − 650
%will suffer from at least an exponential runtime increase.
603
+ − 651
612
+ − 652
+ − 653
The problem with matching
635
+ − 654
such bounded repetitions
612
+ − 655
is that tools based on the classic notion of
+ − 656
automata need to expand $r^{\{n\}}$ into $n$ connected
603
+ − 657
copies of the automaton for $r$. This leads to very inefficient matching
+ − 658
algorithms or algorithms that consume large amounts of memory.
605
+ − 659
Implementations using $\DFA$s will
635
+ − 660
in such situations
605
+ − 661
either become excruciatingly slow
635
+ − 662
(for example Verbatim++ \cite{Verbatimpp}) or run
+ − 663
out of memory (for example $\mathit{LEX}$ and
+ − 664
$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
605
+ − 665
in C and JAVA that generate $\mathit{DFA}$-based
+ − 666
lexers. The user provides a set of regular expressions
635
+ − 667
and configurations, and then
605
+ − 668
gets an output program encoding a minimized $\mathit{DFA}$
+ − 669
that can be compiled and run.
+ − 670
When given the above countdown regular expression,
635
+ − 671
a small $n$ (say 20) would result in a program representing a
+ − 672
DFA
612
+ − 673
with millions of states.}) for large counters.
+ − 674
A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$
605
+ − 675
where the minimal DFA requires at least $2^{n+1}$ states.
+ − 676
For example, when $n$ is equal to 2,
635
+ − 677
the corresponding $\mathit{NFA}$ looks like:
+ − 678
\vspace{6mm}
604
+ − 679
\begin{center}
+ − 680
\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 681
\node[state,initial] (q_0) {$q_0$};
+ − 682
\node[state, red] (q_1) [right=of q_0] {$q_1$};
+ − 683
\node[state, red] (q_2) [right=of q_1] {$q_2$};
+ − 684
\node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+ − 685
\path[->]
+ − 686
(q_0) edge node {a} (q_1)
+ − 687
edge [loop below] node {a,b} ()
+ − 688
(q_1) edge node {a,b} (q_2)
+ − 689
(q_2) edge node {a,b} (q_3);
+ − 690
\end{tikzpicture}
+ − 691
\end{center}
635
+ − 692
and when turned into a DFA by the subset construction
612
+ − 693
requires at least $2^3$ states.\footnote{The
635
+ − 694
red states are "countdown states" which count down
604
+ − 695
the number of characters needed in addition to the current
+ − 696
string to make a successful match.
+ − 697
For example, state $q_1$ indicates a match that has
+ − 698
gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+ − 699
and just consumed the "delimiter" $a$ in the middle, and
635
+ − 700
needs to match 2 more iterations of $(a|b)$ to complete.
604
+ − 701
State $q_2$ on the other hand, can be viewed as a state
+ − 702
after $q_1$ has consumed 1 character, and just waits
+ − 703
for 1 more character to complete.
635
+ − 704
The state $q_3$ is the last (accepting) state, requiring 0
+ − 705
more characters.
604
+ − 706
Depending on the suffix of the
+ − 707
input string up to the current read location,
+ − 708
the states $q_1$ and $q_2$, $q_3$
+ − 709
may or may
635
+ − 710
not be active.
604
+ − 711
A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+ − 712
contain at least $2^3$ non-equivalent states that cannot be merged,
+ − 713
because the subset construction during determinisation will generate
+ − 714
all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+ − 715
Generalizing this to regular expressions with larger
+ − 716
bounded repetitions number, we have that
+ − 717
regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
605
+ − 718
would require at least $2^{n+1}$ states, if $r$ itself contains
604
+ − 719
more than 1 string.
+ − 720
This is to represent all different
635
+ − 721
scenarios in which "countdown" states are active.}
603
+ − 722
538
+ − 723
635
+ − 724
Bounded repetitions are important because they
+ − 725
tend to occur frequently in practical use,
+ − 726
for example in the regex library RegExLib, in
607
+ − 727
the rules library of Snort \cite{Snort1999}\footnote{
+ − 728
Snort is a network intrusion detection (NID) tool
608
+ − 729
for monitoring network traffic.
+ − 730
The network security community curates a list
+ − 731
of malicious patterns written as regexes,
+ − 732
which is used by Snort's detection engine
+ − 733
to match against network traffic for any hostile
+ − 734
activities such as buffer overflow attacks.},
607
+ − 735
as well as in XML Schema definitions (XSDs).
605
+ − 736
According to Bj\"{o}rklund et al \cite{xml2015},
+ − 737
more than half of the
612
+ − 738
XSDs they found on the Maven.org central repository
+ − 739
have bounded regular expressions in them.
+ − 740
Often the counters are quite large, with the largest being
635
+ − 741
close to ten million.
+ − 742
A smaller sample XSD they gave
612
+ − 743
is:
638
+ − 744
\lstset{
+ − 745
basicstyle=\fontsize{8.5}{9}\ttfamily,
+ − 746
language=XML,
+ − 747
morekeywords={encoding,
+ − 748
xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
+ − 749
}
+ − 750
\begin{lstlisting}
612
+ − 751
<sequence minOccurs="0" maxOccurs="65535">
638
+ − 752
<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ − 753
<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
612
+ − 754
</sequence>
638
+ − 755
\end{lstlisting}
635
+ − 756
This can be seen as the regex
605
+ − 757
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+ − 758
regular expressions
+ − 759
satisfying certain constraints (such as
+ − 760
satisfying the floating point number format).
+ − 761
It is therefore quite unsatisfying that
+ − 762
some regular expressions matching libraries
+ − 763
impose adhoc limits
+ − 764
for bounded regular expressions:
+ − 765
For example, in the regular expression matching library in the Go
+ − 766
language the regular expression $a^{1001}$ is not permitted, because no counter
+ − 767
can be above 1000, and in the built-in Rust regular expression library
+ − 768
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
606
+ − 769
for being too big.
635
+ − 770
As Becchi and Crawley \cite{Becchi08} have pointed out,
606
+ − 771
the reason for these restrictions
612
+ − 772
is that they simulate a non-deterministic finite
606
+ − 773
automata (NFA) with a breadth-first search.
+ − 774
This way the number of active states could
+ − 775
be equal to the counter number.
+ − 776
When the counters are large,
+ − 777
the memory requirement could become
612
+ − 778
infeasible, and a regex engine
635
+ − 779
like in Go will reject this pattern straight away.
606
+ − 780
\begin{figure}[H]
+ − 781
\begin{center}
+ − 782
\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+ − 783
+ − 784
\node (q0) [state, initial] {$0$};
+ − 785
\node (q1) [state, right = of q0] {$1$};
612
+ − 786
%\node (q2) [state, right = of q1] {$2$};
+ − 787
\node (qdots) [right = of q1] {$\ldots$};
606
+ − 788
\node (qn) [state, right = of qdots] {$n$};
+ − 789
\node (qn1) [state, right = of qn] {$n+1$};
+ − 790
\node (qn2) [state, right = of qn1] {$n+2$};
+ − 791
\node (qn3) [state, accepting, right = of qn2] {$n+3$};
+ − 792
+ − 793
\path [-stealth, thick]
+ − 794
(q0) edge [loop above] node {a} ()
+ − 795
(q0) edge node {a} (q1)
612
+ − 796
%(q1) edge node {.} (q2)
+ − 797
(q1) edge node {.} (qdots)
606
+ − 798
(qdots) edge node {.} (qn)
+ − 799
(qn) edge node {.} (qn1)
+ − 800
(qn1) edge node {b} (qn2)
+ − 801
(qn2) edge node {$c$} (qn3);
+ − 802
\end{tikzpicture}
+ − 803
%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+ − 804
% \node[state,initial] (q_0) {$0$};
+ − 805
% \node[state, ] (q_1) [right=of q_0] {$1$};
+ − 806
% \node[state, ] (q_2) [right=of q_1] {$2$};
+ − 807
% \node[state,
+ − 808
% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+ − 809
% \path[->]
+ − 810
% (q_0) edge node {a} (q_1)
+ − 811
% edge [loop below] node {a,b} ()
+ − 812
% (q_1) edge node {a,b} (q_2)
+ − 813
% (q_2) edge node {a,b} (q_3);
+ − 814
%\end{tikzpicture}
+ − 815
\end{center}
+ − 816
\caption{The example given by Becchi and Crawley
+ − 817
that NFA simulation can consume large
+ − 818
amounts of memory: $.^*a.^{\{n\}}bc$ matching
+ − 819
strings of the form $aaa\ldots aaaabc$.
+ − 820
When traversing in a breadth-first manner,
+ − 821
all states from 0 till $n+1$ will become active.}
+ − 822
\end{figure}
+ − 823
%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+ − 824
%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+ − 825
%in terms of input string length.
+ − 826
%TODO:try out these lexers
+ − 827
These problems can of course be solved in matching algorithms where
605
+ − 828
automata go beyond the classic notion and for instance include explicit
+ − 829
counters \cite{Turo_ov__2020}.
612
+ − 830
These solutions can be quite efficient,
606
+ − 831
with the ability to process
636
+ − 832
gigabits of strings input per second
606
+ − 833
even with large counters \cite{Becchi08}.
635
+ − 834
These practical solutions do not come with
+ − 835
formal guarantees, and as pointed out by
+ − 836
Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
+ − 837
%But formal reasoning about these automata especially in Isabelle
+ − 838
%can be challenging
+ − 839
%and un-intuitive.
+ − 840
%Therefore, we take correctness and runtime claims made about these solutions
+ − 841
%with a grain of salt.
605
+ − 842
628
+ − 843
In the work reported in \cite{FoSSaCS2023} and here,
605
+ − 844
we add better support using derivatives
635
+ − 845
for bounded regular expression $r^{\{n\}}$.
+ − 846
Our results
605
+ − 847
extend straightforwardly to
635
+ − 848
repetitions with intervals such as
605
+ − 849
$r^{\{n\ldots m\}}$.
608
+ − 850
The merit of Brzozowski derivatives (more on this later)
605
+ − 851
on this problem is that
+ − 852
it can be naturally extended to support bounded repetitions.
635
+ − 853
Moreover these extensions are still made up of only small
605
+ − 854
inductive datatypes and recursive functions,
633
+ − 855
making it handy to deal with them in theorem provers.
605
+ − 856
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
+ − 857
%straightforwardly extended to deal with bounded regular expressions
+ − 858
%and moreover the resulting code still consists of only simple
+ − 859
%recursive functions and inductive datatypes.
+ − 860
Finally, bounded regular expressions do not destroy our finite
+ − 861
boundedness property, which we shall prove later on.
538
+ − 862
+ − 863
606
+ − 864
+ − 865
+ − 866
605
+ − 867
\subsection{Back-References}
606
+ − 868
The other way to simulate an $\mathit{NFA}$ for matching is choosing
+ − 869
a single transition each time, keeping all the other options in
+ − 870
a queue or stack, and backtracking if that choice eventually
636
+ − 871
fails.
+ − 872
This method, often called a "depth-first-search",
+ − 873
is efficient in many cases, but could end up
606
+ − 874
with exponential run time.
+ − 875
The backtracking method is employed in regex libraries
+ − 876
that support \emph{back-references}, for example
+ − 877
in Java and Python.
605
+ − 878
%\section{Back-references and The Terminology Regex}
538
+ − 879
605
+ − 880
%When one constructs an $\NFA$ out of a regular expression
+ − 881
%there is often very little to be done in the first phase, one simply
+ − 882
%construct the $\NFA$ states based on the structure of the input regular expression.
538
+ − 883
605
+ − 884
%In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
+ − 885
%one by keeping track of all active states after consuming
+ − 886
%a character, and update that set of states iteratively.
+ − 887
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
+ − 888
%for a path terminating
+ − 889
%at an accepting state.
606
+ − 890
+ − 891
+ − 892
636
+ − 893
Consider the following regular expression where the sequence
+ − 894
operator is omitted for brevity:
532
+ − 895
\begin{center}
606
+ − 896
$r_1r_2r_3r_4$
532
+ − 897
\end{center}
636
+ − 898
In this example,
606
+ − 899
one could label sub-expressions of interest
532
+ − 900
by parenthesizing them and giving
636
+ − 901
them a number in the order in which their opening parentheses appear.
+ − 902
One possible way of parenthesizing and labelling is:
532
+ − 903
\begin{center}
+ − 904
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
+ − 905
\end{center}
606
+ − 906
The sub-expressions
+ − 907
$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
+ − 908
by 1 to 4, and can be ``referred back'' by their respective numbers.
+ − 909
%These sub-expressions are called "capturing groups".
638
+ − 910
To do so, one uses the syntax $\backslash i$
606
+ − 911
to denote that we want the sub-string
636
+ − 912
of the input matched by the i-th
606
+ − 913
sub-expression to appear again,
+ − 914
exactly the same as it first appeared:
532
+ − 915
\begin{center}
+ − 916
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
636
+ − 917
\underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
532
+ − 918
\end{center}
636
+ − 919
Once the sub-string $s_i$ for the sub-expression $r_i$
+ − 920
has been fixed, there is no variability on what the back-reference
+ − 921
label $\backslash i$ can be---it is tied to $s_i$.
+ − 922
The overall string will look like $\ldots s_i \ldots s_i \ldots $
606
+ − 923
%The backslash and number $i$ are the
+ − 924
%so-called "back-references".
+ − 925
%Let $e$ be an expression made of regular expressions
+ − 926
%and back-references. $e$ contains the expression $e_i$
+ − 927
%as its $i$-th capturing group.
+ − 928
%The semantics of back-reference can be recursively
+ − 929
%written as:
+ − 930
%\begin{center}
+ − 931
% \begin{tabular}{c}
+ − 932
% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+ − 933
% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+ − 934
% \end{tabular}
+ − 935
%\end{center}
+ − 936
A concrete example
618
+ − 937
for back-references is
532
+ − 938
\begin{center}
607
+ − 939
$(.^*)\backslash 1$,
532
+ − 940
\end{center}
618
+ − 941
which matches
606
+ − 942
strings that can be split into two identical halves,
618
+ − 943
for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
607
+ − 944
Note that this is different from
+ − 945
repeating the sub-expression verbatim like
+ − 946
\begin{center}
+ − 947
$(.^*)(.^*)$,
+ − 948
\end{center}
+ − 949
which does not impose any restrictions on what strings the second
+ − 950
sub-expression $.^*$
+ − 951
might match.
636
+ − 952
Another example for back-references is
607
+ − 953
\begin{center}
+ − 954
$(.)(.)\backslash 2\backslash 1$
+ − 955
\end{center}
618
+ − 956
which matches four-character palindromes
+ − 957
like $abba$, $x??x$ and so on.
607
+ − 958
638
+ − 959
Back-references are a regex construct
618
+ − 960
that programmers find quite useful.
636
+ − 961
According to Becchi and Crawley \cite{Becchi08},
618
+ − 962
6\% of Snort rules (up until 2008) use them.
607
+ − 963
The most common use of back-references
618
+ − 964
is to express well-formed html files,
+ − 965
where back-references are convenient for matching
+ − 966
opening and closing tags like
607
+ − 967
\begin{center}
+ − 968
$\langle html \rangle \ldots \langle / html \rangle$
+ − 969
\end{center}
+ − 970
A regex describing such a format
618
+ − 971
is
607
+ − 972
\begin{center}
+ − 973
$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+ − 974
\end{center}
618
+ − 975
Despite being useful, the expressive power of regexes
633
+ − 976
go beyond regular languages
618
+ − 977
once back-references are included.
607
+ − 978
In fact, they allow the regex construct to express
532
+ − 979
languages that cannot be contained in context-free
+ − 980
languages either.
608
+ − 981
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
532
+ − 982
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
636
+ − 983
which cannot be expressed by
+ − 984
context-free grammars \cite{campeanu2003formal}.
532
+ − 985
Such a language is contained in the context-sensitive hierarchy
+ − 986
of formal languages.
618
+ − 987
Also solving the matching problem involving back-references
609
+ − 988
is known to be NP-complete \parencite{alfred2014algorithms}.
608
+ − 989
Regex libraries supporting back-references such as
+ − 990
PCRE \cite{pcre} therefore have to
636
+ − 991
revert to a depth-first search algorithm including backtracking.
609
+ − 992
What is unexpected is that even in the cases
+ − 993
not involving back-references, there is still
+ − 994
a (non-negligible) chance they might backtrack super-linearly,
636
+ − 995
as shown in the graphs in figure \ref{fig:aStarStarb}.
538
+ − 996
636
+ − 997
Summing up, we can categorise existing
607
+ − 998
practical regex libraries into two kinds:
622
+ − 999
(i) The ones with linear
+ − 1000
time guarantees like Go and Rust. The downside with them is that
607
+ − 1001
they impose restrictions
622
+ − 1002
on the regular expressions (not allowing back-references,
+ − 1003
bounded repetitions cannot exceed an ad hoc limit etc.).
633
+ − 1004
And (ii) those
607
+ − 1005
that allow large bounded regular expressions and back-references
622
+ − 1006
at the expense of using backtracking algorithms.
+ − 1007
They can potentially ``grind to a halt''
+ − 1008
on some very simple cases, resulting
633
+ − 1009
ReDoS attacks if exposed to the internet.
607
+ − 1010
633
+ − 1011
The problems with both approaches are the motivation for us
622
+ − 1012
to look again at the regular expression matching problem.
+ − 1013
Another motivation is that regular expression matching algorithms
+ − 1014
that follow the POSIX standard often contain errors and bugs
+ − 1015
as we shall explain next.
+ − 1016
+ − 1017
%We would like to have regex engines that can
+ − 1018
%deal with the regular part (e.g.
+ − 1019
%bounded repetitions) of regexes more
+ − 1020
%efficiently.
+ − 1021
%Also we want to make sure that they do it correctly.
+ − 1022
%It turns out that such aim is not so easy to achieve.
532
+ − 1023
%TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions
+ − 1024
% For example, the Rust regex engine claims to be linear,
+ − 1025
% but does not support lookarounds and back-references.
+ − 1026
% The GoLang regex library does not support over 1000 repetitions.
+ − 1027
% Java and Python both support back-references, but shows
+ − 1028
%catastrophic backtracking behaviours on inputs without back-references(
+ − 1029
%when the language is still regular).
+ − 1030
%TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
+ − 1031
%TODO: verify the fact Rust does not allow 1000+ reps
+ − 1032
+ − 1033
605
+ − 1034
+ − 1035
+ − 1036
%The time cost of regex matching algorithms in general
+ − 1037
%involve two different phases, and different things can go differently wrong on
+ − 1038
%these phases.
+ − 1039
%$\DFA$s usually have problems in the first (construction) phase
+ − 1040
%, whereas $\NFA$s usually run into trouble
+ − 1041
%on the second phase.
+ − 1042
+ − 1043
+ − 1044
\section{Error-prone POSIX Implementations}
622
+ − 1045
Very often there are multiple ways of matching a string
+ − 1046
with a regular expression.
+ − 1047
In such cases the regular expressions matcher needs to
607
+ − 1048
disambiguate.
622
+ − 1049
The more widely used strategy is called POSIX,
+ − 1050
which roughly speaking always chooses the longest initial match.
638
+ − 1051
The POSIX strategy is widely adopted in many regular expression matchers
+ − 1052
because it is a natural strategy for lexers.
607
+ − 1053
However, many implementations (including the C libraries
+ − 1054
used by Linux and OS X distributions) contain bugs
+ − 1055
or do not meet the specification they claim to adhere to.
622
+ − 1056
Kuklewicz maintains a unit test repository which lists some
628
+ − 1057
problems with existing regular expression engines \cite{KuklewiczHaskell}.
622
+ − 1058
In some cases, they either fail to generate a
607
+ − 1059
result when there exists a match,
622
+ − 1060
or give results that are inconsistent with the POSIX standard.
+ − 1061
A concrete example is the regex:
605
+ − 1062
\begin{center}
636
+ − 1063
$(aba + ab + a)^* \text{and the string} \; ababa$
605
+ − 1064
\end{center}
622
+ − 1065
The correct POSIX match for the above
636
+ − 1066
involves the entire string $ababa$,
622
+ − 1067
split into two Kleene star iterations, namely $[ab], [aba]$ at positions
605
+ − 1068
$[0, 2), [2, 5)$
+ − 1069
respectively.
636
+ − 1070
But feeding this example to the different engines
638
+ − 1071
listed at regex101 \footnote{
622
+ − 1072
regex101 is an online regular expression matcher which
+ − 1073
provides API for trying out regular expression
+ − 1074
engines of multiple popular programming languages like
638
+ − 1075
Java, Python, Go, etc.} \parencite{regex101}.
636
+ − 1076
yields
+ − 1077
only two incomplete matches: $[aba]$ at $[0, 3)$
605
+ − 1078
and $a$ at $[4, 5)$.
607
+ − 1079
Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell}
+ − 1080
commented that most regex libraries are not
622
+ − 1081
correctly implementing the central POSIX
+ − 1082
rule, called the maximum munch rule.
638
+ − 1083
Grathwohl \parencite{grathwohl2014crash} wrote:
+ − 1084
\begin{quote}\it
607
+ − 1085
``The POSIX strategy is more complicated than the
605
+ − 1086
greedy because of the dependence on information about
607
+ − 1087
the length of matched strings in the various subexpressions.''
605
+ − 1088
\end{quote}
+ − 1089
%\noindent
646
+ − 1090
People have recognised that the implementation complexity of POSIX rules also come from
622
+ − 1091
the specification being not very precise.
638
+ − 1092
The developers of the regexp package of Go
636
+ − 1093
\footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
+ − 1094
commented that
+ − 1095
\begin{quote}\it
+ − 1096
``
+ − 1097
The POSIX rule is computationally prohibitive
+ − 1098
and not even well-defined.
+ − 1099
``
+ − 1100
\end{quote}
607
+ − 1101
There are many informal summaries of this disambiguation
+ − 1102
strategy, which are often quite long and delicate.
608
+ − 1103
For example Kuklewicz \cite{KuklewiczHaskell}
622
+ − 1104
described the POSIX rule as (section 1, last paragraph):
607
+ − 1105
\begin{quote}
+ − 1106
\begin{itemize}
+ − 1107
\item
+ − 1108
regular expressions (REs) take the leftmost starting match, and the longest match starting there
+ − 1109
earlier subpatterns have leftmost-longest priority over later subpatterns\\
+ − 1110
\item
+ − 1111
higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+ − 1112
\item
+ − 1113
REs have right associative concatenation which can be changed with parenthesis\\
+ − 1114
\item
+ − 1115
parenthesized subexpressions return the match from their last usage\\
+ − 1116
\item
+ − 1117
text of component subexpressions must be contained in the text of the
+ − 1118
higher-level subexpressions\\
+ − 1119
\item
+ − 1120
if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+ − 1121
\item
622
+ − 1122
if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
607
+ − 1123
\end{itemize}
+ − 1124
\end{quote}
622
+ − 1125
%The text above
+ − 1126
%is trying to capture something very precise,
+ − 1127
%and is crying out for formalising.
646
+ − 1128
Ribeiro and Du Bois \cite{RibeiroAgda2017} have
+ − 1129
formalised the notion of bit-coded regular expressions
+ − 1130
and proved their relations with simple regular expressions in
+ − 1131
the dependently-typed proof assistant Agda.
+ − 1132
They also proved the soundness and completeness of a matching algorithm
+ − 1133
based on the bit-coded regular expressions.
608
+ − 1134
Ausaf et al. \cite{AusafDyckhoffUrban2016}
622
+ − 1135
are the first to
+ − 1136
give a quite simple formalised POSIX
+ − 1137
specification in Isabelle/HOL, and also prove
638
+ − 1138
that their specification coincides with an earlier (unformalised)
622
+ − 1139
POSIX specification given by Okui and Suzuki \cite{Okui10}.
608
+ − 1140
They then formally proved the correctness of
+ − 1141
a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
622
+ − 1142
with regards to that specification.
+ − 1143
They also found that the informal POSIX
638
+ − 1144
specification by Sulzmann and Lu needs to be substantially
+ − 1145
revised in order for the correctness proof to go through.
+ − 1146
Their original specification and proof were unfixable
+ − 1147
according to Ausaf et al.
605
+ − 1148
646
+ − 1149
636
+ − 1150
In the next section, we will briefly
609
+ − 1151
introduce Brzozowski derivatives and Sulzmann
622
+ − 1152
and Lu's algorithm, which the main point of this thesis builds on.
+ − 1153
%We give a taste of what they
+ − 1154
%are like and why they are suitable for regular expression
+ − 1155
%matching and lexing.
+ − 1156
\section{Formal Specification of POSIX Matching
609
+ − 1157
and Brzozowski Derivatives}
622
+ − 1158
%Now we start with the central topic of the thesis: Brzozowski derivatives.
609
+ − 1159
Brzozowski \cite{Brzozowski1964} first introduced the
622
+ − 1160
concept of a \emph{derivative} of regular expression in 1964.
609
+ − 1161
The derivative of a regular expression $r$
622
+ − 1162
with respect to a character $c$, is written as $r \backslash c$.
+ − 1163
This operation tells us what $r$ transforms into
638
+ − 1164
if we ``chop'' off a particular character $c$
622
+ − 1165
from all strings in the language of $r$ (defined
+ − 1166
later as $L \; r$).
+ − 1167
%To give a flavour of Brzozowski derivatives, we present
+ − 1168
%two straightforward clauses from it:
+ − 1169
%\begin{center}
+ − 1170
% \begin{tabular}{lcl}
+ − 1171
% $d \backslash c$ & $\dn$ &
+ − 1172
% $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+ − 1173
%$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+ − 1174
% \end{tabular}
+ − 1175
%\end{center}
+ − 1176
%\noindent
+ − 1177
%The first clause says that for the regular expression
636
+ − 1178
%denoting a singleton set consisting of a single-character string $\{ d \}$,
622
+ − 1179
%we check the derivative character $c$ against $d$,
+ − 1180
%returning a set containing only the empty string $\{ [] \}$
+ − 1181
%if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
+ − 1182
%The second clause states that to obtain the regular expression
+ − 1183
%representing all strings' head character $c$ being chopped off
+ − 1184
%from $r_1 + r_2$, one simply needs to recursively take derivative
+ − 1185
%of $r_1$ and $r_2$ and then put them together.
+ − 1186
Derivatives have the property
609
+ − 1187
that $s \in L \; (r\backslash c)$ if and only if
622
+ − 1188
$c::s \in L \; r$ where $::$ stands for list prepending.
609
+ − 1189
%This property can be used on regular expressions
+ − 1190
%matching and lexing--to test whether a string $s$ is in $L \; r$,
+ − 1191
%one simply takes derivatives of $r$ successively with
+ − 1192
%respect to the characters (in the correct order) in $s$,
+ − 1193
%and then test whether the empty string is in the last regular expression.
638
+ − 1194
With this property, derivatives can give a simple solution
622
+ − 1195
to the problem of matching a string $s$ with a regular
538
+ − 1196
expression $r$: if the derivative of $r$ w.r.t.\ (in
+ − 1197
succession) all the characters of the string matches the empty string,
+ − 1198
then $r$ matches $s$ (and {\em vice versa}).
622
+ − 1199
%This makes formally reasoning about these properties such
+ − 1200
%as correctness and complexity smooth and intuitive.
638
+ − 1201
There are several mechanised proofs of this property in various theorem
622
+ − 1202
provers,
+ − 1203
for example one by Owens and Slind \cite{Owens2008} in HOL4,
609
+ − 1204
another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
+ − 1205
yet another in Coq by Coquand and Siles \cite{Coquand2012}.
+ − 1206
622
+ − 1207
In addition, one can extend derivatives to bounded repetitions
+ − 1208
relatively straightforwardly. For example, the derivative for
+ − 1209
this can be defined as:
609
+ − 1210
\begin{center}
+ − 1211
\begin{tabular}{lcl}
+ − 1212
$r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot
638
+ − 1213
r^{\{n-1\}} (\text{when} n > 0)$\\
609
+ − 1214
\end{tabular}
+ − 1215
\end{center}
+ − 1216
\noindent
622
+ − 1217
Experimental results suggest that unlike DFA-based solutions
+ − 1218
for bounded regular expressions,
+ − 1219
derivatives can cope
+ − 1220
large counters
609
+ − 1221
quite well.
532
+ − 1222
638
+ − 1223
There have also been
+ − 1224
extensions of derivatives to other regex constructs.
609
+ − 1225
For example, Owens et al include the derivatives
622
+ − 1226
for the \emph{NOT} regular expression, which is
609
+ − 1227
able to concisely express C-style comments of the form
622
+ − 1228
$/* \ldots */$ (see \cite{Owens2008}).
+ − 1229
Another extension for derivatives is
+ − 1230
regular expressions with look-aheads, done
609
+ − 1231
by Miyazaki and Minamide
+ − 1232
\cite{Takayuki2019}.
+ − 1233
%We therefore use Brzozowski derivatives on regular expressions
+ − 1234
%lexing
+ − 1235
+ − 1236
+ − 1237
+ − 1238
Given the above definitions and properties of
+ − 1239
Brzozowski derivatives, one quickly realises their potential
638
+ − 1240
in generating a formally verified algorithm for lexing: the clauses and property
609
+ − 1241
can be easily expressed in a functional programming language
+ − 1242
or converted to theorem prover
638
+ − 1243
code, with great ease.
+ − 1244
Perhaps this is the reason why derivatives have sparked quite a bit of interest
609
+ − 1245
in the functional programming and theorem prover communities in the last
+ − 1246
fifteen or so years (
+ − 1247
\cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
+ − 1248
\cite{Chen12} and \cite{Coquand2012}
+ − 1249
to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
622
+ − 1250
after they were first published by Brzozowski.
609
+ − 1251
+ − 1252
+ − 1253
However, there are two difficulties with derivative-based matchers:
532
+ − 1254
First, Brzozowski's original matcher only generates a yes/no answer
+ − 1255
for whether a regular expression matches a string or not. This is too
+ − 1256
little information in the context of lexing where separate tokens must
+ − 1257
be identified and also classified (for example as keywords
609
+ − 1258
or identifiers).
638
+ − 1259
Second, derivative-based matchers need to be more efficient in terms
+ − 1260
of the sizes of derivatives.
609
+ − 1261
Elegant and beautiful
+ − 1262
as many implementations are,
+ − 1263
they can be excruciatingly slow.
+ − 1264
For example, Sulzmann and Lu
+ − 1265
claim a linear running time of their proposed algorithm,
+ − 1266
but that was falsified by our experiments. The running time
+ − 1267
is actually $\Omega(2^n)$ in the worst case.
+ − 1268
A similar claim about a theoretical runtime of $O(n^2)$
+ − 1269
is made for the Verbatim \cite{Verbatim}
+ − 1270
%TODO: give references
+ − 1271
lexer, which calculates POSIX matches and is based on derivatives.
622
+ − 1272
They formalized the correctness of the lexer, but not their complexity result.
+ − 1273
In the performance evaluation section, they analyzed the run time
609
+ − 1274
of matching $a$ with the string
+ − 1275
\begin{center}
622
+ − 1276
$\underbrace{a \ldots a}_{\text{n a's}}$.
609
+ − 1277
\end{center}
622
+ − 1278
\noindent
+ − 1279
They concluded that the algorithm is quadratic in terms of
+ − 1280
the length of the input string.
638
+ − 1281
When we tried out their extracted OCaml code with the example $(a+aa)^*$,
622
+ − 1282
the time it took to match a string of 40 $a$'s was approximately 5 minutes.
609
+ − 1283
+ − 1284
+ − 1285
\subsection{Sulzmann and Lu's Algorithm}
+ − 1286
Sulzmann and Lu~\cite{Sulzmann2014} overcame the first
622
+ − 1287
problem with the yes/no answer
+ − 1288
by cleverly extending Brzozowski's matching
532
+ − 1289
algorithm. Their extended version generates additional information on
+ − 1290
\emph{how} a regular expression matches a string following the POSIX
+ − 1291
rules for regular expression matching. They achieve this by adding a
+ − 1292
second ``phase'' to Brzozowski's algorithm involving an injection
638
+ − 1293
function. This injection function in a sense undoes the ``damage''
+ − 1294
of the derivatives chopping off characters.
622
+ − 1295
In earlier work, Ausaf et al provided the formal
532
+ − 1296
specification of what POSIX matching means and proved in Isabelle/HOL
+ − 1297
the correctness
622
+ − 1298
of this extended algorithm accordingly
532
+ − 1299
\cite{AusafDyckhoffUrban2016}.
+ − 1300
638
+ − 1301
The version of the algorithm proven correct
+ − 1302
suffers however heavily from a
+ − 1303
second difficulty, where derivatives can
609
+ − 1304
grow to arbitrarily big sizes.
+ − 1305
For example if we start with the
532
+ − 1306
regular expression $(a+aa)^*$ and take
+ − 1307
successive derivatives according to the character $a$, we end up with
+ − 1308
a sequence of ever-growing derivatives like
+ − 1309
+ − 1310
\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+ − 1311
\begin{center}
+ − 1312
\begin{tabular}{rll}
+ − 1313
$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1314
& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1315
& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+ − 1316
& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+ − 1317
& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+ − 1318
\end{tabular}
+ − 1319
\end{center}
+ − 1320
638
+ − 1321
\noindent where after around 35 steps we usually run out of memory on a
622
+ − 1322
typical computer. Clearly, the
532
+ − 1323
notation involving $\ZERO$s and $\ONE$s already suggests
+ − 1324
simplification rules that can be applied to regular regular
+ − 1325
expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
+ − 1326
\Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
622
+ − 1327
r$. While such simple-minded simplifications have been proved in
+ − 1328
the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
532
+ − 1329
algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
+ − 1330
\emph{not} help with limiting the growth of the derivatives shown
+ − 1331
above: the growth is slowed, but the derivatives can still grow rather
+ − 1332
quickly beyond any finite bound.
+ − 1333
622
+ − 1334
Therefore we want to look in this thesis at a second
+ − 1335
algorithm by Sulzmann and Lu where they
+ − 1336
overcame this ``growth problem'' \cite{Sulzmann2014}.
+ − 1337
In this version, POSIX values are
538
+ − 1338
represented as bit sequences and such sequences are incrementally generated
532
+ − 1339
when derivatives are calculated. The compact representation
538
+ − 1340
of bit sequences and regular expressions allows them to define a more
532
+ − 1341
``aggressive'' simplification method that keeps the size of the
+ − 1342
derivatives finite no matter what the length of the string is.
+ − 1343
They make some informal claims about the correctness and linear behaviour
+ − 1344
of this version, but do not provide any supporting proof arguments, not
538
+ − 1345
even ``pencil-and-paper'' arguments. They write about their bit-coded
532
+ − 1346
\emph{incremental parsing method} (that is the algorithm to be formalised
538
+ − 1347
in this dissertation)
532
+ − 1348
+ − 1349
+ − 1350
+ − 1351
\begin{quote}\it
+ − 1352
``Correctness Claim: We further claim that the incremental parsing
+ − 1353
method [..] in combination with the simplification steps [..]
+ − 1354
yields POSIX parse trees. We have tested this claim
+ − 1355
extensively [..] but yet
+ − 1356
have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
+ − 1357
\end{quote}
622
+ − 1358
Ausaf and Urban made some initial progress towards the
+ − 1359
full correctness proof but still had to leave out the optimisation
+ − 1360
Sulzmann and Lu proposed.
+ − 1361
Ausaf wrote \cite{Ausaf},
532
+ − 1362
\begin{quote}\it
609
+ − 1363
``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
532
+ − 1364
\end{quote}
+ − 1365
This thesis implements the aggressive simplifications envisioned
+ − 1366
by Ausaf and Urban,
622
+ − 1367
together with a formal proof of the correctness of those simplifications.
532
+ − 1368
+ − 1369
612
+ − 1370
One of the most recent work in the context of lexing
+ − 1371
%with this issue
630
+ − 1372
is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
612
+ − 1373
This is relevant work for us and we will compare it later with
+ − 1374
our derivative-based matcher we are going to present.
+ − 1375
There is also some newer work called
630
+ − 1376
Verbatim++~\cite{Verbatimpp}, which does not use derivatives,
612
+ − 1377
but deterministic finite automaton instead.
638
+ − 1378
We will also study this work in a later section.
612
+ − 1379
%An example that gives problem to automaton approaches would be
+ − 1380
%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+ − 1381
%It requires at least $2^{n+1}$ states to represent
+ − 1382
%as a DFA.
+ − 1383
+ − 1384
532
+ − 1385
%----------------------------------------------------------------------------------------
+ − 1386
\section{Contribution}
646
+ − 1387
{\color{red} \rule{\linewidth}{0.5mm}}
+ − 1388
\textbf{issue with this part: way too short, not enough details of what I have done.}
+ − 1389
{\color{red} \rule{\linewidth}{0.5mm}}
532
+ − 1390
609
+ − 1391
In this thesis,
+ − 1392
we propose a solution to catastrophic
+ − 1393
backtracking and error-prone matchers: a formally verified
+ − 1394
regular expression lexing algorithm
+ − 1395
that is both fast
638
+ − 1396
and correct.
646
+ − 1397
{\color{red} \rule{\linewidth}{0.5mm}}
+ − 1398
\HandRight Added content:
+ − 1399
%Package \verb`pifont`: \ding{43}
+ − 1400
%Package \verb`utfsym`: \usym{261E}
+ − 1401
%Package \verb`dingbat`: \leftpointright
+ − 1402
%Package \verb`dingbat`: \rightpointright
+ − 1403
+ − 1404
In particular, the main problem we solved on top of previous work was
+ − 1405
coming up with a formally verified algorithm called $\blexersimp$ based
+ − 1406
on Brzozowski derivatives. It calculates a POSIX
+ − 1407
lexical value from a string and a regular expression. This algorithm was originally
+ − 1408
by Sulzmann and Lu \cite{Sulzmann2014}, but we made the key observation that its $\textit{nub}$
+ − 1409
function does not really simplify intermediate results where it needs to and improved the
+ − 1410
algorithm accordingly.
+ − 1411
We have proven our $\blexersimp$'s internal data structure does not grow beyond a constant $N_r$
+ − 1412
depending on the input regular expression $r$, thanks to the aggressive simplifications of $\blexersimp$:
+ − 1413
\begin{theorem}
+ − 1414
$|\blexersimp \; r \; s | \leq N_r$
+ − 1415
\end{theorem}
+ − 1416
The simplifications applied in each step of $\blexersimp$
+ − 1417
+ − 1418
\begin{center}
+ − 1419
$\blexersimp
+ − 1420
$
+ − 1421
\end{center}
+ − 1422
keeps the derivatives small, but presents a
+ − 1423
challenge
+ − 1424
+ − 1425
+ − 1426
establishing a correctness theorem of the below form:
+ − 1427
%TODO: change this to "theorem to prove"
+ − 1428
\begin{theorem}
+ − 1429
If the POSIX lexical value of matching regular expression $r$ with string $s$ is $v$,
+ − 1430
then $\blexersimp\; r \; s = \Some \; v$. Otherwise
+ − 1431
$\blexersimp \; r \; s = \None$.
+ − 1432
\end{theorem}
+ − 1433
+ − 1434
+ − 1435
+ − 1436
638
+ − 1437
The result is %a regular expression lexing algorithm that comes with
538
+ − 1438
\begin{itemize}
+ − 1439
\item
609
+ − 1440
an improved version of Sulzmann and Lu's bit-coded algorithm using
+ − 1441
derivatives with simplifications,
+ − 1442
accompanied by
+ − 1443
a proven correctness theorem according to POSIX specification
+ − 1444
given by Ausaf et al. \cite{AusafDyckhoffUrban2016},
+ − 1445
\item
+ − 1446
a complexity-related property for that algorithm saying that the
+ − 1447
internal data structure will
638
+ − 1448
remain below a finite bound,
538
+ − 1449
\item
638
+ − 1450
and an extension to
+ − 1451
the bounded repetition constructs with the correctness and finiteness property
609
+ − 1452
maintained.
622
+ − 1453
\end{itemize}
+ − 1454
\noindent
646
+ − 1455
{\color{red} \rule{\linewidth}{0.5mm}}
609
+ − 1456
With a formal finiteness bound in place,
+ − 1457
we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
631
+ − 1458
The Isabelle/HOL code for our formalisation can be
+ − 1459
found at
+ − 1460
\begin{center}
+ − 1461
\url{https://github.com/hellotommmy/posix}
+ − 1462
\end{center}
609
+ − 1463
Further improvements to the algorithm with an even stronger version of
631
+ − 1464
simplification can be made. We conjecture that the resulting size of derivatives
+ − 1465
can be bounded by a cubic bound w.r.t. the size of the regular expression.
638
+ − 1466
We will give relevant code in Scala,
+ − 1467
but do not give a formal proof for that in Isabelle/HOL.
+ − 1468
This is still future work.
532
+ − 1469
+ − 1470
+ − 1471
\section{Structure of the thesis}
622
+ − 1472
In chapter \ref{Inj} we will introduce the concepts
532
+ − 1473
and notations we
622
+ − 1474
use for describing regular expressions and derivatives,
638
+ − 1475
and the first version of Sulzmann and Lu's lexing algorithm without bitcodes (including
622
+ − 1476
its correctness proof).
+ − 1477
We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
+ − 1478
together with the correctness proof by Ausaf and Urban.
+ − 1479
Then we illustrate in chapter \ref{Bitcoded2}
+ − 1480
how Sulzmann and Lu's
638
+ − 1481
simplifications fail to simplify correctly. We therefore introduce our own version of the
+ − 1482
algorithm with correct simplifications and
+ − 1483
their correctness proof.
622
+ − 1484
In chapter \ref{Finite} we give the second guarantee
532
+ − 1485
of our bitcoded algorithm, that is a finite bound on the size of any
631
+ − 1486
regular expression's derivatives.
+ − 1487
We also show how one can extend the
+ − 1488
algorithm to include bounded repetitions.
622
+ − 1489
In chapter \ref{Cubic} we discuss stronger simplification rules which
638
+ − 1490
improve the finite bound to a cubic bound. %and the NOT regular expression.
637
+ − 1491
Chapter \ref{RelatedWork} introduces relevant work for this thesis.
638
+ − 1492
Chapter \ref{Future} concludes and mentions avenues of future research.
532
+ − 1493
+ − 1494
+ − 1495
+ − 1496
+ − 1497
+ − 1498
%----------------------------------------------------------------------------------------
+ − 1499
+ − 1500
+ − 1501
%----------------------------------------------------------------------------------------
+ − 1502
+ − 1503
%----------------------------------------------------------------------------------------
+ − 1504
+ − 1505
%----------------------------------------------------------------------------------------
+ − 1506
+ − 1507