|
1 \chapter{Technical Overview} |
|
2 \label{Overview} |
|
3 We start with a technical overview of the |
|
4 catastrophic backtracking problem, |
|
5 motivating rigorous approaches to regular expression matching and lexing. |
|
6 In doing so we also briefly |
|
7 introduce common terminology such as |
|
8 bounded repetitions and back-references. |
|
9 |
|
10 \section{Motivating Examples} |
|
11 Consider for example the regular expression $(a^*)^*\,b$ and |
|
12 strings of the form $aa..a$. These strings cannot be matched by this regular |
|
13 expression: obviously the expected $b$ in the last |
|
14 position is missing. One would assume that modern regular expression |
|
15 matching engines can find this out very quickly. Surprisingly, if one tries |
|
16 this example in JavaScript, Python or Java 8, even with small strings, |
|
17 say of lenght of around 30 $a$'s, |
|
18 the decision takes large amounts of time to finish. |
|
19 This is inproportional |
|
20 to the simplicity of the input (see graphs in figure \ref{fig:aStarStarb}). |
|
21 The algorithms clearly show exponential behaviour, and as can be seen |
|
22 is triggered by some relatively simple regular expressions. |
|
23 Java 9 and newer |
|
24 versions mitigates this behaviour by several magnitudes, |
|
25 but are still slow compared |
|
26 with the approach we are going to use in this thesis. |
|
27 |
|
28 |
|
29 |
|
30 This superlinear blowup in regular expression engines |
|
31 has caused grief in ``real life'' where it is |
|
32 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
|
33 A less serious example is a bug in the Atom editor: |
|
34 a user found out when writing the following code |
|
35 \begin{verbatim} |
|
36 vVar.Type().Name() == "" && vVar.Kind() == reflect.Ptr |
|
37 \end{verbatim} |
|
38 \begin{verbatim} |
|
39 && vVar.Type().Elem().Name() == "" && vVar.Type().Elem().Kind() == |
|
40 \end{verbatim} |
|
41 \begin{verbatim} |
|
42 reflect.Slice |
|
43 \end{verbatim} |
|
44 it took the editor half a hour to finish computing. |
|
45 This was subsequently fixed by Galbraith \cite{fixedAtom}, |
|
46 and the issue was due to the regex engine of Atom. |
|
47 But evil regular expressions can be more than a nuisance in a text editor: |
|
48 on 20 July 2016 one evil |
|
49 regular expression brought the webpage |
|
50 \href{http://stackexchange.com}{Stack Exchange} to its |
|
51 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
|
52 In this instance, a regular expression intended to just trim white |
|
53 spaces from the beginning and the end of a line actually consumed |
|
54 massive amounts of CPU resources---causing the web servers to grind to a |
|
55 halt. In this example, the time needed to process |
|
56 the string was |
|
57 $O(n^2)$ with respect to the string length $n$. This |
|
58 quadratic overhead was enough for the homepage of Stack Exchange to |
|
59 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
|
60 attack and therefore stopped the servers from responding to any |
|
61 requests. This made the whole site become unavailable. |
|
62 |
|
63 \begin{figure}[p] |
|
64 \begin{center} |
|
65 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}} |
|
66 \begin{tikzpicture} |
|
67 \begin{axis}[ |
|
68 xlabel={$n$}, |
|
69 x label style={at={(1.05,-0.05)}}, |
|
70 ylabel={time in secs}, |
|
71 enlargelimits=false, |
|
72 xtick={0,5,...,30}, |
|
73 xmax=33, |
|
74 ymax=35, |
|
75 ytick={0,5,...,30}, |
|
76 scaled ticks=false, |
|
77 axis lines=left, |
|
78 width=5cm, |
|
79 height=4cm, |
|
80 legend entries={JavaScript}, |
|
81 legend pos=north west, |
|
82 legend cell align=left] |
|
83 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
|
84 \end{axis} |
|
85 \end{tikzpicture} |
|
86 & |
|
87 \begin{tikzpicture} |
|
88 \begin{axis}[ |
|
89 xlabel={$n$}, |
|
90 x label style={at={(1.05,-0.05)}}, |
|
91 %ylabel={time in secs}, |
|
92 enlargelimits=false, |
|
93 xtick={0,5,...,30}, |
|
94 xmax=33, |
|
95 ymax=35, |
|
96 ytick={0,5,...,30}, |
|
97 scaled ticks=false, |
|
98 axis lines=left, |
|
99 width=5cm, |
|
100 height=4cm, |
|
101 legend entries={Python}, |
|
102 legend pos=north west, |
|
103 legend cell align=left] |
|
104 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
105 \end{axis} |
|
106 \end{tikzpicture}\\ |
|
107 \begin{tikzpicture} |
|
108 \begin{axis}[ |
|
109 xlabel={$n$}, |
|
110 x label style={at={(1.05,-0.05)}}, |
|
111 ylabel={time in secs}, |
|
112 enlargelimits=false, |
|
113 xtick={0,5,...,30}, |
|
114 xmax=33, |
|
115 ymax=35, |
|
116 ytick={0,5,...,30}, |
|
117 scaled ticks=false, |
|
118 axis lines=left, |
|
119 width=5cm, |
|
120 height=4cm, |
|
121 legend entries={Java 8}, |
|
122 legend pos=north west, |
|
123 legend cell align=left] |
|
124 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
125 \end{axis} |
|
126 \end{tikzpicture} |
|
127 & |
|
128 \begin{tikzpicture} |
|
129 \begin{axis}[ |
|
130 xlabel={$n$}, |
|
131 x label style={at={(1.05,-0.05)}}, |
|
132 %ylabel={time in secs}, |
|
133 enlargelimits=false, |
|
134 xtick={0,5,...,30}, |
|
135 xmax=33, |
|
136 ymax=35, |
|
137 ytick={0,5,...,30}, |
|
138 scaled ticks=false, |
|
139 axis lines=left, |
|
140 width=5cm, |
|
141 height=4cm, |
|
142 legend entries={Dart}, |
|
143 legend pos=north west, |
|
144 legend cell align=left] |
|
145 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; |
|
146 \end{axis} |
|
147 \end{tikzpicture}\\ |
|
148 \begin{tikzpicture} |
|
149 \begin{axis}[ |
|
150 xlabel={$n$}, |
|
151 x label style={at={(1.05,-0.05)}}, |
|
152 ylabel={time in secs}, |
|
153 enlargelimits=false, |
|
154 xtick={0,5,...,30}, |
|
155 xmax=33, |
|
156 ymax=35, |
|
157 ytick={0,5,...,30}, |
|
158 scaled ticks=false, |
|
159 axis lines=left, |
|
160 width=5cm, |
|
161 height=4cm, |
|
162 legend entries={Swift}, |
|
163 legend pos=north west, |
|
164 legend cell align=left] |
|
165 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; |
|
166 \end{axis} |
|
167 \end{tikzpicture} |
|
168 & |
|
169 \begin{tikzpicture} |
|
170 \begin{axis}[ |
|
171 xlabel={$n$}, |
|
172 x label style={at={(1.05,-0.05)}}, |
|
173 %ylabel={time in secs}, |
|
174 enlargelimits=true, |
|
175 %xtick={0,5000,...,40000}, |
|
176 %xmax=40000, |
|
177 %ymax=35, |
|
178 restrict x to domain*=0:40000, |
|
179 restrict y to domain*=0:35, |
|
180 %ytick={0,5,...,30}, |
|
181 %scaled ticks=false, |
|
182 axis lines=left, |
|
183 width=5cm, |
|
184 height=4cm, |
|
185 legend entries={Java9+}, |
|
186 legend pos=north west, |
|
187 legend cell align=left] |
|
188 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
|
189 \end{axis} |
|
190 \end{tikzpicture}\\ |
|
191 \begin{tikzpicture} |
|
192 \begin{axis}[ |
|
193 xlabel={$n$}, |
|
194 x label style={at={(1.05,-0.05)}}, |
|
195 ylabel={time in secs}, |
|
196 enlargelimits=true, |
|
197 %xtick={0,5,...,30}, |
|
198 %xmax=33, |
|
199 %ymax=35, |
|
200 restrict x to domain*=0:60000, |
|
201 restrict y to domain*=0:35, |
|
202 %ytick={0,5,...,30}, |
|
203 %scaled ticks=false, |
|
204 axis lines=left, |
|
205 width=5cm, |
|
206 height=4cm, |
|
207 legend entries={Scala}, |
|
208 legend pos=north west, |
|
209 legend cell align=left] |
|
210 \addplot[magenta,mark=*, mark options={fill=white}] table {re-blexersimp.data}; |
|
211 \end{axis} |
|
212 \end{tikzpicture} |
|
213 & |
|
214 \begin{tikzpicture} |
|
215 \begin{axis}[ |
|
216 xlabel={$n$}, |
|
217 x label style={at={(1.05,-0.05)}}, |
|
218 %ylabel={time in secs}, |
|
219 enlargelimits=true, |
|
220 %xtick={0,5000,...,40000}, |
|
221 %xmax=40000, |
|
222 %ymax=35, |
|
223 restrict x to domain*=0:60000, |
|
224 restrict y to domain*=0:45, |
|
225 %ytick={0,5,...,30}, |
|
226 %scaled ticks=false, |
|
227 axis lines=left, |
|
228 width=5cm, |
|
229 height=4cm, |
|
230 legend style={cells={align=left}}, |
|
231 legend entries={Isabelle \\ Extracted}, |
|
232 legend pos=north west, |
|
233 legend cell align=left] |
|
234 \addplot[magenta,mark=*, mark options={fill=white}] table {re-fromIsabelle.data}; |
|
235 \end{axis} |
|
236 \end{tikzpicture}\\ |
|
237 \multicolumn{2}{c}{Graphs} |
|
238 \end{tabular} |
|
239 \end{center} |
|
240 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
|
241 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
|
242 The reason for their fast growth %superlinear behaviour |
|
243 is that they do a depth-first-search |
|
244 using NFAs. |
|
245 If the string does not match, the regular expression matching |
|
246 engine starts to explore all possibilities. |
|
247 The last two graphs are for our implementation in Scala, one manual |
|
248 and one extracted from the verified lexer in Isabelle by $\textit{codegen}$. |
|
249 Our lexer |
|
250 performs better in this case, and |
|
251 is formally verified. |
|
252 Despite being almost identical, the codegen-generated lexer |
|
253 % is generated directly from Isabelle using $\textit{codegen}$, |
|
254 is slower than the manually written version since the code synthesised by |
|
255 $\textit{codegen}$ does not use native integer or string |
|
256 types of the target language. |
|
257 %Note that we are comparing our \emph{lexer} against other languages' matchers. |
|
258 }\label{fig:aStarStarb} |
|
259 \end{figure}\afterpage{\clearpage} |
|
260 |
|
261 A more recent example is a global outage of all Cloudflare servers on |
|
262 2 July 2019. |
|
263 The traffic Cloudflare services each day is more than |
|
264 Twitter, Amazon, Instagram, Apple, Bing and Wikipedia combined, yet |
|
265 it became completely unavailable for half an hour because of |
|
266 a poorly-written regular expression roughly of the form $^*.^*=.^*$ |
|
267 exhausted CPU resources that serve HTTP traffic. |
|
268 Although the outage |
|
269 had several causes, at the heart was a regular expression that |
|
270 was used to monitor network |
|
271 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
|
272 These problems with regular expressions |
|
273 are not isolated events that happen |
|
274 very rarely, |
|
275 %but actually widespread. |
|
276 %They occur so often that they have a |
|
277 but they occur actually often enough that they have a |
|
278 name: Regular-Expression-Denial-Of-Service (ReDoS) |
|
279 attacks. |
|
280 Davis et al. \cite{Davis18} detected more |
|
281 than 1000 evil regular expressions |
|
282 in Node.js, Python core libraries, npm and pypi. |
|
283 They therefore concluded that evil regular expressions |
|
284 are a real problem rather than just "a parlour trick". |
|
285 |
|
286 The work in this thesis aims to address this issue |
|
287 with the help of formal proofs. |
|
288 We describe a lexing algorithm based |
|
289 on Brzozowski derivatives with verified correctness |
|
290 and a finiteness property for the size of derivatives |
|
291 (which are all done in Isabelle/HOL). |
|
292 Such properties %guarantee the absence of |
|
293 are an important step in preventing |
|
294 catastrophic backtracking once and for all. |
|
295 We will give more details in the next sections |
|
296 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
|
297 can occur in traditional regular expression engines |
|
298 and (ii) why we choose our |
|
299 approach based on Brzozowski derivatives and formal proofs. |
|
300 |
|
301 |
|
302 |
|
303 \section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
|
304 Regular expressions and regular expression matchers |
|
305 have been studied for many years. |
|
306 Theoretical results in automata theory state |
|
307 that basic regular expression matching should be linear |
|
308 w.r.t the input. |
|
309 This assumes that the regular expression |
|
310 $r$ was pre-processed and turned into a |
|
311 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}. |
|
312 By basic we mean textbook definitions such as the one |
|
313 below, involving only regular expressions for characters, alternatives, |
|
314 sequences, and Kleene stars: |
|
315 \[ |
|
316 r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
|
317 \] |
|
318 Modern regular expression matchers used by programmers, |
|
319 however, |
|
320 support much richer constructs, such as bounded repetitions, |
|
321 negations, |
|
322 and back-references. |
|
323 To differentiate, we use the word \emph{regex} to refer |
|
324 to those expressions with richer constructs while reserving the |
|
325 term \emph{regular expression} |
|
326 for the more traditional meaning in formal languages theory. |
|
327 We follow this convention |
|
328 in this thesis. |
|
329 In the future, we aim to support all the popular features of regexes, |
|
330 but for this work we mainly look at basic regular expressions |
|
331 and bounded repetitions. |
|
332 |
|
333 |
|
334 %Most modern regex libraries |
|
335 %the so-called PCRE standard (Peral Compatible Regular Expressions) |
|
336 %has the back-references |
|
337 Regexes come with a number of constructs |
|
338 that make it more convenient for |
|
339 programmers to write regular expressions. |
|
340 Depending on the types of constructs |
|
341 the task of matching and lexing with them |
|
342 will have different levels of complexity. |
|
343 Some of those constructs are syntactic sugars that are |
|
344 simply short hand notations |
|
345 that save the programmers a few keystrokes. |
|
346 These will not cause problems for regex libraries. |
|
347 For example the |
|
348 non-binary alternative involving three or more choices just means: |
|
349 \[ |
|
350 (a | b | c) \stackrel{means}{=} ((a + b)+ c) |
|
351 \] |
|
352 Similarly, the range operator |
|
353 %used to express the alternative |
|
354 %of all characters between its operands, |
|
355 is just a concise way |
|
356 of expressing an alternative of consecutive characters: |
|
357 \[ |
|
358 [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) |
|
359 \] |
|
360 for an alternative. The |
|
361 wildcard character '$.$' is used to refer to any single character, |
|
362 \[ |
|
363 . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
|
364 \] |
|
365 except the newline. |
|
366 |
|
367 \subsection{Bounded Repetitions} |
|
368 More interesting are bounded repetitions, which can |
|
369 make the regular expressions much |
|
370 more compact. |
|
371 Normally there are four kinds of bounded repetitions: |
|
372 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ |
|
373 (where $n$ and $m$ are constant natural numbers). |
|
374 Like the star regular expressions, the set of strings or language |
|
375 a bounded regular expression can match |
|
376 is defined using the power operation on sets: |
|
377 \begin{center} |
|
378 \begin{tabular}{lcl} |
|
379 $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ |
|
380 $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ |
|
381 $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
|
382 $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
|
383 \end{tabular} |
|
384 \end{center} |
|
385 The attraction of bounded repetitions is that they can be |
|
386 used to avoid a size blow up: for example $r^{\{n\}}$ |
|
387 is a shorthand for |
|
388 the much longer regular expression: |
|
389 \[ |
|
390 \underbrace{r\ldots r}_\text{n copies of r}. |
|
391 \] |
|
392 %Therefore, a naive algorithm that simply unfolds |
|
393 %them into their desugared forms |
|
394 %will suffer from at least an exponential runtime increase. |
|
395 |
|
396 |
|
397 The problem with matching |
|
398 such bounded repetitions |
|
399 is that tools based on the classic notion of |
|
400 automata need to expand $r^{\{n\}}$ into $n$ connected |
|
401 copies of the automaton for $r$. This leads to very inefficient matching |
|
402 algorithms or algorithms that consume large amounts of memory. |
|
403 Implementations using $\DFA$s will |
|
404 in such situations |
|
405 either become very slow |
|
406 (for example Verbatim++ \cite{Verbatimpp}) or run |
|
407 out of memory (for example $\mathit{LEX}$ and |
|
408 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators |
|
409 in C and JAVA that generate $\mathit{DFA}$-based |
|
410 lexers. The user provides a set of regular expressions |
|
411 and configurations, and then |
|
412 gets an output program encoding a minimized $\mathit{DFA}$ |
|
413 that can be compiled and run. |
|
414 When given the above countdown regular expression, |
|
415 a small $n$ (say 20) would result in a program representing a |
|
416 DFA |
|
417 with millions of states.}) for large counters. |
|
418 A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ |
|
419 where the minimal DFA requires at least $2^{n+1}$ states. |
|
420 For example, when $n$ is equal to 2, |
|
421 the corresponding $\mathit{NFA}$ looks like: |
|
422 \vspace{6mm} |
|
423 \begin{center} |
|
424 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
425 \node[state,initial] (q_0) {$q_0$}; |
|
426 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
|
427 \node[state, red] (q_2) [right=of q_1] {$q_2$}; |
|
428 \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; |
|
429 \path[->] |
|
430 (q_0) edge node {a} (q_1) |
|
431 edge [loop below] node {a,b} () |
|
432 (q_1) edge node {a,b} (q_2) |
|
433 (q_2) edge node {a,b} (q_3); |
|
434 \end{tikzpicture} |
|
435 \end{center} |
|
436 and when turned into a DFA by the subset construction |
|
437 requires at least $2^3$ states.\footnote{The |
|
438 red states are "countdown states" which count down |
|
439 the number of characters needed in addition to the current |
|
440 string to make a successful match. |
|
441 For example, state $q_1$ indicates a match that has |
|
442 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
|
443 and just consumed the "delimiter" $a$ in the middle, and |
|
444 needs to match 2 more iterations of $(a|b)$ to complete. |
|
445 State $q_2$ on the other hand, can be viewed as a state |
|
446 after $q_1$ has consumed 1 character, and just waits |
|
447 for 1 more character to complete. |
|
448 The state $q_3$ is the last (accepting) state, requiring 0 |
|
449 more characters. |
|
450 Depending on the suffix of the |
|
451 input string up to the current read location, |
|
452 the states $q_1$ and $q_2$, $q_3$ |
|
453 may or may |
|
454 not be active. |
|
455 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
|
456 contain at least $2^3$ non-equivalent states that cannot be merged, |
|
457 because the subset construction during determinisation will generate |
|
458 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
|
459 Generalizing this to regular expressions with larger |
|
460 bounded repetitions number, we have that |
|
461 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
|
462 would require at least $2^{n+1}$ states, if $r$ itself contains |
|
463 more than 1 string. |
|
464 This is to represent all different |
|
465 scenarios in which "countdown" states are active.} |
|
466 |
|
467 |
|
468 Bounded repetitions are important because they |
|
469 tend to occur frequently in practical use, |
|
470 for example in the regex library RegExLib, in |
|
471 the rules library of Snort \cite{Snort1999}\footnote{ |
|
472 Snort is a network intrusion detection (NID) tool |
|
473 for monitoring network traffic. |
|
474 The network security community curates a list |
|
475 of malicious patterns written as regexes, |
|
476 which is used by Snort's detection engine |
|
477 to match against network traffic for any hostile |
|
478 activities such as buffer overflow attacks.}, |
|
479 as well as in XML Schema definitions (XSDs). |
|
480 According to Bj\"{o}rklund et al \cite{xml2015}, |
|
481 more than half of the |
|
482 XSDs they found on the Maven.org central repository |
|
483 have bounded regular expressions in them. |
|
484 Often the counters are quite large, with the largest being |
|
485 close to ten million. |
|
486 A smaller sample XSD they gave |
|
487 is: |
|
488 \lstset{ |
|
489 basicstyle=\fontsize{8.5}{9}\ttfamily, |
|
490 language=XML, |
|
491 morekeywords={encoding, |
|
492 xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute} |
|
493 } |
|
494 \begin{lstlisting} |
|
495 <sequence minOccurs="0" maxOccurs="65535"> |
|
496 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
|
497 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
|
498 </sequence> |
|
499 \end{lstlisting} |
|
500 This can be seen as the regex |
|
501 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
|
502 regular expressions |
|
503 satisfying certain constraints (such as |
|
504 satisfying the floating point number format). |
|
505 It is therefore quite unsatisfying that |
|
506 some regular expressions matching libraries |
|
507 impose adhoc limits |
|
508 for bounded regular expressions: |
|
509 For example, in the regular expression matching library in the Go |
|
510 language the regular expression $a^{1001}$ is not permitted, because no counter |
|
511 can be above 1000, and in the built-in Rust regular expression library |
|
512 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
|
513 for being too big~\footnote{Try it out here: https://rustexp.lpil.uk}. |
|
514 As Becchi and Crawley \cite{Becchi08} have pointed out, |
|
515 the reason for these restrictions |
|
516 is that they simulate a non-deterministic finite |
|
517 automata (NFA) with a breadth-first search. |
|
518 This way the number of active states could |
|
519 be equal to the counter number. |
|
520 When the counters are large, |
|
521 the memory requirement could become |
|
522 infeasible, and a regex engine |
|
523 like in Go will reject this pattern straight away. |
|
524 \begin{figure}[H] |
|
525 \begin{center} |
|
526 \begin{tikzpicture} [node distance = 2cm, on grid, auto] |
|
527 |
|
528 \node (q0) [state, initial] {$0$}; |
|
529 \node (q1) [state, right = of q0] {$1$}; |
|
530 %\node (q2) [state, right = of q1] {$2$}; |
|
531 \node (qdots) [right = of q1] {$\ldots$}; |
|
532 \node (qn) [state, right = of qdots] {$n$}; |
|
533 \node (qn1) [state, right = of qn] {$n+1$}; |
|
534 \node (qn2) [state, right = of qn1] {$n+2$}; |
|
535 \node (qn3) [state, accepting, right = of qn2] {$n+3$}; |
|
536 |
|
537 \path [-stealth, thick] |
|
538 (q0) edge [loop above] node {a} () |
|
539 (q0) edge node {a} (q1) |
|
540 %(q1) edge node {.} (q2) |
|
541 (q1) edge node {.} (qdots) |
|
542 (qdots) edge node {.} (qn) |
|
543 (qn) edge node {.} (qn1) |
|
544 (qn1) edge node {b} (qn2) |
|
545 (qn2) edge node {$c$} (qn3); |
|
546 \end{tikzpicture} |
|
547 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
548 % \node[state,initial] (q_0) {$0$}; |
|
549 % \node[state, ] (q_1) [right=of q_0] {$1$}; |
|
550 % \node[state, ] (q_2) [right=of q_1] {$2$}; |
|
551 % \node[state, |
|
552 % \node[state, accepting, ](q_3) [right=of q_2] {$3$}; |
|
553 % \path[->] |
|
554 % (q_0) edge node {a} (q_1) |
|
555 % edge [loop below] node {a,b} () |
|
556 % (q_1) edge node {a,b} (q_2) |
|
557 % (q_2) edge node {a,b} (q_3); |
|
558 %\end{tikzpicture} |
|
559 \end{center} |
|
560 \caption{The example given by Becchi and Crawley |
|
561 that NFA simulation can consume large |
|
562 amounts of memory: $.^*a.^{\{n\}}bc$ matching |
|
563 strings of the form $aaa\ldots aaaabc$. |
|
564 When traversing in a breadth-first manner, |
|
565 all states from 0 till $n+1$ will become active.}\label{fig:inj} |
|
566 |
|
567 \end{figure} |
|
568 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
|
569 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
|
570 %in terms of input string length. |
|
571 %TODO:try out these lexers |
|
572 These problems can of course be solved in matching algorithms where |
|
573 automata go beyond the classic notion and for instance include explicit |
|
574 counters \cite{Turo_ov__2020}. |
|
575 These solutions can be quite efficient, |
|
576 with the ability to process |
|
577 gigabits of strings input per second |
|
578 even with large counters \cite{Becchi08}. |
|
579 These practical solutions do not come with |
|
580 formal guarantees, and as pointed out by |
|
581 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. |
|
582 %But formal reasoning about these automata especially in Isabelle |
|
583 %can be challenging |
|
584 %and un-intuitive. |
|
585 %Therefore, we take correctness and runtime claims made about these solutions |
|
586 %with a grain of salt. |
|
587 |
|
588 In the work reported in \cite{ITP2023} and here, |
|
589 we add better support using derivatives |
|
590 for bounded regular expression $r^{\{n\}}$. |
|
591 Our results |
|
592 extend straightforwardly to |
|
593 repetitions with intervals such as |
|
594 $r^{\{n\ldots m\}}$. |
|
595 The merit of Brzozowski derivatives (more on this later) |
|
596 on this problem is that |
|
597 it can be naturally extended to support bounded repetitions. |
|
598 Moreover these extensions are still made up of only small |
|
599 inductive datatypes and recursive functions, |
|
600 making it handy to deal with them in theorem provers. |
|
601 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
|
602 %straightforwardly extended to deal with bounded regular expressions |
|
603 %and moreover the resulting code still consists of only simple |
|
604 %recursive functions and inductive datatypes. |
|
605 Finally, bounded regular expressions do not destroy our finite |
|
606 boundedness property, which we shall prove later on. |
|
607 |
|
608 |
|
609 |
|
610 |
|
611 |
|
612 \subsection{Back-References} |
|
613 The other way to simulate an $\mathit{NFA}$ for matching is choosing |
|
614 a single transition each time, keeping all the other options in |
|
615 a queue or stack, and backtracking if that choice eventually |
|
616 fails. |
|
617 This method, often called a "depth-first-search", |
|
618 is efficient in many cases, but could end up |
|
619 with exponential run time. |
|
620 The backtracking method is employed in regex libraries |
|
621 that support \emph{back-references}, for example |
|
622 in Java and Python. |
|
623 %\section{Back-references and The Terminology Regex} |
|
624 |
|
625 %When one constructs an $\NFA$ out of a regular expression |
|
626 %there is often very little to be done in the first phase, one simply |
|
627 %construct the $\NFA$ states based on the structure of the input regular expression. |
|
628 |
|
629 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: |
|
630 %one by keeping track of all active states after consuming |
|
631 %a character, and update that set of states iteratively. |
|
632 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
|
633 %for a path terminating |
|
634 %at an accepting state. |
|
635 |
|
636 |
|
637 |
|
638 Consider the following regular expression where the sequence |
|
639 operator is omitted for brevity: |
|
640 \begin{center} |
|
641 $r_1r_2r_3r_4$ |
|
642 \end{center} |
|
643 In this example, |
|
644 one could label sub-expressions of interest |
|
645 by parenthesizing them and giving |
|
646 them a number in the order in which their opening parentheses appear. |
|
647 One possible way of parenthesizing and labelling is: |
|
648 \begin{center} |
|
649 $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
|
650 \end{center} |
|
651 The sub-expressions |
|
652 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
|
653 by 1 to 4, and can be ``referred back'' by their respective numbers. |
|
654 %These sub-expressions are called "capturing groups". |
|
655 To do so, one uses the syntax $\backslash i$ |
|
656 to denote that we want the sub-string |
|
657 of the input matched by the i-th |
|
658 sub-expression to appear again, |
|
659 exactly the same as it first appeared: |
|
660 \begin{center} |
|
661 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
|
662 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$ |
|
663 \end{center} |
|
664 Once the sub-string $s_i$ for the sub-expression $r_i$ |
|
665 has been fixed, there is no variability on what the back-reference |
|
666 label $\backslash i$ can be---it is tied to $s_i$. |
|
667 The overall string will look like $\ldots s_i \ldots s_i \ldots $ |
|
668 %The backslash and number $i$ are the |
|
669 %so-called "back-references". |
|
670 %Let $e$ be an expression made of regular expressions |
|
671 %and back-references. $e$ contains the expression $e_i$ |
|
672 %as its $i$-th capturing group. |
|
673 %The semantics of back-reference can be recursively |
|
674 %written as: |
|
675 %\begin{center} |
|
676 % \begin{tabular}{c} |
|
677 % $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
|
678 % $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
|
679 % \end{tabular} |
|
680 %\end{center} |
|
681 A concrete example |
|
682 for back-references is |
|
683 \begin{center} |
|
684 $(.^*)\backslash 1$, |
|
685 \end{center} |
|
686 which matches |
|
687 strings that can be split into two identical halves, |
|
688 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. |
|
689 Note that this is different from |
|
690 repeating the sub-expression verbatim like |
|
691 \begin{center} |
|
692 $(.^*)(.^*)$, |
|
693 \end{center} |
|
694 which does not impose any restrictions on what strings the second |
|
695 sub-expression $.^*$ |
|
696 might match. |
|
697 Another example for back-references is |
|
698 \begin{center} |
|
699 $(.)(.)\backslash 2\backslash 1$ |
|
700 \end{center} |
|
701 which matches four-character palindromes |
|
702 like $abba$, $x??x$ and so on. |
|
703 |
|
704 Back-references are a regex construct |
|
705 that programmers find quite useful. |
|
706 According to Becchi and Crawley \cite{Becchi08}, |
|
707 6\% of Snort rules (up until 2008) use them. |
|
708 The most common use of back-references |
|
709 is to express well-formed html files, |
|
710 where back-references are convenient for matching |
|
711 opening and closing tags like |
|
712 \begin{center} |
|
713 $\langle html \rangle \ldots \langle / html \rangle$ |
|
714 \end{center} |
|
715 A regex describing such a format |
|
716 is |
|
717 \begin{center} |
|
718 $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
|
719 \end{center} |
|
720 Despite being useful, the expressive power of regexes |
|
721 go beyond regular languages |
|
722 once back-references are included. |
|
723 In fact, they allow the regex construct to express |
|
724 languages that cannot be contained in context-free |
|
725 languages either. |
|
726 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
|
727 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
|
728 which cannot be expressed by |
|
729 context-free grammars \cite{campeanu2003formal}. |
|
730 Such a language is contained in the context-sensitive hierarchy |
|
731 of formal languages. |
|
732 Also solving the matching problem involving back-references |
|
733 is known to be NP-complete \parencite{alfred2014algorithms}. |
|
734 Regex libraries supporting back-references such as |
|
735 PCRE \cite{pcre} therefore have to |
|
736 revert to a depth-first search algorithm including backtracking. |
|
737 What is unexpected is that even in the cases |
|
738 not involving back-references, there is still |
|
739 a (non-negligible) chance they might backtrack super-linearly, |
|
740 as shown in the graphs in figure \ref{fig:aStarStarb}. |
|
741 |
|
742 Summing up, we can categorise existing |
|
743 practical regex libraries into two kinds: |
|
744 (i) The ones with linear |
|
745 time guarantees like Go and Rust. The downside with them is that |
|
746 they impose restrictions |
|
747 on the regular expressions (not allowing back-references, |
|
748 bounded repetitions cannot exceed an ad hoc limit etc.). |
|
749 And (ii) those |
|
750 that allow large bounded regular expressions and back-references |
|
751 at the expense of using backtracking algorithms. |
|
752 They can potentially ``grind to a halt'' |
|
753 on some very simple cases, resulting |
|
754 ReDoS attacks if exposed to the internet. |
|
755 |
|
756 The problems with both approaches are the motivation for us |
|
757 to look again at the regular expression matching problem. |
|
758 Another motivation is that regular expression matching algorithms |
|
759 that follow the POSIX standard often contain errors and bugs |
|
760 as we shall explain next. |
|
761 |
|
762 %We would like to have regex engines that can |
|
763 %deal with the regular part (e.g. |
|
764 %bounded repetitions) of regexes more |
|
765 %efficiently. |
|
766 %Also we want to make sure that they do it correctly. |
|
767 %It turns out that such aim is not so easy to achieve. |
|
768 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
|
769 % For example, the Rust regex engine claims to be linear, |
|
770 % but does not support lookarounds and back-references. |
|
771 % The GoLang regex library does not support over 1000 repetitions. |
|
772 % Java and Python both support back-references, but shows |
|
773 %catastrophic backtracking behaviours on inputs without back-references( |
|
774 %when the language is still regular). |
|
775 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
|
776 %TODO: verify the fact Rust does not allow 1000+ reps |
|
777 |
|
778 |
|
779 |
|
780 |
|
781 %The time cost of regex matching algorithms in general |
|
782 %involve two different phases, and different things can go differently wrong on |
|
783 %these phases. |
|
784 %$\DFA$s usually have problems in the first (construction) phase |
|
785 %, whereas $\NFA$s usually run into trouble |
|
786 %on the second phase. |
|
787 |
|
788 |
|
789 \section{Error-prone POSIX Implementations} |
|
790 Very often there are multiple ways of matching a string |
|
791 with a regular expression. |
|
792 In such cases the regular expressions matcher needs to |
|
793 disambiguate. |
|
794 The more widely used strategy is called POSIX, |
|
795 which roughly speaking always chooses the longest initial match. |
|
796 The POSIX strategy is widely adopted in many regular expression matchers |
|
797 because it is a natural strategy for lexers. |
|
798 However, many implementations (including the C libraries |
|
799 used by Linux and OS X distributions) contain bugs |
|
800 or do not meet the specification they claim to adhere to. |
|
801 Kuklewicz maintains a unit test repository which lists some |
|
802 problems with existing regular expression engines \cite{KuklewiczHaskell}. |
|
803 In some cases, they either fail to generate a |
|
804 result when there exists a match, |
|
805 or give results that are inconsistent with the POSIX standard. |
|
806 A concrete example is the regex: |
|
807 \begin{center} |
|
808 $(aba + ab + a)^* \text{and the string} \; ababa$ |
|
809 \end{center} |
|
810 The correct POSIX match for the above |
|
811 involves the entire string $ababa$, |
|
812 split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
|
813 $[0, 2), [2, 5)$ |
|
814 respectively. |
|
815 But feeding this example to the different engines |
|
816 listed at regex101 \footnote{ |
|
817 regex101 is an online regular expression matcher which |
|
818 provides API for trying out regular expression |
|
819 engines of multiple popular programming languages like |
|
820 Java, Python, Go, etc.} \parencite{regex101}. |
|
821 yields |
|
822 only two incomplete matches: $[aba]$ at $[0, 3)$ |
|
823 and $a$ at $[4, 5)$. |
|
824 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
|
825 commented that most regex libraries are not |
|
826 correctly implementing the central POSIX |
|
827 rule, called the maximum munch rule. |
|
828 Grathwohl \parencite{grathwohl2014crash} wrote: |
|
829 \begin{quote}\it |
|
830 ``The POSIX strategy is more complicated than the |
|
831 greedy because of the dependence on information about |
|
832 the length of matched strings in the various subexpressions.'' |
|
833 \end{quote} |
|
834 %\noindent |
|
835 People have recognised that the implementation complexity of POSIX rules also come from |
|
836 the specification being not very precise. |
|
837 The developers of the regexp package of Go |
|
838 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
|
839 commented that |
|
840 \begin{quote}\it |
|
841 `` |
|
842 The POSIX rule is computationally prohibitive |
|
843 and not even well-defined. |
|
844 `` |
|
845 \end{quote} |
|
846 There are many informal summaries of this disambiguation |
|
847 strategy, which are often quite long and delicate. |
|
848 For example Kuklewicz \cite{KuklewiczHaskell} |
|
849 described the POSIX rule as (section 1, last paragraph): |
|
850 \begin{quote} |
|
851 \begin{itemize} |
|
852 \item |
|
853 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
|
854 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
|
855 \item |
|
856 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
|
857 \item |
|
858 REs have right associative concatenation which can be changed with parenthesis\\ |
|
859 \item |
|
860 parenthesized subexpressions return the match from their last usage\\ |
|
861 \item |
|
862 text of component subexpressions must be contained in the text of the |
|
863 higher-level subexpressions\\ |
|
864 \item |
|
865 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
|
866 \item |
|
867 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
|
868 \end{itemize} |
|
869 \end{quote} |
|
870 %The text above |
|
871 %is trying to capture something very precise, |
|
872 %and is crying out for formalising. |