218 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
218 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
219 However, those matchers can exhibit a surprising security vulnerability |
219 However, those matchers can exhibit a surprising security vulnerability |
220 under a certain class of inputs. |
220 under a certain class of inputs. |
221 %However, , this is not the case for $\mathbf{all}$ inputs. |
221 %However, , this is not the case for $\mathbf{all}$ inputs. |
222 %TODO: get source for SNORT/BRO's regex matching engine/speed |
222 %TODO: get source for SNORT/BRO's regex matching engine/speed |
|
223 |
|
224 |
|
225 Take $(a^*)^*\,b$ and ask whether |
|
226 strings of the form $aa..a$ match this regular |
|
227 expression. Obviously this is not the case---the expected $b$ in the last |
|
228 position is missing. One would expect that modern regular expression |
|
229 matching engines can find this out very quickly. Alas, if one tries |
|
230 this example in JavaScript, Python or Java 8, even with strings of a small |
|
231 length, say around 30 $a$'s, |
|
232 the decision takes crazy time to finish (graph \ref{fig:aStarStarb}). |
|
233 This is clearly exponential behaviour, and |
|
234 is triggered by some relatively simple regex patterns. |
|
235 Java 9 and newer |
|
236 versions improves this behaviour, but is still slow compared |
|
237 with the approach we are going to use. |
|
238 |
|
239 |
|
240 |
|
241 |
|
242 This superlinear blowup in regular expression engines |
|
243 had repeatedly caused grief in real life that they |
|
244 get a name for them--``catastrophic backtracking''. |
|
245 For example, on 20 July 2016 one evil |
|
246 regular expression brought the webpage |
|
247 \href{http://stackexchange.com}{Stack Exchange} to its |
|
248 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
|
249 In this instance, a regular expression intended to just trim white |
|
250 spaces from the beginning and the end of a line actually consumed |
|
251 massive amounts of CPU resources---causing web servers to grind to a |
|
252 halt. In this example, the time needed to process |
|
253 the string was $O(n^2)$ with respect to the string length. This |
|
254 quadratic overhead was enough for the homepage of Stack Exchange to |
|
255 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
|
256 attack and therefore stopped the servers from responding to any |
|
257 requests. This made the whole site become unavailable. |
223 |
258 |
224 \begin{figure}[p] |
259 \begin{figure}[p] |
225 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
260 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
226 \begin{tikzpicture} |
261 \begin{tikzpicture} |
227 \begin{axis}[ |
262 \begin{axis}[ |
334 The reason for their superlinear behaviour is that they do a depth-first-search. |
369 The reason for their superlinear behaviour is that they do a depth-first-search. |
335 If the string does not match, the engine starts to explore all possibilities. |
370 If the string does not match, the engine starts to explore all possibilities. |
336 }\label{fig:aStarStarb} |
371 }\label{fig:aStarStarb} |
337 \end{figure}\afterpage{\clearpage} |
372 \end{figure}\afterpage{\clearpage} |
338 |
373 |
339 Take $(a^*)^*\,b$ and ask whether |
|
340 strings of the form $aa..a$ match this regular |
|
341 expression. Obviously this is not the case---the expected $b$ in the last |
|
342 position is missing. One would expect that modern regular expression |
|
343 matching engines can find this out very quickly. Alas, if one tries |
|
344 this example in JavaScript, Python or Java 8, even with strings of a small |
|
345 length, say around 30 $a$'s, one discovers that |
|
346 this decision takes crazy time to finish given the simplicity of the problem. |
|
347 This is clearly exponential behaviour, and |
|
348 is triggered by some relatively simple regex patterns, as the graphs |
|
349 in \ref{fig:aStarStarb} show. |
|
350 Java 9 and newer |
|
351 versions improves this behaviour, but is still slow compared |
|
352 with the approach we are going to use. |
|
353 |
|
354 |
|
355 |
|
356 |
|
357 This superlinear blowup in regular expression engines |
|
358 had repeatedly caused grief in real life. |
|
359 For example, on 20 July 2016 one evil |
|
360 regular expression brought the webpage |
|
361 \href{http://stackexchange.com}{Stack Exchange} to its |
|
362 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
|
363 In this instance, a regular expression intended to just trim white |
|
364 spaces from the beginning and the end of a line actually consumed |
|
365 massive amounts of CPU resources---causing web servers to grind to a |
|
366 halt. In this example, the time needed to process |
|
367 the string was $O(n^2)$ with respect to the string length. This |
|
368 quadratic overhead was enough for the homepage of Stack Exchange to |
|
369 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
|
370 attack and therefore stopped the servers from responding to any |
|
371 requests. This made the whole site become unavailable. |
|
372 |
|
373 A more recent example is a global outage of all Cloudflare servers on 2 July |
374 A more recent example is a global outage of all Cloudflare servers on 2 July |
374 2019. A poorly written regular expression exhibited exponential |
375 2019. A poorly written regular expression exhibited exponential |
375 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage |
376 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage |
376 had several causes, at the heart was a regular expression that |
377 had several causes, at the heart was a regular expression that |
377 was used to monitor network |
378 was used to monitor network |
388 They therefore concluded that evil regular expressions |
389 They therefore concluded that evil regular expressions |
389 are problems "more than a parlour trick", but one that |
390 are problems "more than a parlour trick", but one that |
390 requires |
391 requires |
391 more research attention. |
392 more research attention. |
392 |
393 |
393 |
394 This work aims to address this issue |
394 \ChristianComment{I am not totally sure where this sentence should be |
395 with the help of formal proofs. |
395 put, seems a little out-standing here.} |
396 We offer a lexing algorithm based |
|
397 on Brzozowski derivatives with certified correctness (in |
|
398 Isabelle/HOL) |
|
399 and finiteness property. |
|
400 Such properties guarantees the absence of |
|
401 catastrophic backtracking in most cases. |
|
402 We will give more details why we choose our |
|
403 approach (Brzozowski derivatives and formal proofs) |
|
404 in the next sections. |
|
405 |
|
406 |
|
407 \section{The Problem with Bounded Repetitions} |
396 Regular expressions and regular expression matchers |
408 Regular expressions and regular expression matchers |
397 have of course been studied for many, many years. |
409 have of course been studied for many, many years. |
398 One of the most recent work in the context of lexing |
410 One of the most recent work in the context of lexing |
399 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
411 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
400 This is relevant work and we will compare later on |
412 This is relevant work and we will compare later on |
401 our derivative-based matcher we are going to present. |
413 our derivative-based matcher we are going to present. |
402 There is also some newer work called |
414 There is also some newer work called |
403 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. |
415 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. |
404 For that the problem is dealing with the bounded regular expressions of the form |
416 For that the problem is dealing with the bounded regular expressions of the form |
405 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat |
417 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat |
406 exactly $n$ times. |
418 exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow |
|
419 on the bounded repetitions construct. |
|
420 |
|
421 In the work reported in \cite{CSL2022} and here, we add better support |
|
422 for them. |
407 The other repetition constructs include |
423 The other repetition constructs include |
408 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating |
424 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify |
409 at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times. |
425 intervals for how many times $r$ should match. |
|
426 $r^{\ldots m}$ means repeating |
|
427 at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and |
|
428 $r^{n\ldots m}$ means repeating between $n$ and $m$ times. |
|
429 The results presented in this thesis extend straightforwardly to them |
|
430 too. |
410 Their formal definitions will be given later. |
431 Their formal definitions will be given later. |
|
432 |
411 Bounded repetitions are important because they |
433 Bounded repetitions are important because they |
412 tend to occur often in practical use\cite{xml2015}, for example in RegExLib, |
434 tend to occur often in practical use, for example in RegExLib, |
413 Snort, as well as in XML Schema definitions (XSDs). |
435 Snort, as well as in XML Schema definitions (XSDs). |
414 One XSD that seems to be related to the MPEG-7 standard involves |
436 According to Bj\"{o}rklund et al \cite{xml2015}, |
415 the below regular expression: |
437 bounded regular expressions occur frequently in the latter and can have |
|
438 counters up to ten million. An example XSD with a large counter they gave |
|
439 was: |
416 \begin{verbatim} |
440 \begin{verbatim} |
417 <sequence minOccurs="0" maxOccurs="65535"> |
441 <sequence minOccurs="0" maxOccurs="65535"> |
418 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
442 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
419 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
443 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
420 </sequence> |
444 </sequence> |
421 \end{verbatim} |
445 \end{verbatim} |
422 This is just a fancy way of writing the regular expression |
446 This can be seen as the expression |
423 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
447 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
424 regular expressions |
448 regular expressions |
425 satisfy certain constraints such as floating point number format. |
449 satisfying certain constraints (such as |
426 |
450 satisfying the floating point number format). |
427 The problems are not limited to slowness on certain |
451 The problem here is that tools based on the classic notion of |
|
452 automata need to expand $r^{n}$ into $n$ connected |
|
453 copies of the automaton for $r$. This leads to very inefficient matching |
|
454 algorithms or algorithms that consume large amounts of memory. |
|
455 A classic example is the regular expression $(a+b)^* a (a+b)^{n}$ |
|
456 where the minimal DFA requires at least $2^{n+1}$ states (more on this |
|
457 later). |
|
458 Therefore regular expressions matching libraries that rely on the classic |
|
459 notion of DFAs often impose adhoc limits |
|
460 for bounded regular expressions: |
|
461 For example, in the regular expression matching library in the Go |
|
462 language the regular expression $a^{1001}$ is not permitted, because no counter |
|
463 can be above 1000, and in the built-in Rust regular expression library |
|
464 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
|
465 for being too big. These problems can of course be solved in matching algorithms where |
|
466 automata go beyond the classic notion and for instance include explicit |
|
467 counters \cite{Turo_ov__2020}. |
|
468 The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
|
469 straightforwardly extended to deal with bounded regular expressions |
|
470 and moreover the resulting code still consists of only simple |
|
471 recursive functions and inductive datatypes. |
|
472 Finally, bounded regular expressions do not destroy our finite |
|
473 boundedness property, which we shall prove later on. |
|
474 |
|
475 \section{The Terminology Regex, and why Backtracking?} |
|
476 Shouldn't regular expression matching be linear? |
|
477 How can one explain the super-linear behaviour of the |
|
478 regex matching engines we have? |
|
479 The time cost of regex matching algorithms in general |
|
480 involve two different phases, and different things can go differently wrong on |
|
481 these phases. |
|
482 $\DFA$s usually have problems in the first (construction) phase |
|
483 , whereas $\NFA$s usually run into trouble |
|
484 on the second phase. |
|
485 |
|
486 |
|
487 \section{Error-prone POSIX Implementations} |
|
488 The problems with practical implementations |
|
489 of reare not limited to slowness on certain |
428 cases. |
490 cases. |
429 Another thing about these libraries is that there |
491 Another thing about these libraries is that there |
430 is no correctness guarantee. |
492 is no correctness guarantee. |
431 In some cases, they either fail to generate a lexing result when there exists a match, |
493 In some cases, they either fail to generate a lexing result when there exists a match, |
432 or give results that are inconsistent with the $\POSIX$ standard. |
494 or give results that are inconsistent with the $\POSIX$ standard. |