301 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
301 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
302 \item |
302 \item |
303 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
303 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
304 \end{itemize} |
304 \end{itemize} |
305 \end{quote} |
305 \end{quote} |
306 However, the author noted that lexers are rarely correct according to this standard. |
306 However, the author noted that various lexers that claim to be POSIX |
307 Being able to formally define and capture the idea of POSIX rules and matching/lexing algorithms and prove |
307 are rarely correct according to this standard. |
308 the correctness of such algorithms against the POSIX semantics definitions |
308 There are numerous occasions where programmers realised the subtlety and |
309 would be valuable. |
309 difficulty to implement correctly, one such quote from Go's regexp library author |
310 |
310 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
311 Formal proofs are machine checked programs |
311 \begin{quote}\it |
312 such as the ones written in Isabelle/HOL, is a powerful means |
312 `` |
313 for computer scientists to be certain about correctness of their algorithms and correctness properties. |
313 The POSIX rule is computationally prohibitive |
314 This is done by automatically checking that the end goal is derived solely from a list of axioms, definitions |
314 and not even well-defined. |
315 and proof rules that are known to be correct. |
315 `` |
316 The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such |
316 \end{quote} |
317 methods as their implementation and complexity analysis tend to be error-prone. |
317 Being able to formally define and capture the idea of |
318 |
318 POSIX rules and prove |
319 There have been many interesting steps taken in the direction of formal proofs and regular expressions |
319 the correctness of regular expression matching/lexing |
320 lexing and matching. |
320 algorithms against the POSIX semantics definitions |
321 |
321 is valuable. |
322 |
322 |
|
323 |
|
324 Formal proofs are |
|
325 machine checked programs |
|
326 %such as the ones written in Isabelle/HOL, is a powerful means |
|
327 for computer scientists to be certain |
|
328 about the correctness of their algorithms. |
|
329 This is done by |
|
330 recursively checking that every fact in a proof script |
|
331 is either an axiom or a fact that is derived from |
|
332 known axioms or verified facts. |
|
333 %The problem of regular expressions lexing and catastrophic backtracking are a suitable problem for such |
|
334 %methods as their implementation and complexity analysis tend to be error-prone. |
|
335 Formal proofs provides an unprecendented level of asssurance |
|
336 that an algorithm will perform as expected under all inputs. |
|
337 The software systems that help people interactively build and check |
|
338 such proofs are called theorem-provers or proof assitants. |
|
339 Many theorem-provers have been developed, such as Mizar, |
|
340 Isabelle/HOL, HOL-Light, HOL4, |
|
341 Coq, Agda, Idris, Lean and so on. |
|
342 Isabelle/HOL is a theorem prover with a simple type theory |
|
343 and powerful automated proof generators like sledgehammer. |
|
344 We chose to use Isabelle/HOL for its powerful automation |
|
345 and ease and simplicity in expressing regular expressions and |
|
346 regular languages. |
|
347 %Some of those employ |
|
348 %dependent types like Mizar, Coq, Agda, Lean and Idris. |
|
349 %Some support a constructivism approach, such as Coq. |
|
350 |
|
351 |
|
352 Formal proofs on regular expression matching and lexing |
|
353 complements the efforts in |
|
354 industry which tend to focus on overall speed |
|
355 with techniques like parallelization (FPGA paper), tackling |
|
356 the problem of catastrophic backtracking |
|
357 in an ad-hoc manner (cloudflare and stackexchange article). |
|
358 |
|
359 There have been many interesting steps in the theorem-proving community |
|
360 about formalising regular expressions and lexing. |
|
361 One flavour is to build from the regular expression an automaton, and determine |
|
362 acceptance in terms of the resulting |
|
363 state after executing the input string on that automaton. |
|
364 Automata formalisations are in general harder and more cumbersome to deal |
|
365 with for theorem provers than working directly on regular expressions. |
|
366 One such example is by Nipkow \cite{Nipkow1998}. |
|
367 %They |
|
368 %made everything recursive (for example the next state function), |
|
369 As a core idea, they |
|
370 used a list of booleans to name each state so that |
|
371 after composing sub-automata together, renaming the states to maintain |
|
372 the distinctness of each state is recursive and simple. |
|
373 The result was the obvious lemmas incorporating |
|
374 ``a painful amount of detail'' in their formalisation. |
|
375 Sometimes the automata are represented as graphs. |
|
376 But graphs are not inductive datatypes. |
|
377 Having to set the induction principle on the number of nodes |
|
378 in a graph makes formal reasoning non-intuitive and convoluted, |
|
379 resulting in large formalisations \cite{Lammich2012}. |
|
380 When combining two graphs, one also needs to make sure that the nodes in |
|
381 both graphs are distinct, which almost always involve |
|
382 renaming of the nodes. |
|
383 A theorem-prover which provides dependent types such as Coq |
|
384 can alleviate the issue of representing graph nodes |
|
385 \cite{Doczkal2013}. There the representation of nodes is made |
|
386 easier by the use of $\textit{FinType}$. |
|
387 Another representation for automata are matrices. |
|
388 But the induction for them is not as straightforward either. |
|
389 There are some more clever representations, for example one by Paulson |
|
390 using hereditarily finite sets \cite{Paulson2015}. |
|
391 There the problem with combining graphs can be solved better. |
|
392 %but we believe that such clever tricks are not very obvious for |
|
393 %the John-average-Isabelle-user. |
|
394 |
|
395 The approach that operates directly on regular expressions circumvents the problem of |
|
396 conversion between a regular expression and an automaton, thereby avoiding representation |
|
397 problems altogether, despite that regular expressions may be seen as a variant of a |
|
398 non-deterministic finite automaton (ref Laurikari tagged NFA paper). |
|
399 To matching a string, a sequence of algebraic transformations called |
|
400 (Brzozowski) $\textit{derivatives}$ (ref Brzozowski) is carried out on that regular expression. |
|
401 Each derivative takes a character and a regular expression, |
|
402 and returns a new regular expression whose language is closely related to |
|
403 the original regular expression's language: |
|
404 strings prefixed with that input character will have their head removed |
|
405 and strings not prefixed |
|
406 with that character will be eliminated. |
|
407 After taking derivatives with respect to all the characters the string is |
|
408 exhausted. Then an algorithm checks whether the empty string is in that final |
|
409 regular expression's language. |
|
410 If so, a match exists and the string is in the language of the input regular expression. |
|
411 |
|
412 Again this process can be seen as the simulation of an NFA running on a string, |
|
413 but the recursive nature of the datatypes and functions involved makes |
|
414 derivatives a perfect object of study for theorem provers. |
|
415 That is why there has been numerous formalisations of regular expressions |
|
416 and Brzozowski derivatives in the functional programming and |
|
417 theorem proving community (a large list of refs to derivatives formalisation publications). |
|
418 Ribeiro and Du Bois \cite{RibeiroAgda2017} have |
|
419 formalised the notion of bit-coded regular expressions |
|
420 and proved their relations with simple regular expressions in |
|
421 the dependently-typed proof assistant Agda. |
|
422 They also proved the soundness and completeness of a matching algorithm |
|
423 based on the bit-coded regular expressions. Their algorithm is a decision procedure |
|
424 that gives a Yes/No answer, which does not produce |
|
425 lexical values. |
|
426 %X also formalised derivatives and regular expressions, producing "parse trees". |
|
427 %(Some person who's a big name in formal methods) |
|
428 |
|
429 |
|
430 The variant of the problem we are looking at centers around |
|
431 an algorithm (which we call $\blexer$) developed by Sulzmann and Lu \ref{Sulzmann2014}. |
|
432 The reason we chose to look at $\blexer$ and its simplifications |
|
433 is because it allows a lexical tree to be generated |
|
434 by some elegant and subtle procedure based on Brzozowski derivatives. |
|
435 The procedures are made of recursive functions and inductive datatypes just like derivatives, |
|
436 allowing intuitive and concise formal reasoning with theorem provers. |
|
437 Most importantly, $\blexer$ opens up a path to an optimized version |
|
438 of $\blexersimp$ possibilities to improve |
|
439 performance with simplifications that aggressively change the structure of regular expressions. |
|
440 While most derivative-based methods |
|
441 rely on structures to be maintained to allow induction to |
|
442 go through. |
|
443 For example, Egolf et al. \ref{Verbatim} have developed a verified lexer |
|
444 with derivatives, but as soon as they started introducing |
|
445 optimizations such as memoization, they reverted to constructing |
|
446 DFAs first. |
|
447 Edelmann \ref{Edelmann2020} explored similar optimizations in his |
|
448 work on verified LL(1) parsing, with additional enhancements with data structures like |
|
449 zippers. |
|
450 |
|
451 %Sulzmann and Lu have worked out an algorithm |
|
452 %that is especially suited for verification |
|
453 %which utilized the fact |
|
454 %that whenever ambiguity occurs one may duplicate a sub-expression and use |
|
455 %different copies to describe different matching choices. |
|
456 The idea behind the correctness of $\blexer$ is simple: during a derivative, |
|
457 multiple matches might be possible, where an alternative with multiple children |
|
458 each corresponding to a |
|
459 different match is created. In the end of |
|
460 a lexing process one always picks up the leftmost alternative, which is guarnateed |
|
461 to be a POSIX value. |
|
462 This is done by consistently keeping sub-regular expressions in an alternative |
|
463 with longer submatches |
|
464 to the left of other copies ( |
|
465 Remember that POSIX values are roughly the values with the longest inital |
|
466 submatch). |
|
467 The idea behind the optimized version of $\blexer$, which we call $\blexersimp$, |
|
468 is that since we only take the leftmost copy, then all re-occurring copies can be |
|
469 eliminated without losing the POSIX property, and this can be done with |
|
470 children of alternatives at different levels by merging them together. |
|
471 Proving $\blexersimp$ requires a different |
|
472 proof strategy compared to that by Ausaf \cite{FahadThesis}. |
|
473 We invent a rewriting relation as an |
|
474 inductive predicate which captures |
|
475 a strong enough invariance that ensures correctness, |
|
476 which commutes with the derivative operation. |
|
477 This predicate allows a simple |
|
478 induction on the input string to go through. |
|
479 |
|
480 %This idea has been repeatedly used in different variants of lexing |
|
481 %algorithms in their paper, one of which involves bit-codes. The bit-coded |
|
482 %derivative-based algorithm even allows relatively aggressive |
|
483 %%simplification rules which cause |
|
484 %structural changes that render the induction used in the correctness |
|
485 %proofs unusable. |
|
486 %More details will be given in \ref{Bitcoded2} including the |
|
487 %new correctness proof which involves a new inductive predicate which allows |
|
488 %rule induction to go through. |
|
489 |
|
490 |
|
491 |
|
492 |
|
493 |
|
494 |
|
495 |
|
496 %first character is removed |
|
497 %state of the automaton after matching that character |
|
498 %where nodes are represented as |
|
499 %a sub-expression (for example tagged NFA |
|
500 %Working on regular expressions |
|
501 %Because of these problems with automata, we prefer regular expressions |
|
502 %and derivatives rather than an automata (or graph-based) approach which explicitly converts between |
|
503 %the regular expression and a different data structure. |
|
504 % |
|
505 % |
|
506 %The key idea |
323 |
507 |
324 (ends) |
508 (ends) |
325 |
509 |
326 %Regular expressions are widely used in computer science: |
510 %Regular expressions are widely used in computer science: |
327 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
511 %be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
341 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
525 %hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
342 %However, those engines can sometimes also exhibit a surprising security vulnerability |
526 %However, those engines can sometimes also exhibit a surprising security vulnerability |
343 %under a certain class of inputs. |
527 %under a certain class of inputs. |
344 %However, , this is not the case for $\mathbf{all}$ inputs. |
528 %However, , this is not the case for $\mathbf{all}$ inputs. |
345 %TODO: get source for SNORT/BRO's regex matching engine/speed |
529 %TODO: get source for SNORT/BRO's regex matching engine/speed |
346 |
|
347 |
|
348 Consider for example the regular expression $(a^*)^*\,b$ and |
|
349 strings of the form $aa..a$. These strings cannot be matched by this regular |
|
350 expression: Obviously the expected $b$ in the last |
|
351 position is missing. One would assume that modern regular expression |
|
352 matching engines can find this out very quickly. Surprisingly, if one tries |
|
353 this example in JavaScript, Python or Java 8, even with small strings, |
|
354 say of lenght of around 30 $a$'s, |
|
355 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}). |
|
356 The algorithms clearly show exponential behaviour, and as can be seen |
|
357 is triggered by some relatively simple regular expressions. |
|
358 Java 9 and newer |
|
359 versions improve this behaviour somewhat, but are still slow compared |
|
360 with the approach we are going to use in this thesis. |
|
361 |
|
362 |
|
363 |
|
364 This superlinear blowup in regular expression engines |
|
365 has caused grief in ``real life'' where it is |
|
366 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
|
367 For example, on 20 July 2016 one evil |
|
368 regular expression brought the webpage |
|
369 \href{http://stackexchange.com}{Stack Exchange} to its |
|
370 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
|
371 In this instance, a regular expression intended to just trim white |
|
372 spaces from the beginning and the end of a line actually consumed |
|
373 massive amounts of CPU resources---causing the web servers to grind to a |
|
374 halt. In this example, the time needed to process |
|
375 the string was |
|
376 $O(n^2)$ with respect to the string length $n$. This |
|
377 quadratic overhead was enough for the homepage of Stack Exchange to |
|
378 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
|
379 attack and therefore stopped the servers from responding to any |
|
380 requests. This made the whole site become unavailable. |
|
381 |
|
382 \begin{figure}[p] |
|
383 \begin{center} |
|
384 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}} |
|
385 \begin{tikzpicture} |
|
386 \begin{axis}[ |
|
387 xlabel={$n$}, |
|
388 x label style={at={(1.05,-0.05)}}, |
|
389 ylabel={time in secs}, |
|
390 enlargelimits=false, |
|
391 xtick={0,5,...,30}, |
|
392 xmax=33, |
|
393 ymax=35, |
|
394 ytick={0,5,...,30}, |
|
395 scaled ticks=false, |
|
396 axis lines=left, |
|
397 width=5cm, |
|
398 height=4cm, |
|
399 legend entries={JavaScript}, |
|
400 legend pos=north west, |
|
401 legend cell align=left] |
|
402 \addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
|
403 \end{axis} |
|
404 \end{tikzpicture} |
|
405 & |
|
406 \begin{tikzpicture} |
|
407 \begin{axis}[ |
|
408 xlabel={$n$}, |
|
409 x label style={at={(1.05,-0.05)}}, |
|
410 %ylabel={time in secs}, |
|
411 enlargelimits=false, |
|
412 xtick={0,5,...,30}, |
|
413 xmax=33, |
|
414 ymax=35, |
|
415 ytick={0,5,...,30}, |
|
416 scaled ticks=false, |
|
417 axis lines=left, |
|
418 width=5cm, |
|
419 height=4cm, |
|
420 legend entries={Python}, |
|
421 legend pos=north west, |
|
422 legend cell align=left] |
|
423 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
424 \end{axis} |
|
425 \end{tikzpicture}\\ |
|
426 \begin{tikzpicture} |
|
427 \begin{axis}[ |
|
428 xlabel={$n$}, |
|
429 x label style={at={(1.05,-0.05)}}, |
|
430 ylabel={time in secs}, |
|
431 enlargelimits=false, |
|
432 xtick={0,5,...,30}, |
|
433 xmax=33, |
|
434 ymax=35, |
|
435 ytick={0,5,...,30}, |
|
436 scaled ticks=false, |
|
437 axis lines=left, |
|
438 width=5cm, |
|
439 height=4cm, |
|
440 legend entries={Java 8}, |
|
441 legend pos=north west, |
|
442 legend cell align=left] |
|
443 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
444 \end{axis} |
|
445 \end{tikzpicture} |
|
446 & |
|
447 \begin{tikzpicture} |
|
448 \begin{axis}[ |
|
449 xlabel={$n$}, |
|
450 x label style={at={(1.05,-0.05)}}, |
|
451 %ylabel={time in secs}, |
|
452 enlargelimits=false, |
|
453 xtick={0,5,...,30}, |
|
454 xmax=33, |
|
455 ymax=35, |
|
456 ytick={0,5,...,30}, |
|
457 scaled ticks=false, |
|
458 axis lines=left, |
|
459 width=5cm, |
|
460 height=4cm, |
|
461 legend entries={Dart}, |
|
462 legend pos=north west, |
|
463 legend cell align=left] |
|
464 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; |
|
465 \end{axis} |
|
466 \end{tikzpicture}\\ |
|
467 \begin{tikzpicture} |
|
468 \begin{axis}[ |
|
469 xlabel={$n$}, |
|
470 x label style={at={(1.05,-0.05)}}, |
|
471 ylabel={time in secs}, |
|
472 enlargelimits=false, |
|
473 xtick={0,5,...,30}, |
|
474 xmax=33, |
|
475 ymax=35, |
|
476 ytick={0,5,...,30}, |
|
477 scaled ticks=false, |
|
478 axis lines=left, |
|
479 width=5cm, |
|
480 height=4cm, |
|
481 legend entries={Swift}, |
|
482 legend pos=north west, |
|
483 legend cell align=left] |
|
484 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; |
|
485 \end{axis} |
|
486 \end{tikzpicture} |
|
487 & |
|
488 \begin{tikzpicture} |
|
489 \begin{axis}[ |
|
490 xlabel={$n$}, |
|
491 x label style={at={(1.05,-0.05)}}, |
|
492 %ylabel={time in secs}, |
|
493 enlargelimits=true, |
|
494 %xtick={0,5000,...,40000}, |
|
495 %xmax=40000, |
|
496 %ymax=35, |
|
497 restrict x to domain*=0:40000, |
|
498 restrict y to domain*=0:35, |
|
499 %ytick={0,5,...,30}, |
|
500 %scaled ticks=false, |
|
501 axis lines=left, |
|
502 width=5cm, |
|
503 height=4cm, |
|
504 legend entries={Java9+}, |
|
505 legend pos=north west, |
|
506 legend cell align=left] |
|
507 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
|
508 \end{axis} |
|
509 \end{tikzpicture}\\ |
|
510 \multicolumn{2}{c}{Graphs} |
|
511 \end{tabular} |
|
512 \end{center} |
|
513 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
|
514 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
|
515 The reason for their superlinear behaviour is that they do a depth-first-search |
|
516 using NFAs. |
|
517 If the string does not match, the regular expression matching |
|
518 engine starts to explore all possibilities. |
|
519 }\label{fig:aStarStarb} |
|
520 \end{figure}\afterpage{\clearpage} |
|
521 |
|
522 A more recent example is a global outage of all Cloudflare servers on 2 July |
|
523 2019. A poorly written regular expression exhibited catastrophic backtracking |
|
524 and exhausted CPUs that serve HTTP traffic. Although the outage |
|
525 had several causes, at the heart was a regular expression that |
|
526 was used to monitor network |
|
527 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
|
528 These problems with regular expressions |
|
529 are not isolated events that happen |
|
530 very rarely, |
|
531 %but actually widespread. |
|
532 %They occur so often that they have a |
|
533 but they occur actually often enough that they have a |
|
534 name: Regular-Expression-Denial-Of-Service (ReDoS) |
|
535 attacks. |
|
536 Davis et al. \cite{Davis18} detected more |
|
537 than 1000 evil regular expressions |
|
538 in Node.js, Python core libraries, npm and pypi. |
|
539 They therefore concluded that evil regular expressions |
|
540 are a real problem rather than just "a parlour trick". |
|
541 |
|
542 The work in this thesis aims to address this issue |
|
543 with the help of formal proofs. |
|
544 We describe a lexing algorithm based |
|
545 on Brzozowski derivatives with verified correctness |
|
546 and a finiteness property for the size of derivatives |
|
547 (which are all done in Isabelle/HOL). |
|
548 Such properties %guarantee the absence of |
|
549 are an important step in preventing |
|
550 catastrophic backtracking once and for all. |
|
551 We will give more details in the next sections |
|
552 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
|
553 can occur in traditional regular expression engines |
|
554 and (ii) why we choose our |
|
555 approach based on Brzozowski derivatives and formal proofs. |
|
556 |
|
557 |
|
558 \section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
|
559 Regular expressions and regular expression matchers |
|
560 have clearly been studied for many, many years. |
|
561 Theoretical results in automata theory state |
|
562 that basic regular expression matching should be linear |
|
563 w.r.t the input. |
|
564 This assumes that the regular expression |
|
565 $r$ was pre-processed and turned into a |
|
566 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}. |
|
567 By basic we mean textbook definitions such as the one |
|
568 below, involving only regular expressions for characters, alternatives, |
|
569 sequences, and Kleene stars: |
|
570 \[ |
|
571 r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
|
572 \] |
|
573 Modern regular expression matchers used by programmers, |
|
574 however, |
|
575 support much richer constructs, such as bounded repetitions, |
|
576 negations, |
|
577 and back-references. |
|
578 To differentiate, we use the word \emph{regex} to refer |
|
579 to those expressions with richer constructs while reserving the |
|
580 term \emph{regular expression} |
|
581 for the more traditional meaning in formal languages theory. |
|
582 We follow this convention |
|
583 in this thesis. |
|
584 In the future, we aim to support all the popular features of regexes, |
|
585 but for this work we mainly look at basic regular expressions |
|
586 and bounded repetitions. |
|
587 |
|
588 |
|
589 |
|
590 %Most modern regex libraries |
|
591 %the so-called PCRE standard (Peral Compatible Regular Expressions) |
|
592 %has the back-references |
|
593 Regexes come with a number of constructs |
|
594 that make it more convenient for |
|
595 programmers to write regular expressions. |
|
596 Depending on the types of constructs |
|
597 the task of matching and lexing with them |
|
598 will have different levels of complexity. |
|
599 Some of those constructs are syntactic sugars that are |
|
600 simply short hand notations |
|
601 that save the programmers a few keystrokes. |
|
602 These will not cause problems for regex libraries. |
|
603 For example the |
|
604 non-binary alternative involving three or more choices just means: |
|
605 \[ |
|
606 (a | b | c) \stackrel{means}{=} ((a + b)+ c) |
|
607 \] |
|
608 Similarly, the range operator |
|
609 %used to express the alternative |
|
610 %of all characters between its operands, |
|
611 is just a concise way |
|
612 of expressing an alternative of consecutive characters: |
|
613 \[ |
|
614 [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) |
|
615 \] |
|
616 for an alternative. The |
|
617 wildcard character '$.$' is used to refer to any single character, |
|
618 \[ |
|
619 . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
|
620 \] |
|
621 except the newline. |
|
622 |
|
623 \subsection{Bounded Repetitions} |
|
624 More interesting are bounded repetitions, which can |
|
625 make the regular expressions much |
|
626 more compact. |
|
627 Normally there are four kinds of bounded repetitions: |
|
628 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ |
|
629 (where $n$ and $m$ are constant natural numbers). |
|
630 Like the star regular expressions, the set of strings or language |
|
631 a bounded regular expression can match |
|
632 is defined using the power operation on sets: |
|
633 \begin{center} |
|
634 \begin{tabular}{lcl} |
|
635 $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ |
|
636 $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ |
|
637 $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
|
638 $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
|
639 \end{tabular} |
|
640 \end{center} |
|
641 The attraction of bounded repetitions is that they can be |
|
642 used to avoid a size blow up: for example $r^{\{n\}}$ |
|
643 is a shorthand for |
|
644 the much longer regular expression: |
|
645 \[ |
|
646 \underbrace{r\ldots r}_\text{n copies of r}. |
|
647 \] |
|
648 %Therefore, a naive algorithm that simply unfolds |
|
649 %them into their desugared forms |
|
650 %will suffer from at least an exponential runtime increase. |
|
651 |
|
652 |
|
653 The problem with matching |
|
654 such bounded repetitions |
|
655 is that tools based on the classic notion of |
|
656 automata need to expand $r^{\{n\}}$ into $n$ connected |
|
657 copies of the automaton for $r$. This leads to very inefficient matching |
|
658 algorithms or algorithms that consume large amounts of memory. |
|
659 Implementations using $\DFA$s will |
|
660 in such situations |
|
661 either become excruciatingly slow |
|
662 (for example Verbatim++ \cite{Verbatimpp}) or run |
|
663 out of memory (for example $\mathit{LEX}$ and |
|
664 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators |
|
665 in C and JAVA that generate $\mathit{DFA}$-based |
|
666 lexers. The user provides a set of regular expressions |
|
667 and configurations, and then |
|
668 gets an output program encoding a minimized $\mathit{DFA}$ |
|
669 that can be compiled and run. |
|
670 When given the above countdown regular expression, |
|
671 a small $n$ (say 20) would result in a program representing a |
|
672 DFA |
|
673 with millions of states.}) for large counters. |
|
674 A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ |
|
675 where the minimal DFA requires at least $2^{n+1}$ states. |
|
676 For example, when $n$ is equal to 2, |
|
677 the corresponding $\mathit{NFA}$ looks like: |
|
678 \vspace{6mm} |
|
679 \begin{center} |
|
680 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
681 \node[state,initial] (q_0) {$q_0$}; |
|
682 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
|
683 \node[state, red] (q_2) [right=of q_1] {$q_2$}; |
|
684 \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; |
|
685 \path[->] |
|
686 (q_0) edge node {a} (q_1) |
|
687 edge [loop below] node {a,b} () |
|
688 (q_1) edge node {a,b} (q_2) |
|
689 (q_2) edge node {a,b} (q_3); |
|
690 \end{tikzpicture} |
|
691 \end{center} |
|
692 and when turned into a DFA by the subset construction |
|
693 requires at least $2^3$ states.\footnote{The |
|
694 red states are "countdown states" which count down |
|
695 the number of characters needed in addition to the current |
|
696 string to make a successful match. |
|
697 For example, state $q_1$ indicates a match that has |
|
698 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
|
699 and just consumed the "delimiter" $a$ in the middle, and |
|
700 needs to match 2 more iterations of $(a|b)$ to complete. |
|
701 State $q_2$ on the other hand, can be viewed as a state |
|
702 after $q_1$ has consumed 1 character, and just waits |
|
703 for 1 more character to complete. |
|
704 The state $q_3$ is the last (accepting) state, requiring 0 |
|
705 more characters. |
|
706 Depending on the suffix of the |
|
707 input string up to the current read location, |
|
708 the states $q_1$ and $q_2$, $q_3$ |
|
709 may or may |
|
710 not be active. |
|
711 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
|
712 contain at least $2^3$ non-equivalent states that cannot be merged, |
|
713 because the subset construction during determinisation will generate |
|
714 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
|
715 Generalizing this to regular expressions with larger |
|
716 bounded repetitions number, we have that |
|
717 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
|
718 would require at least $2^{n+1}$ states, if $r$ itself contains |
|
719 more than 1 string. |
|
720 This is to represent all different |
|
721 scenarios in which "countdown" states are active.} |
|
722 |
|
723 |
|
724 Bounded repetitions are important because they |
|
725 tend to occur frequently in practical use, |
|
726 for example in the regex library RegExLib, in |
|
727 the rules library of Snort \cite{Snort1999}\footnote{ |
|
728 Snort is a network intrusion detection (NID) tool |
|
729 for monitoring network traffic. |
|
730 The network security community curates a list |
|
731 of malicious patterns written as regexes, |
|
732 which is used by Snort's detection engine |
|
733 to match against network traffic for any hostile |
|
734 activities such as buffer overflow attacks.}, |
|
735 as well as in XML Schema definitions (XSDs). |
|
736 According to Bj\"{o}rklund et al \cite{xml2015}, |
|
737 more than half of the |
|
738 XSDs they found on the Maven.org central repository |
|
739 have bounded regular expressions in them. |
|
740 Often the counters are quite large, with the largest being |
|
741 close to ten million. |
|
742 A smaller sample XSD they gave |
|
743 is: |
|
744 \lstset{ |
|
745 basicstyle=\fontsize{8.5}{9}\ttfamily, |
|
746 language=XML, |
|
747 morekeywords={encoding, |
|
748 xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute} |
|
749 } |
|
750 \begin{lstlisting} |
|
751 <sequence minOccurs="0" maxOccurs="65535"> |
|
752 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
|
753 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
|
754 </sequence> |
|
755 \end{lstlisting} |
|
756 This can be seen as the regex |
|
757 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
|
758 regular expressions |
|
759 satisfying certain constraints (such as |
|
760 satisfying the floating point number format). |
|
761 It is therefore quite unsatisfying that |
|
762 some regular expressions matching libraries |
|
763 impose adhoc limits |
|
764 for bounded regular expressions: |
|
765 For example, in the regular expression matching library in the Go |
|
766 language the regular expression $a^{1001}$ is not permitted, because no counter |
|
767 can be above 1000, and in the built-in Rust regular expression library |
|
768 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
|
769 for being too big. |
|
770 As Becchi and Crawley \cite{Becchi08} have pointed out, |
|
771 the reason for these restrictions |
|
772 is that they simulate a non-deterministic finite |
|
773 automata (NFA) with a breadth-first search. |
|
774 This way the number of active states could |
|
775 be equal to the counter number. |
|
776 When the counters are large, |
|
777 the memory requirement could become |
|
778 infeasible, and a regex engine |
|
779 like in Go will reject this pattern straight away. |
|
780 \begin{figure}[H] |
|
781 \begin{center} |
|
782 \begin{tikzpicture} [node distance = 2cm, on grid, auto] |
|
783 |
|
784 \node (q0) [state, initial] {$0$}; |
|
785 \node (q1) [state, right = of q0] {$1$}; |
|
786 %\node (q2) [state, right = of q1] {$2$}; |
|
787 \node (qdots) [right = of q1] {$\ldots$}; |
|
788 \node (qn) [state, right = of qdots] {$n$}; |
|
789 \node (qn1) [state, right = of qn] {$n+1$}; |
|
790 \node (qn2) [state, right = of qn1] {$n+2$}; |
|
791 \node (qn3) [state, accepting, right = of qn2] {$n+3$}; |
|
792 |
|
793 \path [-stealth, thick] |
|
794 (q0) edge [loop above] node {a} () |
|
795 (q0) edge node {a} (q1) |
|
796 %(q1) edge node {.} (q2) |
|
797 (q1) edge node {.} (qdots) |
|
798 (qdots) edge node {.} (qn) |
|
799 (qn) edge node {.} (qn1) |
|
800 (qn1) edge node {b} (qn2) |
|
801 (qn2) edge node {$c$} (qn3); |
|
802 \end{tikzpicture} |
|
803 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
804 % \node[state,initial] (q_0) {$0$}; |
|
805 % \node[state, ] (q_1) [right=of q_0] {$1$}; |
|
806 % \node[state, ] (q_2) [right=of q_1] {$2$}; |
|
807 % \node[state, |
|
808 % \node[state, accepting, ](q_3) [right=of q_2] {$3$}; |
|
809 % \path[->] |
|
810 % (q_0) edge node {a} (q_1) |
|
811 % edge [loop below] node {a,b} () |
|
812 % (q_1) edge node {a,b} (q_2) |
|
813 % (q_2) edge node {a,b} (q_3); |
|
814 %\end{tikzpicture} |
|
815 \end{center} |
|
816 \caption{The example given by Becchi and Crawley |
|
817 that NFA simulation can consume large |
|
818 amounts of memory: $.^*a.^{\{n\}}bc$ matching |
|
819 strings of the form $aaa\ldots aaaabc$. |
|
820 When traversing in a breadth-first manner, |
|
821 all states from 0 till $n+1$ will become active.} |
|
822 \end{figure} |
|
823 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
|
824 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
|
825 %in terms of input string length. |
|
826 %TODO:try out these lexers |
|
827 These problems can of course be solved in matching algorithms where |
|
828 automata go beyond the classic notion and for instance include explicit |
|
829 counters \cite{Turo_ov__2020}. |
|
830 These solutions can be quite efficient, |
|
831 with the ability to process |
|
832 gigabits of strings input per second |
|
833 even with large counters \cite{Becchi08}. |
|
834 These practical solutions do not come with |
|
835 formal guarantees, and as pointed out by |
|
836 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. |
|
837 %But formal reasoning about these automata especially in Isabelle |
|
838 %can be challenging |
|
839 %and un-intuitive. |
|
840 %Therefore, we take correctness and runtime claims made about these solutions |
|
841 %with a grain of salt. |
|
842 |
|
843 In the work reported in \cite{FoSSaCS2023} and here, |
|
844 we add better support using derivatives |
|
845 for bounded regular expression $r^{\{n\}}$. |
|
846 Our results |
|
847 extend straightforwardly to |
|
848 repetitions with intervals such as |
|
849 $r^{\{n\ldots m\}}$. |
|
850 The merit of Brzozowski derivatives (more on this later) |
|
851 on this problem is that |
|
852 it can be naturally extended to support bounded repetitions. |
|
853 Moreover these extensions are still made up of only small |
|
854 inductive datatypes and recursive functions, |
|
855 making it handy to deal with them in theorem provers. |
|
856 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
|
857 %straightforwardly extended to deal with bounded regular expressions |
|
858 %and moreover the resulting code still consists of only simple |
|
859 %recursive functions and inductive datatypes. |
|
860 Finally, bounded regular expressions do not destroy our finite |
|
861 boundedness property, which we shall prove later on. |
|
862 |
|
863 |
|
864 |
|
865 |
|
866 |
|
867 \subsection{Back-References} |
|
868 The other way to simulate an $\mathit{NFA}$ for matching is choosing |
|
869 a single transition each time, keeping all the other options in |
|
870 a queue or stack, and backtracking if that choice eventually |
|
871 fails. |
|
872 This method, often called a "depth-first-search", |
|
873 is efficient in many cases, but could end up |
|
874 with exponential run time. |
|
875 The backtracking method is employed in regex libraries |
|
876 that support \emph{back-references}, for example |
|
877 in Java and Python. |
|
878 %\section{Back-references and The Terminology Regex} |
|
879 |
|
880 %When one constructs an $\NFA$ out of a regular expression |
|
881 %there is often very little to be done in the first phase, one simply |
|
882 %construct the $\NFA$ states based on the structure of the input regular expression. |
|
883 |
|
884 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: |
|
885 %one by keeping track of all active states after consuming |
|
886 %a character, and update that set of states iteratively. |
|
887 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
|
888 %for a path terminating |
|
889 %at an accepting state. |
|
890 |
|
891 |
|
892 |
|
893 Consider the following regular expression where the sequence |
|
894 operator is omitted for brevity: |
|
895 \begin{center} |
|
896 $r_1r_2r_3r_4$ |
|
897 \end{center} |
|
898 In this example, |
|
899 one could label sub-expressions of interest |
|
900 by parenthesizing them and giving |
|
901 them a number in the order in which their opening parentheses appear. |
|
902 One possible way of parenthesizing and labelling is: |
|
903 \begin{center} |
|
904 $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
|
905 \end{center} |
|
906 The sub-expressions |
|
907 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
|
908 by 1 to 4, and can be ``referred back'' by their respective numbers. |
|
909 %These sub-expressions are called "capturing groups". |
|
910 To do so, one uses the syntax $\backslash i$ |
|
911 to denote that we want the sub-string |
|
912 of the input matched by the i-th |
|
913 sub-expression to appear again, |
|
914 exactly the same as it first appeared: |
|
915 \begin{center} |
|
916 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
|
917 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$ |
|
918 \end{center} |
|
919 Once the sub-string $s_i$ for the sub-expression $r_i$ |
|
920 has been fixed, there is no variability on what the back-reference |
|
921 label $\backslash i$ can be---it is tied to $s_i$. |
|
922 The overall string will look like $\ldots s_i \ldots s_i \ldots $ |
|
923 %The backslash and number $i$ are the |
|
924 %so-called "back-references". |
|
925 %Let $e$ be an expression made of regular expressions |
|
926 %and back-references. $e$ contains the expression $e_i$ |
|
927 %as its $i$-th capturing group. |
|
928 %The semantics of back-reference can be recursively |
|
929 %written as: |
|
930 %\begin{center} |
|
931 % \begin{tabular}{c} |
|
932 % $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
|
933 % $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
|
934 % \end{tabular} |
|
935 %\end{center} |
|
936 A concrete example |
|
937 for back-references is |
|
938 \begin{center} |
|
939 $(.^*)\backslash 1$, |
|
940 \end{center} |
|
941 which matches |
|
942 strings that can be split into two identical halves, |
|
943 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. |
|
944 Note that this is different from |
|
945 repeating the sub-expression verbatim like |
|
946 \begin{center} |
|
947 $(.^*)(.^*)$, |
|
948 \end{center} |
|
949 which does not impose any restrictions on what strings the second |
|
950 sub-expression $.^*$ |
|
951 might match. |
|
952 Another example for back-references is |
|
953 \begin{center} |
|
954 $(.)(.)\backslash 2\backslash 1$ |
|
955 \end{center} |
|
956 which matches four-character palindromes |
|
957 like $abba$, $x??x$ and so on. |
|
958 |
|
959 Back-references are a regex construct |
|
960 that programmers find quite useful. |
|
961 According to Becchi and Crawley \cite{Becchi08}, |
|
962 6\% of Snort rules (up until 2008) use them. |
|
963 The most common use of back-references |
|
964 is to express well-formed html files, |
|
965 where back-references are convenient for matching |
|
966 opening and closing tags like |
|
967 \begin{center} |
|
968 $\langle html \rangle \ldots \langle / html \rangle$ |
|
969 \end{center} |
|
970 A regex describing such a format |
|
971 is |
|
972 \begin{center} |
|
973 $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
|
974 \end{center} |
|
975 Despite being useful, the expressive power of regexes |
|
976 go beyond regular languages |
|
977 once back-references are included. |
|
978 In fact, they allow the regex construct to express |
|
979 languages that cannot be contained in context-free |
|
980 languages either. |
|
981 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
|
982 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
|
983 which cannot be expressed by |
|
984 context-free grammars \cite{campeanu2003formal}. |
|
985 Such a language is contained in the context-sensitive hierarchy |
|
986 of formal languages. |
|
987 Also solving the matching problem involving back-references |
|
988 is known to be NP-complete \parencite{alfred2014algorithms}. |
|
989 Regex libraries supporting back-references such as |
|
990 PCRE \cite{pcre} therefore have to |
|
991 revert to a depth-first search algorithm including backtracking. |
|
992 What is unexpected is that even in the cases |
|
993 not involving back-references, there is still |
|
994 a (non-negligible) chance they might backtrack super-linearly, |
|
995 as shown in the graphs in figure \ref{fig:aStarStarb}. |
|
996 |
|
997 Summing up, we can categorise existing |
|
998 practical regex libraries into two kinds: |
|
999 (i) The ones with linear |
|
1000 time guarantees like Go and Rust. The downside with them is that |
|
1001 they impose restrictions |
|
1002 on the regular expressions (not allowing back-references, |
|
1003 bounded repetitions cannot exceed an ad hoc limit etc.). |
|
1004 And (ii) those |
|
1005 that allow large bounded regular expressions and back-references |
|
1006 at the expense of using backtracking algorithms. |
|
1007 They can potentially ``grind to a halt'' |
|
1008 on some very simple cases, resulting |
|
1009 ReDoS attacks if exposed to the internet. |
|
1010 |
|
1011 The problems with both approaches are the motivation for us |
|
1012 to look again at the regular expression matching problem. |
|
1013 Another motivation is that regular expression matching algorithms |
|
1014 that follow the POSIX standard often contain errors and bugs |
|
1015 as we shall explain next. |
|
1016 |
|
1017 %We would like to have regex engines that can |
|
1018 %deal with the regular part (e.g. |
|
1019 %bounded repetitions) of regexes more |
|
1020 %efficiently. |
|
1021 %Also we want to make sure that they do it correctly. |
|
1022 %It turns out that such aim is not so easy to achieve. |
|
1023 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
|
1024 % For example, the Rust regex engine claims to be linear, |
|
1025 % but does not support lookarounds and back-references. |
|
1026 % The GoLang regex library does not support over 1000 repetitions. |
|
1027 % Java and Python both support back-references, but shows |
|
1028 %catastrophic backtracking behaviours on inputs without back-references( |
|
1029 %when the language is still regular). |
|
1030 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
|
1031 %TODO: verify the fact Rust does not allow 1000+ reps |
|
1032 |
|
1033 |
|
1034 |
|
1035 |
|
1036 %The time cost of regex matching algorithms in general |
|
1037 %involve two different phases, and different things can go differently wrong on |
|
1038 %these phases. |
|
1039 %$\DFA$s usually have problems in the first (construction) phase |
|
1040 %, whereas $\NFA$s usually run into trouble |
|
1041 %on the second phase. |
|
1042 |
|
1043 |
|
1044 \section{Error-prone POSIX Implementations} |
|
1045 Very often there are multiple ways of matching a string |
|
1046 with a regular expression. |
|
1047 In such cases the regular expressions matcher needs to |
|
1048 disambiguate. |
|
1049 The more widely used strategy is called POSIX, |
|
1050 which roughly speaking always chooses the longest initial match. |
|
1051 The POSIX strategy is widely adopted in many regular expression matchers |
|
1052 because it is a natural strategy for lexers. |
|
1053 However, many implementations (including the C libraries |
|
1054 used by Linux and OS X distributions) contain bugs |
|
1055 or do not meet the specification they claim to adhere to. |
|
1056 Kuklewicz maintains a unit test repository which lists some |
|
1057 problems with existing regular expression engines \cite{KuklewiczHaskell}. |
|
1058 In some cases, they either fail to generate a |
|
1059 result when there exists a match, |
|
1060 or give results that are inconsistent with the POSIX standard. |
|
1061 A concrete example is the regex: |
|
1062 \begin{center} |
|
1063 $(aba + ab + a)^* \text{and the string} \; ababa$ |
|
1064 \end{center} |
|
1065 The correct POSIX match for the above |
|
1066 involves the entire string $ababa$, |
|
1067 split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
|
1068 $[0, 2), [2, 5)$ |
|
1069 respectively. |
|
1070 But feeding this example to the different engines |
|
1071 listed at regex101 \footnote{ |
|
1072 regex101 is an online regular expression matcher which |
|
1073 provides API for trying out regular expression |
|
1074 engines of multiple popular programming languages like |
|
1075 Java, Python, Go, etc.} \parencite{regex101}. |
|
1076 yields |
|
1077 only two incomplete matches: $[aba]$ at $[0, 3)$ |
|
1078 and $a$ at $[4, 5)$. |
|
1079 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
|
1080 commented that most regex libraries are not |
|
1081 correctly implementing the central POSIX |
|
1082 rule, called the maximum munch rule. |
|
1083 Grathwohl \parencite{grathwohl2014crash} wrote: |
|
1084 \begin{quote}\it |
|
1085 ``The POSIX strategy is more complicated than the |
|
1086 greedy because of the dependence on information about |
|
1087 the length of matched strings in the various subexpressions.'' |
|
1088 \end{quote} |
|
1089 %\noindent |
|
1090 People have recognised that the implementation complexity of POSIX rules also come from |
|
1091 the specification being not very precise. |
|
1092 The developers of the regexp package of Go |
|
1093 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
|
1094 commented that |
|
1095 \begin{quote}\it |
|
1096 `` |
|
1097 The POSIX rule is computationally prohibitive |
|
1098 and not even well-defined. |
|
1099 `` |
|
1100 \end{quote} |
|
1101 There are many informal summaries of this disambiguation |
|
1102 strategy, which are often quite long and delicate. |
|
1103 For example Kuklewicz \cite{KuklewiczHaskell} |
|
1104 described the POSIX rule as (section 1, last paragraph): |
|
1105 \begin{quote} |
|
1106 \begin{itemize} |
|
1107 \item |
|
1108 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
|
1109 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
|
1110 \item |
|
1111 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
|
1112 \item |
|
1113 REs have right associative concatenation which can be changed with parenthesis\\ |
|
1114 \item |
|
1115 parenthesized subexpressions return the match from their last usage\\ |
|
1116 \item |
|
1117 text of component subexpressions must be contained in the text of the |
|
1118 higher-level subexpressions\\ |
|
1119 \item |
|
1120 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
|
1121 \item |
|
1122 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
|
1123 \end{itemize} |
|
1124 \end{quote} |
|
1125 %The text above |
|
1126 %is trying to capture something very precise, |
|
1127 %and is crying out for formalising. |
|
1128 Ribeiro and Du Bois \cite{RibeiroAgda2017} have |
|
1129 formalised the notion of bit-coded regular expressions |
|
1130 and proved their relations with simple regular expressions in |
|
1131 the dependently-typed proof assistant Agda. |
|
1132 They also proved the soundness and completeness of a matching algorithm |
|
1133 based on the bit-coded regular expressions. |
|
1134 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
|
1135 are the first to |
|
1136 give a quite simple formalised POSIX |
|
1137 specification in Isabelle/HOL, and also prove |
|
1138 that their specification coincides with an earlier (unformalised) |
|
1139 POSIX specification given by Okui and Suzuki \cite{Okui10}. |
|
1140 They then formally proved the correctness of |
|
1141 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
|
1142 with regards to that specification. |
|
1143 They also found that the informal POSIX |
|
1144 specification by Sulzmann and Lu needs to be substantially |
|
1145 revised in order for the correctness proof to go through. |
|
1146 Their original specification and proof were unfixable |
|
1147 according to Ausaf et al. |
|
1148 |
|
1149 |
|
1150 In the next section, we will briefly |
|
1151 introduce Brzozowski derivatives and Sulzmann |
|
1152 and Lu's algorithm, which the main point of this thesis builds on. |
|
1153 %We give a taste of what they |
|
1154 %are like and why they are suitable for regular expression |
|
1155 %matching and lexing. |
|
1156 \section{Formal Specification of POSIX Matching |
|
1157 and Brzozowski Derivatives} |
|
1158 %Now we start with the central topic of the thesis: Brzozowski derivatives. |
|
1159 Brzozowski \cite{Brzozowski1964} first introduced the |
|
1160 concept of a \emph{derivative} of regular expression in 1964. |
|
1161 The derivative of a regular expression $r$ |
|
1162 with respect to a character $c$, is written as $r \backslash c$. |
|
1163 This operation tells us what $r$ transforms into |
|
1164 if we ``chop'' off a particular character $c$ |
|
1165 from all strings in the language of $r$ (defined |
|
1166 later as $L \; r$). |
|
1167 %To give a flavour of Brzozowski derivatives, we present |
|
1168 %two straightforward clauses from it: |
|
1169 %\begin{center} |
|
1170 % \begin{tabular}{lcl} |
|
1171 % $d \backslash c$ & $\dn$ & |
|
1172 % $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
1173 %$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
1174 % \end{tabular} |
|
1175 %\end{center} |
|
1176 %\noindent |
|
1177 %The first clause says that for the regular expression |
|
1178 %denoting a singleton set consisting of a single-character string $\{ d \}$, |
|
1179 %we check the derivative character $c$ against $d$, |
|
1180 %returning a set containing only the empty string $\{ [] \}$ |
|
1181 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. |
|
1182 %The second clause states that to obtain the regular expression |
|
1183 %representing all strings' head character $c$ being chopped off |
|
1184 %from $r_1 + r_2$, one simply needs to recursively take derivative |
|
1185 %of $r_1$ and $r_2$ and then put them together. |
|
1186 Derivatives have the property |
|
1187 that $s \in L \; (r\backslash c)$ if and only if |
|
1188 $c::s \in L \; r$ where $::$ stands for list prepending. |
|
1189 %This property can be used on regular expressions |
|
1190 %matching and lexing--to test whether a string $s$ is in $L \; r$, |
|
1191 %one simply takes derivatives of $r$ successively with |
|
1192 %respect to the characters (in the correct order) in $s$, |
|
1193 %and then test whether the empty string is in the last regular expression. |
|
1194 With this property, derivatives can give a simple solution |
|
1195 to the problem of matching a string $s$ with a regular |
|
1196 expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
1197 succession) all the characters of the string matches the empty string, |
|
1198 then $r$ matches $s$ (and {\em vice versa}). |
|
1199 %This makes formally reasoning about these properties such |
|
1200 %as correctness and complexity smooth and intuitive. |
|
1201 There are several mechanised proofs of this property in various theorem |
|
1202 provers, |
|
1203 for example one by Owens and Slind \cite{Owens2008} in HOL4, |
|
1204 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and |
|
1205 yet another in Coq by Coquand and Siles \cite{Coquand2012}. |
|
1206 |
|
1207 In addition, one can extend derivatives to bounded repetitions |
|
1208 relatively straightforwardly. For example, the derivative for |
|
1209 this can be defined as: |
|
1210 \begin{center} |
|
1211 \begin{tabular}{lcl} |
|
1212 $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot |
|
1213 r^{\{n-1\}} (\text{when} n > 0)$\\ |
|
1214 \end{tabular} |
|
1215 \end{center} |
|
1216 \noindent |
|
1217 Experimental results suggest that unlike DFA-based solutions |
|
1218 for bounded regular expressions, |
|
1219 derivatives can cope |
|
1220 large counters |
|
1221 quite well. |
|
1222 |
|
1223 There have also been |
|
1224 extensions of derivatives to other regex constructs. |
|
1225 For example, Owens et al include the derivatives |
|
1226 for the \emph{NOT} regular expression, which is |
|
1227 able to concisely express C-style comments of the form |
|
1228 $/* \ldots */$ (see \cite{Owens2008}). |
|
1229 Another extension for derivatives is |
|
1230 regular expressions with look-aheads, done |
|
1231 by Miyazaki and Minamide |
|
1232 \cite{Takayuki2019}. |
|
1233 %We therefore use Brzozowski derivatives on regular expressions |
|
1234 %lexing |
|
1235 |
|
1236 |
|
1237 |
|
1238 Given the above definitions and properties of |
|
1239 Brzozowski derivatives, one quickly realises their potential |
|
1240 in generating a formally verified algorithm for lexing: the clauses and property |
|
1241 can be easily expressed in a functional programming language |
|
1242 or converted to theorem prover |
|
1243 code, with great ease. |
|
1244 Perhaps this is the reason why derivatives have sparked quite a bit of interest |
|
1245 in the functional programming and theorem prover communities in the last |
|
1246 fifteen or so years ( |
|
1247 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18}, |
|
1248 \cite{Chen12} and \cite{Coquand2012} |
|
1249 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008} |
|
1250 after they were first published by Brzozowski. |
|
1251 |
|
1252 |
|
1253 However, there are two difficulties with derivative-based matchers: |
|
1254 First, Brzozowski's original matcher only generates a yes/no answer |
|
1255 for whether a regular expression matches a string or not. This is too |
|
1256 little information in the context of lexing where separate tokens must |
|
1257 be identified and also classified (for example as keywords |
|
1258 or identifiers). |
|
1259 Second, derivative-based matchers need to be more efficient in terms |
|
1260 of the sizes of derivatives. |
|
1261 Elegant and beautiful |
|
1262 as many implementations are, |
|
1263 they can be excruciatingly slow. |
|
1264 For example, Sulzmann and Lu |
|
1265 claim a linear running time of their proposed algorithm, |
|
1266 but that was falsified by our experiments. The running time |
|
1267 is actually $\Omega(2^n)$ in the worst case. |
|
1268 A similar claim about a theoretical runtime of $O(n^2)$ |
|
1269 is made for the Verbatim \cite{Verbatim} |
|
1270 %TODO: give references |
|
1271 lexer, which calculates POSIX matches and is based on derivatives. |
|
1272 They formalized the correctness of the lexer, but not their complexity result. |
|
1273 In the performance evaluation section, they analyzed the run time |
|
1274 of matching $a$ with the string |
|
1275 \begin{center} |
|
1276 $\underbrace{a \ldots a}_{\text{n a's}}$. |
|
1277 \end{center} |
|
1278 \noindent |
|
1279 They concluded that the algorithm is quadratic in terms of |
|
1280 the length of the input string. |
|
1281 When we tried out their extracted OCaml code with the example $(a+aa)^*$, |
|
1282 the time it took to match a string of 40 $a$'s was approximately 5 minutes. |
|
1283 |
|
1284 |
|
1285 \subsection{Sulzmann and Lu's Algorithm} |
|
1286 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first |
|
1287 problem with the yes/no answer |
|
1288 by cleverly extending Brzozowski's matching |
|
1289 algorithm. Their extended version generates additional information on |
|
1290 \emph{how} a regular expression matches a string following the POSIX |
|
1291 rules for regular expression matching. They achieve this by adding a |
|
1292 second ``phase'' to Brzozowski's algorithm involving an injection |
|
1293 function. This injection function in a sense undoes the ``damage'' |
|
1294 of the derivatives chopping off characters. |
|
1295 In earlier work, Ausaf et al provided the formal |
|
1296 specification of what POSIX matching means and proved in Isabelle/HOL |
|
1297 the correctness |
|
1298 of this extended algorithm accordingly |
|
1299 \cite{AusafDyckhoffUrban2016}. |
|
1300 |
|
1301 The version of the algorithm proven correct |
|
1302 suffers however heavily from a |
|
1303 second difficulty, where derivatives can |
|
1304 grow to arbitrarily big sizes. |
|
1305 For example if we start with the |
|
1306 regular expression $(a+aa)^*$ and take |
|
1307 successive derivatives according to the character $a$, we end up with |
|
1308 a sequence of ever-growing derivatives like |
|
1309 |
|
1310 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} |
|
1311 \begin{center} |
|
1312 \begin{tabular}{rll} |
|
1313 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
1314 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
1315 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ |
|
1316 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
1317 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
|
1318 \end{tabular} |
|
1319 \end{center} |
|
1320 |
|
1321 \noindent where after around 35 steps we usually run out of memory on a |
|
1322 typical computer. Clearly, the |
|
1323 notation involving $\ZERO$s and $\ONE$s already suggests |
|
1324 simplification rules that can be applied to regular regular |
|
1325 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
|
1326 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
|
1327 r$. While such simple-minded simplifications have been proved in |
|
1328 the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's |
|
1329 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
|
1330 \emph{not} help with limiting the growth of the derivatives shown |
|
1331 above: the growth is slowed, but the derivatives can still grow rather |
|
1332 quickly beyond any finite bound. |
|
1333 |
|
1334 Therefore we want to look in this thesis at a second |
|
1335 algorithm by Sulzmann and Lu where they |
|
1336 overcame this ``growth problem'' \cite{Sulzmann2014}. |
|
1337 In this version, POSIX values are |
|
1338 represented as bit sequences and such sequences are incrementally generated |
|
1339 when derivatives are calculated. The compact representation |
|
1340 of bit sequences and regular expressions allows them to define a more |
|
1341 ``aggressive'' simplification method that keeps the size of the |
|
1342 derivatives finite no matter what the length of the string is. |
|
1343 They make some informal claims about the correctness and linear behaviour |
|
1344 of this version, but do not provide any supporting proof arguments, not |
|
1345 even ``pencil-and-paper'' arguments. They write about their bit-coded |
|
1346 \emph{incremental parsing method} (that is the algorithm to be formalised |
|
1347 in this dissertation) |
|
1348 |
|
1349 |
|
1350 |
|
1351 \begin{quote}\it |
|
1352 ``Correctness Claim: We further claim that the incremental parsing |
|
1353 method [..] in combination with the simplification steps [..] |
|
1354 yields POSIX parse trees. We have tested this claim |
|
1355 extensively [..] but yet |
|
1356 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} |
|
1357 \end{quote} |
|
1358 Ausaf and Urban made some initial progress towards the |
|
1359 full correctness proof but still had to leave out the optimisation |
|
1360 Sulzmann and Lu proposed. |
|
1361 Ausaf wrote \cite{Ausaf}, |
|
1362 \begin{quote}\it |
|
1363 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.'' |
|
1364 \end{quote} |
|
1365 This thesis implements the aggressive simplifications envisioned |
|
1366 by Ausaf and Urban, |
|
1367 together with a formal proof of the correctness of those simplifications. |
|
1368 |
|
1369 |
|
1370 One of the most recent work in the context of lexing |
|
1371 %with this issue |
|
1372 is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}. |
|
1373 This is relevant work for us and we will compare it later with |
|
1374 our derivative-based matcher we are going to present. |
|
1375 There is also some newer work called |
|
1376 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, |
|
1377 but deterministic finite automaton instead. |
|
1378 We will also study this work in a later section. |
|
1379 %An example that gives problem to automaton approaches would be |
|
1380 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$. |
|
1381 %It requires at least $2^{n+1}$ states to represent |
|
1382 %as a DFA. |
|
1383 |
530 |
1384 |
531 |
1385 %---------------------------------------------------------------------------------------- |
532 %---------------------------------------------------------------------------------------- |
1386 \section{Contribution} |
533 \section{Contribution} |
1387 {\color{red} \rule{\linewidth}{0.5mm}} |
534 {\color{red} \rule{\linewidth}{0.5mm}} |