395 with the help of formal proofs. |
395 with the help of formal proofs. |
396 We offer a lexing algorithm based |
396 We offer a lexing algorithm based |
397 on Brzozowski derivatives with certified correctness (in |
397 on Brzozowski derivatives with certified correctness (in |
398 Isabelle/HOL) |
398 Isabelle/HOL) |
399 and finiteness property. |
399 and finiteness property. |
400 Such properties guarantees the absence of |
400 Such properties guarantee the absence of |
401 catastrophic backtracking in most cases. |
401 catastrophic backtracking in most cases. |
402 We will give more details why we choose our |
402 We will give more details in the next sections |
403 approach (Brzozowski derivatives and formal proofs) |
403 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
404 in the next sections. |
404 can occur |
405 |
405 and (ii) why we choose our |
406 |
406 approach (Brzozowski derivatives and formal proofs). |
407 \section{The Problem with Bounded Repetitions} |
407 |
|
408 |
|
409 \section{Terminology, and the Problem with Bounded Repetitions} |
408 Regular expressions and regular expression matchers |
410 Regular expressions and regular expression matchers |
409 have of course been studied for many, many years. |
411 have of course been studied for many, many years. |
|
412 Theoretical results in automata theory says |
|
413 that basic regular expression matching should be linear |
|
414 w.r.t the input, provided that the regular expression |
|
415 $r$ had been pre-processed and turned into a |
|
416 deterministic finite automata (DFA). |
|
417 By basic we mean textbook definitions such as the one |
|
418 below, involving only characters, alternatives, |
|
419 sequences, and Kleene stars: |
|
420 \[ |
|
421 r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
|
422 \] |
|
423 Modern regular expression matchers used by programmers, |
|
424 however, |
|
425 support richer constructs such as bounded repetitions |
|
426 and back-references. |
|
427 The syntax and expressive power of those |
|
428 matching engines |
|
429 make ``regular expressions'' quite different from |
|
430 their original meaning in the formal languages |
|
431 theory. |
|
432 To differentiate, people tend to use the word \emph{regex} to refer |
|
433 those expressions with richer constructs, and regular expressions |
|
434 for the more traditional meaning. |
|
435 For example, the PCRE standard (Peral Compatible Regular Expressions) |
|
436 is such a regex syntax standard. |
|
437 We follow this convention in this thesis. |
|
438 We aim to support all the popular features of regexes in the future, |
|
439 but for this work we mainly look at regular expressions. |
|
440 |
|
441 \subsection{A Little Introduction to Regexes: Bounded Repetitions |
|
442 and Back-references} |
|
443 Regexes come with a lot of constructs |
|
444 that makes it more convenient for |
|
445 programmers to write regular expressions. |
|
446 Some of those constructs are syntactic sugars that are |
|
447 simply short hand notations |
|
448 that save the programmers a few keystrokes, |
|
449 for example the |
|
450 non-binary alternative involving three or more choices: |
|
451 \[ |
|
452 r = (a | b | c | \ldots | z)^* |
|
453 \] |
|
454 , the range operator $-$ which means the alternative |
|
455 of all characters between its operands: |
|
456 \[ |
|
457 r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)} |
|
458 \] |
|
459 and the |
|
460 wildcard character $.$ meaning any character |
|
461 \[ |
|
462 . = [0-9a-zA-Z+-()*&\ldots] |
|
463 |
|
464 \] |
|
465 Some of those constructs do make the expressions much |
|
466 more compact, and matching time could be greatly increase. |
|
467 For example, $r^{n}$ is exponentially more concise compared with |
|
468 the expression $\underbrace{r}_\text{n copies of r}$, |
|
469 and therefore a naive algorithm that simply unfolds |
|
470 $r^{n}$ into $\underbrace{r}_\text{n copies of r}$ |
|
471 will suffer exponential runtime increase. |
|
472 Some constructs can even raise the expressive |
|
473 power to the non-regular realm, for example |
|
474 the back-references. |
|
475 |
|
476 bounded repetitions, as we have discussed in the |
|
477 previous section. |
|
478 This super-linear behaviour of the |
|
479 regex matching engines we have? |
410 One of the most recent work in the context of lexing |
480 One of the most recent work in the context of lexing |
411 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
481 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
412 This is relevant work and we will compare later on |
482 This is relevant work and we will compare later on |
413 our derivative-based matcher we are going to present. |
483 our derivative-based matcher we are going to present. |
414 There is also some newer work called |
484 There is also some newer work called |
470 and moreover the resulting code still consists of only simple |
540 and moreover the resulting code still consists of only simple |
471 recursive functions and inductive datatypes. |
541 recursive functions and inductive datatypes. |
472 Finally, bounded regular expressions do not destroy our finite |
542 Finally, bounded regular expressions do not destroy our finite |
473 boundedness property, which we shall prove later on. |
543 boundedness property, which we shall prove later on. |
474 |
544 |
475 \section{The Terminology Regex, and why Backtracking?} |
545 \section{Back-references and The Terminology Regex} |
476 Shouldn't regular expression matching be linear? |
546 |
477 How can one explain the super-linear behaviour of the |
547 |
478 regex matching engines we have? |
|
479 The time cost of regex matching algorithms in general |
|
480 involve two different phases, and different things can go differently wrong on |
|
481 these phases. |
|
482 $\DFA$s usually have problems in the first (construction) phase |
|
483 , whereas $\NFA$s usually run into trouble |
|
484 on the second phase. |
|
485 |
|
486 |
|
487 \section{Error-prone POSIX Implementations} |
|
488 The problems with practical implementations |
|
489 of reare not limited to slowness on certain |
|
490 cases. |
|
491 Another thing about these libraries is that there |
|
492 is no correctness guarantee. |
|
493 In some cases, they either fail to generate a lexing result when there exists a match, |
|
494 or give results that are inconsistent with the $\POSIX$ standard. |
|
495 A concrete example would be the regex |
|
496 \begin{center} |
|
497 $(aba + ab + a)* \text{and the string} ababa$ |
|
498 \end{center} |
|
499 The correct $\POSIX$ match for the above would be |
|
500 with the entire string $ababa$, |
|
501 split into two Kleene star iterations, $[ab] [aba]$ at positions |
|
502 $[0, 2), [2, 5)$ |
|
503 respectively. |
|
504 But trying this out in regex101\parencite{regex101} |
|
505 with different language engines would yield |
|
506 the same two fragmented matches: $[aba]$ at $[0, 3)$ |
|
507 and $a$ at $[4, 5)$. |
|
508 |
|
509 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
|
510 correctly implementing the POSIX (maximum-munch) |
|
511 rule of regular expression matching. |
|
512 |
|
513 As Grathwohl\parencite{grathwohl2014crash} wrote, |
|
514 \begin{quote} |
|
515 The POSIX strategy is more complicated than the |
|
516 greedy because of the dependence on information about |
|
517 the length of matched strings in the various subexpressions. |
|
518 \end{quote} |
|
519 %\noindent |
|
520 To summarise the above, regular expressions are important. |
|
521 They are popular and programming languages' library functions |
|
522 for them are very fast on non-catastrophic cases. |
|
523 But there are problems with current practical implementations. |
|
524 First thing is that the running time might blow up. |
|
525 The second problem is that they might be error-prone on certain |
|
526 very simple cases. |
|
527 In the next part of the chapter, we will look into reasons why |
|
528 certain regex engines are running horribly slow on the "catastrophic" |
|
529 cases and propose a solution that addresses both of these problems |
|
530 based on Brzozowski and Sulzmann and Lu's work. |
|
531 |
|
532 |
|
533 |
|
534 \subsection{Different Phases of a Matching/Lexing Algorithm} |
|
535 |
|
536 |
|
537 Most lexing algorithms can be roughly divided into |
|
538 two phases during its run. |
|
539 The first phase is the "construction" phase, |
|
540 in which the algorithm builds some |
|
541 suitable data structure from the input regex $r$, so that |
|
542 it can be easily operated on later. |
|
543 We denote |
|
544 the time cost for such a phase by $P_1(r)$. |
|
545 The second phase is the lexing phase, when the input string |
|
546 $s$ is read and the data structure |
|
547 representing that regex $r$ is being operated on. |
|
548 We represent the time |
|
549 it takes by $P_2(r, s)$.\\ |
|
550 For $\mathit{DFA}$, |
|
551 we have $P_2(r, s) = O( |s| )$, |
|
552 because we take at most $|s|$ steps, |
|
553 and each step takes |
|
554 at most one transition-- |
|
555 a deterministic-finite-automata |
|
556 by definition has at most one state active and at most one |
|
557 transition upon receiving an input symbol. |
|
558 But unfortunately in the worst case |
|
559 $P_1(r) = O(exp^{|r|})$. An example will be given later. |
|
560 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
|
561 expressions like $r^n$ into |
|
562 \[ |
|
563 \underbrace{r \cdots r}_{\text{n copies of r}}. |
|
564 \] |
|
565 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
|
566 On the other hand, if backtracking is used, the worst-case time bound bloats |
|
567 to $|r| * 2^{|s|}$. |
|
568 %on the input |
|
569 %And when calculating the time complexity of the matching algorithm, |
|
570 %we are assuming that each input reading step requires constant time. |
|
571 %which translates to that the number of |
|
572 %states active and transitions taken each time is bounded by a |
|
573 %constant $C$. |
|
574 %But modern regex libraries in popular language engines |
|
575 % often want to support much richer constructs than just |
|
576 % sequences and Kleene stars, |
|
577 %such as negation, intersection, |
|
578 %bounded repetitions and back-references. |
|
579 %And de-sugaring these "extended" regular expressions |
|
580 %into basic ones might bloat the size exponentially. |
|
581 %TODO: more reference for exponential size blowup on desugaring. |
|
582 |
|
583 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase} |
|
584 |
|
585 |
|
586 The good things about $\mathit{DFA}$s is that once |
|
587 generated, they are fast and stable, unlike |
|
588 backtracking algorithms. |
|
589 However, they do not scale well with bounded repetitions. |
|
590 |
|
591 \subsubsection{Problems with Bounded Repetitions} |
|
592 Bounded repetitions, usually written in the form |
548 Bounded repetitions, usually written in the form |
593 $r^{\{c\}}$ (where $c$ is a constant natural number), |
549 $r^{\{c\}}$ (where $c$ is a constant natural number), |
594 denotes a regular expression accepting strings |
550 denotes a regular expression accepting strings |
595 that can be divided into $c$ substrings, where each |
551 that can be divided into $c$ substrings, where each |
596 substring is in $r$. |
552 substring is in $r$. |
636 more than 1 string. |
592 more than 1 string. |
637 This is to represent all different |
593 This is to represent all different |
638 scenarios which "countdown" states are active. |
594 scenarios which "countdown" states are active. |
639 For those regexes, tools that uses $\DFA$s will get |
595 For those regexes, tools that uses $\DFA$s will get |
640 out of memory errors. |
596 out of memory errors. |
|
597 |
|
598 |
|
599 The time cost of regex matching algorithms in general |
|
600 involve two different phases, and different things can go differently wrong on |
|
601 these phases. |
|
602 $\DFA$s usually have problems in the first (construction) phase |
|
603 , whereas $\NFA$s usually run into trouble |
|
604 on the second phase. |
|
605 |
|
606 |
|
607 \section{Error-prone POSIX Implementations} |
|
608 The problems with practical implementations |
|
609 of reare not limited to slowness on certain |
|
610 cases. |
|
611 Another thing about these libraries is that there |
|
612 is no correctness guarantee. |
|
613 In some cases, they either fail to generate a lexing result when there exists a match, |
|
614 or give results that are inconsistent with the $\POSIX$ standard. |
|
615 A concrete example would be the regex |
|
616 \begin{center} |
|
617 $(aba + ab + a)* \text{and the string} ababa$ |
|
618 \end{center} |
|
619 The correct $\POSIX$ match for the above would be |
|
620 with the entire string $ababa$, |
|
621 split into two Kleene star iterations, $[ab] [aba]$ at positions |
|
622 $[0, 2), [2, 5)$ |
|
623 respectively. |
|
624 But trying this out in regex101\parencite{regex101} |
|
625 with different language engines would yield |
|
626 the same two fragmented matches: $[aba]$ at $[0, 3)$ |
|
627 and $a$ at $[4, 5)$. |
|
628 |
|
629 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
|
630 correctly implementing the POSIX (maximum-munch) |
|
631 rule of regular expression matching. |
|
632 |
|
633 As Grathwohl\parencite{grathwohl2014crash} wrote, |
|
634 \begin{quote} |
|
635 The POSIX strategy is more complicated than the |
|
636 greedy because of the dependence on information about |
|
637 the length of matched strings in the various subexpressions. |
|
638 \end{quote} |
|
639 %\noindent |
|
640 To summarise the above, regular expressions are important. |
|
641 They are popular and programming languages' library functions |
|
642 for them are very fast on non-catastrophic cases. |
|
643 But there are problems with current practical implementations. |
|
644 First thing is that the running time might blow up. |
|
645 The second problem is that they might be error-prone on certain |
|
646 very simple cases. |
|
647 In the next part of the chapter, we will look into reasons why |
|
648 certain regex engines are running horribly slow on the "catastrophic" |
|
649 cases and propose a solution that addresses both of these problems |
|
650 based on Brzozowski and Sulzmann and Lu's work. |
|
651 |
|
652 |
|
653 |
|
654 \subsection{Different Phases of a Matching/Lexing Algorithm} |
|
655 |
|
656 |
|
657 Most lexing algorithms can be roughly divided into |
|
658 two phases during its run. |
|
659 The first phase is the "construction" phase, |
|
660 in which the algorithm builds some |
|
661 suitable data structure from the input regex $r$, so that |
|
662 it can be easily operated on later. |
|
663 We denote |
|
664 the time cost for such a phase by $P_1(r)$. |
|
665 The second phase is the lexing phase, when the input string |
|
666 $s$ is read and the data structure |
|
667 representing that regex $r$ is being operated on. |
|
668 We represent the time |
|
669 it takes by $P_2(r, s)$.\\ |
|
670 For $\mathit{DFA}$, |
|
671 we have $P_2(r, s) = O( |s| )$, |
|
672 because we take at most $|s|$ steps, |
|
673 and each step takes |
|
674 at most one transition-- |
|
675 a deterministic-finite-automata |
|
676 by definition has at most one state active and at most one |
|
677 transition upon receiving an input symbol. |
|
678 But unfortunately in the worst case |
|
679 $P_1(r) = O(exp^{|r|})$. An example will be given later. |
|
680 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
|
681 expressions like $r^n$ into |
|
682 \[ |
|
683 \underbrace{r \cdots r}_{\text{n copies of r}}. |
|
684 \] |
|
685 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
|
686 On the other hand, if backtracking is used, the worst-case time bound bloats |
|
687 to $|r| * 2^{|s|}$. |
|
688 %on the input |
|
689 %And when calculating the time complexity of the matching algorithm, |
|
690 %we are assuming that each input reading step requires constant time. |
|
691 %which translates to that the number of |
|
692 %states active and transitions taken each time is bounded by a |
|
693 %constant $C$. |
|
694 %But modern regex libraries in popular language engines |
|
695 % often want to support much richer constructs than just |
|
696 % sequences and Kleene stars, |
|
697 %such as negation, intersection, |
|
698 %bounded repetitions and back-references. |
|
699 %And de-sugaring these "extended" regular expressions |
|
700 %into basic ones might bloat the size exponentially. |
|
701 %TODO: more reference for exponential size blowup on desugaring. |
|
702 |
|
703 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase} |
|
704 |
|
705 |
|
706 The good things about $\mathit{DFA}$s is that once |
|
707 generated, they are fast and stable, unlike |
|
708 backtracking algorithms. |
|
709 However, they do not scale well with bounded repetitions. |
|
710 |
|
711 \subsubsection{Problems with Bounded Repetitions} |
|
712 |
|
713 |
|
714 |
|
715 |
641 |
716 |
642 \subsubsection{Tools that uses $\mathit{DFA}$s} |
717 \subsubsection{Tools that uses $\mathit{DFA}$s} |
643 %TODO:more tools that use DFAs? |
718 %TODO:more tools that use DFAs? |
644 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools |
719 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools |
645 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based |
720 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based |