404 can occur |
404 can occur |
405 and (ii) why we choose our |
405 and (ii) why we choose our |
406 approach (Brzozowski derivatives and formal proofs). |
406 approach (Brzozowski derivatives and formal proofs). |
407 |
407 |
408 |
408 |
409 \section{Terminology, and the Problem with Bounded Repetitions} |
409 \section{Regex, and the Problems with Regex Matchers} |
410 Regular expressions and regular expression matchers |
410 Regular expressions and regular expression matchers |
411 have of course been studied for many, many years. |
411 have of course been studied for many, many years. |
412 Theoretical results in automata theory says |
412 Theoretical results in automata theory say |
413 that basic regular expression matching should be linear |
413 that basic regular expression matching should be linear |
414 w.r.t the input, provided that the regular expression |
414 w.r.t the input. |
415 $r$ had been pre-processed and turned into a |
415 This assumes that the regular expression |
416 deterministic finite automata (DFA). |
416 $r$ was pre-processed and turned into a |
|
417 deterministic finite automata (DFA) before matching, |
|
418 which could be exponential\cite{Sakarovitch2009}. |
417 By basic we mean textbook definitions such as the one |
419 By basic we mean textbook definitions such as the one |
418 below, involving only characters, alternatives, |
420 below, involving only characters, alternatives, |
419 sequences, and Kleene stars: |
421 sequences, and Kleene stars: |
420 \[ |
422 \[ |
421 r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
423 r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
422 \] |
424 \] |
423 Modern regular expression matchers used by programmers, |
425 Modern regular expression matchers used by programmers, |
424 however, |
426 however, |
425 support richer constructs such as bounded repetitions |
427 support richer constructs such as bounded repetitions |
426 and back-references. |
428 and back-references. |
427 The syntax and expressive power of those |
429 To differentiate, people use the word \emph{regex} to refer |
428 matching engines |
430 to those expressions with richer constructs while reserving the |
429 make ``regular expressions'' quite different from |
431 term \emph{regular expression} |
430 their original meaning in the formal languages |
432 for the more traditional meaning in formal languages theory. |
431 theory. |
433 We follow this convention |
432 To differentiate, people tend to use the word \emph{regex} to refer |
434 in this thesis. |
433 those expressions with richer constructs, and regular expressions |
435 In the future, we aim to support all the popular features of regexes, |
434 for the more traditional meaning. |
|
435 For example, the PCRE standard (Peral Compatible Regular Expressions) |
|
436 is such a regex syntax standard. |
|
437 We follow this convention in this thesis. |
|
438 We aim to support all the popular features of regexes in the future, |
|
439 but for this work we mainly look at regular expressions. |
436 but for this work we mainly look at regular expressions. |
440 |
437 |
441 \subsection{A Little Introduction to Regexes: Bounded Repetitions |
438 |
442 and Back-references} |
439 |
|
440 %Most modern regex libraries |
|
441 %the so-called PCRE standard (Peral Compatible Regular Expressions) |
|
442 %has the back-references |
443 Regexes come with a lot of constructs |
443 Regexes come with a lot of constructs |
444 that makes it more convenient for |
444 beyond the basic ones |
|
445 that make it more convenient for |
445 programmers to write regular expressions. |
446 programmers to write regular expressions. |
|
447 Depending on the types of these constructs |
|
448 the task of matching and lexing with them |
|
449 will have different levels of complexity increase. |
446 Some of those constructs are syntactic sugars that are |
450 Some of those constructs are syntactic sugars that are |
447 simply short hand notations |
451 simply short hand notations |
448 that save the programmers a few keystrokes, |
452 that save the programmers a few keystrokes. |
449 for example the |
453 These will not cause trouble for regex libraries. |
|
454 |
|
455 \noindent |
|
456 For example the |
450 non-binary alternative involving three or more choices: |
457 non-binary alternative involving three or more choices: |
451 \[ |
458 \[ |
452 r = (a | b | c | \ldots | z)^* |
459 (a | b | c) \stackrel{means}{=} ((a + b)+ c) |
453 \] |
460 \] |
454 , the range operator $-$ which means the alternative |
461 the range operator $-$ used to express the alternative |
455 of all characters between its operands: |
462 of all characters between its operands in a concise way: |
456 \[ |
463 \[ |
457 r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)} |
464 [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)} |
458 \] |
465 \] |
459 and the |
466 and the |
460 wildcard character $.$ meaning any character |
467 wildcard character $.$ used to refer to any single character: |
461 \[ |
468 \[ |
462 . = [0-9a-zA-Z+-()*&\ldots] |
469 . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
463 |
|
464 \] |
470 \] |
|
471 |
|
472 \noindent |
|
473 \subsection{Bounded Repetitions} |
465 Some of those constructs do make the expressions much |
474 Some of those constructs do make the expressions much |
466 more compact, and matching time could be greatly increase. |
475 more compact. |
467 For example, $r^{n}$ is exponentially more concise compared with |
476 For example, the bounded regular expressions |
468 the expression $\underbrace{r}_\text{n copies of r}$, |
477 (where $n$ and $m$ are constant natural numbers) |
469 and therefore a naive algorithm that simply unfolds |
478 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$, |
470 $r^{n}$ into $\underbrace{r}_\text{n copies of r}$ |
479 defined as |
471 will suffer exponential runtime increase. |
480 \begin{center} |
472 Some constructs can even raise the expressive |
481 \begin{tabular}{lcl} |
473 power to the non-regular realm, for example |
482 $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ |
474 the back-references. |
483 $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ |
475 |
484 $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
476 bounded repetitions, as we have discussed in the |
485 $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
477 previous section. |
486 \end{tabular} |
478 This super-linear behaviour of the |
487 \end{center} |
479 regex matching engines we have? |
488 are exponentially smaller compared with |
480 One of the most recent work in the context of lexing |
489 their unfolded form: for example $r^{\{n\}}$ |
481 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
490 as opposed to |
482 This is relevant work and we will compare later on |
491 \[ |
483 our derivative-based matcher we are going to present. |
492 \underbrace{r\ldots r}_\text{n copies of r}. |
484 There is also some newer work called |
493 \] |
485 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. |
494 %Therefore, a naive algorithm that simply unfolds |
486 For that the problem is dealing with the bounded regular expressions of the form |
495 %them into their desugared forms |
487 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat |
496 %will suffer from at least an exponential runtime increase. |
488 exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow |
497 |
489 on the bounded repetitions construct. |
|
490 |
|
491 In the work reported in \cite{CSL2022} and here, we add better support |
|
492 for them. |
|
493 The other repetition constructs include |
|
494 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify |
|
495 intervals for how many times $r$ should match. |
|
496 $r^{\ldots m}$ means repeating |
|
497 at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and |
|
498 $r^{n\ldots m}$ means repeating between $n$ and $m$ times. |
|
499 The results presented in this thesis extend straightforwardly to them |
|
500 too. |
|
501 Their formal definitions will be given later. |
|
502 |
|
503 Bounded repetitions are important because they |
|
504 tend to occur often in practical use, for example in RegExLib, |
|
505 Snort, as well as in XML Schema definitions (XSDs). |
|
506 According to Bj\"{o}rklund et al \cite{xml2015}, |
|
507 bounded regular expressions occur frequently in the latter and can have |
|
508 counters up to ten million. An example XSD with a large counter they gave |
|
509 was: |
|
510 \begin{verbatim} |
|
511 <sequence minOccurs="0" maxOccurs="65535"> |
|
512 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
|
513 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
|
514 </sequence> |
|
515 \end{verbatim} |
|
516 This can be seen as the expression |
|
517 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
|
518 regular expressions |
|
519 satisfying certain constraints (such as |
|
520 satisfying the floating point number format). |
|
521 The problem here is that tools based on the classic notion of |
498 The problem here is that tools based on the classic notion of |
522 automata need to expand $r^{n}$ into $n$ connected |
499 automata need to expand $r^{n}$ into $n$ connected |
523 copies of the automaton for $r$. This leads to very inefficient matching |
500 copies of the automaton for $r$. This leads to very inefficient matching |
524 algorithms or algorithms that consume large amounts of memory. |
501 algorithms or algorithms that consume large amounts of memory. |
|
502 Implementations using $\DFA$s will |
|
503 either become excruciatingly slow |
|
504 (for example Verbatim++\cite{Verbatimpp}) or get |
|
505 out of memory errors (for example $\mathit{LEX}$ and |
|
506 $\mathit{JFLEX}$\footnote{which are lexer generators |
|
507 in C and JAVA that generate $\mathit{DFA}$-based |
|
508 lexers. The user provides a set of regular expressions |
|
509 and configurations to them, and then |
|
510 gets an output program encoding a minimized $\mathit{DFA}$ |
|
511 that can be compiled and run. |
|
512 When given the above countdown regular expression, |
|
513 a small $n$ (a few dozen) would result in a |
|
514 determinised automata |
|
515 with millions of states.}) under large counters. |
525 A classic example is the regular expression $(a+b)^* a (a+b)^{n}$ |
516 A classic example is the regular expression $(a+b)^* a (a+b)^{n}$ |
526 where the minimal DFA requires at least $2^{n+1}$ states (more on this |
517 where the minimal DFA requires at least $2^{n+1}$ states. |
527 later). |
518 For example, when $n$ is equal to 2, |
528 Therefore regular expressions matching libraries that rely on the classic |
|
529 notion of DFAs often impose adhoc limits |
|
530 for bounded regular expressions: |
|
531 For example, in the regular expression matching library in the Go |
|
532 language the regular expression $a^{1001}$ is not permitted, because no counter |
|
533 can be above 1000, and in the built-in Rust regular expression library |
|
534 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
|
535 for being too big. These problems can of course be solved in matching algorithms where |
|
536 automata go beyond the classic notion and for instance include explicit |
|
537 counters \cite{Turo_ov__2020}. |
|
538 The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
|
539 straightforwardly extended to deal with bounded regular expressions |
|
540 and moreover the resulting code still consists of only simple |
|
541 recursive functions and inductive datatypes. |
|
542 Finally, bounded regular expressions do not destroy our finite |
|
543 boundedness property, which we shall prove later on. |
|
544 |
|
545 \section{Back-references and The Terminology Regex} |
|
546 |
|
547 |
|
548 Bounded repetitions, usually written in the form |
|
549 $r^{\{c\}}$ (where $c$ is a constant natural number), |
|
550 denotes a regular expression accepting strings |
|
551 that can be divided into $c$ substrings, where each |
|
552 substring is in $r$. |
|
553 For the regular expression $(a|b)^*a(a|b)^{\{2\}}$, |
|
554 an $\mathit{NFA}$ describing it would look like: |
519 an $\mathit{NFA}$ describing it would look like: |
555 \begin{center} |
520 \begin{center} |
556 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
521 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
557 \node[state,initial] (q_0) {$q_0$}; |
522 \node[state,initial] (q_0) {$q_0$}; |
558 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
523 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
586 because the subset construction during determinisation will generate |
553 because the subset construction during determinisation will generate |
587 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
554 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
588 Generalizing this to regular expressions with larger |
555 Generalizing this to regular expressions with larger |
589 bounded repetitions number, we have that |
556 bounded repetitions number, we have that |
590 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
557 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
591 would require at least $2^{n+1}$ states, if $r$ contains |
558 would require at least $2^{n+1}$ states, if $r$ itself contains |
592 more than 1 string. |
559 more than 1 string. |
593 This is to represent all different |
560 This is to represent all different |
594 scenarios which "countdown" states are active. |
561 scenarios which "countdown" states are active.} |
595 For those regexes, tools that uses $\DFA$s will get |
562 |
596 out of memory errors. |
563 One of the most recent work in the context of lexing |
597 |
564 %with this issue |
598 |
565 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
599 The time cost of regex matching algorithms in general |
566 This is relevant work and we will compare later on |
600 involve two different phases, and different things can go differently wrong on |
567 our derivative-based matcher we are going to present. |
601 these phases. |
568 There is also some newer work called |
602 $\DFA$s usually have problems in the first (construction) phase |
569 Verbatim++\cite{Verbatimpp}, which does not use derivatives, |
603 , whereas $\NFA$s usually run into trouble |
570 but deterministic finite automaton instead. |
604 on the second phase. |
571 %An example that gives problem to automaton approaches would be |
605 |
572 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$. |
606 |
573 %It requires at least $2^{n+1}$ states to represent |
607 \section{Error-prone POSIX Implementations} |
574 %as a DFA. |
608 The problems with practical implementations |
575 |
609 of reare not limited to slowness on certain |
576 |
610 cases. |
577 Bounded repetitions are very important because they |
611 Another thing about these libraries is that there |
578 tend to occur a lot in practical use. |
612 is no correctness guarantee. |
579 For example in the regex library RegExLib, |
613 In some cases, they either fail to generate a lexing result when there exists a match, |
580 the rules library of Snort, as well as in XML Schema definitions (XSDs). |
614 or give results that are inconsistent with the $\POSIX$ standard. |
581 According to Bj\"{o}rklund et al \cite{xml2015}, |
615 A concrete example would be the regex |
582 more than half of the |
616 \begin{center} |
583 XSDs they found have bounded regular expressions in them. |
617 $(aba + ab + a)* \text{and the string} ababa$ |
584 Often the counters are quite large, the largest up to ten million. |
618 \end{center} |
585 An example XSD they gave |
619 The correct $\POSIX$ match for the above would be |
586 was: |
620 with the entire string $ababa$, |
587 \begin{verbatim} |
621 split into two Kleene star iterations, $[ab] [aba]$ at positions |
588 <sequence minOccurs="0" maxOccurs="65535"> |
622 $[0, 2), [2, 5)$ |
589 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
623 respectively. |
590 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
624 But trying this out in regex101\parencite{regex101} |
591 </sequence> |
625 with different language engines would yield |
592 \end{verbatim} |
626 the same two fragmented matches: $[aba]$ at $[0, 3)$ |
593 This can be seen as the expression |
627 and $a$ at $[4, 5)$. |
594 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
628 |
595 regular expressions |
629 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
596 satisfying certain constraints (such as |
630 correctly implementing the POSIX (maximum-munch) |
597 satisfying the floating point number format). |
631 rule of regular expression matching. |
598 |
632 |
599 It is therefore quite unsatisfying that |
633 As Grathwohl\parencite{grathwohl2014crash} wrote, |
600 some regular expressions matching libraries |
634 \begin{quote} |
601 impose adhoc limits |
635 The POSIX strategy is more complicated than the |
602 for bounded regular expressions: |
636 greedy because of the dependence on information about |
603 For example, in the regular expression matching library in the Go |
637 the length of matched strings in the various subexpressions. |
604 language the regular expression $a^{1001}$ is not permitted, because no counter |
638 \end{quote} |
605 can be above 1000, and in the built-in Rust regular expression library |
639 %\noindent |
606 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
640 To summarise the above, regular expressions are important. |
607 for being too big. These problems can of course be solved in matching algorithms where |
641 They are popular and programming languages' library functions |
608 automata go beyond the classic notion and for instance include explicit |
642 for them are very fast on non-catastrophic cases. |
609 counters \cite{Turo_ov__2020}. |
643 But there are problems with current practical implementations. |
610 |
644 First thing is that the running time might blow up. |
611 In the work reported in \cite{CSL2022} and here, |
645 The second problem is that they might be error-prone on certain |
612 we add better support using derivatives |
646 very simple cases. |
613 for bounded regular expressions $r^{\{n\}}$. |
647 In the next part of the chapter, we will look into reasons why |
614 The results |
648 certain regex engines are running horribly slow on the "catastrophic" |
615 extend straightforwardly to |
649 cases and propose a solution that addresses both of these problems |
616 repetitions with an interval such as |
650 based on Brzozowski and Sulzmann and Lu's work. |
617 $r^{\{n\ldots m\}}$. |
651 |
618 The merit of Brzozowski derivatives |
652 |
619 on this problem is that |
653 |
620 it can be naturally extended to support bounded repetitions. |
654 \subsection{Different Phases of a Matching/Lexing Algorithm} |
621 Moreover these extensions are still made up of only |
655 |
622 inductive datatypes and recursive functions, |
656 |
623 making it handy to deal with using theorem provers. |
657 Most lexing algorithms can be roughly divided into |
624 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
658 two phases during its run. |
625 %straightforwardly extended to deal with bounded regular expressions |
659 The first phase is the "construction" phase, |
626 %and moreover the resulting code still consists of only simple |
660 in which the algorithm builds some |
627 %recursive functions and inductive datatypes. |
661 suitable data structure from the input regex $r$, so that |
628 Finally, bounded regular expressions do not destroy our finite |
662 it can be easily operated on later. |
629 boundedness property, which we shall prove later on. |
663 We denote |
630 |
664 the time cost for such a phase by $P_1(r)$. |
631 |
665 The second phase is the lexing phase, when the input string |
632 \subsection{Back-References} |
666 $s$ is read and the data structure |
633 %Some constructs can even raise the expressive |
667 representing that regex $r$ is being operated on. |
634 %power to the non-regular realm, for example |
668 We represent the time |
635 %the back-references. |
669 it takes by $P_2(r, s)$.\\ |
636 %bounded repetitions, as we have discussed in the |
670 For $\mathit{DFA}$, |
637 %previous section. |
671 we have $P_2(r, s) = O( |s| )$, |
638 %This super-linear behaviour of the |
672 because we take at most $|s|$ steps, |
639 %regex matching engines we have? |
673 and each step takes |
640 The syntax and expressive power of regexes |
674 at most one transition-- |
641 can go quickly beyond the regular language hierarchy |
675 a deterministic-finite-automata |
642 with back-references. |
676 by definition has at most one state active and at most one |
643 %\section{Back-references and The Terminology Regex} |
677 transition upon receiving an input symbol. |
644 |
678 But unfortunately in the worst case |
645 %When one constructs an $\NFA$ out of a regular expression |
679 $P_1(r) = O(exp^{|r|})$. An example will be given later. |
646 %there is often very little to be done in the first phase, one simply |
680 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
647 %construct the $\NFA$ states based on the structure of the input regular expression. |
681 expressions like $r^n$ into |
648 |
682 \[ |
649 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: |
683 \underbrace{r \cdots r}_{\text{n copies of r}}. |
650 %one by keeping track of all active states after consuming |
684 \] |
651 %a character, and update that set of states iteratively. |
685 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
652 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
686 On the other hand, if backtracking is used, the worst-case time bound bloats |
653 %for a path terminating |
687 to $|r| * 2^{|s|}$. |
654 %at an accepting state. |
688 %on the input |
|
689 %And when calculating the time complexity of the matching algorithm, |
|
690 %we are assuming that each input reading step requires constant time. |
|
691 %which translates to that the number of |
|
692 %states active and transitions taken each time is bounded by a |
|
693 %constant $C$. |
|
694 %But modern regex libraries in popular language engines |
|
695 % often want to support much richer constructs than just |
|
696 % sequences and Kleene stars, |
|
697 %such as negation, intersection, |
|
698 %bounded repetitions and back-references. |
|
699 %And de-sugaring these "extended" regular expressions |
|
700 %into basic ones might bloat the size exponentially. |
|
701 %TODO: more reference for exponential size blowup on desugaring. |
|
702 |
|
703 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase} |
|
704 |
|
705 |
|
706 The good things about $\mathit{DFA}$s is that once |
|
707 generated, they are fast and stable, unlike |
|
708 backtracking algorithms. |
|
709 However, they do not scale well with bounded repetitions. |
|
710 |
|
711 \subsubsection{Problems with Bounded Repetitions} |
|
712 |
|
713 |
|
714 |
|
715 |
|
716 |
|
717 \subsubsection{Tools that uses $\mathit{DFA}$s} |
|
718 %TODO:more tools that use DFAs? |
|
719 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools |
|
720 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based |
|
721 lexers. The user provides a set of regular expressions |
|
722 and configurations to such lexer generators, and then |
|
723 gets an output program encoding a minimized $\mathit{DFA}$ |
|
724 that can be compiled and run. |
|
725 When given the above countdown regular expression, |
|
726 a small number $n$ would result in a determinised automata |
|
727 with millions of states. |
|
728 |
|
729 For this reason, regex libraries that support |
|
730 bounded repetitions often choose to use the $\mathit{NFA}$ |
|
731 approach. |
|
732 |
|
733 |
|
734 |
|
735 |
|
736 |
|
737 |
|
738 |
|
739 |
|
740 \subsection{Why $\mathit{NFA}$s can be slow in the second phase} |
|
741 When one constructs an $\NFA$ out of a regular expression |
|
742 there is often very little to be done in the first phase, one simply |
|
743 construct the $\NFA$ states based on the structure of the input regular expression. |
|
744 |
|
745 In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: |
|
746 one by keeping track of all active states after consuming |
|
747 a character, and update that set of states iteratively. |
|
748 This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
|
749 for a path terminating |
|
750 at an accepting state. |
|
751 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
655 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
752 type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
656 type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
753 in terms of input string length. |
657 in terms of input string length. |
754 %TODO:try out these lexers |
658 %TODO:try out these lexers |
755 The other way to use $\mathit{NFA}$ for matching is choosing |
659 The other way to use $\mathit{NFA}$ for matching is choosing |
836 %when the language is still regular). |
740 %when the language is still regular). |
837 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
741 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
838 %TODO: verify the fact Rust does not allow 1000+ reps |
742 %TODO: verify the fact Rust does not allow 1000+ reps |
839 |
743 |
840 |
744 |
|
745 |
|
746 |
|
747 %The time cost of regex matching algorithms in general |
|
748 %involve two different phases, and different things can go differently wrong on |
|
749 %these phases. |
|
750 %$\DFA$s usually have problems in the first (construction) phase |
|
751 %, whereas $\NFA$s usually run into trouble |
|
752 %on the second phase. |
|
753 |
|
754 |
|
755 \section{Error-prone POSIX Implementations} |
|
756 The problems with practical implementations |
|
757 of reare not limited to slowness on certain |
|
758 cases. |
|
759 Another thing about these libraries is that there |
|
760 is no correctness guarantee. |
|
761 In some cases, they either fail to generate a lexing result when there exists a match, |
|
762 or give results that are inconsistent with the $\POSIX$ standard. |
|
763 A concrete example would be the regex |
|
764 \begin{center} |
|
765 $(aba + ab + a)* \text{and the string} ababa$ |
|
766 \end{center} |
|
767 The correct $\POSIX$ match for the above would be |
|
768 with the entire string $ababa$, |
|
769 split into two Kleene star iterations, $[ab] [aba]$ at positions |
|
770 $[0, 2), [2, 5)$ |
|
771 respectively. |
|
772 But trying this out in regex101\parencite{regex101} |
|
773 with different language engines would yield |
|
774 the same two fragmented matches: $[aba]$ at $[0, 3)$ |
|
775 and $a$ at $[4, 5)$. |
|
776 |
|
777 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
|
778 correctly implementing the POSIX (maximum-munch) |
|
779 rule of regular expression matching. |
|
780 |
|
781 As Grathwohl\parencite{grathwohl2014crash} wrote, |
|
782 \begin{quote} |
|
783 The POSIX strategy is more complicated than the |
|
784 greedy because of the dependence on information about |
|
785 the length of matched strings in the various subexpressions. |
|
786 \end{quote} |
|
787 %\noindent |
|
788 To summarise the above, regular expressions are important. |
|
789 They are popular and programming languages' library functions |
|
790 for them are very fast on non-catastrophic cases. |
|
791 But there are problems with current practical implementations. |
|
792 First thing is that the running time might blow up. |
|
793 The second problem is that they might be error-prone on certain |
|
794 very simple cases. |
|
795 In the next part of the chapter, we will look into reasons why |
|
796 certain regex engines are running horribly slow on the "catastrophic" |
|
797 cases and propose a solution that addresses both of these problems |
|
798 based on Brzozowski and Sulzmann and Lu's work. |
|
799 |
|
800 |
|
801 \subsection{Different Phases of a Matching/Lexing Algorithm} |
|
802 |
|
803 |
|
804 Most lexing algorithms can be roughly divided into |
|
805 two phases during its run. |
|
806 The first phase is the "construction" phase, |
|
807 in which the algorithm builds some |
|
808 suitable data structure from the input regex $r$, so that |
|
809 it can be easily operated on later. |
|
810 We denote |
|
811 the time cost for such a phase by $P_1(r)$. |
|
812 The second phase is the lexing phase, when the input string |
|
813 $s$ is read and the data structure |
|
814 representing that regex $r$ is being operated on. |
|
815 We represent the time |
|
816 it takes by $P_2(r, s)$.\\ |
|
817 For $\mathit{DFA}$, |
|
818 we have $P_2(r, s) = O( |s| )$, |
|
819 because we take at most $|s|$ steps, |
|
820 and each step takes |
|
821 at most one transition-- |
|
822 a deterministic-finite-automata |
|
823 by definition has at most one state active and at most one |
|
824 transition upon receiving an input symbol. |
|
825 But unfortunately in the worst case |
|
826 $P_1(r) = O(exp^{|r|})$. An example will be given later. |
|
827 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
|
828 expressions like $r^n$ into |
|
829 \[ |
|
830 \underbrace{r \cdots r}_{\text{n copies of r}}. |
|
831 \] |
|
832 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
|
833 On the other hand, if backtracking is used, the worst-case time bound bloats |
|
834 to $|r| * 2^{|s|}$. |
|
835 %on the input |
|
836 %And when calculating the time complexity of the matching algorithm, |
|
837 %we are assuming that each input reading step requires constant time. |
|
838 %which translates to that the number of |
|
839 %states active and transitions taken each time is bounded by a |
|
840 %constant $C$. |
|
841 %But modern regex libraries in popular language engines |
|
842 % often want to support much richer constructs than just |
|
843 % sequences and Kleene stars, |
|
844 %such as negation, intersection, |
|
845 %bounded repetitions and back-references. |
|
846 %And de-sugaring these "extended" regular expressions |
|
847 %into basic ones might bloat the size exponentially. |
|
848 %TODO: more reference for exponential size blowup on desugaring. |
|
849 |
|
850 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase} |
|
851 |
|
852 |
|
853 The good things about $\mathit{DFA}$s is that once |
|
854 generated, they are fast and stable, unlike |
|
855 backtracking algorithms. |
|
856 However, they do not scale well with bounded repetitions. |
|
857 |
|
858 \subsubsection{Problems with Bounded Repetitions} |
|
859 |
|
860 |
|
861 |
|
862 |
|
863 |
|
864 |
|
865 |
|
866 |
|
867 |
|
868 |
|
869 |
|
870 |
|
871 |
|
872 |
841 So we have practical implementations |
873 So we have practical implementations |
842 on regular expression matching/lexing which are fast |
874 on regular expression matching/lexing which are fast |
843 but do not come with any guarantees that it will not grind to a halt |
875 but do not come with any guarantees that it will not grind to a halt |
844 or give wrong answers. |
876 or give wrong answers. |
845 Our goal is to have a regex lexing algorithm that comes with |
877 Our goal is to have a regex lexing algorithm that comes with |