582 more than half of the |
582 more than half of the |
583 XSDs they found have bounded regular expressions in them. |
583 XSDs they found have bounded regular expressions in them. |
584 Often the counters are quite large, the largest up to ten million. |
584 Often the counters are quite large, the largest up to ten million. |
585 An example XSD they gave |
585 An example XSD they gave |
586 was: |
586 was: |
587 \begin{verbatim} |
587 %\begin{verbatim} |
588 <sequence minOccurs="0" maxOccurs="65535"> |
588 %<sequence minOccurs="0" maxOccurs="65535"> |
589 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
589 % <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
590 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
590 % <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
591 </sequence> |
591 %</sequence> |
592 \end{verbatim} |
592 %\end{verbatim} |
593 This can be seen as the expression |
593 This can be seen as the expression |
594 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
594 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
595 regular expressions |
595 regular expressions |
596 satisfying certain constraints (such as |
596 satisfying certain constraints (such as |
597 satisfying the floating point number format). |
597 satisfying the floating point number format). |
602 for bounded regular expressions: |
602 for bounded regular expressions: |
603 For example, in the regular expression matching library in the Go |
603 For example, in the regular expression matching library in the Go |
604 language the regular expression $a^{1001}$ is not permitted, because no counter |
604 language the regular expression $a^{1001}$ is not permitted, because no counter |
605 can be above 1000, and in the built-in Rust regular expression library |
605 can be above 1000, and in the built-in Rust regular expression library |
606 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
606 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
607 for being too big. These problems can of course be solved in matching algorithms where |
607 for being too big. |
|
608 As Becchi and Crawley\cite{Becchi08} have pointed out, |
|
609 the reason for these restrictions |
|
610 are that they simulate a non-deterministic finite |
|
611 automata (NFA) with a breadth-first search. |
|
612 This way the number of active states could |
|
613 be equal to the counter number. |
|
614 When the counters are large, |
|
615 the memory requirement could become |
|
616 infeasible, and the pattern needs to be rejected straight away. |
|
617 \begin{figure}[H] |
|
618 \begin{center} |
|
619 \begin{tikzpicture} [node distance = 2cm, on grid, auto] |
|
620 |
|
621 \node (q0) [state, initial] {$0$}; |
|
622 \node (q1) [state, right = of q0] {$1$}; |
|
623 \node (q2) [state, right = of q1] {$2$}; |
|
624 \node (qdots) [right = of q2] {$\ldots$}; |
|
625 \node (qn) [state, right = of qdots] {$n$}; |
|
626 \node (qn1) [state, right = of qn] {$n+1$}; |
|
627 \node (qn2) [state, right = of qn1] {$n+2$}; |
|
628 \node (qn3) [state, accepting, right = of qn2] {$n+3$}; |
|
629 |
|
630 \path [-stealth, thick] |
|
631 (q0) edge [loop above] node {a} () |
|
632 (q0) edge node {a} (q1) |
|
633 (q1) edge node {.} (q2) |
|
634 (q2) edge node {.} (qdots) |
|
635 (qdots) edge node {.} (qn) |
|
636 (qn) edge node {.} (qn1) |
|
637 (qn1) edge node {b} (qn2) |
|
638 (qn2) edge node {$c$} (qn3); |
|
639 \end{tikzpicture} |
|
640 %\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
641 % \node[state,initial] (q_0) {$0$}; |
|
642 % \node[state, ] (q_1) [right=of q_0] {$1$}; |
|
643 % \node[state, ] (q_2) [right=of q_1] {$2$}; |
|
644 % \node[state, |
|
645 % \node[state, accepting, ](q_3) [right=of q_2] {$3$}; |
|
646 % \path[->] |
|
647 % (q_0) edge node {a} (q_1) |
|
648 % edge [loop below] node {a,b} () |
|
649 % (q_1) edge node {a,b} (q_2) |
|
650 % (q_2) edge node {a,b} (q_3); |
|
651 %\end{tikzpicture} |
|
652 \end{center} |
|
653 \caption{The example given by Becchi and Crawley |
|
654 that NFA simulation can consume large |
|
655 amounts of memory: $.^*a.^{\{n\}}bc$ matching |
|
656 strings of the form $aaa\ldots aaaabc$. |
|
657 When traversing in a breadth-first manner, |
|
658 all states from 0 till $n+1$ will become active.} |
|
659 \end{figure} |
|
660 %Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
|
661 %type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
|
662 %in terms of input string length. |
|
663 %TODO:try out these lexers |
|
664 These problems can of course be solved in matching algorithms where |
608 automata go beyond the classic notion and for instance include explicit |
665 automata go beyond the classic notion and for instance include explicit |
609 counters \cite{Turo_ov__2020}. |
666 counters \cite{Turo_ov__2020}. |
|
667 These solutions can be quite effective, |
|
668 with the ability to process |
|
669 gigabytes of string input per second |
|
670 even with large counters \cite{Becchi08}. |
|
671 But formally reasoning about these automata can be challenging |
|
672 and un-intuitive. |
|
673 Therefore, correctness and runtime claims made about these solutions need to be |
|
674 taken with a grain of salt. |
610 |
675 |
611 In the work reported in \cite{CSL2022} and here, |
676 In the work reported in \cite{CSL2022} and here, |
612 we add better support using derivatives |
677 we add better support using derivatives |
613 for bounded regular expressions $r^{\{n\}}$. |
678 for bounded regular expressions $r^{\{n\}}$. |
614 The results |
679 The results |
627 %recursive functions and inductive datatypes. |
692 %recursive functions and inductive datatypes. |
628 Finally, bounded regular expressions do not destroy our finite |
693 Finally, bounded regular expressions do not destroy our finite |
629 boundedness property, which we shall prove later on. |
694 boundedness property, which we shall prove later on. |
630 |
695 |
631 |
696 |
|
697 |
|
698 |
|
699 |
632 \subsection{Back-References} |
700 \subsection{Back-References} |
633 %Some constructs can even raise the expressive |
701 The other way to simulate an $\mathit{NFA}$ for matching is choosing |
634 %power to the non-regular realm, for example |
702 a single transition each time, keeping all the other options in |
635 %the back-references. |
703 a queue or stack, and backtracking if that choice eventually |
636 %bounded repetitions, as we have discussed in the |
704 fails. This method, often called a "depth-first-search", |
637 %previous section. |
705 is efficient in a lot of cases, but could end up |
638 %This super-linear behaviour of the |
706 with exponential run time. |
639 %regex matching engines we have? |
707 The backtracking method is employed in regex libraries |
|
708 that support \emph{back-references}, for example |
|
709 in Java and Python. |
640 The syntax and expressive power of regexes |
710 The syntax and expressive power of regexes |
641 can go quickly beyond the regular language hierarchy |
711 go beyond the regular language hierarchy |
642 with back-references. |
712 with back-references. |
643 %\section{Back-references and The Terminology Regex} |
713 %\section{Back-references and The Terminology Regex} |
644 |
714 |
645 %When one constructs an $\NFA$ out of a regular expression |
715 %When one constructs an $\NFA$ out of a regular expression |
646 %there is often very little to be done in the first phase, one simply |
716 %there is often very little to be done in the first phase, one simply |
650 %one by keeping track of all active states after consuming |
720 %one by keeping track of all active states after consuming |
651 %a character, and update that set of states iteratively. |
721 %a character, and update that set of states iteratively. |
652 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
722 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
653 %for a path terminating |
723 %for a path terminating |
654 %at an accepting state. |
724 %at an accepting state. |
655 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
725 |
656 type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
726 |
657 in terms of input string length. |
727 |
658 %TODO:try out these lexers |
728 Given a regular expression like this (the sequence |
659 The other way to use $\mathit{NFA}$ for matching is choosing |
|
660 a single transition each time, keeping all the other options in |
|
661 a queue or stack, and backtracking if that choice eventually |
|
662 fails. This method, often called a "depth-first-search", |
|
663 is efficient in a lot of cases, but could end up |
|
664 with exponential run time.\\ |
|
665 %TODO:COMPARE java python lexer speed with Rust and Go |
|
666 The reason behind backtracking algorithms in languages like |
|
667 Java and Python is that they support back-references. |
|
668 \subsubsection{Back References} |
|
669 If we have a regular expression like this (the sequence |
|
670 operator is omitted for brevity): |
729 operator is omitted for brevity): |
671 \begin{center} |
730 \begin{center} |
672 $r_1(r_2(r_3r_4))$ |
731 $r_1r_2r_3r_4$ |
673 \end{center} |
732 \end{center} |
674 We could label sub-expressions of interest |
733 one could label sub-expressions of interest |
675 by parenthesizing them and giving |
734 by parenthesizing them and giving |
676 them a number by the order in which their opening parentheses appear. |
735 them a number by the order in which their opening parentheses appear. |
677 One possible way of parenthesizing and labelling is given below: |
736 One possible way of parenthesizing and labelling is given below: |
678 \begin{center} |
737 \begin{center} |
679 $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
738 $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
680 \end{center} |
739 \end{center} |
681 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled |
740 The sub-expressions |
682 by 1 to 4. $1$ would refer to the entire expression |
741 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
683 $(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc. |
742 by 1 to 4, and can be ``referred back'' by their respective numbers. |
684 These sub-expressions are called "capturing groups". |
743 %These sub-expressions are called "capturing groups". |
685 We can use the following syntax to denote that we want a string just matched by a |
744 To do so, we use the syntax $\backslash i$ |
686 sub-expression (capturing group) to appear at a certain location again, |
745 to denote that we want the sub-string |
687 exactly as it was: |
746 of the input just matched by the i-th |
|
747 sub-expression to appear again, |
|
748 exactly the same as it first appeared: |
688 \begin{center} |
749 \begin{center} |
689 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
750 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
690 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$ |
751 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$ |
691 \end{center} |
752 \end{center} |
692 The backslash and number $i$ are used to denote such |
753 %The backslash and number $i$ are the |
693 so-called "back-references". |
754 %so-called "back-references". |
694 Let $e$ be an expression made of regular expressions |
755 %Let $e$ be an expression made of regular expressions |
695 and back-references. $e$ contains the expression $e_i$ |
756 %and back-references. $e$ contains the expression $e_i$ |
696 as its $i$-th capturing group. |
757 %as its $i$-th capturing group. |
697 The semantics of back-reference can be recursively |
758 %The semantics of back-reference can be recursively |
698 written as: |
759 %written as: |
|
760 %\begin{center} |
|
761 % \begin{tabular}{c} |
|
762 % $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
|
763 % $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
|
764 % \end{tabular} |
|
765 %\end{center} |
|
766 A concrete example |
|
767 for back-references would be |
699 \begin{center} |
768 \begin{center} |
700 \begin{tabular}{c} |
769 $((a|b|c|\ldots|z)^*)\backslash 1$, |
701 $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
|
702 $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
|
703 \end{tabular} |
|
704 \end{center} |
770 \end{center} |
705 The concrete example |
771 which would match |
706 $((a|b|c|\ldots|z)^*)\backslash 1$ |
772 strings that can be split into two identical halves, |
707 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\ |
773 for example $\mathit{bobo}$, $\mathit{weewee}$ and etc. |
708 Back-reference is a construct in the "regex" standard |
774 Back-reference is a regex construct |
709 that programmers found useful, but not exactly |
775 that programmers found useful, but not exactly |
710 regular any more. |
776 regular any more. |
711 In fact, that allows the regex construct to express |
777 In fact, that allows the regex construct to express |
712 languages that cannot be contained in context-free |
778 languages that cannot be contained in context-free |
713 languages either. |
779 languages either. |