778 % $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
778 % $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
779 % $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
779 % $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
780 % \end{tabular} |
780 % \end{tabular} |
781 %\end{center} |
781 %\end{center} |
782 A concrete example |
782 A concrete example |
783 for back-references would be |
783 for back-references is |
784 \begin{center} |
784 \begin{center} |
785 $(.^*)\backslash 1$, |
785 $(.^*)\backslash 1$, |
786 \end{center} |
786 \end{center} |
787 which would match |
787 which matches |
788 strings that can be split into two identical halves, |
788 strings that can be split into two identical halves, |
789 for example $\mathit{foofoo}$, $\mathit{ww}$ and etc. |
789 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. |
790 Note that this is different from |
790 Note that this is different from |
791 repeating the sub-expression verbatim like |
791 repeating the sub-expression verbatim like |
792 \begin{center} |
792 \begin{center} |
793 $(.^*)(.^*)$, |
793 $(.^*)(.^*)$, |
794 \end{center} |
794 \end{center} |
795 which does not impose any restrictions on what strings the second |
795 which does not impose any restrictions on what strings the second |
796 sub-expression $.^*$ |
796 sub-expression $.^*$ |
797 might match. |
797 might match. |
798 Another example of back-references would be |
798 Another example of back-references is |
799 \begin{center} |
799 \begin{center} |
800 $(.)(.)\backslash 2\backslash 1$ |
800 $(.)(.)\backslash 2\backslash 1$ |
801 \end{center} |
801 \end{center} |
802 which expresses four-character palindromes |
802 which matches four-character palindromes |
803 like $abba$, $x??x$ etc. |
803 like $abba$, $x??x$ and so on. |
804 |
804 |
805 Back-references is a regex construct |
805 Back-references is a regex construct |
806 that programmers found quite useful. |
806 that programmers find quite useful. |
807 According to Becchi and Crawley\cite{Becchi08}, |
807 According to Becchi and Crawley\cite{Becchi08}, |
808 6\% of Snort rules (up until 2008) include the use of them. |
808 6\% of Snort rules (up until 2008) use them. |
809 The most common use of back-references |
809 The most common use of back-references |
810 would be expressing well-formed html files, |
810 is to express well-formed html files, |
811 where back-references would be handy in expressing |
811 where back-references are convenient for matching |
812 a pair of opening and closing tags like |
812 opening and closing tags like |
813 \begin{center} |
813 \begin{center} |
814 $\langle html \rangle \ldots \langle / html \rangle$ |
814 $\langle html \rangle \ldots \langle / html \rangle$ |
815 \end{center} |
815 \end{center} |
816 A regex describing such a format |
816 A regex describing such a format |
817 could be |
817 is |
818 \begin{center} |
818 \begin{center} |
819 $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
819 $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
820 \end{center} |
820 \end{center} |
821 Despite being useful, the syntax and expressive power of regexes |
821 Despite being useful, the expressive power of regexes |
822 go beyond the regular language hierarchy |
822 go beyond the regular language hierarchy |
823 with back-references. |
823 once back-references are included. |
824 In fact, they allow the regex construct to express |
824 In fact, they allow the regex construct to express |
825 languages that cannot be contained in context-free |
825 languages that cannot be contained in context-free |
826 languages either. |
826 languages either. |
827 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
827 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
828 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
828 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
829 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. |
829 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. |
830 Such a language is contained in the context-sensitive hierarchy |
830 Such a language is contained in the context-sensitive hierarchy |
831 of formal languages. |
831 of formal languages. |
832 Solving the back-reference expressions matching problem |
832 Also solving the matching problem involving back-references |
833 is known to be NP-complete \parencite{alfred2014algorithms}. |
833 is known to be NP-complete \parencite{alfred2014algorithms}. |
834 A non-bactracking, |
|
835 efficient solution is not known to exist. |
|
836 Regex libraries supporting back-references such as |
834 Regex libraries supporting back-references such as |
837 PCRE \cite{pcre} therefore have to |
835 PCRE \cite{pcre} therefore have to |
838 revert to a depth-first search algorithm which backtracks. |
836 revert to a depth-first search algorithm which backtracks. |
839 What is unexpected is that even in the cases |
837 What is unexpected is that even in the cases |
840 not involving back-references, there is still |
838 not involving back-references, there is still |
841 a (non-negligible) chance they might backtrack super-linearly, |
839 a (non-negligible) chance they might backtrack super-linearly, |
842 as shown in the graphs in \ref{fig:aStarStarb}. |
840 as shown in the graphs in figure\ref{fig:aStarStarb}. |
843 |
841 |
844 \subsection{Summary of the Catastrophic Backtracking Problem} |
842 \subsection{Summary of the Catastrophic Backtracking Problem} |
845 Summing these up, we can categorise existing |
843 Summing these up, we can categorise existing |
846 practical regex libraries into two kinds: |
844 practical regex libraries into two kinds: |
847 (i)The ones with linear |
845 (i)The ones with linear |