707 These problems can of course be solved in matching algorithms where |
707 These problems can of course be solved in matching algorithms where |
708 automata go beyond the classic notion and for instance include explicit |
708 automata go beyond the classic notion and for instance include explicit |
709 counters \cite{Turo_ov__2020}. |
709 counters \cite{Turo_ov__2020}. |
710 These solutions can be quite efficient, |
710 These solutions can be quite efficient, |
711 with the ability to process |
711 with the ability to process |
712 gigabytes of strings input per second |
712 gigabits of strings input per second |
713 even with large counters \cite{Becchi08}. |
713 even with large counters \cite{Becchi08}. |
714 These practical solutions do not come with |
714 These practical solutions do not come with |
715 formal guarantees, and as pointed out by |
715 formal guarantees, and as pointed out by |
716 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. |
716 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. |
717 %But formal reasoning about these automata especially in Isabelle |
717 %But formal reasoning about these automata especially in Isabelle |
746 |
746 |
747 \subsection{Back-References} |
747 \subsection{Back-References} |
748 The other way to simulate an $\mathit{NFA}$ for matching is choosing |
748 The other way to simulate an $\mathit{NFA}$ for matching is choosing |
749 a single transition each time, keeping all the other options in |
749 a single transition each time, keeping all the other options in |
750 a queue or stack, and backtracking if that choice eventually |
750 a queue or stack, and backtracking if that choice eventually |
751 fails. This method, often called a "depth-first-search", |
751 fails. |
752 is efficient in a lot of cases, but could end up |
752 This method, often called a "depth-first-search", |
|
753 is efficient in many cases, but could end up |
753 with exponential run time. |
754 with exponential run time. |
754 The backtracking method is employed in regex libraries |
755 The backtracking method is employed in regex libraries |
755 that support \emph{back-references}, for example |
756 that support \emph{back-references}, for example |
756 in Java and Python. |
757 in Java and Python. |
757 %\section{Back-references and The Terminology Regex} |
758 %\section{Back-references and The Terminology Regex} |
767 %for a path terminating |
768 %for a path terminating |
768 %at an accepting state. |
769 %at an accepting state. |
769 |
770 |
770 |
771 |
771 |
772 |
772 Given a regular expression like this (the sequence |
773 Consider the following regular expression where the sequence |
773 operator is omitted for brevity): |
774 operator is omitted for brevity: |
774 \begin{center} |
775 \begin{center} |
775 $r_1r_2r_3r_4$ |
776 $r_1r_2r_3r_4$ |
776 \end{center} |
777 \end{center} |
|
778 In this example, |
777 one could label sub-expressions of interest |
779 one could label sub-expressions of interest |
778 by parenthesizing them and giving |
780 by parenthesizing them and giving |
779 them a number by the order in which their opening parentheses appear. |
781 them a number in the order in which their opening parentheses appear. |
780 One possible way of parenthesizing and labelling is given below: |
782 One possible way of parenthesizing and labelling is: |
781 \begin{center} |
783 \begin{center} |
782 $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
784 $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
783 \end{center} |
785 \end{center} |
784 The sub-expressions |
786 The sub-expressions |
785 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
787 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
786 by 1 to 4, and can be ``referred back'' by their respective numbers. |
788 by 1 to 4, and can be ``referred back'' by their respective numbers. |
787 %These sub-expressions are called "capturing groups". |
789 %These sub-expressions are called "capturing groups". |
788 To do so, we use the syntax $\backslash i$ |
790 To do so, we use the syntax $\backslash i$ |
789 to denote that we want the sub-string |
791 to denote that we want the sub-string |
790 of the input just matched by the i-th |
792 of the input matched by the i-th |
791 sub-expression to appear again, |
793 sub-expression to appear again, |
792 exactly the same as it first appeared: |
794 exactly the same as it first appeared: |
793 \begin{center} |
795 \begin{center} |
794 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
796 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
795 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$ |
797 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$ |
796 \end{center} |
798 \end{center} |
|
799 Once the sub-string $s_i$ for the sub-expression $r_i$ |
|
800 has been fixed, there is no variability on what the back-reference |
|
801 label $\backslash i$ can be---it is tied to $s_i$. |
|
802 The overall string will look like $\ldots s_i \ldots s_i \ldots $ |
797 %The backslash and number $i$ are the |
803 %The backslash and number $i$ are the |
798 %so-called "back-references". |
804 %so-called "back-references". |
799 %Let $e$ be an expression made of regular expressions |
805 %Let $e$ be an expression made of regular expressions |
800 %and back-references. $e$ contains the expression $e_i$ |
806 %and back-references. $e$ contains the expression $e_i$ |
801 %as its $i$-th capturing group. |
807 %as its $i$-th capturing group. |
821 $(.^*)(.^*)$, |
827 $(.^*)(.^*)$, |
822 \end{center} |
828 \end{center} |
823 which does not impose any restrictions on what strings the second |
829 which does not impose any restrictions on what strings the second |
824 sub-expression $.^*$ |
830 sub-expression $.^*$ |
825 might match. |
831 might match. |
826 Another example of back-references is |
832 Another example for back-references is |
827 \begin{center} |
833 \begin{center} |
828 $(.)(.)\backslash 2\backslash 1$ |
834 $(.)(.)\backslash 2\backslash 1$ |
829 \end{center} |
835 \end{center} |
830 which matches four-character palindromes |
836 which matches four-character palindromes |
831 like $abba$, $x??x$ and so on. |
837 like $abba$, $x??x$ and so on. |
832 |
838 |
833 Back-references is a regex construct |
839 Back-references is a regex construct |
834 that programmers find quite useful. |
840 that programmers find quite useful. |
835 According to Becchi and Crawley~\cite{Becchi08}, |
841 According to Becchi and Crawley \cite{Becchi08}, |
836 6\% of Snort rules (up until 2008) use them. |
842 6\% of Snort rules (up until 2008) use them. |
837 The most common use of back-references |
843 The most common use of back-references |
838 is to express well-formed html files, |
844 is to express well-formed html files, |
839 where back-references are convenient for matching |
845 where back-references are convenient for matching |
840 opening and closing tags like |
846 opening and closing tags like |
852 In fact, they allow the regex construct to express |
858 In fact, they allow the regex construct to express |
853 languages that cannot be contained in context-free |
859 languages that cannot be contained in context-free |
854 languages either. |
860 languages either. |
855 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
861 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
856 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
862 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
857 which cannot be expressed by context-free grammars \parencite{campeanu2003formal}. |
863 which cannot be expressed by |
|
864 context-free grammars \cite{campeanu2003formal}. |
858 Such a language is contained in the context-sensitive hierarchy |
865 Such a language is contained in the context-sensitive hierarchy |
859 of formal languages. |
866 of formal languages. |
860 Also solving the matching problem involving back-references |
867 Also solving the matching problem involving back-references |
861 is known to be NP-complete \parencite{alfred2014algorithms}. |
868 is known to be NP-complete \parencite{alfred2014algorithms}. |
862 Regex libraries supporting back-references such as |
869 Regex libraries supporting back-references such as |
863 PCRE \cite{pcre} therefore have to |
870 PCRE \cite{pcre} therefore have to |
864 revert to a depth-first search algorithm which backtracks. |
871 revert to a depth-first search algorithm including backtracking. |
865 What is unexpected is that even in the cases |
872 What is unexpected is that even in the cases |
866 not involving back-references, there is still |
873 not involving back-references, there is still |
867 a (non-negligible) chance they might backtrack super-linearly, |
874 a (non-negligible) chance they might backtrack super-linearly, |
868 as shown in the graphs in figure\ref{fig:aStarStarb}. |
875 as shown in the graphs in figure \ref{fig:aStarStarb}. |
869 |
876 |
870 Summing these up, we can categorise existing |
877 Summing up, we can categorise existing |
871 practical regex libraries into two kinds: |
878 practical regex libraries into two kinds: |
872 (i) The ones with linear |
879 (i) The ones with linear |
873 time guarantees like Go and Rust. The downside with them is that |
880 time guarantees like Go and Rust. The downside with them is that |
874 they impose restrictions |
881 they impose restrictions |
875 on the regular expressions (not allowing back-references, |
882 on the regular expressions (not allowing back-references, |
930 In some cases, they either fail to generate a |
937 In some cases, they either fail to generate a |
931 result when there exists a match, |
938 result when there exists a match, |
932 or give results that are inconsistent with the POSIX standard. |
939 or give results that are inconsistent with the POSIX standard. |
933 A concrete example is the regex: |
940 A concrete example is the regex: |
934 \begin{center} |
941 \begin{center} |
935 $(aba + ab + a)^* \text{and the string} ababa$ |
942 $(aba + ab + a)^* \text{and the string} \; ababa$ |
936 \end{center} |
943 \end{center} |
937 The correct POSIX match for the above |
944 The correct POSIX match for the above |
938 is the entire string $ababa$, |
945 involves the entire string $ababa$, |
939 split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
946 split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
940 $[0, 2), [2, 5)$ |
947 $[0, 2), [2, 5)$ |
941 respectively. |
948 respectively. |
942 But trying this out in regex101 \parencite{regex101} \footnote{ |
949 But feeding this example to the different engines |
|
950 in regex101 \parencite{regex101} \footnote{ |
943 regex101 is an online regular expression matcher which |
951 regex101 is an online regular expression matcher which |
944 provides API for trying out regular expression |
952 provides API for trying out regular expression |
945 engines of multiple popular programming languages like |
953 engines of multiple popular programming languages like |
946 Java, Python, Go, etc.} |
954 Java, Python, Go, etc.} |
947 with different engines yields |
955 yields |
948 always matches: $[aba]$ at $[0, 3)$ |
956 only two incomplete matches: $[aba]$ at $[0, 3)$ |
949 and $a$ at $[4, 5)$. |
957 and $a$ at $[4, 5)$. |
950 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
958 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
951 commented that most regex libraries are not |
959 commented that most regex libraries are not |
952 correctly implementing the central POSIX |
960 correctly implementing the central POSIX |
953 rule, called the maximum munch rule. |
961 rule, called the maximum munch rule. |
958 the length of matched strings in the various subexpressions.'' |
966 the length of matched strings in the various subexpressions.'' |
959 \end{quote} |
967 \end{quote} |
960 %\noindent |
968 %\noindent |
961 We think the implementation complexity of POSIX rules also come from |
969 We think the implementation complexity of POSIX rules also come from |
962 the specification being not very precise. |
970 the specification being not very precise. |
|
971 The developers of the regexp package of GO |
|
972 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
|
973 commented that |
|
974 \begin{quote}\it |
|
975 `` |
|
976 The POSIX rule is computationally prohibitive |
|
977 and not even well-defined. |
|
978 `` |
|
979 \end{quote} |
963 There are many informal summaries of this disambiguation |
980 There are many informal summaries of this disambiguation |
964 strategy, which are often quite long and delicate. |
981 strategy, which are often quite long and delicate. |
965 For example Kuklewicz \cite{KuklewiczHaskell} |
982 For example Kuklewicz \cite{KuklewiczHaskell} |
966 described the POSIX rule as (section 1, last paragraph): |
983 described the POSIX rule as (section 1, last paragraph): |
967 \begin{quote} |
984 \begin{quote} |
995 POSIX specification given by Okui and Suzuki \cite{Okui10}. |
1012 POSIX specification given by Okui and Suzuki \cite{Okui10}. |
996 They then formally proved the correctness of |
1013 They then formally proved the correctness of |
997 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
1014 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
998 with regards to that specification. |
1015 with regards to that specification. |
999 They also found that the informal POSIX |
1016 They also found that the informal POSIX |
1000 specification by Sulzmann and Lu does not work for the correctness proof. |
1017 specification by Sulzmann and Lu needs to be revised for the correctness proof. |
1001 |
1018 |
1002 In the next section we will briefly |
1019 In the next section, we will briefly |
1003 introduce Brzozowski derivatives and Sulzmann |
1020 introduce Brzozowski derivatives and Sulzmann |
1004 and Lu's algorithm, which the main point of this thesis builds on. |
1021 and Lu's algorithm, which the main point of this thesis builds on. |
1005 %We give a taste of what they |
1022 %We give a taste of what they |
1006 %are like and why they are suitable for regular expression |
1023 %are like and why they are suitable for regular expression |
1007 %matching and lexing. |
1024 %matching and lexing. |
1025 %$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
1042 %$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
1026 % \end{tabular} |
1043 % \end{tabular} |
1027 %\end{center} |
1044 %\end{center} |
1028 %\noindent |
1045 %\noindent |
1029 %The first clause says that for the regular expression |
1046 %The first clause says that for the regular expression |
1030 %denoting a singleton set consisting of a sinlge-character string $\{ d \}$, |
1047 %denoting a singleton set consisting of a single-character string $\{ d \}$, |
1031 %we check the derivative character $c$ against $d$, |
1048 %we check the derivative character $c$ against $d$, |
1032 %returning a set containing only the empty string $\{ [] \}$ |
1049 %returning a set containing only the empty string $\{ [] \}$ |
1033 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. |
1050 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. |
1034 %The second clause states that to obtain the regular expression |
1051 %The second clause states that to obtain the regular expression |
1035 %representing all strings' head character $c$ being chopped off |
1052 %representing all strings' head character $c$ being chopped off |
1041 %This property can be used on regular expressions |
1058 %This property can be used on regular expressions |
1042 %matching and lexing--to test whether a string $s$ is in $L \; r$, |
1059 %matching and lexing--to test whether a string $s$ is in $L \; r$, |
1043 %one simply takes derivatives of $r$ successively with |
1060 %one simply takes derivatives of $r$ successively with |
1044 %respect to the characters (in the correct order) in $s$, |
1061 %respect to the characters (in the correct order) in $s$, |
1045 %and then test whether the empty string is in the last regular expression. |
1062 %and then test whether the empty string is in the last regular expression. |
1046 With this derivatives give a simple solution |
1063 With this property, derivatives give a simple solution |
1047 to the problem of matching a string $s$ with a regular |
1064 to the problem of matching a string $s$ with a regular |
1048 expression $r$: if the derivative of $r$ w.r.t.\ (in |
1065 expression $r$: if the derivative of $r$ w.r.t.\ (in |
1049 succession) all the characters of the string matches the empty string, |
1066 succession) all the characters of the string matches the empty string, |
1050 then $r$ matches $s$ (and {\em vice versa}). |
1067 then $r$ matches $s$ (and {\em vice versa}). |
1051 %This makes formally reasoning about these properties such |
1068 %This makes formally reasoning about these properties such |