764 % \end{tabular} |
764 % \end{tabular} |
765 %\end{center} |
765 %\end{center} |
766 A concrete example |
766 A concrete example |
767 for back-references would be |
767 for back-references would be |
768 \begin{center} |
768 \begin{center} |
769 $((a|b|c|\ldots|z)^*)\backslash 1$, |
769 $(.^*)\backslash 1$, |
770 \end{center} |
770 \end{center} |
771 which would match |
771 which would match |
772 strings that can be split into two identical halves, |
772 strings that can be split into two identical halves, |
773 for example $\mathit{bobo}$, $\mathit{weewee}$ and etc. |
773 for example $\mathit{foofoo}$, $\mathit{ww}$ and etc. |
774 Back-reference is a regex construct |
774 Note that this is different from |
775 that programmers found useful, but not exactly |
775 repeating the sub-expression verbatim like |
776 regular any more. |
776 \begin{center} |
777 In fact, that allows the regex construct to express |
777 $(.^*)(.^*)$, |
|
778 \end{center} |
|
779 which does not impose any restrictions on what strings the second |
|
780 sub-expression $.^*$ |
|
781 might match. |
|
782 Another example of back-references would be |
|
783 \begin{center} |
|
784 $(.)(.)\backslash 2\backslash 1$ |
|
785 \end{center} |
|
786 which expresses four-character palindromes |
|
787 like $abba$, $x??x$ etc. |
|
788 |
|
789 Back-references is a regex construct |
|
790 that programmers found quite useful. |
|
791 According to Becchi and Crawley\cite{Becchi08}, |
|
792 6\% of Snort rules (up until 2008) include the use of them. |
|
793 The most common use of back-references |
|
794 would be expressing well-formed html files, |
|
795 where back-references would be handy in expressing |
|
796 a pair of opening and closing tags like |
|
797 \begin{center} |
|
798 $\langle html \rangle \ldots \langle / html \rangle$ |
|
799 \end{center} |
|
800 A regex describing such a format |
|
801 could be |
|
802 \begin{center} |
|
803 $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
|
804 \end{center} |
|
805 Despite being useful, the syntax and expressive power of regexes |
|
806 go beyond the regular language hierarchy |
|
807 with back-references. |
|
808 In fact, they allow the regex construct to express |
778 languages that cannot be contained in context-free |
809 languages that cannot be contained in context-free |
779 languages either. |
810 languages either. |
780 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$ |
811 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$ |
781 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
812 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
782 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. |
813 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. |
783 Such a language is contained in the context-sensitive hierarchy |
814 Such a language is contained in the context-sensitive hierarchy |
784 of formal languages. |
815 of formal languages. |
785 Solving the back-reference expressions matching problem |
816 Solving the back-reference expressions matching problem |
786 is NP-complete\parencite{alfred2014algorithms} and a non-bactracking, |
817 is NP-complete\parencite{alfred2014algorithms}. |
|
818 A non-bactracking, |
787 efficient solution is not known to exist. |
819 efficient solution is not known to exist. |
788 %TODO:read a bit more about back reference algorithms |
820 Regex libraries supporting back-references such as PCRE therefore have to |
789 |
821 revert to a depth-first search algorithm that backtracks. |
790 It seems that languages like Java and Python made the trade-off |
822 The unreasonable part with them, is that even in the case of |
791 to support back-references at the expense of having to backtrack, |
823 regexes not involving back-references, there is still |
792 even in the case of regexes not involving back-references.\\ |
824 a (non-negligible) chance they might backtrack super-linearly. |
|
825 |
|
826 \subsection{Summary of the Catastrophic Backtracking Problem} |
793 Summing these up, we can categorise existing |
827 Summing these up, we can categorise existing |
794 practical regex libraries into the ones with linear |
828 practical regex libraries into two kinds: |
795 time guarantees like Go and Rust, which impose restrictions |
829 (i)The ones with linear |
|
830 time guarantees like Go and Rust. The cost with them is that |
|
831 they impose restrictions |
796 on the user input (not allowing back-references, |
832 on the user input (not allowing back-references, |
797 bounded repetitions cannot exceed 1000 etc.), and ones |
833 bounded repetitions cannot exceed a counter limit etc.). |
798 that allows the programmer much freedom, but grinds to a halt |
834 (ii) Those |
799 in some non-negligible portion of cases. |
835 that allow large bounded regular expressions and back-references |
|
836 at the expense of using a backtracking algorithm. |
|
837 They could grind to a halt |
|
838 on some very simple cases, posing a vulnerability of |
|
839 a ReDoS attack. |
|
840 |
|
841 |
|
842 We would like to have regex engines that can |
|
843 deal with the regular part (e.g. |
|
844 bounded repetitions) of regexes more |
|
845 efficiently. |
|
846 Also we want to make sure that they do it correctly. |
|
847 It turns out that such aim is not so easy to achieve. |
800 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
848 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
801 % For example, the Rust regex engine claims to be linear, |
849 % For example, the Rust regex engine claims to be linear, |
802 % but does not support lookarounds and back-references. |
850 % but does not support lookarounds and back-references. |
803 % The GoLang regex library does not support over 1000 repetitions. |
851 % The GoLang regex library does not support over 1000 repetitions. |
804 % Java and Python both support back-references, but shows |
852 % Java and Python both support back-references, but shows |
817 %, whereas $\NFA$s usually run into trouble |
865 %, whereas $\NFA$s usually run into trouble |
818 %on the second phase. |
866 %on the second phase. |
819 |
867 |
820 |
868 |
821 \section{Error-prone POSIX Implementations} |
869 \section{Error-prone POSIX Implementations} |
822 The problems with practical implementations |
870 When there are multiple ways of matching a string |
823 of reare not limited to slowness on certain |
871 with a regular expression, a matcher needs to |
824 cases. |
872 disambiguate. |
825 Another thing about these libraries is that there |
873 The standard for which particular match to pick |
826 is no correctness guarantee. |
874 is called the disambiguation strategy. |
827 In some cases, they either fail to generate a lexing result when there exists a match, |
875 The more intuitive strategy is called POSIX, |
|
876 which always chooses the longest initial match. |
|
877 An alternative strategy would be greedy matches, |
|
878 which always ends a sub-match as early as possible. |
|
879 The POSIX standard is widely adopted in many operating systems. |
|
880 However, many implementations (including the C libraries |
|
881 used by Linux and OS X distributions) contain bugs |
|
882 or do not meet the specification they claim to adhere to. |
|
883 In some cases, they either fail to generate a lexing |
|
884 result when there exists a match, |
828 or give results that are inconsistent with the $\POSIX$ standard. |
885 or give results that are inconsistent with the $\POSIX$ standard. |
829 A concrete example would be the regex |
886 A concrete example would be the regex given by \cite{fowler2003} |
830 \begin{center} |
887 \begin{center} |
831 $(aba + ab + a)* \text{and the string} ababa$ |
888 $(aba + ab + a)^* \text{and the string} ababa$ |
832 \end{center} |
889 \end{center} |
833 The correct $\POSIX$ match for the above would be |
890 The correct $\POSIX$ match for the above would be |
834 with the entire string $ababa$, |
891 with the entire string $ababa$, |
835 split into two Kleene star iterations, $[ab] [aba]$ at positions |
892 split into two Kleene star iterations, $[ab] [aba]$ at positions |
836 $[0, 2), [2, 5)$ |
893 $[0, 2), [2, 5)$ |
837 respectively. |
894 respectively. |
838 But trying this out in regex101\parencite{regex101} |
895 But trying this out in regex101\parencite{regex101} |
839 with different language engines would yield |
896 with different language engines would yield |
840 the same two fragmented matches: $[aba]$ at $[0, 3)$ |
897 the same two fragmented matches: $[aba]$ at $[0, 3)$ |
841 and $a$ at $[4, 5)$. |
898 and $a$ at $[4, 5)$. |
842 |
899 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
843 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
900 commented that most regex libraries are not |
844 correctly implementing the POSIX (maximum-munch) |
901 correctly implementing the POSIX (maximum-munch) |
845 rule of regular expression matching. |
902 rule of regular expression matching. |
846 |
|
847 As Grathwohl\parencite{grathwohl2014crash} wrote, |
903 As Grathwohl\parencite{grathwohl2014crash} wrote, |
848 \begin{quote} |
904 \begin{quote} |
849 The POSIX strategy is more complicated than the |
905 ``The POSIX strategy is more complicated than the |
850 greedy because of the dependence on information about |
906 greedy because of the dependence on information about |
851 the length of matched strings in the various subexpressions. |
907 the length of matched strings in the various subexpressions.'' |
852 \end{quote} |
908 \end{quote} |
853 %\noindent |
909 %\noindent |
854 To summarise the above, regular expressions are important. |
910 The implementation complexity of POSIX rules also come from |
855 They are popular and programming languages' library functions |
911 the specification being not very clear. |
856 for them are very fast on non-catastrophic cases. |
912 There are many informal summaries of this disambiguation |
857 But there are problems with current practical implementations. |
913 strategy, which are often quite long and delicate. |
858 First thing is that the running time might blow up. |
914 For example Kuklewicz described the POSIX rule as |
859 The second problem is that they might be error-prone on certain |
915 \begin{quote} |
860 very simple cases. |
916 `` |
861 In the next part of the chapter, we will look into reasons why |
917 \begin{itemize} |
862 certain regex engines are running horribly slow on the "catastrophic" |
918 \item |
863 cases and propose a solution that addresses both of these problems |
919 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
864 based on Brzozowski and Sulzmann and Lu's work. |
920 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
865 |
921 \item |
866 |
922 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\ |
867 \subsection{Different Phases of a Matching/Lexing Algorithm} |
923 \item |
868 |
924 REs have right associative concatenation which can be changed with parenthesis\\ |
869 |
925 \item |
870 Most lexing algorithms can be roughly divided into |
926 parenthesized subexpressions return the match from their last usage\\ |
871 two phases during its run. |
927 \item |
872 The first phase is the "construction" phase, |
928 text of component subexpressions must be contained in the text of the |
873 in which the algorithm builds some |
929 higher-level subexpressions\\ |
874 suitable data structure from the input regex $r$, so that |
930 \item |
875 it can be easily operated on later. |
931 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
876 We denote |
932 \item |
877 the time cost for such a phase by $P_1(r)$. |
933 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\'' |
878 The second phase is the lexing phase, when the input string |
934 \end{itemize} |
879 $s$ is read and the data structure |
935 \end{quote} |
880 representing that regex $r$ is being operated on. |
936 The text above |
881 We represent the time |
937 is trying to capture something very precise, |
882 it takes by $P_2(r, s)$.\\ |
938 and is crying out for formalising. |
883 For $\mathit{DFA}$, |
939 |
884 we have $P_2(r, s) = O( |s| )$, |
940 |
885 because we take at most $|s|$ steps, |
941 %\subsection{Different Phases of a Matching/Lexing Algorithm} |
886 and each step takes |
942 % |
887 at most one transition-- |
943 % |
888 a deterministic-finite-automata |
944 %Most lexing algorithms can be roughly divided into |
889 by definition has at most one state active and at most one |
945 %two phases during its run. |
890 transition upon receiving an input symbol. |
946 %The first phase is the "construction" phase, |
891 But unfortunately in the worst case |
947 %in which the algorithm builds some |
892 $P_1(r) = O(exp^{|r|})$. An example will be given later. |
948 %suitable data structure from the input regex $r$, so that |
893 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
949 %it can be easily operated on later. |
894 expressions like $r^n$ into |
950 %We denote |
895 \[ |
951 %the time cost for such a phase by $P_1(r)$. |
896 \underbrace{r \cdots r}_{\text{n copies of r}}. |
952 %The second phase is the lexing phase, when the input string |
897 \] |
953 %$s$ is read and the data structure |
898 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
954 %representing that regex $r$ is being operated on. |
899 On the other hand, if backtracking is used, the worst-case time bound bloats |
955 %We represent the time |
900 to $|r| * 2^{|s|}$. |
956 %it takes by $P_2(r, s)$.\\ |
901 %on the input |
957 %For $\mathit{DFA}$, |
902 %And when calculating the time complexity of the matching algorithm, |
958 %we have $P_2(r, s) = O( |s| )$, |
903 %we are assuming that each input reading step requires constant time. |
959 %because we take at most $|s|$ steps, |
904 %which translates to that the number of |
960 %and each step takes |
905 %states active and transitions taken each time is bounded by a |
961 %at most one transition-- |
906 %constant $C$. |
962 %a deterministic-finite-automata |
907 %But modern regex libraries in popular language engines |
963 %by definition has at most one state active and at most one |
908 % often want to support much richer constructs than just |
964 %transition upon receiving an input symbol. |
909 % sequences and Kleene stars, |
965 %But unfortunately in the worst case |
910 %such as negation, intersection, |
966 %$P_1(r) = O(exp^{|r|})$. An example will be given later. |
911 %bounded repetitions and back-references. |
967 %For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
912 %And de-sugaring these "extended" regular expressions |
968 %expressions like $r^n$ into |
913 %into basic ones might bloat the size exponentially. |
969 %\[ |
914 %TODO: more reference for exponential size blowup on desugaring. |
970 % \underbrace{r \cdots r}_{\text{n copies of r}}. |
915 |
971 %\] |
916 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase} |
972 %The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
917 |
973 %On the other hand, if backtracking is used, the worst-case time bound bloats |
918 |
974 %to $|r| * 2^{|s|}$. |
919 The good things about $\mathit{DFA}$s is that once |
975 %%on the input |
920 generated, they are fast and stable, unlike |
976 %%And when calculating the time complexity of the matching algorithm, |
921 backtracking algorithms. |
977 %%we are assuming that each input reading step requires constant time. |
922 However, they do not scale well with bounded repetitions. |
978 %%which translates to that the number of |
923 |
979 %%states active and transitions taken each time is bounded by a |
924 \subsubsection{Problems with Bounded Repetitions} |
980 %%constant $C$. |
925 |
981 %%But modern regex libraries in popular language engines |
926 |
982 %% often want to support much richer constructs than just |
927 |
983 %% sequences and Kleene stars, |
928 |
984 %%such as negation, intersection, |
929 |
985 %%bounded repetitions and back-references. |
930 |
986 %%And de-sugaring these "extended" regular expressions |
931 |
987 %%into basic ones might bloat the size exponentially. |
932 |
988 %%TODO: more reference for exponential size blowup on desugaring. |
933 |
989 % |
934 |
990 %\subsection{Why $\mathit{DFA}s$ can be slow in the first phase} |
935 |
991 % |
936 |
992 % |
937 |
993 %The good things about $\mathit{DFA}$s is that once |
938 |
994 %generated, they are fast and stable, unlike |
939 So we have practical implementations |
995 %backtracking algorithms. |
940 on regular expression matching/lexing which are fast |
996 %However, they do not scale well with bounded repetitions. |
941 but do not come with any guarantees that it will not grind to a halt |
997 % |
942 or give wrong answers. |
998 %\subsubsection{Problems with Bounded Repetitions} |
943 Our goal is to have a regex lexing algorithm that comes with |
999 % |
|
1000 % |
|
1001 |
|
1002 To summarise, we need regex libraries that are both fast |
|
1003 and correct. |
|
1004 And that correctness needs to be built on a precise |
|
1005 model of what POSIX disambiguation is. |
|
1006 We propose a solution that addresses both problems |
|
1007 based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work. |
|
1008 The end result is a regular expression lexing algorithm that comes with |
944 \begin{itemize} |
1009 \begin{itemize} |
945 \item |
1010 \item |
946 proven correctness |
1011 a proven correctness theorem according to POSIX specification |
|
1012 given by Ausaf and Urban \cite{AusafDyckhoffUrban2016}, |
947 \item |
1013 \item |
948 proven non-catastrophic properties |
1014 a proven property saying that the algorithm's internal data structure will |
|
1015 remain finite, |
949 \item |
1016 \item |
950 easy extensions to |
1017 and extension to |
951 constructs like |
1018 the bounded repetitions construct with the correctness and finiteness property |
952 bounded repetitions, negation, lookarounds, and even back-references. |
1019 maintained. |
953 \end{itemize} |
1020 \end{itemize} |
954 |
1021 |
955 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} |
1022 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} |
956 We propose Brzozowski derivatives on regular expressions as |
1023 We propose Brzozowski derivatives on regular expressions as |
957 a solution to this. |
1024 a solution to this. |