848 What is unexpected is that even in the cases |
848 What is unexpected is that even in the cases |
849 not involving back-references, there is still |
849 not involving back-references, there is still |
850 a (non-negligible) chance they might backtrack super-linearly, |
850 a (non-negligible) chance they might backtrack super-linearly, |
851 as shown in the graphs in figure\ref{fig:aStarStarb}. |
851 as shown in the graphs in figure\ref{fig:aStarStarb}. |
852 |
852 |
853 \subsection{Summary of the Catastrophic Backtracking Problem} |
|
854 Summing these up, we can categorise existing |
853 Summing these up, we can categorise existing |
855 practical regex libraries into two kinds: |
854 practical regex libraries into two kinds: |
856 (i)The ones with linear |
855 (i) The ones with linear |
857 time guarantees like Go and Rust. The cost with them is that |
856 time guarantees like Go and Rust. The downside with them is that |
858 they impose restrictions |
857 they impose restrictions |
859 on the user input (not allowing back-references, |
858 on the regular expressions (not allowing back-references, |
860 bounded repetitions cannot exceed a counter limit etc.). |
859 bounded repetitions cannot exceed an ad hoc limit etc.). |
861 (ii) Those |
860 (ii) Those |
862 that allow large bounded regular expressions and back-references |
861 that allow large bounded regular expressions and back-references |
863 at the expense of using a backtracking algorithm. |
862 at the expense of using backtracking algorithms. |
864 They could grind to a halt |
863 They can potentially ``grind to a halt'' |
865 on some very simple cases, posing a vulnerability of |
864 on some very simple cases, resulting |
866 a ReDoS attack. |
865 ReDoS attacks. |
867 |
866 |
868 |
867 The proble with both approaches is the motivation for us |
869 We would like to have regex engines that can |
868 to look again at the regular expression matching problem. |
870 deal with the regular part (e.g. |
869 Another motivation is that regular expression matching algorithms |
871 bounded repetitions) of regexes more |
870 that follow the POSIX standard often contain errors and bugs |
872 efficiently. |
871 as we shall explain next. |
873 Also we want to make sure that they do it correctly. |
872 |
874 It turns out that such aim is not so easy to achieve. |
873 %We would like to have regex engines that can |
|
874 %deal with the regular part (e.g. |
|
875 %bounded repetitions) of regexes more |
|
876 %efficiently. |
|
877 %Also we want to make sure that they do it correctly. |
|
878 %It turns out that such aim is not so easy to achieve. |
875 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
879 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
876 % For example, the Rust regex engine claims to be linear, |
880 % For example, the Rust regex engine claims to be linear, |
877 % but does not support lookarounds and back-references. |
881 % but does not support lookarounds and back-references. |
878 % The GoLang regex library does not support over 1000 repetitions. |
882 % The GoLang regex library does not support over 1000 repetitions. |
879 % Java and Python both support back-references, but shows |
883 % Java and Python both support back-references, but shows |
892 %, whereas $\NFA$s usually run into trouble |
896 %, whereas $\NFA$s usually run into trouble |
893 %on the second phase. |
897 %on the second phase. |
894 |
898 |
895 |
899 |
896 \section{Error-prone POSIX Implementations} |
900 \section{Error-prone POSIX Implementations} |
897 When there are multiple ways of matching a string |
901 Very often there are multiple ways of matching a string |
898 with a regular expression, a matcher needs to |
902 with a regular expression. |
|
903 In such cases the regular expressions matcher needs to |
899 disambiguate. |
904 disambiguate. |
900 The standard for which particular match to pick |
905 The more widely used strategy is called POSIX, |
901 is called the disambiguation strategy. |
906 which roughly speaking always chooses the longest initial match. |
902 The more intuitive strategy is called POSIX, |
907 The POSIX strategy is widely adopted in many regular expression matchers. |
903 which always chooses the longest initial match. |
|
904 An alternative strategy would be greedy matches, |
|
905 which always ends a sub-match as early as possible. |
|
906 The POSIX standard is widely adopted in many operating systems. |
|
907 However, many implementations (including the C libraries |
908 However, many implementations (including the C libraries |
908 used by Linux and OS X distributions) contain bugs |
909 used by Linux and OS X distributions) contain bugs |
909 or do not meet the specification they claim to adhere to. |
910 or do not meet the specification they claim to adhere to. |
910 In some cases, they either fail to generate a lexing |
911 Kuklewicz maintains a unit test repository which lists some |
|
912 problems with existing regular expression engines \ref{KuklewiczHaskell}. |
|
913 In some cases, they either fail to generate a |
911 result when there exists a match, |
914 result when there exists a match, |
912 or give results that are inconsistent with the $\POSIX$ standard. |
915 or give results that are inconsistent with the POSIX standard. |
913 A concrete example would be the regex given by \cite{fowler2003} |
916 A concrete example is the regex: |
914 \begin{center} |
917 \begin{center} |
915 $(aba + ab + a)^* \text{and the string} ababa$ |
918 $(aba + ab + a)^* \text{and the string} ababa$ |
916 \end{center} |
919 \end{center} |
917 The correct $\POSIX$ match for the above would be |
920 The correct POSIX match for the above |
918 with the entire string $ababa$, |
921 is the entire string $ababa$, |
919 split into two Kleene star iterations, $[ab] [aba]$ at positions |
922 split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
920 $[0, 2), [2, 5)$ |
923 $[0, 2), [2, 5)$ |
921 respectively. |
924 respectively. |
922 But trying this out in regex101\parencite{regex101} |
925 But trying this out in regex101 \parencite{regex101} \footnote{ |
923 with different language engines would yield |
926 regex101 is an online regular expression matcher which |
924 the same two fragmented matches: $[aba]$ at $[0, 3)$ |
927 provides API for trying out regular expression |
|
928 engines of multiple popular programming languages like |
|
929 Java, Python, Go, etc.} |
|
930 with different engines yields |
|
931 always matches: $[aba]$ at $[0, 3)$ |
925 and $a$ at $[4, 5)$. |
932 and $a$ at $[4, 5)$. |
926 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
933 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
927 commented that most regex libraries are not |
934 commented that most regex libraries are not |
928 correctly implementing the POSIX (maximum-munch) |
935 correctly implementing the central POSIX |
929 rule of regular expression matching. |
936 rule, called the maximum munch rule. |
930 As Grathwohl\parencite{grathwohl2014crash} wrote, |
937 Grathwohl \parencite{grathwohl2014crash} wrote, |
931 \begin{quote} |
938 \begin{quote} |
932 ``The POSIX strategy is more complicated than the |
939 ``The POSIX strategy is more complicated than the |
933 greedy because of the dependence on information about |
940 greedy because of the dependence on information about |
934 the length of matched strings in the various subexpressions.'' |
941 the length of matched strings in the various subexpressions.'' |
935 \end{quote} |
942 \end{quote} |
936 %\noindent |
943 %\noindent |
937 The implementation complexity of POSIX rules also come from |
944 We think the implementation complexity of POSIX rules also come from |
938 the specification being not very clear. |
945 the specification being not very precise. |
939 There are many informal summaries of this disambiguation |
946 There are many informal summaries of this disambiguation |
940 strategy, which are often quite long and delicate. |
947 strategy, which are often quite long and delicate. |
941 For example Kuklewicz \cite{KuklewiczHaskell} |
948 For example Kuklewicz \cite{KuklewiczHaskell} |
942 described the POSIX rule as |
949 described the POSIX rule as (section 1, last paragraph): |
943 \begin{quote} |
950 \begin{quote} |
944 `` |
|
945 \begin{itemize} |
951 \begin{itemize} |
946 \item |
952 \item |
947 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
953 regular expressions (REs) take the leftmost starting match, and the longest match starting there |
948 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
954 earlier subpatterns have leftmost-longest priority over later subpatterns\\ |
949 \item |
955 \item |
956 text of component subexpressions must be contained in the text of the |
962 text of component subexpressions must be contained in the text of the |
957 higher-level subexpressions\\ |
963 higher-level subexpressions\\ |
958 \item |
964 \item |
959 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
965 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\ |
960 \item |
966 \item |
961 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\'' |
967 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\ |
962 \end{itemize} |
968 \end{itemize} |
963 \end{quote} |
969 \end{quote} |
964 The text above |
970 %The text above |
965 is trying to capture something very precise, |
971 %is trying to capture something very precise, |
966 and is crying out for formalising. |
972 %and is crying out for formalising. |
967 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
973 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
968 are the first to fill the gap |
974 are the first to |
969 by not just describing such a formalised POSIX |
975 give a quite simple formalised POSIX |
970 specification in Isabelle/HOL, but also proving |
976 specification in Isabelle/HOL, and also prove |
971 that their specification coincides with the |
977 that their specification coincides with the |
972 POSIX specification given by Okui and Suzuki \cite{Okui10} |
978 POSIX specification given by Okui and Suzuki \cite{Okui10}. |
973 which is a completely |
|
974 different characterisation. |
|
975 They then formally proved the correctness of |
979 They then formally proved the correctness of |
976 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
980 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
977 based on that specification. |
981 with regards to that specification. |
978 |
982 They also found that the informal POSIX |
979 In the next section we will very briefly |
983 specification by Sulzmann and Lu does not work for the correctness proof. |
|
984 |
|
985 In the next section we will briefly |
980 introduce Brzozowski derivatives and Sulzmann |
986 introduce Brzozowski derivatives and Sulzmann |
981 and Lu's algorithm, which this thesis builds on. |
987 and Lu's algorithm, which the main point of this thesis builds on. |
982 We give a taste of what they |
988 %We give a taste of what they |
983 are like and why they are suitable for regular expression |
989 %are like and why they are suitable for regular expression |
984 matching and lexing. |
990 %matching and lexing. |
985 |
991 \section{Formal Specification of POSIX Matching |
986 \section{Our Solution--Formal Specification of POSIX Matching |
|
987 and Brzozowski Derivatives} |
992 and Brzozowski Derivatives} |
988 Now we start with the central topic of the thesis: Brzozowski derivatives. |
993 %Now we start with the central topic of the thesis: Brzozowski derivatives. |
989 Brzozowski \cite{Brzozowski1964} first introduced the |
994 Brzozowski \cite{Brzozowski1964} first introduced the |
990 concept of the \emph{derivative} in the 1960s. |
995 concept of a \emph{derivative} of regular expression in 1964. |
991 The derivative of a regular expression $r$ |
996 The derivative of a regular expression $r$ |
992 with respect to a character $c$, is written as $r \backslash c$.\footnote{ |
997 with respect to a character $c$, is written as $r \backslash c$. |
993 Despite having the same name, regular expression |
998 This operation tells us what $r$ transforms into |
994 derivatives bear little similarity with the mathematical definition |
999 if we ``chop'' off the first character $c$ |
995 of derivatives on functions. |
1000 from all strings in the language of $r$ (defined |
996 } |
1001 later as $L \; r$). |
997 It tells us what $r$ would transform into |
1002 %To give a flavour of Brzozowski derivatives, we present |
998 if we chop off the first character $c$ |
1003 %two straightforward clauses from it: |
999 from all strings in the language of $r$ ($L \; r$). |
1004 %\begin{center} |
1000 To give a flavour of Brzozowski derivatives, we present |
1005 % \begin{tabular}{lcl} |
1001 two straightforward clauses from it: |
1006 % $d \backslash c$ & $\dn$ & |
1002 \begin{center} |
1007 % $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
1003 \begin{tabular}{lcl} |
1008 %$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
1004 $d \backslash c$ & $\dn$ & |
1009 % \end{tabular} |
1005 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
1010 %\end{center} |
1006 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
1011 %\noindent |
1007 \end{tabular} |
1012 %The first clause says that for the regular expression |
1008 \end{center} |
1013 %denoting a singleton set consisting of a sinlge-character string $\{ d \}$, |
1009 \noindent |
1014 %we check the derivative character $c$ against $d$, |
1010 The first clause says that for the regular expression |
1015 %returning a set containing only the empty string $\{ [] \}$ |
1011 denoting a singleton set consisting of a sinlge-character string $\{ d \}$, |
1016 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. |
1012 we check the derivative character $c$ against $d$, |
1017 %The second clause states that to obtain the regular expression |
1013 returning a set containing only the empty string $\{ [] \}$ |
1018 %representing all strings' head character $c$ being chopped off |
1014 if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. |
1019 %from $r_1 + r_2$, one simply needs to recursively take derivative |
1015 The second clause states that to obtain the regular expression |
1020 %of $r_1$ and $r_2$ and then put them together. |
1016 representing all strings' head character $c$ being chopped off |
1021 Derivatives have the property |
1017 from $r_1 + r_2$, one simply needs to recursively take derivative |
|
1018 of $r_1$ and $r_2$ and then put them together. |
|
1019 |
|
1020 Thanks to the definition, derivatives have the nice property |
|
1021 that $s \in L \; (r\backslash c)$ if and only if |
1022 that $s \in L \; (r\backslash c)$ if and only if |
1022 $c::s \in L \; r$. |
1023 $c::s \in L \; r$ where $::$ stands for list prepending. |
1023 %This property can be used on regular expressions |
1024 %This property can be used on regular expressions |
1024 %matching and lexing--to test whether a string $s$ is in $L \; r$, |
1025 %matching and lexing--to test whether a string $s$ is in $L \; r$, |
1025 %one simply takes derivatives of $r$ successively with |
1026 %one simply takes derivatives of $r$ successively with |
1026 %respect to the characters (in the correct order) in $s$, |
1027 %respect to the characters (in the correct order) in $s$, |
1027 %and then test whether the empty string is in the last regular expression. |
1028 %and then test whether the empty string is in the last regular expression. |
1028 Derivatives give a simple solution |
1029 With this derivatives give a simple solution |
1029 to the problem of matching and lexing a string $s$ with a regular |
1030 to the problem of matching a string $s$ with a regular |
1030 expression $r$: if the derivative of $r$ w.r.t.\ (in |
1031 expression $r$: if the derivative of $r$ w.r.t.\ (in |
1031 succession) all the characters of the string matches the empty string, |
1032 succession) all the characters of the string matches the empty string, |
1032 then $r$ matches $s$ (and {\em vice versa}). |
1033 then $r$ matches $s$ (and {\em vice versa}). |
1033 |
1034 %This makes formally reasoning about these properties such |
1034 This makes formally reasoning about these properties such |
1035 %as correctness and complexity smooth and intuitive. |
1035 as correctness and complexity smooth and intuitive. |
1036 There had been several mechanised proofs about this property in various theorem |
1036 In fact, there has already been several mechanised proofs about them, |
1037 provers, |
1037 for example the one by Owens and Slind \cite{Owens2008} in HOL4, |
1038 for example one by Owens and Slind \cite{Owens2008} in HOL4, |
1038 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and |
1039 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and |
1039 yet another in Coq by Coquand and Siles \cite{Coquand2012}. |
1040 yet another in Coq by Coquand and Siles \cite{Coquand2012}. |
1040 |
1041 |
1041 In addition, one can extend the clauses to bounded repetitions |
1042 In addition, one can extend derivatives to bounded repetitions |
1042 ``for free'': |
1043 relatively straightforwardly. For example, the derivative for |
|
1044 this can be defined as: |
1043 \begin{center} |
1045 \begin{center} |
1044 \begin{tabular}{lcl} |
1046 \begin{tabular}{lcl} |
1045 $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot |
1047 $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot |
1046 r^{\{n-1\}}$\\ |
1048 r^{\{n-1\}}$\\ |
1047 \end{tabular} |
1049 \end{tabular} |
1048 \end{center} |
1050 \end{center} |
1049 \noindent |
1051 \noindent |
1050 And experimental results suggest that unlike DFA-based solutions, |
1052 Experimental results suggest that unlike DFA-based solutions |
1051 this derivatives can support |
1053 for bounded regular expressions, |
1052 bounded regular expressions with large counters |
1054 derivatives can cope |
|
1055 large counters |
1053 quite well. |
1056 quite well. |
1054 |
1057 |
1055 There has also been |
1058 There has also been |
1056 extensions to other constructs. |
1059 extensions to other constructs. |
1057 For example, Owens et al include the derivatives |
1060 For example, Owens et al include the derivatives |
1058 for \emph{NOT} regular expressions, which is |
1061 for the \emph{NOT} regular expression, which is |
1059 able to concisely express C-style comments of the form |
1062 able to concisely express C-style comments of the form |
1060 $/* \ldots */$. |
1063 $/* \ldots */$ (see \cite{Owens2008}). |
1061 Another extension for derivatives would be |
1064 Another extension for derivatives is |
1062 regular expressions with look-aheads, done by |
1065 regular expressions with look-aheads, done |
1063 by Miyazaki and Minamide |
1066 by Miyazaki and Minamide |
1064 \cite{Takayuki2019}. |
1067 \cite{Takayuki2019}. |
1065 %We therefore use Brzozowski derivatives on regular expressions |
1068 %We therefore use Brzozowski derivatives on regular expressions |
1066 %lexing |
1069 %lexing |
1067 |
1070 |
1098 is actually $\Omega(2^n)$ in the worst case. |
1101 is actually $\Omega(2^n)$ in the worst case. |
1099 A similar claim about a theoretical runtime of $O(n^2)$ |
1102 A similar claim about a theoretical runtime of $O(n^2)$ |
1100 is made for the Verbatim \cite{Verbatim} |
1103 is made for the Verbatim \cite{Verbatim} |
1101 %TODO: give references |
1104 %TODO: give references |
1102 lexer, which calculates POSIX matches and is based on derivatives. |
1105 lexer, which calculates POSIX matches and is based on derivatives. |
1103 They formalized the correctness of the lexer, but not the complexity. |
1106 They formalized the correctness of the lexer, but not their complexity result. |
1104 In the performance evaluation section, they simply analyzed the run time |
1107 In the performance evaluation section, they analyzed the run time |
1105 of matching $a$ with the string |
1108 of matching $a$ with the string |
1106 \begin{center} |
1109 \begin{center} |
1107 $\underbrace{a \ldots a}_{\text{n a's}}$ |
1110 $\underbrace{a \ldots a}_{\text{n a's}}$. |
1108 \end{center} |
1111 \end{center} |
1109 and concluded that the algorithm is quadratic in terms of input length. |
1112 \noindent |
|
1113 They concluded that the algorithm is quadratic in terms of |
|
1114 the length of the input string. |
1110 When we tried out their extracted OCaml code with our example $(a+aa)^*$, |
1115 When we tried out their extracted OCaml code with our example $(a+aa)^*$, |
1111 the time it took to lex only 40 $a$'s was 5 minutes. |
1116 the time it took to match a string of 40 $a$'s was approximately 5 minutes. |
1112 |
1117 |
1113 |
1118 |
1114 \subsection{Sulzmann and Lu's Algorithm} |
1119 \subsection{Sulzmann and Lu's Algorithm} |
1115 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first |
1120 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first |
1116 difficulty by cleverly extending Brzozowski's matching |
1121 problem with the yes/no answer |
|
1122 by cleverly extending Brzozowski's matching |
1117 algorithm. Their extended version generates additional information on |
1123 algorithm. Their extended version generates additional information on |
1118 \emph{how} a regular expression matches a string following the POSIX |
1124 \emph{how} a regular expression matches a string following the POSIX |
1119 rules for regular expression matching. They achieve this by adding a |
1125 rules for regular expression matching. They achieve this by adding a |
1120 second ``phase'' to Brzozowski's algorithm involving an injection |
1126 second ``phase'' to Brzozowski's algorithm involving an injection |
1121 function simplification of internal data structures |
1127 function. |
1122 eliminating the exponential behaviours. |
1128 In earlier work, Ausaf et al provided the formal |
1123 In an earlier work, Ausaf et al provided the formal |
|
1124 specification of what POSIX matching means and proved in Isabelle/HOL |
1129 specification of what POSIX matching means and proved in Isabelle/HOL |
1125 the correctness |
1130 the correctness |
1126 of Sulzmann and Lu's extended algorithm accordingly |
1131 of this extended algorithm accordingly |
1127 \cite{AusafDyckhoffUrban2016}. |
1132 \cite{AusafDyckhoffUrban2016}. |
1128 |
1133 |
1129 The version of the algorithm proven correct |
1134 The version of the algorithm proven correct |
1130 suffers from the |
1135 suffers heavily from a |
1131 second difficulty though, where the internal derivatives can |
1136 second difficulty, where the internal derivatives can |
1132 grow to arbitrarily big sizes. |
1137 grow to arbitrarily big sizes. |
1133 For example if we start with the |
1138 For example if we start with the |
1134 regular expression $(a+aa)^*$ and take |
1139 regular expression $(a+aa)^*$ and take |
1135 successive derivatives according to the character $a$, we end up with |
1140 successive derivatives according to the character $a$, we end up with |
1136 a sequence of ever-growing derivatives like |
1141 a sequence of ever-growing derivatives like |
1145 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
1150 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
1146 \end{tabular} |
1151 \end{tabular} |
1147 \end{center} |
1152 \end{center} |
1148 |
1153 |
1149 \noindent where after around 35 steps we run out of memory on a |
1154 \noindent where after around 35 steps we run out of memory on a |
1150 typical computer (we shall define in the next chapter |
1155 typical computer. Clearly, the |
1151 the precise details of our |
|
1152 regular expressions and the derivative operation). Clearly, the |
|
1153 notation involving $\ZERO$s and $\ONE$s already suggests |
1156 notation involving $\ZERO$s and $\ONE$s already suggests |
1154 simplification rules that can be applied to regular regular |
1157 simplification rules that can be applied to regular regular |
1155 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
1158 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
1156 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
1159 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
1157 r$. While such simple-minded simplifications have been proved in our |
1160 r$. While such simple-minded simplifications have been proved in |
1158 earlier work to preserve the correctness of Sulzmann and Lu's |
1161 the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's |
1159 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
1162 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do |
1160 \emph{not} help with limiting the growth of the derivatives shown |
1163 \emph{not} help with limiting the growth of the derivatives shown |
1161 above: the growth is slowed, but the derivatives can still grow rather |
1164 above: the growth is slowed, but the derivatives can still grow rather |
1162 quickly beyond any finite bound. |
1165 quickly beyond any finite bound. |
1163 |
1166 |
1164 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm |
1167 Therefore we want to look in this thesis at a second |
1165 \cite{Sulzmann2014} where they introduce bit-coded |
1168 algorithm by Sulzmann and Lu where they |
1166 regular expressions. In this version, POSIX values are |
1169 overcame this ``growth problem'' \cite{Sulzmann2014}. |
|
1170 In this version, POSIX values are |
1167 represented as bit sequences and such sequences are incrementally generated |
1171 represented as bit sequences and such sequences are incrementally generated |
1168 when derivatives are calculated. The compact representation |
1172 when derivatives are calculated. The compact representation |
1169 of bit sequences and regular expressions allows them to define a more |
1173 of bit sequences and regular expressions allows them to define a more |
1170 ``aggressive'' simplification method that keeps the size of the |
1174 ``aggressive'' simplification method that keeps the size of the |
1171 derivatives finite no matter what the length of the string is. |
1175 derivatives finite no matter what the length of the string is. |
1237 remain finite, |
1237 remain finite, |
1238 \item |
1238 \item |
1239 and extension to |
1239 and extension to |
1240 the bounded repetitions construct with the correctness and finiteness property |
1240 the bounded repetitions construct with the correctness and finiteness property |
1241 maintained. |
1241 maintained. |
1242 \end{itemize} |
1242 \end{itemize} |
1243 |
1243 \noindent |
1244 |
|
1245 With a formal finiteness bound in place, |
1244 With a formal finiteness bound in place, |
1246 we can greatly reduce the attack surface of servers in terms of ReDoS attacks. |
1245 we can greatly reduce the attack surface of servers in terms of ReDoS attacks. |
1247 Further improvements to the algorithm with an even stronger version of |
1246 Further improvements to the algorithm with an even stronger version of |
1248 simplification is made. |
1247 simplification can be made. |
1249 Thanks to our theorem-prover-friendly approach, |
1248 |
1250 we believe that |
1249 |
1251 this finiteness bound can be improved to a bound |
1250 |
1252 linear to input and |
1251 |
1253 cubic to the regular expression size using a technique by |
1252 |
1254 Antimirov\cite{Antimirov95}. |
1253 \section{Structure of the thesis} |
1255 Once formalised, this would be a guarantee for the absence of all super-linear behavious. |
1254 In chapter \ref{Inj} we will introduce the concepts |
1256 We are working out the |
1255 and notations we |
1257 details. |
1256 use for describing regular expressions and derivatives, |
1258 |
1257 and the first version of their lexing algorithm without bitcodes (including |
|
1258 its correctness proof). |
|
1259 We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1} |
|
1260 together with the correctness proof by Ausaf and Urban. |
|
1261 Then we illustrate in chapter \ref{Bitcoded2} |
|
1262 how Sulzmann and Lu's |
|
1263 simplifications fail to simplify. We therefore introduce our version of the |
|
1264 algorithm with simplification and |
|
1265 its correctness proof . |
|
1266 In chapter \ref{Finite} we give the second guarantee |
|
1267 of our bitcoded algorithm, that is a finite bound on the size of any |
|
1268 regular expression's derivatives. |
|
1269 In chapter \ref{Cubic} we discuss stronger simplification rules which |
|
1270 improve the finite bound to a polynomial bound, and also show how one can extend the |
|
1271 algorithm to include bounded repetitions. %and the NOT regular expression. |
1259 |
1272 |
1260 To our best knowledge, no lexing libraries using Brzozowski derivatives |
|
1261 have similar complexity-related bounds, |
|
1262 and claims about running time are usually speculative and backed by empirical |
|
1263 evidence on a few test cases. |
|
1264 If a matching or lexing algorithm |
|
1265 does not come with certain basic complexity related |
|
1266 guarantees (for examaple the internal data structure size |
|
1267 does not grow indefinitely), |
|
1268 then they cannot claim with confidence having solved the problem |
|
1269 of catastrophic backtracking. |
|
1270 |
|
1271 |
|
1272 |
|
1273 |
|
1274 |
|
1275 \section{Structure of the thesis} |
|
1276 In chapter 2 \ref{Inj} we will introduce the concepts |
|
1277 and notations we |
|
1278 use for describing the lexing algorithm by Sulzmann and Lu, |
|
1279 and then give the lexing algorithm. |
|
1280 We will give its variant in \ref{Bitcoded1}. |
|
1281 Then we illustrate in \ref{Bitcoded2} |
|
1282 how the algorithm without bitcodes falls short for such aggressive |
|
1283 simplifications and therefore introduce our version of the |
|
1284 bit-coded algorithm and |
|
1285 its correctness proof . |
|
1286 In \ref{Finite} we give the second guarantee |
|
1287 of our bitcoded algorithm, that is a finite bound on the size of any |
|
1288 regex's derivatives. |
|
1289 In \ref{Cubic} we discuss stronger simplifications to improve the finite bound |
|
1290 in \ref{Finite} to a polynomial one, and demonstrate how one can extend the |
|
1291 algorithm to include constructs such as bounded repetitions and negations. |
|
1292 |
|
1293 |
1273 |
1294 |
1274 |
1295 |
1275 |
1296 |
1276 |
1297 %---------------------------------------------------------------------------------------- |
1277 %---------------------------------------------------------------------------------------- |