215 detection systems that inspect suspicious traffic; or compiler |
215 detection systems that inspect suspicious traffic; or compiler |
216 front ends. |
216 front ends. |
217 Given their usefulness and ubiquity, one would assume that |
217 Given their usefulness and ubiquity, one would assume that |
218 modern regular expression matching implementations |
218 modern regular expression matching implementations |
219 are mature and fully studied. |
219 are mature and fully studied. |
220 Indeed, in a popular programming language's regular expression engine, |
220 Indeed, in many popular programming languages' regular expression engines, |
221 supplying it with regular expressions and strings, |
221 one can supply a regular expression and a string, and in most cases get |
222 in most cases one can |
|
223 get the matching information in a very short time. |
222 get the matching information in a very short time. |
224 Those engines can be blindingly fast--some |
223 Those engines can sometimes be blindingly fast--some |
225 network intrusion detection systems |
224 network intrusion detection systems |
226 use regular expression engines that are able to process |
225 use regular expression engines that are able to process |
227 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
226 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
228 However, those engines can sometimes exhibit a surprising security vulnerability |
227 However, those engines can sometimes also exhibit a surprising security vulnerability |
229 under a certain class of inputs. |
228 under a certain class of inputs. |
230 %However, , this is not the case for $\mathbf{all}$ inputs. |
229 %However, , this is not the case for $\mathbf{all}$ inputs. |
231 %TODO: get source for SNORT/BRO's regex matching engine/speed |
230 %TODO: get source for SNORT/BRO's regex matching engine/speed |
232 |
231 |
233 |
232 |
234 Consider for example $(a^*)^*\,b$ and |
233 Consider for example the regular expression $(a^*)^*\,b$ and |
235 strings of the form $aa..a$, these strings cannot be matched by this regular |
234 strings of the form $aa..a$. These strings cannot be matched by this regular |
236 expression: Obviously the expected $b$ in the last |
235 expression: Obviously the expected $b$ in the last |
237 position is missing. One would assume that modern regular expression |
236 position is missing. One would assume that modern regular expression |
238 matching engines can find this out very quickly. Surprisingly, if one tries |
237 matching engines can find this out very quickly. Surprisingly, if one tries |
239 this example in JavaScript, Python or Java 8, even with small strings, |
238 this example in JavaScript, Python or Java 8, even with small strings, |
240 say of lenght of around 30 $a$'s, |
239 say of lenght of around 30 $a$'s, |
241 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}). |
240 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}). |
242 This is clearly exponential behaviour, and as can be seen |
241 The algorithms clearly show exponential behaviour, and as can be seen |
243 is triggered by some relatively simple regular expressions. |
242 is triggered by some relatively simple regular expressions. |
244 Java 9 and newer |
243 Java 9 and newer |
245 versions improve this behaviour somewhat, but are still slow compared |
244 versions improve this behaviour somewhat, but are still slow compared |
246 with the approach we are going to use in this thesis. |
245 with the approach we are going to use in this thesis. |
247 |
246 |
248 |
247 |
249 |
248 |
250 This superlinear blowup in regular expression engines |
249 This superlinear blowup in regular expression engines |
251 has caused grief in ``real life'' in the past where it is |
250 has caused grief in ``real life'' where it is |
252 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
251 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
253 For example, on 20 July 2016 one evil |
252 For example, on 20 July 2016 one evil |
254 regular expression brought the webpage |
253 regular expression brought the webpage |
255 \href{http://stackexchange.com}{Stack Exchange} to its |
254 \href{http://stackexchange.com}{Stack Exchange} to its |
256 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
255 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
421 attacks. |
420 attacks. |
422 Davis et al. \cite{Davis18} detected more |
421 Davis et al. \cite{Davis18} detected more |
423 than 1000 evil regular expressions |
422 than 1000 evil regular expressions |
424 in Node.js, Python core libraries, npm and pypi. |
423 in Node.js, Python core libraries, npm and pypi. |
425 They therefore concluded that evil regular expressions |
424 They therefore concluded that evil regular expressions |
426 are real problems rather than just "a parlour trick". |
425 are a real problem rather than just "a parlour trick". |
427 |
426 |
428 This work aims to address this issue |
427 The work in this thesis aims to address this issue |
429 with the help of formal proofs. |
428 with the help of formal proofs. |
430 We describe a lexing algorithm based |
429 We describe a lexing algorithm based |
431 on Brzozowski derivatives with verified correctness |
430 on Brzozowski derivatives with verified correctness |
432 and a finiteness property for the size of derivatives |
431 and a finiteness property for the size of derivatives |
433 (which are all done in Isabelle/HOL). |
432 (which are all done in Isabelle/HOL). |
625 have bounded regular expressions in them. |
624 have bounded regular expressions in them. |
626 Often the counters are quite large, with the largest being |
625 Often the counters are quite large, with the largest being |
627 close to ten million. |
626 close to ten million. |
628 A smaller sample XSD they gave |
627 A smaller sample XSD they gave |
629 is: |
628 is: |
630 \begin{verbatim} |
629 \lstset{ |
|
630 basicstyle=\fontsize{8.5}{9}\ttfamily, |
|
631 language=XML, |
|
632 morekeywords={encoding, |
|
633 xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute} |
|
634 } |
|
635 \begin{lstlisting} |
631 <sequence minOccurs="0" maxOccurs="65535"> |
636 <sequence minOccurs="0" maxOccurs="65535"> |
632 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
637 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
633 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
638 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
634 </sequence> |
639 </sequence> |
635 \end{verbatim} |
640 \end{lstlisting} |
636 This can be seen as the regex |
641 This can be seen as the regex |
637 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
642 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
638 regular expressions |
643 regular expressions |
639 satisfying certain constraints (such as |
644 satisfying certain constraints (such as |
640 satisfying the floating point number format). |
645 satisfying the floating point number format). |
785 \end{center} |
790 \end{center} |
786 The sub-expressions |
791 The sub-expressions |
787 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
792 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled |
788 by 1 to 4, and can be ``referred back'' by their respective numbers. |
793 by 1 to 4, and can be ``referred back'' by their respective numbers. |
789 %These sub-expressions are called "capturing groups". |
794 %These sub-expressions are called "capturing groups". |
790 To do so, we use the syntax $\backslash i$ |
795 To do so, one uses the syntax $\backslash i$ |
791 to denote that we want the sub-string |
796 to denote that we want the sub-string |
792 of the input matched by the i-th |
797 of the input matched by the i-th |
793 sub-expression to appear again, |
798 sub-expression to appear again, |
794 exactly the same as it first appeared: |
799 exactly the same as it first appeared: |
795 \begin{center} |
800 \begin{center} |
834 $(.)(.)\backslash 2\backslash 1$ |
839 $(.)(.)\backslash 2\backslash 1$ |
835 \end{center} |
840 \end{center} |
836 which matches four-character palindromes |
841 which matches four-character palindromes |
837 like $abba$, $x??x$ and so on. |
842 like $abba$, $x??x$ and so on. |
838 |
843 |
839 Back-references is a regex construct |
844 Back-references are a regex construct |
840 that programmers find quite useful. |
845 that programmers find quite useful. |
841 According to Becchi and Crawley \cite{Becchi08}, |
846 According to Becchi and Crawley \cite{Becchi08}, |
842 6\% of Snort rules (up until 2008) use them. |
847 6\% of Snort rules (up until 2008) use them. |
843 The most common use of back-references |
848 The most common use of back-references |
844 is to express well-formed html files, |
849 is to express well-formed html files, |
926 with a regular expression. |
931 with a regular expression. |
927 In such cases the regular expressions matcher needs to |
932 In such cases the regular expressions matcher needs to |
928 disambiguate. |
933 disambiguate. |
929 The more widely used strategy is called POSIX, |
934 The more widely used strategy is called POSIX, |
930 which roughly speaking always chooses the longest initial match. |
935 which roughly speaking always chooses the longest initial match. |
931 The POSIX strategy is widely adopted in many regular expression matchers. |
936 The POSIX strategy is widely adopted in many regular expression matchers |
|
937 because it is a natural strategy for lexers. |
932 However, many implementations (including the C libraries |
938 However, many implementations (including the C libraries |
933 used by Linux and OS X distributions) contain bugs |
939 used by Linux and OS X distributions) contain bugs |
934 or do not meet the specification they claim to adhere to. |
940 or do not meet the specification they claim to adhere to. |
935 Kuklewicz maintains a unit test repository which lists some |
941 Kuklewicz maintains a unit test repository which lists some |
936 problems with existing regular expression engines \cite{KuklewiczHaskell}. |
942 problems with existing regular expression engines \cite{KuklewiczHaskell}. |
945 involves the entire string $ababa$, |
951 involves the entire string $ababa$, |
946 split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
952 split into two Kleene star iterations, namely $[ab], [aba]$ at positions |
947 $[0, 2), [2, 5)$ |
953 $[0, 2), [2, 5)$ |
948 respectively. |
954 respectively. |
949 But feeding this example to the different engines |
955 But feeding this example to the different engines |
950 in regex101 \parencite{regex101} \footnote{ |
956 listed at regex101 \footnote{ |
951 regex101 is an online regular expression matcher which |
957 regex101 is an online regular expression matcher which |
952 provides API for trying out regular expression |
958 provides API for trying out regular expression |
953 engines of multiple popular programming languages like |
959 engines of multiple popular programming languages like |
954 Java, Python, Go, etc.} |
960 Java, Python, Go, etc.} \parencite{regex101}. |
955 yields |
961 yields |
956 only two incomplete matches: $[aba]$ at $[0, 3)$ |
962 only two incomplete matches: $[aba]$ at $[0, 3)$ |
957 and $a$ at $[4, 5)$. |
963 and $a$ at $[4, 5)$. |
958 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
964 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} |
959 commented that most regex libraries are not |
965 commented that most regex libraries are not |
960 correctly implementing the central POSIX |
966 correctly implementing the central POSIX |
961 rule, called the maximum munch rule. |
967 rule, called the maximum munch rule. |
962 Grathwohl \parencite{grathwohl2014crash} wrote, |
968 Grathwohl \parencite{grathwohl2014crash} wrote: |
963 \begin{quote} |
969 \begin{quote}\it |
964 ``The POSIX strategy is more complicated than the |
970 ``The POSIX strategy is more complicated than the |
965 greedy because of the dependence on information about |
971 greedy because of the dependence on information about |
966 the length of matched strings in the various subexpressions.'' |
972 the length of matched strings in the various subexpressions.'' |
967 \end{quote} |
973 \end{quote} |
968 %\noindent |
974 %\noindent |
969 We think the implementation complexity of POSIX rules also come from |
975 We think the implementation complexity of POSIX rules also come from |
970 the specification being not very precise. |
976 the specification being not very precise. |
971 The developers of the regexp package of GO |
977 The developers of the regexp package of Go |
972 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
978 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}} |
973 commented that |
979 commented that |
974 \begin{quote}\it |
980 \begin{quote}\it |
975 `` |
981 `` |
976 The POSIX rule is computationally prohibitive |
982 The POSIX rule is computationally prohibitive |
1006 %and is crying out for formalising. |
1012 %and is crying out for formalising. |
1007 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
1013 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
1008 are the first to |
1014 are the first to |
1009 give a quite simple formalised POSIX |
1015 give a quite simple formalised POSIX |
1010 specification in Isabelle/HOL, and also prove |
1016 specification in Isabelle/HOL, and also prove |
1011 that their specification coincides with the |
1017 that their specification coincides with an earlier (unformalised) |
1012 POSIX specification given by Okui and Suzuki \cite{Okui10}. |
1018 POSIX specification given by Okui and Suzuki \cite{Okui10}. |
1013 They then formally proved the correctness of |
1019 They then formally proved the correctness of |
1014 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
1020 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
1015 with regards to that specification. |
1021 with regards to that specification. |
1016 They also found that the informal POSIX |
1022 They also found that the informal POSIX |
1017 specification by Sulzmann and Lu needs to be revised for the correctness proof. |
1023 specification by Sulzmann and Lu needs to be substantially |
|
1024 revised in order for the correctness proof to go through. |
|
1025 Their original specification and proof were unfixable |
|
1026 according to Ausaf et al. |
1018 |
1027 |
1019 In the next section, we will briefly |
1028 In the next section, we will briefly |
1020 introduce Brzozowski derivatives and Sulzmann |
1029 introduce Brzozowski derivatives and Sulzmann |
1021 and Lu's algorithm, which the main point of this thesis builds on. |
1030 and Lu's algorithm, which the main point of this thesis builds on. |
1022 %We give a taste of what they |
1031 %We give a taste of what they |
1028 Brzozowski \cite{Brzozowski1964} first introduced the |
1037 Brzozowski \cite{Brzozowski1964} first introduced the |
1029 concept of a \emph{derivative} of regular expression in 1964. |
1038 concept of a \emph{derivative} of regular expression in 1964. |
1030 The derivative of a regular expression $r$ |
1039 The derivative of a regular expression $r$ |
1031 with respect to a character $c$, is written as $r \backslash c$. |
1040 with respect to a character $c$, is written as $r \backslash c$. |
1032 This operation tells us what $r$ transforms into |
1041 This operation tells us what $r$ transforms into |
1033 if we ``chop'' off the first character $c$ |
1042 if we ``chop'' off a particular character $c$ |
1034 from all strings in the language of $r$ (defined |
1043 from all strings in the language of $r$ (defined |
1035 later as $L \; r$). |
1044 later as $L \; r$). |
1036 %To give a flavour of Brzozowski derivatives, we present |
1045 %To give a flavour of Brzozowski derivatives, we present |
1037 %two straightforward clauses from it: |
1046 %two straightforward clauses from it: |
1038 %\begin{center} |
1047 %\begin{center} |
1058 %This property can be used on regular expressions |
1067 %This property can be used on regular expressions |
1059 %matching and lexing--to test whether a string $s$ is in $L \; r$, |
1068 %matching and lexing--to test whether a string $s$ is in $L \; r$, |
1060 %one simply takes derivatives of $r$ successively with |
1069 %one simply takes derivatives of $r$ successively with |
1061 %respect to the characters (in the correct order) in $s$, |
1070 %respect to the characters (in the correct order) in $s$, |
1062 %and then test whether the empty string is in the last regular expression. |
1071 %and then test whether the empty string is in the last regular expression. |
1063 With this property, derivatives give a simple solution |
1072 With this property, derivatives can give a simple solution |
1064 to the problem of matching a string $s$ with a regular |
1073 to the problem of matching a string $s$ with a regular |
1065 expression $r$: if the derivative of $r$ w.r.t.\ (in |
1074 expression $r$: if the derivative of $r$ w.r.t.\ (in |
1066 succession) all the characters of the string matches the empty string, |
1075 succession) all the characters of the string matches the empty string, |
1067 then $r$ matches $s$ (and {\em vice versa}). |
1076 then $r$ matches $s$ (and {\em vice versa}). |
1068 %This makes formally reasoning about these properties such |
1077 %This makes formally reasoning about these properties such |
1069 %as correctness and complexity smooth and intuitive. |
1078 %as correctness and complexity smooth and intuitive. |
1070 There had been several mechanised proofs about this property in various theorem |
1079 There are several mechanised proofs of this property in various theorem |
1071 provers, |
1080 provers, |
1072 for example one by Owens and Slind \cite{Owens2008} in HOL4, |
1081 for example one by Owens and Slind \cite{Owens2008} in HOL4, |
1073 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and |
1082 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and |
1074 yet another in Coq by Coquand and Siles \cite{Coquand2012}. |
1083 yet another in Coq by Coquand and Siles \cite{Coquand2012}. |
1075 |
1084 |
1077 relatively straightforwardly. For example, the derivative for |
1086 relatively straightforwardly. For example, the derivative for |
1078 this can be defined as: |
1087 this can be defined as: |
1079 \begin{center} |
1088 \begin{center} |
1080 \begin{tabular}{lcl} |
1089 \begin{tabular}{lcl} |
1081 $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot |
1090 $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot |
1082 r^{\{n-1\}}$\\ |
1091 r^{\{n-1\}} (\text{when} n > 0)$\\ |
1083 \end{tabular} |
1092 \end{tabular} |
1084 \end{center} |
1093 \end{center} |
1085 \noindent |
1094 \noindent |
1086 Experimental results suggest that unlike DFA-based solutions |
1095 Experimental results suggest that unlike DFA-based solutions |
1087 for bounded regular expressions, |
1096 for bounded regular expressions, |
1088 derivatives can cope |
1097 derivatives can cope |
1089 large counters |
1098 large counters |
1090 quite well. |
1099 quite well. |
1091 |
1100 |
1092 There has also been |
1101 There have also been |
1093 extensions to other constructs. |
1102 extensions of derivatives to other regex constructs. |
1094 For example, Owens et al include the derivatives |
1103 For example, Owens et al include the derivatives |
1095 for the \emph{NOT} regular expression, which is |
1104 for the \emph{NOT} regular expression, which is |
1096 able to concisely express C-style comments of the form |
1105 able to concisely express C-style comments of the form |
1097 $/* \ldots */$ (see \cite{Owens2008}). |
1106 $/* \ldots */$ (see \cite{Owens2008}). |
1098 Another extension for derivatives is |
1107 Another extension for derivatives is |
1104 |
1113 |
1105 |
1114 |
1106 |
1115 |
1107 Given the above definitions and properties of |
1116 Given the above definitions and properties of |
1108 Brzozowski derivatives, one quickly realises their potential |
1117 Brzozowski derivatives, one quickly realises their potential |
1109 in generating a formally verified algorithm for lexing--the clauses and property |
1118 in generating a formally verified algorithm for lexing: the clauses and property |
1110 can be easily expressed in a functional programming language |
1119 can be easily expressed in a functional programming language |
1111 or converted to theorem prover |
1120 or converted to theorem prover |
1112 code, with great extensibility. |
1121 code, with great ease. |
1113 Perhaps this is the reason why it has sparked quite a bit of interest |
1122 Perhaps this is the reason why derivatives have sparked quite a bit of interest |
1114 in the functional programming and theorem prover communities in the last |
1123 in the functional programming and theorem prover communities in the last |
1115 fifteen or so years ( |
1124 fifteen or so years ( |
1116 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18}, |
1125 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18}, |
1117 \cite{Chen12} and \cite{Coquand2012} |
1126 \cite{Chen12} and \cite{Coquand2012} |
1118 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008} |
1127 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008} |
1123 First, Brzozowski's original matcher only generates a yes/no answer |
1132 First, Brzozowski's original matcher only generates a yes/no answer |
1124 for whether a regular expression matches a string or not. This is too |
1133 for whether a regular expression matches a string or not. This is too |
1125 little information in the context of lexing where separate tokens must |
1134 little information in the context of lexing where separate tokens must |
1126 be identified and also classified (for example as keywords |
1135 be identified and also classified (for example as keywords |
1127 or identifiers). |
1136 or identifiers). |
1128 Second, derivative-based matchers need to be more efficient. |
1137 Second, derivative-based matchers need to be more efficient in terms |
|
1138 of the sizes of derivatives. |
1129 Elegant and beautiful |
1139 Elegant and beautiful |
1130 as many implementations are, |
1140 as many implementations are, |
1131 they can be excruciatingly slow. |
1141 they can be excruciatingly slow. |
1132 For example, Sulzmann and Lu |
1142 For example, Sulzmann and Lu |
1133 claim a linear running time of their proposed algorithm, |
1143 claim a linear running time of their proposed algorithm, |
1156 by cleverly extending Brzozowski's matching |
1166 by cleverly extending Brzozowski's matching |
1157 algorithm. Their extended version generates additional information on |
1167 algorithm. Their extended version generates additional information on |
1158 \emph{how} a regular expression matches a string following the POSIX |
1168 \emph{how} a regular expression matches a string following the POSIX |
1159 rules for regular expression matching. They achieve this by adding a |
1169 rules for regular expression matching. They achieve this by adding a |
1160 second ``phase'' to Brzozowski's algorithm involving an injection |
1170 second ``phase'' to Brzozowski's algorithm involving an injection |
1161 function. |
1171 function. This injection function in a sense undoes the ``damage'' |
|
1172 of the derivatives chopping off characters. |
1162 In earlier work, Ausaf et al provided the formal |
1173 In earlier work, Ausaf et al provided the formal |
1163 specification of what POSIX matching means and proved in Isabelle/HOL |
1174 specification of what POSIX matching means and proved in Isabelle/HOL |
1164 the correctness |
1175 the correctness |
1165 of this extended algorithm accordingly |
1176 of this extended algorithm accordingly |
1166 \cite{AusafDyckhoffUrban2016}. |
1177 \cite{AusafDyckhoffUrban2016}. |
1167 |
1178 |
1168 The version of the algorithm proven correct |
1179 The version of the algorithm proven correct |
1169 suffers heavily from a |
1180 suffers however heavily from a |
1170 second difficulty, where the internal derivatives can |
1181 second difficulty, where derivatives can |
1171 grow to arbitrarily big sizes. |
1182 grow to arbitrarily big sizes. |
1172 For example if we start with the |
1183 For example if we start with the |
1173 regular expression $(a+aa)^*$ and take |
1184 regular expression $(a+aa)^*$ and take |
1174 successive derivatives according to the character $a$, we end up with |
1185 successive derivatives according to the character $a$, we end up with |
1175 a sequence of ever-growing derivatives like |
1186 a sequence of ever-growing derivatives like |
1183 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
1194 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
1184 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
1195 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
1185 \end{tabular} |
1196 \end{tabular} |
1186 \end{center} |
1197 \end{center} |
1187 |
1198 |
1188 \noindent where after around 35 steps we run out of memory on a |
1199 \noindent where after around 35 steps we usually run out of memory on a |
1189 typical computer. Clearly, the |
1200 typical computer. Clearly, the |
1190 notation involving $\ZERO$s and $\ONE$s already suggests |
1201 notation involving $\ZERO$s and $\ONE$s already suggests |
1191 simplification rules that can be applied to regular regular |
1202 simplification rules that can be applied to regular regular |
1192 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
1203 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r |
1193 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
1204 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow |
1240 This is relevant work for us and we will compare it later with |
1251 This is relevant work for us and we will compare it later with |
1241 our derivative-based matcher we are going to present. |
1252 our derivative-based matcher we are going to present. |
1242 There is also some newer work called |
1253 There is also some newer work called |
1243 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, |
1254 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, |
1244 but deterministic finite automaton instead. |
1255 but deterministic finite automaton instead. |
|
1256 We will also study this work in a later section. |
1245 %An example that gives problem to automaton approaches would be |
1257 %An example that gives problem to automaton approaches would be |
1246 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$. |
1258 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$. |
1247 %It requires at least $2^{n+1}$ states to represent |
1259 %It requires at least $2^{n+1}$ states to represent |
1248 %as a DFA. |
1260 %as a DFA. |
1249 |
1261 |
1254 In this thesis, |
1266 In this thesis, |
1255 we propose a solution to catastrophic |
1267 we propose a solution to catastrophic |
1256 backtracking and error-prone matchers: a formally verified |
1268 backtracking and error-prone matchers: a formally verified |
1257 regular expression lexing algorithm |
1269 regular expression lexing algorithm |
1258 that is both fast |
1270 that is both fast |
1259 and correct by extending Ausaf et al.'s work. |
1271 and correct. |
1260 The end result is %a regular expression lexing algorithm that comes with |
1272 The result is %a regular expression lexing algorithm that comes with |
1261 \begin{itemize} |
1273 \begin{itemize} |
1262 \item |
1274 \item |
1263 an improved version of Sulzmann and Lu's bit-coded algorithm using |
1275 an improved version of Sulzmann and Lu's bit-coded algorithm using |
1264 derivatives with simplifications, |
1276 derivatives with simplifications, |
1265 accompanied by |
1277 accompanied by |
1266 a proven correctness theorem according to POSIX specification |
1278 a proven correctness theorem according to POSIX specification |
1267 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, |
1279 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, |
1268 \item |
1280 \item |
1269 a complexity-related property for that algorithm saying that the |
1281 a complexity-related property for that algorithm saying that the |
1270 internal data structure will |
1282 internal data structure will |
1271 remain finite, |
1283 remain below a finite bound, |
1272 \item |
1284 \item |
1273 and extension to |
1285 and an extension to |
1274 the bounded repetitions construct with the correctness and finiteness property |
1286 the bounded repetition constructs with the correctness and finiteness property |
1275 maintained. |
1287 maintained. |
1276 \end{itemize} |
1288 \end{itemize} |
1277 \noindent |
1289 \noindent |
1278 With a formal finiteness bound in place, |
1290 With a formal finiteness bound in place, |
1279 we can greatly reduce the attack surface of servers in terms of ReDoS attacks. |
1291 we can greatly reduce the attack surface of servers in terms of ReDoS attacks. |
1283 \url{https://github.com/hellotommmy/posix} |
1295 \url{https://github.com/hellotommmy/posix} |
1284 \end{center} |
1296 \end{center} |
1285 Further improvements to the algorithm with an even stronger version of |
1297 Further improvements to the algorithm with an even stronger version of |
1286 simplification can be made. We conjecture that the resulting size of derivatives |
1298 simplification can be made. We conjecture that the resulting size of derivatives |
1287 can be bounded by a cubic bound w.r.t. the size of the regular expression. |
1299 can be bounded by a cubic bound w.r.t. the size of the regular expression. |
1288 |
1300 We will give relevant code in Scala, |
1289 |
1301 but do not give a formal proof for that in Isabelle/HOL. |
1290 |
1302 This is still future work. |
1291 |
|
1292 |
1303 |
1293 |
1304 |
1294 \section{Structure of the thesis} |
1305 \section{Structure of the thesis} |
1295 In chapter \ref{Inj} we will introduce the concepts |
1306 In chapter \ref{Inj} we will introduce the concepts |
1296 and notations we |
1307 and notations we |
1297 use for describing regular expressions and derivatives, |
1308 use for describing regular expressions and derivatives, |
1298 and the first version of their lexing algorithm without bitcodes (including |
1309 and the first version of Sulzmann and Lu's lexing algorithm without bitcodes (including |
1299 its correctness proof). |
1310 its correctness proof). |
1300 We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1} |
1311 We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1} |
1301 together with the correctness proof by Ausaf and Urban. |
1312 together with the correctness proof by Ausaf and Urban. |
1302 Then we illustrate in chapter \ref{Bitcoded2} |
1313 Then we illustrate in chapter \ref{Bitcoded2} |
1303 how Sulzmann and Lu's |
1314 how Sulzmann and Lu's |
1304 simplifications fail to simplify. We therefore introduce our version of the |
1315 simplifications fail to simplify correctly. We therefore introduce our own version of the |
1305 algorithm with simplification and |
1316 algorithm with correct simplifications and |
1306 its correctness proof . |
1317 their correctness proof. |
1307 In chapter \ref{Finite} we give the second guarantee |
1318 In chapter \ref{Finite} we give the second guarantee |
1308 of our bitcoded algorithm, that is a finite bound on the size of any |
1319 of our bitcoded algorithm, that is a finite bound on the size of any |
1309 regular expression's derivatives. |
1320 regular expression's derivatives. |
1310 We also show how one can extend the |
1321 We also show how one can extend the |
1311 algorithm to include bounded repetitions. |
1322 algorithm to include bounded repetitions. |
1312 In chapter \ref{Cubic} we discuss stronger simplification rules which |
1323 In chapter \ref{Cubic} we discuss stronger simplification rules which |
1313 improve the finite bound to a cubic bound.%and the NOT regular expression. |
1324 improve the finite bound to a cubic bound. %and the NOT regular expression. |
1314 Chapter \ref{RelatedWork} introduces relevant work for this thesis. |
1325 Chapter \ref{RelatedWork} introduces relevant work for this thesis. |
1315 Chapter \ref{Future} concludes with avenues of future research. |
1326 Chapter \ref{Future} concludes and mentions avenues of future research. |
1316 |
1327 |
1317 |
1328 |
1318 |
1329 |
1319 |
1330 |
1320 |
1331 |