942 \end{quote} |
943 \end{quote} |
943 The text above |
944 The text above |
944 is trying to capture something very precise, |
945 is trying to capture something very precise, |
945 and is crying out for formalising. |
946 and is crying out for formalising. |
946 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
947 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
947 are the first to describe such a formalised POSIX |
948 are the first to fill the gap |
948 specification in Isabelle/HOL. |
949 by not just describing such a formalised POSIX |
|
950 specification in Isabelle/HOL, but also proving |
|
951 that their specification coincides with the |
|
952 POSIX specification given by Okui and Suzuki \cite{Okui10} |
|
953 which is a completely |
|
954 different characterisation. |
949 They then formally proved the correctness of |
955 They then formally proved the correctness of |
950 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
956 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
951 based on that specification. |
957 based on that specification. |
952 |
958 |
953 In this thesis, |
|
954 we propose a solution to catastrophic |
|
955 backtracking and error-prone matchers--a formally verified |
|
956 regular expression lexing algorithm |
|
957 that is both fast |
|
958 and correct by extending Ausaf et al.'s work. |
|
959 The end result is a regular expression lexing algorithm that comes with |
|
960 \begin{itemize} |
|
961 \item |
|
962 a proven correctness theorem according to POSIX specification |
|
963 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, |
|
964 \item |
|
965 a proven complexity-related property saying that the |
|
966 algorithm's internal data structure will |
|
967 remain finite, |
|
968 \item |
|
969 and extension to |
|
970 the bounded repetitions construct with the correctness and finiteness property |
|
971 maintained. |
|
972 \end{itemize} |
|
973 In the next section we will very briefly |
959 In the next section we will very briefly |
974 introduce Brzozowski derivatives. |
960 introduce Brzozowski derivatives and Sulzmann |
975 We will give a taste to what they |
961 and Lu's algorithm, which this thesis builds on. |
|
962 We give a taste of what they |
976 are like and why they are suitable for regular expression |
963 are like and why they are suitable for regular expression |
977 matching and lexing. |
964 matching and lexing. |
978 |
965 |
979 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} |
966 \section{Our Solution--Formal Specification of POSIX Matching |
980 We propose Brzozowski derivatives on regular expressions as |
967 and Brzozowski Derivatives} |
981 a solution to this. |
968 Now we start with the central topic of the thesis: Brzozowski derivatives. |
982 In the last fifteen or so years, Brzozowski's derivatives of regular |
969 Brzozowski \cite{Brzozowski1964} first introduced the |
983 expressions have sparked quite a bit of interest in the functional |
970 concept of the \emph{derivative} in the 1960s. |
984 programming and theorem prover communities. |
971 The derivative of a regular expression $r$ |
985 |
972 with respect to a character $c$, is written as $r \backslash c$.\footnote{ |
986 \subsection{Motivation} |
973 Despite having the same name, regular expression |
987 |
974 derivatives bear little similarity with the mathematical definition |
|
975 of derivatives on functions. |
|
976 } |
|
977 It tells us what $r$ would transform into |
|
978 if we chop off the first character $c$ |
|
979 from all strings in the language of $r$ ($L \; r$). |
|
980 To give a flavour of Brzozowski derivatives, we present |
|
981 two straightforward clauses from it: |
|
982 \begin{center} |
|
983 \begin{tabular}{lcl} |
|
984 $d \backslash c$ & $\dn$ & |
|
985 $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\ |
|
986 $(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\ |
|
987 \end{tabular} |
|
988 \end{center} |
|
989 \noindent |
|
990 The first clause says that for the regular expression |
|
991 denoting a singleton set consisting of a sinlge-character string $\{ d \}$, |
|
992 we check the derivative character $c$ against $d$, |
|
993 returning a set containing only the empty string $\{ [] \}$ |
|
994 if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise. |
|
995 The second clause states that to obtain the regular expression |
|
996 representing all strings' head character $c$ being chopped off |
|
997 from $r_1 + r_2$, one simply needs to recursively take derivative |
|
998 of $r_1$ and $r_2$ and then put them together. |
|
999 |
|
1000 Thanks to the definition, derivatives have the nice property |
|
1001 that $s \in L \; (r\backslash c)$ if and only if |
|
1002 $c::s \in L \; r$. |
|
1003 %This property can be used on regular expressions |
|
1004 %matching and lexing--to test whether a string $s$ is in $L \; r$, |
|
1005 %one simply takes derivatives of $r$ successively with |
|
1006 %respect to the characters (in the correct order) in $s$, |
|
1007 %and then test whether the empty string is in the last regular expression. |
988 Derivatives give a simple solution |
1008 Derivatives give a simple solution |
989 to the problem of matching a string $s$ with a regular |
1009 to the problem of matching and lexing a string $s$ with a regular |
990 expression $r$: if the derivative of $r$ w.r.t.\ (in |
1010 expression $r$: if the derivative of $r$ w.r.t.\ (in |
991 succession) all the characters of the string matches the empty string, |
1011 succession) all the characters of the string matches the empty string, |
992 then $r$ matches $s$ (and {\em vice versa}). |
1012 then $r$ matches $s$ (and {\em vice versa}). |
993 |
1013 |
994 The beauty of |
1014 This makes formally reasoning about these properties such |
995 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly |
1015 as correctness and complexity smooth and intuitive. |
996 expressible in any functional language, and easily definable and |
1016 In fact, there has already been several mechanised proofs about them, |
997 reasoned about in theorem provers---the definitions just consist of |
1017 for example the one by Owens and Slind \cite{Owens2008} in HOL4, |
998 inductive datatypes and simple recursive functions. |
1018 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and |
999 And an algorithms based on it by |
1019 yet another in Coq by Coquand and Siles \cite{Coquand2012}. |
1000 Suzmann and Lu \parencite{Sulzmann2014} allows easy extension |
1020 |
1001 to include extended regular expressions and |
1021 In addition, one can extend the clauses to bounded repetitions |
1002 simplification of internal data structures |
1022 ``for free'': |
1003 eliminating the exponential behaviours. |
1023 \begin{center} |
1004 |
1024 \begin{tabular}{lcl} |
1005 However, two difficulties with derivative-based matchers exist: |
1025 $r^{\{n\}} \backslash c$ & $\dn$ & $r \backslash c \cdot |
1006 \subsubsection{Problems with Current Brzozowski Matchers} |
1026 r^{\{n-1\}}$\\ |
|
1027 \end{tabular} |
|
1028 \end{center} |
|
1029 \noindent |
|
1030 And experimental results suggest that unlike DFA-based solutions, |
|
1031 this derivatives can support |
|
1032 bounded regular expressions with large counters |
|
1033 quite well. |
|
1034 |
|
1035 There has also been |
|
1036 extensions to other constructs. |
|
1037 For example, Owens et al include the derivatives |
|
1038 for \emph{NOT} regular expressions, which is |
|
1039 able to concisely express C-style comments of the form |
|
1040 $/* \ldots */$. |
|
1041 Another extension for derivatives would be |
|
1042 regular expressions with look-aheads, done by |
|
1043 by Miyazaki and Minamide |
|
1044 \cite{Takayuki2019}. |
|
1045 %We therefore use Brzozowski derivatives on regular expressions |
|
1046 %lexing |
|
1047 |
|
1048 |
|
1049 |
|
1050 Given the above definitions and properties of |
|
1051 Brzozowski derivatives, one quickly realises their potential |
|
1052 in generating a formally verified algorithm for lexing--the clauses and property |
|
1053 can be easily expressed in a functional programming language |
|
1054 or converted to theorem prover |
|
1055 code, with great extensibility. |
|
1056 Perhaps this is the reason why it has sparked quite a bit of interest |
|
1057 in the functional programming and theorem prover communities in the last |
|
1058 fifteen or so years ( |
|
1059 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18}, |
|
1060 \cite{Chen12} and \cite{Coquand2012} |
|
1061 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008} |
|
1062 after they were first published. |
|
1063 |
|
1064 |
|
1065 However, there are two difficulties with derivative-based matchers: |
1007 First, Brzozowski's original matcher only generates a yes/no answer |
1066 First, Brzozowski's original matcher only generates a yes/no answer |
1008 for whether a regular expression matches a string or not. This is too |
1067 for whether a regular expression matches a string or not. This is too |
1009 little information in the context of lexing where separate tokens must |
1068 little information in the context of lexing where separate tokens must |
1010 be identified and also classified (for example as keywords |
1069 be identified and also classified (for example as keywords |
1011 or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this |
1070 or identifiers). |
|
1071 Second, derivative-based matchers need to be more efficient. |
|
1072 Elegant and beautiful |
|
1073 as many implementations are, |
|
1074 they can be excruciatingly slow. |
|
1075 For example, Sulzmann and Lu |
|
1076 claim a linear running time of their proposed algorithm, |
|
1077 but that was falsified by our experiments. The running time |
|
1078 is actually $\Omega(2^n)$ in the worst case. |
|
1079 A similar claim about a theoretical runtime of $O(n^2)$ |
|
1080 is made for the Verbatim \cite{Verbatim} |
|
1081 %TODO: give references |
|
1082 lexer, which calculates POSIX matches and is based on derivatives. |
|
1083 They formalized the correctness of the lexer, but not the complexity. |
|
1084 In the performance evaluation section, they simply analyzed the run time |
|
1085 of matching $a$ with the string |
|
1086 \begin{center} |
|
1087 $\underbrace{a \ldots a}_{\text{n a's}}$ |
|
1088 \end{center} |
|
1089 and concluded that the algorithm is quadratic in terms of input length. |
|
1090 When we tried out their extracted OCaml code with our example $(a+aa)^*$, |
|
1091 the time it took to lex only 40 $a$'s was 5 minutes. |
|
1092 |
|
1093 |
|
1094 \subsection{Sulzmann and Lu's Algorithm} |
|
1095 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first |
1012 difficulty by cleverly extending Brzozowski's matching |
1096 difficulty by cleverly extending Brzozowski's matching |
1013 algorithm. Their extended version generates additional information on |
1097 algorithm. Their extended version generates additional information on |
1014 \emph{how} a regular expression matches a string following the POSIX |
1098 \emph{how} a regular expression matches a string following the POSIX |
1015 rules for regular expression matching. They achieve this by adding a |
1099 rules for regular expression matching. They achieve this by adding a |
1016 second ``phase'' to Brzozowski's algorithm involving an injection |
1100 second ``phase'' to Brzozowski's algorithm involving an injection |
1017 function. In our own earlier work, we provided the formal |
1101 function simplification of internal data structures |
|
1102 eliminating the exponential behaviours. |
|
1103 In an earlier work, Ausaf et al provided the formal |
1018 specification of what POSIX matching means and proved in Isabelle/HOL |
1104 specification of what POSIX matching means and proved in Isabelle/HOL |
1019 the correctness |
1105 the correctness |
1020 of Sulzmann and Lu's extended algorithm accordingly |
1106 of Sulzmann and Lu's extended algorithm accordingly |
1021 \cite{AusafDyckhoffUrban2016}. |
1107 \cite{AusafDyckhoffUrban2016}. |
1022 |
1108 |
1023 The second difficulty is that Brzozowski's derivatives can |
1109 The version of the algorithm proven correct |
1024 grow to arbitrarily big sizes. For example if we start with the |
1110 suffers from the |
|
1111 second difficulty though, where the internal derivatives can |
|
1112 grow to arbitrarily big sizes. |
|
1113 For example if we start with the |
1025 regular expression $(a+aa)^*$ and take |
1114 regular expression $(a+aa)^*$ and take |
1026 successive derivatives according to the character $a$, we end up with |
1115 successive derivatives according to the character $a$, we end up with |
1027 a sequence of ever-growing derivatives like |
1116 a sequence of ever-growing derivatives like |
1028 |
1117 |
1029 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} |
1118 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} |
1073 method [..] in combination with the simplification steps [..] |
1162 method [..] in combination with the simplification steps [..] |
1074 yields POSIX parse trees. We have tested this claim |
1163 yields POSIX parse trees. We have tested this claim |
1075 extensively [..] but yet |
1164 extensively [..] but yet |
1076 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} |
1165 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} |
1077 \end{quote} |
1166 \end{quote} |
1078 |
|
1079 Ausaf and Urban were able to back this correctness claim with |
1167 Ausaf and Urban were able to back this correctness claim with |
1080 a formal proof. |
1168 a formal proof. |
1081 |
1169 |
1082 But as they stated, |
1170 However a faster formally verified |
|
1171 lexing program with the optimisations |
|
1172 mentioned by Sulzmann and Lu's second algorithm |
|
1173 is still missing. |
|
1174 As they stated, |
1083 \begin{quote}\it |
1175 \begin{quote}\it |
1084 The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here. |
1176 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.'' |
1085 \end{quote} |
1177 \end{quote} |
1086 |
|
1087 This thesis implements the aggressive simplifications envisioned |
1178 This thesis implements the aggressive simplifications envisioned |
1088 by Ausaf and Urban, |
1179 by Ausaf and Urban, |
1089 and gives a formal proof of the correctness with those simplifications. |
1180 together with a formal proof of the correctness with those simplifications. |
1090 |
1181 |
1091 |
1182 |
1092 %---------------------------------------------------------------------------------------- |
1183 %---------------------------------------------------------------------------------------- |
1093 \section{Contribution} |
1184 \section{Contribution} |
1094 |
1185 |
1095 |
1186 In this thesis, |
1096 |
1187 we propose a solution to catastrophic |
1097 This work addresses the vulnerability of super-linear and |
1188 backtracking and error-prone matchers: a formally verified |
1098 buggy regex implementations by the combination |
1189 regular expression lexing algorithm |
1099 of Brzozowski's derivatives and interactive theorem proving. |
1190 that is both fast |
1100 We give an |
1191 and correct by extending Ausaf et al.'s work. |
1101 improved version of Sulzmann and Lu's bit-coded algorithm using |
1192 The end result is %a regular expression lexing algorithm that comes with |
1102 derivatives, which come with a formal guarantee in terms of correctness and |
1193 \begin{itemize} |
1103 running time as an Isabelle/HOL proof. |
1194 \item |
|
1195 an improved version of Sulzmann and Lu's bit-coded algorithm using |
|
1196 derivatives with simplifications, |
|
1197 accompanied by |
|
1198 a proven correctness theorem according to POSIX specification |
|
1199 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, |
|
1200 \item |
|
1201 a complexity-related property for that algorithm saying that the |
|
1202 internal data structure will |
|
1203 remain finite, |
|
1204 \item |
|
1205 and extension to |
|
1206 the bounded repetitions construct with the correctness and finiteness property |
|
1207 maintained. |
|
1208 \end{itemize} |
|
1209 |
|
1210 |
|
1211 With a formal finiteness bound in place, |
|
1212 we can greatly reduce the attack surface of servers in terms of ReDoS attacks. |
1104 Further improvements to the algorithm with an even stronger version of |
1213 Further improvements to the algorithm with an even stronger version of |
1105 simplification is made. |
1214 simplification is made. |
1106 We have not yet come up with one, but believe that it leads to a |
1215 Thanks to our theorem-prover-friendly approach, |
1107 formalised proof with a time bound linear to input and |
1216 we believe that |
1108 cubic to regular expression size using a technique by |
1217 this finiteness bound can be improved to a bound |
1109 Antimirov\cite{Antimirov}. |
1218 linear to input and |
|
1219 cubic to the regular expression size using a technique by |
|
1220 Antimirov\cite{Antimirov95}. |
|
1221 Once formalised, this would be a guarantee for the absence of all super-linear behavious. |
|
1222 We are working out the |
|
1223 details. |
1110 |
1224 |
1111 |
1225 |
1112 The main contribution of this thesis is |
|
1113 \begin{itemize} |
|
1114 \item |
|
1115 a proven correct lexing algorithm |
|
1116 \item |
|
1117 with formalized finite bounds on internal data structures' sizes. |
|
1118 \end{itemize} |
|
1119 |
|
1120 To our best knowledge, no lexing libraries using Brzozowski derivatives |
1226 To our best knowledge, no lexing libraries using Brzozowski derivatives |
1121 have a provable time guarantee, |
1227 have similar complexity-related bounds, |
1122 and claims about running time are usually speculative and backed by thin empirical |
1228 and claims about running time are usually speculative and backed by empirical |
1123 evidence. |
1229 evidence on a few test cases. |
1124 %TODO: give references |
1230 If a matching or lexing algorithm |
1125 For example, Sulzmann and Lu had proposed an algorithm in which they |
1231 does not come with certain basic complexity related |
1126 claim a linear running time. |
1232 guarantees (for examaple the internal data structure size |
1127 But that was falsified by our experiments and the running time |
1233 does not grow indefinitely), |
1128 is actually $\Omega(2^n)$ in the worst case. |
1234 then they cannot claim with confidence having solved the problem |
1129 A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim |
1235 of catastrophic backtracking. |
1130 %TODO: give references |
1236 |
1131 lexer, which calculates POSIX matches and is based on derivatives. |
|
1132 They formalized the correctness of the lexer, but not the complexity. |
|
1133 In the performance evaluation section, they simply analyzed the run time |
|
1134 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$ |
|
1135 and concluded that the algorithm is quadratic in terms of input length. |
|
1136 When we tried out their extracted OCaml code with our example $(a+aa)^*$, |
|
1137 the time it took to lex only 40 $a$'s was 5 minutes. |
|
1138 |
|
1139 |
|
1140 |
|
1141 \subsection{Related Work} |
|
1142 We are aware |
|
1143 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
1144 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
1145 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
|
1146 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
1147 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
|
1148 |
|
1149 |
|
1150 When a regular expression does not behave as intended, |
|
1151 people usually try to rewrite the regex to some equivalent form |
|
1152 or they try to avoid the possibly problematic patterns completely, |
|
1153 for which many false positives exist\parencite{Davis18}. |
|
1154 Animated tools to "debug" regular expressions such as |
|
1155 \parencite{regexploit2021} \parencite{regex101} are also popular. |
|
1156 We are also aware of static analysis work on regular expressions that |
|
1157 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
|
1158 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
1159 that detects regular expressions triggering exponential |
|
1160 behavious on backtracking matchers. |
|
1161 Weideman \parencite{Weideman2017Static} came up with |
|
1162 non-linear polynomial worst-time estimates |
|
1163 for regexes, attack string that exploit the worst-time |
|
1164 scenario, and "attack automata" that generates |
|
1165 attack strings. |
|
1166 |
1237 |
1167 |
1238 |
1168 |
1239 |
1169 |
1240 |
1170 \section{Structure of the thesis} |
1241 \section{Structure of the thesis} |