ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 609 61139fdddae0
parent 608 37b6fd310a16
child 612 8c234a1bc7e0
equal deleted inserted replaced
608:37b6fd310a16 609:61139fdddae0
   817 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   817 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   818 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
   818 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
   819 Such a language is contained in the context-sensitive hierarchy
   819 Such a language is contained in the context-sensitive hierarchy
   820 of formal languages. 
   820 of formal languages. 
   821 Solving the back-reference expressions matching problem
   821 Solving the back-reference expressions matching problem
   822 is NP-complete\parencite{alfred2014algorithms}.
   822 is known to be NP-complete \parencite{alfred2014algorithms}.
   823 A non-bactracking,
   823 A non-bactracking,
   824 efficient solution is not known to exist.
   824 efficient solution is not known to exist.
   825 Regex libraries supporting back-references such as 
   825 Regex libraries supporting back-references such as 
   826 PCRE \cite{pcre} therefore have to
   826 PCRE \cite{pcre} therefore have to
   827 revert to a depth-first search algorithm that backtracks.
   827 revert to a depth-first search algorithm which backtracks.
   828 The unreasonable part with them, is that even in the case of 
   828 What is unexpected is that even in the cases 
   829 regexes not involving back-references, there is still
   829 not involving back-references, there is still
   830 a (non-negligible) chance they might backtrack super-linearly.
   830 a (non-negligible) chance they might backtrack super-linearly,
       
   831 as shown in the graphs in \ref{fig:aStarStarb}.
   831 
   832 
   832 \subsection{Summary of the Catastrophic Backtracking Problem}
   833 \subsection{Summary of the Catastrophic Backtracking Problem}
   833 Summing these up, we can categorise existing 
   834 Summing these up, we can categorise existing 
   834 practical regex libraries into two kinds:
   835 practical regex libraries into two kinds:
   835 (i)The ones  with  linear
   836 (i)The ones  with  linear
   942 \end{quote}
   943 \end{quote}
   943 The text above 
   944 The text above 
   944 is trying to capture something very precise,
   945 is trying to capture something very precise,
   945 and is crying out for formalising.
   946 and is crying out for formalising.
   946 Ausaf et al. \cite{AusafDyckhoffUrban2016}
   947 Ausaf et al. \cite{AusafDyckhoffUrban2016}
   947 are the first to describe such a formalised POSIX
   948 are the first to fill the gap
   948 specification in Isabelle/HOL.
   949 by not just describing such a formalised POSIX
       
   950 specification in Isabelle/HOL, but also proving
       
   951 that their specification coincides with the 
       
   952 POSIX specification given by Okui and Suzuki \cite{Okui10} 
       
   953 which is a completely
       
   954 different characterisation.
   949 They then formally proved the correctness of
   955 They then formally proved the correctness of
   950 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
   956 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
   951 based on that specification.
   957 based on that specification.
   952 
   958 
   953 In this thesis,
       
   954 we propose a solution to catastrophic
       
   955 backtracking and error-prone matchers--a formally verified
       
   956 regular expression lexing algorithm
       
   957 that is both fast
       
   958 and correct by extending Ausaf et al.'s work.
       
   959 The end result is a regular expression lexing algorithm that comes with 
       
   960 \begin{itemize}
       
   961 \item
       
   962 a proven correctness theorem according to POSIX specification 
       
   963 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, 
       
   964 \item 
       
   965 a proven complexity-related property saying that the 
       
   966 algorithm's internal data structure will
       
   967 remain finite,
       
   968 \item
       
   969 and extension to
       
   970 the bounded repetitions construct with the correctness and finiteness property
       
   971 maintained.
       
   972  \end{itemize}
       
   973 In the next section we will very briefly
   959 In the next section we will very briefly
   974 introduce Brzozowski derivatives.
   960 introduce Brzozowski derivatives and Sulzmann
   975 We will give a taste to what they 
   961 and Lu's algorithm, which this thesis builds on.
       
   962 We give a taste of what they 
   976 are like and why they are suitable for regular expression
   963 are like and why they are suitable for regular expression
   977 matching and lexing.
   964 matching and lexing.
   978  
   965  
   979 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
   966 \section{Our Solution--Formal Specification of POSIX Matching 
   980 We propose Brzozowski derivatives on regular expressions as
   967 and Brzozowski Derivatives}
   981   a solution to this.
   968 Now we start with the central topic of the thesis: Brzozowski derivatives.
   982 In the last fifteen or so years, Brzozowski's derivatives of regular
   969 Brzozowski \cite{Brzozowski1964} first introduced the 
   983 expressions have sparked quite a bit of interest in the functional
   970 concept of the \emph{derivative} in the 1960s.
   984 programming and theorem prover communities.   
   971 The derivative of a regular expression $r$
   985 
   972 with respect to a character $c$, is written as $r \backslash c$.\footnote{
   986 \subsection{Motivation}
   973 	Despite having the same name, regular expression
   987   
   974 	derivatives bear little similarity with the mathematical definition
       
   975 	of derivatives on functions.
       
   976 }
       
   977 It tells us what $r$ would transform into
       
   978 if we chop off the first character $c$ 
       
   979 from all strings in the language of $r$ ($L \; r$).
       
   980 To give a flavour of Brzozowski derivatives, we present
       
   981 two straightforward clauses from it:
       
   982 \begin{center}
       
   983 	\begin{tabular}{lcl}
       
   984 		$d \backslash c$     & $\dn$ & 
       
   985 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
       
   986 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
       
   987 	\end{tabular}
       
   988 \end{center}
       
   989 \noindent
       
   990 The first clause says that for the regular expression
       
   991 denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
       
   992 we check the derivative character $c$ against $d$,
       
   993 returning a set containing only the empty string $\{ [] \}$
       
   994 if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
       
   995 The second clause states that to obtain the regular expression
       
   996 representing all strings' head character $c$ being chopped off
       
   997 from $r_1 + r_2$, one simply needs to recursively take derivative
       
   998 of $r_1$ and $r_2$ and then put them together.
       
   999 
       
  1000 Thanks to the definition, derivatives have the nice property
       
  1001 that $s \in L \; (r\backslash c)$ if and only if 
       
  1002 $c::s \in L \; r$.
       
  1003 %This property can be used on regular expressions
       
  1004 %matching and lexing--to test whether a string $s$ is in $L \; r$,
       
  1005 %one simply takes derivatives of $r$ successively with
       
  1006 %respect to the characters (in the correct order) in $s$,
       
  1007 %and then test whether the empty string is in the last regular expression.
   988 Derivatives give a simple solution
  1008 Derivatives give a simple solution
   989 to the problem of matching a string $s$ with a regular
  1009 to the problem of matching and lexing a string $s$ with a regular
   990 expression $r$: if the derivative of $r$ w.r.t.\ (in
  1010 expression $r$: if the derivative of $r$ w.r.t.\ (in
   991 succession) all the characters of the string matches the empty string,
  1011 succession) all the characters of the string matches the empty string,
   992 then $r$ matches $s$ (and {\em vice versa}).  
  1012 then $r$ matches $s$ (and {\em vice versa}).  
   993 
  1013 
   994 The beauty of
  1014 This makes formally reasoning about these properties such
   995 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly
  1015 as correctness and complexity smooth and intuitive.
   996 expressible in any functional language, and easily definable and
  1016 In fact, there has already been several mechanised proofs about them,
   997 reasoned about in theorem provers---the definitions just consist of
  1017 for example the one by Owens and Slind \cite{Owens2008} in HOL4,
   998 inductive datatypes and simple recursive functions. 
  1018 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
   999 And an algorithms based on it by 
  1019 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
  1000 Suzmann and Lu  \parencite{Sulzmann2014} allows easy extension
  1020 
  1001 to include  extended regular expressions and 
  1021 In addition, one can extend the clauses to bounded repetitions
  1002  simplification of internal data structures 
  1022 ``for free'':
  1003  eliminating the exponential behaviours.
  1023 \begin{center}
  1004 
  1024 	\begin{tabular}{lcl}
  1005 However, two difficulties with derivative-based matchers exist:
  1025 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
  1006 \subsubsection{Problems with Current Brzozowski Matchers}
  1026 		r^{\{n-1\}}$\\
       
  1027 	\end{tabular}
       
  1028 \end{center}
       
  1029 \noindent
       
  1030 And experimental results suggest that  unlike DFA-based solutions,
       
  1031 this derivatives can support 
       
  1032 bounded regular expressions with large counters
       
  1033 quite well.
       
  1034 
       
  1035 There has also been 
       
  1036 extensions to other constructs.
       
  1037 For example, Owens et al include the derivatives
       
  1038 for \emph{NOT} regular expressions, which is
       
  1039 able to concisely express C-style comments of the form
       
  1040 $/* \ldots */$.
       
  1041 Another extension for derivatives would be
       
  1042 regular expressions with look-aheads, done by
       
  1043 by Miyazaki and Minamide
       
  1044 \cite{Takayuki2019}.
       
  1045 %We therefore use Brzozowski derivatives on regular expressions 
       
  1046 %lexing 
       
  1047 
       
  1048 
       
  1049 
       
  1050 Given the above definitions and properties of
       
  1051 Brzozowski derivatives, one quickly realises their potential
       
  1052 in generating a formally verified algorithm for lexing--the clauses and property
       
  1053 can be easily expressed in a functional programming language 
       
  1054 or converted to theorem prover
       
  1055 code, with great extensibility.
       
  1056 Perhaps this is the reason why it has sparked quite a bit of interest
       
  1057 in the functional programming and theorem prover communities in the last
       
  1058 fifteen or so years (
       
  1059 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
       
  1060 \cite{Chen12} and \cite{Coquand2012}
       
  1061 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
       
  1062 after they were first published.
       
  1063 
       
  1064 
       
  1065 However, there are two difficulties with derivative-based matchers:
  1007 First, Brzozowski's original matcher only generates a yes/no answer
  1066 First, Brzozowski's original matcher only generates a yes/no answer
  1008 for whether a regular expression matches a string or not.  This is too
  1067 for whether a regular expression matches a string or not.  This is too
  1009 little information in the context of lexing where separate tokens must
  1068 little information in the context of lexing where separate tokens must
  1010 be identified and also classified (for example as keywords
  1069 be identified and also classified (for example as keywords
  1011 or identifiers).  Sulzmann and Lu~\cite{Sulzmann2014} overcome this
  1070 or identifiers). 
       
  1071 Second, derivative-based matchers need to be more efficient.
       
  1072 Elegant and beautiful
       
  1073 as many implementations are,
       
  1074 they can be excruciatingly slow. 
       
  1075 For example, Sulzmann and Lu
       
  1076 claim a linear running time of their proposed algorithm,
       
  1077 but that was falsified by our experiments. The running time 
       
  1078 is actually $\Omega(2^n)$ in the worst case.
       
  1079 A similar claim about a theoretical runtime of $O(n^2)$ 
       
  1080 is made for the Verbatim \cite{Verbatim}
       
  1081 %TODO: give references
       
  1082 lexer, which calculates POSIX matches and is based on derivatives.
       
  1083 They formalized the correctness of the lexer, but not the complexity.
       
  1084 In the performance evaluation section, they simply analyzed the run time
       
  1085 of matching $a$ with the string 
       
  1086 \begin{center}
       
  1087 	$\underbrace{a \ldots a}_{\text{n a's}}$
       
  1088 \end{center}
       
  1089 and concluded that the algorithm is quadratic in terms of input length.
       
  1090 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
       
  1091 the time it took to lex only 40 $a$'s was 5 minutes.
       
  1092 
       
  1093 
       
  1094 \subsection{Sulzmann and Lu's Algorithm}
       
  1095 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
  1012 difficulty by cleverly extending Brzozowski's matching
  1096 difficulty by cleverly extending Brzozowski's matching
  1013 algorithm. Their extended version generates additional information on
  1097 algorithm. Their extended version generates additional information on
  1014 \emph{how} a regular expression matches a string following the POSIX
  1098 \emph{how} a regular expression matches a string following the POSIX
  1015 rules for regular expression matching. They achieve this by adding a
  1099 rules for regular expression matching. They achieve this by adding a
  1016 second ``phase'' to Brzozowski's algorithm involving an injection
  1100 second ``phase'' to Brzozowski's algorithm involving an injection
  1017 function.  In our own earlier work, we provided the formal
  1101 function simplification of internal data structures 
       
  1102 eliminating the exponential behaviours.
       
  1103 In an earlier work, Ausaf et al provided the formal
  1018 specification of what POSIX matching means and proved in Isabelle/HOL
  1104 specification of what POSIX matching means and proved in Isabelle/HOL
  1019 the correctness
  1105 the correctness
  1020 of Sulzmann and Lu's extended algorithm accordingly
  1106 of Sulzmann and Lu's extended algorithm accordingly
  1021 \cite{AusafDyckhoffUrban2016}.
  1107 \cite{AusafDyckhoffUrban2016}.
  1022 
  1108 
  1023 The second difficulty is that Brzozowski's derivatives can 
  1109 The version of the algorithm proven correct 
  1024 grow to arbitrarily big sizes. For example if we start with the
  1110 suffers from the
       
  1111 second difficulty though, where the internal derivatives can
       
  1112 grow to arbitrarily big sizes. 
       
  1113 For example if we start with the
  1025 regular expression $(a+aa)^*$ and take
  1114 regular expression $(a+aa)^*$ and take
  1026 successive derivatives according to the character $a$, we end up with
  1115 successive derivatives according to the character $a$, we end up with
  1027 a sequence of ever-growing derivatives like 
  1116 a sequence of ever-growing derivatives like 
  1028 
  1117 
  1029 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
  1118 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
  1036 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
  1125 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
  1037 \end{tabular}
  1126 \end{tabular}
  1038 \end{center}
  1127 \end{center}
  1039  
  1128  
  1040 \noindent where after around 35 steps we run out of memory on a
  1129 \noindent where after around 35 steps we run out of memory on a
  1041 typical computer (we shall define shortly the precise details of our
  1130 typical computer (we shall define in the next chapter 
       
  1131 the precise details of our
  1042 regular expressions and the derivative operation).  Clearly, the
  1132 regular expressions and the derivative operation).  Clearly, the
  1043 notation involving $\ZERO$s and $\ONE$s already suggests
  1133 notation involving $\ZERO$s and $\ONE$s already suggests
  1044 simplification rules that can be applied to regular regular
  1134 simplification rules that can be applied to regular regular
  1045 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
  1135 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
  1046 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
  1136 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
  1048 earlier work to preserve the correctness of Sulzmann and Lu's
  1138 earlier work to preserve the correctness of Sulzmann and Lu's
  1049 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
  1139 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
  1050 \emph{not} help with limiting the growth of the derivatives shown
  1140 \emph{not} help with limiting the growth of the derivatives shown
  1051 above: the growth is slowed, but the derivatives can still grow rather
  1141 above: the growth is slowed, but the derivatives can still grow rather
  1052 quickly beyond any finite bound.
  1142 quickly beyond any finite bound.
  1053 
       
  1054 
  1143 
  1055 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
  1144 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
  1056 \cite{Sulzmann2014} where they introduce bit-coded
  1145 \cite{Sulzmann2014} where they introduce bit-coded
  1057 regular expressions. In this version, POSIX values are
  1146 regular expressions. In this version, POSIX values are
  1058 represented as bit sequences and such sequences are incrementally generated
  1147 represented as bit sequences and such sequences are incrementally generated
  1073   method [..] in combination with the simplification steps [..]
  1162   method [..] in combination with the simplification steps [..]
  1074   yields POSIX parse trees. We have tested this claim
  1163   yields POSIX parse trees. We have tested this claim
  1075   extensively [..] but yet
  1164   extensively [..] but yet
  1076   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
  1165   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
  1077 \end{quote}  
  1166 \end{quote}  
  1078 
       
  1079 Ausaf and Urban were able to back this correctness claim with
  1167 Ausaf and Urban were able to back this correctness claim with
  1080 a formal proof.
  1168 a formal proof.
  1081 
  1169 
  1082 But as they stated,
  1170 However a faster formally verified 
       
  1171 lexing program with the optimisations
       
  1172 mentioned by Sulzmann and Lu's second algorithm
       
  1173 is still missing.
       
  1174 As they stated,
  1083   \begin{quote}\it
  1175   \begin{quote}\it
  1084 The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.
  1176 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
  1085 \end{quote}  
  1177 \end{quote}  
  1086 
       
  1087 This thesis implements the aggressive simplifications envisioned
  1178 This thesis implements the aggressive simplifications envisioned
  1088 by Ausaf and Urban,
  1179 by Ausaf and Urban,
  1089 and gives a formal proof of the correctness with those simplifications.
  1180 together with a formal proof of the correctness with those simplifications.
  1090 
  1181 
  1091 
  1182 
  1092 %----------------------------------------------------------------------------------------
  1183 %----------------------------------------------------------------------------------------
  1093 \section{Contribution}
  1184 \section{Contribution}
  1094 
  1185 
  1095 
  1186 In this thesis,
  1096 
  1187 we propose a solution to catastrophic
  1097 This work addresses the vulnerability of super-linear and
  1188 backtracking and error-prone matchers: a formally verified
  1098 buggy regex implementations by the combination
  1189 regular expression lexing algorithm
  1099 of Brzozowski's derivatives and interactive theorem proving. 
  1190 that is both fast
  1100 We give an 
  1191 and correct by extending Ausaf et al.'s work.
  1101 improved version of  Sulzmann and Lu's bit-coded algorithm using 
  1192 The end result is %a regular expression lexing algorithm that comes with 
  1102 derivatives, which come with a formal guarantee in terms of correctness and 
  1193 \begin{itemize}
  1103 running time as an Isabelle/HOL proof.
  1194 \item
       
  1195 an improved version of  Sulzmann and Lu's bit-coded algorithm using 
       
  1196 derivatives with simplifications, 
       
  1197 accompanied by
       
  1198 a proven correctness theorem according to POSIX specification 
       
  1199 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, 
       
  1200 \item 
       
  1201 a complexity-related property for that algorithm saying that the 
       
  1202 internal data structure will
       
  1203 remain finite,
       
  1204 \item
       
  1205 and extension to
       
  1206 the bounded repetitions construct with the correctness and finiteness property
       
  1207 maintained.
       
  1208  \end{itemize}
       
  1209 
       
  1210 
       
  1211 With a formal finiteness bound in place,
       
  1212 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
  1104 Further improvements to the algorithm with an even stronger version of 
  1213 Further improvements to the algorithm with an even stronger version of 
  1105 simplification is made.
  1214 simplification is made.
  1106 We have not yet come up with one, but believe that it leads to a 
  1215 Thanks to our theorem-prover-friendly approach,
  1107 formalised proof with a time bound linear to input and
  1216 we believe that 
  1108 cubic to regular expression size using a technique by
  1217 this finiteness bound can be improved to a bound
  1109 Antimirov\cite{Antimirov}.
  1218 linear to input and
       
  1219 cubic to the regular expression size using a technique by
       
  1220 Antimirov\cite{Antimirov95}.
       
  1221 Once formalised, this would be a guarantee for the absence of all super-linear behavious.
       
  1222 We are working out the
       
  1223 details.
  1110 
  1224 
  1111  
  1225  
  1112 The main contribution of this thesis is 
       
  1113 \begin{itemize}
       
  1114 \item
       
  1115 a proven correct lexing algorithm
       
  1116 \item
       
  1117 with formalized finite bounds on internal data structures' sizes.
       
  1118 \end{itemize}
       
  1119 
       
  1120 To our best knowledge, no lexing libraries using Brzozowski derivatives
  1226 To our best knowledge, no lexing libraries using Brzozowski derivatives
  1121 have a provable time guarantee, 
  1227 have similar complexity-related bounds, 
  1122 and claims about running time are usually speculative and backed by thin empirical
  1228 and claims about running time are usually speculative and backed by empirical
  1123 evidence.
  1229 evidence on a few test cases.
  1124 %TODO: give references
  1230 If a matching or lexing algorithm
  1125 For example, Sulzmann and Lu had proposed an algorithm  in which they
  1231 does not come with certain basic complexity related 
  1126 claim a linear running time.
  1232 guarantees (for examaple the internal data structure size
  1127 But that was falsified by our experiments and the running time 
  1233 does not grow indefinitely), 
  1128 is actually $\Omega(2^n)$ in the worst case.
  1234 then they cannot claim with confidence having solved the problem
  1129 A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
  1235 of catastrophic backtracking.
  1130 %TODO: give references
  1236 
  1131 lexer, which calculates POSIX matches and is based on derivatives.
       
  1132 They formalized the correctness of the lexer, but not the complexity.
       
  1133 In the performance evaluation section, they simply analyzed the run time
       
  1134 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
       
  1135 and concluded that the algorithm is quadratic in terms of input length.
       
  1136 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
       
  1137 the time it took to lex only 40 $a$'s was 5 minutes.
       
  1138 
       
  1139 
       
  1140 
       
  1141 \subsection{Related Work}
       
  1142 We are aware
       
  1143 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
       
  1144 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
       
  1145 of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
       
  1146 in Coq is given by Coquand and Siles \parencite{Coquand2012}.
       
  1147 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
       
  1148  
       
  1149  
       
  1150  When a regular expression does not behave as intended,
       
  1151 people usually try to rewrite the regex to some equivalent form
       
  1152 or they try to avoid the possibly problematic patterns completely,
       
  1153 for which many false positives exist\parencite{Davis18}.
       
  1154 Animated tools to "debug" regular expressions such as
       
  1155  \parencite{regexploit2021} \parencite{regex101} are also popular.
       
  1156 We are also aware of static analysis work on regular expressions that
       
  1157 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
       
  1158 \parencite{Rathnayake2014StaticAF} proposed an algorithm
       
  1159 that detects regular expressions triggering exponential
       
  1160 behavious on backtracking matchers.
       
  1161 Weideman \parencite{Weideman2017Static} came up with 
       
  1162 non-linear polynomial worst-time estimates
       
  1163 for regexes, attack string that exploit the worst-time 
       
  1164 scenario, and "attack automata" that generates
       
  1165 attack strings.
       
  1166 
  1237 
  1167 
  1238 
  1168 
  1239 
  1169 
  1240 
  1170 \section{Structure of the thesis}
  1241 \section{Structure of the thesis}