ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 638 dd9dde2d902b
parent 637 e3752aac8ec2
child 646 56057198e4f5
equal deleted inserted replaced
637:e3752aac8ec2 638:dd9dde2d902b
   215 detection systems that inspect suspicious traffic; or compiler
   215 detection systems that inspect suspicious traffic; or compiler
   216 front ends.
   216 front ends.
   217 Given their usefulness and ubiquity, one would assume that
   217 Given their usefulness and ubiquity, one would assume that
   218 modern regular expression matching implementations
   218 modern regular expression matching implementations
   219 are mature and fully studied.
   219 are mature and fully studied.
   220 Indeed, in a popular programming language's regular expression engine, 
   220 Indeed, in many popular programming languages' regular expression engines, 
   221 supplying it with regular expressions and strings,
   221 one can supply a regular expression and a string, and in most cases get
   222 in most cases one can
       
   223 get the matching information in a very short time.
   222 get the matching information in a very short time.
   224 Those engines can be blindingly fast--some 
   223 Those engines can sometimes be blindingly fast--some 
   225 network intrusion detection systems
   224 network intrusion detection systems
   226 use regular expression engines that are able to process 
   225 use regular expression engines that are able to process 
   227 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   226 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   228 However, those engines can sometimes exhibit a surprising security vulnerability
   227 However, those engines can sometimes also exhibit a surprising security vulnerability
   229 under a certain class of inputs.
   228 under a certain class of inputs.
   230 %However, , this is not the case for $\mathbf{all}$ inputs.
   229 %However, , this is not the case for $\mathbf{all}$ inputs.
   231 %TODO: get source for SNORT/BRO's regex matching engine/speed
   230 %TODO: get source for SNORT/BRO's regex matching engine/speed
   232 
   231 
   233 
   232 
   234 Consider for example $(a^*)^*\,b$ and 
   233 Consider for example the regular expression $(a^*)^*\,b$ and 
   235 strings of the form $aa..a$, these strings cannot be matched by this regular
   234 strings of the form $aa..a$. These strings cannot be matched by this regular
   236 expression: Obviously the expected $b$ in the last
   235 expression: Obviously the expected $b$ in the last
   237 position is missing. One would assume that modern regular expression
   236 position is missing. One would assume that modern regular expression
   238 matching engines can find this out very quickly. Surprisingly, if one tries
   237 matching engines can find this out very quickly. Surprisingly, if one tries
   239 this example in JavaScript, Python or Java 8, even with small strings, 
   238 this example in JavaScript, Python or Java 8, even with small strings, 
   240 say of lenght of around 30 $a$'s,
   239 say of lenght of around 30 $a$'s,
   241 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
   240 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
   242 This is clearly exponential behaviour, and as can be seen
   241 The algorithms clearly show exponential behaviour, and as can be seen
   243 is triggered by some relatively simple regular expressions.
   242 is triggered by some relatively simple regular expressions.
   244 Java 9 and newer
   243 Java 9 and newer
   245 versions improve this behaviour somewhat, but are still slow compared 
   244 versions improve this behaviour somewhat, but are still slow compared 
   246 with the approach we are going to use in this thesis.
   245 with the approach we are going to use in this thesis.
   247 
   246 
   248 
   247 
   249 
   248 
   250 This superlinear blowup in regular expression engines
   249 This superlinear blowup in regular expression engines
   251 has caused grief in ``real life'' in the past where it is 
   250 has caused grief in ``real life'' where it is 
   252 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
   251 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
   253 For example, on 20 July 2016 one evil
   252 For example, on 20 July 2016 one evil
   254 regular expression brought the webpage
   253 regular expression brought the webpage
   255 \href{http://stackexchange.com}{Stack Exchange} to its
   254 \href{http://stackexchange.com}{Stack Exchange} to its
   256 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
   255 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
   421 attacks.
   420 attacks.
   422 Davis et al. \cite{Davis18} detected more
   421 Davis et al. \cite{Davis18} detected more
   423 than 1000 evil regular expressions
   422 than 1000 evil regular expressions
   424 in Node.js, Python core libraries, npm and pypi. 
   423 in Node.js, Python core libraries, npm and pypi. 
   425 They therefore concluded that evil regular expressions
   424 They therefore concluded that evil regular expressions
   426 are real problems rather than just "a parlour trick".
   425 are a real problem rather than just "a parlour trick".
   427 
   426 
   428 This work aims to address this issue
   427 The work in this thesis aims to address this issue
   429 with the help of formal proofs.
   428 with the help of formal proofs.
   430 We describe a lexing algorithm based
   429 We describe a lexing algorithm based
   431 on Brzozowski derivatives with verified correctness 
   430 on Brzozowski derivatives with verified correctness 
   432 and a finiteness property for the size of derivatives
   431 and a finiteness property for the size of derivatives
   433 (which are all done in Isabelle/HOL).
   432 (which are all done in Isabelle/HOL).
   625 have bounded regular expressions in them.
   624 have bounded regular expressions in them.
   626 Often the counters are quite large, with the largest being
   625 Often the counters are quite large, with the largest being
   627 close to ten million. 
   626 close to ten million. 
   628 A smaller sample XSD they gave
   627 A smaller sample XSD they gave
   629 is:
   628 is:
   630 \begin{verbatim}
   629 \lstset{
       
   630 	basicstyle=\fontsize{8.5}{9}\ttfamily,
       
   631   language=XML,
       
   632   morekeywords={encoding,
       
   633     xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
       
   634 }
       
   635 \begin{lstlisting}
   631 <sequence minOccurs="0" maxOccurs="65535">
   636 <sequence minOccurs="0" maxOccurs="65535">
   632  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   637 	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   633  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   638  	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   634 </sequence>
   639 </sequence>
   635 \end{verbatim}
   640 \end{lstlisting}
   636 This can be seen as the regex
   641 This can be seen as the regex
   637 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   642 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   638 regular expressions 
   643 regular expressions 
   639 satisfying certain constraints (such as 
   644 satisfying certain constraints (such as 
   640 satisfying the floating point number format).
   645 satisfying the floating point number format).
   785 \end{center}
   790 \end{center}
   786 The sub-expressions
   791 The sub-expressions
   787 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
   792 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
   788 by 1 to 4, and can be ``referred back'' by their respective numbers. 
   793 by 1 to 4, and can be ``referred back'' by their respective numbers. 
   789 %These sub-expressions are called "capturing groups".
   794 %These sub-expressions are called "capturing groups".
   790 To do so, we use the syntax $\backslash i$ 
   795 To do so, one uses the syntax $\backslash i$ 
   791 to denote that we want the sub-string 
   796 to denote that we want the sub-string 
   792 of the input matched by the i-th
   797 of the input matched by the i-th
   793 sub-expression to appear again, 
   798 sub-expression to appear again, 
   794 exactly the same as it first appeared: 
   799 exactly the same as it first appeared: 
   795 \begin{center}
   800 \begin{center}
   834 $(.)(.)\backslash 2\backslash 1$
   839 $(.)(.)\backslash 2\backslash 1$
   835 \end{center}
   840 \end{center}
   836 which matches four-character palindromes
   841 which matches four-character palindromes
   837 like $abba$, $x??x$ and so on.
   842 like $abba$, $x??x$ and so on.
   838 
   843 
   839 Back-references is a regex construct 
   844 Back-references are a regex construct 
   840 that programmers find quite useful.
   845 that programmers find quite useful.
   841 According to Becchi and Crawley \cite{Becchi08},
   846 According to Becchi and Crawley \cite{Becchi08},
   842 6\% of Snort rules (up until 2008) use them.
   847 6\% of Snort rules (up until 2008) use them.
   843 The most common use of back-references
   848 The most common use of back-references
   844 is to express well-formed html files,
   849 is to express well-formed html files,
   926 with a regular expression.
   931 with a regular expression.
   927 In such cases the regular expressions matcher needs to
   932 In such cases the regular expressions matcher needs to
   928 disambiguate.
   933 disambiguate.
   929 The more widely used strategy is called POSIX,
   934 The more widely used strategy is called POSIX,
   930 which roughly speaking always chooses the longest initial match.
   935 which roughly speaking always chooses the longest initial match.
   931 The POSIX strategy is widely adopted in many regular expression matchers.
   936 The POSIX strategy is widely adopted in many regular expression matchers
       
   937 because it is a natural strategy for lexers.
   932 However, many implementations (including the C libraries
   938 However, many implementations (including the C libraries
   933 used by Linux and OS X distributions) contain bugs
   939 used by Linux and OS X distributions) contain bugs
   934 or do not meet the specification they claim to adhere to.
   940 or do not meet the specification they claim to adhere to.
   935 Kuklewicz maintains a unit test repository which lists some
   941 Kuklewicz maintains a unit test repository which lists some
   936 problems with existing regular expression engines \cite{KuklewiczHaskell}.
   942 problems with existing regular expression engines \cite{KuklewiczHaskell}.
   945 involves the entire string $ababa$, 
   951 involves the entire string $ababa$, 
   946 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
   952 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
   947 $[0, 2), [2, 5)$
   953 $[0, 2), [2, 5)$
   948 respectively.
   954 respectively.
   949 But feeding this example to the different engines
   955 But feeding this example to the different engines
   950 in regex101 \parencite{regex101} \footnote{
   956 listed at regex101 \footnote{
   951 	regex101 is an online regular expression matcher which
   957 	regex101 is an online regular expression matcher which
   952 	provides API for trying out regular expression
   958 	provides API for trying out regular expression
   953 	engines of multiple popular programming languages like
   959 	engines of multiple popular programming languages like
   954 Java, Python, Go, etc.}
   960 Java, Python, Go, etc.} \parencite{regex101}. 
   955 yields
   961 yields
   956 only two incomplete matches: $[aba]$ at $[0, 3)$
   962 only two incomplete matches: $[aba]$ at $[0, 3)$
   957 and $a$ at $[4, 5)$.
   963 and $a$ at $[4, 5)$.
   958 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
   964 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
   959 commented that most regex libraries are not
   965 commented that most regex libraries are not
   960 correctly implementing the central POSIX
   966 correctly implementing the central POSIX
   961 rule, called the maximum munch rule.
   967 rule, called the maximum munch rule.
   962 Grathwohl \parencite{grathwohl2014crash} wrote,
   968 Grathwohl \parencite{grathwohl2014crash} wrote:
   963 \begin{quote}
   969 \begin{quote}\it
   964 	``The POSIX strategy is more complicated than the 
   970 	``The POSIX strategy is more complicated than the 
   965 	greedy because of the dependence on information about 
   971 	greedy because of the dependence on information about 
   966 	the length of matched strings in the various subexpressions.''
   972 	the length of matched strings in the various subexpressions.''
   967 \end{quote}
   973 \end{quote}
   968 %\noindent
   974 %\noindent
   969 We think the implementation complexity of POSIX rules also come from
   975 We think the implementation complexity of POSIX rules also come from
   970 the specification being not very precise.
   976 the specification being not very precise.
   971 The developers of the regexp package of GO 
   977 The developers of the regexp package of Go 
   972 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
   978 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
   973 commented that
   979 commented that
   974 \begin{quote}\it
   980 \begin{quote}\it
   975 ``
   981 ``
   976 The POSIX rule is computationally prohibitive
   982 The POSIX rule is computationally prohibitive
  1006 %and is crying out for formalising.
  1012 %and is crying out for formalising.
  1007 Ausaf et al. \cite{AusafDyckhoffUrban2016}
  1013 Ausaf et al. \cite{AusafDyckhoffUrban2016}
  1008 are the first to
  1014 are the first to
  1009 give a quite simple formalised POSIX
  1015 give a quite simple formalised POSIX
  1010 specification in Isabelle/HOL, and also prove 
  1016 specification in Isabelle/HOL, and also prove 
  1011 that their specification coincides with the 
  1017 that their specification coincides with an earlier (unformalised) 
  1012 POSIX specification given by Okui and Suzuki \cite{Okui10}.
  1018 POSIX specification given by Okui and Suzuki \cite{Okui10}.
  1013 They then formally proved the correctness of
  1019 They then formally proved the correctness of
  1014 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
  1020 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
  1015 with regards to that specification.
  1021 with regards to that specification.
  1016 They also found that the informal POSIX
  1022 They also found that the informal POSIX
  1017 specification by Sulzmann and Lu needs to be revised for the correctness proof.
  1023 specification by Sulzmann and Lu needs to be substantially 
       
  1024 revised in order for the correctness proof to go through.
       
  1025 Their original specification and proof were unfixable
       
  1026 according to Ausaf et al.
  1018 
  1027 
  1019 In the next section, we will briefly
  1028 In the next section, we will briefly
  1020 introduce Brzozowski derivatives and Sulzmann
  1029 introduce Brzozowski derivatives and Sulzmann
  1021 and Lu's algorithm, which the main point of this thesis builds on.
  1030 and Lu's algorithm, which the main point of this thesis builds on.
  1022 %We give a taste of what they 
  1031 %We give a taste of what they 
  1028 Brzozowski \cite{Brzozowski1964} first introduced the 
  1037 Brzozowski \cite{Brzozowski1964} first introduced the 
  1029 concept of a \emph{derivative} of regular expression in 1964.
  1038 concept of a \emph{derivative} of regular expression in 1964.
  1030 The derivative of a regular expression $r$
  1039 The derivative of a regular expression $r$
  1031 with respect to a character $c$, is written as $r \backslash c$.
  1040 with respect to a character $c$, is written as $r \backslash c$.
  1032 This operation tells us what $r$ transforms into
  1041 This operation tells us what $r$ transforms into
  1033 if we ``chop'' off the first character $c$ 
  1042 if we ``chop'' off a particular character $c$ 
  1034 from all strings in the language of $r$ (defined
  1043 from all strings in the language of $r$ (defined
  1035 later as $L \; r$).
  1044 later as $L \; r$).
  1036 %To give a flavour of Brzozowski derivatives, we present
  1045 %To give a flavour of Brzozowski derivatives, we present
  1037 %two straightforward clauses from it:
  1046 %two straightforward clauses from it:
  1038 %\begin{center}
  1047 %\begin{center}
  1058 %This property can be used on regular expressions
  1067 %This property can be used on regular expressions
  1059 %matching and lexing--to test whether a string $s$ is in $L \; r$,
  1068 %matching and lexing--to test whether a string $s$ is in $L \; r$,
  1060 %one simply takes derivatives of $r$ successively with
  1069 %one simply takes derivatives of $r$ successively with
  1061 %respect to the characters (in the correct order) in $s$,
  1070 %respect to the characters (in the correct order) in $s$,
  1062 %and then test whether the empty string is in the last regular expression.
  1071 %and then test whether the empty string is in the last regular expression.
  1063 With this property, derivatives give a simple solution
  1072 With this property, derivatives can give a simple solution
  1064 to the problem of matching a string $s$ with a regular
  1073 to the problem of matching a string $s$ with a regular
  1065 expression $r$: if the derivative of $r$ w.r.t.\ (in
  1074 expression $r$: if the derivative of $r$ w.r.t.\ (in
  1066 succession) all the characters of the string matches the empty string,
  1075 succession) all the characters of the string matches the empty string,
  1067 then $r$ matches $s$ (and {\em vice versa}).  
  1076 then $r$ matches $s$ (and {\em vice versa}).  
  1068 %This makes formally reasoning about these properties such
  1077 %This makes formally reasoning about these properties such
  1069 %as correctness and complexity smooth and intuitive.
  1078 %as correctness and complexity smooth and intuitive.
  1070 There had been several mechanised proofs about this property in various theorem
  1079 There are several mechanised proofs of this property in various theorem
  1071 provers,
  1080 provers,
  1072 for example one by Owens and Slind \cite{Owens2008} in HOL4,
  1081 for example one by Owens and Slind \cite{Owens2008} in HOL4,
  1073 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
  1082 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
  1074 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
  1083 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
  1075 
  1084 
  1077 relatively straightforwardly. For example, the derivative for 
  1086 relatively straightforwardly. For example, the derivative for 
  1078 this can be defined as:
  1087 this can be defined as:
  1079 \begin{center}
  1088 \begin{center}
  1080 	\begin{tabular}{lcl}
  1089 	\begin{tabular}{lcl}
  1081 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
  1090 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
  1082 		r^{\{n-1\}}$\\
  1091 		r^{\{n-1\}} (\text{when} n > 0)$\\
  1083 	\end{tabular}
  1092 	\end{tabular}
  1084 \end{center}
  1093 \end{center}
  1085 \noindent
  1094 \noindent
  1086 Experimental results suggest that  unlike DFA-based solutions
  1095 Experimental results suggest that  unlike DFA-based solutions
  1087 for bounded regular expressions,
  1096 for bounded regular expressions,
  1088 derivatives can cope
  1097 derivatives can cope
  1089 large counters
  1098 large counters
  1090 quite well.
  1099 quite well.
  1091 
  1100 
  1092 There has also been 
  1101 There have also been 
  1093 extensions to other constructs.
  1102 extensions of derivatives to other regex constructs.
  1094 For example, Owens et al include the derivatives
  1103 For example, Owens et al include the derivatives
  1095 for the \emph{NOT} regular expression, which is
  1104 for the \emph{NOT} regular expression, which is
  1096 able to concisely express C-style comments of the form
  1105 able to concisely express C-style comments of the form
  1097 $/* \ldots */$ (see \cite{Owens2008}).
  1106 $/* \ldots */$ (see \cite{Owens2008}).
  1098 Another extension for derivatives is
  1107 Another extension for derivatives is
  1104 
  1113 
  1105 
  1114 
  1106 
  1115 
  1107 Given the above definitions and properties of
  1116 Given the above definitions and properties of
  1108 Brzozowski derivatives, one quickly realises their potential
  1117 Brzozowski derivatives, one quickly realises their potential
  1109 in generating a formally verified algorithm for lexing--the clauses and property
  1118 in generating a formally verified algorithm for lexing: the clauses and property
  1110 can be easily expressed in a functional programming language 
  1119 can be easily expressed in a functional programming language 
  1111 or converted to theorem prover
  1120 or converted to theorem prover
  1112 code, with great extensibility.
  1121 code, with great ease.
  1113 Perhaps this is the reason why it has sparked quite a bit of interest
  1122 Perhaps this is the reason why derivatives have sparked quite a bit of interest
  1114 in the functional programming and theorem prover communities in the last
  1123 in the functional programming and theorem prover communities in the last
  1115 fifteen or so years (
  1124 fifteen or so years (
  1116 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
  1125 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
  1117 \cite{Chen12} and \cite{Coquand2012}
  1126 \cite{Chen12} and \cite{Coquand2012}
  1118 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
  1127 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
  1123 First, Brzozowski's original matcher only generates a yes/no answer
  1132 First, Brzozowski's original matcher only generates a yes/no answer
  1124 for whether a regular expression matches a string or not.  This is too
  1133 for whether a regular expression matches a string or not.  This is too
  1125 little information in the context of lexing where separate tokens must
  1134 little information in the context of lexing where separate tokens must
  1126 be identified and also classified (for example as keywords
  1135 be identified and also classified (for example as keywords
  1127 or identifiers). 
  1136 or identifiers). 
  1128 Second, derivative-based matchers need to be more efficient.
  1137 Second, derivative-based matchers need to be more efficient in terms
       
  1138 of the sizes of derivatives.
  1129 Elegant and beautiful
  1139 Elegant and beautiful
  1130 as many implementations are,
  1140 as many implementations are,
  1131 they can be excruciatingly slow. 
  1141 they can be excruciatingly slow. 
  1132 For example, Sulzmann and Lu
  1142 For example, Sulzmann and Lu
  1133 claim a linear running time of their proposed algorithm,
  1143 claim a linear running time of their proposed algorithm,
  1144 	$\underbrace{a \ldots a}_{\text{n a's}}$.
  1154 	$\underbrace{a \ldots a}_{\text{n a's}}$.
  1145 \end{center}
  1155 \end{center}
  1146 \noindent
  1156 \noindent
  1147 They concluded that the algorithm is quadratic in terms of 
  1157 They concluded that the algorithm is quadratic in terms of 
  1148 the length of the input string.
  1158 the length of the input string.
  1149 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
  1159 When we tried out their extracted OCaml code with the example $(a+aa)^*$,
  1150 the time it took to match a string of 40 $a$'s was approximately 5 minutes.
  1160 the time it took to match a string of 40 $a$'s was approximately 5 minutes.
  1151 
  1161 
  1152 
  1162 
  1153 \subsection{Sulzmann and Lu's Algorithm}
  1163 \subsection{Sulzmann and Lu's Algorithm}
  1154 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
  1164 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
  1156 by cleverly extending Brzozowski's matching
  1166 by cleverly extending Brzozowski's matching
  1157 algorithm. Their extended version generates additional information on
  1167 algorithm. Their extended version generates additional information on
  1158 \emph{how} a regular expression matches a string following the POSIX
  1168 \emph{how} a regular expression matches a string following the POSIX
  1159 rules for regular expression matching. They achieve this by adding a
  1169 rules for regular expression matching. They achieve this by adding a
  1160 second ``phase'' to Brzozowski's algorithm involving an injection
  1170 second ``phase'' to Brzozowski's algorithm involving an injection
  1161 function.
  1171 function. This injection function in a sense undoes the ``damage''
       
  1172 of the derivatives chopping off characters.
  1162 In earlier work, Ausaf et al provided the formal
  1173 In earlier work, Ausaf et al provided the formal
  1163 specification of what POSIX matching means and proved in Isabelle/HOL
  1174 specification of what POSIX matching means and proved in Isabelle/HOL
  1164 the correctness
  1175 the correctness
  1165 of this extended algorithm accordingly
  1176 of this extended algorithm accordingly
  1166 \cite{AusafDyckhoffUrban2016}.
  1177 \cite{AusafDyckhoffUrban2016}.
  1167 
  1178 
  1168 The version of the algorithm proven correct 
  1179 The version of the algorithm proven correct
  1169 suffers heavily from a 
  1180 suffers however heavily from a 
  1170 second difficulty, where the internal derivatives can
  1181 second difficulty, where derivatives can
  1171 grow to arbitrarily big sizes. 
  1182 grow to arbitrarily big sizes. 
  1172 For example if we start with the
  1183 For example if we start with the
  1173 regular expression $(a+aa)^*$ and take
  1184 regular expression $(a+aa)^*$ and take
  1174 successive derivatives according to the character $a$, we end up with
  1185 successive derivatives according to the character $a$, we end up with
  1175 a sequence of ever-growing derivatives like 
  1186 a sequence of ever-growing derivatives like 
  1183 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
  1194 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
  1184 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
  1195 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
  1185 \end{tabular}
  1196 \end{tabular}
  1186 \end{center}
  1197 \end{center}
  1187  
  1198  
  1188 \noindent where after around 35 steps we run out of memory on a
  1199 \noindent where after around 35 steps we usually run out of memory on a
  1189 typical computer.  Clearly, the
  1200 typical computer.  Clearly, the
  1190 notation involving $\ZERO$s and $\ONE$s already suggests
  1201 notation involving $\ZERO$s and $\ONE$s already suggests
  1191 simplification rules that can be applied to regular regular
  1202 simplification rules that can be applied to regular regular
  1192 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
  1203 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
  1193 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
  1204 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
  1240 This is relevant work for us and we will compare it later with 
  1251 This is relevant work for us and we will compare it later with 
  1241 our derivative-based matcher we are going to present.
  1252 our derivative-based matcher we are going to present.
  1242 There is also some newer work called
  1253 There is also some newer work called
  1243 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
  1254 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
  1244 but deterministic finite automaton instead.
  1255 but deterministic finite automaton instead.
       
  1256 We will also study this work in a later section.
  1245 %An example that gives problem to automaton approaches would be
  1257 %An example that gives problem to automaton approaches would be
  1246 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
  1258 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
  1247 %It requires at least $2^{n+1}$ states to represent
  1259 %It requires at least $2^{n+1}$ states to represent
  1248 %as a DFA.
  1260 %as a DFA.
  1249 
  1261 
  1254 In this thesis,
  1266 In this thesis,
  1255 we propose a solution to catastrophic
  1267 we propose a solution to catastrophic
  1256 backtracking and error-prone matchers: a formally verified
  1268 backtracking and error-prone matchers: a formally verified
  1257 regular expression lexing algorithm
  1269 regular expression lexing algorithm
  1258 that is both fast
  1270 that is both fast
  1259 and correct by extending Ausaf et al.'s work.
  1271 and correct. 
  1260 The end result is %a regular expression lexing algorithm that comes with 
  1272 The result is %a regular expression lexing algorithm that comes with 
  1261 \begin{itemize}
  1273 \begin{itemize}
  1262 \item
  1274 \item
  1263 an improved version of  Sulzmann and Lu's bit-coded algorithm using 
  1275 an improved version of  Sulzmann and Lu's bit-coded algorithm using 
  1264 derivatives with simplifications, 
  1276 derivatives with simplifications, 
  1265 accompanied by
  1277 accompanied by
  1266 a proven correctness theorem according to POSIX specification 
  1278 a proven correctness theorem according to POSIX specification 
  1267 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, 
  1279 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, 
  1268 \item 
  1280 \item 
  1269 a complexity-related property for that algorithm saying that the 
  1281 a complexity-related property for that algorithm saying that the 
  1270 internal data structure will
  1282 internal data structure will
  1271 remain finite,
  1283 remain below a finite bound,
  1272 \item
  1284 \item
  1273 and extension to
  1285 and an extension to
  1274 the bounded repetitions construct with the correctness and finiteness property
  1286 the bounded repetition constructs with the correctness and finiteness property
  1275 maintained.
  1287 maintained.
  1276 \end{itemize}
  1288 \end{itemize}
  1277 \noindent
  1289 \noindent
  1278 With a formal finiteness bound in place,
  1290 With a formal finiteness bound in place,
  1279 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
  1291 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
  1283 \url{https://github.com/hellotommmy/posix}
  1295 \url{https://github.com/hellotommmy/posix}
  1284 \end{center}
  1296 \end{center}
  1285 Further improvements to the algorithm with an even stronger version of 
  1297 Further improvements to the algorithm with an even stronger version of 
  1286 simplification can be made. We conjecture that the resulting size of derivatives
  1298 simplification can be made. We conjecture that the resulting size of derivatives
  1287 can be bounded by a cubic bound w.r.t. the size of the regular expression.
  1299 can be bounded by a cubic bound w.r.t. the size of the regular expression.
  1288 
  1300 We will give relevant code in Scala, 
  1289 
  1301 but do not give a formal proof for that in Isabelle/HOL.
  1290 
  1302 This is still future work.
  1291 
       
  1292 
  1303 
  1293 
  1304 
  1294 \section{Structure of the thesis}
  1305 \section{Structure of the thesis}
  1295 In chapter \ref{Inj} we will introduce the concepts
  1306 In chapter \ref{Inj} we will introduce the concepts
  1296 and notations we 
  1307 and notations we 
  1297 use for describing regular expressions and derivatives,
  1308 use for describing regular expressions and derivatives,
  1298 and the first version of their lexing algorithm without bitcodes (including 
  1309 and the first version of Sulzmann and Lu's lexing algorithm without bitcodes (including 
  1299 its correctness proof).
  1310 its correctness proof).
  1300 We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
  1311 We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
  1301 together with the correctness proof by Ausaf and Urban.
  1312 together with the correctness proof by Ausaf and Urban.
  1302 Then we illustrate in chapter \ref{Bitcoded2}
  1313 Then we illustrate in chapter \ref{Bitcoded2}
  1303 how Sulzmann and Lu's
  1314 how Sulzmann and Lu's
  1304 simplifications fail to simplify. We therefore introduce our version of the
  1315 simplifications fail to simplify correctly. We therefore introduce our own version of the
  1305 algorithm with simplification and 
  1316 algorithm with correct simplifications and 
  1306 its correctness proof .  
  1317 their correctness proof.  
  1307 In chapter \ref{Finite} we give the second guarantee
  1318 In chapter \ref{Finite} we give the second guarantee
  1308 of our bitcoded algorithm, that is a finite bound on the size of any 
  1319 of our bitcoded algorithm, that is a finite bound on the size of any 
  1309 regular expression's derivatives. 
  1320 regular expression's derivatives. 
  1310 We also show how one can extend the
  1321 We also show how one can extend the
  1311 algorithm to include bounded repetitions. 
  1322 algorithm to include bounded repetitions. 
  1312 In chapter \ref{Cubic} we discuss stronger simplification rules which 
  1323 In chapter \ref{Cubic} we discuss stronger simplification rules which 
  1313 improve the finite bound to a cubic bound.%and the NOT regular expression.
  1324 improve the finite bound to a cubic bound. %and the NOT regular expression.
  1314 Chapter \ref{RelatedWork} introduces relevant work for this thesis.
  1325 Chapter \ref{RelatedWork} introduces relevant work for this thesis.
  1315 Chapter \ref{Future} concludes with avenues of future research.
  1326 Chapter \ref{Future} concludes and mentions avenues of future research.
  1316  
  1327  
  1317 
  1328 
  1318 
  1329 
  1319 
  1330 
  1320 
  1331