ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 622 4b1149fb5aec
parent 620 ae6010c14e49
child 628 7af4e2420a8c
equal deleted inserted replaced
621:17c7611fb0a9 622:4b1149fb5aec
   848 What is unexpected is that even in the cases 
   848 What is unexpected is that even in the cases 
   849 not involving back-references, there is still
   849 not involving back-references, there is still
   850 a (non-negligible) chance they might backtrack super-linearly,
   850 a (non-negligible) chance they might backtrack super-linearly,
   851 as shown in the graphs in figure\ref{fig:aStarStarb}.
   851 as shown in the graphs in figure\ref{fig:aStarStarb}.
   852 
   852 
   853 \subsection{Summary of the Catastrophic Backtracking Problem}
       
   854 Summing these up, we can categorise existing 
   853 Summing these up, we can categorise existing 
   855 practical regex libraries into two kinds:
   854 practical regex libraries into two kinds:
   856 (i)The ones  with  linear
   855 (i) The ones  with  linear
   857 time guarantees like Go and Rust. The cost with them is that
   856 time guarantees like Go and Rust. The downside with them is that
   858 they impose restrictions
   857 they impose restrictions
   859 on the user input (not allowing back-references, 
   858 on the regular expressions (not allowing back-references, 
   860 bounded repetitions cannot exceed a counter limit etc.).
   859 bounded repetitions cannot exceed an ad hoc limit etc.).
   861 (ii) Those 
   860 (ii) Those 
   862 that allow large bounded regular expressions and back-references
   861 that allow large bounded regular expressions and back-references
   863 at the expense of using a backtracking algorithm.
   862 at the expense of using backtracking algorithms.
   864 They could grind to a halt
   863 They can potentially ``grind to a halt''
   865 on some very simple cases, posing a vulnerability of
   864 on some very simple cases, resulting 
   866 a ReDoS attack.
   865 ReDoS attacks.
   867 
   866 
   868  
   867 The proble with both approaches is the motivation for us 
   869 We would like to have regex engines that can 
   868 to look again at the regular expression matching problem. 
   870 deal with the regular part (e.g.
   869 Another motivation is that regular expression matching algorithms
   871 bounded repetitions) of regexes more
   870 that follow the POSIX standard often contain errors and bugs 
   872 efficiently.
   871 as we shall explain next.
   873 Also we want to make sure that they do it correctly.
   872 
   874 It turns out that such aim is not so easy to achieve.
   873 %We would like to have regex engines that can 
       
   874 %deal with the regular part (e.g.
       
   875 %bounded repetitions) of regexes more
       
   876 %efficiently.
       
   877 %Also we want to make sure that they do it correctly.
       
   878 %It turns out that such aim is not so easy to achieve.
   875  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
   879  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
   876 % For example, the Rust regex engine claims to be linear, 
   880 % For example, the Rust regex engine claims to be linear, 
   877 % but does not support lookarounds and back-references.
   881 % but does not support lookarounds and back-references.
   878 % The GoLang regex library does not support over 1000 repetitions.  
   882 % The GoLang regex library does not support over 1000 repetitions.  
   879 % Java and Python both support back-references, but shows
   883 % Java and Python both support back-references, but shows
   892 %, whereas $\NFA$s usually run into trouble
   896 %, whereas $\NFA$s usually run into trouble
   893 %on the second phase.
   897 %on the second phase.
   894 
   898 
   895 
   899 
   896 \section{Error-prone POSIX Implementations}
   900 \section{Error-prone POSIX Implementations}
   897 When there are multiple ways of matching a string
   901 Very often there are multiple ways of matching a string
   898 with a regular expression, a matcher needs to
   902 with a regular expression.
       
   903 In such cases the regular expressions matcher needs to
   899 disambiguate.
   904 disambiguate.
   900 The standard for which particular match to pick
   905 The more widely used strategy is called POSIX,
   901 is called the disambiguation strategy.
   906 which roughly speaking always chooses the longest initial match.
   902 The more intuitive strategy is called POSIX,
   907 The POSIX strategy is widely adopted in many regular expression matchers.
   903 which always chooses the longest initial match.
       
   904 An alternative strategy would be greedy matches,
       
   905 which always ends a sub-match as early as possible.
       
   906 The POSIX standard is widely adopted in many operating systems.
       
   907 However, many implementations (including the C libraries
   908 However, many implementations (including the C libraries
   908 used by Linux and OS X distributions) contain bugs
   909 used by Linux and OS X distributions) contain bugs
   909 or do not meet the specification they claim to adhere to.
   910 or do not meet the specification they claim to adhere to.
   910 In some cases, they either fail to generate a lexing 
   911 Kuklewicz maintains a unit test repository which lists some
       
   912 problems with existing regular expression engines \ref{KuklewiczHaskell}.
       
   913 In some cases, they either fail to generate a
   911 result when there exists a match,
   914 result when there exists a match,
   912 or give results that are inconsistent with the $\POSIX$ standard.
   915 or give results that are inconsistent with the POSIX standard.
   913 A concrete example would be the regex given by \cite{fowler2003}
   916 A concrete example is the regex:
   914 \begin{center}
   917 \begin{center}
   915 	$(aba + ab + a)^* \text{and the string} ababa$
   918 	$(aba + ab + a)^* \text{and the string} ababa$
   916 \end{center}
   919 \end{center}
   917 The correct $\POSIX$ match for the above would be 
   920 The correct POSIX match for the above
   918 with the entire string $ababa$, 
   921 is the entire string $ababa$, 
   919 split into two Kleene star iterations, $[ab] [aba]$ at positions
   922 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
   920 $[0, 2), [2, 5)$
   923 $[0, 2), [2, 5)$
   921 respectively.
   924 respectively.
   922 But trying this out in regex101\parencite{regex101}
   925 But trying this out in regex101 \parencite{regex101} \footnote{
   923 with different language engines would yield 
   926 	regex101 is an online regular expression matcher which
   924 the same two fragmented matches: $[aba]$ at $[0, 3)$
   927 	provides API for trying out regular expression
       
   928 	engines of multiple popular programming languages like
       
   929 Java, Python, Go, etc.}
       
   930 with different engines yields
       
   931 always matches: $[aba]$ at $[0, 3)$
   925 and $a$ at $[4, 5)$.
   932 and $a$ at $[4, 5)$.
   926 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
   933 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
   927 commented that most regex libraries are not
   934 commented that most regex libraries are not
   928 correctly implementing the POSIX (maximum-munch)
   935 correctly implementing the central POSIX
   929 rule of regular expression matching.
   936 rule, called the maximum munch rule.
   930 As Grathwohl\parencite{grathwohl2014crash} wrote,
   937 Grathwohl \parencite{grathwohl2014crash} wrote,
   931 \begin{quote}
   938 \begin{quote}
   932 	``The POSIX strategy is more complicated than the 
   939 	``The POSIX strategy is more complicated than the 
   933 	greedy because of the dependence on information about 
   940 	greedy because of the dependence on information about 
   934 	the length of matched strings in the various subexpressions.''
   941 	the length of matched strings in the various subexpressions.''
   935 \end{quote}
   942 \end{quote}
   936 %\noindent
   943 %\noindent
   937 The implementation complexity of POSIX rules also come from
   944 We think the implementation complexity of POSIX rules also come from
   938 the specification being not very clear.
   945 the specification being not very precise.
   939 There are many informal summaries of this disambiguation
   946 There are many informal summaries of this disambiguation
   940 strategy, which are often quite long and delicate.
   947 strategy, which are often quite long and delicate.
   941 For example Kuklewicz \cite{KuklewiczHaskell} 
   948 For example Kuklewicz \cite{KuklewiczHaskell} 
   942 described the POSIX rule as
   949 described the POSIX rule as (section 1, last paragraph):
   943 \begin{quote}
   950 \begin{quote}
   944 	``
       
   945 	\begin{itemize}
   951 	\begin{itemize}
   946 		\item
   952 		\item
   947 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   953 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   948 earlier subpatterns have leftmost-longest priority over later subpatterns\\
   954 earlier subpatterns have leftmost-longest priority over later subpatterns\\
   949 \item
   955 \item
   956 text of component subexpressions must be contained in the text of the 
   962 text of component subexpressions must be contained in the text of the 
   957 higher-level subexpressions\\
   963 higher-level subexpressions\\
   958 \item
   964 \item
   959 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   965 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   960 \item
   966 \item
   961 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
   967 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
   962 \end{itemize}
   968 \end{itemize}
   963 \end{quote}
   969 \end{quote}
   964 The text above 
   970 %The text above 
   965 is trying to capture something very precise,
   971 %is trying to capture something very precise,
   966 and is crying out for formalising.
   972 %and is crying out for formalising.
   967 Ausaf et al. \cite{AusafDyckhoffUrban2016}
   973 Ausaf et al. \cite{AusafDyckhoffUrban2016}
   968 are the first to fill the gap
   974 are the first to
   969 by not just describing such a formalised POSIX
   975 give a quite simple formalised POSIX
   970 specification in Isabelle/HOL, but also proving
   976 specification in Isabelle/HOL, and also prove 
   971 that their specification coincides with the 
   977 that their specification coincides with the 
   972 POSIX specification given by Okui and Suzuki \cite{Okui10} 
   978 POSIX specification given by Okui and Suzuki \cite{Okui10}.
   973 which is a completely
       
   974 different characterisation.
       
   975 They then formally proved the correctness of
   979 They then formally proved the correctness of
   976 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
   980 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
   977 based on that specification.
   981 with regards to that specification.
   978 
   982 They also found that the informal POSIX
   979 In the next section we will very briefly
   983 specification by Sulzmann and Lu does not work for the correctness proof.
       
   984 
       
   985 In the next section we will briefly
   980 introduce Brzozowski derivatives and Sulzmann
   986 introduce Brzozowski derivatives and Sulzmann
   981 and Lu's algorithm, which this thesis builds on.
   987 and Lu's algorithm, which the main point of this thesis builds on.
   982 We give a taste of what they 
   988 %We give a taste of what they 
   983 are like and why they are suitable for regular expression
   989 %are like and why they are suitable for regular expression
   984 matching and lexing.
   990 %matching and lexing.
   985  
   991 \section{Formal Specification of POSIX Matching 
   986 \section{Our Solution--Formal Specification of POSIX Matching 
       
   987 and Brzozowski Derivatives}
   992 and Brzozowski Derivatives}
   988 Now we start with the central topic of the thesis: Brzozowski derivatives.
   993 %Now we start with the central topic of the thesis: Brzozowski derivatives.
   989 Brzozowski \cite{Brzozowski1964} first introduced the 
   994 Brzozowski \cite{Brzozowski1964} first introduced the 
   990 concept of the \emph{derivative} in the 1960s.
   995 concept of a \emph{derivative} of regular expression in 1964.
   991 The derivative of a regular expression $r$
   996 The derivative of a regular expression $r$
   992 with respect to a character $c$, is written as $r \backslash c$.\footnote{
   997 with respect to a character $c$, is written as $r \backslash c$.
   993 	Despite having the same name, regular expression
   998 This operation tells us what $r$ transforms into
   994 	derivatives bear little similarity with the mathematical definition
   999 if we ``chop'' off the first character $c$ 
   995 	of derivatives on functions.
  1000 from all strings in the language of $r$ (defined
   996 }
  1001 later as $L \; r$).
   997 It tells us what $r$ would transform into
  1002 %To give a flavour of Brzozowski derivatives, we present
   998 if we chop off the first character $c$ 
  1003 %two straightforward clauses from it:
   999 from all strings in the language of $r$ ($L \; r$).
  1004 %\begin{center}
  1000 To give a flavour of Brzozowski derivatives, we present
  1005 %	\begin{tabular}{lcl}
  1001 two straightforward clauses from it:
  1006 %		$d \backslash c$     & $\dn$ & 
  1002 \begin{center}
  1007 %		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
  1003 	\begin{tabular}{lcl}
  1008 %$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
  1004 		$d \backslash c$     & $\dn$ & 
  1009 %	\end{tabular}
  1005 		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
  1010 %\end{center}
  1006 $(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
  1011 %\noindent
  1007 	\end{tabular}
  1012 %The first clause says that for the regular expression
  1008 \end{center}
  1013 %denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
  1009 \noindent
  1014 %we check the derivative character $c$ against $d$,
  1010 The first clause says that for the regular expression
  1015 %returning a set containing only the empty string $\{ [] \}$
  1011 denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
  1016 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
  1012 we check the derivative character $c$ against $d$,
  1017 %The second clause states that to obtain the regular expression
  1013 returning a set containing only the empty string $\{ [] \}$
  1018 %representing all strings' head character $c$ being chopped off
  1014 if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
  1019 %from $r_1 + r_2$, one simply needs to recursively take derivative
  1015 The second clause states that to obtain the regular expression
  1020 %of $r_1$ and $r_2$ and then put them together.
  1016 representing all strings' head character $c$ being chopped off
  1021 Derivatives have the property
  1017 from $r_1 + r_2$, one simply needs to recursively take derivative
       
  1018 of $r_1$ and $r_2$ and then put them together.
       
  1019 
       
  1020 Thanks to the definition, derivatives have the nice property
       
  1021 that $s \in L \; (r\backslash c)$ if and only if 
  1022 that $s \in L \; (r\backslash c)$ if and only if 
  1022 $c::s \in L \; r$.
  1023 $c::s \in L \; r$ where $::$ stands for list prepending.
  1023 %This property can be used on regular expressions
  1024 %This property can be used on regular expressions
  1024 %matching and lexing--to test whether a string $s$ is in $L \; r$,
  1025 %matching and lexing--to test whether a string $s$ is in $L \; r$,
  1025 %one simply takes derivatives of $r$ successively with
  1026 %one simply takes derivatives of $r$ successively with
  1026 %respect to the characters (in the correct order) in $s$,
  1027 %respect to the characters (in the correct order) in $s$,
  1027 %and then test whether the empty string is in the last regular expression.
  1028 %and then test whether the empty string is in the last regular expression.
  1028 Derivatives give a simple solution
  1029 With this derivatives give a simple solution
  1029 to the problem of matching and lexing a string $s$ with a regular
  1030 to the problem of matching a string $s$ with a regular
  1030 expression $r$: if the derivative of $r$ w.r.t.\ (in
  1031 expression $r$: if the derivative of $r$ w.r.t.\ (in
  1031 succession) all the characters of the string matches the empty string,
  1032 succession) all the characters of the string matches the empty string,
  1032 then $r$ matches $s$ (and {\em vice versa}).  
  1033 then $r$ matches $s$ (and {\em vice versa}).  
  1033 
  1034 %This makes formally reasoning about these properties such
  1034 This makes formally reasoning about these properties such
  1035 %as correctness and complexity smooth and intuitive.
  1035 as correctness and complexity smooth and intuitive.
  1036 There had been several mechanised proofs about this property in various theorem
  1036 In fact, there has already been several mechanised proofs about them,
  1037 provers,
  1037 for example the one by Owens and Slind \cite{Owens2008} in HOL4,
  1038 for example one by Owens and Slind \cite{Owens2008} in HOL4,
  1038 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
  1039 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
  1039 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
  1040 yet another in Coq by Coquand and Siles \cite{Coquand2012}.
  1040 
  1041 
  1041 In addition, one can extend the clauses to bounded repetitions
  1042 In addition, one can extend derivatives to bounded repetitions
  1042 ``for free'':
  1043 relatively straightforwardly. For example, the derivative for 
       
  1044 this can be defined as:
  1043 \begin{center}
  1045 \begin{center}
  1044 	\begin{tabular}{lcl}
  1046 	\begin{tabular}{lcl}
  1045 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
  1047 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
  1046 		r^{\{n-1\}}$\\
  1048 		r^{\{n-1\}}$\\
  1047 	\end{tabular}
  1049 	\end{tabular}
  1048 \end{center}
  1050 \end{center}
  1049 \noindent
  1051 \noindent
  1050 And experimental results suggest that  unlike DFA-based solutions,
  1052 Experimental results suggest that  unlike DFA-based solutions
  1051 this derivatives can support 
  1053 for bounded regular expressions,
  1052 bounded regular expressions with large counters
  1054 derivatives can cope
       
  1055 large counters
  1053 quite well.
  1056 quite well.
  1054 
  1057 
  1055 There has also been 
  1058 There has also been 
  1056 extensions to other constructs.
  1059 extensions to other constructs.
  1057 For example, Owens et al include the derivatives
  1060 For example, Owens et al include the derivatives
  1058 for \emph{NOT} regular expressions, which is
  1061 for the \emph{NOT} regular expression, which is
  1059 able to concisely express C-style comments of the form
  1062 able to concisely express C-style comments of the form
  1060 $/* \ldots */$.
  1063 $/* \ldots */$ (see \cite{Owens2008}).
  1061 Another extension for derivatives would be
  1064 Another extension for derivatives is
  1062 regular expressions with look-aheads, done by
  1065 regular expressions with look-aheads, done
  1063 by Miyazaki and Minamide
  1066 by Miyazaki and Minamide
  1064 \cite{Takayuki2019}.
  1067 \cite{Takayuki2019}.
  1065 %We therefore use Brzozowski derivatives on regular expressions 
  1068 %We therefore use Brzozowski derivatives on regular expressions 
  1066 %lexing 
  1069 %lexing 
  1067 
  1070 
  1077 in the functional programming and theorem prover communities in the last
  1080 in the functional programming and theorem prover communities in the last
  1078 fifteen or so years (
  1081 fifteen or so years (
  1079 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
  1082 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
  1080 \cite{Chen12} and \cite{Coquand2012}
  1083 \cite{Chen12} and \cite{Coquand2012}
  1081 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
  1084 to name a few), despite being buried in the ``sands of time'' \cite{Owens2008}
  1082 after they were first published.
  1085 after they were first published by Brzozowski.
  1083 
  1086 
  1084 
  1087 
  1085 However, there are two difficulties with derivative-based matchers:
  1088 However, there are two difficulties with derivative-based matchers:
  1086 First, Brzozowski's original matcher only generates a yes/no answer
  1089 First, Brzozowski's original matcher only generates a yes/no answer
  1087 for whether a regular expression matches a string or not.  This is too
  1090 for whether a regular expression matches a string or not.  This is too
  1098 is actually $\Omega(2^n)$ in the worst case.
  1101 is actually $\Omega(2^n)$ in the worst case.
  1099 A similar claim about a theoretical runtime of $O(n^2)$ 
  1102 A similar claim about a theoretical runtime of $O(n^2)$ 
  1100 is made for the Verbatim \cite{Verbatim}
  1103 is made for the Verbatim \cite{Verbatim}
  1101 %TODO: give references
  1104 %TODO: give references
  1102 lexer, which calculates POSIX matches and is based on derivatives.
  1105 lexer, which calculates POSIX matches and is based on derivatives.
  1103 They formalized the correctness of the lexer, but not the complexity.
  1106 They formalized the correctness of the lexer, but not their complexity result.
  1104 In the performance evaluation section, they simply analyzed the run time
  1107 In the performance evaluation section, they analyzed the run time
  1105 of matching $a$ with the string 
  1108 of matching $a$ with the string 
  1106 \begin{center}
  1109 \begin{center}
  1107 	$\underbrace{a \ldots a}_{\text{n a's}}$
  1110 	$\underbrace{a \ldots a}_{\text{n a's}}$.
  1108 \end{center}
  1111 \end{center}
  1109 and concluded that the algorithm is quadratic in terms of input length.
  1112 \noindent
       
  1113 They concluded that the algorithm is quadratic in terms of 
       
  1114 the length of the input string.
  1110 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
  1115 When we tried out their extracted OCaml code with our example $(a+aa)^*$,
  1111 the time it took to lex only 40 $a$'s was 5 minutes.
  1116 the time it took to match a string of 40 $a$'s was approximately 5 minutes.
  1112 
  1117 
  1113 
  1118 
  1114 \subsection{Sulzmann and Lu's Algorithm}
  1119 \subsection{Sulzmann and Lu's Algorithm}
  1115 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
  1120 Sulzmann and Lu~\cite{Sulzmann2014} overcame the first 
  1116 difficulty by cleverly extending Brzozowski's matching
  1121 problem with the yes/no answer 
       
  1122 by cleverly extending Brzozowski's matching
  1117 algorithm. Their extended version generates additional information on
  1123 algorithm. Their extended version generates additional information on
  1118 \emph{how} a regular expression matches a string following the POSIX
  1124 \emph{how} a regular expression matches a string following the POSIX
  1119 rules for regular expression matching. They achieve this by adding a
  1125 rules for regular expression matching. They achieve this by adding a
  1120 second ``phase'' to Brzozowski's algorithm involving an injection
  1126 second ``phase'' to Brzozowski's algorithm involving an injection
  1121 function simplification of internal data structures 
  1127 function.
  1122 eliminating the exponential behaviours.
  1128 In earlier work, Ausaf et al provided the formal
  1123 In an earlier work, Ausaf et al provided the formal
       
  1124 specification of what POSIX matching means and proved in Isabelle/HOL
  1129 specification of what POSIX matching means and proved in Isabelle/HOL
  1125 the correctness
  1130 the correctness
  1126 of Sulzmann and Lu's extended algorithm accordingly
  1131 of this extended algorithm accordingly
  1127 \cite{AusafDyckhoffUrban2016}.
  1132 \cite{AusafDyckhoffUrban2016}.
  1128 
  1133 
  1129 The version of the algorithm proven correct 
  1134 The version of the algorithm proven correct 
  1130 suffers from the
  1135 suffers heavily from a 
  1131 second difficulty though, where the internal derivatives can
  1136 second difficulty, where the internal derivatives can
  1132 grow to arbitrarily big sizes. 
  1137 grow to arbitrarily big sizes. 
  1133 For example if we start with the
  1138 For example if we start with the
  1134 regular expression $(a+aa)^*$ and take
  1139 regular expression $(a+aa)^*$ and take
  1135 successive derivatives according to the character $a$, we end up with
  1140 successive derivatives according to the character $a$, we end up with
  1136 a sequence of ever-growing derivatives like 
  1141 a sequence of ever-growing derivatives like 
  1145 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
  1150 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
  1146 \end{tabular}
  1151 \end{tabular}
  1147 \end{center}
  1152 \end{center}
  1148  
  1153  
  1149 \noindent where after around 35 steps we run out of memory on a
  1154 \noindent where after around 35 steps we run out of memory on a
  1150 typical computer (we shall define in the next chapter 
  1155 typical computer.  Clearly, the
  1151 the precise details of our
       
  1152 regular expressions and the derivative operation).  Clearly, the
       
  1153 notation involving $\ZERO$s and $\ONE$s already suggests
  1156 notation involving $\ZERO$s and $\ONE$s already suggests
  1154 simplification rules that can be applied to regular regular
  1157 simplification rules that can be applied to regular regular
  1155 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
  1158 expressions, for example $\ZERO{}\,r \Rightarrow \ZERO$, $\ONE{}\,r
  1156 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
  1159 \Rightarrow r$, $\ZERO{} + r \Rightarrow r$ and $r + r \Rightarrow
  1157 r$. While such simple-minded simplifications have been proved in our
  1160 r$. While such simple-minded simplifications have been proved in 
  1158 earlier work to preserve the correctness of Sulzmann and Lu's
  1161 the work by Ausaf et al. to preserve the correctness of Sulzmann and Lu's
  1159 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
  1162 algorithm \cite{AusafDyckhoffUrban2016}, they unfortunately do
  1160 \emph{not} help with limiting the growth of the derivatives shown
  1163 \emph{not} help with limiting the growth of the derivatives shown
  1161 above: the growth is slowed, but the derivatives can still grow rather
  1164 above: the growth is slowed, but the derivatives can still grow rather
  1162 quickly beyond any finite bound.
  1165 quickly beyond any finite bound.
  1163 
  1166 
  1164 Sulzmann and Lu overcome this ``growth problem'' in a second algorithm
  1167 Therefore we want to look in this thesis at a second
  1165 \cite{Sulzmann2014} where they introduce bit-coded
  1168 algorithm by Sulzmann and Lu where they
  1166 regular expressions. In this version, POSIX values are
  1169 overcame this ``growth problem'' \cite{Sulzmann2014}.
       
  1170 In this version, POSIX values are 
  1167 represented as bit sequences and such sequences are incrementally generated
  1171 represented as bit sequences and such sequences are incrementally generated
  1168 when derivatives are calculated. The compact representation
  1172 when derivatives are calculated. The compact representation
  1169 of bit sequences and regular expressions allows them to define a more
  1173 of bit sequences and regular expressions allows them to define a more
  1170 ``aggressive'' simplification method that keeps the size of the
  1174 ``aggressive'' simplification method that keeps the size of the
  1171 derivatives finite no matter what the length of the string is.
  1175 derivatives finite no matter what the length of the string is.
  1182   method [..] in combination with the simplification steps [..]
  1186   method [..] in combination with the simplification steps [..]
  1183   yields POSIX parse trees. We have tested this claim
  1187   yields POSIX parse trees. We have tested this claim
  1184   extensively [..] but yet
  1188   extensively [..] but yet
  1185   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
  1189   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
  1186 \end{quote}  
  1190 \end{quote}  
  1187 Ausaf and Urban were able to back this correctness claim with
  1191 Ausaf and Urban made some initial progress towards the 
  1188 a formal proof.
  1192 full correctness proof but still had to leave out the optimisation
  1189 
  1193 Sulzmann and Lu proposed.
  1190 However a faster formally verified 
  1194 Ausaf  wrote \cite{Ausaf},
  1191 lexing program with the optimisations
       
  1192 mentioned by Sulzmann and Lu's second algorithm
       
  1193 is still missing.
       
  1194 As they stated,
       
  1195   \begin{quote}\it
  1195   \begin{quote}\it
  1196 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
  1196 ``The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.''
  1197 \end{quote}  
  1197 \end{quote}  
  1198 This thesis implements the aggressive simplifications envisioned
  1198 This thesis implements the aggressive simplifications envisioned
  1199 by Ausaf and Urban,
  1199 by Ausaf and Urban,
  1200 together with a formal proof of the correctness with those simplifications.
  1200 together with a formal proof of the correctness of those simplifications.
  1201 
  1201 
  1202 
  1202 
  1203 One of the most recent work in the context of lexing
  1203 One of the most recent work in the context of lexing
  1204 %with this issue
  1204 %with this issue
  1205 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
  1205 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
  1237 remain finite,
  1237 remain finite,
  1238 \item
  1238 \item
  1239 and extension to
  1239 and extension to
  1240 the bounded repetitions construct with the correctness and finiteness property
  1240 the bounded repetitions construct with the correctness and finiteness property
  1241 maintained.
  1241 maintained.
  1242  \end{itemize}
  1242 \end{itemize}
  1243 
  1243 \noindent
  1244 
       
  1245 With a formal finiteness bound in place,
  1244 With a formal finiteness bound in place,
  1246 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
  1245 we can greatly reduce the attack surface of servers in terms of ReDoS attacks.
  1247 Further improvements to the algorithm with an even stronger version of 
  1246 Further improvements to the algorithm with an even stronger version of 
  1248 simplification is made.
  1247 simplification can be made.
  1249 Thanks to our theorem-prover-friendly approach,
  1248 
  1250 we believe that 
  1249 
  1251 this finiteness bound can be improved to a bound
  1250 
  1252 linear to input and
  1251 
  1253 cubic to the regular expression size using a technique by
  1252 
  1254 Antimirov\cite{Antimirov95}.
  1253 \section{Structure of the thesis}
  1255 Once formalised, this would be a guarantee for the absence of all super-linear behavious.
  1254 In chapter \ref{Inj} we will introduce the concepts
  1256 We are working out the
  1255 and notations we 
  1257 details.
  1256 use for describing regular expressions and derivatives,
  1258 
  1257 and the first version of their lexing algorithm without bitcodes (including 
       
  1258 its correctness proof).
       
  1259 We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
       
  1260 together with the correctness proof by Ausaf and Urban.
       
  1261 Then we illustrate in chapter \ref{Bitcoded2}
       
  1262 how Sulzmann and Lu's
       
  1263 simplifications fail to simplify. We therefore introduce our version of the
       
  1264 algorithm with simplification and 
       
  1265 its correctness proof .  
       
  1266 In chapter \ref{Finite} we give the second guarantee
       
  1267 of our bitcoded algorithm, that is a finite bound on the size of any 
       
  1268 regular expression's derivatives.
       
  1269 In chapter \ref{Cubic} we discuss stronger simplification rules which 
       
  1270 improve the finite bound to a polynomial bound, and also show how one can extend the
       
  1271 algorithm to include bounded repetitions. %and the NOT regular expression.
  1259  
  1272  
  1260 To our best knowledge, no lexing libraries using Brzozowski derivatives
       
  1261 have similar complexity-related bounds, 
       
  1262 and claims about running time are usually speculative and backed by empirical
       
  1263 evidence on a few test cases.
       
  1264 If a matching or lexing algorithm
       
  1265 does not come with certain basic complexity related 
       
  1266 guarantees (for examaple the internal data structure size
       
  1267 does not grow indefinitely), 
       
  1268 then they cannot claim with confidence having solved the problem
       
  1269 of catastrophic backtracking.
       
  1270 
       
  1271 
       
  1272 
       
  1273 
       
  1274 
       
  1275 \section{Structure of the thesis}
       
  1276 In chapter 2 \ref{Inj} we will introduce the concepts
       
  1277 and notations we 
       
  1278 use for describing the lexing algorithm by Sulzmann and Lu,
       
  1279 and then give the lexing algorithm.
       
  1280 We will give its variant in \ref{Bitcoded1}.
       
  1281 Then we illustrate in \ref{Bitcoded2}
       
  1282 how the algorithm without bitcodes falls short for such aggressive 
       
  1283 simplifications and therefore introduce our version of the
       
  1284  bit-coded algorithm and 
       
  1285 its correctness proof .  
       
  1286 In \ref{Finite} we give the second guarantee
       
  1287 of our bitcoded algorithm, that is a finite bound on the size of any 
       
  1288 regex's derivatives.
       
  1289 In \ref{Cubic} we discuss stronger simplifications to improve the finite bound
       
  1290 in \ref{Finite} to a polynomial one, and demonstrate how one can extend the
       
  1291 algorithm to include constructs such as bounded repetitions and negations.
       
  1292  
       
  1293 
  1273 
  1294 
  1274 
  1295 
  1275 
  1296 
  1276 
  1297 %----------------------------------------------------------------------------------------
  1277 %----------------------------------------------------------------------------------------