ChengsongTanPhdThesis/Chapters/Future.tex
changeset 640 bd1354127574
parent 637 e3752aac8ec2
child 666 6da4516ea87d
equal deleted inserted replaced
639:80cc6dc4c98b 640:bd1354127574
     1 % Chapter Template
     1 % Chapter Template
     2 
     2 
     3 \chapter{Future Work and Conclusion} % Main chapter title
     3 \chapter{Conclusion and Future Work} % Main chapter title
     4 
     4 
     5 \label{Future} 
     5 \label{Future} 
     6 
     6 
     7 In this thesis, in order to solve the ReDoS attacks
     7 In this thesis, in order to solve the ReDoS attacks
     8 once and for all, we have set out to formalise the correctness
     8 once and for all, we have set out to formalise the correctness
     9 proof of Sulzmann and Lu's
     9 proof of Sulzmann and Lu's
    10 lexing algorithm with simplifications \cite{Sulzmann2014}
    10 lexing algorithm with aggressive simplifications \cite{Sulzmann2014}.
    11 in Isabelle/HOL. 
    11 We formalised our proofs in the Isabelle/HOL theorem prover.
    12 We have fixed inefficiencies in  the original simplification
    12 We have fixed some inefficiencies and a bug in  their original simplification
    13 algorithm, and established the correctness by proving
    13 algorithm, and established the correctness by proving
    14 that our algorithm outputs the same result as the original bit-coded
    14 that our algorithm outputs the same result as the original bit-coded
    15 lexer without simplifications (whose correctness was established in 
    15 lexer without simplifications (whose correctness was established in 
    16 previous work by Ausaf et al. in
    16 previous work by Ausaf et al. in
    17 \cite{Ausaf}  and \cite{AusafDyckhoffUrban2016}).
    17 \cite{Ausaf}  and \cite{AusafDyckhoffUrban2016}).
    18 The proof technique used in \cite{Ausaf} does not work in our case
    18 The proof technique used in \cite{Ausaf} does not work in our case
    19 because the simplification function messes with the structure of
    19 because the simplification function messes with the structure of simplified
    20 regular expressions.
    20 regular expressions.
    21 Despite having to try out several workarounds and being
    21 Despite having to try out several workarounds and being
    22 stuck for months or even years looking for proofs, 
    22 stuck for months looking for proofs, 
    23 we were delighted to have discovered the simple yet
    23 we were delighted to have discovered the simple yet
    24 effective proof method
    24 effective proof method
    25 by modelling individual
    25 by modelling individual
    26 simplification steps as small-step rewriting rules and proving
    26 simplification steps as small-step rewriting rules and proving
    27 equivalence between terms linked by these rules.
    27 equivalence between terms linked by these rewrite rules.
    28 
    28 
    29 In addition, we have proved a formal size bound on the 
    29 In addition, we have proved a formal size bound on the 
    30 regular expressions. The technique was by establishing
    30 regular expressions. The technique was by establishing
    31 a ``closed form'' informally
    31 a ``closed form'' informally
    32 described by
    32 described by
    33 Murugesan and Shanmuga Sundaram \cite{Murugesan2014} 
    33 Murugesan and Shanmuga Sundaram \cite{Murugesan2014} 
    34 for compound regular expressions' derivatives
    34 for compound derivatives
    35 and using those closed forms to control the size.
    35 and using those closed forms to control the size.
    36 The Isabelle/HOL code for our formalisation can be 
    36 The Isabelle/HOL code for our formalisation can be 
    37 found at
    37 found at
    38 \begin{center}
    38 \begin{center}
    39 \url{https://github.com/hellotommmy/posix}
    39 \url{https://github.com/hellotommmy/posix}
    44 this finiteness bound can be improved to a bound
    44 this finiteness bound can be improved to a bound
    45 linear to input and
    45 linear to input and
    46 cubic to the regular expression size using a technique by
    46 cubic to the regular expression size using a technique by
    47 Antimirov\cite{Antimirov95}.
    47 Antimirov\cite{Antimirov95}.
    48 Once formalised, this would be a guarantee for the absence of all super-linear behavious.
    48 Once formalised, this would be a guarantee for the absence of all super-linear behavious.
    49 We are working out the
    49 We are yet to work out the
    50 details.
    50 details.
       
    51 
       
    52 Our formalisation is approximately 7500 lines of code. Slightly more than half of this concerns the finiteness bound obtained in Chapter 5. This slight "bloat" is because we had to repeat many definitions for the rrexp datatype. However, we think we would not have been able to formalise the quite intricate reasoning involving rrexps with annotated regular expressions because we would have to carry around the bit-sequences (that are of course necessary in the algorithm) but which do not contribute to the size bound of the calculated derivatives.
    51 
    53 
    52 
    54 
    53 To our best knowledge, no lexing libraries using Brzozowski derivatives
    55 To our best knowledge, no lexing libraries using Brzozowski derivatives
    54 have similar complexity-related bounds, 
    56 have similar complexity-related bounds, 
    55 and claims about running time are usually speculative and backed by empirical
    57 and claims about running time are usually speculative and backed up only by empirical
    56 evidence on a few test cases.
    58 evidence on some test cases.
    57 If a matching or lexing algorithm
    59 If a matching or lexing algorithm
    58 does not come with certain basic complexity related 
    60 does not come with complexity related 
    59 guarantees (for examaple the internal data structure size
    61 guarantees (for examaple the internal data structure size
    60 does not grow indefinitely), 
    62 does not grow indefinitely), 
    61 then one cannot claim with confidence having solved the problem
    63 then one cannot claim with confidence of having solved the problem
    62 of catastrophic backtracking.
    64 of catastrophic backtracking.
    63 
    65 
    64 We believe our proof method is not specific to this
    66 We believe our proof method is not specific to this
    65 algorithm, and intend to extend this approach
    67 algorithm, and intend to extend this approach
    66 to prove the correctness of a faster version
    68 to prove the correctness of the faster version
    67 of the lexer. The closed
    69 of the lexer proposed in chapter \cite{Cubic}. The closed
    68 forms can then be re-used as well. Together
    70 forms can then be re-used as well. Together
    69 with the idea from Antimirov \cite{Antimirov95}
    71 with the idea from Antimirov \cite{Antimirov95}
    70 we plan to reduce the size bound to become polynomial
    72 we plan to reduce the size bound to be just polynomial
    71 with respect to the size of the regular expression.
    73 with respect to the size of the regular expressions.
    72 %We have tried out a variety of other optimisations 
    74 %We have tried out a variety of other optimisations 
    73 %to make lexing/parsing with derivatives faster,
    75 %to make lexing/parsing with derivatives faster,
    74 %for example in \cite{Edelmann2021}, \cite{Darragh2020}
    76 %for example in \cite{Edelmann2021}, \cite{Darragh2020}
    75 %and \cite{Verbatimpp}.
    77 %and \cite{Verbatimpp}.
    76 
    78 
    77 We have learned quite a few lessons in this process.
    79 We have learned quite a few lessons in this process.
    78 As simple as the end result may seem,
    80 As simple as the end result may seem,
    79 coming up with the initial proof idea with rewriting relations
    81 coming up with the initial proof idea with rewriting relations
    80 was the hardest for us, as it requires a deep understanding of what information
    82 was the hardest for us, as it required a deep understanding of what information
    81 is preserved during the simplification process.
    83 is preserved during the simplification process.
    82 Having the first formalisation gives us much more confidence----it took
    84 There the ideas given by Sulzmann and Lu and Ausaf et al. were of no use.
    83 us much less time to obtain the closed forms and size bound, although they
    85 %Having the first formalisation gives us much more confidence----it took
    84 are more involved than the correctness proof.
    86 %us much less time to obtain the closed forms and size bound, although they
    85 As having been tested numerous times, 
    87 %are more involved than the correctness proof.
       
    88 Of course this has already been shown many times,
    86 the usefulness of formal approaches cannot be overstated:
    89 the usefulness of formal approaches cannot be overstated:
    87 they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
    90 they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
    88 but also helped us set up realistic expectations about performance
    91 but also helped us set up realistic expectations about performance
    89 of algorithms. We believed in the beginning that
    92 of algorithms. We believed in the beginning that
    90 the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
    93 the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
    91 was already able to achieve the linear bound claimed by
    94 was already able to achieve the linear bound claimed by
    92 Sulzmann and Lu.
    95 Sulzmann and Lu.
    93 We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$ 
    96 We then attempted to prove that the size $\llbracket \blexersimp\; r \; s\rrbracket$ 
    94 is $O(\llbracket r\rrbracket^c\cdot |s|)$,
    97 is $O(\llbracket r\rrbracket^c\cdot |s|)$,
    95 where $c$ is a small constant number, making
    98 where $c$ is a small constant, making
    96 $\llbracket r\rrbracket^c$ a small constant factor.
    99 $\llbracket r\rrbracket^c$ a small constant factor.
    97 We were quite surprised that this did not go through despite a lot of efforts.
   100 We were then quite surprised that this did not go through despite a lot of effort.
    98 This led us to discover the example where $\llbracket r\rrbracket^c$
   101 This led us to discover the example where $\llbracket r\rrbracket^c$
    99 can in fact be exponentially large in chapter \ref{Finite}.
   102 can in fact be exponentially large in chapter \ref{Finite}.
   100 We therefore learned the necessity to introduce the stronger simplifications
   103 We therefore learned the necessity to introduce the stronger simplifications
   101 in chapter \ref{Cubic}.
   104 in chapter \ref{Cubic}.
   102 Without formal proofs we could not have found out this so soon.
   105 Without formal proofs we would not have found out this so soon,
   103 
   106 if at all.
   104 
   107 
   105 In the future, we plan to incorporate many extensions
   108 
   106 to this work.
   109 %In the future, we plan to incorporate many extensions
       
   110 %to this work.
   107 
   111 
   108 \section{Future Work}
   112 \section{Future Work}
   109 The most obvious next-step is to implement the cubic bound and
   113 The most obvious next-step is to implement the cubic bound and
   110 correctness of $\blexerStrong$
   114 correctness of $\blexerStrong$
   111 in chapter \ref{Cubic} formally.
   115 in chapter \ref{Cubic}.
   112 A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to 
   116 A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to 
   113 regular expression size will fulfill
   117 regular expression size will get us one step closer to fulfilling
   114 the linear complexity claim made by Sulzmann and Lu.
   118 the linear complexity claim made by Sulzmann and Lu.
   115 
   119 
   116 With a linear size bound theoretically, the next challenge would be
   120 With a linear size bound theoretically, the next challenge would be
   117 to generate code that are competitive with respect to commercial
   121 to generate code that is competitive with respect to 
   118 matchers. For that a lot of optimisations are needed.
   122 matchers based on DFAs or NFAs. For that a lot of optimisations are needed.
   119 We aim to  integrate the zipper data structure into our lexer.
   123 We aim to  integrate the zipper data structure into our lexer.
   120 The idea is very simple: using a zipper with multiple focuses
   124 The idea is very simple: using a zipper with multiple focuses
   121 just like Darragh \cite{Darragh2020} did, we could represent
   125 just like Darragh \cite{Darragh2020} did in their parsing algorithm, we could represent
   122 \[
   126 \[
   123 	x\cdot r + y \cdot r + \ldots
   127 	x\cdot r + y \cdot r + \ldots
   124 \]
   128 \]
   125 as 
   129 as 
   126 \[
   130 \[
   145 language using
   149 language using
   146 an efficient and verified equivalence checker like \cite{Krauss2011}.
   150 an efficient and verified equivalence checker like \cite{Krauss2011}.
   147 In this way, the internal data structures can be pruned more aggressively, 
   151 In this way, the internal data structures can be pruned more aggressively, 
   148 leading to better simplifications and ultimately faster algorithms.
   152 leading to better simplifications and ultimately faster algorithms.
   149 
   153 
   150 Traditional automaton approaches can be sped up by compiling
   154 Traditional automata approaches can be sped up by compiling
   151 multiple rules into the same automaton. This has been done
   155 multiple rules into the same automaton. This has been done
   152 in \cite{Kumar2006} and \cite{Becchi08}, for example.
   156 in \cite{Kumar2006} and \cite{Becchi08}, for example.
   153 Currently our lexer only deals with a single regular expression each time.
   157 Currently our lexer only deals with a single regular expression each time.
   154 Putting multiple regular expressions might open up more
   158 Extending this to multiple regular expressions might open up more
   155 possibilities of simplifications.
   159 possibilities of simplifications.
   156 
   160 
   157 As already mentioned in chapter \ref{RelatedWork},
   161 As already mentioned in chapter \ref{RelatedWork},
   158 reducing the number of memory accesses can accelerate the 
   162 reducing the number of memory accesses can also accelerate the 
   159 matching speed. It would be interesting to see the memory
   163 matching speed. It would be interesting to study the memory
   160 bandwidth of our derivative-based matching and improve accordingly.
   164 bandwidth of our derivative-based matching algorithm and improve accordingly.
   161 
   165 
   162 Memoization has been used frequently in lexing and parsing to achieve
   166 Memoization has been used frequently in lexing and parsing to achieve
   163 better complexity results, for example in \cite{Reps1998},
   167 better complexity results, for example in \cite{Reps1998},
   164 \cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
   168 \cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
   165 We plan to explore the performance enhancements by memoization in our algorithm
   169 We plan to explore the performance enhancements by memoization in our algorithm
   166 in a correctness-preserving way.
   170 in a correctness-preserving way.
   167 The Monadic data refinement technique that Lammich and Tuerk used
   171 The monadic data refinement technique that Lammich and Tuerk used
   168 in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
   172 in \cite{Lammich2012} to optimise Hopcroft's automaton minimisation
   169 algorithm seem quite interesting.
   173 algorithm seems also quite relevant for such an enterprise.
   170 We aim to learn from their refinement framework
   174 We aim to learn from their refinement framework
   171 which generates high performance code with proofs
   175 which generates high performance code with proofs
   172 that can be broken down into uncluttered small steps.
   176 that can be broken down into small steps.
   173 
   177 
   174 
   178 
   175 
   179 
   176 
   180 
   177 %We establish the correctness 
   181 %We establish the correctness