1 % Chapter Template |
1 % Chapter Template |
2 |
2 |
3 \chapter{Future Work and Conclusion} % Main chapter title |
3 \chapter{Conclusion and Future Work} % Main chapter title |
4 |
4 |
5 \label{Future} |
5 \label{Future} |
6 |
6 |
7 In this thesis, in order to solve the ReDoS attacks |
7 In this thesis, in order to solve the ReDoS attacks |
8 once and for all, we have set out to formalise the correctness |
8 once and for all, we have set out to formalise the correctness |
9 proof of Sulzmann and Lu's |
9 proof of Sulzmann and Lu's |
10 lexing algorithm with simplifications \cite{Sulzmann2014} |
10 lexing algorithm with aggressive simplifications \cite{Sulzmann2014}. |
11 in Isabelle/HOL. |
11 We formalised our proofs in the Isabelle/HOL theorem prover. |
12 We have fixed inefficiencies in the original simplification |
12 We have fixed some inefficiencies and a bug in their original simplification |
13 algorithm, and established the correctness by proving |
13 algorithm, and established the correctness by proving |
14 that our algorithm outputs the same result as the original bit-coded |
14 that our algorithm outputs the same result as the original bit-coded |
15 lexer without simplifications (whose correctness was established in |
15 lexer without simplifications (whose correctness was established in |
16 previous work by Ausaf et al. in |
16 previous work by Ausaf et al. in |
17 \cite{Ausaf} and \cite{AusafDyckhoffUrban2016}). |
17 \cite{Ausaf} and \cite{AusafDyckhoffUrban2016}). |
18 The proof technique used in \cite{Ausaf} does not work in our case |
18 The proof technique used in \cite{Ausaf} does not work in our case |
19 because the simplification function messes with the structure of |
19 because the simplification function messes with the structure of simplified |
20 regular expressions. |
20 regular expressions. |
21 Despite having to try out several workarounds and being |
21 Despite having to try out several workarounds and being |
22 stuck for months or even years looking for proofs, |
22 stuck for months looking for proofs, |
23 we were delighted to have discovered the simple yet |
23 we were delighted to have discovered the simple yet |
24 effective proof method |
24 effective proof method |
25 by modelling individual |
25 by modelling individual |
26 simplification steps as small-step rewriting rules and proving |
26 simplification steps as small-step rewriting rules and proving |
27 equivalence between terms linked by these rules. |
27 equivalence between terms linked by these rewrite rules. |
28 |
28 |
29 In addition, we have proved a formal size bound on the |
29 In addition, we have proved a formal size bound on the |
30 regular expressions. The technique was by establishing |
30 regular expressions. The technique was by establishing |
31 a ``closed form'' informally |
31 a ``closed form'' informally |
32 described by |
32 described by |
33 Murugesan and Shanmuga Sundaram \cite{Murugesan2014} |
33 Murugesan and Shanmuga Sundaram \cite{Murugesan2014} |
34 for compound regular expressions' derivatives |
34 for compound derivatives |
35 and using those closed forms to control the size. |
35 and using those closed forms to control the size. |
36 The Isabelle/HOL code for our formalisation can be |
36 The Isabelle/HOL code for our formalisation can be |
37 found at |
37 found at |
38 \begin{center} |
38 \begin{center} |
39 \url{https://github.com/hellotommmy/posix} |
39 \url{https://github.com/hellotommmy/posix} |
44 this finiteness bound can be improved to a bound |
44 this finiteness bound can be improved to a bound |
45 linear to input and |
45 linear to input and |
46 cubic to the regular expression size using a technique by |
46 cubic to the regular expression size using a technique by |
47 Antimirov\cite{Antimirov95}. |
47 Antimirov\cite{Antimirov95}. |
48 Once formalised, this would be a guarantee for the absence of all super-linear behavious. |
48 Once formalised, this would be a guarantee for the absence of all super-linear behavious. |
49 We are working out the |
49 We are yet to work out the |
50 details. |
50 details. |
|
51 |
|
52 Our formalisation is approximately 7500 lines of code. Slightly more than half of this concerns the finiteness bound obtained in Chapter 5. This slight "bloat" is because we had to repeat many definitions for the rrexp datatype. However, we think we would not have been able to formalise the quite intricate reasoning involving rrexps with annotated regular expressions because we would have to carry around the bit-sequences (that are of course necessary in the algorithm) but which do not contribute to the size bound of the calculated derivatives. |
51 |
53 |
52 |
54 |
53 To our best knowledge, no lexing libraries using Brzozowski derivatives |
55 To our best knowledge, no lexing libraries using Brzozowski derivatives |
54 have similar complexity-related bounds, |
56 have similar complexity-related bounds, |
55 and claims about running time are usually speculative and backed by empirical |
57 and claims about running time are usually speculative and backed up only by empirical |
56 evidence on a few test cases. |
58 evidence on some test cases. |
57 If a matching or lexing algorithm |
59 If a matching or lexing algorithm |
58 does not come with certain basic complexity related |
60 does not come with complexity related |
59 guarantees (for examaple the internal data structure size |
61 guarantees (for examaple the internal data structure size |
60 does not grow indefinitely), |
62 does not grow indefinitely), |
61 then one cannot claim with confidence having solved the problem |
63 then one cannot claim with confidence of having solved the problem |
62 of catastrophic backtracking. |
64 of catastrophic backtracking. |
63 |
65 |
64 We believe our proof method is not specific to this |
66 We believe our proof method is not specific to this |
65 algorithm, and intend to extend this approach |
67 algorithm, and intend to extend this approach |
66 to prove the correctness of a faster version |
68 to prove the correctness of the faster version |
67 of the lexer. The closed |
69 of the lexer proposed in chapter \cite{Cubic}. The closed |
68 forms can then be re-used as well. Together |
70 forms can then be re-used as well. Together |
69 with the idea from Antimirov \cite{Antimirov95} |
71 with the idea from Antimirov \cite{Antimirov95} |
70 we plan to reduce the size bound to become polynomial |
72 we plan to reduce the size bound to be just polynomial |
71 with respect to the size of the regular expression. |
73 with respect to the size of the regular expressions. |
72 %We have tried out a variety of other optimisations |
74 %We have tried out a variety of other optimisations |
73 %to make lexing/parsing with derivatives faster, |
75 %to make lexing/parsing with derivatives faster, |
74 %for example in \cite{Edelmann2021}, \cite{Darragh2020} |
76 %for example in \cite{Edelmann2021}, \cite{Darragh2020} |
75 %and \cite{Verbatimpp}. |
77 %and \cite{Verbatimpp}. |
76 |
78 |
77 We have learned quite a few lessons in this process. |
79 We have learned quite a few lessons in this process. |
78 As simple as the end result may seem, |
80 As simple as the end result may seem, |
79 coming up with the initial proof idea with rewriting relations |
81 coming up with the initial proof idea with rewriting relations |
80 was the hardest for us, as it requires a deep understanding of what information |
82 was the hardest for us, as it required a deep understanding of what information |
81 is preserved during the simplification process. |
83 is preserved during the simplification process. |
82 Having the first formalisation gives us much more confidence----it took |
84 There the ideas given by Sulzmann and Lu and Ausaf et al. were of no use. |
83 us much less time to obtain the closed forms and size bound, although they |
85 %Having the first formalisation gives us much more confidence----it took |
84 are more involved than the correctness proof. |
86 %us much less time to obtain the closed forms and size bound, although they |
85 As having been tested numerous times, |
87 %are more involved than the correctness proof. |
|
88 Of course this has already been shown many times, |
86 the usefulness of formal approaches cannot be overstated: |
89 the usefulness of formal approaches cannot be overstated: |
87 they not only allow us to find bugs in Sulzmann and Lu's simplification functions, |
90 they not only allow us to find bugs in Sulzmann and Lu's simplification functions, |
88 but also helped us set up realistic expectations about performance |
91 but also helped us set up realistic expectations about performance |
89 of algorithms. We believed in the beginning that |
92 of algorithms. We believed in the beginning that |
90 the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2} |
93 the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2} |
91 was already able to achieve the linear bound claimed by |
94 was already able to achieve the linear bound claimed by |
92 Sulzmann and Lu. |
95 Sulzmann and Lu. |
93 We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$ |
96 We then attempted to prove that the size $\llbracket \blexersimp\; r \; s\rrbracket$ |
94 is $O(\llbracket r\rrbracket^c\cdot |s|)$, |
97 is $O(\llbracket r\rrbracket^c\cdot |s|)$, |
95 where $c$ is a small constant number, making |
98 where $c$ is a small constant, making |
96 $\llbracket r\rrbracket^c$ a small constant factor. |
99 $\llbracket r\rrbracket^c$ a small constant factor. |
97 We were quite surprised that this did not go through despite a lot of efforts. |
100 We were then quite surprised that this did not go through despite a lot of effort. |
98 This led us to discover the example where $\llbracket r\rrbracket^c$ |
101 This led us to discover the example where $\llbracket r\rrbracket^c$ |
99 can in fact be exponentially large in chapter \ref{Finite}. |
102 can in fact be exponentially large in chapter \ref{Finite}. |
100 We therefore learned the necessity to introduce the stronger simplifications |
103 We therefore learned the necessity to introduce the stronger simplifications |
101 in chapter \ref{Cubic}. |
104 in chapter \ref{Cubic}. |
102 Without formal proofs we could not have found out this so soon. |
105 Without formal proofs we would not have found out this so soon, |
103 |
106 if at all. |
104 |
107 |
105 In the future, we plan to incorporate many extensions |
108 |
106 to this work. |
109 %In the future, we plan to incorporate many extensions |
|
110 %to this work. |
107 |
111 |
108 \section{Future Work} |
112 \section{Future Work} |
109 The most obvious next-step is to implement the cubic bound and |
113 The most obvious next-step is to implement the cubic bound and |
110 correctness of $\blexerStrong$ |
114 correctness of $\blexerStrong$ |
111 in chapter \ref{Cubic} formally. |
115 in chapter \ref{Cubic}. |
112 A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to |
116 A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to |
113 regular expression size will fulfill |
117 regular expression size will get us one step closer to fulfilling |
114 the linear complexity claim made by Sulzmann and Lu. |
118 the linear complexity claim made by Sulzmann and Lu. |
115 |
119 |
116 With a linear size bound theoretically, the next challenge would be |
120 With a linear size bound theoretically, the next challenge would be |
117 to generate code that are competitive with respect to commercial |
121 to generate code that is competitive with respect to |
118 matchers. For that a lot of optimisations are needed. |
122 matchers based on DFAs or NFAs. For that a lot of optimisations are needed. |
119 We aim to integrate the zipper data structure into our lexer. |
123 We aim to integrate the zipper data structure into our lexer. |
120 The idea is very simple: using a zipper with multiple focuses |
124 The idea is very simple: using a zipper with multiple focuses |
121 just like Darragh \cite{Darragh2020} did, we could represent |
125 just like Darragh \cite{Darragh2020} did in their parsing algorithm, we could represent |
122 \[ |
126 \[ |
123 x\cdot r + y \cdot r + \ldots |
127 x\cdot r + y \cdot r + \ldots |
124 \] |
128 \] |
125 as |
129 as |
126 \[ |
130 \[ |
145 language using |
149 language using |
146 an efficient and verified equivalence checker like \cite{Krauss2011}. |
150 an efficient and verified equivalence checker like \cite{Krauss2011}. |
147 In this way, the internal data structures can be pruned more aggressively, |
151 In this way, the internal data structures can be pruned more aggressively, |
148 leading to better simplifications and ultimately faster algorithms. |
152 leading to better simplifications and ultimately faster algorithms. |
149 |
153 |
150 Traditional automaton approaches can be sped up by compiling |
154 Traditional automata approaches can be sped up by compiling |
151 multiple rules into the same automaton. This has been done |
155 multiple rules into the same automaton. This has been done |
152 in \cite{Kumar2006} and \cite{Becchi08}, for example. |
156 in \cite{Kumar2006} and \cite{Becchi08}, for example. |
153 Currently our lexer only deals with a single regular expression each time. |
157 Currently our lexer only deals with a single regular expression each time. |
154 Putting multiple regular expressions might open up more |
158 Extending this to multiple regular expressions might open up more |
155 possibilities of simplifications. |
159 possibilities of simplifications. |
156 |
160 |
157 As already mentioned in chapter \ref{RelatedWork}, |
161 As already mentioned in chapter \ref{RelatedWork}, |
158 reducing the number of memory accesses can accelerate the |
162 reducing the number of memory accesses can also accelerate the |
159 matching speed. It would be interesting to see the memory |
163 matching speed. It would be interesting to study the memory |
160 bandwidth of our derivative-based matching and improve accordingly. |
164 bandwidth of our derivative-based matching algorithm and improve accordingly. |
161 |
165 |
162 Memoization has been used frequently in lexing and parsing to achieve |
166 Memoization has been used frequently in lexing and parsing to achieve |
163 better complexity results, for example in \cite{Reps1998}, |
167 better complexity results, for example in \cite{Reps1998}, |
164 \cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}. |
168 \cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}. |
165 We plan to explore the performance enhancements by memoization in our algorithm |
169 We plan to explore the performance enhancements by memoization in our algorithm |
166 in a correctness-preserving way. |
170 in a correctness-preserving way. |
167 The Monadic data refinement technique that Lammich and Tuerk used |
171 The monadic data refinement technique that Lammich and Tuerk used |
168 in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation |
172 in \cite{Lammich2012} to optimise Hopcroft's automaton minimisation |
169 algorithm seem quite interesting. |
173 algorithm seems also quite relevant for such an enterprise. |
170 We aim to learn from their refinement framework |
174 We aim to learn from their refinement framework |
171 which generates high performance code with proofs |
175 which generates high performance code with proofs |
172 that can be broken down into uncluttered small steps. |
176 that can be broken down into small steps. |
173 |
177 |
174 |
178 |
175 |
179 |
176 |
180 |
177 %We establish the correctness |
181 %We establish the correctness |