|
1 % Chapter Template |
|
2 |
|
3 \chapter{Future Work and Conclusion} % Main chapter title |
|
4 |
|
5 \label{Future} |
|
6 |
|
7 In this thesis, in order to solve the ReDoS attacks |
|
8 once and for all, we have set out to formalise the correctness |
|
9 proof of Sulzmann and Lu's |
|
10 lexing algorithm with simplifications \cite{Sulzmann2014} |
|
11 in Isabelle/HOL. |
|
12 We have fixed inefficiencies in the original simplification |
|
13 algorithm, and established the correctness by proving |
|
14 that our algorithm outputs the same result as the original bit-coded |
|
15 lexer without simplifications (whose correctness was established in |
|
16 previous work by Ausaf et al. in |
|
17 \cite{Ausaf} and \cite{AusafDyckhoffUrban2016}). |
|
18 The proof technique used in \cite{Ausaf} does not work in our case |
|
19 because the simplification function messes with the structure of |
|
20 regular expressions. |
|
21 Despite having to try out several workarounds and being |
|
22 stuck for months or even years looking for proofs, |
|
23 we were delighted to have discovered the simple yet |
|
24 effective proof method |
|
25 by modelling individual |
|
26 simplification steps as small-step rewriting rules and proving |
|
27 equivalence between terms linked by these rules. |
|
28 |
|
29 In addition, we have proved a formal size bound on the |
|
30 regular expressions. The technique was by establishing |
|
31 a ``closed form'' informally |
|
32 described by |
|
33 Murugesan and Shanmuga Sundaram \cite{Murugesan2014} |
|
34 for compound regular expressions' derivatives |
|
35 and using those closed forms to control the size. |
|
36 The Isabelle/HOL code for our formalisation can be |
|
37 found at |
|
38 \begin{center} |
|
39 \url{https://github.com/hellotommmy/posix} |
|
40 \end{center} |
|
41 \noindent |
|
42 Thanks to our theorem-prover-friendly approach, |
|
43 we believe that |
|
44 this finiteness bound can be improved to a bound |
|
45 linear to input and |
|
46 cubic to the regular expression size using a technique by |
|
47 Antimirov\cite{Antimirov95}. |
|
48 Once formalised, this would be a guarantee for the absence of all super-linear behavious. |
|
49 We are working out the |
|
50 details. |
|
51 |
|
52 |
|
53 To our best knowledge, no lexing libraries using Brzozowski derivatives |
|
54 have similar complexity-related bounds, |
|
55 and claims about running time are usually speculative and backed by empirical |
|
56 evidence on a few test cases. |
|
57 If a matching or lexing algorithm |
|
58 does not come with certain basic complexity related |
|
59 guarantees (for examaple the internal data structure size |
|
60 does not grow indefinitely), |
|
61 then one cannot claim with confidence having solved the problem |
|
62 of catastrophic backtracking. |
|
63 |
|
64 We believe our proof method is not specific to this |
|
65 algorithm, and intend to extend this approach |
|
66 to prove the correctness of a faster version |
|
67 of the lexer. The closed |
|
68 forms can then be re-used as well. Together |
|
69 with the idea from Antimirov \cite{Antimirov95} |
|
70 we plan to reduce the size bound to become polynomial |
|
71 with respect to the size of the regular expression. |
|
72 %We have tried out a variety of other optimisations |
|
73 %to make lexing/parsing with derivatives faster, |
|
74 %for example in \cite{Edelmann2021}, \cite{Darragh2020} |
|
75 %and \cite{Verbatimpp}. |
|
76 |
|
77 We have learned quite a few lessons in this process. |
|
78 As simple as the end result may seem, |
|
79 coming up with the initial proof idea with rewriting relations |
|
80 was the hardest for us, as it requires a deep understanding of what information |
|
81 is preserved during the simplification process. |
|
82 Having the first formalisation gives us much more confidence----it took |
|
83 us much less time to obtain the closed forms and size bound, although they |
|
84 are more involved than the correctness proof. |
|
85 As having been tested numerous times, |
|
86 the usefulness of formal approaches cannot be overstated: |
|
87 they not only allow us to find bugs in Sulzmann and Lu's simplification functions, |
|
88 but also helped us set up realistic expectations about performance |
|
89 of algorithms. We believed in the beginning that |
|
90 the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2} |
|
91 was already able to achieve the linear bound claimed by |
|
92 Sulzmann and Lu. |
|
93 We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$ |
|
94 is $O(\llbracket r\rrbracket^c\cdot |s|)$, |
|
95 where $c$ is a small constant number, making |
|
96 $\llbracket r\rrbracket^c$ a small constant factor. |
|
97 We were quite surprised that this did not go through despite a lot of efforts. |
|
98 This led us to discover the example where $\llbracket r\rrbracket^c$ |
|
99 can in fact be exponentially large in chapter \ref{Finite}. |
|
100 We therefore learned the necessity to introduce the stronger simplifications |
|
101 in chapter \ref{Cubic}. |
|
102 Without formal proofs we could not have found out this so soon. |
|
103 |
|
104 |
|
105 In the future, we plan to incorporate many extensions |
|
106 to this work. |
|
107 |
|
108 \section{Future Work} |
|
109 The most obvious next-step is to implement the cubic bound and |
|
110 correctness of $\blexerStrong$ |
|
111 in chapter \ref{Cubic} formally. |
|
112 A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to |
|
113 regular expression size will fulfill |
|
114 the linear complexity claim made by Sulzmann and Lu. |
|
115 |
|
116 With a linear size bound theoretically, the next challenge would be |
|
117 to generate code that are competitive with respect to commercial |
|
118 matchers. For that a lot of optimisations are needed. |
|
119 We aim to integrate the zipper data structure into our lexer. |
|
120 The idea is very simple: using a zipper with multiple focuses |
|
121 just like Darragh \cite{Darragh2020} did, we could represent |
|
122 \[ |
|
123 x\cdot r + y \cdot r + \ldots |
|
124 \] |
|
125 as |
|
126 \[ |
|
127 (x+y + \ldots) \cdot r. |
|
128 \] |
|
129 This would greatly reduce the amount of space needed |
|
130 when we have many terms like $x\cdot r$. |
|
131 Some initial results have been obtained, but |
|
132 significant changes are needed to make sure |
|
133 that the lexer output conforms to the POSIX standard. |
|
134 We aim to make use of Okui and Suzuki's labelling |
|
135 system \cite{Okui10} to ensure regular expressions represented as zippers |
|
136 always maintain the POSIX orderings. |
|
137 %Formalising correctness of |
|
138 %such a lexer using zippers will probably require using an imperative |
|
139 %framework with separation logic as it involves |
|
140 %mutable data structures, which is also interesting to look into. |
|
141 |
|
142 To further optimise the algorithm, |
|
143 we plan to add a deduplicate function that tells |
|
144 whether two regular expressions denote the same |
|
145 language using |
|
146 an efficient and verified equivalence checker like \cite{Krauss2011}. |
|
147 In this way, the internal data structures can be pruned more aggressively, |
|
148 leading to better simplifications and ultimately faster algorithms. |
|
149 |
|
150 Traditional automaton approaches can be sped up by compiling |
|
151 multiple rules into the same automaton. This has been done |
|
152 in \cite{Kumar2006} and \cite{Becchi08}, for example. |
|
153 Currently our lexer only deals with a single regular expression each time. |
|
154 Putting multiple regular expressions might open up more |
|
155 possibilities of simplifications. |
|
156 |
|
157 As already mentioned in chapter \ref{RelatedWork}, |
|
158 reducing the number of memory accesses can accelerate the |
|
159 matching speed. It would be interesting to see the memory |
|
160 bandwidth of our derivative-based matching and improve accordingly. |
|
161 |
|
162 Memoization has been used frequently in lexing and parsing to achieve |
|
163 better complexity results, for example in \cite{Reps1998}, |
|
164 \cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}. |
|
165 We plan to explore the performance enhancements by memoization in our algorithm |
|
166 in a correctness-preserving way. |
|
167 The Monadic data refinement technique that Lammich and Tuerk used |
|
168 in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation |
|
169 algorithm seem quite interesting. |
|
170 We aim to learn from their refinement framework |
|
171 which generates high performance code with proofs |
|
172 that can be broken down into uncluttered small steps. |
|
173 |
|
174 |
|
175 |
|
176 |
|
177 %We establish the correctness |
|
178 %claim made by them about the |
|
179 %, which were only proven with pencil and paper. |
|
180 |
|
181 |
|
182 %\section{Work on Back-References} |
|
183 %We introduced regular expressions with back-references |
|
184 %in chapter \ref{Introduction}. |
|
185 %We adopt the common practice of calling them rewbrs (Regular Expressions |
|
186 %With Back References) in this section for brevity. |
|
187 %It has been shown by Aho \cite{Aho1990} |
|
188 %that the k-vertex cover problem can be reduced |
|
189 %to the problem of rewbrs matching, and is therefore NP-complete. |
|
190 %Given the depth of the problem, the progress made at the full generality |
|
191 %of arbitrary rewbrs matching has been slow, with |
|
192 %theoretical work on them being |
|
193 %fairly recent. |
|
194 % |
|
195 %Campaneu et al. studied regexes with back-references |
|
196 %in the context of formal languages theory in |
|
197 %their 2003 work \cite{campeanu2003formal}. |
|
198 %They devised a pumping lemma for Perl-like regexes, |
|
199 %proving that the langugages denoted by them |
|
200 %is properly contained in context-sensitive |
|
201 %languages. |
|
202 %More interesting questions such as |
|
203 %whether the language denoted by Perl-like regexes |
|
204 %can express palindromes ($\{w \mid w = w^R\}$) |
|
205 %are discussed in \cite{campeanu2009patterns} |
|
206 %and \cite{CAMPEANU2009Intersect}. |
|
207 %Freydenberger \cite{Frey2013} investigated the |
|
208 %expressive power of back-references. He showed several |
|
209 %undecidability and decriptional complexity results |
|
210 %of back-references, concluding that they add |
|
211 %great power to certain programming tasks but are difficult to harness. |
|
212 %An interesting question would then be |
|
213 %whether we can add restrictions to them, |
|
214 %so that they become expressive enough for practical use such |
|
215 %as html files, but not too powerful. |
|
216 %Freydenberger and Schmid \cite{FREYDENBERGER20191} |
|
217 %introduced the notion of deterministic |
|
218 %regular expressions with back-references to achieve |
|
219 %a better balance between expressiveness and tractability. |
|
220 % |
|
221 % |
|
222 %Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012} |
|
223 %investigated the time complexity of different variants |
|
224 %of back-references. |
|
225 %We are not aware of any work that uses derivatives on back-references. |
|
226 % |
|
227 %See \cite{BERGLUND2022} for a survey |
|
228 %of these works and comparison between different |
|
229 %flavours of back-references syntax (e.g. whether references can be circular, |
|
230 %can labels be repeated etc.). |
|
231 % |
|
232 % |
|
233 %\subsection{Matchers and Lexers with Mechanised Proofs} |
|
234 %We are aware |
|
235 %of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
236 %Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
237 %of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
|
238 %in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
239 %Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
|
240 % |
|
241 %\subsection{Static Analysis of Evil Regex Patterns} |
|
242 %When a regular expression does not behave as intended, |
|
243 %people usually try to rewrite the regex to some equivalent form |
|
244 %or they try to avoid the possibly problematic patterns completely, |
|
245 %for which many false positives exist\parencite{Davis18}. |
|
246 %Animated tools to "debug" regular expressions such as |
|
247 % \parencite{regexploit2021} \parencite{regex101} are also popular. |
|
248 %We are also aware of static analysis work on regular expressions that |
|
249 %aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
|
250 %\parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
251 %that detects regular expressions triggering exponential |
|
252 %behavious on backtracking matchers. |
|
253 %Weideman \parencite{Weideman2017Static} came up with |
|
254 %non-linear polynomial worst-time estimates |
|
255 %for regexes, attack string that exploit the worst-time |
|
256 %scenario, and "attack automata" that generates |
|
257 %attack strings. |
|
258 % |
|
259 % |
|
260 % |
|
261 % |
|
262 % |
|
263 % |
|
264 %Thanks to our theorem-prover-friendly approach, |
|
265 %we believe that |
|
266 %this finiteness bound can be improved to a bound |
|
267 %linear to input and |
|
268 %cubic to the regular expression size using a technique by |
|
269 %Antimirov\cite{Antimirov95}. |
|
270 %Once formalised, this would be a guarantee for the absence of all super-linear behavious. |
|
271 %We are working out the |
|
272 %details. |
|
273 % |
|
274 % |
|
275 %To our best knowledge, no lexing libraries using Brzozowski derivatives |
|
276 %have similar complexity-related bounds, |
|
277 %and claims about running time are usually speculative and backed by empirical |
|
278 %evidence on a few test cases. |
|
279 %If a matching or lexing algorithm |
|
280 %does not come with certain basic complexity related |
|
281 %guarantees (for examaple the internal data structure size |
|
282 %does not grow indefinitely), |
|
283 %then they cannot claim with confidence having solved the problem |
|
284 %of catastrophic backtracking. |
|
285 % |
|
286 % |
|
287 %\section{Optimisations} |
|
288 %Perhaps the biggest problem that prevents derivative-based lexing |
|
289 %from being more widely adopted |
|
290 %is that they are still not blindingly fast in practice, unable to |
|
291 %reach throughputs like gigabytes per second, which the commercial |
|
292 %regular expression matchers such as Snort and Bro are able to achieve. |
|
293 %For our algorithm to be more attractive for practical use, we |
|
294 %need more correctness-preserving optimisations. |
|
295 %%So far our focus has been mainly on the bit-coded algorithm, |
|
296 %%but the injection-based lexing algorithm |
|
297 %%could also be sped up in various ways: |
|
298 %% |
|
299 %One such optimisation is defining string derivatives |
|
300 %directly, without recursively decomposing them into |
|
301 %character-by-character derivatives. For example, instead of |
|
302 %replacing $(r_1 + r_2)\backslash (c::cs)$ by |
|
303 %$((r_1 + r_2)\backslash c)\backslash cs$, we rather |
|
304 %calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. |
|
305 % |
|
306 % |
|
307 %\subsection{Derivatives and Zippers} |
|
308 %Zipper is a data structure designed to focus on |
|
309 %and navigate between local parts of a tree. |
|
310 %The idea is that often operations on a large tree only deal with |
|
311 %local regions each time. |
|
312 %It was first formally described by Huet \cite{HuetZipper}. |
|
313 %Typical applications of zippers involve text editor buffers |
|
314 %and proof system databases. |
|
315 %In our setting, the idea is to compactify the representation |
|
316 %of derivatives with zippers, thereby making our algorithm faster. |
|
317 %We draw our inspirations from several works on parsing, derivatives |
|
318 %and zippers. |
|
319 % |
|
320 %Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives |
|
321 %\cite{Zippy2020}. |
|
322 %They adopted zippers to improve the speed, and argued that the runtime |
|
323 %complexity of their algorithm was linear with respect to the input string. |
|
324 % |
|
325 %The idea of using Brzozowski derivatives on general context-free |
|
326 %parsing was first implemented |
|
327 %by Might et al. \ref{Might2011}. |
|
328 %They used memoization and fixpoint construction to eliminate infinite recursion, |
|
329 %which is a major problem for using derivatives on context-free grammars. |
|
330 %The initial version was quite slow----exponential on the size of the input. |
|
331 %Adams et al. \ref{Adams2016} improved the speed and argued that their version |
|
332 %was cubic with respect to the input. |
|
333 %Darragh and Adams \cite{10.1145/3408990} further improved the performance |
|
334 %by using zippers in an innovative way--their zippers had multiple focuses |
|
335 %instead of just one in a traditional zipper to handle ambiguity. |
|
336 %Their algorithm was not formalised, though. |
|
337 % |
|
338 %We aim to integrate the zipper data structure into our lexer. |
|
339 %The idea is very simple: using a zipper with multiple focuses |
|
340 %just like Darragh did, we could represent |
|
341 %\[ |
|
342 % x\cdot r + y \cdot r + \ldots |
|
343 %\] |
|
344 %as |
|
345 %\[ |
|
346 % (x+y + \ldots) \cdot r. |
|
347 %\] |
|
348 %This would greatly reduce the amount of space needed |
|
349 %when we have many terms like $x\cdot r$. |
|
350 % |
|
351 %into the current lexer. |
|
352 %This allows the lexer to not traverse the entire |
|
353 %derivative tree generated |
|
354 %in every intermediate step, but only a small segment |
|
355 %that is currently active. |
|
356 % |
|
357 %Some initial results have been obtained, but more care needs to be taken to make sure |
|
358 %that the lexer output conforms to the POSIX standard. Formalising correctness of |
|
359 %such a lexer using zippers will probably require using an imperative |
|
360 %framework with separation logic as it involves |
|
361 %mutable data structures, which is also interesting to look into. |
|
362 % |
|
363 %To further optimise the algorithm, |
|
364 %I plan to add a ``deduplicate'' function that tells |
|
365 %whether two regular expressions denote the same |
|
366 %language using |
|
367 %an efficient and verified equivalence checker. |
|
368 %In this way, the internal data structures can be pruned more aggressively, |
|
369 %leading to better simplifications and ultimately faster algorithms. |
|
370 % |
|
371 % |
|
372 %Some initial results |
|
373 %We first give a brief introduction to what zippers are, |
|
374 %and other works |
|
375 %that apply zippers to derivatives |
|
376 %When dealing with large trees, it would be a waste to |
|
377 %traverse the entire tree if |
|
378 %the operation only |
|
379 %involves a small fraction of it. |
|
380 %The idea is to put the focus on that subtree, turning other parts |
|
381 %of the tree into a context |
|
382 % |
|
383 % |
|
384 %One observation about our derivative-based lexing algorithm is that |
|
385 %the derivative operation sometimes traverses the entire regular expression |
|
386 %unnecessarily: |
|
387 |