637
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
\chapter{Future Work and Conclusion} % Main chapter title
|
|
4 |
|
|
5 |
\label{Future}
|
|
6 |
|
|
7 |
In this thesis, in order to solve the ReDoS attacks
|
|
8 |
once and for all, we have set out to formalise the correctness
|
|
9 |
proof of Sulzmann and Lu's
|
|
10 |
lexing algorithm with simplifications \cite{Sulzmann2014}
|
|
11 |
in Isabelle/HOL.
|
|
12 |
We have fixed inefficiencies in the original simplification
|
|
13 |
algorithm, and established the correctness by proving
|
|
14 |
that our algorithm outputs the same result as the original bit-coded
|
|
15 |
lexer without simplifications (whose correctness was established in
|
|
16 |
previous work by Ausaf et al. in
|
|
17 |
\cite{Ausaf} and \cite{AusafDyckhoffUrban2016}).
|
|
18 |
The proof technique used in \cite{Ausaf} does not work in our case
|
|
19 |
because the simplification function messes with the structure of
|
|
20 |
regular expressions.
|
|
21 |
Despite having to try out several workarounds and being
|
|
22 |
stuck for months or even years looking for proofs,
|
|
23 |
we were delighted to have discovered the simple yet
|
|
24 |
effective proof method
|
|
25 |
by modelling individual
|
|
26 |
simplification steps as small-step rewriting rules and proving
|
|
27 |
equivalence between terms linked by these rules.
|
|
28 |
|
|
29 |
In addition, we have proved a formal size bound on the
|
|
30 |
regular expressions. The technique was by establishing
|
|
31 |
a ``closed form'' informally
|
|
32 |
described by
|
|
33 |
Murugesan and Shanmuga Sundaram \cite{Murugesan2014}
|
|
34 |
for compound regular expressions' derivatives
|
|
35 |
and using those closed forms to control the size.
|
|
36 |
The Isabelle/HOL code for our formalisation can be
|
|
37 |
found at
|
|
38 |
\begin{center}
|
|
39 |
\url{https://github.com/hellotommmy/posix}
|
|
40 |
\end{center}
|
|
41 |
\noindent
|
|
42 |
Thanks to our theorem-prover-friendly approach,
|
|
43 |
we believe that
|
|
44 |
this finiteness bound can be improved to a bound
|
|
45 |
linear to input and
|
|
46 |
cubic to the regular expression size using a technique by
|
|
47 |
Antimirov\cite{Antimirov95}.
|
|
48 |
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
|
|
49 |
We are working out the
|
|
50 |
details.
|
|
51 |
|
|
52 |
|
|
53 |
To our best knowledge, no lexing libraries using Brzozowski derivatives
|
|
54 |
have similar complexity-related bounds,
|
|
55 |
and claims about running time are usually speculative and backed by empirical
|
|
56 |
evidence on a few test cases.
|
|
57 |
If a matching or lexing algorithm
|
|
58 |
does not come with certain basic complexity related
|
|
59 |
guarantees (for examaple the internal data structure size
|
|
60 |
does not grow indefinitely),
|
|
61 |
then one cannot claim with confidence having solved the problem
|
|
62 |
of catastrophic backtracking.
|
|
63 |
|
|
64 |
We believe our proof method is not specific to this
|
|
65 |
algorithm, and intend to extend this approach
|
|
66 |
to prove the correctness of a faster version
|
|
67 |
of the lexer. The closed
|
|
68 |
forms can then be re-used as well. Together
|
|
69 |
with the idea from Antimirov \cite{Antimirov95}
|
|
70 |
we plan to reduce the size bound to become polynomial
|
|
71 |
with respect to the size of the regular expression.
|
|
72 |
%We have tried out a variety of other optimisations
|
|
73 |
%to make lexing/parsing with derivatives faster,
|
|
74 |
%for example in \cite{Edelmann2021}, \cite{Darragh2020}
|
|
75 |
%and \cite{Verbatimpp}.
|
|
76 |
|
|
77 |
We have learned quite a few lessons in this process.
|
|
78 |
As simple as the end result may seem,
|
|
79 |
coming up with the initial proof idea with rewriting relations
|
|
80 |
was the hardest for us, as it requires a deep understanding of what information
|
|
81 |
is preserved during the simplification process.
|
|
82 |
Having the first formalisation gives us much more confidence----it took
|
|
83 |
us much less time to obtain the closed forms and size bound, although they
|
|
84 |
are more involved than the correctness proof.
|
|
85 |
As having been tested numerous times,
|
|
86 |
the usefulness of formal approaches cannot be overstated:
|
|
87 |
they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
|
|
88 |
but also helped us set up realistic expectations about performance
|
|
89 |
of algorithms. We believed in the beginning that
|
|
90 |
the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
|
|
91 |
was already able to achieve the linear bound claimed by
|
|
92 |
Sulzmann and Lu.
|
|
93 |
We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$
|
|
94 |
is $O(\llbracket r\rrbracket^c\cdot |s|)$,
|
|
95 |
where $c$ is a small constant number, making
|
|
96 |
$\llbracket r\rrbracket^c$ a small constant factor.
|
|
97 |
We were quite surprised that this did not go through despite a lot of efforts.
|
|
98 |
This led us to discover the example where $\llbracket r\rrbracket^c$
|
|
99 |
can in fact be exponentially large in chapter \ref{Finite}.
|
|
100 |
We therefore learned the necessity to introduce the stronger simplifications
|
|
101 |
in chapter \ref{Cubic}.
|
|
102 |
Without formal proofs we could not have found out this so soon.
|
|
103 |
|
|
104 |
|
|
105 |
In the future, we plan to incorporate many extensions
|
|
106 |
to this work.
|
|
107 |
|
|
108 |
\section{Future Work}
|
|
109 |
The most obvious next-step is to implement the cubic bound and
|
|
110 |
correctness of $\blexerStrong$
|
|
111 |
in chapter \ref{Cubic} formally.
|
|
112 |
A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to
|
|
113 |
regular expression size will fulfill
|
|
114 |
the linear complexity claim made by Sulzmann and Lu.
|
|
115 |
|
|
116 |
With a linear size bound theoretically, the next challenge would be
|
|
117 |
to generate code that are competitive with respect to commercial
|
|
118 |
matchers. For that a lot of optimisations are needed.
|
|
119 |
We aim to integrate the zipper data structure into our lexer.
|
|
120 |
The idea is very simple: using a zipper with multiple focuses
|
|
121 |
just like Darragh \cite{Darragh2020} did, we could represent
|
|
122 |
\[
|
|
123 |
x\cdot r + y \cdot r + \ldots
|
|
124 |
\]
|
|
125 |
as
|
|
126 |
\[
|
|
127 |
(x+y + \ldots) \cdot r.
|
|
128 |
\]
|
|
129 |
This would greatly reduce the amount of space needed
|
|
130 |
when we have many terms like $x\cdot r$.
|
|
131 |
Some initial results have been obtained, but
|
|
132 |
significant changes are needed to make sure
|
|
133 |
that the lexer output conforms to the POSIX standard.
|
|
134 |
We aim to make use of Okui and Suzuki's labelling
|
|
135 |
system \cite{Okui10} to ensure regular expressions represented as zippers
|
|
136 |
always maintain the POSIX orderings.
|
|
137 |
%Formalising correctness of
|
|
138 |
%such a lexer using zippers will probably require using an imperative
|
|
139 |
%framework with separation logic as it involves
|
|
140 |
%mutable data structures, which is also interesting to look into.
|
|
141 |
|
|
142 |
To further optimise the algorithm,
|
|
143 |
we plan to add a deduplicate function that tells
|
|
144 |
whether two regular expressions denote the same
|
|
145 |
language using
|
|
146 |
an efficient and verified equivalence checker like \cite{Krauss2011}.
|
|
147 |
In this way, the internal data structures can be pruned more aggressively,
|
|
148 |
leading to better simplifications and ultimately faster algorithms.
|
|
149 |
|
|
150 |
Traditional automaton approaches can be sped up by compiling
|
|
151 |
multiple rules into the same automaton. This has been done
|
|
152 |
in \cite{Kumar2006} and \cite{Becchi08}, for example.
|
|
153 |
Currently our lexer only deals with a single regular expression each time.
|
|
154 |
Putting multiple regular expressions might open up more
|
|
155 |
possibilities of simplifications.
|
|
156 |
|
|
157 |
As already mentioned in chapter \ref{RelatedWork},
|
|
158 |
reducing the number of memory accesses can accelerate the
|
|
159 |
matching speed. It would be interesting to see the memory
|
|
160 |
bandwidth of our derivative-based matching and improve accordingly.
|
|
161 |
|
|
162 |
Memoization has been used frequently in lexing and parsing to achieve
|
|
163 |
better complexity results, for example in \cite{Reps1998},
|
|
164 |
\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
|
|
165 |
We plan to explore the performance enhancements by memoization in our algorithm
|
|
166 |
in a correctness-preserving way.
|
|
167 |
The Monadic data refinement technique that Lammich and Tuerk used
|
|
168 |
in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
|
|
169 |
algorithm seem quite interesting.
|
|
170 |
We aim to learn from their refinement framework
|
|
171 |
which generates high performance code with proofs
|
|
172 |
that can be broken down into uncluttered small steps.
|
|
173 |
|
|
174 |
|
|
175 |
|
|
176 |
|
|
177 |
%We establish the correctness
|
|
178 |
%claim made by them about the
|
|
179 |
%, which were only proven with pencil and paper.
|
|
180 |
|
|
181 |
|
|
182 |
%\section{Work on Back-References}
|
|
183 |
%We introduced regular expressions with back-references
|
|
184 |
%in chapter \ref{Introduction}.
|
|
185 |
%We adopt the common practice of calling them rewbrs (Regular Expressions
|
|
186 |
%With Back References) in this section for brevity.
|
|
187 |
%It has been shown by Aho \cite{Aho1990}
|
|
188 |
%that the k-vertex cover problem can be reduced
|
|
189 |
%to the problem of rewbrs matching, and is therefore NP-complete.
|
|
190 |
%Given the depth of the problem, the progress made at the full generality
|
|
191 |
%of arbitrary rewbrs matching has been slow, with
|
|
192 |
%theoretical work on them being
|
|
193 |
%fairly recent.
|
|
194 |
%
|
|
195 |
%Campaneu et al. studied regexes with back-references
|
|
196 |
%in the context of formal languages theory in
|
|
197 |
%their 2003 work \cite{campeanu2003formal}.
|
|
198 |
%They devised a pumping lemma for Perl-like regexes,
|
|
199 |
%proving that the langugages denoted by them
|
|
200 |
%is properly contained in context-sensitive
|
|
201 |
%languages.
|
|
202 |
%More interesting questions such as
|
|
203 |
%whether the language denoted by Perl-like regexes
|
|
204 |
%can express palindromes ($\{w \mid w = w^R\}$)
|
|
205 |
%are discussed in \cite{campeanu2009patterns}
|
|
206 |
%and \cite{CAMPEANU2009Intersect}.
|
|
207 |
%Freydenberger \cite{Frey2013} investigated the
|
|
208 |
%expressive power of back-references. He showed several
|
|
209 |
%undecidability and decriptional complexity results
|
|
210 |
%of back-references, concluding that they add
|
|
211 |
%great power to certain programming tasks but are difficult to harness.
|
|
212 |
%An interesting question would then be
|
|
213 |
%whether we can add restrictions to them,
|
|
214 |
%so that they become expressive enough for practical use such
|
|
215 |
%as html files, but not too powerful.
|
|
216 |
%Freydenberger and Schmid \cite{FREYDENBERGER20191}
|
|
217 |
%introduced the notion of deterministic
|
|
218 |
%regular expressions with back-references to achieve
|
|
219 |
%a better balance between expressiveness and tractability.
|
|
220 |
%
|
|
221 |
%
|
|
222 |
%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
|
|
223 |
%investigated the time complexity of different variants
|
|
224 |
%of back-references.
|
|
225 |
%We are not aware of any work that uses derivatives on back-references.
|
|
226 |
%
|
|
227 |
%See \cite{BERGLUND2022} for a survey
|
|
228 |
%of these works and comparison between different
|
|
229 |
%flavours of back-references syntax (e.g. whether references can be circular,
|
|
230 |
%can labels be repeated etc.).
|
|
231 |
%
|
|
232 |
%
|
|
233 |
%\subsection{Matchers and Lexers with Mechanised Proofs}
|
|
234 |
%We are aware
|
|
235 |
%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
|
|
236 |
%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
|
|
237 |
%of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
|
|
238 |
%in Coq is given by Coquand and Siles \parencite{Coquand2012}.
|
|
239 |
%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
|
|
240 |
%
|
|
241 |
%\subsection{Static Analysis of Evil Regex Patterns}
|
|
242 |
%When a regular expression does not behave as intended,
|
|
243 |
%people usually try to rewrite the regex to some equivalent form
|
|
244 |
%or they try to avoid the possibly problematic patterns completely,
|
|
245 |
%for which many false positives exist\parencite{Davis18}.
|
|
246 |
%Animated tools to "debug" regular expressions such as
|
|
247 |
% \parencite{regexploit2021} \parencite{regex101} are also popular.
|
|
248 |
%We are also aware of static analysis work on regular expressions that
|
|
249 |
%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
|
|
250 |
%\parencite{Rathnayake2014StaticAF} proposed an algorithm
|
|
251 |
%that detects regular expressions triggering exponential
|
|
252 |
%behavious on backtracking matchers.
|
|
253 |
%Weideman \parencite{Weideman2017Static} came up with
|
|
254 |
%non-linear polynomial worst-time estimates
|
|
255 |
%for regexes, attack string that exploit the worst-time
|
|
256 |
%scenario, and "attack automata" that generates
|
|
257 |
%attack strings.
|
|
258 |
%
|
|
259 |
%
|
|
260 |
%
|
|
261 |
%
|
|
262 |
%
|
|
263 |
%
|
|
264 |
%Thanks to our theorem-prover-friendly approach,
|
|
265 |
%we believe that
|
|
266 |
%this finiteness bound can be improved to a bound
|
|
267 |
%linear to input and
|
|
268 |
%cubic to the regular expression size using a technique by
|
|
269 |
%Antimirov\cite{Antimirov95}.
|
|
270 |
%Once formalised, this would be a guarantee for the absence of all super-linear behavious.
|
|
271 |
%We are working out the
|
|
272 |
%details.
|
|
273 |
%
|
|
274 |
%
|
|
275 |
%To our best knowledge, no lexing libraries using Brzozowski derivatives
|
|
276 |
%have similar complexity-related bounds,
|
|
277 |
%and claims about running time are usually speculative and backed by empirical
|
|
278 |
%evidence on a few test cases.
|
|
279 |
%If a matching or lexing algorithm
|
|
280 |
%does not come with certain basic complexity related
|
|
281 |
%guarantees (for examaple the internal data structure size
|
|
282 |
%does not grow indefinitely),
|
|
283 |
%then they cannot claim with confidence having solved the problem
|
|
284 |
%of catastrophic backtracking.
|
|
285 |
%
|
|
286 |
%
|
|
287 |
%\section{Optimisations}
|
|
288 |
%Perhaps the biggest problem that prevents derivative-based lexing
|
|
289 |
%from being more widely adopted
|
|
290 |
%is that they are still not blindingly fast in practice, unable to
|
|
291 |
%reach throughputs like gigabytes per second, which the commercial
|
|
292 |
%regular expression matchers such as Snort and Bro are able to achieve.
|
|
293 |
%For our algorithm to be more attractive for practical use, we
|
|
294 |
%need more correctness-preserving optimisations.
|
|
295 |
%%So far our focus has been mainly on the bit-coded algorithm,
|
|
296 |
%%but the injection-based lexing algorithm
|
|
297 |
%%could also be sped up in various ways:
|
|
298 |
%%
|
|
299 |
%One such optimisation is defining string derivatives
|
|
300 |
%directly, without recursively decomposing them into
|
|
301 |
%character-by-character derivatives. For example, instead of
|
|
302 |
%replacing $(r_1 + r_2)\backslash (c::cs)$ by
|
|
303 |
%$((r_1 + r_2)\backslash c)\backslash cs$, we rather
|
|
304 |
%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
|
|
305 |
%
|
|
306 |
%
|
|
307 |
%\subsection{Derivatives and Zippers}
|
|
308 |
%Zipper is a data structure designed to focus on
|
|
309 |
%and navigate between local parts of a tree.
|
|
310 |
%The idea is that often operations on a large tree only deal with
|
|
311 |
%local regions each time.
|
|
312 |
%It was first formally described by Huet \cite{HuetZipper}.
|
|
313 |
%Typical applications of zippers involve text editor buffers
|
|
314 |
%and proof system databases.
|
|
315 |
%In our setting, the idea is to compactify the representation
|
|
316 |
%of derivatives with zippers, thereby making our algorithm faster.
|
|
317 |
%We draw our inspirations from several works on parsing, derivatives
|
|
318 |
%and zippers.
|
|
319 |
%
|
|
320 |
%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
|
|
321 |
%\cite{Zippy2020}.
|
|
322 |
%They adopted zippers to improve the speed, and argued that the runtime
|
|
323 |
%complexity of their algorithm was linear with respect to the input string.
|
|
324 |
%
|
|
325 |
%The idea of using Brzozowski derivatives on general context-free
|
|
326 |
%parsing was first implemented
|
|
327 |
%by Might et al. \ref{Might2011}.
|
|
328 |
%They used memoization and fixpoint construction to eliminate infinite recursion,
|
|
329 |
%which is a major problem for using derivatives on context-free grammars.
|
|
330 |
%The initial version was quite slow----exponential on the size of the input.
|
|
331 |
%Adams et al. \ref{Adams2016} improved the speed and argued that their version
|
|
332 |
%was cubic with respect to the input.
|
|
333 |
%Darragh and Adams \cite{10.1145/3408990} further improved the performance
|
|
334 |
%by using zippers in an innovative way--their zippers had multiple focuses
|
|
335 |
%instead of just one in a traditional zipper to handle ambiguity.
|
|
336 |
%Their algorithm was not formalised, though.
|
|
337 |
%
|
|
338 |
%We aim to integrate the zipper data structure into our lexer.
|
|
339 |
%The idea is very simple: using a zipper with multiple focuses
|
|
340 |
%just like Darragh did, we could represent
|
|
341 |
%\[
|
|
342 |
% x\cdot r + y \cdot r + \ldots
|
|
343 |
%\]
|
|
344 |
%as
|
|
345 |
%\[
|
|
346 |
% (x+y + \ldots) \cdot r.
|
|
347 |
%\]
|
|
348 |
%This would greatly reduce the amount of space needed
|
|
349 |
%when we have many terms like $x\cdot r$.
|
|
350 |
%
|
|
351 |
%into the current lexer.
|
|
352 |
%This allows the lexer to not traverse the entire
|
|
353 |
%derivative tree generated
|
|
354 |
%in every intermediate step, but only a small segment
|
|
355 |
%that is currently active.
|
|
356 |
%
|
|
357 |
%Some initial results have been obtained, but more care needs to be taken to make sure
|
|
358 |
%that the lexer output conforms to the POSIX standard. Formalising correctness of
|
|
359 |
%such a lexer using zippers will probably require using an imperative
|
|
360 |
%framework with separation logic as it involves
|
|
361 |
%mutable data structures, which is also interesting to look into.
|
|
362 |
%
|
|
363 |
%To further optimise the algorithm,
|
|
364 |
%I plan to add a ``deduplicate'' function that tells
|
|
365 |
%whether two regular expressions denote the same
|
|
366 |
%language using
|
|
367 |
%an efficient and verified equivalence checker.
|
|
368 |
%In this way, the internal data structures can be pruned more aggressively,
|
|
369 |
%leading to better simplifications and ultimately faster algorithms.
|
|
370 |
%
|
|
371 |
%
|
|
372 |
%Some initial results
|
|
373 |
%We first give a brief introduction to what zippers are,
|
|
374 |
%and other works
|
|
375 |
%that apply zippers to derivatives
|
|
376 |
%When dealing with large trees, it would be a waste to
|
|
377 |
%traverse the entire tree if
|
|
378 |
%the operation only
|
|
379 |
%involves a small fraction of it.
|
|
380 |
%The idea is to put the focus on that subtree, turning other parts
|
|
381 |
%of the tree into a context
|
|
382 |
%
|
|
383 |
%
|
|
384 |
%One observation about our derivative-based lexing algorithm is that
|
|
385 |
%the derivative operation sometimes traverses the entire regular expression
|
|
386 |
%unnecessarily:
|
|
387 |
|