--- a/ChengsongTanPhdThesis/Chapters/Future.tex Fri Dec 30 17:37:51 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Future.tex Fri Dec 30 23:41:44 2022 +0000
@@ -1,37 +1,37 @@
% Chapter Template
-\chapter{Future Work and Conclusion} % Main chapter title
+\chapter{Conclusion and Future Work} % Main chapter title
\label{Future}
In this thesis, in order to solve the ReDoS attacks
once and for all, we have set out to formalise the correctness
proof of Sulzmann and Lu's
-lexing algorithm with simplifications \cite{Sulzmann2014}
-in Isabelle/HOL.
-We have fixed inefficiencies in the original simplification
+lexing algorithm with aggressive simplifications \cite{Sulzmann2014}.
+We formalised our proofs in the Isabelle/HOL theorem prover.
+We have fixed some inefficiencies and a bug in their original simplification
algorithm, and established the correctness by proving
that our algorithm outputs the same result as the original bit-coded
lexer without simplifications (whose correctness was established in
previous work by Ausaf et al. in
\cite{Ausaf} and \cite{AusafDyckhoffUrban2016}).
The proof technique used in \cite{Ausaf} does not work in our case
-because the simplification function messes with the structure of
+because the simplification function messes with the structure of simplified
regular expressions.
Despite having to try out several workarounds and being
-stuck for months or even years looking for proofs,
+stuck for months looking for proofs,
we were delighted to have discovered the simple yet
effective proof method
by modelling individual
simplification steps as small-step rewriting rules and proving
-equivalence between terms linked by these rules.
+equivalence between terms linked by these rewrite rules.
In addition, we have proved a formal size bound on the
regular expressions. The technique was by establishing
a ``closed form'' informally
described by
Murugesan and Shanmuga Sundaram \cite{Murugesan2014}
-for compound regular expressions' derivatives
+for compound derivatives
and using those closed forms to control the size.
The Isabelle/HOL code for our formalisation can be
found at
@@ -46,29 +46,31 @@
cubic to the regular expression size using a technique by
Antimirov\cite{Antimirov95}.
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
-We are working out the
+We are yet to work out the
details.
+Our formalisation is approximately 7500 lines of code. Slightly more than half of this concerns the finiteness bound obtained in Chapter 5. This slight "bloat" is because we had to repeat many definitions for the rrexp datatype. However, we think we would not have been able to formalise the quite intricate reasoning involving rrexps with annotated regular expressions because we would have to carry around the bit-sequences (that are of course necessary in the algorithm) but which do not contribute to the size bound of the calculated derivatives.
+
To our best knowledge, no lexing libraries using Brzozowski derivatives
have similar complexity-related bounds,
-and claims about running time are usually speculative and backed by empirical
-evidence on a few test cases.
+and claims about running time are usually speculative and backed up only by empirical
+evidence on some test cases.
If a matching or lexing algorithm
-does not come with certain basic complexity related
+does not come with complexity related
guarantees (for examaple the internal data structure size
does not grow indefinitely),
-then one cannot claim with confidence having solved the problem
+then one cannot claim with confidence of having solved the problem
of catastrophic backtracking.
We believe our proof method is not specific to this
algorithm, and intend to extend this approach
-to prove the correctness of a faster version
-of the lexer. The closed
+to prove the correctness of the faster version
+of the lexer proposed in chapter \cite{Cubic}. The closed
forms can then be re-used as well. Together
with the idea from Antimirov \cite{Antimirov95}
-we plan to reduce the size bound to become polynomial
-with respect to the size of the regular expression.
+we plan to reduce the size bound to be just polynomial
+with respect to the size of the regular expressions.
%We have tried out a variety of other optimisations
%to make lexing/parsing with derivatives faster,
%for example in \cite{Edelmann2021}, \cite{Darragh2020}
@@ -77,12 +79,13 @@
We have learned quite a few lessons in this process.
As simple as the end result may seem,
coming up with the initial proof idea with rewriting relations
-was the hardest for us, as it requires a deep understanding of what information
+was the hardest for us, as it required a deep understanding of what information
is preserved during the simplification process.
-Having the first formalisation gives us much more confidence----it took
-us much less time to obtain the closed forms and size bound, although they
-are more involved than the correctness proof.
-As having been tested numerous times,
+There the ideas given by Sulzmann and Lu and Ausaf et al. were of no use.
+%Having the first formalisation gives us much more confidence----it took
+%us much less time to obtain the closed forms and size bound, although they
+%are more involved than the correctness proof.
+Of course this has already been shown many times,
the usefulness of formal approaches cannot be overstated:
they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
but also helped us set up realistic expectations about performance
@@ -90,35 +93,36 @@
the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
was already able to achieve the linear bound claimed by
Sulzmann and Lu.
-We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$
+We then attempted to prove that the size $\llbracket \blexersimp\; r \; s\rrbracket$
is $O(\llbracket r\rrbracket^c\cdot |s|)$,
-where $c$ is a small constant number, making
+where $c$ is a small constant, making
$\llbracket r\rrbracket^c$ a small constant factor.
-We were quite surprised that this did not go through despite a lot of efforts.
+We were then quite surprised that this did not go through despite a lot of effort.
This led us to discover the example where $\llbracket r\rrbracket^c$
can in fact be exponentially large in chapter \ref{Finite}.
We therefore learned the necessity to introduce the stronger simplifications
in chapter \ref{Cubic}.
-Without formal proofs we could not have found out this so soon.
+Without formal proofs we would not have found out this so soon,
+if at all.
-In the future, we plan to incorporate many extensions
-to this work.
+%In the future, we plan to incorporate many extensions
+%to this work.
\section{Future Work}
The most obvious next-step is to implement the cubic bound and
correctness of $\blexerStrong$
-in chapter \ref{Cubic} formally.
+in chapter \ref{Cubic}.
A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to
-regular expression size will fulfill
+regular expression size will get us one step closer to fulfilling
the linear complexity claim made by Sulzmann and Lu.
With a linear size bound theoretically, the next challenge would be
-to generate code that are competitive with respect to commercial
-matchers. For that a lot of optimisations are needed.
+to generate code that is competitive with respect to
+matchers based on DFAs or NFAs. For that a lot of optimisations are needed.
We aim to integrate the zipper data structure into our lexer.
The idea is very simple: using a zipper with multiple focuses
-just like Darragh \cite{Darragh2020} did, we could represent
+just like Darragh \cite{Darragh2020} did in their parsing algorithm, we could represent
\[
x\cdot r + y \cdot r + \ldots
\]
@@ -147,29 +151,29 @@
In this way, the internal data structures can be pruned more aggressively,
leading to better simplifications and ultimately faster algorithms.
-Traditional automaton approaches can be sped up by compiling
+Traditional automata approaches can be sped up by compiling
multiple rules into the same automaton. This has been done
in \cite{Kumar2006} and \cite{Becchi08}, for example.
Currently our lexer only deals with a single regular expression each time.
-Putting multiple regular expressions might open up more
+Extending this to multiple regular expressions might open up more
possibilities of simplifications.
As already mentioned in chapter \ref{RelatedWork},
-reducing the number of memory accesses can accelerate the
-matching speed. It would be interesting to see the memory
-bandwidth of our derivative-based matching and improve accordingly.
+reducing the number of memory accesses can also accelerate the
+matching speed. It would be interesting to study the memory
+bandwidth of our derivative-based matching algorithm and improve accordingly.
Memoization has been used frequently in lexing and parsing to achieve
better complexity results, for example in \cite{Reps1998},
\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
We plan to explore the performance enhancements by memoization in our algorithm
in a correctness-preserving way.
-The Monadic data refinement technique that Lammich and Tuerk used
-in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
-algorithm seem quite interesting.
+The monadic data refinement technique that Lammich and Tuerk used
+in \cite{Lammich2012} to optimise Hopcroft's automaton minimisation
+algorithm seems also quite relevant for such an enterprise.
We aim to learn from their refinement framework
which generates high performance code with proofs
-that can be broken down into uncluttered small steps.
+that can be broken down into small steps.