lexing: comparison ChengsongTanPhdThesis/Chapters/Future.tex

equal deleted inserted replaced

-:80cc6dc4c98b
+:bd1354127574
 % Chapter Template
-\chapter{Future Work and Conclusion} % Main chapter title
+\chapter{Conclusion and Future Work} % Main chapter title
 \label{Future}
 In this thesis, in order to solve the ReDoS attacks
 once and for all, we have set out to formalise the correctness
 proof of Sulzmann and Lu's
-lexing algorithm with simplifications \cite{Sulzmann2014}
+lexing algorithm with aggressive simplifications \cite{Sulzmann2014}.
-in Isabelle/HOL.
+We formalised our proofs in the Isabelle/HOL theorem prover.
-We have fixed inefficiencies in  the original simplification
+We have fixed some inefficiencies and a bug in  their original simplification
 algorithm, and established the correctness by proving
 that our algorithm outputs the same result as the original bit-coded
 lexer without simplifications (whose correctness was established in
 previous work by Ausaf et al. in
 \cite{Ausaf}  and \cite{AusafDyckhoffUrban2016}).
 The proof technique used in \cite{Ausaf} does not work in our case
-because the simplification function messes with the structure of
+because the simplification function messes with the structure of simplified
 regular expressions.
 Despite having to try out several workarounds and being
-stuck for months or even years looking for proofs,
+stuck for months looking for proofs,
 we were delighted to have discovered the simple yet
 effective proof method
 by modelling individual
 simplification steps as small-step rewriting rules and proving
-equivalence between terms linked by these rules.
+equivalence between terms linked by these rewrite rules.
 In addition, we have proved a formal size bound on the
 regular expressions. The technique was by establishing
 a ``closed form'' informally
 described by
 Murugesan and Shanmuga Sundaram \cite{Murugesan2014}
-for compound regular expressions' derivatives
+for compound derivatives
 and using those closed forms to control the size.
 The Isabelle/HOL code for our formalisation can be
 found at
 \begin{center}
 \url{https://github.com/hellotommmy/posix}
 this finiteness bound can be improved to a bound
 linear to input and
 cubic to the regular expression size using a technique by
 Antimirov\cite{Antimirov95}.
 Once formalised, this would be a guarantee for the absence of all super-linear behavious.
-We are working out the
+We are yet to work out the
 details.
+Our formalisation is approximately 7500 lines of code. Slightly more than half of this concerns the finiteness bound obtained in Chapter 5. This slight "bloat" is because we had to repeat many definitions for the rrexp datatype. However, we think we would not have been able to formalise the quite intricate reasoning involving rrexps with annotated regular expressions because we would have to carry around the bit-sequences (that are of course necessary in the algorithm) but which do not contribute to the size bound of the calculated derivatives.
 To our best knowledge, no lexing libraries using Brzozowski derivatives
 have similar complexity-related bounds,
-and claims about running time are usually speculative and backed by empirical
+and claims about running time are usually speculative and backed up only by empirical
-evidence on a few test cases.
+evidence on some test cases.
 If a matching or lexing algorithm
-does not come with certain basic complexity related
+does not come with complexity related
 guarantees (for examaple the internal data structure size
 does not grow indefinitely),
-then one cannot claim with confidence having solved the problem
+then one cannot claim with confidence of having solved the problem
 of catastrophic backtracking.
 We believe our proof method is not specific to this
 algorithm, and intend to extend this approach
-to prove the correctness of a faster version
+to prove the correctness of the faster version
-of the lexer. The closed
+of the lexer proposed in chapter \cite{Cubic}. The closed
 forms can then be re-used as well. Together
 with the idea from Antimirov \cite{Antimirov95}
-we plan to reduce the size bound to become polynomial
+we plan to reduce the size bound to be just polynomial
-with respect to the size of the regular expression.
+with respect to the size of the regular expressions.
 %We have tried out a variety of other optimisations
 %to make lexing/parsing with derivatives faster,
 %for example in \cite{Edelmann2021}, \cite{Darragh2020}
 %and \cite{Verbatimpp}.
 We have learned quite a few lessons in this process.
 As simple as the end result may seem,
 coming up with the initial proof idea with rewriting relations
-was the hardest for us, as it requires a deep understanding of what information
+was the hardest for us, as it required a deep understanding of what information
 is preserved during the simplification process.
-Having the first formalisation gives us much more confidence----it took
+There the ideas given by Sulzmann and Lu and Ausaf et al. were of no use.
-us much less time to obtain the closed forms and size bound, although they
+%Having the first formalisation gives us much more confidence----it took
-are more involved than the correctness proof.
+%us much less time to obtain the closed forms and size bound, although they
-As having been tested numerous times,
+%are more involved than the correctness proof.
+Of course this has already been shown many times,
 the usefulness of formal approaches cannot be overstated:
 they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
 but also helped us set up realistic expectations about performance
 of algorithms. We believed in the beginning that
 the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
 was already able to achieve the linear bound claimed by
 Sulzmann and Lu.
-We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$
+We then attempted to prove that the size $\llbracket \blexersimp\; r \; s\rrbracket$
 is $O(\llbracket r\rrbracket^c\cdot |s|)$,
-where $c$ is a small constant number, making
+where $c$ is a small constant, making
 $\llbracket r\rrbracket^c$ a small constant factor.
-We were quite surprised that this did not go through despite a lot of efforts.
+We were then quite surprised that this did not go through despite a lot of effort.
 This led us to discover the example where $\llbracket r\rrbracket^c$
 can in fact be exponentially large in chapter \ref{Finite}.
 We therefore learned the necessity to introduce the stronger simplifications
 in chapter \ref{Cubic}.
-Without formal proofs we could not have found out this so soon.
+Without formal proofs we would not have found out this so soon,
+if at all.
-In the future, we plan to incorporate many extensions
-to this work.
+%In the future, we plan to incorporate many extensions
+%to this work.
 \section{Future Work}
 The most obvious next-step is to implement the cubic bound and
 correctness of $\blexerStrong$
-in chapter \ref{Cubic} formally.
+in chapter \ref{Cubic}.
 A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to
-regular expression size will fulfill
+regular expression size will get us one step closer to fulfilling
 the linear complexity claim made by Sulzmann and Lu.
 With a linear size bound theoretically, the next challenge would be
-to generate code that are competitive with respect to commercial
+to generate code that is competitive with respect to
-matchers. For that a lot of optimisations are needed.
+matchers based on DFAs or NFAs. For that a lot of optimisations are needed.
 We aim to  integrate the zipper data structure into our lexer.
 The idea is very simple: using a zipper with multiple focuses
-just like Darragh \cite{Darragh2020} did, we could represent
+just like Darragh \cite{Darragh2020} did in their parsing algorithm, we could represent
 \[
 	x\cdot r + y \cdot r + \ldots
 \]
 as
 \[
 language using
 an efficient and verified equivalence checker like \cite{Krauss2011}.
 In this way, the internal data structures can be pruned more aggressively,
 leading to better simplifications and ultimately faster algorithms.
-Traditional automaton approaches can be sped up by compiling
+Traditional automata approaches can be sped up by compiling
 multiple rules into the same automaton. This has been done
 in \cite{Kumar2006} and \cite{Becchi08}, for example.
 Currently our lexer only deals with a single regular expression each time.
-Putting multiple regular expressions might open up more
+Extending this to multiple regular expressions might open up more
 possibilities of simplifications.
 As already mentioned in chapter \ref{RelatedWork},
-reducing the number of memory accesses can accelerate the
+reducing the number of memory accesses can also accelerate the
-matching speed. It would be interesting to see the memory
+matching speed. It would be interesting to study the memory
-bandwidth of our derivative-based matching and improve accordingly.
+bandwidth of our derivative-based matching algorithm and improve accordingly.
 Memoization has been used frequently in lexing and parsing to achieve
 better complexity results, for example in \cite{Reps1998},
 \cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
 We plan to explore the performance enhancements by memoization in our algorithm
 in a correctness-preserving way.
-The Monadic data refinement technique that Lammich and Tuerk used
+The monadic data refinement technique that Lammich and Tuerk used
-in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
+in \cite{Lammich2012} to optimise Hopcroft's automaton minimisation
-algorithm seem quite interesting.
+algorithm seems also quite relevant for such an enterprise.
 We aim to learn from their refinement framework
 which generates high performance code with proofs
-that can be broken down into uncluttered small steps.
+that can be broken down into small steps.
 %We establish the correctness

changeset 640	bd1354127574
parent 637	e3752aac8ec2
child 666	6da4516ea87d