ChengsongTanPhdThesis/Chapters/Future.tex
changeset 637 e3752aac8ec2
child 640 bd1354127574
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Future.tex	Sat Dec 24 11:55:04 2022 +0000
@@ -0,0 +1,387 @@
+% Chapter Template
+
+\chapter{Future Work and Conclusion} % Main chapter title
+
+\label{Future} 
+
+In this thesis, in order to solve the ReDoS attacks
+once and for all, we have set out to formalise the correctness
+proof of Sulzmann and Lu's
+lexing algorithm with simplifications \cite{Sulzmann2014}
+in Isabelle/HOL. 
+We have fixed inefficiencies in  the original simplification
+algorithm, and established the correctness by proving
+that our algorithm outputs the same result as the original bit-coded
+lexer without simplifications (whose correctness was established in 
+previous work by Ausaf et al. in
+\cite{Ausaf}  and \cite{AusafDyckhoffUrban2016}).
+The proof technique used in \cite{Ausaf} does not work in our case
+because the simplification function messes with the structure of
+regular expressions.
+Despite having to try out several workarounds and being
+stuck for months or even years looking for proofs, 
+we were delighted to have discovered the simple yet
+effective proof method
+by modelling individual
+simplification steps as small-step rewriting rules and proving
+equivalence between terms linked by these rules.
+
+In addition, we have proved a formal size bound on the 
+regular expressions. The technique was by establishing
+a ``closed form'' informally
+described by
+Murugesan and Shanmuga Sundaram \cite{Murugesan2014} 
+for compound regular expressions' derivatives
+and using those closed forms to control the size.
+The Isabelle/HOL code for our formalisation can be 
+found at
+\begin{center}
+\url{https://github.com/hellotommmy/posix}
+\end{center}
+\noindent
+Thanks to our theorem-prover-friendly approach,
+we believe that 
+this finiteness bound can be improved to a bound
+linear to input and
+cubic to the regular expression size using a technique by
+Antimirov\cite{Antimirov95}.
+Once formalised, this would be a guarantee for the absence of all super-linear behavious.
+We are working out the
+details.
+
+
+To our best knowledge, no lexing libraries using Brzozowski derivatives
+have similar complexity-related bounds, 
+and claims about running time are usually speculative and backed by empirical
+evidence on a few test cases.
+If a matching or lexing algorithm
+does not come with certain basic complexity related 
+guarantees (for examaple the internal data structure size
+does not grow indefinitely), 
+then one cannot claim with confidence having solved the problem
+of catastrophic backtracking.
+
+We believe our proof method is not specific to this
+algorithm, and intend to extend this approach
+to prove the correctness of a faster version
+of the lexer. The closed
+forms can then be re-used as well. Together
+with the idea from Antimirov \cite{Antimirov95}
+we plan to reduce the size bound to become polynomial
+with respect to the size of the regular expression.
+%We have tried out a variety of other optimisations 
+%to make lexing/parsing with derivatives faster,
+%for example in \cite{Edelmann2021}, \cite{Darragh2020}
+%and \cite{Verbatimpp}.
+
+We have learned quite a few lessons in this process.
+As simple as the end result may seem,
+coming up with the initial proof idea with rewriting relations
+was the hardest for us, as it requires a deep understanding of what information
+is preserved during the simplification process.
+Having the first formalisation gives us much more confidence----it took
+us much less time to obtain the closed forms and size bound, although they
+are more involved than the correctness proof.
+As having been tested numerous times, 
+the usefulness of formal approaches cannot be overstated:
+they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
+but also helped us set up realistic expectations about performance
+of algorithms. We believed in the beginning that
+the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
+was already able to achieve the linear bound claimed by
+Sulzmann and Lu.
+We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$ 
+is $O(\llbracket r\rrbracket^c\cdot |s|)$,
+where $c$ is a small constant number, making
+$\llbracket r\rrbracket^c$ a small constant factor.
+We were quite surprised that this did not go through despite a lot of efforts.
+This led us to discover the example where $\llbracket r\rrbracket^c$
+can in fact be exponentially large in chapter \ref{Finite}.
+We therefore learned the necessity to introduce the stronger simplifications
+in chapter \ref{Cubic}.
+Without formal proofs we could not have found out this so soon.
+
+
+In the future, we plan to incorporate many extensions
+to this work.
+
+\section{Future Work}
+The most obvious next-step is to implement the cubic bound and
+correctness of $\blexerStrong$
+in chapter \ref{Cubic} formally.
+A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to 
+regular expression size will fulfill
+the linear complexity claim made by Sulzmann and Lu.
+
+With a linear size bound theoretically, the next challenge would be
+to generate code that are competitive with respect to commercial
+matchers. For that a lot of optimisations are needed.
+We aim to  integrate the zipper data structure into our lexer.
+The idea is very simple: using a zipper with multiple focuses
+just like Darragh \cite{Darragh2020} did, we could represent
+\[
+	x\cdot r + y \cdot r + \ldots
+\]
+as 
+\[
+	(x+y + \ldots) \cdot r.
+\]
+This would greatly reduce the amount of space needed
+when we have many terms like $x\cdot r$.
+Some initial results have been obtained, but 
+significant changes are needed to make sure 
+that the lexer output conforms to the POSIX standard. 
+We aim to make use of Okui and Suzuki's labelling
+system \cite{Okui10} to ensure regular expressions represented as zippers
+always maintain the POSIX orderings.
+%Formalising correctness of 
+%such a lexer using zippers will probably require using an imperative
+%framework with separation logic as it involves
+%mutable data structures, which is also interesting to look into.  
+
+To further optimise the algorithm, 
+we plan to add a deduplicate function that tells 
+whether two regular expressions denote the same 
+language using
+an efficient and verified equivalence checker like \cite{Krauss2011}.
+In this way, the internal data structures can be pruned more aggressively, 
+leading to better simplifications and ultimately faster algorithms.
+
+Traditional automaton approaches can be sped up by compiling
+multiple rules into the same automaton. This has been done
+in \cite{Kumar2006} and \cite{Becchi08}, for example.
+Currently our lexer only deals with a single regular expression each time.
+Putting multiple regular expressions might open up more
+possibilities of simplifications.
+
+As already mentioned in chapter \ref{RelatedWork},
+reducing the number of memory accesses can accelerate the 
+matching speed. It would be interesting to see the memory
+bandwidth of our derivative-based matching and improve accordingly.
+
+Memoization has been used frequently in lexing and parsing to achieve
+better complexity results, for example in \cite{Reps1998},
+\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
+We plan to explore the performance enhancements by memoization in our algorithm
+in a correctness-preserving way.
+The Monadic data refinement technique that Lammich and Tuerk used
+in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
+algorithm seem quite interesting.
+We aim to learn from their refinement framework
+which generates high performance code with proofs
+that can be broken down into uncluttered small steps.
+
+
+
+
+%We establish the correctness 
+%claim made by them about the 
+%, which were only proven with pencil and paper.
+
+
+%\section{Work on Back-References}
+%We introduced regular expressions with back-references
+%in chapter \ref{Introduction}.
+%We adopt the common practice of calling them rewbrs (Regular Expressions
+%With Back References) in this section for brevity.
+%It has been shown by Aho \cite{Aho1990}
+%that the k-vertex cover problem can be reduced
+%to the problem of rewbrs matching, and is therefore NP-complete.
+%Given the depth of the problem, the progress made at the full generality
+%of arbitrary rewbrs matching has been slow, with
+%theoretical work on them being 
+%fairly recent. 
+%
+%Campaneu et al. studied regexes with back-references
+%in the context of formal languages theory in 
+%their 2003 work \cite{campeanu2003formal}.
+%They devised a pumping lemma for Perl-like regexes, 
+%proving that the langugages denoted by them
+%is  properly  contained in context-sensitive
+%languages.
+%More interesting questions such as 
+%whether the language denoted by Perl-like regexes
+%can express palindromes ($\{w \mid w = w^R\}$)
+%are discussed in \cite{campeanu2009patterns} 
+%and \cite{CAMPEANU2009Intersect}.
+%Freydenberger \cite{Frey2013} investigated the 
+%expressive power of back-references. He showed several 
+%undecidability and decriptional complexity results 
+%of back-references, concluding that they add
+%great power to certain programming tasks but are difficult to harness.
+%An interesting question would then be
+%whether we can add restrictions to them, 
+%so that they become expressive enough for practical use such
+%as html files, but not too powerful.
+%Freydenberger and Schmid \cite{FREYDENBERGER20191}
+%introduced the notion of deterministic
+%regular expressions with back-references to achieve
+%a better balance between expressiveness and tractability.
+%
+%
+%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
+%investigated the time complexity of different variants
+%of back-references.
+%We are not aware of any work that uses derivatives on back-references.
+%
+%See \cite{BERGLUND2022} for a survey
+%of these works and comparison between different
+%flavours  of back-references syntax (e.g. whether references can be circular, 
+%can labels be repeated etc.).
+%
+%
+%\subsection{Matchers and Lexers with Mechanised Proofs}
+%We are aware
+%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
+%of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
+%in Coq is given by Coquand and Siles \parencite{Coquand2012}.
+%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
+% 
+%\subsection{Static Analysis of Evil Regex Patterns} 
+%When a regular expression does not behave as intended,
+%people usually try to rewrite the regex to some equivalent form
+%or they try to avoid the possibly problematic patterns completely,
+%for which many false positives exist\parencite{Davis18}.
+%Animated tools to "debug" regular expressions such as
+% \parencite{regexploit2021} \parencite{regex101} are also popular.
+%We are also aware of static analysis work on regular expressions that
+%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
+%\parencite{Rathnayake2014StaticAF} proposed an algorithm
+%that detects regular expressions triggering exponential
+%behavious on backtracking matchers.
+%Weideman \parencite{Weideman2017Static} came up with 
+%non-linear polynomial worst-time estimates
+%for regexes, attack string that exploit the worst-time 
+%scenario, and "attack automata" that generates
+%attack strings.
+%
+%
+%
+%
+%
+%
+%Thanks to our theorem-prover-friendly approach,
+%we believe that 
+%this finiteness bound can be improved to a bound
+%linear to input and
+%cubic to the regular expression size using a technique by
+%Antimirov\cite{Antimirov95}.
+%Once formalised, this would be a guarantee for the absence of all super-linear behavious.
+%We are working out the
+%details.
+%
+% 
+%To our best knowledge, no lexing libraries using Brzozowski derivatives
+%have similar complexity-related bounds, 
+%and claims about running time are usually speculative and backed by empirical
+%evidence on a few test cases.
+%If a matching or lexing algorithm
+%does not come with certain basic complexity related 
+%guarantees (for examaple the internal data structure size
+%does not grow indefinitely), 
+%then they cannot claim with confidence having solved the problem
+%of catastrophic backtracking.
+%
+%
+%\section{Optimisations}
+%Perhaps the biggest problem that prevents derivative-based lexing 
+%from being more widely adopted
+%is that they are still not blindingly fast in practice, unable to 
+%reach throughputs like gigabytes per second, which the commercial
+%regular expression matchers such as Snort and Bro are able to achieve.
+%For our algorithm to be more attractive for practical use, we
+%need more correctness-preserving optimisations.
+%%So far our focus has been mainly on the bit-coded algorithm,
+%%but the injection-based lexing algorithm 
+%%could also be sped up in various ways:
+%%
+%One such optimisation is defining string derivatives
+%directly, without recursively decomposing them into 
+%character-by-character derivatives. For example, instead of
+%replacing $(r_1 + r_2)\backslash (c::cs)$ by
+%$((r_1 + r_2)\backslash c)\backslash cs$, we rather
+%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
+%
+%
+%\subsection{Derivatives and Zippers}
+%Zipper is a data structure designed to focus on 
+%and navigate between local parts of a tree.
+%The idea is that often operations on a large tree only deal with
+%local regions each time.
+%It was first formally described by Huet \cite{HuetZipper}.
+%Typical applications of zippers involve text editor buffers
+%and proof system databases.
+%In our setting, the idea is to compactify the representation
+%of derivatives with zippers, thereby making our algorithm faster.
+%We draw our inspirations from several works on parsing, derivatives
+%and zippers.
+%
+%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
+%\cite{Zippy2020}.
+%They adopted zippers to improve the speed, and argued that the runtime
+%complexity of their algorithm was linear with respect to the input string.
+%
+%The idea of using Brzozowski derivatives on general context-free 
+%parsing was first implemented
+%by Might et al. \ref{Might2011}. 
+%They used memoization and fixpoint construction to eliminate infinite recursion,
+%which is a major problem for using derivatives on context-free grammars.
+%The initial version was quite slow----exponential on the size of the input.
+%Adams et al. \ref{Adams2016} improved the speed and argued that their version
+%was cubic with respect to the input.
+%Darragh and Adams \cite{10.1145/3408990} further improved the performance
+%by using zippers in an innovative way--their zippers had multiple focuses
+%instead of just one in a traditional zipper to handle ambiguity.
+%Their algorithm was not formalised, though.
+%
+%We aim to  integrate the zipper data structure into our lexer.
+%The idea is very simple: using a zipper with multiple focuses
+%just like Darragh did, we could represent
+%\[
+%	x\cdot r + y \cdot r + \ldots
+%\]
+%as 
+%\[
+%	(x+y + \ldots) \cdot r.
+%\]
+%This would greatly reduce the amount of space needed
+%when we have many terms like $x\cdot r$.
+%
+%into the current lexer. 
+%This allows the lexer to not traverse the entire 
+%derivative tree generated
+%in every intermediate step, but only a small segment 
+%that is currently active. 
+%
+%Some initial results have been obtained, but more care needs to be taken to make sure 
+%that the lexer output conforms to the POSIX standard. Formalising correctness of 
+%such a lexer using zippers will probably require using an imperative
+%framework with separation logic as it involves
+%mutable data structures, which is also interesting to look into.  
+%
+%To further optimise the algorithm, 
+%I plan to add a ``deduplicate'' function that tells 
+%whether two regular expressions denote the same 
+%language using
+%an efficient and verified equivalence checker.
+%In this way, the internal data structures can be pruned more aggressively, 
+%leading to better simplifications and ultimately faster algorithms.
+%
+%
+%Some initial results 
+%We first give a brief introduction to what zippers are,
+%and other works
+%that apply zippers to derivatives
+%When dealing with large trees, it would be a waste to 
+%traverse the entire tree if
+%the operation only
+%involves a small fraction of it.
+%The idea is to put the focus on that subtree, turning other parts
+%of the tree into a context
+%
+%
+%One observation about our derivative-based lexing algorithm is that
+%the derivative operation sometimes traverses the entire regular expression
+%unnecessarily:
+