diff -r 0bcb4a7cb40c -r e3752aac8ec2 ChengsongTanPhdThesis/Chapters/Future.tex --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Future.tex Sat Dec 24 11:55:04 2022 +0000 @@ -0,0 +1,387 @@ +% Chapter Template + +\chapter{Future Work and Conclusion} % Main chapter title + +\label{Future} + +In this thesis, in order to solve the ReDoS attacks +once and for all, we have set out to formalise the correctness +proof of Sulzmann and Lu's +lexing algorithm with simplifications \cite{Sulzmann2014} +in Isabelle/HOL. +We have fixed inefficiencies in the original simplification +algorithm, and established the correctness by proving +that our algorithm outputs the same result as the original bit-coded +lexer without simplifications (whose correctness was established in +previous work by Ausaf et al. in +\cite{Ausaf} and \cite{AusafDyckhoffUrban2016}). +The proof technique used in \cite{Ausaf} does not work in our case +because the simplification function messes with the structure of +regular expressions. +Despite having to try out several workarounds and being +stuck for months or even years looking for proofs, +we were delighted to have discovered the simple yet +effective proof method +by modelling individual +simplification steps as small-step rewriting rules and proving +equivalence between terms linked by these rules. + +In addition, we have proved a formal size bound on the +regular expressions. The technique was by establishing +a ``closed form'' informally +described by +Murugesan and Shanmuga Sundaram \cite{Murugesan2014} +for compound regular expressions' derivatives +and using those closed forms to control the size. +The Isabelle/HOL code for our formalisation can be +found at +\begin{center} +\url{https://github.com/hellotommmy/posix} +\end{center} +\noindent +Thanks to our theorem-prover-friendly approach, +we believe that +this finiteness bound can be improved to a bound +linear to input and +cubic to the regular expression size using a technique by +Antimirov\cite{Antimirov95}. +Once formalised, this would be a guarantee for the absence of all super-linear behavious. +We are working out the +details. + + +To our best knowledge, no lexing libraries using Brzozowski derivatives +have similar complexity-related bounds, +and claims about running time are usually speculative and backed by empirical +evidence on a few test cases. +If a matching or lexing algorithm +does not come with certain basic complexity related +guarantees (for examaple the internal data structure size +does not grow indefinitely), +then one cannot claim with confidence having solved the problem +of catastrophic backtracking. + +We believe our proof method is not specific to this +algorithm, and intend to extend this approach +to prove the correctness of a faster version +of the lexer. The closed +forms can then be re-used as well. Together +with the idea from Antimirov \cite{Antimirov95} +we plan to reduce the size bound to become polynomial +with respect to the size of the regular expression. +%We have tried out a variety of other optimisations +%to make lexing/parsing with derivatives faster, +%for example in \cite{Edelmann2021}, \cite{Darragh2020} +%and \cite{Verbatimpp}. + +We have learned quite a few lessons in this process. +As simple as the end result may seem, +coming up with the initial proof idea with rewriting relations +was the hardest for us, as it requires a deep understanding of what information +is preserved during the simplification process. +Having the first formalisation gives us much more confidence----it took +us much less time to obtain the closed forms and size bound, although they +are more involved than the correctness proof. +As having been tested numerous times, +the usefulness of formal approaches cannot be overstated: +they not only allow us to find bugs in Sulzmann and Lu's simplification functions, +but also helped us set up realistic expectations about performance +of algorithms. We believed in the beginning that +the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2} +was already able to achieve the linear bound claimed by +Sulzmann and Lu. +We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$ +is $O(\llbracket r\rrbracket^c\cdot |s|)$, +where $c$ is a small constant number, making +$\llbracket r\rrbracket^c$ a small constant factor. +We were quite surprised that this did not go through despite a lot of efforts. +This led us to discover the example where $\llbracket r\rrbracket^c$ +can in fact be exponentially large in chapter \ref{Finite}. +We therefore learned the necessity to introduce the stronger simplifications +in chapter \ref{Cubic}. +Without formal proofs we could not have found out this so soon. + + +In the future, we plan to incorporate many extensions +to this work. + +\section{Future Work} +The most obvious next-step is to implement the cubic bound and +correctness of $\blexerStrong$ +in chapter \ref{Cubic} formally. +A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to +regular expression size will fulfill +the linear complexity claim made by Sulzmann and Lu. + +With a linear size bound theoretically, the next challenge would be +to generate code that are competitive with respect to commercial +matchers. For that a lot of optimisations are needed. +We aim to integrate the zipper data structure into our lexer. +The idea is very simple: using a zipper with multiple focuses +just like Darragh \cite{Darragh2020} did, we could represent +\[ + x\cdot r + y \cdot r + \ldots +\] +as +\[ + (x+y + \ldots) \cdot r. +\] +This would greatly reduce the amount of space needed +when we have many terms like $x\cdot r$. +Some initial results have been obtained, but +significant changes are needed to make sure +that the lexer output conforms to the POSIX standard. +We aim to make use of Okui and Suzuki's labelling +system \cite{Okui10} to ensure regular expressions represented as zippers +always maintain the POSIX orderings. +%Formalising correctness of +%such a lexer using zippers will probably require using an imperative +%framework with separation logic as it involves +%mutable data structures, which is also interesting to look into. + +To further optimise the algorithm, +we plan to add a deduplicate function that tells +whether two regular expressions denote the same +language using +an efficient and verified equivalence checker like \cite{Krauss2011}. +In this way, the internal data structures can be pruned more aggressively, +leading to better simplifications and ultimately faster algorithms. + +Traditional automaton approaches can be sped up by compiling +multiple rules into the same automaton. This has been done +in \cite{Kumar2006} and \cite{Becchi08}, for example. +Currently our lexer only deals with a single regular expression each time. +Putting multiple regular expressions might open up more +possibilities of simplifications. + +As already mentioned in chapter \ref{RelatedWork}, +reducing the number of memory accesses can accelerate the +matching speed. It would be interesting to see the memory +bandwidth of our derivative-based matching and improve accordingly. + +Memoization has been used frequently in lexing and parsing to achieve +better complexity results, for example in \cite{Reps1998}, +\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}. +We plan to explore the performance enhancements by memoization in our algorithm +in a correctness-preserving way. +The Monadic data refinement technique that Lammich and Tuerk used +in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation +algorithm seem quite interesting. +We aim to learn from their refinement framework +which generates high performance code with proofs +that can be broken down into uncluttered small steps. + + + + +%We establish the correctness +%claim made by them about the +%, which were only proven with pencil and paper. + + +%\section{Work on Back-References} +%We introduced regular expressions with back-references +%in chapter \ref{Introduction}. +%We adopt the common practice of calling them rewbrs (Regular Expressions +%With Back References) in this section for brevity. +%It has been shown by Aho \cite{Aho1990} +%that the k-vertex cover problem can be reduced +%to the problem of rewbrs matching, and is therefore NP-complete. +%Given the depth of the problem, the progress made at the full generality +%of arbitrary rewbrs matching has been slow, with +%theoretical work on them being +%fairly recent. +% +%Campaneu et al. studied regexes with back-references +%in the context of formal languages theory in +%their 2003 work \cite{campeanu2003formal}. +%They devised a pumping lemma for Perl-like regexes, +%proving that the langugages denoted by them +%is properly contained in context-sensitive +%languages. +%More interesting questions such as +%whether the language denoted by Perl-like regexes +%can express palindromes ($\{w \mid w = w^R\}$) +%are discussed in \cite{campeanu2009patterns} +%and \cite{CAMPEANU2009Intersect}. +%Freydenberger \cite{Frey2013} investigated the +%expressive power of back-references. He showed several +%undecidability and decriptional complexity results +%of back-references, concluding that they add +%great power to certain programming tasks but are difficult to harness. +%An interesting question would then be +%whether we can add restrictions to them, +%so that they become expressive enough for practical use such +%as html files, but not too powerful. +%Freydenberger and Schmid \cite{FREYDENBERGER20191} +%introduced the notion of deterministic +%regular expressions with back-references to achieve +%a better balance between expressiveness and tractability. +% +% +%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012} +%investigated the time complexity of different variants +%of back-references. +%We are not aware of any work that uses derivatives on back-references. +% +%See \cite{BERGLUND2022} for a survey +%of these works and comparison between different +%flavours of back-references syntax (e.g. whether references can be circular, +%can labels be repeated etc.). +% +% +%\subsection{Matchers and Lexers with Mechanised Proofs} +%We are aware +%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by +%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part +%of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one +%in Coq is given by Coquand and Siles \parencite{Coquand2012}. +%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. +% +%\subsection{Static Analysis of Evil Regex Patterns} +%When a regular expression does not behave as intended, +%people usually try to rewrite the regex to some equivalent form +%or they try to avoid the possibly problematic patterns completely, +%for which many false positives exist\parencite{Davis18}. +%Animated tools to "debug" regular expressions such as +% \parencite{regexploit2021} \parencite{regex101} are also popular. +%We are also aware of static analysis work on regular expressions that +%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke +%\parencite{Rathnayake2014StaticAF} proposed an algorithm +%that detects regular expressions triggering exponential +%behavious on backtracking matchers. +%Weideman \parencite{Weideman2017Static} came up with +%non-linear polynomial worst-time estimates +%for regexes, attack string that exploit the worst-time +%scenario, and "attack automata" that generates +%attack strings. +% +% +% +% +% +% +%Thanks to our theorem-prover-friendly approach, +%we believe that +%this finiteness bound can be improved to a bound +%linear to input and +%cubic to the regular expression size using a technique by +%Antimirov\cite{Antimirov95}. +%Once formalised, this would be a guarantee for the absence of all super-linear behavious. +%We are working out the +%details. +% +% +%To our best knowledge, no lexing libraries using Brzozowski derivatives +%have similar complexity-related bounds, +%and claims about running time are usually speculative and backed by empirical +%evidence on a few test cases. +%If a matching or lexing algorithm +%does not come with certain basic complexity related +%guarantees (for examaple the internal data structure size +%does not grow indefinitely), +%then they cannot claim with confidence having solved the problem +%of catastrophic backtracking. +% +% +%\section{Optimisations} +%Perhaps the biggest problem that prevents derivative-based lexing +%from being more widely adopted +%is that they are still not blindingly fast in practice, unable to +%reach throughputs like gigabytes per second, which the commercial +%regular expression matchers such as Snort and Bro are able to achieve. +%For our algorithm to be more attractive for practical use, we +%need more correctness-preserving optimisations. +%%So far our focus has been mainly on the bit-coded algorithm, +%%but the injection-based lexing algorithm +%%could also be sped up in various ways: +%% +%One such optimisation is defining string derivatives +%directly, without recursively decomposing them into +%character-by-character derivatives. For example, instead of +%replacing $(r_1 + r_2)\backslash (c::cs)$ by +%$((r_1 + r_2)\backslash c)\backslash cs$, we rather +%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. +% +% +%\subsection{Derivatives and Zippers} +%Zipper is a data structure designed to focus on +%and navigate between local parts of a tree. +%The idea is that often operations on a large tree only deal with +%local regions each time. +%It was first formally described by Huet \cite{HuetZipper}. +%Typical applications of zippers involve text editor buffers +%and proof system databases. +%In our setting, the idea is to compactify the representation +%of derivatives with zippers, thereby making our algorithm faster. +%We draw our inspirations from several works on parsing, derivatives +%and zippers. +% +%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives +%\cite{Zippy2020}. +%They adopted zippers to improve the speed, and argued that the runtime +%complexity of their algorithm was linear with respect to the input string. +% +%The idea of using Brzozowski derivatives on general context-free +%parsing was first implemented +%by Might et al. \ref{Might2011}. +%They used memoization and fixpoint construction to eliminate infinite recursion, +%which is a major problem for using derivatives on context-free grammars. +%The initial version was quite slow----exponential on the size of the input. +%Adams et al. \ref{Adams2016} improved the speed and argued that their version +%was cubic with respect to the input. +%Darragh and Adams \cite{10.1145/3408990} further improved the performance +%by using zippers in an innovative way--their zippers had multiple focuses +%instead of just one in a traditional zipper to handle ambiguity. +%Their algorithm was not formalised, though. +% +%We aim to integrate the zipper data structure into our lexer. +%The idea is very simple: using a zipper with multiple focuses +%just like Darragh did, we could represent +%\[ +% x\cdot r + y \cdot r + \ldots +%\] +%as +%\[ +% (x+y + \ldots) \cdot r. +%\] +%This would greatly reduce the amount of space needed +%when we have many terms like $x\cdot r$. +% +%into the current lexer. +%This allows the lexer to not traverse the entire +%derivative tree generated +%in every intermediate step, but only a small segment +%that is currently active. +% +%Some initial results have been obtained, but more care needs to be taken to make sure +%that the lexer output conforms to the POSIX standard. Formalising correctness of +%such a lexer using zippers will probably require using an imperative +%framework with separation logic as it involves +%mutable data structures, which is also interesting to look into. +% +%To further optimise the algorithm, +%I plan to add a ``deduplicate'' function that tells +%whether two regular expressions denote the same +%language using +%an efficient and verified equivalence checker. +%In this way, the internal data structures can be pruned more aggressively, +%leading to better simplifications and ultimately faster algorithms. +% +% +%Some initial results +%We first give a brief introduction to what zippers are, +%and other works +%that apply zippers to derivatives +%When dealing with large trees, it would be a waste to +%traverse the entire tree if +%the operation only +%involves a small fraction of it. +%The idea is to put the focus on that subtree, turning other parts +%of the tree into a context +% +% +%One observation about our derivative-based lexing algorithm is that +%the derivative operation sometimes traverses the entire regular expression +%unnecessarily: +