lexing: ChengsongTanPhdThesis/Chapters/Future.tex@3831621d7b14

% Chapter Template

\chapter{Conclusion and Future Work} % Main chapter title

\label{Future}

In this thesis, in order to solve the ReDoS attacks
once and for all, we have set out to formalise the correctness
proof of Sulzmann and Lu's
lexing algorithm with aggressive simplifications \cite{Sulzmann2014}.
We formalised our proofs in the Isabelle/HOL theorem prover.
We have fixed some inefficiencies and a bug in their original simplification
algorithm, and established the correctness by proving
that our algorithm outputs the same result as the original bit-coded
lexer without simplifications (whose correctness was established in
previous work by Ausaf et al. in
\cite{Ausaf} and \cite{AusafDyckhoffUrban2016}).
The proof technique used in \cite{Ausaf} does not work in our case
because the simplification function messes with the structure of simplified
regular expressions.
Despite having to try out several workarounds and being
stuck for months looking for proofs,
we were delighted to have discovered the simple yet
effective proof method
by modelling individual
simplification steps as small-step rewriting rules and proving
equivalence between terms linked by these rewrite rules.

In addition, we have proved a formal size bound on the
regular expressions. The technique was by establishing
a ``closed form'' informally
described by
Murugesan and Shanmuga Sundaram \cite{Murugesan2014}
for compound derivatives
and using those closed forms to control the size.
The Isabelle/HOL code for our formalisation can be
found at
\begin{center}
\url{https://github.com/hellotommmy/posix}
\end{center}
\noindent
Thanks to our theorem-prover-friendly approach,
we believe that
this finiteness bound can be improved to a bound
linear to input and
cubic to the regular expression size using a technique by
Antimirov\cite{Antimirov95}.
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
We are yet to work out the
details.

Our formalisation is approximately 7500 lines of code. Slightly more than half of this concerns the finiteness bound obtained in Chapter 5. This slight "bloat" is because we had to repeat many definitions for the rrexp datatype. However, we think we would not have been able to formalise the quite intricate reasoning involving rrexps with annotated regular expressions because we would have to carry around the bit-sequences (that are of course necessary in the algorithm) but which do not contribute to the size bound of the calculated derivatives.

To our best knowledge, no lexing libraries using Brzozowski derivatives
have similar complexity-related bounds,
and claims about running time are usually speculative and backed up only by empirical
evidence on some test cases.
If a matching or lexing algorithm
does not come with complexity related
guarantees (for examaple the internal data structure size
does not grow indefinitely),
then one cannot claim with confidence of having solved the problem
of catastrophic backtracking.

We believe our proof method is not specific to this
algorithm, and intend to extend this approach
to prove the correctness of the faster version
of the lexer proposed in chapter \cite{Cubic}. The closed
forms can then be re-used as well. Together
with the idea from Antimirov \cite{Antimirov95}
we plan to reduce the size bound to be just polynomial
with respect to the size of the regular expressions.
%We have tried out a variety of other optimisations
%to make lexing/parsing with derivatives faster,
%for example in \cite{Edelmann2021}, \cite{Darragh2020}
%and \cite{Verbatimpp}.

We have learned quite a few lessons in this process.
As simple as the end result may seem,
coming up with the initial proof idea with rewriting relations
was the hardest for us, as it required a deep understanding of what information
is preserved during the simplification process.
There the ideas given by Sulzmann and Lu and Ausaf et al. were of no use.
%Having the first formalisation gives us much more confidence----it took
%us much less time to obtain the closed forms and size bound, although they
%are more involved than the correctness proof.
Of course this has already been shown many times,
the usefulness of formal approaches cannot be overstated:
they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
but also helped us set up realistic expectations about performance
of algorithms. We believed in the beginning that
the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
was already able to achieve the linear bound claimed by
Sulzmann and Lu.
We then attempted to prove that the size $\llbracket \blexersimp\; r \; s\rrbracket$
is $O(\llbracket r\rrbracket^c\cdot |s|)$,
where $c$ is a small constant, making
$\llbracket r\rrbracket^c$ a small constant factor.
We were then quite surprised that this did not go through despite a lot of effort.
This led us to discover the example where $\llbracket r\rrbracket^c$
can in fact be exponentially large in chapter \ref{Finite}.
We therefore learned the necessity to introduce the stronger simplifications
in chapter \ref{Cubic}.
Without formal proofs we would not have found out this so soon,
if at all.

%In the future, we plan to incorporate many extensions
%to this work.

\section{Future Work}
The most obvious next-step is to implement the cubic bound and
correctness of $\blexerStrong$
in chapter \ref{Cubic}.
A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to
regular expression size will get us one step closer to fulfilling
the linear complexity claim made by Sulzmann and Lu.

With a linear size bound theoretically, the next challenge would be
to generate code that is competitive with respect to
matchers based on DFAs or NFAs. For that a lot of optimisations are needed.
We aim to integrate the zipper data structure into our lexer.
The idea is very simple: using a zipper with multiple focuses
just like Darragh \cite{Darragh2020} did in their parsing algorithm, we could represent
\[
x\cdot r + y \cdot r + \ldots
\]
as
\[
(x+y + \ldots) \cdot r.
\]
This would greatly reduce the amount of space needed
when we have many terms like $x\cdot r$.
Some initial results have been obtained, but
significant changes are needed to make sure
that the lexer output conforms to the POSIX standard.
We aim to make use of Okui and Suzuki's labelling
system \cite{Okui10} to ensure regular expressions represented as zippers
always maintain the POSIX orderings.
%Formalising correctness of
%such a lexer using zippers will probably require using an imperative
%framework with separation logic as it involves
%mutable data structures, which is also interesting to look into.

To further optimise the algorithm,
we plan to add a deduplicate function that tells
whether two regular expressions denote the same
language using
an efficient and verified equivalence checker like \cite{Krauss2011}.
In this way, the internal data structures can be pruned more aggressively,
leading to better simplifications and ultimately faster algorithms.

Traditional automata approaches can be sped up by compiling
multiple rules into the same automaton. This has been done
in \cite{Kumar2006} and \cite{Becchi08}, for example.
Currently our lexer only deals with a single regular expression each time.
Extending this to multiple regular expressions might open up more
possibilities of simplifications.

As already mentioned in chapter \ref{RelatedWork},
reducing the number of memory accesses can also accelerate the
matching speed. It would be interesting to study the memory
bandwidth of our derivative-based matching algorithm and improve accordingly.

Memoization has been used frequently in lexing and parsing to achieve
better complexity results, for example in \cite{Reps1998},
\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
We plan to explore the performance enhancements by memoization in our algorithm
in a correctness-preserving way.
The monadic data refinement technique that Lammich and Tuerk used
in \cite{Lammich2012} to optimise Hopcroft's automaton minimisation
algorithm seems also quite relevant for such an enterprise.
We aim to learn from their refinement framework
which generates high performance code with proofs
that can be broken down into small steps.

Extending the Sulzmann and Lu's algorithm to parse
more pcre regex constructs like lookahead, capture groups
and back-references with proofs on basic properties like correctness
seems useful, despite it cannot be made efficient.
Creating a correct and easy-to-prove version of $\blexersimp$
seems an appealing next-step for both practice and theory.

%We establish the correctness
%claim made by them about the
%, which were only proven with pencil and paper.

%\section{Work on Back-References}
%We introduced regular expressions with back-references
%in chapter \ref{Introduction}.
%We adopt the common practice of calling them rewbrs (Regular Expressions
%With Back References) in this section for brevity.
%It has been shown by Aho \cite{Aho1990}
%that the k-vertex cover problem can be reduced
%to the problem of rewbrs matching, and is therefore NP-complete.
%Given the depth of the problem, the progress made at the full generality
%of arbitrary rewbrs matching has been slow, with
%theoretical work on them being
%fairly recent.
%
%Campaneu et al. studied regexes with back-references
%in the context of formal languages theory in
%their 2003 work \cite{campeanu2003formal}.
%They devised a pumping lemma for Perl-like regexes,
%proving that the langugages denoted by them
%is properly contained in context-sensitive
%languages.
%More interesting questions such as
%whether the language denoted by Perl-like regexes
%can express palindromes ($\{w \mid w = w^R\}$)
%are discussed in \cite{campeanu2009patterns}
%and \cite{CAMPEANU2009Intersect}.
%Freydenberger \cite{Frey2013} investigated the
%expressive power of back-references. He showed several
%undecidability and decriptional complexity results
%of back-references, concluding that they add
%great power to certain programming tasks but are difficult to harness.
%An interesting question would then be
%whether we can add restrictions to them,
%so that they become expressive enough for practical use such
%as html files, but not too powerful.
%Freydenberger and Schmid \cite{FREYDENBERGER20191}
%introduced the notion of deterministic
%regular expressions with back-references to achieve
%a better balance between expressiveness and tractability.
%
%
%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
%investigated the time complexity of different variants
%of back-references.
%We are not aware of any work that uses derivatives on back-references.
%
%See \cite{BERGLUND2022} for a survey
%of these works and comparison between different
%flavours of back-references syntax (e.g. whether references can be circular,
%can labels be repeated etc.).
%
%
%\subsection{Matchers and Lexers with Mechanised Proofs}
%We are aware
%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
%of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
%in Coq is given by Coquand and Siles \parencite{Coquand2012}.
%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
%
%\subsection{Static Analysis of Evil Regex Patterns}
%When a regular expression does not behave as intended,
%people usually try to rewrite the regex to some equivalent form
%or they try to avoid the possibly problematic patterns completely,
%for which many false positives exist\parencite{Davis18}.
%Animated tools to "debug" regular expressions such as
% \parencite{regexploit2021} \parencite{regex101} are also popular.
%We are also aware of static analysis work on regular expressions that
%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
%\parencite{Rathnayake2014StaticAF} proposed an algorithm
%that detects regular expressions triggering exponential
%behavious on backtracking matchers.
%Weideman \parencite{Weideman2017Static} came up with
%non-linear polynomial worst-time estimates
%for regexes, attack string that exploit the worst-time
%scenario, and "attack automata" that generates
%attack strings.
%
%
%
%
%
%
%Thanks to our theorem-prover-friendly approach,
%we believe that
%this finiteness bound can be improved to a bound
%linear to input and
%cubic to the regular expression size using a technique by
%Antimirov\cite{Antimirov95}.
%Once formalised, this would be a guarantee for the absence of all super-linear behavious.
%We are working out the
%details.
%
%
%To our best knowledge, no lexing libraries using Brzozowski derivatives
%have similar complexity-related bounds,
%and claims about running time are usually speculative and backed by empirical
%evidence on a few test cases.
%If a matching or lexing algorithm
%does not come with certain basic complexity related
%guarantees (for examaple the internal data structure size
%does not grow indefinitely),
%then they cannot claim with confidence having solved the problem
%of catastrophic backtracking.
%
%
%\section{Optimisations}
%Perhaps the biggest problem that prevents derivative-based lexing
%from being more widely adopted
%is that they are still not blindingly fast in practice, unable to
%reach throughputs like gigabytes per second, which the commercial
%regular expression matchers such as Snort and Bro are able to achieve.
%For our algorithm to be more attractive for practical use, we
%need more correctness-preserving optimisations.
%%So far our focus has been mainly on the bit-coded algorithm,
%%but the injection-based lexing algorithm
%%could also be sped up in various ways:
%%
%One such optimisation is defining string derivatives
%directly, without recursively decomposing them into
%character-by-character derivatives. For example, instead of
%replacing $(r_1 + r_2)\backslash (c::cs)$ by
%$((r_1 + r_2)\backslash c)\backslash cs$, we rather
%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
%
%
%\subsection{Derivatives and Zippers}
%Zipper is a data structure designed to focus on
%and navigate between local parts of a tree.
%The idea is that often operations on a large tree only deal with
%local regions each time.
%It was first formally described by Huet \cite{HuetZipper}.
%Typical applications of zippers involve text editor buffers
%and proof system databases.
%In our setting, the idea is to compactify the representation
%of derivatives with zippers, thereby making our algorithm faster.
%We draw our inspirations from several works on parsing, derivatives
%and zippers.
%
%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
%\cite{Zippy2020}.
%They adopted zippers to improve the speed, and argued that the runtime
%complexity of their algorithm was linear with respect to the input string.
%
%The idea of using Brzozowski derivatives on general context-free
%parsing was first implemented
%by Might et al. \ref{Might2011}.
%They used memoization and fixpoint construction to eliminate infinite recursion,
%which is a major problem for using derivatives on context-free grammars.
%The initial version was quite slow----exponential on the size of the input.
%Adams et al. \ref{Adams2016} improved the speed and argued that their version
%was cubic with respect to the input.
%Darragh and Adams \cite{10.1145/3408990} further improved the performance
%by using zippers in an innovative way--their zippers had multiple focuses
%instead of just one in a traditional zipper to handle ambiguity.
%Their algorithm was not formalised, though.
%
%We aim to integrate the zipper data structure into our lexer.
%The idea is very simple: using a zipper with multiple focuses
%just like Darragh did, we could represent
%\[
% x\cdot r + y \cdot r + \ldots
%\]
%as
%\[
% (x+y + \ldots) \cdot r.
%\]
%This would greatly reduce the amount of space needed
%when we have many terms like $x\cdot r$.
%
%into the current lexer.
%This allows the lexer to not traverse the entire
%derivative tree generated
%in every intermediate step, but only a small segment
%that is currently active.
%
%Some initial results have been obtained, but more care needs to be taken to make sure
%that the lexer output conforms to the POSIX standard. Formalising correctness of
%such a lexer using zippers will probably require using an imperative
%framework with separation logic as it involves
%mutable data structures, which is also interesting to look into.
%
%To further optimise the algorithm,
%I plan to add a ``deduplicate'' function that tells
%whether two regular expressions denote the same
%language using
%an efficient and verified equivalence checker.
%In this way, the internal data structures can be pruned more aggressively,
%leading to better simplifications and ultimately faster algorithms.
%
%
%Some initial results
%We first give a brief introduction to what zippers are,
%and other works
%that apply zippers to derivatives
%When dealing with large trees, it would be a waste to
%traverse the entire tree if
%the operation only
%involves a small fraction of it.
%The idea is to put the focus on that subtree, turning other parts
%of the tree into a context
%
%
%One observation about our derivative-based lexing algorithm is that
%the derivative operation sometimes traverses the entire regular expression
%unnecessarily:

author	Chengsong
	Wed, 23 Aug 2023 03:02:31 +0100
changeset 668	3831621d7b14
parent 666	6da4516ea87d
permissions	-rw-r--r--