lexing: ChengsongTanPhdThesis/Chapters/Future.tex@e3752aac8ec2

% Chapter Template

\chapter{Future Work and Conclusion} % Main chapter title

\label{Future}

In this thesis, in order to solve the ReDoS attacks
once and for all, we have set out to formalise the correctness
proof of Sulzmann and Lu's
lexing algorithm with simplifications \cite{Sulzmann2014}
in Isabelle/HOL.
We have fixed inefficiencies in the original simplification
algorithm, and established the correctness by proving
that our algorithm outputs the same result as the original bit-coded
lexer without simplifications (whose correctness was established in
previous work by Ausaf et al. in
\cite{Ausaf} and \cite{AusafDyckhoffUrban2016}).
The proof technique used in \cite{Ausaf} does not work in our case
because the simplification function messes with the structure of
regular expressions.
Despite having to try out several workarounds and being
stuck for months or even years looking for proofs,
we were delighted to have discovered the simple yet
effective proof method
by modelling individual
simplification steps as small-step rewriting rules and proving
equivalence between terms linked by these rules.

In addition, we have proved a formal size bound on the
regular expressions. The technique was by establishing
a ``closed form'' informally
described by
Murugesan and Shanmuga Sundaram \cite{Murugesan2014}
for compound regular expressions' derivatives
and using those closed forms to control the size.
The Isabelle/HOL code for our formalisation can be
found at
\begin{center}
\url{https://github.com/hellotommmy/posix}
\end{center}
\noindent
Thanks to our theorem-prover-friendly approach,
we believe that
this finiteness bound can be improved to a bound
linear to input and
cubic to the regular expression size using a technique by
Antimirov\cite{Antimirov95}.
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
We are working out the
details.

To our best knowledge, no lexing libraries using Brzozowski derivatives
have similar complexity-related bounds,
and claims about running time are usually speculative and backed by empirical
evidence on a few test cases.
If a matching or lexing algorithm
does not come with certain basic complexity related
guarantees (for examaple the internal data structure size
does not grow indefinitely),
then one cannot claim with confidence having solved the problem
of catastrophic backtracking.

We believe our proof method is not specific to this
algorithm, and intend to extend this approach
to prove the correctness of a faster version
of the lexer. The closed
forms can then be re-used as well. Together
with the idea from Antimirov \cite{Antimirov95}
we plan to reduce the size bound to become polynomial
with respect to the size of the regular expression.
%We have tried out a variety of other optimisations
%to make lexing/parsing with derivatives faster,
%for example in \cite{Edelmann2021}, \cite{Darragh2020}
%and \cite{Verbatimpp}.

We have learned quite a few lessons in this process.
As simple as the end result may seem,
coming up with the initial proof idea with rewriting relations
was the hardest for us, as it requires a deep understanding of what information
is preserved during the simplification process.
Having the first formalisation gives us much more confidence----it took
us much less time to obtain the closed forms and size bound, although they
are more involved than the correctness proof.
As having been tested numerous times,
the usefulness of formal approaches cannot be overstated:
they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
but also helped us set up realistic expectations about performance
of algorithms. We believed in the beginning that
the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
was already able to achieve the linear bound claimed by
Sulzmann and Lu.
We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$
is $O(\llbracket r\rrbracket^c\cdot |s|)$,
where $c$ is a small constant number, making
$\llbracket r\rrbracket^c$ a small constant factor.
We were quite surprised that this did not go through despite a lot of efforts.
This led us to discover the example where $\llbracket r\rrbracket^c$
can in fact be exponentially large in chapter \ref{Finite}.
We therefore learned the necessity to introduce the stronger simplifications
in chapter \ref{Cubic}.
Without formal proofs we could not have found out this so soon.

In the future, we plan to incorporate many extensions
to this work.

\section{Future Work}
The most obvious next-step is to implement the cubic bound and
correctness of $\blexerStrong$
in chapter \ref{Cubic} formally.
A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to
regular expression size will fulfill
the linear complexity claim made by Sulzmann and Lu.

With a linear size bound theoretically, the next challenge would be
to generate code that are competitive with respect to commercial
matchers. For that a lot of optimisations are needed.
We aim to integrate the zipper data structure into our lexer.
The idea is very simple: using a zipper with multiple focuses
just like Darragh \cite{Darragh2020} did, we could represent
\[
x\cdot r + y \cdot r + \ldots
\]
as
\[
(x+y + \ldots) \cdot r.
\]
This would greatly reduce the amount of space needed
when we have many terms like $x\cdot r$.
Some initial results have been obtained, but
significant changes are needed to make sure
that the lexer output conforms to the POSIX standard.
We aim to make use of Okui and Suzuki's labelling
system \cite{Okui10} to ensure regular expressions represented as zippers
always maintain the POSIX orderings.
%Formalising correctness of
%such a lexer using zippers will probably require using an imperative
%framework with separation logic as it involves
%mutable data structures, which is also interesting to look into.

To further optimise the algorithm,
we plan to add a deduplicate function that tells
whether two regular expressions denote the same
language using
an efficient and verified equivalence checker like \cite{Krauss2011}.
In this way, the internal data structures can be pruned more aggressively,
leading to better simplifications and ultimately faster algorithms.

Traditional automaton approaches can be sped up by compiling
multiple rules into the same automaton. This has been done
in \cite{Kumar2006} and \cite{Becchi08}, for example.
Currently our lexer only deals with a single regular expression each time.
Putting multiple regular expressions might open up more
possibilities of simplifications.

As already mentioned in chapter \ref{RelatedWork},
reducing the number of memory accesses can accelerate the
matching speed. It would be interesting to see the memory
bandwidth of our derivative-based matching and improve accordingly.

Memoization has been used frequently in lexing and parsing to achieve
better complexity results, for example in \cite{Reps1998},
\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
We plan to explore the performance enhancements by memoization in our algorithm
in a correctness-preserving way.
The Monadic data refinement technique that Lammich and Tuerk used
in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
algorithm seem quite interesting.
We aim to learn from their refinement framework
which generates high performance code with proofs
that can be broken down into uncluttered small steps.

%We establish the correctness
%claim made by them about the
%, which were only proven with pencil and paper.

%\section{Work on Back-References}
%We introduced regular expressions with back-references
%in chapter \ref{Introduction}.
%We adopt the common practice of calling them rewbrs (Regular Expressions
%With Back References) in this section for brevity.
%It has been shown by Aho \cite{Aho1990}
%that the k-vertex cover problem can be reduced
%to the problem of rewbrs matching, and is therefore NP-complete.
%Given the depth of the problem, the progress made at the full generality
%of arbitrary rewbrs matching has been slow, with
%theoretical work on them being
%fairly recent.
%
%Campaneu et al. studied regexes with back-references
%in the context of formal languages theory in
%their 2003 work \cite{campeanu2003formal}.
%They devised a pumping lemma for Perl-like regexes,
%proving that the langugages denoted by them
%is properly contained in context-sensitive
%languages.
%More interesting questions such as
%whether the language denoted by Perl-like regexes
%can express palindromes ($\{w \mid w = w^R\}$)
%are discussed in \cite{campeanu2009patterns}
%and \cite{CAMPEANU2009Intersect}.
%Freydenberger \cite{Frey2013} investigated the
%expressive power of back-references. He showed several
%undecidability and decriptional complexity results
%of back-references, concluding that they add
%great power to certain programming tasks but are difficult to harness.
%An interesting question would then be
%whether we can add restrictions to them,
%so that they become expressive enough for practical use such
%as html files, but not too powerful.
%Freydenberger and Schmid \cite{FREYDENBERGER20191}
%introduced the notion of deterministic
%regular expressions with back-references to achieve
%a better balance between expressiveness and tractability.
%
%
%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
%investigated the time complexity of different variants
%of back-references.
%We are not aware of any work that uses derivatives on back-references.
%
%See \cite{BERGLUND2022} for a survey
%of these works and comparison between different
%flavours of back-references syntax (e.g. whether references can be circular,
%can labels be repeated etc.).
%
%
%\subsection{Matchers and Lexers with Mechanised Proofs}
%We are aware
%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
%of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one
%in Coq is given by Coquand and Siles \parencite{Coquand2012}.
%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
%
%\subsection{Static Analysis of Evil Regex Patterns}
%When a regular expression does not behave as intended,
%people usually try to rewrite the regex to some equivalent form
%or they try to avoid the possibly problematic patterns completely,
%for which many false positives exist\parencite{Davis18}.
%Animated tools to "debug" regular expressions such as
% \parencite{regexploit2021} \parencite{regex101} are also popular.
%We are also aware of static analysis work on regular expressions that
%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
%\parencite{Rathnayake2014StaticAF} proposed an algorithm
%that detects regular expressions triggering exponential
%behavious on backtracking matchers.
%Weideman \parencite{Weideman2017Static} came up with
%non-linear polynomial worst-time estimates
%for regexes, attack string that exploit the worst-time
%scenario, and "attack automata" that generates
%attack strings.
%
%
%
%
%
%
%Thanks to our theorem-prover-friendly approach,
%we believe that
%this finiteness bound can be improved to a bound
%linear to input and
%cubic to the regular expression size using a technique by
%Antimirov\cite{Antimirov95}.
%Once formalised, this would be a guarantee for the absence of all super-linear behavious.
%We are working out the
%details.
%
%
%To our best knowledge, no lexing libraries using Brzozowski derivatives
%have similar complexity-related bounds,
%and claims about running time are usually speculative and backed by empirical
%evidence on a few test cases.
%If a matching or lexing algorithm
%does not come with certain basic complexity related
%guarantees (for examaple the internal data structure size
%does not grow indefinitely),
%then they cannot claim with confidence having solved the problem
%of catastrophic backtracking.
%
%
%\section{Optimisations}
%Perhaps the biggest problem that prevents derivative-based lexing
%from being more widely adopted
%is that they are still not blindingly fast in practice, unable to
%reach throughputs like gigabytes per second, which the commercial
%regular expression matchers such as Snort and Bro are able to achieve.
%For our algorithm to be more attractive for practical use, we
%need more correctness-preserving optimisations.
%%So far our focus has been mainly on the bit-coded algorithm,
%%but the injection-based lexing algorithm
%%could also be sped up in various ways:
%%
%One such optimisation is defining string derivatives
%directly, without recursively decomposing them into
%character-by-character derivatives. For example, instead of
%replacing $(r_1 + r_2)\backslash (c::cs)$ by
%$((r_1 + r_2)\backslash c)\backslash cs$, we rather
%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
%
%
%\subsection{Derivatives and Zippers}
%Zipper is a data structure designed to focus on
%and navigate between local parts of a tree.
%The idea is that often operations on a large tree only deal with
%local regions each time.
%It was first formally described by Huet \cite{HuetZipper}.
%Typical applications of zippers involve text editor buffers
%and proof system databases.
%In our setting, the idea is to compactify the representation
%of derivatives with zippers, thereby making our algorithm faster.
%We draw our inspirations from several works on parsing, derivatives
%and zippers.
%
%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
%\cite{Zippy2020}.
%They adopted zippers to improve the speed, and argued that the runtime
%complexity of their algorithm was linear with respect to the input string.
%
%The idea of using Brzozowski derivatives on general context-free
%parsing was first implemented
%by Might et al. \ref{Might2011}.
%They used memoization and fixpoint construction to eliminate infinite recursion,
%which is a major problem for using derivatives on context-free grammars.
%The initial version was quite slow----exponential on the size of the input.
%Adams et al. \ref{Adams2016} improved the speed and argued that their version
%was cubic with respect to the input.
%Darragh and Adams \cite{10.1145/3408990} further improved the performance
%by using zippers in an innovative way--their zippers had multiple focuses
%instead of just one in a traditional zipper to handle ambiguity.
%Their algorithm was not formalised, though.
%
%We aim to integrate the zipper data structure into our lexer.
%The idea is very simple: using a zipper with multiple focuses
%just like Darragh did, we could represent
%\[
% x\cdot r + y \cdot r + \ldots
%\]
%as
%\[
% (x+y + \ldots) \cdot r.
%\]
%This would greatly reduce the amount of space needed
%when we have many terms like $x\cdot r$.
%
%into the current lexer.
%This allows the lexer to not traverse the entire
%derivative tree generated
%in every intermediate step, but only a small segment
%that is currently active.
%
%Some initial results have been obtained, but more care needs to be taken to make sure
%that the lexer output conforms to the POSIX standard. Formalising correctness of
%such a lexer using zippers will probably require using an imperative
%framework with separation logic as it involves
%mutable data structures, which is also interesting to look into.
%
%To further optimise the algorithm,
%I plan to add a ``deduplicate'' function that tells
%whether two regular expressions denote the same
%language using
%an efficient and verified equivalence checker.
%In this way, the internal data structures can be pruned more aggressively,
%leading to better simplifications and ultimately faster algorithms.
%
%
%Some initial results
%We first give a brief introduction to what zippers are,
%and other works
%that apply zippers to derivatives
%When dealing with large trees, it would be a waste to
%traverse the entire tree if
%the operation only
%involves a small fraction of it.
%The idea is to put the focus on that subtree, turning other parts
%of the tree into a context
%
%
%One observation about our derivative-based lexing algorithm is that
%the derivative operation sometimes traverses the entire regular expression
%unnecessarily:

author	Chengsong
	Sat, 24 Dec 2022 11:55:04 +0000
changeset 637	e3752aac8ec2
child 640	bd1354127574
permissions	-rw-r--r--