lexing: ChengsongTanPhdThesis/Chapters/RelatedWork.tex@3831621d7b14

% Chapter Template

\chapter{Related Work} % Main chapter title

\label{RelatedWork} 

%In this chapter, we introduce
%work relevant to this thesis.

\section{Regular Expressions, Derivatives and POSIX Lexing}

%\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms}
Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}.
Since then they have become a fundamental concept in 
formal languages and automata theory \cite{Sakarovitch2009}.
Brzozowski defined derivatives on regular expressions in his PhD thesis
in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
of regular expression derivatives modulo the ACI-axioms.
It is worth pointing out that this result does not directly
translate to our
finiteness proof, 
and our proof does not depend on it.
The key observation is that our version of the Sulzmann and Lu's algorithm
\cite{Sulzmann2014} represents 
derivative terms in a way that allows efficient de-duplication,
and we do not make use of an equivalence checker that exploits the ACI-equivalent
terms.


Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}.
They first introduced the elegant and simple idea of injection-based lexing
and bit-coded lexing.
In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
incorporated these ideas
into a tool called \emph{dreml}.
The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
by Frisch and Cardelli \cite{Frisch2004} were later
found to have unfillable gaps by Ausaf et al. \cite{AusafDyckhoffUrban2016},
who came up with an alternative proof inspired by 
Vansummeren \cite{Vansummeren2006}.
Sulzmann and Thiemann extended the Brzozowski derivatives to 
shuffling regular expressions \cite{Sulzmann2015},
which are a very succinct way of describing regular expressions.



Regular expressions and lexers have been a popular topic among
the theorem proving and functional programming community.
In the next section we give a list of lexers
and matchers that come with a machine-checked correctness proof.

\subsection{Matchers and Lexers with Mechanised Proofs}
We are aware
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
of the work by Krauss and Nipkow \parencite{Krauss2011}.  Another one
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
The most recent works on verified lexers to our best knowledge are the Verbatim \cite{Verbatim}
and Verbatim++ \cite{Verbatimpp} lexers by Egolf et al.
Verbatim is based on derivatives but does not simplify them. 
Therefore it can be very slow on evil regular expressions.
The Verbatim++ lexer adds many correctness-preserving 
optimisations to the Verbatim lexer,
and is therefore quite fast on many inputs.
The problem is that Egolf et al. chose to use traditional DFAs to speed up lexing,
but then dealing with bounded repetitions is a real bottleneck.


This thesis builds on the formal specifications of POSIX
rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
The bounded repetitions presented here is a continuation of the work by Ausaf \cite{Ausaf}.

Automata formalisations are in general harder and more cumbersome to deal
with for theorem provers \cite{Nipkow1998}.
To represent them, one way is to use graphs, but graphs are not inductive datatypes.
Having to set the inductive principle on the number of nodes
in a graph makes formal reasoning non-intuitive and convoluted,
resulting in large formalisations \cite{Lammich2012}. 
When combining two graphs, one also needs to make sure that the nodes in
both graphs are distinct. 
If they are not distinct, then renaming of the nodes is needed.
Using Coq which provides dependent types can potentially make things slightly easier
\cite{Doczkal2013}.

Another representation for automata are matrices. 
But the induction for them is not as straightforward either.
Both approaches have been used in the past and resulted in huge formalisations.
There are some more clever representations, for example one by Paulson 
using hereditarily finite sets \cite{Paulson2015}. 
There the problem with combining graphs can be solved better. 
%but we believe that such clever tricks are not very obvious for 
%the John-average-Isabelle-user.
Because of these problems with automata, we prefer regular expressions
and derivatives which can be implemented in theorem provers and functional programming languages
with ease.

\subsection{Different Definitions of POSIX Rules}
There are different ways to formalise values and POSIX matching.
Cardelli and Frisch \cite{Frisch2004} have developed a notion of
\emph{non-problematic values} which is a slight variation
of the values defined in the inhabitation relation in \ref{fig:inhab}.
They then defined an ordering between values, and showed that
the maximal element of those values correspond to the output
of their GREEDY lexing algorithm.

Okui and Suzuki \cite{Okui10} allow iterations of values to 
flatten to the empty
string for the star inhabitation rule in
\ref{fig:inhab}.
They refer to the more restrictive version as used 
in this thesis (which was defined by Ausaf et al.
\cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
The very interesting link between the work by Ausaf et al.
and Okui and Suzuki is that they have distinct formalisations
of POSIX values, and yet they define the same notion.  See
\cite{AusafDyckhoffUrban2016} for details of the
alternative definitions given by Okui and Suzuki and the formalisation
described in this thesis.
%TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.

Sulzmann and Lu themselves have come up with a POSIX definition \cite{Sulzmann2014}.
In their paper they defined an ordering between values with respect
to regular expressions, and tried to establish that their
algorithm outputs the minimal element by a pencil-and-paper proof.
But having the ordering relation taking regular expression as parameters
causes the transitivity of their ordering to not go through.

\subsection{Static Analysis of Evil Regex Patterns} 


When faced with catastrophic backtracking, 
sometimes programmers 
have to rewrite their regexes in an ad hoc fashion.
Knowing which patterns should be avoided
can be helpful.
Tools for this
often over-approximate evil regular expression patterns,
and there can be many false positives \cite{Davis18}.
Animated tools to "debug" regular expressions such as
 \parencite{regexploit2021} \parencite{regex101} are also popular.
We are also aware of static analysis work on regular expressions that
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
\parencite{Rathnayake2014StaticAF} proposed an algorithm
that detects regular expressions triggering exponential
behavious with backtracking matchers.
Weideman \parencite{Weideman2017Static} came up with 
non-linear polynomial worst-time estimates
for regexes and ``attack string'' that exploit the worst-time 
scenario, and introduced "attack automaton" that generates
attack strings.




\section{Optimisations}
Perhaps the biggest problem that prevents derivative-based lexing 
from being more widely adopted
is that they tend to be not very fast in practice, unable to 
reach throughputs like gigabytes per second, which is the
application we have in mind when we started looking at the topic.
Commercial
regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
are capable of inspecting payloads
at line rates (which can be up to a few gigabits per second) against 
thousands of regex rules \cite{communityRules}.
For our algorithm to be more attractive for practical use, we
need more correctness-preserving optimisations.

FPGA-based implementations such as \cite{Sidhu2001} 
have the advantages of being
reconfigurable and parallel, but suffer from lower clock frequency
and scalability.
Traditional automaton approaches that use DFAs instead of NFAs
benefit from the fact that only a single transition is needed 
for each input character \cite{Becchi08}. Lower memory bandwidth leads
to faster performance.
However, they suffer from exponential memory requirements on bounded repetitions.
Compression techniques are used, such as those in \cite{Kumar2006} and
\cite{Becchi2007}.
Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
have been
proposed to better deal with bounded repetitions.
But they are usually not used in formalised proofs.

%So far our focus has been mainly on the bit-coded algorithm,
%but the injection-based lexing algorithm 
%could also be sped up in various ways:
%
Another direction of optimisation for derivative-based approaches
is defining string derivatives
directly, without recursively decomposing them into 
character-by-character derivatives. For example, instead of
replacing $(r_1 + r_2)\backslash (c::cs)$ by
$((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
This has the potential to speed up matching because input is 
processed at a larger granularity.
One interesting point is to explore whether this can be done
to $\inj$ as well, so that we can generate a lexical value 
rather than simply get a matcher.
It is also not yet clear how this idea can be extended to sequences and stars.


\subsection{Derivatives and Zippers}
Zippers are a data structure designed to focus on 
and navigate between local parts of a tree.
The idea is that often operations on large trees only deal with
local regions each time.
Therefore it would be a waste to 
traverse the entire tree if
the operation only
involves a small fraction of it.
Zippers were first formally described by Huet \cite{Huet1997}.
Typical applications of zippers involve text editor buffers
and proof system databases.
In our setting, the idea is to compactify the representation
of derivatives with zippers, thereby making our algorithm faster.
Below we describe several works on parsing, derivatives
and zippers.

Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
\cite{Zippy2020}.
They adopted zippers to improve the speed, and argued that the runtime
complexity of their algorithm was linear with respect to the input string.
They did not provide a formal proof for this.

The idea of using Brzozowski derivatives on general context-free 
parsing was first implemented
by Might et al. \cite{Might2011}. 
They used memoization and fixpoint constructions to eliminate infinite recursion,
which is a major problem for using derivatives with context-free grammars.
Their initial version was quite slow----exponential on the size of the input.
Adams et al. \cite{Adams2016} improved the speed and argued that their version
was cubic with respect to the input.
Darragh and Adams \cite{Darragh2020} further improved the performance
by using zippers in an innovative way--their zippers had multiple focuses
instead of just one in a traditional zipper to handle ambiguity.
Their algorithm was not formalised though.




\subsection{Back-References}
We introduced regular expressions with back-references
in chapter \ref{Introduction}.
We adopt the common practice of calling them rewbrs (Regular Expressions
With Back References) for brevity.
It has been shown by Aho \cite{Aho1990}
that the k-vertex cover problem can be reduced
to the problem of rewbrs matching, and therefore matching with rewbrs is NP-complete.
Given the depth of the problem, the progress made at the full generality
of arbitrary rewbrs matching has been slow with
theoretical work on them being 
fairly recent. 

Campaneu et al. studied rewbrs 
in the context of formal languages theory in 
\cite{campeanu2003formal}.
They devised a pumping lemma for Perl-like regexes, 
proving that the langugages denoted by them
is  properly  contained in context-sensitive
languages.
More interesting questions such as 
whether the languages denoted by Perl-like regexes
can express palindromes ($\{w \mid w = w^R\}$)
are discussed in \cite{campeanu2009patterns} 
and \cite{CAMPEANU2009Intersect}.
Freydenberger \cite{Frey2013} investigated the 
expressive power of back-references. He showed several 
undecidability and decriptional complexity results 
for back-references, concluding that they add
great power to certain programming tasks but are difficult to comprehend.
An interesting question would then be
whether we can add restrictions to them, 
so that they become expressive enough for practical use, such
as matching html files, but not too powerful.
Freydenberger and Schmid \cite{FREYDENBERGER20191}
introduced the notion of deterministic
regular expressions with back-references to achieve
a better balance between expressiveness and tractability.


Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
investigated the time complexity of different variants
of back-references.
We are not aware of any work that uses derivatives with back-references.

%See \cite{BERGLUND2022} for a survey
%of these works and comparison between different
%flavours  of back-references syntax (e.g. whether references can be circular, 
%can labels be repeated etc.).

The recently published work on formally verified lexer 
\cite{Margus2023PLDI} by Moseley et al. 
moves a big step forward in practical performance--their lexer can compete
with unverified commercial regex engines like those in Java, Python and etc.
and supports lookaheads.
To reach their speed on larger datasets like \cite{TwainRegex}, one has to
incorporate many fine-tunings and one of the first modifications needed would
be to drop whole-string lexing but only extract the submatch needed by the user.
author	Chengsong
	Wed, 23 Aug 2023 03:02:31 +0100 (18 months ago)
changeset 668	3831621d7b14
parent 640	bd1354127574
permissions	-rw-r--r--