lexing: ChengsongTanPhdThesis/Chapters/RelatedWork.tex@b797c9a709d9


% Chapter Template

\chapter{Related Work} % Main chapter title

\label{RelatedWork} 

In this chapter, we introduce
the work relevant to this thesis.


\section{Work on Back-References}
We introduced regular expressions with back-references
in chapter \ref{Introduction}.
We adopt the common practice of calling them rewbrs (Regular Expressions
With Back References) in this section for brevity.
It has been shown by Aho \cite{Aho1990}
that the k-vertex cover problem can be reduced
to the problem of rewbrs matching problem, and is therefore NP-complete.
Given the depth of the problem, the
theoretical work on them is 
fairly recent. 

Campaneu et al. studied regexes with back-references
in the context of formal languages theory in 
their 2003 work \cite{campeanu2003formal}.
They devised a pumping lemma for Perl-like regexes, 
proving that the langugages denoted by them
is  properly  contained in context-sensitive
languages.
More interesting questions such as 
whether the language denoted by Perl-like regexes
can express palindromes ($\{w \mid w = w^R\}$)
are discussed in \cite{campeanu2009patterns} 
and \cite{CAMPEANU2009Intersect}.
Freydenberger \cite{Frey2013} investigated the 
expressive power of back-references. He showed several 
undecidability and decriptional complexity results 
of back-references, concluding that they add
great power to certain programming tasks but are difficult to harness.
An interesting question would then be
whether we can add restrictions to them, 
so that they become expressive enough for practical use such
as html files, but not too powerful.
Freydenberger and Schmid \cite{FREYDENBERGER20191}
introduced the notion of deterministic
regular expressions with back-references to achieve
a better balance between expressiveness and tractability.


Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
investigated the time complexity of different variants
of back-references.
We are not aware of any work that uses derivatives on back-references.

See \cite{BERGLUND2022} for a survey
of these works and comparison between different
flavours  of back-references syntax (e.g. whether references can be circular, 
can labels be repeated etc.).


\subsection{Matchers and Lexers with Mechanised Proofs}
We are aware
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
 
\subsection{Static Analysis of Evil Regex Patterns} 
When a regular expression does not behave as intended,
people usually try to rewrite the regex to some equivalent form
or they try to avoid the possibly problematic patterns completely,
for which many false positives exist\parencite{Davis18}.
Animated tools to "debug" regular expressions such as
 \parencite{regexploit2021} \parencite{regex101} are also popular.
We are also aware of static analysis work on regular expressions that
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
\parencite{Rathnayake2014StaticAF} proposed an algorithm
that detects regular expressions triggering exponential
behavious on backtracking matchers.
Weideman \parencite{Weideman2017Static} came up with 
non-linear polynomial worst-time estimates
for regexes, attack string that exploit the worst-time 
scenario, and "attack automata" that generates
attack strings.






Thanks to our theorem-prover-friendly approach,
we believe that 
this finiteness bound can be improved to a bound
linear to input and
cubic to the regular expression size using a technique by
Antimirov\cite{Antimirov95}.
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
We are working out the
details.

 
To our best knowledge, no lexing libraries using Brzozowski derivatives
have similar complexity-related bounds, 
and claims about running time are usually speculative and backed by empirical
evidence on a few test cases.
If a matching or lexing algorithm
does not come with certain basic complexity related 
guarantees (for examaple the internal data structure size
does not grow indefinitely), 
then they cannot claim with confidence having solved the problem
of catastrophic backtracking.


\section{Derivatives and Zippers}
Zipper is a data structure designed to focus on 
and navigate between local parts of a tree.
The idea is that often operations on a large tree only deal with
local regions each time.
It was first formally described by Huet \cite{HuetZipper}.
Typical applications of zippers involve text editor buffers
and proof system databases.
In our setting, the idea is to compactify the representation
of derivatives with zippers, thereby making our algorithm faster.
We draw our inspirations from several works on parsing, derivatives
and zippers.

Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
\cite{Zippy2020}.
They adopted zippers to improve the speed, and argued that the runtime
complexity of their algorithm was linear with respect to the input string.

The idea of using Brzozowski derivatives on general context-free 
parsing was first implemented
by Might et al. \ref{Might2011}. 
They used memoization and fixpoint construction to eliminate infinite recursion,
which is a major problem for using derivatives on context-free grammars.
The initial version was quite slow----exponential on the size of the input.
Adams et al. \ref{Adams2016} improved the speed and argued that their version
was cubic with respect to the input.
Darragh and Adams \cite{10.1145/3408990} further improved the performance
by using zippers in an innovative way--their zippers had multiple focuses
instead of just one in a traditional zipper to handle ambiguity.
Their algorithm was not formalised, though.

We aim to  integrate the zipper data structure into our lexer.
The idea is very simple: using a zipper with multiple focuses
just like Darragh did, we could represent
\[
	x\cdot r + y \cdot r + \ldots
\]
as 
\[
	(x+y + \ldots) \cdot r.
\]
This would greatly reduce the amount of space needed
when we have many terms like $x\cdot r$.

into the current lexer. 
This allows the lexer to not traverse the entire 
derivative tree generated
in every intermediate step, but only a small segment 
that is currently active. 

Some initial results have been obtained, but more care needs to be taken to make sure 
that the lexer output conforms to the POSIX standard. Formalising correctness of 
such a lexer using zippers will probably require using an imperative
framework with separation logic as it involves
mutable data structures, which is also interesting to look into.  

To further optimise the algorithm, 
I plan to add a ``deduplicate'' function that tells 
whether two regular expressions denote the same 
language using
an efficient and verified equivalence checker.
In this way, the internal data structures can be pruned more aggressively, 
leading to better simplifications and ultimately faster algorithms.


Some initial results 
We first give a brief introduction to what zippers are,
and other works
that apply zippers to derivatives
When dealing with large trees, it would be a waste to 
traverse the entire tree if
the operation only
involves a small fraction of it.
The idea is to put the focus on that subtree, turning other parts
of the tree into a context


One observation about our derivative-based lexing algorithm is that
the derivative operation sometimes traverses the entire regular expression
unnecessarily:
author	Chengsong
	Thu, 17 Nov 2022 23:13:57 +0000
changeset 625	b797c9a709d9
parent 624	8ffa28fce271
child 626	1c8525061545
permissions	-rw-r--r--