lexing: ChengsongTanPhdThesis/Chapters/RelatedWork.tex@8ffa28fce271


% Chapter Template

\chapter{Related Work} % Main chapter title

\label{RelatedWork} 

In this chapter, we introduce
the work relevant to this thesis.


\section{Work on Back-References}
We introduced back-references
in chapter \ref{Introduction}.
This is a quite deep problem,
with theoretical work on them being
fairly recent.

Campaneu et al. studied regexes with back-references
in the context of formal languages theory in 
their 2003 work \cite{campeanu2003formal}.
They devised a pumping lemma for Perl-like regexes, 
proving that the langugages denoted by them
is  properly  contained in context-sensitive
languages.
More interesting questions such as 
whether the language denoted by Perl-like regexes
can express palindromes ($\{w \mid w = w^R\}$)
are discussed in \cite{campeanu2009patterns} 
and \cite{CAMPEANU2009Intersect}.
Freydenberger \cite{Frey2013} investigated the 
expressive power of back-references. He showed several 
undecidability and decriptional complexity results 
of back-references, concluding that they add
great power to certain programming tasks but are difficult to harness.
An interesting question would then be
whether we can add restrictions to them, 
so that they become expressive enough, but not too powerful.
Freydenberger and Schmid \cite{FREYDENBERGER20191}
introduced the notion of deterministic
regular expressions with back-references to achieve
a better balance between expressiveness and tractability.


Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
investigated the time complexity of different variants
of back-references.

See \cite{BERGLUND2022} for a survey
of these works and comparison between different
flavours  of back-references syntax (e.g. whether references can be circular, 
can labels be repeated etc.).


\subsection{Matchers and Lexers with Mechanised Proofs}
We are aware
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
 
\subsection{Static Analysis of Evil Regex Patterns} 
 When a regular expression does not behave as intended,
people usually try to rewrite the regex to some equivalent form
or they try to avoid the possibly problematic patterns completely,
for which many false positives exist\parencite{Davis18}.
Animated tools to "debug" regular expressions such as
 \parencite{regexploit2021} \parencite{regex101} are also popular.
We are also aware of static analysis work on regular expressions that
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
\parencite{Rathnayake2014StaticAF} proposed an algorithm
that detects regular expressions triggering exponential
behavious on backtracking matchers.
Weideman \parencite{Weideman2017Static} came up with 
non-linear polynomial worst-time estimates
for regexes, attack string that exploit the worst-time 
scenario, and "attack automata" that generates
attack strings.






Thanks to our theorem-prover-friendly approach,
we believe that 
this finiteness bound can be improved to a bound
linear to input and
cubic to the regular expression size using a technique by
Antimirov\cite{Antimirov95}.
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
We are working out the
details.

 
To our best knowledge, no lexing libraries using Brzozowski derivatives
have similar complexity-related bounds, 
and claims about running time are usually speculative and backed by empirical
evidence on a few test cases.
If a matching or lexing algorithm
does not come with certain basic complexity related 
guarantees (for examaple the internal data structure size
does not grow indefinitely), 
then they cannot claim with confidence having solved the problem
of catastrophic backtracking.
author	Chengsong
	Sat, 12 Nov 2022 21:34:40 +0000
changeset 624	8ffa28fce271
parent 622	4b1149fb5aec
child 625	b797c9a709d9
permissions	-rw-r--r--