ChengsongTanPhdThesis/Chapters/RelatedWork.tex
author Chengsong
Fri, 26 May 2023 08:09:30 +0100
changeset 646 56057198e4f5
parent 640 bd1354127574
child 668 3831621d7b14
permissions -rw-r--r--
intro
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
608
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
     1
% Chapter Template
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
     2
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
     3
\chapter{Related Work} % Main chapter title
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
     4
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
     5
\label{RelatedWork} 
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
     6
630
d50a309a0645 with Christian
Chengsong
parents: 628
diff changeset
     7
%In this chapter, we introduce
d50a309a0645 with Christian
Chengsong
parents: 628
diff changeset
     8
%work relevant to this thesis.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
     9
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    10
\section{Regular Expressions, Derivatives and POSIX Lexing}
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    11
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    12
%\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms}
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    13
Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    14
Since then they have become a fundamental concept in 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    15
formal languages and automata theory \cite{Sakarovitch2009}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    16
Brzozowski defined derivatives on regular expressions in his PhD thesis
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    17
in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    18
of regular expression derivatives modulo the ACI-axioms.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    19
It is worth pointing out that this result does not directly
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    20
translate to our
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    21
finiteness proof, 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    22
and our proof does not depend on it.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    23
The key observation is that our version of the Sulzmann and Lu's algorithm
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    24
\cite{Sulzmann2014} represents 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    25
derivative terms in a way that allows efficient de-duplication,
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    26
and we do not make use of an equivalence checker that exploits the ACI-equivalent
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    27
terms.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    28
628
7af4e2420a8c ready to submit~~
Chengsong
parents: 627
diff changeset
    29
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    30
Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    31
They first introduced the elegant and simple idea of injection-based lexing
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    32
and bit-coded lexing.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    33
In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    34
incorporated these ideas
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    35
into a tool called \emph{dreml}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    36
The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    37
by Frisch and Cardelli \cite{Frisch2004} were later
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    38
found to have unfillable gaps by Ausaf et al. \cite{AusafDyckhoffUrban2016},
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    39
who came up with an alternative proof inspired by 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    40
Vansummeren \cite{Vansummeren2006}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    41
Sulzmann and Thiemann extended the Brzozowski derivatives to 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    42
shuffling regular expressions \cite{Sulzmann2015},
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    43
which are a very succinct way of describing regular expressions.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    44
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    45
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    46
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    47
Regular expressions and lexers have been a popular topic among
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    48
the theorem proving and functional programming community.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    49
In the next section we give a list of lexers
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    50
and matchers that come with a machine-checked correctness proof.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    51
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    52
\subsection{Matchers and Lexers with Mechanised Proofs}
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    53
We are aware
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    54
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    55
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    56
of the work by Krauss and Nipkow \parencite{Krauss2011}.  Another one
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    57
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    58
Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    59
The most recent works on verified lexers to our best knowledge are the Verbatim \cite{Verbatim}
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    60
and Verbatim++ \cite{Verbatimpp} lexers by Egolf et al.
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    61
Verbatim is based on derivatives but does not simplify them. 
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    62
Therefore it can be very slow on evil regular expressions.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    63
The Verbatim++ lexer adds many correctness-preserving 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    64
optimisations to the Verbatim lexer,
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    65
and is therefore quite fast on many inputs.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    66
The problem is that Egolf et al. chose to use traditional DFAs to speed up lexing,
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    67
but then dealing with bounded repetitions is a real bottleneck.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    68
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    69
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    70
This thesis builds on the formal specifications of POSIX
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    71
rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    72
The bounded repetitions presented here is a continuation of the work by Ausaf \cite{Ausaf}.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    73
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    74
Automata formalisations are in general harder and more cumbersome to deal
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    75
with for theorem provers \cite{Nipkow1998}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    76
To represent them, one way is to use graphs, but graphs are not inductive datatypes.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    77
Having to set the inductive principle on the number of nodes
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
    78
in a graph makes formal reasoning non-intuitive and convoluted,
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
    79
resulting in large formalisations \cite{Lammich2012}. 
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    80
When combining two graphs, one also needs to make sure that the nodes in
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    81
both graphs are distinct. 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    82
If they are not distinct, then renaming of the nodes is needed.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    83
Using Coq which provides dependent types can potentially make things slightly easier
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    84
\cite{Doczkal2013}.
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    85
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    86
Another representation for automata are matrices. 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    87
But the induction for them is not as straightforward either.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    88
Both approaches have been used in the past and resulted in huge formalisations.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    89
There are some more clever representations, for example one by Paulson 
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
    90
using hereditarily finite sets \cite{Paulson2015}. 
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
    91
There the problem with combining graphs can be solved better. 
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
    92
%but we believe that such clever tricks are not very obvious for 
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
    93
%the John-average-Isabelle-user.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    94
Because of these problems with automata, we prefer regular expressions
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    95
and derivatives which can be implemented in theorem provers and functional programming languages
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
    96
with ease.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    97
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    98
\subsection{Different Definitions of POSIX Rules}
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
    99
There are different ways to formalise values and POSIX matching.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   100
Cardelli and Frisch \cite{Frisch2004} have developed a notion of
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   101
\emph{non-problematic values} which is a slight variation
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   102
of the values defined in the inhabitation relation in \ref{fig:inhab}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   103
They then defined an ordering between values, and showed that
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   104
the maximal element of those values correspond to the output
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   105
of their GREEDY lexing algorithm.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   106
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   107
Okui and Suzuki \cite{Okui10} allow iterations of values to 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   108
flatten to the empty
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   109
string for the star inhabitation rule in
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   110
\ref{fig:inhab}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   111
They refer to the more restrictive version as used 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   112
in this thesis (which was defined by Ausaf et al.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   113
\cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   114
The very interesting link between the work by Ausaf et al.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   115
and Okui and Suzuki is that they have distinct formalisations
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   116
of POSIX values, and yet they define the same notion.  See
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   117
\cite{AusafDyckhoffUrban2016} for details of the
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   118
alternative definitions given by Okui and Suzuki and the formalisation
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   119
described in this thesis.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   120
%TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   121
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   122
Sulzmann and Lu themselves have come up with a POSIX definition \cite{Sulzmann2014}.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   123
In their paper they defined an ordering between values with respect
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   124
to regular expressions, and tried to establish that their
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   125
algorithm outputs the minimal element by a pencil-and-paper proof.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   126
But having the ordering relation taking regular expression as parameters
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   127
causes the transitivity of their ordering to not go through.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   128
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   129
\subsection{Static Analysis of Evil Regex Patterns} 
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   130
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   131
632
Chengsong
parents: 631
diff changeset
   132
When faced with catastrophic backtracking, 
Chengsong
parents: 631
diff changeset
   133
sometimes programmers 
Chengsong
parents: 631
diff changeset
   134
have to rewrite their regexes in an ad hoc fashion.
Chengsong
parents: 631
diff changeset
   135
Knowing which patterns should be avoided
Chengsong
parents: 631
diff changeset
   136
can be helpful.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   137
Tools for this
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   138
often over-approximate evil regular expression patterns,
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   139
and there can be many false positives \cite{Davis18}.
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   140
Animated tools to "debug" regular expressions such as
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   141
 \parencite{regexploit2021} \parencite{regex101} are also popular.
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   142
We are also aware of static analysis work on regular expressions that
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   143
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   144
\parencite{Rathnayake2014StaticAF} proposed an algorithm
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   145
that detects regular expressions triggering exponential
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   146
behavious with backtracking matchers.
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   147
Weideman \parencite{Weideman2017Static} came up with 
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   148
non-linear polynomial worst-time estimates
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   149
for regexes and ``attack string'' that exploit the worst-time 
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   150
scenario, and introduced "attack automaton" that generates
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   151
attack strings.
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   152
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   153
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   154
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   155
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   156
\section{Optimisations}
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   157
Perhaps the biggest problem that prevents derivative-based lexing 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   158
from being more widely adopted
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   159
is that they tend to be not very fast in practice, unable to 
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   160
reach throughputs like gigabytes per second, which is the
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   161
application we have in mind when we started looking at the topic.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   162
Commercial
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   163
regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   164
are capable of inspecting payloads
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   165
at line rates (which can be up to a few gigabits per second) against 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   166
thousands of regex rules \cite{communityRules}.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   167
For our algorithm to be more attractive for practical use, we
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   168
need more correctness-preserving optimisations.
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   169
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   170
FPGA-based implementations such as \cite{Sidhu2001} 
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   171
have the advantages of being
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   172
reconfigurable and parallel, but suffer from lower clock frequency
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   173
and scalability.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   174
Traditional automaton approaches that use DFAs instead of NFAs
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   175
benefit from the fact that only a single transition is needed 
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   176
for each input character \cite{Becchi08}. Lower memory bandwidth leads
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   177
to faster performance.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   178
However, they suffer from exponential memory requirements on bounded repetitions.
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   179
Compression techniques are used, such as those in \cite{Kumar2006} and
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   180
\cite{Becchi2007}.
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   181
Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020}
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   182
have been
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   183
proposed to better deal with bounded repetitions.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   184
But they are usually not used in formalised proofs.
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   185
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   186
%So far our focus has been mainly on the bit-coded algorithm,
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   187
%but the injection-based lexing algorithm 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   188
%could also be sped up in various ways:
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   189
%
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   190
Another direction of optimisation for derivative-based approaches
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   191
is defining string derivatives
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   192
directly, without recursively decomposing them into 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   193
character-by-character derivatives. For example, instead of
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   194
replacing $(r_1 + r_2)\backslash (c::cs)$ by
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   195
$((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   196
calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   197
This has the potential to speed up matching because input is 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   198
processed at a larger granularity.
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   199
One interesting point is to explore whether this can be done
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   200
to $\inj$ as well, so that we can generate a lexical value 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   201
rather than simply get a matcher.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   202
It is also not yet clear how this idea can be extended to sequences and stars.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   203
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   204
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   205
\subsection{Derivatives and Zippers}
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   206
Zippers are a data structure designed to focus on 
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   207
and navigate between local parts of a tree.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   208
The idea is that often operations on large trees only deal with
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   209
local regions each time.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   210
Therefore it would be a waste to 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   211
traverse the entire tree if
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   212
the operation only
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   213
involves a small fraction of it.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   214
Zippers were first formally described by Huet \cite{Huet1997}.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   215
Typical applications of zippers involve text editor buffers
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   216
and proof system databases.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   217
In our setting, the idea is to compactify the representation
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   218
of derivatives with zippers, thereby making our algorithm faster.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   219
Below we describe several works on parsing, derivatives
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   220
and zippers.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   221
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   222
Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   223
\cite{Zippy2020}.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   224
They adopted zippers to improve the speed, and argued that the runtime
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   225
complexity of their algorithm was linear with respect to the input string.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   226
They did not provide a formal proof for this.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   227
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   228
The idea of using Brzozowski derivatives on general context-free 
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   229
parsing was first implemented
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   230
by Might et al. \cite{Might2011}. 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   231
They used memoization and fixpoint constructions to eliminate infinite recursion,
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   232
which is a major problem for using derivatives with context-free grammars.
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   233
Their initial version was quite slow----exponential on the size of the input.
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   234
Adams et al. \cite{Adams2016} improved the speed and argued that their version
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   235
was cubic with respect to the input.
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   236
Darragh and Adams \cite{Darragh2020} further improved the performance
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   237
by using zippers in an innovative way--their zippers had multiple focuses
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   238
instead of just one in a traditional zipper to handle ambiguity.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   239
Their algorithm was not formalised though.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   240
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   241
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   242
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   243
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   244
\subsection{Back-References}
625
b797c9a709d9 section reorganising, related work
Chengsong
parents: 624
diff changeset
   245
We introduced regular expressions with back-references
608
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
   246
in chapter \ref{Introduction}.
625
b797c9a709d9 section reorganising, related work
Chengsong
parents: 624
diff changeset
   247
We adopt the common practice of calling them rewbrs (Regular Expressions
627
94db2636a296 finished!
Chengsong
parents: 626
diff changeset
   248
With Back References) for brevity.
625
b797c9a709d9 section reorganising, related work
Chengsong
parents: 624
diff changeset
   249
It has been shown by Aho \cite{Aho1990}
b797c9a709d9 section reorganising, related work
Chengsong
parents: 624
diff changeset
   250
that the k-vertex cover problem can be reduced
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   251
to the problem of rewbrs matching, and therefore matching with rewbrs is NP-complete.
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   252
Given the depth of the problem, the progress made at the full generality
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   253
of arbitrary rewbrs matching has been slow with
626
1c8525061545 finished!
Chengsong
parents: 625
diff changeset
   254
theoretical work on them being 
625
b797c9a709d9 section reorganising, related work
Chengsong
parents: 624
diff changeset
   255
fairly recent. 
608
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
   256
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   257
Campaneu et al. studied rewbrs 
624
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   258
in the context of formal languages theory in 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   259
\cite{campeanu2003formal}.
624
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   260
They devised a pumping lemma for Perl-like regexes, 
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   261
proving that the langugages denoted by them
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   262
is  properly  contained in context-sensitive
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   263
languages.
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   264
More interesting questions such as 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   265
whether the languages denoted by Perl-like regexes
624
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   266
can express palindromes ($\{w \mid w = w^R\}$)
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   267
are discussed in \cite{campeanu2009patterns} 
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   268
and \cite{CAMPEANU2009Intersect}.
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   269
Freydenberger \cite{Frey2013} investigated the 
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   270
expressive power of back-references. He showed several 
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   271
undecidability and decriptional complexity results 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   272
for back-references, concluding that they add
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   273
great power to certain programming tasks but are difficult to comprehend.
624
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   274
An interesting question would then be
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   275
whether we can add restrictions to them, 
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   276
so that they become expressive enough for practical use, such
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   277
as matching html files, but not too powerful.
624
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   278
Freydenberger and Schmid \cite{FREYDENBERGER20191}
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   279
introduced the notion of deterministic
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   280
regular expressions with back-references to achieve
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   281
a better balance between expressiveness and tractability.
608
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
   282
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
   283
624
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   284
Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   285
investigated the time complexity of different variants
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   286
of back-references.
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   287
We are not aware of any work that uses derivatives with back-references.
624
8ffa28fce271 all comments incorporated!!+related work
Chengsong
parents: 622
diff changeset
   288
640
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   289
%See \cite{BERGLUND2022} for a survey
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   290
%of these works and comparison between different
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   291
%flavours  of back-references syntax (e.g. whether references can be circular, 
bd1354127574 more proofreading done, last version before submission
Chengsong
parents: 632
diff changeset
   292
%can labels be repeated etc.).
608
37b6fd310a16 added related work chap
Chengsong
parents:
diff changeset
   293
609
61139fdddae0 chap1 totally done
Chengsong
parents: 608
diff changeset
   294
61139fdddae0 chap1 totally done
Chengsong
parents: 608
diff changeset
   295
 
622
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 609
diff changeset
   296
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 609
diff changeset
   297
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 609
diff changeset
   298
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 609
diff changeset
   299
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 609
diff changeset
   300
4b1149fb5aec incorporated more comments, bib
Chengsong
parents: 609
diff changeset
   301
 
625
b797c9a709d9 section reorganising, related work
Chengsong
parents: 624
diff changeset
   302
b797c9a709d9 section reorganising, related work
Chengsong
parents: 624
diff changeset
   303