author | Chengsong |
Fri, 26 May 2023 08:09:30 +0100 | |
changeset 646 | 56057198e4f5 |
parent 640 | bd1354127574 |
child 668 | 3831621d7b14 |
permissions | -rw-r--r-- |
608 | 1 |
% Chapter Template |
2 |
||
3 |
\chapter{Related Work} % Main chapter title |
|
4 |
||
5 |
\label{RelatedWork} |
|
6 |
||
630 | 7 |
%In this chapter, we introduce |
8 |
%work relevant to this thesis. |
|
626 | 9 |
|
10 |
\section{Regular Expressions, Derivatives and POSIX Lexing} |
|
11 |
||
12 |
%\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms} |
|
13 |
Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}. |
|
14 |
Since then they have become a fundamental concept in |
|
15 |
formal languages and automata theory \cite{Sakarovitch2009}. |
|
16 |
Brzozowski defined derivatives on regular expressions in his PhD thesis |
|
17 |
in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers |
|
18 |
of regular expression derivatives modulo the ACI-axioms. |
|
19 |
It is worth pointing out that this result does not directly |
|
20 |
translate to our |
|
21 |
finiteness proof, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
22 |
and our proof does not depend on it. |
626 | 23 |
The key observation is that our version of the Sulzmann and Lu's algorithm |
24 |
\cite{Sulzmann2014} represents |
|
25 |
derivative terms in a way that allows efficient de-duplication, |
|
26 |
and we do not make use of an equivalence checker that exploits the ACI-equivalent |
|
27 |
terms. |
|
28 |
||
628 | 29 |
|
626 | 30 |
Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}. |
31 |
They first introduced the elegant and simple idea of injection-based lexing |
|
32 |
and bit-coded lexing. |
|
33 |
In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven |
|
34 |
incorporated these ideas |
|
35 |
into a tool called \emph{dreml}. |
|
36 |
The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas |
|
37 |
by Frisch and Cardelli \cite{Frisch2004} were later |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
38 |
found to have unfillable gaps by Ausaf et al. \cite{AusafDyckhoffUrban2016}, |
626 | 39 |
who came up with an alternative proof inspired by |
40 |
Vansummeren \cite{Vansummeren2006}. |
|
41 |
Sulzmann and Thiemann extended the Brzozowski derivatives to |
|
42 |
shuffling regular expressions \cite{Sulzmann2015}, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
43 |
which are a very succinct way of describing regular expressions. |
626 | 44 |
|
45 |
||
46 |
||
47 |
Regular expressions and lexers have been a popular topic among |
|
48 |
the theorem proving and functional programming community. |
|
49 |
In the next section we give a list of lexers |
|
50 |
and matchers that come with a machine-checked correctness proof. |
|
51 |
||
52 |
\subsection{Matchers and Lexers with Mechanised Proofs} |
|
53 |
We are aware |
|
54 |
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
55 |
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
56 |
of the work by Krauss and Nipkow \parencite{Krauss2011}. Another one |
|
57 |
in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
58 |
Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
59 |
The most recent works on verified lexers to our best knowledge are the Verbatim \cite{Verbatim} |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
60 |
and Verbatim++ \cite{Verbatimpp} lexers by Egolf et al. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
61 |
Verbatim is based on derivatives but does not simplify them. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
62 |
Therefore it can be very slow on evil regular expressions. |
626 | 63 |
The Verbatim++ lexer adds many correctness-preserving |
64 |
optimisations to the Verbatim lexer, |
|
65 |
and is therefore quite fast on many inputs. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
66 |
The problem is that Egolf et al. chose to use traditional DFAs to speed up lexing, |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
67 |
but then dealing with bounded repetitions is a real bottleneck. |
626 | 68 |
|
69 |
||
70 |
This thesis builds on the formal specifications of POSIX |
|
71 |
rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
72 |
The bounded repetitions presented here is a continuation of the work by Ausaf \cite{Ausaf}. |
626 | 73 |
|
74 |
Automata formalisations are in general harder and more cumbersome to deal |
|
75 |
with for theorem provers \cite{Nipkow1998}. |
|
76 |
To represent them, one way is to use graphs, but graphs are not inductive datatypes. |
|
77 |
Having to set the inductive principle on the number of nodes |
|
627 | 78 |
in a graph makes formal reasoning non-intuitive and convoluted, |
79 |
resulting in large formalisations \cite{Lammich2012}. |
|
626 | 80 |
When combining two graphs, one also needs to make sure that the nodes in |
81 |
both graphs are distinct. |
|
82 |
If they are not distinct, then renaming of the nodes is needed. |
|
83 |
Using Coq which provides dependent types can potentially make things slightly easier |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
84 |
\cite{Doczkal2013}. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
85 |
|
626 | 86 |
Another representation for automata are matrices. |
87 |
But the induction for them is not as straightforward either. |
|
88 |
Both approaches have been used in the past and resulted in huge formalisations. |
|
89 |
There are some more clever representations, for example one by Paulson |
|
627 | 90 |
using hereditarily finite sets \cite{Paulson2015}. |
91 |
There the problem with combining graphs can be solved better. |
|
92 |
%but we believe that such clever tricks are not very obvious for |
|
93 |
%the John-average-Isabelle-user. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
94 |
Because of these problems with automata, we prefer regular expressions |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
95 |
and derivatives which can be implemented in theorem provers and functional programming languages |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
96 |
with ease. |
626 | 97 |
|
98 |
\subsection{Different Definitions of POSIX Rules} |
|
99 |
There are different ways to formalise values and POSIX matching. |
|
100 |
Cardelli and Frisch \cite{Frisch2004} have developed a notion of |
|
101 |
\emph{non-problematic values} which is a slight variation |
|
102 |
of the values defined in the inhabitation relation in \ref{fig:inhab}. |
|
103 |
They then defined an ordering between values, and showed that |
|
104 |
the maximal element of those values correspond to the output |
|
105 |
of their GREEDY lexing algorithm. |
|
106 |
||
107 |
Okui and Suzuki \cite{Okui10} allow iterations of values to |
|
108 |
flatten to the empty |
|
109 |
string for the star inhabitation rule in |
|
110 |
\ref{fig:inhab}. |
|
111 |
They refer to the more restrictive version as used |
|
112 |
in this thesis (which was defined by Ausaf et al. |
|
113 |
\cite{AusafDyckhoffUrban2016}) as \emph{canonical values}. |
|
114 |
The very interesting link between the work by Ausaf et al. |
|
115 |
and Okui and Suzuki is that they have distinct formalisations |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
116 |
of POSIX values, and yet they define the same notion. See |
626 | 117 |
\cite{AusafDyckhoffUrban2016} for details of the |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
118 |
alternative definitions given by Okui and Suzuki and the formalisation |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
119 |
described in this thesis. |
626 | 120 |
%TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF. |
121 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
122 |
Sulzmann and Lu themselves have come up with a POSIX definition \cite{Sulzmann2014}. |
626 | 123 |
In their paper they defined an ordering between values with respect |
124 |
to regular expressions, and tried to establish that their |
|
125 |
algorithm outputs the minimal element by a pencil-and-paper proof. |
|
126 |
But having the ordering relation taking regular expression as parameters |
|
127 |
causes the transitivity of their ordering to not go through. |
|
128 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
129 |
\subsection{Static Analysis of Evil Regex Patterns} |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
130 |
|
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
131 |
|
632 | 132 |
When faced with catastrophic backtracking, |
133 |
sometimes programmers |
|
134 |
have to rewrite their regexes in an ad hoc fashion. |
|
135 |
Knowing which patterns should be avoided |
|
136 |
can be helpful. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
137 |
Tools for this |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
138 |
often over-approximate evil regular expression patterns, |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
139 |
and there can be many false positives \cite{Davis18}. |
627 | 140 |
Animated tools to "debug" regular expressions such as |
141 |
\parencite{regexploit2021} \parencite{regex101} are also popular. |
|
142 |
We are also aware of static analysis work on regular expressions that |
|
143 |
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
|
144 |
\parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
145 |
that detects regular expressions triggering exponential |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
146 |
behavious with backtracking matchers. |
627 | 147 |
Weideman \parencite{Weideman2017Static} came up with |
148 |
non-linear polynomial worst-time estimates |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
149 |
for regexes and ``attack string'' that exploit the worst-time |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
150 |
scenario, and introduced "attack automaton" that generates |
627 | 151 |
attack strings. |
152 |
||
153 |
||
154 |
||
155 |
||
626 | 156 |
\section{Optimisations} |
157 |
Perhaps the biggest problem that prevents derivative-based lexing |
|
158 |
from being more widely adopted |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
159 |
is that they tend to be not very fast in practice, unable to |
626 | 160 |
reach throughputs like gigabytes per second, which is the |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
161 |
application we have in mind when we started looking at the topic. |
626 | 162 |
Commercial |
163 |
regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro} |
|
164 |
are capable of inspecting payloads |
|
165 |
at line rates (which can be up to a few gigabits per second) against |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
166 |
thousands of regex rules \cite{communityRules}. |
626 | 167 |
For our algorithm to be more attractive for practical use, we |
168 |
need more correctness-preserving optimisations. |
|
627 | 169 |
|
170 |
FPGA-based implementations such as \cite{Sidhu2001} |
|
171 |
have the advantages of being |
|
172 |
reconfigurable and parallel, but suffer from lower clock frequency |
|
173 |
and scalability. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
174 |
Traditional automaton approaches that use DFAs instead of NFAs |
627 | 175 |
benefit from the fact that only a single transition is needed |
176 |
for each input character \cite{Becchi08}. Lower memory bandwidth leads |
|
177 |
to faster performance. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
178 |
However, they suffer from exponential memory requirements on bounded repetitions. |
627 | 179 |
Compression techniques are used, such as those in \cite{Kumar2006} and |
180 |
\cite{Becchi2007}. |
|
181 |
Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020} |
|
182 |
have been |
|
183 |
proposed to better deal with bounded repetitions. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
184 |
But they are usually not used in formalised proofs. |
627 | 185 |
|
626 | 186 |
%So far our focus has been mainly on the bit-coded algorithm, |
187 |
%but the injection-based lexing algorithm |
|
188 |
%could also be sped up in various ways: |
|
189 |
% |
|
627 | 190 |
Another direction of optimisation for derivative-based approaches |
191 |
is defining string derivatives |
|
626 | 192 |
directly, without recursively decomposing them into |
193 |
character-by-character derivatives. For example, instead of |
|
194 |
replacing $(r_1 + r_2)\backslash (c::cs)$ by |
|
195 |
$((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather |
|
196 |
calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. |
|
197 |
This has the potential to speed up matching because input is |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
198 |
processed at a larger granularity. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
199 |
One interesting point is to explore whether this can be done |
626 | 200 |
to $\inj$ as well, so that we can generate a lexical value |
201 |
rather than simply get a matcher. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
202 |
It is also not yet clear how this idea can be extended to sequences and stars. |
626 | 203 |
|
204 |
||
205 |
\subsection{Derivatives and Zippers} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
206 |
Zippers are a data structure designed to focus on |
626 | 207 |
and navigate between local parts of a tree. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
208 |
The idea is that often operations on large trees only deal with |
626 | 209 |
local regions each time. |
210 |
Therefore it would be a waste to |
|
211 |
traverse the entire tree if |
|
212 |
the operation only |
|
213 |
involves a small fraction of it. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
214 |
Zippers were first formally described by Huet \cite{Huet1997}. |
626 | 215 |
Typical applications of zippers involve text editor buffers |
216 |
and proof system databases. |
|
217 |
In our setting, the idea is to compactify the representation |
|
218 |
of derivatives with zippers, thereby making our algorithm faster. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
219 |
Below we describe several works on parsing, derivatives |
626 | 220 |
and zippers. |
221 |
||
222 |
Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives |
|
223 |
\cite{Zippy2020}. |
|
224 |
They adopted zippers to improve the speed, and argued that the runtime |
|
225 |
complexity of their algorithm was linear with respect to the input string. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
226 |
They did not provide a formal proof for this. |
626 | 227 |
|
228 |
The idea of using Brzozowski derivatives on general context-free |
|
229 |
parsing was first implemented |
|
627 | 230 |
by Might et al. \cite{Might2011}. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
231 |
They used memoization and fixpoint constructions to eliminate infinite recursion, |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
232 |
which is a major problem for using derivatives with context-free grammars. |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
233 |
Their initial version was quite slow----exponential on the size of the input. |
627 | 234 |
Adams et al. \cite{Adams2016} improved the speed and argued that their version |
626 | 235 |
was cubic with respect to the input. |
236 |
Darragh and Adams \cite{Darragh2020} further improved the performance |
|
237 |
by using zippers in an innovative way--their zippers had multiple focuses |
|
238 |
instead of just one in a traditional zipper to handle ambiguity. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
239 |
Their algorithm was not formalised though. |
626 | 240 |
|
241 |
||
242 |
||
243 |
||
627 | 244 |
\subsection{Back-References} |
625 | 245 |
We introduced regular expressions with back-references |
608 | 246 |
in chapter \ref{Introduction}. |
625 | 247 |
We adopt the common practice of calling them rewbrs (Regular Expressions |
627 | 248 |
With Back References) for brevity. |
625 | 249 |
It has been shown by Aho \cite{Aho1990} |
250 |
that the k-vertex cover problem can be reduced |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
251 |
to the problem of rewbrs matching, and therefore matching with rewbrs is NP-complete. |
626 | 252 |
Given the depth of the problem, the progress made at the full generality |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
253 |
of arbitrary rewbrs matching has been slow with |
626 | 254 |
theoretical work on them being |
625 | 255 |
fairly recent. |
608 | 256 |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
257 |
Campaneu et al. studied rewbrs |
624 | 258 |
in the context of formal languages theory in |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
259 |
\cite{campeanu2003formal}. |
624 | 260 |
They devised a pumping lemma for Perl-like regexes, |
261 |
proving that the langugages denoted by them |
|
262 |
is properly contained in context-sensitive |
|
263 |
languages. |
|
264 |
More interesting questions such as |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
265 |
whether the languages denoted by Perl-like regexes |
624 | 266 |
can express palindromes ($\{w \mid w = w^R\}$) |
267 |
are discussed in \cite{campeanu2009patterns} |
|
268 |
and \cite{CAMPEANU2009Intersect}. |
|
269 |
Freydenberger \cite{Frey2013} investigated the |
|
270 |
expressive power of back-references. He showed several |
|
271 |
undecidability and decriptional complexity results |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
272 |
for back-references, concluding that they add |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
273 |
great power to certain programming tasks but are difficult to comprehend. |
624 | 274 |
An interesting question would then be |
275 |
whether we can add restrictions to them, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
276 |
so that they become expressive enough for practical use, such |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
277 |
as matching html files, but not too powerful. |
624 | 278 |
Freydenberger and Schmid \cite{FREYDENBERGER20191} |
279 |
introduced the notion of deterministic |
|
280 |
regular expressions with back-references to achieve |
|
281 |
a better balance between expressiveness and tractability. |
|
608 | 282 |
|
283 |
||
624 | 284 |
Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012} |
285 |
investigated the time complexity of different variants |
|
286 |
of back-references. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
287 |
We are not aware of any work that uses derivatives with back-references. |
624 | 288 |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
289 |
%See \cite{BERGLUND2022} for a survey |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
290 |
%of these works and comparison between different |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
291 |
%flavours of back-references syntax (e.g. whether references can be circular, |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
632
diff
changeset
|
292 |
%can labels be repeated etc.). |
608 | 293 |
|
609 | 294 |
|
295 |
||
622 | 296 |
|
297 |
||
298 |
||
299 |
||
300 |
||
301 |
||
625 | 302 |
|
303 |