608
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
\chapter{Related Work} % Main chapter title
|
|
4 |
|
|
5 |
\label{RelatedWork}
|
|
6 |
|
|
7 |
In this chapter, we introduce
|
626
|
8 |
work relevant to this thesis.
|
|
9 |
|
|
10 |
\section{Regular Expressions, Derivatives and POSIX Lexing}
|
|
11 |
|
|
12 |
%\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms}
|
|
13 |
Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}.
|
|
14 |
Since then they have become a fundamental concept in
|
|
15 |
formal languages and automata theory \cite{Sakarovitch2009}.
|
|
16 |
Brzozowski defined derivatives on regular expressions in his PhD thesis
|
|
17 |
in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers
|
|
18 |
of regular expression derivatives modulo the ACI-axioms.
|
|
19 |
It is worth pointing out that this result does not directly
|
|
20 |
translate to our
|
|
21 |
finiteness proof,
|
|
22 |
and our proof does not depend on it to be valid.
|
|
23 |
The key observation is that our version of the Sulzmann and Lu's algorithm
|
|
24 |
\cite{Sulzmann2014} represents
|
|
25 |
derivative terms in a way that allows efficient de-duplication,
|
|
26 |
and we do not make use of an equivalence checker that exploits the ACI-equivalent
|
|
27 |
terms.
|
|
28 |
|
|
29 |
Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}.
|
|
30 |
They first introduced the elegant and simple idea of injection-based lexing
|
|
31 |
and bit-coded lexing.
|
|
32 |
In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven
|
|
33 |
incorporated these ideas
|
|
34 |
into a tool called \emph{dreml}.
|
|
35 |
The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas
|
|
36 |
by Frisch and Cardelli \cite{Frisch2004} were later
|
|
37 |
found to have unfillable gaps by Ausaf \cite{Ausaf},
|
|
38 |
who came up with an alternative proof inspired by
|
|
39 |
Vansummeren \cite{Vansummeren2006}.
|
|
40 |
Sulzmann and Thiemann extended the Brzozowski derivatives to
|
|
41 |
shuffling regular expressions \cite{Sulzmann2015},
|
|
42 |
which are a very succinct way of describing basic regular expressions.
|
|
43 |
|
|
44 |
|
|
45 |
|
|
46 |
Regular expressions and lexers have been a popular topic among
|
|
47 |
the theorem proving and functional programming community.
|
|
48 |
In the next section we give a list of lexers
|
|
49 |
and matchers that come with a machine-checked correctness proof.
|
|
50 |
|
|
51 |
\subsection{Matchers and Lexers with Mechanised Proofs}
|
|
52 |
We are aware
|
|
53 |
of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
|
|
54 |
Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
|
|
55 |
of the work by Krauss and Nipkow \parencite{Krauss2011}. Another one
|
|
56 |
in Coq is given by Coquand and Siles \parencite{Coquand2012}.
|
|
57 |
Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}.
|
|
58 |
The most recent works to our best knowledge are the Verbatim \cite{Verbatim}
|
|
59 |
and Verbatim++ \cite{Verbatim} lexers.
|
|
60 |
The Verbatim++ lexer adds many correctness-preserving
|
|
61 |
optimisations to the Verbatim lexer,
|
|
62 |
and is therefore quite fast on many inputs.
|
|
63 |
The problem is that they chose to use DFA to speed up things,
|
|
64 |
for which dealing with bounded repetitions is a bottleneck.
|
|
65 |
|
|
66 |
|
|
67 |
This thesis builds on the formal specifications of POSIX
|
|
68 |
rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}.
|
|
69 |
The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}.
|
|
70 |
|
|
71 |
Automata formalisations are in general harder and more cumbersome to deal
|
|
72 |
with for theorem provers \cite{Nipkow1998}.
|
|
73 |
To represent them, one way is to use graphs, but graphs are not inductive datatypes.
|
|
74 |
Having to set the inductive principle on the number of nodes
|
|
75 |
in a graph makes formal reasoning non-intuitive and convoluted.
|
|
76 |
When combining two graphs, one also needs to make sure that the nodes in
|
|
77 |
both graphs are distinct.
|
|
78 |
If they are not distinct, then renaming of the nodes is needed.
|
|
79 |
Using Coq which provides dependent types can potentially make things slightly easier
|
|
80 |
\cite{Doczkal2013}
|
|
81 |
Another representation for automata are matrices.
|
|
82 |
But the induction for them is not as straightforward either.
|
|
83 |
Both approaches have been used in the past and resulted in huge formalisations.
|
|
84 |
There are some more clever representations, for example one by Paulson
|
|
85 |
using hereditarily finite sets.
|
|
86 |
There the problem with combining graphs can be solved better,
|
|
87 |
but we believe that such clever tricks are not very obvious for
|
|
88 |
the John-average-Isabelle-user.
|
|
89 |
|
|
90 |
\subsection{Different Definitions of POSIX Rules}
|
|
91 |
There are different ways to formalise values and POSIX matching.
|
|
92 |
Cardelli and Frisch \cite{Frisch2004} have developed a notion of
|
|
93 |
\emph{non-problematic values} which is a slight variation
|
|
94 |
of the values defined in the inhabitation relation in \ref{fig:inhab}.
|
|
95 |
They then defined an ordering between values, and showed that
|
|
96 |
the maximal element of those values correspond to the output
|
|
97 |
of their GREEDY lexing algorithm.
|
|
98 |
|
|
99 |
Okui and Suzuki \cite{Okui10} allow iterations of values to
|
|
100 |
flatten to the empty
|
|
101 |
string for the star inhabitation rule in
|
|
102 |
\ref{fig:inhab}.
|
|
103 |
They refer to the more restrictive version as used
|
|
104 |
in this thesis (which was defined by Ausaf et al.
|
|
105 |
\cite{AusafDyckhoffUrban2016}) as \emph{canonical values}.
|
|
106 |
The very interesting link between the work by Ausaf et al.
|
|
107 |
and Okui and Suzuki is that they have distinct formalisations
|
|
108 |
of POSIX values, and yet they are provably equivalent! See
|
|
109 |
\cite{AusafDyckhoffUrban2016} for details of the
|
|
110 |
alternative definitions given by Okui and Suzuki and the formalisation.
|
|
111 |
%TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF.
|
|
112 |
|
|
113 |
Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}.
|
|
114 |
In their paper they defined an ordering between values with respect
|
|
115 |
to regular expressions, and tried to establish that their
|
|
116 |
algorithm outputs the minimal element by a pencil-and-paper proof.
|
|
117 |
But having the ordering relation taking regular expression as parameters
|
|
118 |
causes the transitivity of their ordering to not go through.
|
|
119 |
|
|
120 |
|
|
121 |
\section{Optimisations}
|
|
122 |
Perhaps the biggest problem that prevents derivative-based lexing
|
|
123 |
from being more widely adopted
|
|
124 |
is that they tend to be not blindingly fast in practice, unable to
|
|
125 |
reach throughputs like gigabytes per second, which is the
|
|
126 |
application we had in mind when we initially started looking at the topic.
|
|
127 |
Commercial
|
|
128 |
regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro}
|
|
129 |
are capable of inspecting payloads
|
|
130 |
at line rates (which can be up to a few gigabits per second) against
|
|
131 |
thousands of rules \cite{communityRules}.
|
|
132 |
For our algorithm to be more attractive for practical use, we
|
|
133 |
need more correctness-preserving optimisations.
|
|
134 |
%So far our focus has been mainly on the bit-coded algorithm,
|
|
135 |
%but the injection-based lexing algorithm
|
|
136 |
%could also be sped up in various ways:
|
|
137 |
%
|
|
138 |
One such optimisation is defining string derivatives
|
|
139 |
directly, without recursively decomposing them into
|
|
140 |
character-by-character derivatives. For example, instead of
|
|
141 |
replacing $(r_1 + r_2)\backslash (c::cs)$ by
|
|
142 |
$((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather
|
|
143 |
calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
|
|
144 |
This has the potential to speed up matching because input is
|
|
145 |
processed in larger granularity.
|
|
146 |
One interesting thing is to explore whether this can be done
|
|
147 |
to $\inj$ as well, so that we can generate a lexical value
|
|
148 |
rather than simply get a matcher.
|
|
149 |
|
|
150 |
|
|
151 |
\subsection{Derivatives and Zippers}
|
|
152 |
Zipper is a data structure designed to focus on
|
|
153 |
and navigate between local parts of a tree.
|
|
154 |
The idea is that often operations on a large tree only deal with
|
|
155 |
local regions each time.
|
|
156 |
Therefore it would be a waste to
|
|
157 |
traverse the entire tree if
|
|
158 |
the operation only
|
|
159 |
involves a small fraction of it.
|
|
160 |
It was first formally described by Huet \cite{HuetZipper}.
|
|
161 |
Typical applications of zippers involve text editor buffers
|
|
162 |
and proof system databases.
|
|
163 |
In our setting, the idea is to compactify the representation
|
|
164 |
of derivatives with zippers, thereby making our algorithm faster.
|
|
165 |
We introduce several works on parsing, derivatives
|
|
166 |
and zippers.
|
|
167 |
|
|
168 |
Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
|
|
169 |
\cite{Zippy2020}.
|
|
170 |
They adopted zippers to improve the speed, and argued that the runtime
|
|
171 |
complexity of their algorithm was linear with respect to the input string.
|
|
172 |
|
|
173 |
The idea of using Brzozowski derivatives on general context-free
|
|
174 |
parsing was first implemented
|
|
175 |
by Might et al. \ref{Might2011}.
|
|
176 |
They used memoization and fixpoint construction to eliminate infinite recursion,
|
|
177 |
which is a major problem for using derivatives on context-free grammars.
|
|
178 |
The initial version was quite slow----exponential on the size of the input.
|
|
179 |
Adams et al. \ref{Adams2016} improved the speed and argued that their version
|
|
180 |
was cubic with respect to the input.
|
|
181 |
Darragh and Adams \cite{Darragh2020} further improved the performance
|
|
182 |
by using zippers in an innovative way--their zippers had multiple focuses
|
|
183 |
instead of just one in a traditional zipper to handle ambiguity.
|
|
184 |
Their algorithm was not formalised, though.
|
|
185 |
|
|
186 |
|
|
187 |
Some initial results have been obtained, but more care needs to be taken to make sure
|
|
188 |
that the lexer output conforms to the POSIX standard. Formalising correctness of
|
|
189 |
such a lexer using zippers will probably require using an imperative
|
|
190 |
framework with separation logic as it involves
|
|
191 |
mutable data structures, which is also interesting to look into.
|
|
192 |
|
|
193 |
To further optimise the algorithm,
|
|
194 |
I plan to add a ``deduplicate'' function that tells
|
|
195 |
whether two regular expressions denote the same
|
|
196 |
language using
|
|
197 |
an efficient and verified equivalence checker.
|
|
198 |
In this way, the internal data structures can be pruned more aggressively,
|
|
199 |
leading to better simplifications and ultimately faster algorithms.
|
|
200 |
|
|
201 |
|
|
202 |
|
|
203 |
The idea is to put the focus on that subtree, turning other parts
|
|
204 |
of the tree into a context
|
|
205 |
|
|
206 |
|
|
207 |
One observation about our derivative-based lexing algorithm is that
|
|
208 |
the derivative operation sometimes traverses the entire regular expression
|
|
209 |
unnecessarily:
|
|
210 |
|
|
211 |
|
|
212 |
|
608
|
213 |
|
|
214 |
|
|
215 |
\section{Work on Back-References}
|
625
|
216 |
We introduced regular expressions with back-references
|
608
|
217 |
in chapter \ref{Introduction}.
|
625
|
218 |
We adopt the common practice of calling them rewbrs (Regular Expressions
|
|
219 |
With Back References) in this section for brevity.
|
|
220 |
It has been shown by Aho \cite{Aho1990}
|
|
221 |
that the k-vertex cover problem can be reduced
|
626
|
222 |
to the problem of rewbrs matching, and is therefore NP-complete.
|
|
223 |
Given the depth of the problem, the progress made at the full generality
|
|
224 |
of arbitrary rewbrs matching has been slow, with
|
|
225 |
theoretical work on them being
|
625
|
226 |
fairly recent.
|
608
|
227 |
|
624
|
228 |
Campaneu et al. studied regexes with back-references
|
|
229 |
in the context of formal languages theory in
|
|
230 |
their 2003 work \cite{campeanu2003formal}.
|
|
231 |
They devised a pumping lemma for Perl-like regexes,
|
|
232 |
proving that the langugages denoted by them
|
|
233 |
is properly contained in context-sensitive
|
|
234 |
languages.
|
|
235 |
More interesting questions such as
|
|
236 |
whether the language denoted by Perl-like regexes
|
|
237 |
can express palindromes ($\{w \mid w = w^R\}$)
|
|
238 |
are discussed in \cite{campeanu2009patterns}
|
|
239 |
and \cite{CAMPEANU2009Intersect}.
|
|
240 |
Freydenberger \cite{Frey2013} investigated the
|
|
241 |
expressive power of back-references. He showed several
|
|
242 |
undecidability and decriptional complexity results
|
|
243 |
of back-references, concluding that they add
|
|
244 |
great power to certain programming tasks but are difficult to harness.
|
|
245 |
An interesting question would then be
|
|
246 |
whether we can add restrictions to them,
|
625
|
247 |
so that they become expressive enough for practical use such
|
|
248 |
as html files, but not too powerful.
|
624
|
249 |
Freydenberger and Schmid \cite{FREYDENBERGER20191}
|
|
250 |
introduced the notion of deterministic
|
|
251 |
regular expressions with back-references to achieve
|
|
252 |
a better balance between expressiveness and tractability.
|
608
|
253 |
|
|
254 |
|
624
|
255 |
Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
|
|
256 |
investigated the time complexity of different variants
|
|
257 |
of back-references.
|
625
|
258 |
We are not aware of any work that uses derivatives on back-references.
|
624
|
259 |
|
|
260 |
See \cite{BERGLUND2022} for a survey
|
608
|
261 |
of these works and comparison between different
|
624
|
262 |
flavours of back-references syntax (e.g. whether references can be circular,
|
|
263 |
can labels be repeated etc.).
|
608
|
264 |
|
609
|
265 |
|
|
266 |
|
|
267 |
\subsection{Static Analysis of Evil Regex Patterns}
|
625
|
268 |
When a regular expression does not behave as intended,
|
609
|
269 |
people usually try to rewrite the regex to some equivalent form
|
|
270 |
or they try to avoid the possibly problematic patterns completely,
|
|
271 |
for which many false positives exist\parencite{Davis18}.
|
|
272 |
Animated tools to "debug" regular expressions such as
|
|
273 |
\parencite{regexploit2021} \parencite{regex101} are also popular.
|
|
274 |
We are also aware of static analysis work on regular expressions that
|
|
275 |
aims to detect potentially expoential regex patterns. Rathnayake and Thielecke
|
|
276 |
\parencite{Rathnayake2014StaticAF} proposed an algorithm
|
|
277 |
that detects regular expressions triggering exponential
|
|
278 |
behavious on backtracking matchers.
|
|
279 |
Weideman \parencite{Weideman2017Static} came up with
|
|
280 |
non-linear polynomial worst-time estimates
|
|
281 |
for regexes, attack string that exploit the worst-time
|
|
282 |
scenario, and "attack automata" that generates
|
|
283 |
attack strings.
|
622
|
284 |
|
|
285 |
|
|
286 |
|
|
287 |
|
|
288 |
|
|
289 |
|
|
290 |
Thanks to our theorem-prover-friendly approach,
|
|
291 |
we believe that
|
|
292 |
this finiteness bound can be improved to a bound
|
|
293 |
linear to input and
|
|
294 |
cubic to the regular expression size using a technique by
|
|
295 |
Antimirov\cite{Antimirov95}.
|
|
296 |
Once formalised, this would be a guarantee for the absence of all super-linear behavious.
|
|
297 |
We are working out the
|
|
298 |
details.
|
|
299 |
|
|
300 |
|
|
301 |
To our best knowledge, no lexing libraries using Brzozowski derivatives
|
|
302 |
have similar complexity-related bounds,
|
|
303 |
and claims about running time are usually speculative and backed by empirical
|
|
304 |
evidence on a few test cases.
|
|
305 |
If a matching or lexing algorithm
|
|
306 |
does not come with certain basic complexity related
|
|
307 |
guarantees (for examaple the internal data structure size
|
|
308 |
does not grow indefinitely),
|
|
309 |
then they cannot claim with confidence having solved the problem
|
|
310 |
of catastrophic backtracking.
|
625
|
311 |
|
|
312 |
|