3 \chapter{Related Work} % Main chapter title |
3 \chapter{Related Work} % Main chapter title |
4 |
4 |
5 \label{RelatedWork} |
5 \label{RelatedWork} |
6 |
6 |
7 In this chapter, we introduce |
7 In this chapter, we introduce |
8 the work relevant to this thesis. |
8 work relevant to this thesis. |
|
9 |
|
10 \section{Regular Expressions, Derivatives and POSIX Lexing} |
|
11 |
|
12 %\subsection{Formalisations of Automata, Regular Expressions, and Matching Algorithms} |
|
13 Regular expressions were introduced by Kleene in the 1950s \cite{Kleene1956}. |
|
14 Since then they have become a fundamental concept in |
|
15 formal languages and automata theory \cite{Sakarovitch2009}. |
|
16 Brzozowski defined derivatives on regular expressions in his PhD thesis |
|
17 in 1964 \cite{Brzozowski1964}, in which he proved the finiteness of the numbers |
|
18 of regular expression derivatives modulo the ACI-axioms. |
|
19 It is worth pointing out that this result does not directly |
|
20 translate to our |
|
21 finiteness proof, |
|
22 and our proof does not depend on it to be valid. |
|
23 The key observation is that our version of the Sulzmann and Lu's algorithm |
|
24 \cite{Sulzmann2014} represents |
|
25 derivative terms in a way that allows efficient de-duplication, |
|
26 and we do not make use of an equivalence checker that exploits the ACI-equivalent |
|
27 terms. |
|
28 |
|
29 Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}. |
|
30 They first introduced the elegant and simple idea of injection-based lexing |
|
31 and bit-coded lexing. |
|
32 In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven |
|
33 incorporated these ideas |
|
34 into a tool called \emph{dreml}. |
|
35 The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas |
|
36 by Frisch and Cardelli \cite{Frisch2004} were later |
|
37 found to have unfillable gaps by Ausaf \cite{Ausaf}, |
|
38 who came up with an alternative proof inspired by |
|
39 Vansummeren \cite{Vansummeren2006}. |
|
40 Sulzmann and Thiemann extended the Brzozowski derivatives to |
|
41 shuffling regular expressions \cite{Sulzmann2015}, |
|
42 which are a very succinct way of describing basic regular expressions. |
|
43 |
|
44 |
|
45 |
|
46 Regular expressions and lexers have been a popular topic among |
|
47 the theorem proving and functional programming community. |
|
48 In the next section we give a list of lexers |
|
49 and matchers that come with a machine-checked correctness proof. |
|
50 |
|
51 \subsection{Matchers and Lexers with Mechanised Proofs} |
|
52 We are aware |
|
53 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
54 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
55 of the work by Krauss and Nipkow \parencite{Krauss2011}. Another one |
|
56 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
57 Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}. |
|
58 The most recent works to our best knowledge are the Verbatim \cite{Verbatim} |
|
59 and Verbatim++ \cite{Verbatim} lexers. |
|
60 The Verbatim++ lexer adds many correctness-preserving |
|
61 optimisations to the Verbatim lexer, |
|
62 and is therefore quite fast on many inputs. |
|
63 The problem is that they chose to use DFA to speed up things, |
|
64 for which dealing with bounded repetitions is a bottleneck. |
|
65 |
|
66 |
|
67 This thesis builds on the formal specifications of POSIX |
|
68 rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}. |
|
69 The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}. |
|
70 |
|
71 Automata formalisations are in general harder and more cumbersome to deal |
|
72 with for theorem provers \cite{Nipkow1998}. |
|
73 To represent them, one way is to use graphs, but graphs are not inductive datatypes. |
|
74 Having to set the inductive principle on the number of nodes |
|
75 in a graph makes formal reasoning non-intuitive and convoluted. |
|
76 When combining two graphs, one also needs to make sure that the nodes in |
|
77 both graphs are distinct. |
|
78 If they are not distinct, then renaming of the nodes is needed. |
|
79 Using Coq which provides dependent types can potentially make things slightly easier |
|
80 \cite{Doczkal2013} |
|
81 Another representation for automata are matrices. |
|
82 But the induction for them is not as straightforward either. |
|
83 Both approaches have been used in the past and resulted in huge formalisations. |
|
84 There are some more clever representations, for example one by Paulson |
|
85 using hereditarily finite sets. |
|
86 There the problem with combining graphs can be solved better, |
|
87 but we believe that such clever tricks are not very obvious for |
|
88 the John-average-Isabelle-user. |
|
89 |
|
90 \subsection{Different Definitions of POSIX Rules} |
|
91 There are different ways to formalise values and POSIX matching. |
|
92 Cardelli and Frisch \cite{Frisch2004} have developed a notion of |
|
93 \emph{non-problematic values} which is a slight variation |
|
94 of the values defined in the inhabitation relation in \ref{fig:inhab}. |
|
95 They then defined an ordering between values, and showed that |
|
96 the maximal element of those values correspond to the output |
|
97 of their GREEDY lexing algorithm. |
|
98 |
|
99 Okui and Suzuki \cite{Okui10} allow iterations of values to |
|
100 flatten to the empty |
|
101 string for the star inhabitation rule in |
|
102 \ref{fig:inhab}. |
|
103 They refer to the more restrictive version as used |
|
104 in this thesis (which was defined by Ausaf et al. |
|
105 \cite{AusafDyckhoffUrban2016}) as \emph{canonical values}. |
|
106 The very interesting link between the work by Ausaf et al. |
|
107 and Okui and Suzuki is that they have distinct formalisations |
|
108 of POSIX values, and yet they are provably equivalent! See |
|
109 \cite{AusafDyckhoffUrban2016} for details of the |
|
110 alternative definitions given by Okui and Suzuki and the formalisation. |
|
111 %TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF. |
|
112 |
|
113 Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}. |
|
114 In their paper they defined an ordering between values with respect |
|
115 to regular expressions, and tried to establish that their |
|
116 algorithm outputs the minimal element by a pencil-and-paper proof. |
|
117 But having the ordering relation taking regular expression as parameters |
|
118 causes the transitivity of their ordering to not go through. |
|
119 |
|
120 |
|
121 \section{Optimisations} |
|
122 Perhaps the biggest problem that prevents derivative-based lexing |
|
123 from being more widely adopted |
|
124 is that they tend to be not blindingly fast in practice, unable to |
|
125 reach throughputs like gigabytes per second, which is the |
|
126 application we had in mind when we initially started looking at the topic. |
|
127 Commercial |
|
128 regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro} |
|
129 are capable of inspecting payloads |
|
130 at line rates (which can be up to a few gigabits per second) against |
|
131 thousands of rules \cite{communityRules}. |
|
132 For our algorithm to be more attractive for practical use, we |
|
133 need more correctness-preserving optimisations. |
|
134 %So far our focus has been mainly on the bit-coded algorithm, |
|
135 %but the injection-based lexing algorithm |
|
136 %could also be sped up in various ways: |
|
137 % |
|
138 One such optimisation is defining string derivatives |
|
139 directly, without recursively decomposing them into |
|
140 character-by-character derivatives. For example, instead of |
|
141 replacing $(r_1 + r_2)\backslash (c::cs)$ by |
|
142 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather |
|
143 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. |
|
144 This has the potential to speed up matching because input is |
|
145 processed in larger granularity. |
|
146 One interesting thing is to explore whether this can be done |
|
147 to $\inj$ as well, so that we can generate a lexical value |
|
148 rather than simply get a matcher. |
|
149 |
|
150 |
|
151 \subsection{Derivatives and Zippers} |
|
152 Zipper is a data structure designed to focus on |
|
153 and navigate between local parts of a tree. |
|
154 The idea is that often operations on a large tree only deal with |
|
155 local regions each time. |
|
156 Therefore it would be a waste to |
|
157 traverse the entire tree if |
|
158 the operation only |
|
159 involves a small fraction of it. |
|
160 It was first formally described by Huet \cite{HuetZipper}. |
|
161 Typical applications of zippers involve text editor buffers |
|
162 and proof system databases. |
|
163 In our setting, the idea is to compactify the representation |
|
164 of derivatives with zippers, thereby making our algorithm faster. |
|
165 We introduce several works on parsing, derivatives |
|
166 and zippers. |
|
167 |
|
168 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives |
|
169 \cite{Zippy2020}. |
|
170 They adopted zippers to improve the speed, and argued that the runtime |
|
171 complexity of their algorithm was linear with respect to the input string. |
|
172 |
|
173 The idea of using Brzozowski derivatives on general context-free |
|
174 parsing was first implemented |
|
175 by Might et al. \ref{Might2011}. |
|
176 They used memoization and fixpoint construction to eliminate infinite recursion, |
|
177 which is a major problem for using derivatives on context-free grammars. |
|
178 The initial version was quite slow----exponential on the size of the input. |
|
179 Adams et al. \ref{Adams2016} improved the speed and argued that their version |
|
180 was cubic with respect to the input. |
|
181 Darragh and Adams \cite{Darragh2020} further improved the performance |
|
182 by using zippers in an innovative way--their zippers had multiple focuses |
|
183 instead of just one in a traditional zipper to handle ambiguity. |
|
184 Their algorithm was not formalised, though. |
|
185 |
|
186 |
|
187 Some initial results have been obtained, but more care needs to be taken to make sure |
|
188 that the lexer output conforms to the POSIX standard. Formalising correctness of |
|
189 such a lexer using zippers will probably require using an imperative |
|
190 framework with separation logic as it involves |
|
191 mutable data structures, which is also interesting to look into. |
|
192 |
|
193 To further optimise the algorithm, |
|
194 I plan to add a ``deduplicate'' function that tells |
|
195 whether two regular expressions denote the same |
|
196 language using |
|
197 an efficient and verified equivalence checker. |
|
198 In this way, the internal data structures can be pruned more aggressively, |
|
199 leading to better simplifications and ultimately faster algorithms. |
|
200 |
|
201 |
|
202 |
|
203 The idea is to put the focus on that subtree, turning other parts |
|
204 of the tree into a context |
|
205 |
|
206 |
|
207 One observation about our derivative-based lexing algorithm is that |
|
208 the derivative operation sometimes traverses the entire regular expression |
|
209 unnecessarily: |
|
210 |
|
211 |
|
212 |
9 |
213 |
10 |
214 |
11 \section{Work on Back-References} |
215 \section{Work on Back-References} |
12 We introduced regular expressions with back-references |
216 We introduced regular expressions with back-references |
13 in chapter \ref{Introduction}. |
217 in chapter \ref{Introduction}. |
14 We adopt the common practice of calling them rewbrs (Regular Expressions |
218 We adopt the common practice of calling them rewbrs (Regular Expressions |
15 With Back References) in this section for brevity. |
219 With Back References) in this section for brevity. |
16 It has been shown by Aho \cite{Aho1990} |
220 It has been shown by Aho \cite{Aho1990} |
17 that the k-vertex cover problem can be reduced |
221 that the k-vertex cover problem can be reduced |
18 to the problem of rewbrs matching problem, and is therefore NP-complete. |
222 to the problem of rewbrs matching, and is therefore NP-complete. |
19 Given the depth of the problem, the |
223 Given the depth of the problem, the progress made at the full generality |
20 theoretical work on them is |
224 of arbitrary rewbrs matching has been slow, with |
|
225 theoretical work on them being |
21 fairly recent. |
226 fairly recent. |
22 |
227 |
23 Campaneu et al. studied regexes with back-references |
228 Campaneu et al. studied regexes with back-references |
24 in the context of formal languages theory in |
229 in the context of formal languages theory in |
25 their 2003 work \cite{campeanu2003formal}. |
230 their 2003 work \cite{campeanu2003formal}. |
110 does not grow indefinitely), |
308 does not grow indefinitely), |
111 then they cannot claim with confidence having solved the problem |
309 then they cannot claim with confidence having solved the problem |
112 of catastrophic backtracking. |
310 of catastrophic backtracking. |
113 |
311 |
114 |
312 |
115 \section{Derivatives and Zippers} |
|
116 Zipper is a data structure designed to focus on |
|
117 and navigate between local parts of a tree. |
|
118 The idea is that often operations on a large tree only deal with |
|
119 local regions each time. |
|
120 It was first formally described by Huet \cite{HuetZipper}. |
|
121 Typical applications of zippers involve text editor buffers |
|
122 and proof system databases. |
|
123 In our setting, the idea is to compactify the representation |
|
124 of derivatives with zippers, thereby making our algorithm faster. |
|
125 We draw our inspirations from several works on parsing, derivatives |
|
126 and zippers. |
|
127 |
|
128 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives |
|
129 \cite{Zippy2020}. |
|
130 They adopted zippers to improve the speed, and argued that the runtime |
|
131 complexity of their algorithm was linear with respect to the input string. |
|
132 |
|
133 The idea of using Brzozowski derivatives on general context-free |
|
134 parsing was first implemented |
|
135 by Might et al. \ref{Might2011}. |
|
136 They used memoization and fixpoint construction to eliminate infinite recursion, |
|
137 which is a major problem for using derivatives on context-free grammars. |
|
138 The initial version was quite slow----exponential on the size of the input. |
|
139 Adams et al. \ref{Adams2016} improved the speed and argued that their version |
|
140 was cubic with respect to the input. |
|
141 Darragh and Adams \cite{10.1145/3408990} further improved the performance |
|
142 by using zippers in an innovative way--their zippers had multiple focuses |
|
143 instead of just one in a traditional zipper to handle ambiguity. |
|
144 Their algorithm was not formalised, though. |
|
145 |
|
146 We aim to integrate the zipper data structure into our lexer. |
|
147 The idea is very simple: using a zipper with multiple focuses |
|
148 just like Darragh did, we could represent |
|
149 \[ |
|
150 x\cdot r + y \cdot r + \ldots |
|
151 \] |
|
152 as |
|
153 \[ |
|
154 (x+y + \ldots) \cdot r. |
|
155 \] |
|
156 This would greatly reduce the amount of space needed |
|
157 when we have many terms like $x\cdot r$. |
|
158 |
|
159 into the current lexer. |
|
160 This allows the lexer to not traverse the entire |
|
161 derivative tree generated |
|
162 in every intermediate step, but only a small segment |
|
163 that is currently active. |
|
164 |
|
165 Some initial results have been obtained, but more care needs to be taken to make sure |
|
166 that the lexer output conforms to the POSIX standard. Formalising correctness of |
|
167 such a lexer using zippers will probably require using an imperative |
|
168 framework with separation logic as it involves |
|
169 mutable data structures, which is also interesting to look into. |
|
170 |
|
171 To further optimise the algorithm, |
|
172 I plan to add a ``deduplicate'' function that tells |
|
173 whether two regular expressions denote the same |
|
174 language using |
|
175 an efficient and verified equivalence checker. |
|
176 In this way, the internal data structures can be pruned more aggressively, |
|
177 leading to better simplifications and ultimately faster algorithms. |
|
178 |
|
179 |
|
180 Some initial results |
|
181 We first give a brief introduction to what zippers are, |
|
182 and other works |
|
183 that apply zippers to derivatives |
|
184 When dealing with large trees, it would be a waste to |
|
185 traverse the entire tree if |
|
186 the operation only |
|
187 involves a small fraction of it. |
|
188 The idea is to put the focus on that subtree, turning other parts |
|
189 of the tree into a context |
|
190 |
|
191 |
|
192 One observation about our derivative-based lexing algorithm is that |
|
193 the derivative operation sometimes traverses the entire regular expression |
|
194 unnecessarily: |
|
195 |
|