33 In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven |
33 In a follow-up work \cite{Sulzmann2014b}, Sulzmann and Steenhoven |
34 incorporated these ideas |
34 incorporated these ideas |
35 into a tool called \emph{dreml}. |
35 into a tool called \emph{dreml}. |
36 The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas |
36 The pencil-and-paper proofs in \cite{Sulzmann2014} based on the ideas |
37 by Frisch and Cardelli \cite{Frisch2004} were later |
37 by Frisch and Cardelli \cite{Frisch2004} were later |
38 found to have unfillable gaps by Ausaf \cite{Ausaf}, |
38 found to have unfillable gaps by Ausaf et al. \cite{AusafDyckhoffUrban2016}, |
39 who came up with an alternative proof inspired by |
39 who came up with an alternative proof inspired by |
40 Vansummeren \cite{Vansummeren2006}. |
40 Vansummeren \cite{Vansummeren2006}. |
41 Sulzmann and Thiemann extended the Brzozowski derivatives to |
41 Sulzmann and Thiemann extended the Brzozowski derivatives to |
42 shuffling regular expressions \cite{Sulzmann2015}, |
42 shuffling regular expressions \cite{Sulzmann2015}, |
43 which are a very succinct way of describing basic regular expressions. |
43 which are a very succinct way of describing regular expressions. |
44 |
44 |
45 |
45 |
46 |
46 |
47 Regular expressions and lexers have been a popular topic among |
47 Regular expressions and lexers have been a popular topic among |
48 the theorem proving and functional programming community. |
48 the theorem proving and functional programming community. |
54 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
54 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
55 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
55 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
56 of the work by Krauss and Nipkow \parencite{Krauss2011}. Another one |
56 of the work by Krauss and Nipkow \parencite{Krauss2011}. Another one |
57 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
57 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
58 Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}. |
58 Also Ribeiro and Du Bois gave one in Agda \parencite{RibeiroAgda2017}. |
59 The most recent works to our best knowledge are the Verbatim \cite{Verbatim} |
59 The most recent works on verified lexers to our best knowledge are the Verbatim \cite{Verbatim} |
60 and Verbatim++ \cite{Verbatim} lexers. |
60 and Verbatim++ \cite{Verbatimpp} lexers by Egolf et al. |
|
61 Verbatim is based on derivatives but does not simplify them. |
|
62 Therefore it can be very slow on evil regular expressions. |
61 The Verbatim++ lexer adds many correctness-preserving |
63 The Verbatim++ lexer adds many correctness-preserving |
62 optimisations to the Verbatim lexer, |
64 optimisations to the Verbatim lexer, |
63 and is therefore quite fast on many inputs. |
65 and is therefore quite fast on many inputs. |
64 The problem is that they chose to use DFA to speed up things, |
66 The problem is that Egolf et al. chose to use traditional DFAs to speed up lexing, |
65 for which dealing with bounded repetitions is a bottleneck. |
67 but then dealing with bounded repetitions is a real bottleneck. |
66 |
68 |
67 |
69 |
68 This thesis builds on the formal specifications of POSIX |
70 This thesis builds on the formal specifications of POSIX |
69 rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}. |
71 rules and formal proofs by Ausaf et al. \cite{AusafDyckhoffUrban2016}. |
70 The bounded repetitions is a continuation of the work by Ausaf \cite{Ausaf}. |
72 The bounded repetitions presented here is a continuation of the work by Ausaf \cite{Ausaf}. |
71 |
73 |
72 Automata formalisations are in general harder and more cumbersome to deal |
74 Automata formalisations are in general harder and more cumbersome to deal |
73 with for theorem provers \cite{Nipkow1998}. |
75 with for theorem provers \cite{Nipkow1998}. |
74 To represent them, one way is to use graphs, but graphs are not inductive datatypes. |
76 To represent them, one way is to use graphs, but graphs are not inductive datatypes. |
75 Having to set the inductive principle on the number of nodes |
77 Having to set the inductive principle on the number of nodes |
77 resulting in large formalisations \cite{Lammich2012}. |
79 resulting in large formalisations \cite{Lammich2012}. |
78 When combining two graphs, one also needs to make sure that the nodes in |
80 When combining two graphs, one also needs to make sure that the nodes in |
79 both graphs are distinct. |
81 both graphs are distinct. |
80 If they are not distinct, then renaming of the nodes is needed. |
82 If they are not distinct, then renaming of the nodes is needed. |
81 Using Coq which provides dependent types can potentially make things slightly easier |
83 Using Coq which provides dependent types can potentially make things slightly easier |
82 \cite{Doczkal2013} |
84 \cite{Doczkal2013}. |
|
85 |
83 Another representation for automata are matrices. |
86 Another representation for automata are matrices. |
84 But the induction for them is not as straightforward either. |
87 But the induction for them is not as straightforward either. |
85 Both approaches have been used in the past and resulted in huge formalisations. |
88 Both approaches have been used in the past and resulted in huge formalisations. |
86 There are some more clever representations, for example one by Paulson |
89 There are some more clever representations, for example one by Paulson |
87 using hereditarily finite sets \cite{Paulson2015}. |
90 using hereditarily finite sets \cite{Paulson2015}. |
88 There the problem with combining graphs can be solved better. |
91 There the problem with combining graphs can be solved better. |
89 %but we believe that such clever tricks are not very obvious for |
92 %but we believe that such clever tricks are not very obvious for |
90 %the John-average-Isabelle-user. |
93 %the John-average-Isabelle-user. |
|
94 Because of these problems with automata, we prefer regular expressions |
|
95 and derivatives which can be implemented in theorem provers and functional programming languages |
|
96 with ease. |
91 |
97 |
92 \subsection{Different Definitions of POSIX Rules} |
98 \subsection{Different Definitions of POSIX Rules} |
93 There are different ways to formalise values and POSIX matching. |
99 There are different ways to formalise values and POSIX matching. |
94 Cardelli and Frisch \cite{Frisch2004} have developed a notion of |
100 Cardelli and Frisch \cite{Frisch2004} have developed a notion of |
95 \emph{non-problematic values} which is a slight variation |
101 \emph{non-problematic values} which is a slight variation |
105 They refer to the more restrictive version as used |
111 They refer to the more restrictive version as used |
106 in this thesis (which was defined by Ausaf et al. |
112 in this thesis (which was defined by Ausaf et al. |
107 \cite{AusafDyckhoffUrban2016}) as \emph{canonical values}. |
113 \cite{AusafDyckhoffUrban2016}) as \emph{canonical values}. |
108 The very interesting link between the work by Ausaf et al. |
114 The very interesting link between the work by Ausaf et al. |
109 and Okui and Suzuki is that they have distinct formalisations |
115 and Okui and Suzuki is that they have distinct formalisations |
110 of POSIX values, and yet they are provably equivalent! See |
116 of POSIX values, and yet they define the same notion. See |
111 \cite{AusafDyckhoffUrban2016} for details of the |
117 \cite{AusafDyckhoffUrban2016} for details of the |
112 alternative definitions given by Okui and Suzuki and the formalisation. |
118 alternative definitions given by Okui and Suzuki and the formalisation |
|
119 described in this thesis. |
113 %TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF. |
120 %TODO: EXPLICITLY STATE the OKUI SUZUKI POSIX DEF. |
114 |
121 |
115 Sulzmann and Lu themselves have come up with POSIX definitions \cite{Sulzmann2014}. |
122 Sulzmann and Lu themselves have come up with a POSIX definition \cite{Sulzmann2014}. |
116 In their paper they defined an ordering between values with respect |
123 In their paper they defined an ordering between values with respect |
117 to regular expressions, and tried to establish that their |
124 to regular expressions, and tried to establish that their |
118 algorithm outputs the minimal element by a pencil-and-paper proof. |
125 algorithm outputs the minimal element by a pencil-and-paper proof. |
119 But having the ordering relation taking regular expression as parameters |
126 But having the ordering relation taking regular expression as parameters |
120 causes the transitivity of their ordering to not go through. |
127 causes the transitivity of their ordering to not go through. |
121 |
128 |
|
129 \subsection{Static Analysis of Evil Regex Patterns} |
|
130 |
|
131 |
122 When faced with catastrophic backtracking, |
132 When faced with catastrophic backtracking, |
123 sometimes programmers |
133 sometimes programmers |
124 have to rewrite their regexes in an ad hoc fashion. |
134 have to rewrite their regexes in an ad hoc fashion. |
125 Knowing which patterns should be avoided |
135 Knowing which patterns should be avoided |
126 can be helpful. |
136 can be helpful. |
127 \subsection{Static Analysis of Evil Regex Patterns} |
137 Tools for this |
128 or they try to avoid the possibly problematic patterns completely, |
138 often over-approximate evil regular expression patterns, |
129 for which many false positives exist\parencite{Davis18}. |
139 and there can be many false positives \cite{Davis18}. |
130 Animated tools to "debug" regular expressions such as |
140 Animated tools to "debug" regular expressions such as |
131 \parencite{regexploit2021} \parencite{regex101} are also popular. |
141 \parencite{regexploit2021} \parencite{regex101} are also popular. |
132 We are also aware of static analysis work on regular expressions that |
142 We are also aware of static analysis work on regular expressions that |
133 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
143 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
134 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
144 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
135 that detects regular expressions triggering exponential |
145 that detects regular expressions triggering exponential |
136 behavious on backtracking matchers. |
146 behavious with backtracking matchers. |
137 Weideman \parencite{Weideman2017Static} came up with |
147 Weideman \parencite{Weideman2017Static} came up with |
138 non-linear polynomial worst-time estimates |
148 non-linear polynomial worst-time estimates |
139 for regexes, attack string that exploit the worst-time |
149 for regexes and ``attack string'' that exploit the worst-time |
140 scenario, and "attack automata" that generates |
150 scenario, and introduced "attack automaton" that generates |
141 attack strings. |
151 attack strings. |
142 |
152 |
143 |
153 |
144 |
154 |
145 |
155 |
146 \section{Optimisations} |
156 \section{Optimisations} |
147 Perhaps the biggest problem that prevents derivative-based lexing |
157 Perhaps the biggest problem that prevents derivative-based lexing |
148 from being more widely adopted |
158 from being more widely adopted |
149 is that they tend to be not blindingly fast in practice, unable to |
159 is that they tend to be not very fast in practice, unable to |
150 reach throughputs like gigabytes per second, which is the |
160 reach throughputs like gigabytes per second, which is the |
151 application we had in mind when we initially started looking at the topic. |
161 application we have in mind when we started looking at the topic. |
152 Commercial |
162 Commercial |
153 regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro} |
163 regular expression matchers such as Snort \cite{Snort1999} and Bro \cite{Bro} |
154 are capable of inspecting payloads |
164 are capable of inspecting payloads |
155 at line rates (which can be up to a few gigabits per second) against |
165 at line rates (which can be up to a few gigabits per second) against |
156 thousands of rules \cite{communityRules}. |
166 thousands of regex rules \cite{communityRules}. |
157 For our algorithm to be more attractive for practical use, we |
167 For our algorithm to be more attractive for practical use, we |
158 need more correctness-preserving optimisations. |
168 need more correctness-preserving optimisations. |
159 |
169 |
160 FPGA-based implementations such as \cite{Sidhu2001} |
170 FPGA-based implementations such as \cite{Sidhu2001} |
161 have the advantages of being |
171 have the advantages of being |
162 reconfigurable and parallel, but suffer from lower clock frequency |
172 reconfigurable and parallel, but suffer from lower clock frequency |
163 and scalability. |
173 and scalability. |
164 Traditional automaton approaches that use a DFA instead of NFA |
174 Traditional automaton approaches that use DFAs instead of NFAs |
165 benefit from the fact that only a single transition is needed |
175 benefit from the fact that only a single transition is needed |
166 for each input character \cite{Becchi08}. Lower memory bandwidth leads |
176 for each input character \cite{Becchi08}. Lower memory bandwidth leads |
167 to faster performance. |
177 to faster performance. |
168 However, they suffer from exponential blow-ups on bounded repetitions. |
178 However, they suffer from exponential memory requirements on bounded repetitions. |
169 Compression techniques are used, such as those in \cite{Kumar2006} and |
179 Compression techniques are used, such as those in \cite{Kumar2006} and |
170 \cite{Becchi2007}. |
180 \cite{Becchi2007}. |
171 Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020} |
181 Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020} |
172 have been |
182 have been |
173 proposed to better deal with bounded repetitions. |
183 proposed to better deal with bounded repetitions. |
|
184 But they are usually not used in formalised proofs. |
174 |
185 |
175 %So far our focus has been mainly on the bit-coded algorithm, |
186 %So far our focus has been mainly on the bit-coded algorithm, |
176 %but the injection-based lexing algorithm |
187 %but the injection-based lexing algorithm |
177 %could also be sped up in various ways: |
188 %could also be sped up in various ways: |
178 % |
189 % |
182 character-by-character derivatives. For example, instead of |
193 character-by-character derivatives. For example, instead of |
183 replacing $(r_1 + r_2)\backslash (c::cs)$ by |
194 replacing $(r_1 + r_2)\backslash (c::cs)$ by |
184 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather |
195 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather |
185 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. |
196 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. |
186 This has the potential to speed up matching because input is |
197 This has the potential to speed up matching because input is |
187 processed in larger granularity. |
198 processed at a larger granularity. |
188 One interesting thing is to explore whether this can be done |
199 One interesting point is to explore whether this can be done |
189 to $\inj$ as well, so that we can generate a lexical value |
200 to $\inj$ as well, so that we can generate a lexical value |
190 rather than simply get a matcher. |
201 rather than simply get a matcher. |
|
202 It is also not yet clear how this idea can be extended to sequences and stars. |
191 |
203 |
192 |
204 |
193 \subsection{Derivatives and Zippers} |
205 \subsection{Derivatives and Zippers} |
194 Zipper is a data structure designed to focus on |
206 Zippers are a data structure designed to focus on |
195 and navigate between local parts of a tree. |
207 and navigate between local parts of a tree. |
196 The idea is that often operations on a large tree only deal with |
208 The idea is that often operations on large trees only deal with |
197 local regions each time. |
209 local regions each time. |
198 Therefore it would be a waste to |
210 Therefore it would be a waste to |
199 traverse the entire tree if |
211 traverse the entire tree if |
200 the operation only |
212 the operation only |
201 involves a small fraction of it. |
213 involves a small fraction of it. |
202 It was first formally described by Huet \cite{Huet1997}. |
214 Zippers were first formally described by Huet \cite{Huet1997}. |
203 Typical applications of zippers involve text editor buffers |
215 Typical applications of zippers involve text editor buffers |
204 and proof system databases. |
216 and proof system databases. |
205 In our setting, the idea is to compactify the representation |
217 In our setting, the idea is to compactify the representation |
206 of derivatives with zippers, thereby making our algorithm faster. |
218 of derivatives with zippers, thereby making our algorithm faster. |
207 We introduce several works on parsing, derivatives |
219 Below we describe several works on parsing, derivatives |
208 and zippers. |
220 and zippers. |
209 |
221 |
210 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives |
222 Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives |
211 \cite{Zippy2020}. |
223 \cite{Zippy2020}. |
212 They adopted zippers to improve the speed, and argued that the runtime |
224 They adopted zippers to improve the speed, and argued that the runtime |
213 complexity of their algorithm was linear with respect to the input string. |
225 complexity of their algorithm was linear with respect to the input string. |
|
226 They did not provide a formal proof for this. |
214 |
227 |
215 The idea of using Brzozowski derivatives on general context-free |
228 The idea of using Brzozowski derivatives on general context-free |
216 parsing was first implemented |
229 parsing was first implemented |
217 by Might et al. \cite{Might2011}. |
230 by Might et al. \cite{Might2011}. |
218 They used memoization and fixpoint construction to eliminate infinite recursion, |
231 They used memoization and fixpoint constructions to eliminate infinite recursion, |
219 which is a major problem for using derivatives on context-free grammars. |
232 which is a major problem for using derivatives with context-free grammars. |
220 The initial version was quite slow----exponential on the size of the input. |
233 Their initial version was quite slow----exponential on the size of the input. |
221 Adams et al. \cite{Adams2016} improved the speed and argued that their version |
234 Adams et al. \cite{Adams2016} improved the speed and argued that their version |
222 was cubic with respect to the input. |
235 was cubic with respect to the input. |
223 Darragh and Adams \cite{Darragh2020} further improved the performance |
236 Darragh and Adams \cite{Darragh2020} further improved the performance |
224 by using zippers in an innovative way--their zippers had multiple focuses |
237 by using zippers in an innovative way--their zippers had multiple focuses |
225 instead of just one in a traditional zipper to handle ambiguity. |
238 instead of just one in a traditional zipper to handle ambiguity. |
226 Their algorithm was not formalised, though. |
239 Their algorithm was not formalised though. |
227 |
240 |
228 |
241 |
229 |
242 |
230 |
243 |
231 \subsection{Back-References} |
244 \subsection{Back-References} |
233 in chapter \ref{Introduction}. |
246 in chapter \ref{Introduction}. |
234 We adopt the common practice of calling them rewbrs (Regular Expressions |
247 We adopt the common practice of calling them rewbrs (Regular Expressions |
235 With Back References) for brevity. |
248 With Back References) for brevity. |
236 It has been shown by Aho \cite{Aho1990} |
249 It has been shown by Aho \cite{Aho1990} |
237 that the k-vertex cover problem can be reduced |
250 that the k-vertex cover problem can be reduced |
238 to the problem of rewbrs matching, and is therefore NP-complete. |
251 to the problem of rewbrs matching, and therefore matching with rewbrs is NP-complete. |
239 Given the depth of the problem, the progress made at the full generality |
252 Given the depth of the problem, the progress made at the full generality |
240 of arbitrary rewbrs matching has been slow, with |
253 of arbitrary rewbrs matching has been slow with |
241 theoretical work on them being |
254 theoretical work on them being |
242 fairly recent. |
255 fairly recent. |
243 |
256 |
244 Campaneu et al. studied regexes with back-references |
257 Campaneu et al. studied rewbrs |
245 in the context of formal languages theory in |
258 in the context of formal languages theory in |
246 their 2003 work \cite{campeanu2003formal}. |
259 \cite{campeanu2003formal}. |
247 They devised a pumping lemma for Perl-like regexes, |
260 They devised a pumping lemma for Perl-like regexes, |
248 proving that the langugages denoted by them |
261 proving that the langugages denoted by them |
249 is properly contained in context-sensitive |
262 is properly contained in context-sensitive |
250 languages. |
263 languages. |
251 More interesting questions such as |
264 More interesting questions such as |
252 whether the language denoted by Perl-like regexes |
265 whether the languages denoted by Perl-like regexes |
253 can express palindromes ($\{w \mid w = w^R\}$) |
266 can express palindromes ($\{w \mid w = w^R\}$) |
254 are discussed in \cite{campeanu2009patterns} |
267 are discussed in \cite{campeanu2009patterns} |
255 and \cite{CAMPEANU2009Intersect}. |
268 and \cite{CAMPEANU2009Intersect}. |
256 Freydenberger \cite{Frey2013} investigated the |
269 Freydenberger \cite{Frey2013} investigated the |
257 expressive power of back-references. He showed several |
270 expressive power of back-references. He showed several |
258 undecidability and decriptional complexity results |
271 undecidability and decriptional complexity results |
259 of back-references, concluding that they add |
272 for back-references, concluding that they add |
260 great power to certain programming tasks but are difficult to harness. |
273 great power to certain programming tasks but are difficult to comprehend. |
261 An interesting question would then be |
274 An interesting question would then be |
262 whether we can add restrictions to them, |
275 whether we can add restrictions to them, |
263 so that they become expressive enough for practical use such |
276 so that they become expressive enough for practical use, such |
264 as html files, but not too powerful. |
277 as matching html files, but not too powerful. |
265 Freydenberger and Schmid \cite{FREYDENBERGER20191} |
278 Freydenberger and Schmid \cite{FREYDENBERGER20191} |
266 introduced the notion of deterministic |
279 introduced the notion of deterministic |
267 regular expressions with back-references to achieve |
280 regular expressions with back-references to achieve |
268 a better balance between expressiveness and tractability. |
281 a better balance between expressiveness and tractability. |
269 |
282 |
270 |
283 |
271 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012} |
284 Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012} |
272 investigated the time complexity of different variants |
285 investigated the time complexity of different variants |
273 of back-references. |
286 of back-references. |
274 We are not aware of any work that uses derivatives on back-references. |
287 We are not aware of any work that uses derivatives with back-references. |
275 |
288 |
276 See \cite{BERGLUND2022} for a survey |
289 %See \cite{BERGLUND2022} for a survey |
277 of these works and comparison between different |
290 %of these works and comparison between different |
278 flavours of back-references syntax (e.g. whether references can be circular, |
291 %flavours of back-references syntax (e.g. whether references can be circular, |
279 can labels be repeated etc.). |
292 %can labels be repeated etc.). |
280 |
293 |
281 |
294 |
282 |
295 |
283 |
296 |
284 |
297 |