70 |
70 |
71 Automata formalisations are in general harder and more cumbersome to deal |
71 Automata formalisations are in general harder and more cumbersome to deal |
72 with for theorem provers \cite{Nipkow1998}. |
72 with for theorem provers \cite{Nipkow1998}. |
73 To represent them, one way is to use graphs, but graphs are not inductive datatypes. |
73 To represent them, one way is to use graphs, but graphs are not inductive datatypes. |
74 Having to set the inductive principle on the number of nodes |
74 Having to set the inductive principle on the number of nodes |
75 in a graph makes formal reasoning non-intuitive and convoluted. |
75 in a graph makes formal reasoning non-intuitive and convoluted, |
|
76 resulting in large formalisations \cite{Lammich2012}. |
76 When combining two graphs, one also needs to make sure that the nodes in |
77 When combining two graphs, one also needs to make sure that the nodes in |
77 both graphs are distinct. |
78 both graphs are distinct. |
78 If they are not distinct, then renaming of the nodes is needed. |
79 If they are not distinct, then renaming of the nodes is needed. |
79 Using Coq which provides dependent types can potentially make things slightly easier |
80 Using Coq which provides dependent types can potentially make things slightly easier |
80 \cite{Doczkal2013} |
81 \cite{Doczkal2013} |
81 Another representation for automata are matrices. |
82 Another representation for automata are matrices. |
82 But the induction for them is not as straightforward either. |
83 But the induction for them is not as straightforward either. |
83 Both approaches have been used in the past and resulted in huge formalisations. |
84 Both approaches have been used in the past and resulted in huge formalisations. |
84 There are some more clever representations, for example one by Paulson |
85 There are some more clever representations, for example one by Paulson |
85 using hereditarily finite sets. |
86 using hereditarily finite sets \cite{Paulson2015}. |
86 There the problem with combining graphs can be solved better, |
87 There the problem with combining graphs can be solved better. |
87 but we believe that such clever tricks are not very obvious for |
88 %but we believe that such clever tricks are not very obvious for |
88 the John-average-Isabelle-user. |
89 %the John-average-Isabelle-user. |
89 |
90 |
90 \subsection{Different Definitions of POSIX Rules} |
91 \subsection{Different Definitions of POSIX Rules} |
91 There are different ways to formalise values and POSIX matching. |
92 There are different ways to formalise values and POSIX matching. |
92 Cardelli and Frisch \cite{Frisch2004} have developed a notion of |
93 Cardelli and Frisch \cite{Frisch2004} have developed a notion of |
93 \emph{non-problematic values} which is a slight variation |
94 \emph{non-problematic values} which is a slight variation |
116 algorithm outputs the minimal element by a pencil-and-paper proof. |
117 algorithm outputs the minimal element by a pencil-and-paper proof. |
117 But having the ordering relation taking regular expression as parameters |
118 But having the ordering relation taking regular expression as parameters |
118 causes the transitivity of their ordering to not go through. |
119 causes the transitivity of their ordering to not go through. |
119 |
120 |
120 |
121 |
|
122 \subsection{Static Analysis of Evil Regex Patterns} |
|
123 When a regular expression does not behave as intended, |
|
124 people usually try to rewrite the regex to some equivalent form |
|
125 or they try to avoid the possibly problematic patterns completely, |
|
126 for which many false positives exist\parencite{Davis18}. |
|
127 Animated tools to "debug" regular expressions such as |
|
128 \parencite{regexploit2021} \parencite{regex101} are also popular. |
|
129 We are also aware of static analysis work on regular expressions that |
|
130 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
|
131 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
132 that detects regular expressions triggering exponential |
|
133 behavious on backtracking matchers. |
|
134 Weideman \parencite{Weideman2017Static} came up with |
|
135 non-linear polynomial worst-time estimates |
|
136 for regexes, attack string that exploit the worst-time |
|
137 scenario, and "attack automata" that generates |
|
138 attack strings. |
|
139 |
|
140 |
|
141 |
|
142 |
121 \section{Optimisations} |
143 \section{Optimisations} |
122 Perhaps the biggest problem that prevents derivative-based lexing |
144 Perhaps the biggest problem that prevents derivative-based lexing |
123 from being more widely adopted |
145 from being more widely adopted |
124 is that they tend to be not blindingly fast in practice, unable to |
146 is that they tend to be not blindingly fast in practice, unable to |
125 reach throughputs like gigabytes per second, which is the |
147 reach throughputs like gigabytes per second, which is the |
129 are capable of inspecting payloads |
151 are capable of inspecting payloads |
130 at line rates (which can be up to a few gigabits per second) against |
152 at line rates (which can be up to a few gigabits per second) against |
131 thousands of rules \cite{communityRules}. |
153 thousands of rules \cite{communityRules}. |
132 For our algorithm to be more attractive for practical use, we |
154 For our algorithm to be more attractive for practical use, we |
133 need more correctness-preserving optimisations. |
155 need more correctness-preserving optimisations. |
|
156 |
|
157 FPGA-based implementations such as \cite{Sidhu2001} |
|
158 have the advantages of being |
|
159 reconfigurable and parallel, but suffer from lower clock frequency |
|
160 and scalability. |
|
161 Traditional automaton approaches that use a DFA instead of NFA |
|
162 benefit from the fact that only a single transition is needed |
|
163 for each input character \cite{Becchi08}. Lower memory bandwidth leads |
|
164 to faster performance. |
|
165 However, they suffer from exponential blow-ups on bounded repetitions. |
|
166 Compression techniques are used, such as those in \cite{Kumar2006} and |
|
167 \cite{Becchi2007}. |
|
168 Variations of pure NFAs or DFAs like counting-set automata \cite{Turonova2020} |
|
169 have been |
|
170 proposed to better deal with bounded repetitions. |
|
171 |
134 %So far our focus has been mainly on the bit-coded algorithm, |
172 %So far our focus has been mainly on the bit-coded algorithm, |
135 %but the injection-based lexing algorithm |
173 %but the injection-based lexing algorithm |
136 %could also be sped up in various ways: |
174 %could also be sped up in various ways: |
137 % |
175 % |
138 One such optimisation is defining string derivatives |
176 Another direction of optimisation for derivative-based approaches |
|
177 is defining string derivatives |
139 directly, without recursively decomposing them into |
178 directly, without recursively decomposing them into |
140 character-by-character derivatives. For example, instead of |
179 character-by-character derivatives. For example, instead of |
141 replacing $(r_1 + r_2)\backslash (c::cs)$ by |
180 replacing $(r_1 + r_2)\backslash (c::cs)$ by |
142 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather |
181 $((r_1 + r_2)\backslash c)\backslash cs$ (as in definition \ref{table:der}), we rather |
143 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. |
182 calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$. |
155 local regions each time. |
194 local regions each time. |
156 Therefore it would be a waste to |
195 Therefore it would be a waste to |
157 traverse the entire tree if |
196 traverse the entire tree if |
158 the operation only |
197 the operation only |
159 involves a small fraction of it. |
198 involves a small fraction of it. |
160 It was first formally described by Huet \cite{HuetZipper}. |
199 It was first formally described by Huet \cite{Huet1997}. |
161 Typical applications of zippers involve text editor buffers |
200 Typical applications of zippers involve text editor buffers |
162 and proof system databases. |
201 and proof system databases. |
163 In our setting, the idea is to compactify the representation |
202 In our setting, the idea is to compactify the representation |
164 of derivatives with zippers, thereby making our algorithm faster. |
203 of derivatives with zippers, thereby making our algorithm faster. |
165 We introduce several works on parsing, derivatives |
204 We introduce several works on parsing, derivatives |
170 They adopted zippers to improve the speed, and argued that the runtime |
209 They adopted zippers to improve the speed, and argued that the runtime |
171 complexity of their algorithm was linear with respect to the input string. |
210 complexity of their algorithm was linear with respect to the input string. |
172 |
211 |
173 The idea of using Brzozowski derivatives on general context-free |
212 The idea of using Brzozowski derivatives on general context-free |
174 parsing was first implemented |
213 parsing was first implemented |
175 by Might et al. \ref{Might2011}. |
214 by Might et al. \cite{Might2011}. |
176 They used memoization and fixpoint construction to eliminate infinite recursion, |
215 They used memoization and fixpoint construction to eliminate infinite recursion, |
177 which is a major problem for using derivatives on context-free grammars. |
216 which is a major problem for using derivatives on context-free grammars. |
178 The initial version was quite slow----exponential on the size of the input. |
217 The initial version was quite slow----exponential on the size of the input. |
179 Adams et al. \ref{Adams2016} improved the speed and argued that their version |
218 Adams et al. \cite{Adams2016} improved the speed and argued that their version |
180 was cubic with respect to the input. |
219 was cubic with respect to the input. |
181 Darragh and Adams \cite{Darragh2020} further improved the performance |
220 Darragh and Adams \cite{Darragh2020} further improved the performance |
182 by using zippers in an innovative way--their zippers had multiple focuses |
221 by using zippers in an innovative way--their zippers had multiple focuses |
183 instead of just one in a traditional zipper to handle ambiguity. |
222 instead of just one in a traditional zipper to handle ambiguity. |
184 Their algorithm was not formalised, though. |
223 Their algorithm was not formalised, though. |
185 |
224 |
186 |
225 |
187 Some initial results have been obtained, but more care needs to be taken to make sure |
226 |
188 that the lexer output conforms to the POSIX standard. Formalising correctness of |
227 |
189 such a lexer using zippers will probably require using an imperative |
228 \subsection{Back-References} |
190 framework with separation logic as it involves |
|
191 mutable data structures, which is also interesting to look into. |
|
192 |
|
193 To further optimise the algorithm, |
|
194 I plan to add a ``deduplicate'' function that tells |
|
195 whether two regular expressions denote the same |
|
196 language using |
|
197 an efficient and verified equivalence checker. |
|
198 In this way, the internal data structures can be pruned more aggressively, |
|
199 leading to better simplifications and ultimately faster algorithms. |
|
200 |
|
201 |
|
202 |
|
203 The idea is to put the focus on that subtree, turning other parts |
|
204 of the tree into a context |
|
205 |
|
206 |
|
207 One observation about our derivative-based lexing algorithm is that |
|
208 the derivative operation sometimes traverses the entire regular expression |
|
209 unnecessarily: |
|
210 |
|
211 |
|
212 |
|
213 |
|
214 |
|
215 \section{Work on Back-References} |
|
216 We introduced regular expressions with back-references |
229 We introduced regular expressions with back-references |
217 in chapter \ref{Introduction}. |
230 in chapter \ref{Introduction}. |
218 We adopt the common practice of calling them rewbrs (Regular Expressions |
231 We adopt the common practice of calling them rewbrs (Regular Expressions |
219 With Back References) in this section for brevity. |
232 With Back References) for brevity. |
220 It has been shown by Aho \cite{Aho1990} |
233 It has been shown by Aho \cite{Aho1990} |
221 that the k-vertex cover problem can be reduced |
234 that the k-vertex cover problem can be reduced |
222 to the problem of rewbrs matching, and is therefore NP-complete. |
235 to the problem of rewbrs matching, and is therefore NP-complete. |
223 Given the depth of the problem, the progress made at the full generality |
236 Given the depth of the problem, the progress made at the full generality |
224 of arbitrary rewbrs matching has been slow, with |
237 of arbitrary rewbrs matching has been slow, with |
262 flavours of back-references syntax (e.g. whether references can be circular, |
275 flavours of back-references syntax (e.g. whether references can be circular, |
263 can labels be repeated etc.). |
276 can labels be repeated etc.). |
264 |
277 |
265 |
278 |
266 |
279 |
267 \subsection{Static Analysis of Evil Regex Patterns} |
280 |
268 When a regular expression does not behave as intended, |
281 |
269 people usually try to rewrite the regex to some equivalent form |
282 |
270 or they try to avoid the possibly problematic patterns completely, |
283 |
271 for which many false positives exist\parencite{Davis18}. |
|
272 Animated tools to "debug" regular expressions such as |
|
273 \parencite{regexploit2021} \parencite{regex101} are also popular. |
|
274 We are also aware of static analysis work on regular expressions that |
|
275 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
|
276 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
277 that detects regular expressions triggering exponential |
|
278 behavious on backtracking matchers. |
|
279 Weideman \parencite{Weideman2017Static} came up with |
|
280 non-linear polynomial worst-time estimates |
|
281 for regexes, attack string that exploit the worst-time |
|
282 scenario, and "attack automata" that generates |
|
283 attack strings. |
|
284 |
|
285 |
|
286 |
|
287 |
|
288 |
|
289 |
|
290 Thanks to our theorem-prover-friendly approach, |
|
291 we believe that |
|
292 this finiteness bound can be improved to a bound |
|
293 linear to input and |
|
294 cubic to the regular expression size using a technique by |
|
295 Antimirov\cite{Antimirov95}. |
|
296 Once formalised, this would be a guarantee for the absence of all super-linear behavious. |
|
297 We are working out the |
|
298 details. |
|
299 |
284 |
300 |
285 |
301 To our best knowledge, no lexing libraries using Brzozowski derivatives |
286 |
302 have similar complexity-related bounds, |
287 |
303 and claims about running time are usually speculative and backed by empirical |
|
304 evidence on a few test cases. |
|
305 If a matching or lexing algorithm |
|
306 does not come with certain basic complexity related |
|
307 guarantees (for examaple the internal data structure size |
|
308 does not grow indefinitely), |
|
309 then they cannot claim with confidence having solved the problem |
|
310 of catastrophic backtracking. |
|
311 |
|
312 |
|