195 requires |
195 requires |
196 more research attention. |
196 more research attention. |
197 |
197 |
198 \section{Why are current algorithm for regexes slow?} |
198 \section{Why are current algorithm for regexes slow?} |
199 |
199 |
200 Theoretical results say that regular expression matching |
200 %find literature/find out for yourself that REGEX->DFA on basic regexes |
201 should be linear with respect to the input. |
201 %does not blow up the size |
202 You could construct |
202 Shouldn't regular expression matching be linear? |
203 an NFA via Thompson construction, and simulate running it. |
203 How can one explain the super-linear behaviour of the |
204 This would be linear. |
204 regex matching engines we have? |
205 Or you could determinize the NFA into a DFA, and minimize that |
205 The time cost of regex matching algorithms in general |
206 DFA. Once you have the DFA, the running time is also linear, requiring just |
206 involve two phases: |
207 one scanning pass of the input. |
207 the construction phase, in which the algorithm builds some |
208 |
208 suitable data structure from the input regex $r$, we denote |
209 But modern regex libraries in popular language engines |
209 the time cost by $P_1(r)$. |
210 often want to support richer constructs |
210 The lexing |
211 such as bounded repetitions and back-references |
211 phase, when the input string $s$ is read and the data structure |
212 rather than the most basic regular expressions. |
212 representing that regex $r$ is being operated on. We represent the time |
213 %put in illustrations |
213 it takes by $P_2(r, s)$.\\ |
214 %a{1}{3} |
214 In the case of a $\mathit{DFA}$, |
215 These make a DFA construction impossible because |
215 we have $P_2(r, s) = O( |s| )$, |
216 of an exponential states explosion. |
216 because we take at most $|s|$ steps, |
217 They also want to support lexing rather than just matching |
217 and each step takes |
218 for tasks that involves text processing. |
218 at most one transition-- |
219 |
219 a deterministic-finite-automata |
220 Existing regex libraries either pose restrictions on the user input, or does |
220 by definition has at most one state active and at most one |
221 not give linear running time guarantee. |
221 transition upon receiving an input symbol. |
|
222 But unfortunately in the worst case |
|
223 $P_1(r) = O(exp^{|r|})$. An example will be given later. \\ |
|
224 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
|
225 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$. |
|
226 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
|
227 On the other hand, if backtracking is used, the worst-case time bound bloats |
|
228 to $|r| * 2^|s|$ . |
|
229 %on the input |
|
230 %And when calculating the time complexity of the matching algorithm, |
|
231 %we are assuming that each input reading step requires constant time. |
|
232 %which translates to that the number of |
|
233 %states active and transitions taken each time is bounded by a |
|
234 %constant $C$. |
|
235 %But modern regex libraries in popular language engines |
|
236 % often want to support much richer constructs than just |
|
237 % sequences and Kleene stars, |
|
238 %such as negation, intersection, |
|
239 %bounded repetitions and back-references. |
|
240 %And de-sugaring these "extended" regular expressions |
|
241 %into basic ones might bloat the size exponentially. |
|
242 %TODO: more reference for exponential size blowup on desugaring. |
|
243 \subsection{Tools that uses $\mathit{DFA}$s} |
|
244 %TODO:more tools that use DFAs? |
|
245 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools |
|
246 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based |
|
247 lexers. The user provides a set of regular expressions |
|
248 and configurations to such lexer generators, and then |
|
249 gets an output program encoding a minimized $\mathit{DFA}$ |
|
250 that can be compiled and run. |
|
251 The good things about $\mathit{DFA}$s is that once |
|
252 generated, they are fast and stable, unlike |
|
253 backtracking algorithms. |
|
254 However, they do not scale well with bounded repetitions.\\ |
|
255 |
|
256 Bounded repetitions, usually written in the form |
|
257 $r^{\{c\}}$ (where $c$ is a constant natural number), |
|
258 denotes a regular expression accepting strings |
|
259 that can be divided into $c$ substrings, where each |
|
260 substring is in $r$. |
|
261 For the regular expression $(a|b)^*a(a|b)^{\{2\}}$, |
|
262 an $\mathit{NFA}$ describing it would look like: |
|
263 \begin{center} |
|
264 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
|
265 \node[state,initial] (q_0) {$q_0$}; |
|
266 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
|
267 \node[state, red] (q_2) [right=of q_1] {$q_2$}; |
|
268 \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; |
|
269 \path[->] |
|
270 (q_0) edge node {a} (q_1) |
|
271 edge [loop below] node {a,b} () |
|
272 (q_1) edge node {a,b} (q_2) |
|
273 (q_2) edge node {a,b} (q_3); |
|
274 \end{tikzpicture} |
|
275 \end{center} |
|
276 The red states are "countdown states" which counts down |
|
277 the number of characters needed in addition to the current |
|
278 string to make a successful match. |
|
279 For example, state $q_1$ indicates a match that has |
|
280 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
|
281 and just consumed the "delimiter" $a$ in the middle, and |
|
282 need to match 2 more iterations of $(a|b)$ to complete. |
|
283 State $q_2$ on the other hand, can be viewed as a state |
|
284 after $q_1$ has consumed 1 character, and just waits |
|
285 for 1 more character to complete. |
|
286 $q_3$ is the last state, requiring 0 more character and is accepting. |
|
287 Depending on the suffix of the |
|
288 input string up to the current read location, |
|
289 the states $q_1$ and $q_2$, $q_3$ |
|
290 may or may |
|
291 not be active, independent from each other. |
|
292 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
|
293 contain at least $2^3$ non-equivalent states that cannot be merged, |
|
294 because the subset construction during determinisation will generate |
|
295 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
|
296 Generalizing this to regular expressions with larger |
|
297 bounded repetitions number, we have that |
|
298 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
|
299 would require at least $2^{n+1}$ states, if $r$ contains |
|
300 more than 1 string. |
|
301 This is to represent all different |
|
302 scenarios which "countdown" states are active. |
|
303 For those regexes, tools such as $\mathit{JFLEX}$ |
|
304 would generate gigantic $\mathit{DFA}$'s or |
|
305 out of memory errors. |
|
306 For this reason, regex libraries that support |
|
307 bounded repetitions often choose to use the $\mathit{NFA}$ |
|
308 approach. |
|
309 \subsection{The $\mathit{NFA}$ approach to regex matching} |
|
310 One can simulate the $\mathit{NFA}$ running in two ways: |
|
311 one by keeping track of all active states after consuming |
|
312 a character, and update that set of states iteratively. |
|
313 This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
|
314 for a path terminating |
|
315 at an accepting state. |
|
316 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
|
317 type of $\mathit{NFA}$ simulation, and guarantees a linear runtime |
|
318 in terms of input string length. |
|
319 %TODO:try out these lexers |
|
320 The other way to use $\mathit{NFA}$ for matching is choosing |
|
321 a single transition each time, keeping all the other options in |
|
322 a queue or stack, and backtracking if that choice eventually |
|
323 fails. This method, often called a "depth-first-search", |
|
324 is efficient in a lot of cases, but could end up |
|
325 with exponential run time.\\ |
|
326 %TODO:COMPARE java python lexer speed with Rust and Go |
|
327 The reason behind backtracking algorithms in languages like |
|
328 Java and Python is that they support back-references. |
|
329 \subsection{Back References in Regex--Non-Regular part} |
|
330 If we have a regular expression like this (the sequence |
|
331 operator is omitted for brevity): |
|
332 \begin{center} |
|
333 $r_1(r_2(r_3r_4))$ |
|
334 \end{center} |
|
335 We could label sub-expressions of interest |
|
336 by parenthesizing them and giving |
|
337 them a number by the order in which their opening parentheses appear. |
|
338 One possible way of parenthesizing and labelling is given below: |
|
339 \begin{center} |
|
340 $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ |
|
341 \end{center} |
|
342 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled |
|
343 by 1 to 4. $1$ would refer to the entire expression |
|
344 $(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc. |
|
345 These sub-expressions are called "capturing groups". |
|
346 We can use the following syntax to denote that we want a string just matched by a |
|
347 sub-expression (capturing group) to appear at a certain location again, |
|
348 exactly as it was: |
|
349 \begin{center} |
|
350 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots |
|
351 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$ |
|
352 \end{center} |
|
353 The backslash and number $i$ are used to denote such |
|
354 so-called "back-references". |
|
355 Let $e$ be an expression made of regular expressions |
|
356 and back-references. $e$ contains the expression $e_i$ |
|
357 as its $i$-th capturing group. |
|
358 The semantics of back-reference can be recursively |
|
359 written as: |
|
360 \begin{center} |
|
361 \begin{tabular}{c} |
|
362 $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ |
|
363 $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ |
|
364 \end{tabular} |
|
365 \end{center} |
|
366 The concrete example |
|
367 $((a|b|c|\ldots|z)^*)\backslash 1$ |
|
368 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\ |
|
369 Back-reference is a construct in the "regex" standard |
|
370 that programmers found quite useful, but not exactly |
|
371 regular any more. |
|
372 In fact, that allows the regex construct to express |
|
373 languages that cannot be contained in context-free |
|
374 languages either. |
|
375 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$ |
|
376 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
|
377 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. |
|
378 Such a language is contained in the context-sensitive hierarchy |
|
379 of formal languages. |
|
380 Solving the back-reference expressions matching problem |
|
381 is NP-complete\parencite{alfred2014algorithms} and a non-bactracking, |
|
382 efficient solution is not known to exist. |
|
383 %TODO:read a bit more about back reference algorithms |
|
384 It seems that languages like Java and Python made the trade-off |
|
385 to support back-references at the expense of having to backtrack, |
|
386 even in the case of regexes not involving back-references.\\ |
|
387 Summing these up, we can categorise existing |
|
388 practical regex libraries into the ones with linear |
|
389 time guarantees like Go and Rust, which impose restrictions |
|
390 on the user input (not allowing back-references, |
|
391 bounded repetitions canno exceed 1000 etc.), and ones |
|
392 that allows the programmer much freedom, but grinds to a halt |
|
393 in some non-negligible portion of cases. |
222 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
394 %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions |
223 For example, the Rust regex engine claims to be linear, |
395 % For example, the Rust regex engine claims to be linear, |
224 but does not support lookarounds and back-references. |
396 % but does not support lookarounds and back-references. |
225 The GoLang regex library does not support over 1000 repetitions. |
397 % The GoLang regex library does not support over 1000 repetitions. |
226 Java and Python both support back-references, but shows |
398 % Java and Python both support back-references, but shows |
227 catastrophic backtracking behaviours on inputs without back-references( |
399 %catastrophic backtracking behaviours on inputs without back-references( |
228 when the language is still regular). |
400 %when the language is still regular). |
229 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
401 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
230 %TODO: verify the fact Rust does not allow 1000+ reps |
402 %TODO: verify the fact Rust does not allow 1000+ reps |
231 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs? |
403 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs? |
|
404 \section{Buggy Regex Engines} |
|
405 |
|
406 |
232 Another thing about the these libraries is that there |
407 Another thing about the these libraries is that there |
233 is no correctness guarantee. |
408 is no correctness guarantee. |
234 In some cases they either fails to generate a lexing result when there is a match, |
409 In some cases they either fails to generate a lexing result when there is a match, |
235 or gives the wrong way of matching. |
410 or gives the wrong way of matching. |
236 |
411 |
237 |
412 |
238 It turns out that regex libraries not only suffer from |
413 It turns out that regex libraries not only suffer from |
239 exponential backtracking problems, |
414 exponential backtracking problems, |
240 but also undesired (or even buggy) outputs. |
415 but also undesired (or even buggy) outputs. |
241 %TODO: comment from who |
416 %TODO: comment from who |
242 xxx commented that most regex libraries are not |
417 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
243 correctly implementing the POSIX (maximum-munch) |
418 correctly implementing the POSIX (maximum-munch) |
244 rule of regular expression matching. |
419 rule of regular expression matching. |
|
420 This experience is echoed by the writer's |
|
421 tryout of a few online regex testers: |
245 A concrete example would be |
422 A concrete example would be |
246 the regex |
423 the regex |
247 \begin{verbatim} |
424 \begin{verbatim} |
248 (((((a*a*)b*)b){20})*)c |
425 (((((a*a*)b*)b){20})*)c |
249 \end{verbatim} |
426 \end{verbatim} |
262 all the $b$'s in it. |
439 all the $b$'s in it. |
263 %TODO: give a coloured example of how this matches POSIXly |
440 %TODO: give a coloured example of how this matches POSIXly |
264 |
441 |
265 This regex would trigger catastrophic backtracking in |
442 This regex would trigger catastrophic backtracking in |
266 languages like Python and Java, |
443 languages like Python and Java, |
267 whereas it gives a correct but uninformative (non-POSIX) |
444 whereas it gives a non-POSIX and uninformative |
268 match in languages like Go or .NET--The match with only |
445 match in languages like Go or .NET--The match with only |
269 character $c$. |
446 character $c$. |
270 |
447 |
271 Backtracking or depth-first search might give us |
448 As Grathwohl\parencite{grathwohl2014crash} commented, |
272 exponential running time, and quite a few tools |
449 \begin{center} |
273 avoid that by determinising the $\mathit{NFA}$ |
450 ``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.'' |
274 into a $\mathit{DFA}$ and minimizes it. |
451 \end{center} |
275 For example, $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools |
452 |
276 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based |
453 \section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking} |
277 lexers. |
454 |
278 However, they do not scale well with bounded repetitions. |
455 |
279 Bounded repetitions, usually written in the form |
456 There is also static analysis work on regular expression that |
280 $r^{\{c\}}$ (where $c$ is a constant natural number), |
457 have potential expoential behavious. Rathnayake and Thielecke |
281 denotes a regular expression accepting strings |
458 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
282 that can be divided into $c$ substrings, and each |
459 that detects regular expressions triggering exponential |
283 substring is in $r$. |
460 behavious on backtracking matchers. |
284 %TODO: draw example NFA. |
461 People also developed static analysis methods for |
285 For the regular expression $(a|b)^*a(a|b)^{\{2\}}$, |
462 generating non-linear polynomial worst-time estimates |
286 an $\mathit{NFA}$ describing it would look like: |
463 for regexes, attack string that exploit the worst-time |
287 \begin{center} |
464 scenario, and "attack automata" that generates |
288 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
465 attack strings \parencite{Weideman2017Static}. |
289 \node[state,initial] (q_0) {$q_0$}; |
466 There are also tools to "debug" regular expressions |
290 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
467 that allows people to see why a match failed or was especially slow |
291 \node[state, red] (q_2) [right=of q_1] {$q_2$}; |
468 by showing the steps a back-tracking regex engine took\parencite{regexploit2021}. |
292 \node[state,accepting](q_3) [right=of q_2] {$q_3$}; |
469 %TODO:also the regex101 debugger |
293 \path[->] |
470 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} |
294 (q_0) edge node {a} (q_1) |
|
295 edge [loop below] node {a,b} () |
|
296 (q_1) edge node {a,b} (q_2) |
|
297 edge [loop above] node {0} () |
|
298 (q_2) edge node {a,b} (q_3); |
|
299 \end{tikzpicture} |
|
300 \end{center} |
|
301 The red states are "counter states" which counts down |
|
302 the number of characters needed in addition to the current |
|
303 string to make a successful match. |
|
304 For example, state $q_1$ indicates a match that has |
|
305 gone past the $(a|b)$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
|
306 and just consumed the "delimiter" $a$ in the middle, and |
|
307 need to match 2 more iterations of $a|b$ to complete. |
|
308 State $q_2$ on the other hand, can be viewed as a state |
|
309 after $q_1$ has consumed 1 character, and just waits |
|
310 for 1 more character to complete. |
|
311 Depending on the actual characters appearing in the |
|
312 input string, the states $q_1$ and $q_2$ may or may |
|
313 not be active, independent from each other. |
|
314 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
|
315 contain at least 4 non-equivalent states that cannot be merged, |
|
316 because subset states indicating which of $q_1$ and $q_2$ |
|
317 are active are at least four: $\phi$, $\{q_1\}$, |
|
318 $\{q_2\}$, $\{q_1, q_2\}$. |
|
319 Generalizing this to regular expressions with larger |
|
320 bounded repetitions number, we have $r^*ar^{\{n\}}$ |
|
321 in general would require at least $2^n$ states |
|
322 in a $\mathit{DFA}$. This is to represent all different |
|
323 configurations of "countdown" states. |
|
324 For those regexes, tools such as $\mathit{JFLEX}$ |
|
325 would generate gigantic $\mathit{DFA}$'s or even |
|
326 give out memory errors. |
|
327 |
|
328 For this reason, regex libraries that support |
|
329 bounded repetitions often choose to use the $\mathit{NFA}$ |
|
330 approach. |
|
331 One can simulate the $\mathit{NFA}$ running in two ways: |
|
332 one by keeping track of all active states after consuming |
|
333 a character, and update that set of states iteratively. |
|
334 This is a breadth-first-search of the $\mathit{NFA}$. |
|
335 for a path terminating |
|
336 at an accepting state. |
|
337 Languages like $\mathit{GO}$ and $\mathit{RUST}$ use this |
|
338 type of $\mathit{NFA}$ simulation, and guarantees a linear runtime |
|
339 in terms of input string length. |
|
340 The other way to use $\mathit{NFA}$ for matching is to take |
|
341 a single state in a path each time, and backtrack if that path |
|
342 fails. This is a depth-first-search that could end up |
|
343 with exponential run time. |
|
344 The reason behind backtracking algorithms in languages like |
|
345 Java and Python is that they support back-references. |
|
346 \subsection{Back References in Regex--Non-Regular part} |
|
347 If we label sub-expressions by parenthesizing them and give |
|
348 them a number by the order their opening parenthesis appear, |
|
349 $\underset{1}{(}\ldots\underset{2}{(}\ldots\underset{3}{(}\ldots\underset{4}{(}\ldots)\ldots)\ldots)\ldots)$ |
|
350 We can use the following syntax to denote that we want a string just matched by a |
|
351 sub-expression to appear at a certain location again exactly: |
|
352 $(.*)\backslash 1$ |
|
353 would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc. |
|
354 |
|
355 Back-reference is a construct in the "regex" standard |
|
356 that programmers found quite useful, but not exactly |
|
357 regular any more. |
|
358 In fact, that allows the regex construct to express |
|
359 languages that cannot be contained in context-free |
|
360 languages |
|
361 For example, the back reference $(a^*)\backslash 1 \backslash 1$ |
|
362 expresses the language $\{a^na^na^n\mid n \in N\}$, |
|
363 which cannot be expressed by context-free grammars. |
|
364 To express such a language one would need context-sensitive |
|
365 grammars, and matching/parsing for such grammars is NP-hard in |
|
366 general. |
|
367 %TODO:cite reference for NP-hardness of CSG matching |
|
368 For such constructs, the most intuitive way of matching is |
|
369 using backtracking algorithms, and there are no known algorithms |
|
370 that does not backtrack. |
|
371 %TODO:read a bit more about back reference algorithms |
|
372 |
|
373 |
|
374 |
|
375 |
|
376 \section{Our Solution--Brzozowski Derivatives} |
|
377 |
|
378 |
|
379 |
|
380 Is it possible to have a regex lexing algorithm with proven correctness and |
471 Is it possible to have a regex lexing algorithm with proven correctness and |
381 time complexity, which allows easy extensions to |
472 time complexity, which allows easy extensions to |
382 constructs like |
473 constructs like |
383 bounded repetitions, negation, lookarounds, and even back-references? |
474 bounded repetitions, negation, lookarounds, and even back-references? |
|
475 Building on top of Sulzmann and Lu's attempt to formalize the |
|
476 notion of POSIX lexing rules \parencite{Sulzmann2014}, |
|
477 Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled |
|
478 POSIX matching as a ternary relation recursively defined in a |
|
479 natural deduction style. |
|
480 With the formally-specified rules for what a POSIX matching is, |
|
481 they designed a regex matching algorithm based on Brzozowski derivatives, and |
|
482 proved in Isabelle/HOL that the algorithm gives correct results. |
|
483 |
384 |
484 |
385 We propose Brzozowski's derivatives as a solution to this problem. |
485 |
|
486 |
|
487 |
|
488 %---------------------------------------------------------------------------------------- |
|
489 |
|
490 \section{Our Approach} |
|
491 In the last fifteen or so years, Brzozowski's derivatives of regular |
|
492 expressions have sparked quite a bit of interest in the functional |
|
493 programming and theorem prover communities. The beauty of |
|
494 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly |
|
495 expressible in any functional language, and easily definable and |
|
496 reasoned about in theorem provers---the definitions just consist of |
|
497 inductive datatypes and simple recursive functions. Derivatives of a |
|
498 regular expression, written $r \backslash c$, give a simple solution |
|
499 to the problem of matching a string $s$ with a regular |
|
500 expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
501 succession) all the characters of the string matches the empty string, |
|
502 then $r$ matches $s$ (and {\em vice versa}). |
|
503 |
|
504 |
|
505 This work aims to address the above vulnerability by the combination |
|
506 of Brzozowski's derivatives and interactive theorem proving. We give an |
|
507 improved version of Sulzmann and Lu's bit-coded algorithm using |
|
508 derivatives, which come with a formal guarantee in terms of correctness and |
|
509 running time as an Isabelle/HOL proof. |
|
510 Then we improve the algorithm with an even stronger version of |
|
511 simplification, and prove a time bound linear to input and |
|
512 cubic to regular expression size using a technique by |
|
513 Antimirov. |
|
514 |
|
515 \subsection{Existing Work} |
|
516 We are aware |
|
517 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
518 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
519 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
|
520 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
521 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
|
522 |
|
523 %We propose Brzozowski's derivatives as a solution to this problem. |
386 |
524 |
387 The main contribution of this thesis is a proven correct lexing algorithm |
525 The main contribution of this thesis is a proven correct lexing algorithm |
388 with formalized time bounds. |
526 with formalized time bounds. |
389 To our best knowledge, there is no lexing libraries using Brzozowski derivatives |
527 To our best knowledge, there is no lexing libraries using Brzozowski derivatives |
390 that have a provable time guarantee, |
528 that have a provable time guarantee, |
1312 allowing a ReDoS (regular expression denial-of-service ) attack. |
1450 allowing a ReDoS (regular expression denial-of-service ) attack. |
1313 |
1451 |
1314 |
1452 |
1315 %---------------------------------------------------------------------------------------- |
1453 %---------------------------------------------------------------------------------------- |
1316 |
1454 |
1317 \section{Engineering and Academic Approaches to Deal with Catastrophic Backtracking} |
|
1318 |
|
1319 The reason behind is that regex libraries in popular language engines |
|
1320 often want to support richer constructs |
|
1321 than the most basic regular expressions, and lexing rather than matching |
|
1322 is needed for sub-match extraction. |
|
1323 |
|
1324 There is also static analysis work on regular expression that |
|
1325 have potential expoential behavious. Rathnayake and Thielecke |
|
1326 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
1327 that detects regular expressions triggering exponential |
|
1328 behavious on backtracking matchers. |
|
1329 People also developed static analysis methods for |
|
1330 generating non-linear polynomial worst-time estimates |
|
1331 for regexes, attack string that exploit the worst-time |
|
1332 scenario, and "attack automata" that generates |
|
1333 attack strings. |
|
1334 For a comprehensive analysis, please refer to Weideman's thesis |
|
1335 \parencite{Weideman2017Static}. |
|
1336 |
|
1337 \subsection{DFA Approach} |
|
1338 |
|
1339 Exponential states. |
|
1340 |
|
1341 \subsection{NFA Approach} |
|
1342 Backtracking. |
|
1343 |
|
1344 |
|
1345 |
|
1346 %---------------------------------------------------------------------------------------- |
|
1347 |
|
1348 \section{Our Approach} |
|
1349 In the last fifteen or so years, Brzozowski's derivatives of regular |
|
1350 expressions have sparked quite a bit of interest in the functional |
|
1351 programming and theorem prover communities. The beauty of |
|
1352 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly |
|
1353 expressible in any functional language, and easily definable and |
|
1354 reasoned about in theorem provers---the definitions just consist of |
|
1355 inductive datatypes and simple recursive functions. Derivatives of a |
|
1356 regular expression, written $r \backslash c$, give a simple solution |
|
1357 to the problem of matching a string $s$ with a regular |
|
1358 expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
1359 succession) all the characters of the string matches the empty string, |
|
1360 then $r$ matches $s$ (and {\em vice versa}). |
|
1361 |
|
1362 |
|
1363 This work aims to address the above vulnerability by the combination |
|
1364 of Brzozowski's derivatives and interactive theorem proving. We give an |
|
1365 improved version of Sulzmann and Lu's bit-coded algorithm using |
|
1366 derivatives, which come with a formal guarantee in terms of correctness and |
|
1367 running time as an Isabelle/HOL proof. |
|
1368 Then we improve the algorithm with an even stronger version of |
|
1369 simplification, and prove a time bound linear to input and |
|
1370 cubic to regular expression size using a technique by |
|
1371 Antimirov. |
|
1372 |
|
1373 \subsection{Existing Work} |
|
1374 We are aware |
|
1375 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
|
1376 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
|
1377 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
|
1378 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
|
1379 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
|
1380 |
1455 |
1381 %---------------------------------------------------------------------------------------- |
1456 %---------------------------------------------------------------------------------------- |
1382 |
1457 |
1383 \section{What this Template Includes} |
1458 \section{What this Template Includes} |
1384 |
1459 |