104 \newcommand\RSEQ[2]{#1 \cdot #2} |
109 \newcommand\RSEQ[2]{#1 \cdot #2} |
105 \newcommand\RALTS[1]{\oplus #1} |
110 \newcommand\RALTS[1]{\oplus #1} |
106 \newcommand\RSTAR[1]{#1^*} |
111 \newcommand\RSTAR[1]{#1^*} |
107 \newcommand\vsuf[2]{\textit{vsuf} \;#1\;#2} |
112 \newcommand\vsuf[2]{\textit{vsuf} \;#1\;#2} |
108 |
113 |
|
114 |
|
115 |
|
116 \pgfplotsset{ |
|
117 myplotstyle/.style={ |
|
118 legend style={draw=none, font=\small}, |
|
119 legend cell align=left, |
|
120 legend pos=north east, |
|
121 ylabel style={align=center, font=\bfseries\boldmath}, |
|
122 xlabel style={align=center, font=\bfseries\boldmath}, |
|
123 x tick label style={font=\bfseries\boldmath}, |
|
124 y tick label style={font=\bfseries\boldmath}, |
|
125 scaled ticks=true, |
|
126 every axis plot/.append style={thick}, |
|
127 }, |
|
128 } |
|
129 |
109 %---------------------------------------------------------------------------------------- |
130 %---------------------------------------------------------------------------------------- |
110 %This part is about regular expressions, Brzozowski derivatives, |
131 %This part is about regular expressions, Brzozowski derivatives, |
111 %and a bit-coded lexing algorithm with proven correctness and time bounds. |
132 %and a bit-coded lexing algorithm with proven correctness and time bounds. |
112 |
133 |
113 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
134 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
114 |
|
115 |
|
116 Regular expressions are widely used in computer science: |
|
117 be it in text-editors\parencite{atomEditor} with syntax highlighting and auto-completion, |
|
118 command-line tools like $\mathit{grep}$ that facilitate easy |
|
119 text processing, network intrusion |
|
120 detection systems that reject suspicious traffic, or compiler |
|
121 front ends--the majority of the solutions to these tasks |
|
122 involve lexing with regular |
|
123 expressions. |
|
124 Given its usefulness and ubiquity, one would imagine that |
|
125 modern regular expression matching implementations |
|
126 are mature and fully studied. |
|
127 Indeed, in a popular programming language' regex engine, |
|
128 supplying it with regular expressions and strings, one can |
|
129 get rich matching information in a very short time. |
|
130 Some network intrusion detection systems |
|
131 use regex engines that are able to process |
|
132 megabytes or even gigabytes of data per second\parencite{Turo_ov__2020}. |
|
133 Unfortunately, this is not the case for $\mathbf{all}$ inputs. |
|
134 %TODO: get source for SNORT/BRO's regex matching engine/speed |
|
135 |
|
136 |
|
137 Take $(a^*)^*\,b$ and ask whether |
|
138 strings of the form $aa..a$ match this regular |
|
139 expression. Obviously this is not the case---the expected $b$ in the last |
|
140 position is missing. One would expect that modern regular expression |
|
141 matching engines can find this out very quickly. Alas, if one tries |
|
142 this example in JavaScript, Python or Java 8, even with strings of a small |
|
143 length, say around 30 $a$'s, one discovers that |
|
144 this decision takes crazy time to finish given the simplicity of the problem. |
|
145 This is clearly exponential behaviour, and |
|
146 is triggered by some relatively simple regex patterns, as the graphs |
|
147 below show: |
|
148 |
135 |
149 \begin{figure} |
136 \begin{figure} |
150 \centering |
137 \centering |
151 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
138 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
152 \begin{tikzpicture} |
139 \begin{tikzpicture} |
218 \end{figure} |
205 \end{figure} |
219 |
206 |
220 |
207 |
221 |
208 |
222 |
209 |
223 This superlinear blowup in matching algorithms sometimes cause |
210 |
224 considerable grief in real life: for example on 20 July 2016 one evil |
211 Regular expressions are widely used in computer science: |
|
212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
|
213 command-line tools like $\mathit{grep}$ that facilitate easy |
|
214 text-processing; network intrusion |
|
215 detection systems that reject suspicious traffic; or compiler |
|
216 front ends--the majority of the solutions to these tasks |
|
217 involve lexing with regular |
|
218 expressions. |
|
219 Given its usefulness and ubiquity, one would imagine that |
|
220 modern regular expression matching implementations |
|
221 are mature and fully studied. |
|
222 Indeed, in a popular programming language' regex engine, |
|
223 supplying it with regular expressions and strings, one can |
|
224 get rich matching information in a very short time. |
|
225 Some network intrusion detection systems |
|
226 use regex engines that are able to process |
|
227 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
|
228 Unfortunately, this is not the case for $\mathbf{all}$ inputs. |
|
229 %TODO: get source for SNORT/BRO's regex matching engine/speed |
|
230 |
|
231 |
|
232 Take $(a^*)^*\,b$ and ask whether |
|
233 strings of the form $aa..a$ match this regular |
|
234 expression. Obviously this is not the case---the expected $b$ in the last |
|
235 position is missing. One would expect that modern regular expression |
|
236 matching engines can find this out very quickly. Alas, if one tries |
|
237 this example in JavaScript, Python or Java 8, even with strings of a small |
|
238 length, say around 30 $a$'s, one discovers that |
|
239 this decision takes crazy time to finish given the simplicity of the problem. |
|
240 This is clearly exponential behaviour, and |
|
241 is triggered by some relatively simple regex patterns, as the graphs |
|
242 in \ref{fig:aStarStarb} show. |
|
243 |
|
244 |
|
245 |
|
246 |
|
247 \ChristianComment{Superlinear I just leave out the explanation |
|
248 which I find once used would distract the flow. Plus if i just say exponential |
|
249 here the 2016 event in StackExchange was not exponential, but just quardratic so would be |
|
250 in accurate} |
|
251 |
|
252 This superlinear blowup in regular expression engines |
|
253 had repeatedly caused grief in real life. |
|
254 For example, on 20 July 2016 one evil |
225 regular expression brought the webpage |
255 regular expression brought the webpage |
226 \href{http://stackexchange.com}{Stack Exchange} to its |
256 \href{http://stackexchange.com}{Stack Exchange} to its |
227 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
257 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
228 In this instance, a regular expression intended to just trim white |
258 In this instance, a regular expression intended to just trim white |
229 spaces from the beginning and the end of a line actually consumed |
259 spaces from the beginning and the end of a line actually consumed |
230 massive amounts of CPU-resources---causing web servers to grind to a |
260 massive amounts of CPU resources---causing web servers to grind to a |
231 halt. This happened when a post with 20,000 white spaces was submitted, |
261 halt. In this example, the time needed to process |
232 but importantly the white spaces were neither at the beginning nor at |
|
233 the end. As a result, the regular expression matching engine needed to |
|
234 backtrack over many choices. In this example, the time needed to process |
|
235 the string was $O(n^2)$ with respect to the string length. This |
262 the string was $O(n^2)$ with respect to the string length. This |
236 quadratic overhead was enough for the homepage of Stack Exchange to |
263 quadratic overhead was enough for the homepage of Stack Exchange to |
237 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
264 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
238 attack and therefore stopped the servers from responding to any |
265 attack and therefore stopped the servers from responding to any |
239 requests. This made the whole site become unavailable. |
266 requests. This made the whole site become unavailable. |
|
267 |
240 A more recent example is a global outage of all Cloudflare servers on 2 July |
268 A more recent example is a global outage of all Cloudflare servers on 2 July |
241 2019. A poorly written regular expression exhibited exponential |
269 2019. A poorly written regular expression exhibited exponential |
242 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage |
270 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage |
243 had several causes, at the heart was a regular expression that |
271 had several causes, at the heart was a regular expression that |
244 was used to monitor network |
272 was used to monitor network |
245 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}} |
273 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
246 %TODO: data points for some new versions of languages |
274 %TODO: data points for some new versions of languages |
247 These problems with regular expressions |
275 These problems with regular expressions |
248 are not isolated events that happen |
276 are not isolated events that happen |
249 very occasionally, but actually widespread. |
277 very occasionally, but actually widespread. |
250 They occur so often that they get a |
278 They occur so often that they get a |
251 name--Regular-Expression-Denial-Of-Service (ReDoS) |
279 name--Regular-Expression-Denial-Of-Service (ReDoS) |
252 attack. |
280 attack. |
253 Davis et al. \parencite{Davis18} detected more |
281 \citeauthor{Davis18} detected more |
254 than 1000 super-linear (SL) regular expressions |
282 than 1000 super-linear (SL) regular expressions |
255 in Node.js, Python core libraries, and npm and pypi. |
283 in Node.js, Python core libraries, and npm and pypi. |
256 They therefore concluded that evil regular expressions |
284 They therefore concluded that evil regular expressions |
257 are problems more than "a parlour trick", but one that |
285 are problems "more than a parlour trick", but one that |
258 requires |
286 requires |
259 more research attention. |
287 more research attention. |
260 |
288 |
261 \section{The Problem Behind Slow Cases} |
289 |
|
290 But the problems are not limited to slowness on certain |
|
291 cases. |
|
292 Another thing about these libraries is that there |
|
293 is no correctness guarantee. |
|
294 In some cases, they either fail to generate a lexing result when there exists a match, |
|
295 or give results that are inconsistent with the $\POSIX$ standard. |
|
296 A concrete example would be |
|
297 the regex |
|
298 \begin{verbatim} |
|
299 (aba|ab|a)* |
|
300 \end{verbatim} |
|
301 and the string |
|
302 \begin{verbatim} |
|
303 ababa |
|
304 \end{verbatim} |
|
305 The correct $\POSIX$ match for the above would be |
|
306 with the entire string $ababa$, |
|
307 split into two Kleene star iterations, $[ab] [aba]$ at positions |
|
308 $[0, 2), [2, 5)$ |
|
309 respectively. |
|
310 But trying this out in regex101\parencite{regex101} |
|
311 with different language engines would yield |
|
312 the same two fragmented matches: $[aba]$ at $[0, 3)$ |
|
313 and $a$ at $[4, 5)$. |
|
314 |
|
315 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
|
316 correctly implementing the POSIX (maximum-munch) |
|
317 rule of regular expression matching. |
|
318 |
|
319 As Grathwohl\parencite{grathwohl2014crash} commented, |
|
320 \begin{center} |
|
321 ``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.'' |
|
322 \end{center} |
|
323 |
|
324 |
|
325 To summarise the above, regular expressions are important. |
|
326 They are popular and programming languages' library functions |
|
327 for them are very fast on non-catastrophic cases. |
|
328 But there are problems with current practical implementations. |
|
329 First thing is that the running time might blow up. |
|
330 The second problem is that they might be error-prone on certain |
|
331 cases that are very easily spotted by a human. |
|
332 In the next part of the chapter, we will look into reasons why |
|
333 certain regex engines are running horribly slow on the "catastrophic" |
|
334 cases and propose a solution that addresses both of these problems |
|
335 based on Brzozowski and Sulzmann and Lu's work. |
|
336 |
|
337 |
|
338 \section{Why are current regex engines slow?} |
262 |
339 |
263 %find literature/find out for yourself that REGEX->DFA on basic regexes |
340 %find literature/find out for yourself that REGEX->DFA on basic regexes |
264 %does not blow up the size |
341 %does not blow up the size |
265 Shouldn't regular expression matching be linear? |
342 Shouldn't regular expression matching be linear? |
266 How can one explain the super-linear behaviour of the |
343 How can one explain the super-linear behaviour of the |
267 regex matching engines we have? |
344 regex matching engines we have? |
268 The time cost of regex matching algorithms in general |
345 The time cost of regex matching algorithms in general |
269 involve two phases: |
346 involve two different phases, and different things can go differently wrong on |
270 the construction phase, in which the algorithm builds some |
347 these phases. |
271 suitable data structure from the input regex $r$, we denote |
348 $\DFA$s usually have problems in the first (construction) phase |
272 the time cost by $P_1(r)$. |
349 , whereas $\NFA$s usually run into trouble |
273 The lexing |
350 on the second phase. |
274 phase, when the input string $s$ is read and the data structure |
351 |
275 representing that regex $r$ is being operated on. We represent the time |
352 \subsection{Different Phases of a Matching/Lexing Algorithm} |
|
353 |
|
354 |
|
355 Most lexing algorithms can be roughly divided into |
|
356 two phases during its run. |
|
357 The first phase is the "construction" phase, |
|
358 in which the algorithm builds some |
|
359 suitable data structure from the input regex $r$, so that |
|
360 it can be easily operated on later. |
|
361 We denote |
|
362 the time cost for such a phase by $P_1(r)$. |
|
363 The second phase is the lexing phase, when the input string |
|
364 $s$ is read and the data structure |
|
365 representing that regex $r$ is being operated on. |
|
366 We represent the time |
276 it takes by $P_2(r, s)$.\\ |
367 it takes by $P_2(r, s)$.\\ |
277 In the case of a $\mathit{DFA}$, |
368 |
|
369 For $\mathit{DFA}$, |
278 we have $P_2(r, s) = O( |s| )$, |
370 we have $P_2(r, s) = O( |s| )$, |
279 because we take at most $|s|$ steps, |
371 because we take at most $|s|$ steps, |
280 and each step takes |
372 and each step takes |
281 at most one transition-- |
373 at most one transition-- |
282 a deterministic-finite-automata |
374 a deterministic-finite-automata |
283 by definition has at most one state active and at most one |
375 by definition has at most one state active and at most one |
284 transition upon receiving an input symbol. |
376 transition upon receiving an input symbol. |
285 But unfortunately in the worst case |
377 But unfortunately in the worst case |
286 $P_1(r) = O(exp^{|r|})$. An example will be given later. \\ |
378 $P_1(r) = O(exp^{|r|})$. An example will be given later. |
|
379 |
|
380 |
287 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
381 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
288 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$. |
382 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$. |
289 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
383 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
290 On the other hand, if backtracking is used, the worst-case time bound bloats |
384 On the other hand, if backtracking is used, the worst-case time bound bloats |
291 to $|r| * 2^|s|$ . |
385 to $|r| * 2^|s|$. |
292 %on the input |
386 %on the input |
293 %And when calculating the time complexity of the matching algorithm, |
387 %And when calculating the time complexity of the matching algorithm, |
294 %we are assuming that each input reading step requires constant time. |
388 %we are assuming that each input reading step requires constant time. |
295 %which translates to that the number of |
389 %which translates to that the number of |
296 %states active and transitions taken each time is bounded by a |
390 %states active and transitions taken each time is bounded by a |
364 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
452 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
365 would require at least $2^{n+1}$ states, if $r$ contains |
453 would require at least $2^{n+1}$ states, if $r$ contains |
366 more than 1 string. |
454 more than 1 string. |
367 This is to represent all different |
455 This is to represent all different |
368 scenarios which "countdown" states are active. |
456 scenarios which "countdown" states are active. |
369 For those regexes, tools such as $\mathit{JFLEX}$ |
457 For those regexes, tools that uses $\DFA$s will get |
370 would generate gigantic $\mathit{DFA}$'s or |
|
371 out of memory errors. |
458 out of memory errors. |
|
459 |
|
460 \subsubsection{Tools that uses $\mathit{DFA}$s} |
|
461 %TODO:more tools that use DFAs? |
|
462 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools |
|
463 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based |
|
464 lexers. The user provides a set of regular expressions |
|
465 and configurations to such lexer generators, and then |
|
466 gets an output program encoding a minimized $\mathit{DFA}$ |
|
467 that can be compiled and run. |
|
468 When given the above countdown regular expression, |
|
469 a small number $n$ would result in a determinised automata |
|
470 with millions of states. |
|
471 |
372 For this reason, regex libraries that support |
472 For this reason, regex libraries that support |
373 bounded repetitions often choose to use the $\mathit{NFA}$ |
473 bounded repetitions often choose to use the $\mathit{NFA}$ |
374 approach. |
474 approach. |
375 \subsection{The $\mathit{NFA}$ approach to regex matching} |
475 |
376 One can simulate the $\mathit{NFA}$ running in two ways: |
476 |
|
477 |
|
478 |
|
479 |
|
480 |
|
481 |
|
482 |
|
483 \subsection{Why $\mathit{NFA}$s can be slow in the second phase} |
|
484 When one constructs an $\NFA$ out of a regular expression |
|
485 there is often very little to be done in the first phase, one simply |
|
486 construct the $\NFA$ states based on the structure of the input regular expression. |
|
487 |
|
488 In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways: |
377 one by keeping track of all active states after consuming |
489 one by keeping track of all active states after consuming |
378 a character, and update that set of states iteratively. |
490 a character, and update that set of states iteratively. |
379 This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
491 This can be viewed as a breadth-first-search of the $\mathit{NFA}$ |
380 for a path terminating |
492 for a path terminating |
381 at an accepting state. |
493 at an accepting state. |
382 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
494 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this |
383 type of $\mathit{NFA}$ simulation, and guarantees a linear runtime |
495 type of $\mathit{NFA}$ simulation and guarantees a linear runtime |
384 in terms of input string length. |
496 in terms of input string length. |
385 %TODO:try out these lexers |
497 %TODO:try out these lexers |
386 The other way to use $\mathit{NFA}$ for matching is choosing |
498 The other way to use $\mathit{NFA}$ for matching is choosing |
387 a single transition each time, keeping all the other options in |
499 a single transition each time, keeping all the other options in |
388 a queue or stack, and backtracking if that choice eventually |
500 a queue or stack, and backtracking if that choice eventually |
464 % Java and Python both support back-references, but shows |
577 % Java and Python both support back-references, but shows |
465 %catastrophic backtracking behaviours on inputs without back-references( |
578 %catastrophic backtracking behaviours on inputs without back-references( |
466 %when the language is still regular). |
579 %when the language is still regular). |
467 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
580 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
468 %TODO: verify the fact Rust does not allow 1000+ reps |
581 %TODO: verify the fact Rust does not allow 1000+ reps |
469 %TODO: Java 17 updated graphs? Is it ok to still use Java 8 graphs? |
582 \ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?} |
470 \section{Buggy Regex Engines} |
583 |
471 |
584 |
472 |
585 So we have practical implementations |
473 Another thing about these libraries is that there |
586 on regular expression matching/lexing which are fast |
474 is no correctness guarantee. |
587 but do not come with any guarantees that it will not grind to a halt |
475 In some cases, they either fail to generate a lexing result when there exists a match, |
588 or give wrong answers. |
476 or give the wrong way of matching. |
589 Our goal is to have a regex lexing algorithm that comes with |
|
590 \begin{itemize} |
|
591 \item |
|
592 proven correctness |
|
593 \item |
|
594 proven non-catastrophic properties |
|
595 \item |
|
596 easy extensions to |
|
597 constructs like |
|
598 bounded repetitions, negation, lookarounds, and even back-references. |
|
599 \end{itemize} |
477 |
600 |
478 |
|
479 It turns out that regex libraries not only suffer from |
|
480 exponential backtracking problems, |
|
481 but also undesired (or even buggy) outputs. |
|
482 %TODO: comment from who |
|
483 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
|
484 correctly implementing the POSIX (maximum-munch) |
|
485 rule of regular expression matching. |
|
486 This experience is echoed by the writer's |
|
487 tryout of a few online regex testers: |
|
488 A concrete example would be |
|
489 the regex |
|
490 \begin{verbatim} |
|
491 (((((a*a*)b*)b){20})*)c |
|
492 \end{verbatim} |
|
493 and the string |
|
494 \begin{verbatim} |
|
495 baabaabababaabaaaaaaaaababaa |
|
496 aababababaaaabaaabaaaaaabaab |
|
497 aabababaababaaaaaaaaababaaaa |
|
498 babababaaaaaaaaaaaaac |
|
499 \end{verbatim} |
|
500 |
|
501 This seemingly complex regex simply says "some $a$'s |
|
502 followed by some $b$'s then followed by 1 single $b$, |
|
503 and this iterates 20 times, finally followed by a $c$. |
|
504 And a POSIX match would involve the entire string,"eating up" |
|
505 all the $b$'s in it. |
|
506 %TODO: give a coloured example of how this matches POSIXly |
|
507 |
|
508 This regex would trigger catastrophic backtracking in |
|
509 languages like Python and Java, |
|
510 whereas it gives a non-POSIX and uninformative |
|
511 match in languages like Go or .NET--The match with only |
|
512 character $c$. |
|
513 |
|
514 As Grathwohl\parencite{grathwohl2014crash} commented, |
|
515 \begin{center} |
|
516 ``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.'' |
|
517 \end{center} |
|
518 |
|
519 %\section{How people solve problems with regexes} |
|
520 |
|
521 |
|
522 When a regular expression does not behave as intended, |
|
523 people usually try to rewrite the regex to some equivalent form |
|
524 or they try to avoid the possibly problematic patterns completely, |
|
525 for which many false positives exist\parencite{Davis18}. |
|
526 Animated tools to "debug" regular expressions |
|
527 are also popular, regexploit\parencite{regexploit2021}, regex101\parencite{regex101} |
|
528 to name a few. |
|
529 We are also aware of static analysis work on regular expressions that |
|
530 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
|
531 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
532 that detects regular expressions triggering exponential |
|
533 behavious on backtracking matchers. |
|
534 Weideman \parencite{Weideman2017Static} came up with |
|
535 non-linear polynomial worst-time estimates |
|
536 for regexes, attack string that exploit the worst-time |
|
537 scenario, and "attack automata" that generates |
|
538 attack strings. |
|
539 %Arguably these methods limits the programmers' freedom |
|
540 %or productivity when all they want is to come up with a regex |
|
541 %that solves the text processing problem. |
|
542 |
|
543 %TODO:also the regex101 debugger |
|
544 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} |
601 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} |
545 Is it possible to have a regex lexing algorithm with proven correctness and |
602 We propose Brzozowski derivatives on regular expressions as |
546 time complexity, which allows easy extensions to |
603 a solution to this. |
547 constructs like |
604 In the last fifteen or so years, Brzozowski's derivatives of regular |
548 bounded repetitions, negation, lookarounds, and even back-references? |
605 expressions have sparked quite a bit of interest in the functional |
|
606 programming and theorem prover communities. |
|
607 |
|
608 \subsection{Motivation} |
549 |
609 |
550 We propose Brzozowski derivatives on regular expressions as |
610 Derivatives give a simple solution |
551 a solution to this. |
611 to the problem of matching a string $s$ with a regular |
552 |
612 expression $r$: if the derivative of $r$ w.r.t.\ (in |
553 In the last fifteen or so years, Brzozowski's derivatives of regular |
613 succession) all the characters of the string matches the empty string, |
554 expressions have sparked quite a bit of interest in the functional |
614 then $r$ matches $s$ (and {\em vice versa}). |
555 programming and theorem prover communities. The beauty of |
615 |
|
616 The beauty of |
556 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly |
617 Brzozowski's derivatives \parencite{Brzozowski1964} is that they are neatly |
557 expressible in any functional language, and easily definable and |
618 expressible in any functional language, and easily definable and |
558 reasoned about in theorem provers---the definitions just consist of |
619 reasoned about in theorem provers---the definitions just consist of |
559 inductive datatypes and simple recursive functions. |
620 inductive datatypes and simple recursive functions. |
560 And an algorithms based on it by |
621 And an algorithms based on it by |
561 Suzmann and Lu \parencite{Sulzmann2014} allows easy extension |
622 Suzmann and Lu \parencite{Sulzmann2014} allows easy extension |
562 to include extended regular expressions and |
623 to include extended regular expressions and |
563 simplification of internal data structures |
624 simplification of internal data structures |
564 eliminating the exponential behaviours. |
625 eliminating the exponential behaviours. |
565 |
|
566 |
|
567 \section{Motivation} |
|
568 |
|
569 Derivatives give a simple solution |
|
570 to the problem of matching a string $s$ with a regular |
|
571 expression $r$: if the derivative of $r$ w.r.t.\ (in |
|
572 succession) all the characters of the string matches the empty string, |
|
573 then $r$ matches $s$ (and {\em vice versa}). |
|
574 |
|
575 |
|
576 |
626 |
577 However, two difficulties with derivative-based matchers exist: |
627 However, two difficulties with derivative-based matchers exist: |
|
628 \subsubsection{Problems with Current Brzozowski Matchers} |
578 First, Brzozowski's original matcher only generates a yes/no answer |
629 First, Brzozowski's original matcher only generates a yes/no answer |
579 for whether a regular expression matches a string or not. This is too |
630 for whether a regular expression matches a string or not. This is too |
580 little information in the context of lexing where separate tokens must |
631 little information in the context of lexing where separate tokens must |
581 be identified and also classified (for example as keywords |
632 be identified and also classified (for example as keywords |
582 or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this |
633 or identifiers). Sulzmann and Lu~\cite{Sulzmann2014} overcome this |
583 difficulty by cleverly extending Brzozowski's matching |
634 difficulty by cleverly extending Brzozowski's matching |
584 algorithm. Their extended version generates additional information on |
635 algorithm. Their extended version generates additional information on |
585 \emph{how} a regular expression matches a string following the POSIX |
636 \emph{how} a regular expression matches a string following the POSIX |
586 rules for regular expression matching. They achieve this by adding a |
637 rules for regular expression matching. They achieve this by adding a |
587 second ``phase'' to Brzozowski's algorithm involving an injection |
638 second ``phase'' to Brzozowski's algorithm involving an injection |
588 function. In our own earlier work we provided the formal |
639 function. In our own earlier work, we provided the formal |
589 specification of what POSIX matching means and proved in Isabelle/HOL |
640 specification of what POSIX matching means and proved in Isabelle/HOL |
590 the correctness |
641 the correctness |
591 of Sulzmann and Lu's extended algorithm accordingly |
642 of Sulzmann and Lu's extended algorithm accordingly |
592 \cite{AusafDyckhoffUrban2016}. |
643 \cite{AusafDyckhoffUrban2016}. |
593 |
644 |
702 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$ |
756 of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$ |
703 and concluded that the algorithm is quadratic in terms of input length. |
757 and concluded that the algorithm is quadratic in terms of input length. |
704 When we tried out their extracted OCaml code with our example $(a+aa)^*$, |
758 When we tried out their extracted OCaml code with our example $(a+aa)^*$, |
705 the time it took to lex only 40 $a$'s was 5 minutes. |
759 the time it took to lex only 40 $a$'s was 5 minutes. |
706 |
760 |
707 We believe our results of a proof of performance on general |
|
708 inputs rather than specific examples a novel contribution.\\ |
|
709 |
761 |
710 |
762 |
711 \subsection{Related Work} |
763 \subsection{Related Work} |
712 We are aware |
764 We are aware |
713 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
765 of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by |
714 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
766 Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part |
715 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
767 of the work by Krauss and Nipkow \parencite{Krauss2011}. And another one |
716 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
768 in Coq is given by Coquand and Siles \parencite{Coquand2012}. |
717 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
769 Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}. |
718 |
770 |
719 %We propose Brzozowski's derivatives as a solution to this problem. |
771 |
720 % about Lexing Using Brzozowski derivatives |
772 When a regular expression does not behave as intended, |
|
773 people usually try to rewrite the regex to some equivalent form |
|
774 or they try to avoid the possibly problematic patterns completely, |
|
775 for which many false positives exist\parencite{Davis18}. |
|
776 Animated tools to "debug" regular expressions such as |
|
777 \parencite{regexploit2021} \parencite{regex101} are also popular. |
|
778 We are also aware of static analysis work on regular expressions that |
|
779 aims to detect potentially expoential regex patterns. Rathnayake and Thielecke |
|
780 \parencite{Rathnayake2014StaticAF} proposed an algorithm |
|
781 that detects regular expressions triggering exponential |
|
782 behavious on backtracking matchers. |
|
783 Weideman \parencite{Weideman2017Static} came up with |
|
784 non-linear polynomial worst-time estimates |
|
785 for regexes, attack string that exploit the worst-time |
|
786 scenario, and "attack automata" that generates |
|
787 attack strings. |
|
788 |
|
789 |
721 |
790 |
722 |
791 |
723 \section{Structure of the thesis} |
792 \section{Structure of the thesis} |
724 In chapter 2 \ref{Chapter2} we will introduce the concepts |
793 In chapter 2 \ref{Inj} we will introduce the concepts |
725 and notations we |
794 and notations we |
726 use for describing the lexing algorithm by Sulzmann and Lu, |
795 use for describing the lexing algorithm by Sulzmann and Lu, |
727 and then give the algorithm and its variant, and discuss |
796 and then give the lexing algorithm. |
728 why more aggressive simplifications are needed. |
797 We will give its variant in \ref{Bitcoded1}. |
729 Then we illustrate in Chapter 3\ref{Chapter3} |
798 Then we illustrate in \ref{Bitcoded2} |
730 how the algorithm without bitcodes falls short for such aggressive |
799 how the algorithm without bitcodes falls short for such aggressive |
731 simplifications and therefore introduce our version of the |
800 simplifications and therefore introduce our version of the |
732 bitcoded algorithm and |
801 bit-coded algorithm and |
733 its correctness proof . |
802 its correctness proof . |
734 In Chapter 4 \ref{Chapter4} we give the second guarantee |
803 In \ref{Finite} we give the second guarantee |
735 of our bitcoded algorithm, that is a finite bound on the size of any |
804 of our bitcoded algorithm, that is a finite bound on the size of any |
736 regex's derivatives. |
805 regex's derivatives. |
737 In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound |
806 In \ref{Cubic} we discuss stronger simplifications to improve the finite bound |
738 in Chapter 4 to a polynomial one, and demonstrate how one can extend the |
807 in \ref{Finite} to a polynomial one, and demonstrate how one can extend the |
739 algorithm to include constructs such as bounded repetitions and negations. |
808 algorithm to include constructs such as bounded repetitions and negations. |
740 |
809 |
741 |
810 |
742 |
811 |
743 |
812 |