190 %This part is about regular expressions, Brzozowski derivatives, |
190 %This part is about regular expressions, Brzozowski derivatives, |
191 %and a bit-coded lexing algorithm with proven correctness and time bounds. |
191 %and a bit-coded lexing algorithm with proven correctness and time bounds. |
192 |
192 |
193 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
193 %TODO: look up snort rules to use here--give readers idea of what regexes look like |
194 |
194 |
195 \begin{figure} |
195 |
196 \centering |
196 |
|
197 |
|
198 |
|
199 |
|
200 Regular expressions are widely used in computer science: |
|
201 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
|
202 command-line tools like $\mathit{grep}$ that facilitate easy |
|
203 text-processing; network intrusion |
|
204 detection systems that reject suspicious traffic; or compiler |
|
205 front ends--the majority of the solutions to these tasks |
|
206 involve lexing with regular |
|
207 expressions. |
|
208 Given its usefulness and ubiquity, one would imagine that |
|
209 modern regular expression matching implementations |
|
210 are mature and fully studied. |
|
211 Indeed, in a popular programming language' regex engine, |
|
212 supplying it with regular expressions and strings, one can |
|
213 get rich matching information in a very short time. |
|
214 Some network intrusion detection systems |
|
215 use regex engines that are able to process |
|
216 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
|
217 Unfortunately, this is not the case for $\mathbf{all}$ inputs. |
|
218 %TODO: get source for SNORT/BRO's regex matching engine/speed |
|
219 |
|
220 \begin{figure}[p] |
197 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
221 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
198 \begin{tikzpicture} |
222 \begin{tikzpicture} |
199 \begin{axis}[ |
223 \begin{axis}[ |
200 xlabel={$n$}, |
224 xlabel={$n$}, |
201 x label style={at={(1.05,-0.05)}}, |
225 x label style={at={(1.05,-0.05)}}, |
255 legend pos=north west, |
279 legend pos=north west, |
256 legend cell align=left] |
280 legend cell align=left] |
257 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
281 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
258 \end{axis} |
282 \end{axis} |
259 \end{tikzpicture}\\ |
283 \end{tikzpicture}\\ |
260 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
284 \begin{tikzpicture} |
261 of the form $\underbrace{aa..a}_{n}$.} |
285 \begin{axis}[ |
|
286 xlabel={$n$}, |
|
287 x label style={at={(1.05,-0.05)}}, |
|
288 ylabel={time in secs}, |
|
289 enlargelimits=false, |
|
290 xtick={0,5,...,30}, |
|
291 xmax=33, |
|
292 ymax=35, |
|
293 ytick={0,5,...,30}, |
|
294 scaled ticks=false, |
|
295 axis lines=left, |
|
296 width=5cm, |
|
297 height=4cm, |
|
298 legend entries={Dart}, |
|
299 legend pos=north west, |
|
300 legend cell align=left] |
|
301 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; |
|
302 \end{axis} |
|
303 \end{tikzpicture} |
|
304 & |
|
305 \begin{tikzpicture} |
|
306 \begin{axis}[ |
|
307 xlabel={$n$}, |
|
308 x label style={at={(1.05,-0.05)}}, |
|
309 %ylabel={time in secs}, |
|
310 enlargelimits=false, |
|
311 xtick={0,5,...,30}, |
|
312 xmax=33, |
|
313 ymax=35, |
|
314 ytick={0,5,...,30}, |
|
315 scaled ticks=false, |
|
316 axis lines=left, |
|
317 width=5cm, |
|
318 height=4cm, |
|
319 legend entries={Swift}, |
|
320 legend pos=north west, |
|
321 legend cell align=left] |
|
322 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; |
|
323 \end{axis} |
|
324 \end{tikzpicture} |
|
325 & \\ |
|
326 \multicolumn{3}{c}{Graphs} |
262 \end{tabular} |
327 \end{tabular} |
263 \caption{aStarStarb} \label{fig:aStarStarb} |
328 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
264 \end{figure} |
329 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
265 |
330 The reason for their superlinear behaviour is that they do a depth-first-search. |
266 |
331 If the string does not match, the engine starts to explore all possibilities. |
267 |
332 }\label{fig:aStarStarb} |
268 |
333 \end{figure}\afterpage{\clearpage} |
269 |
|
270 Regular expressions are widely used in computer science: |
|
271 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
|
272 command-line tools like $\mathit{grep}$ that facilitate easy |
|
273 text-processing; network intrusion |
|
274 detection systems that reject suspicious traffic; or compiler |
|
275 front ends--the majority of the solutions to these tasks |
|
276 involve lexing with regular |
|
277 expressions. |
|
278 Given its usefulness and ubiquity, one would imagine that |
|
279 modern regular expression matching implementations |
|
280 are mature and fully studied. |
|
281 Indeed, in a popular programming language' regex engine, |
|
282 supplying it with regular expressions and strings, one can |
|
283 get rich matching information in a very short time. |
|
284 Some network intrusion detection systems |
|
285 use regex engines that are able to process |
|
286 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
|
287 Unfortunately, this is not the case for $\mathbf{all}$ inputs. |
|
288 %TODO: get source for SNORT/BRO's regex matching engine/speed |
|
289 |
|
290 |
334 |
291 Take $(a^*)^*\,b$ and ask whether |
335 Take $(a^*)^*\,b$ and ask whether |
292 strings of the form $aa..a$ match this regular |
336 strings of the form $aa..a$ match this regular |
293 expression. Obviously this is not the case---the expected $b$ in the last |
337 expression. Obviously this is not the case---the expected $b$ in the last |
294 position is missing. One would expect that modern regular expression |
338 position is missing. One would expect that modern regular expression |
297 length, say around 30 $a$'s, one discovers that |
341 length, say around 30 $a$'s, one discovers that |
298 this decision takes crazy time to finish given the simplicity of the problem. |
342 this decision takes crazy time to finish given the simplicity of the problem. |
299 This is clearly exponential behaviour, and |
343 This is clearly exponential behaviour, and |
300 is triggered by some relatively simple regex patterns, as the graphs |
344 is triggered by some relatively simple regex patterns, as the graphs |
301 in \ref{fig:aStarStarb} show. |
345 in \ref{fig:aStarStarb} show. |
302 |
346 Java 9 and newer |
303 |
347 versions improves this behaviour, but is still slow compared |
304 |
348 with the approach we are going to use. |
305 |
349 |
306 \ChristianComment{Superlinear I just leave out the explanation |
350 |
307 which I find once used would distract the flow. Plus if i just say exponential |
351 |
308 here the 2016 event in StackExchange was not exponential, but just quardratic so would be |
|
309 in accurate} |
|
310 |
352 |
311 This superlinear blowup in regular expression engines |
353 This superlinear blowup in regular expression engines |
312 had repeatedly caused grief in real life. |
354 had repeatedly caused grief in real life. |
313 For example, on 20 July 2016 one evil |
355 For example, on 20 July 2016 one evil |
314 regular expression brought the webpage |
356 regular expression brought the webpage |
328 2019. A poorly written regular expression exhibited exponential |
370 2019. A poorly written regular expression exhibited exponential |
329 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage |
371 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage |
330 had several causes, at the heart was a regular expression that |
372 had several causes, at the heart was a regular expression that |
331 was used to monitor network |
373 was used to monitor network |
332 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
374 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
333 %TODO: data points for some new versions of languages |
|
334 These problems with regular expressions |
375 These problems with regular expressions |
335 are not isolated events that happen |
376 are not isolated events that happen |
336 very occasionally, but actually widespread. |
377 very occasionally, but actually widespread. |
337 They occur so often that they get a |
378 They occur so often that they get a |
338 name--Regular-Expression-Denial-Of-Service (ReDoS) |
379 name--Regular-Expression-Denial-Of-Service (ReDoS) |
343 They therefore concluded that evil regular expressions |
384 They therefore concluded that evil regular expressions |
344 are problems "more than a parlour trick", but one that |
385 are problems "more than a parlour trick", but one that |
345 requires |
386 requires |
346 more research attention. |
387 more research attention. |
347 |
388 |
348 |
389 Regular expressions and regular expression matchers |
349 But the problems are not limited to slowness on certain |
390 have of course been studied for many, many years. |
|
391 One of the most recent work in the context of lexing |
|
392 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
|
393 This is relevant work and we will compare later on |
|
394 our derivative-based matcher we are going to present. |
|
395 There is also some newer work called |
|
396 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. |
|
397 For that the problem is the bounded repetitions ($r^{n}$, |
|
398 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$). |
|
399 They often occur in practical use, in for example the Snort XML definitions: |
|
400 |
|
401 |
|
402 The problems are not limited to slowness on certain |
350 cases. |
403 cases. |
351 Another thing about these libraries is that there |
404 Another thing about these libraries is that there |
352 is no correctness guarantee. |
405 is no correctness guarantee. |
353 In some cases, they either fail to generate a lexing result when there exists a match, |
406 In some cases, they either fail to generate a lexing result when there exists a match, |
354 or give results that are inconsistent with the $\POSIX$ standard. |
407 or give results that are inconsistent with the $\POSIX$ standard. |
355 A concrete example would be |
408 A concrete example would be the regex |
356 the regex |
409 \begin{center} |
357 \begin{verbatim} |
410 $(aba + ab + a)* \text{and the string} ababa$ |
358 (aba|ab|a)* |
411 \end{center} |
359 \end{verbatim} |
|
360 and the string |
|
361 \begin{verbatim} |
|
362 ababa |
|
363 \end{verbatim} |
|
364 The correct $\POSIX$ match for the above would be |
412 The correct $\POSIX$ match for the above would be |
365 with the entire string $ababa$, |
413 with the entire string $ababa$, |
366 split into two Kleene star iterations, $[ab] [aba]$ at positions |
414 split into two Kleene star iterations, $[ab] [aba]$ at positions |
367 $[0, 2), [2, 5)$ |
415 $[0, 2), [2, 5)$ |
368 respectively. |
416 respectively. |
373 |
421 |
374 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
422 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not |
375 correctly implementing the POSIX (maximum-munch) |
423 correctly implementing the POSIX (maximum-munch) |
376 rule of regular expression matching. |
424 rule of regular expression matching. |
377 |
425 |
378 As Grathwohl\parencite{grathwohl2014crash} commented, |
426 As Grathwohl\parencite{grathwohl2014crash} wrote, |
379 \begin{center} |
427 \begin{quote} |
380 ``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.'' |
428 The POSIX strategy is more complicated than the |
381 \end{center} |
429 greedy because of the dependence on information about |
382 |
430 the length of matched strings in the various subexpressions. |
383 |
431 \end{quote} |
|
432 %\noindent |
384 To summarise the above, regular expressions are important. |
433 To summarise the above, regular expressions are important. |
385 They are popular and programming languages' library functions |
434 They are popular and programming languages' library functions |
386 for them are very fast on non-catastrophic cases. |
435 for them are very fast on non-catastrophic cases. |
387 But there are problems with current practical implementations. |
436 But there are problems with current practical implementations. |
388 First thing is that the running time might blow up. |
437 First thing is that the running time might blow up. |
422 The second phase is the lexing phase, when the input string |
471 The second phase is the lexing phase, when the input string |
423 $s$ is read and the data structure |
472 $s$ is read and the data structure |
424 representing that regex $r$ is being operated on. |
473 representing that regex $r$ is being operated on. |
425 We represent the time |
474 We represent the time |
426 it takes by $P_2(r, s)$.\\ |
475 it takes by $P_2(r, s)$.\\ |
427 |
|
428 For $\mathit{DFA}$, |
476 For $\mathit{DFA}$, |
429 we have $P_2(r, s) = O( |s| )$, |
477 we have $P_2(r, s) = O( |s| )$, |
430 because we take at most $|s|$ steps, |
478 because we take at most $|s|$ steps, |
431 and each step takes |
479 and each step takes |
432 at most one transition-- |
480 at most one transition-- |
433 a deterministic-finite-automata |
481 a deterministic-finite-automata |
434 by definition has at most one state active and at most one |
482 by definition has at most one state active and at most one |
435 transition upon receiving an input symbol. |
483 transition upon receiving an input symbol. |
436 But unfortunately in the worst case |
484 But unfortunately in the worst case |
437 $P_1(r) = O(exp^{|r|})$. An example will be given later. |
485 $P_1(r) = O(exp^{|r|})$. An example will be given later. |
438 |
|
439 |
|
440 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
486 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
441 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$. |
487 expressions like $r^n$ into |
|
488 \[ |
|
489 \underbrace{r \cdots r}_{\text{n copies of r}}. |
|
490 \] |
442 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
491 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
443 On the other hand, if backtracking is used, the worst-case time bound bloats |
492 On the other hand, if backtracking is used, the worst-case time bound bloats |
444 to $|r| * 2^|s|$. |
493 to $|r| * 2^{|s|}$. |
445 %on the input |
494 %on the input |
446 %And when calculating the time complexity of the matching algorithm, |
495 %And when calculating the time complexity of the matching algorithm, |
447 %we are assuming that each input reading step requires constant time. |
496 %we are assuming that each input reading step requires constant time. |
448 %which translates to that the number of |
497 %which translates to that the number of |
449 %states active and transitions taken each time is bounded by a |
498 %states active and transitions taken each time is bounded by a |
636 % Java and Python both support back-references, but shows |
685 % Java and Python both support back-references, but shows |
637 %catastrophic backtracking behaviours on inputs without back-references( |
686 %catastrophic backtracking behaviours on inputs without back-references( |
638 %when the language is still regular). |
687 %when the language is still regular). |
639 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
688 %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac |
640 %TODO: verify the fact Rust does not allow 1000+ reps |
689 %TODO: verify the fact Rust does not allow 1000+ reps |
641 \ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?} |
|
642 |
690 |
643 |
691 |
644 So we have practical implementations |
692 So we have practical implementations |
645 on regular expression matching/lexing which are fast |
693 on regular expression matching/lexing which are fast |
646 but do not come with any guarantees that it will not grind to a halt |
694 but do not come with any guarantees that it will not grind to a halt |