219 are mature and fully studied. |
219 are mature and fully studied. |
220 Indeed, in a popular programming language's regex engine, |
220 Indeed, in a popular programming language's regex engine, |
221 supplying it with regular expressions and strings, |
221 supplying it with regular expressions and strings, |
222 in most cases one can |
222 in most cases one can |
223 get the matching information in a very short time. |
223 get the matching information in a very short time. |
224 Those matchers can be blindingly fast--some |
224 Those engines can be blindingly fast--some |
225 network intrusion detection systems |
225 network intrusion detection systems |
226 use regex engines that are able to process |
226 use regex engines that are able to process |
227 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
227 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
228 However, those matchers can exhibit a surprising security vulnerability |
228 However, those engines can exhibit a surprising security vulnerability |
229 under a certain class of inputs. |
229 under a certain class of inputs. |
230 %However, , this is not the case for $\mathbf{all}$ inputs. |
230 %However, , this is not the case for $\mathbf{all}$ inputs. |
231 %TODO: get source for SNORT/BRO's regex matching engine/speed |
231 %TODO: get source for SNORT/BRO's regex matching engine/speed |
232 |
232 |
233 |
233 |
234 Consider $(a^*)^*\,b$ and ask whether |
234 Consider for example $(a^*)^*\,b$ and |
235 strings of the form $aa..a$ can be matched by this regular |
235 strings of the form $aa..a$, these strings cannot be matched by this regular |
236 expression. Obviously this is not the case---the expected $b$ in the last |
236 expression: Obviously the expected $b$ in the last |
237 position is missing. One would expect that modern regular expression |
237 position is missing. One would assume that modern regular expression |
238 matching engines can find this out very quickly. Surprisingly, if one tries |
238 matching engines can find this out very quickly. Surprisingly, if one tries |
239 this example in JavaScript, Python or Java 8, even with small strings, |
239 this example in JavaScript, Python or Java 8, even with small strings, |
240 say of lenght of around 30 $a$'s, |
240 say of lenght of around 30 $a$'s, |
241 the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}). |
241 the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}). |
242 This is clearly exponential behaviour, and |
242 This is clearly exponential behaviour, and |
715 The merit of Brzozowski derivatives (more on this later) |
715 The merit of Brzozowski derivatives (more on this later) |
716 on this problem is that |
716 on this problem is that |
717 it can be naturally extended to support bounded repetitions. |
717 it can be naturally extended to support bounded repetitions. |
718 Moreover these extensions are still made up of only |
718 Moreover these extensions are still made up of only |
719 inductive datatypes and recursive functions, |
719 inductive datatypes and recursive functions, |
720 making it handy to deal with using theorem provers. |
720 making it handy to deal with them in theorem provers. |
721 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
721 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
722 %straightforwardly extended to deal with bounded regular expressions |
722 %straightforwardly extended to deal with bounded regular expressions |
723 %and moreover the resulting code still consists of only simple |
723 %and moreover the resulting code still consists of only simple |
724 %recursive functions and inductive datatypes. |
724 %recursive functions and inductive datatypes. |
725 Finally, bounded regular expressions do not destroy our finite |
725 Finally, bounded regular expressions do not destroy our finite |
830 is |
830 is |
831 \begin{center} |
831 \begin{center} |
832 $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
832 $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ |
833 \end{center} |
833 \end{center} |
834 Despite being useful, the expressive power of regexes |
834 Despite being useful, the expressive power of regexes |
835 go beyond the regular language hierarchy |
835 go beyond regular languages |
836 once back-references are included. |
836 once back-references are included. |
837 In fact, they allow the regex construct to express |
837 In fact, they allow the regex construct to express |
838 languages that cannot be contained in context-free |
838 languages that cannot be contained in context-free |
839 languages either. |
839 languages either. |
840 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
840 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
841 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
841 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
842 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. |
842 which cannot be expressed by context-free grammars \parencite{campeanu2003formal}. |
843 Such a language is contained in the context-sensitive hierarchy |
843 Such a language is contained in the context-sensitive hierarchy |
844 of formal languages. |
844 of formal languages. |
845 Also solving the matching problem involving back-references |
845 Also solving the matching problem involving back-references |
846 is known to be NP-complete \parencite{alfred2014algorithms}. |
846 is known to be NP-complete \parencite{alfred2014algorithms}. |
847 Regex libraries supporting back-references such as |
847 Regex libraries supporting back-references such as |
857 (i) The ones with linear |
857 (i) The ones with linear |
858 time guarantees like Go and Rust. The downside with them is that |
858 time guarantees like Go and Rust. The downside with them is that |
859 they impose restrictions |
859 they impose restrictions |
860 on the regular expressions (not allowing back-references, |
860 on the regular expressions (not allowing back-references, |
861 bounded repetitions cannot exceed an ad hoc limit etc.). |
861 bounded repetitions cannot exceed an ad hoc limit etc.). |
862 (ii) Those |
862 And (ii) those |
863 that allow large bounded regular expressions and back-references |
863 that allow large bounded regular expressions and back-references |
864 at the expense of using backtracking algorithms. |
864 at the expense of using backtracking algorithms. |
865 They can potentially ``grind to a halt'' |
865 They can potentially ``grind to a halt'' |
866 on some very simple cases, resulting |
866 on some very simple cases, resulting |
867 ReDoS attacks. |
867 ReDoS attacks if exposed to the internet. |
868 |
868 |
869 The proble with both approaches is the motivation for us |
869 The problems with both approaches are the motivation for us |
870 to look again at the regular expression matching problem. |
870 to look again at the regular expression matching problem. |
871 Another motivation is that regular expression matching algorithms |
871 Another motivation is that regular expression matching algorithms |
872 that follow the POSIX standard often contain errors and bugs |
872 that follow the POSIX standard often contain errors and bugs |
873 as we shall explain next. |
873 as we shall explain next. |
874 |
874 |