ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 633 f1edd5ee1a12
parent 631 bdb3ffdd39f8
child 634 b079aaee5e10
equal deleted inserted replaced
632:99eceee5c4ac 633:f1edd5ee1a12
   219 are mature and fully studied.
   219 are mature and fully studied.
   220 Indeed, in a popular programming language's regex engine, 
   220 Indeed, in a popular programming language's regex engine, 
   221 supplying it with regular expressions and strings,
   221 supplying it with regular expressions and strings,
   222 in most cases one can
   222 in most cases one can
   223 get the matching information in a very short time.
   223 get the matching information in a very short time.
   224 Those matchers can be blindingly fast--some 
   224 Those engines can be blindingly fast--some 
   225 network intrusion detection systems
   225 network intrusion detection systems
   226 use regex engines that are able to process 
   226 use regex engines that are able to process 
   227 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   227 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   228 However, those matchers can exhibit a surprising security vulnerability
   228 However, those engines can exhibit a surprising security vulnerability
   229 under a certain class of inputs.
   229 under a certain class of inputs.
   230 %However, , this is not the case for $\mathbf{all}$ inputs.
   230 %However, , this is not the case for $\mathbf{all}$ inputs.
   231 %TODO: get source for SNORT/BRO's regex matching engine/speed
   231 %TODO: get source for SNORT/BRO's regex matching engine/speed
   232 
   232 
   233 
   233 
   234 Consider $(a^*)^*\,b$ and ask whether
   234 Consider for example $(a^*)^*\,b$ and 
   235 strings of the form $aa..a$ can be matched by this regular
   235 strings of the form $aa..a$, these strings cannot be matched by this regular
   236 expression. Obviously this is not the case---the expected $b$ in the last
   236 expression: Obviously the expected $b$ in the last
   237 position is missing. One would expect that modern regular expression
   237 position is missing. One would assume that modern regular expression
   238 matching engines can find this out very quickly. Surprisingly, if one tries
   238 matching engines can find this out very quickly. Surprisingly, if one tries
   239 this example in JavaScript, Python or Java 8, even with small strings, 
   239 this example in JavaScript, Python or Java 8, even with small strings, 
   240 say of lenght of around 30 $a$'s,
   240 say of lenght of around 30 $a$'s,
   241 the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
   241 the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
   242 This is clearly exponential behaviour, and 
   242 This is clearly exponential behaviour, and 
   715 The merit of Brzozowski derivatives (more on this later)
   715 The merit of Brzozowski derivatives (more on this later)
   716 on this problem is that
   716 on this problem is that
   717 it can be naturally extended to support bounded repetitions.
   717 it can be naturally extended to support bounded repetitions.
   718 Moreover these extensions are still made up of only
   718 Moreover these extensions are still made up of only
   719 inductive datatypes and recursive functions,
   719 inductive datatypes and recursive functions,
   720 making it handy to deal with using theorem provers.
   720 making it handy to deal with them in theorem provers.
   721 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
   721 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
   722 %straightforwardly extended to deal with bounded regular expressions
   722 %straightforwardly extended to deal with bounded regular expressions
   723 %and moreover the resulting code still consists of only simple
   723 %and moreover the resulting code still consists of only simple
   724 %recursive functions and inductive datatypes.
   724 %recursive functions and inductive datatypes.
   725 Finally, bounded regular expressions do not destroy our finite
   725 Finally, bounded regular expressions do not destroy our finite
   830 is
   830 is
   831 \begin{center}
   831 \begin{center}
   832 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
   832 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
   833 \end{center}
   833 \end{center}
   834 Despite being useful, the expressive power of regexes 
   834 Despite being useful, the expressive power of regexes 
   835 go beyond the regular language hierarchy
   835 go beyond regular languages 
   836 once back-references are included.
   836 once back-references are included.
   837 In fact, they allow the regex construct to express 
   837 In fact, they allow the regex construct to express 
   838 languages that cannot be contained in context-free
   838 languages that cannot be contained in context-free
   839 languages either.
   839 languages either.
   840 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
   840 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
   841 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   841 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   842 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
   842 which cannot be expressed by context-free grammars \parencite{campeanu2003formal}.
   843 Such a language is contained in the context-sensitive hierarchy
   843 Such a language is contained in the context-sensitive hierarchy
   844 of formal languages. 
   844 of formal languages. 
   845 Also solving the matching problem involving back-references
   845 Also solving the matching problem involving back-references
   846 is known to be NP-complete \parencite{alfred2014algorithms}.
   846 is known to be NP-complete \parencite{alfred2014algorithms}.
   847 Regex libraries supporting back-references such as 
   847 Regex libraries supporting back-references such as 
   857 (i) The ones  with  linear
   857 (i) The ones  with  linear
   858 time guarantees like Go and Rust. The downside with them is that
   858 time guarantees like Go and Rust. The downside with them is that
   859 they impose restrictions
   859 they impose restrictions
   860 on the regular expressions (not allowing back-references, 
   860 on the regular expressions (not allowing back-references, 
   861 bounded repetitions cannot exceed an ad hoc limit etc.).
   861 bounded repetitions cannot exceed an ad hoc limit etc.).
   862 (ii) Those 
   862 And (ii) those 
   863 that allow large bounded regular expressions and back-references
   863 that allow large bounded regular expressions and back-references
   864 at the expense of using backtracking algorithms.
   864 at the expense of using backtracking algorithms.
   865 They can potentially ``grind to a halt''
   865 They can potentially ``grind to a halt''
   866 on some very simple cases, resulting 
   866 on some very simple cases, resulting 
   867 ReDoS attacks.
   867 ReDoS attacks if exposed to the internet.
   868 
   868 
   869 The proble with both approaches is the motivation for us 
   869 The problems with both approaches are the motivation for us 
   870 to look again at the regular expression matching problem. 
   870 to look again at the regular expression matching problem. 
   871 Another motivation is that regular expression matching algorithms
   871 Another motivation is that regular expression matching algorithms
   872 that follow the POSIX standard often contain errors and bugs 
   872 that follow the POSIX standard often contain errors and bugs 
   873 as we shall explain next.
   873 as we shall explain next.
   874 
   874