ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 634 b079aaee5e10
parent 633 f1edd5ee1a12
child 635 7ce2389dff4b
equal deleted inserted replaced
633:f1edd5ee1a12 634:b079aaee5e10
   209 
   209 
   210 
   210 
   211 Regular expressions are widely used in computer science: 
   211 Regular expressions are widely used in computer science: 
   212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   213 command-line tools like $\mathit{grep}$ that facilitate easy 
   213 command-line tools like $\mathit{grep}$ that facilitate easy 
   214 text-processing; network intrusion
   214 text-processing \cite{grep}; network intrusion
   215 detection systems that inspect suspicious traffic; or compiler
   215 detection systems that inspect suspicious traffic; or compiler
   216 front ends.
   216 front ends.
   217 Given their usefulness and ubiquity, one would assume that
   217 Given their usefulness and ubiquity, one would assume that
   218 modern regular expression matching implementations
   218 modern regular expression matching implementations
   219 are mature and fully studied.
   219 are mature and fully studied.
   220 Indeed, in a popular programming language's regex engine, 
   220 Indeed, in a popular programming language's regular expression engine, 
   221 supplying it with regular expressions and strings,
   221 supplying it with regular expressions and strings,
   222 in most cases one can
   222 in most cases one can
   223 get the matching information in a very short time.
   223 get the matching information in a very short time.
   224 Those engines can be blindingly fast--some 
   224 Those engines can be blindingly fast--some 
   225 network intrusion detection systems
   225 network intrusion detection systems
   226 use regex engines that are able to process 
   226 use regular expression engines that are able to process 
   227 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   227 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   228 However, those engines can exhibit a surprising security vulnerability
   228 However, those engines can sometimes exhibit a surprising security vulnerability
   229 under a certain class of inputs.
   229 under a certain class of inputs.
   230 %However, , this is not the case for $\mathbf{all}$ inputs.
   230 %However, , this is not the case for $\mathbf{all}$ inputs.
   231 %TODO: get source for SNORT/BRO's regex matching engine/speed
   231 %TODO: get source for SNORT/BRO's regex matching engine/speed
   232 
   232 
   233 
   233 
   236 expression: Obviously the expected $b$ in the last
   236 expression: Obviously the expected $b$ in the last
   237 position is missing. One would assume that modern regular expression
   237 position is missing. One would assume that modern regular expression
   238 matching engines can find this out very quickly. Surprisingly, if one tries
   238 matching engines can find this out very quickly. Surprisingly, if one tries
   239 this example in JavaScript, Python or Java 8, even with small strings, 
   239 this example in JavaScript, Python or Java 8, even with small strings, 
   240 say of lenght of around 30 $a$'s,
   240 say of lenght of around 30 $a$'s,
   241 the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
   241 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
   242 This is clearly exponential behaviour, and 
   242 This is clearly exponential behaviour, and as can be seen
   243 is triggered by some relatively simple regular expressions.
   243 is triggered by some relatively simple regular expressions.
   244 Java 9 and newer
   244 Java 9 and newer
   245 versions improve this behaviour somewhat, but is still slow compared 
   245 versions improve this behaviour somewhat, but are still slow compared 
   246 with the approach we are going to use in this thesis.
   246 with the approach we are going to use in this thesis.
   247 
   247 
   248 
   248 
   249 
   249 
   250 This superlinear blowup in regular expression engines
   250 This superlinear blowup in regular expression engines
   251 had repeatedly caused grief in ``real life'' where it is 
   251 has caused grief in ``real life'' in the past where it is 
   252 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
   252 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
   253 For example, on 20 July 2016 one evil
   253 For example, on 20 July 2016 one evil
   254 regular expression brought the webpage
   254 regular expression brought the webpage
   255 \href{http://stackexchange.com}{Stack Exchange} to its
   255 \href{http://stackexchange.com}{Stack Exchange} to its
   256 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
   256 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
   257 In this instance, a regular expression intended to just trim white
   257 In this instance, a regular expression intended to just trim white
   258 spaces from the beginning and the end of a line actually consumed
   258 spaces from the beginning and the end of a line actually consumed
   259 massive amounts of CPU resources---causing web servers to grind to a
   259 massive amounts of CPU resources---causing the web servers to grind to a
   260 halt. In this example, the time needed to process
   260 halt. In this example, the time needed to process
   261 the string was $O(n^2)$ with respect to the string length. This
   261 the string was 
       
   262 $O(n^2)$ with respect to the string length $n$. This
   262 quadratic overhead was enough for the homepage of Stack Exchange to
   263 quadratic overhead was enough for the homepage of Stack Exchange to
   263 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   264 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   264 attack and therefore stopped the servers from responding to any
   265 attack and therefore stopped the servers from responding to any
   265 requests. This made the whole site become unavailable. 
   266 requests. This made the whole site become unavailable. 
   266 
   267 
   410 had several causes, at the heart was a regular expression that
   411 had several causes, at the heart was a regular expression that
   411 was used to monitor network
   412 was used to monitor network
   412 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
   413 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
   413 These problems with regular expressions 
   414 These problems with regular expressions 
   414 are not isolated events that happen
   415 are not isolated events that happen
   415 very occasionally, but actually widespread.
   416 very rarely, 
   416 They occur so often that they have a 
   417 %but actually widespread.
       
   418 %They occur so often that they have a 
       
   419 but they occur actually often enough that they have a
   417 name: Regular-Expression-Denial-Of-Service (ReDoS)
   420 name: Regular-Expression-Denial-Of-Service (ReDoS)
   418 attack.
   421 attacks.
   419 \citeauthor{Davis18} detected more
   422 Davis et al. \cite{Davis18} detected more
   420 than 1000 evil regular expressions
   423 than 1000 evil regular expressions
   421 in Node.js, Python core libraries, npm and in pypi. 
   424 in Node.js, Python core libraries, npm and in pypi. 
   422 They therefore concluded that evil regular expressions
   425 They therefore concluded that evil regular expressions
   423 are real problems rather than "a parlour trick".
   426 are real problems rather than just "a parlour trick".
   424 
   427 
   425 This work aims to address this issue
   428 This work aims to address this issue
   426 with the help of formal proofs.
   429 with the help of formal proofs.
   427 We describe a lexing algorithm based
   430 We describe a lexing algorithm based
   428 on Brzozowski derivatives with verified correctness (in 
   431 on Brzozowski derivatives with verified correctness (in 
   429 Isabelle/HOL)
   432 Isabelle/HOL)
   430 and a finiteness property.
   433 and a finiteness property for the size of derivatives.
   431 Such properties %guarantee the absence of 
   434 Such properties %guarantee the absence of 
   432 are an important step in preventing
   435 are an important step in preventing
   433 catastrophic backtracking once and for all.
   436 catastrophic backtracking once and for all.
   434 We will give more details in the next sections
   437 We will give more details in the next sections
   435 on (i) why the slow cases in graph \ref{fig:aStarStarb}
   438 on (i) why the slow cases in graph \ref{fig:aStarStarb}