209 |
209 |
210 |
210 |
211 Regular expressions are widely used in computer science: |
211 Regular expressions are widely used in computer science: |
212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |
213 command-line tools like $\mathit{grep}$ that facilitate easy |
213 command-line tools like $\mathit{grep}$ that facilitate easy |
214 text-processing; network intrusion |
214 text-processing \cite{grep}; network intrusion |
215 detection systems that inspect suspicious traffic; or compiler |
215 detection systems that inspect suspicious traffic; or compiler |
216 front ends. |
216 front ends. |
217 Given their usefulness and ubiquity, one would assume that |
217 Given their usefulness and ubiquity, one would assume that |
218 modern regular expression matching implementations |
218 modern regular expression matching implementations |
219 are mature and fully studied. |
219 are mature and fully studied. |
220 Indeed, in a popular programming language's regex engine, |
220 Indeed, in a popular programming language's regular expression engine, |
221 supplying it with regular expressions and strings, |
221 supplying it with regular expressions and strings, |
222 in most cases one can |
222 in most cases one can |
223 get the matching information in a very short time. |
223 get the matching information in a very short time. |
224 Those engines can be blindingly fast--some |
224 Those engines can be blindingly fast--some |
225 network intrusion detection systems |
225 network intrusion detection systems |
226 use regex engines that are able to process |
226 use regular expression engines that are able to process |
227 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
227 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
228 However, those engines can exhibit a surprising security vulnerability |
228 However, those engines can sometimes exhibit a surprising security vulnerability |
229 under a certain class of inputs. |
229 under a certain class of inputs. |
230 %However, , this is not the case for $\mathbf{all}$ inputs. |
230 %However, , this is not the case for $\mathbf{all}$ inputs. |
231 %TODO: get source for SNORT/BRO's regex matching engine/speed |
231 %TODO: get source for SNORT/BRO's regex matching engine/speed |
232 |
232 |
233 |
233 |
236 expression: Obviously the expected $b$ in the last |
236 expression: Obviously the expected $b$ in the last |
237 position is missing. One would assume that modern regular expression |
237 position is missing. One would assume that modern regular expression |
238 matching engines can find this out very quickly. Surprisingly, if one tries |
238 matching engines can find this out very quickly. Surprisingly, if one tries |
239 this example in JavaScript, Python or Java 8, even with small strings, |
239 this example in JavaScript, Python or Java 8, even with small strings, |
240 say of lenght of around 30 $a$'s, |
240 say of lenght of around 30 $a$'s, |
241 the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}). |
241 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}). |
242 This is clearly exponential behaviour, and |
242 This is clearly exponential behaviour, and as can be seen |
243 is triggered by some relatively simple regular expressions. |
243 is triggered by some relatively simple regular expressions. |
244 Java 9 and newer |
244 Java 9 and newer |
245 versions improve this behaviour somewhat, but is still slow compared |
245 versions improve this behaviour somewhat, but are still slow compared |
246 with the approach we are going to use in this thesis. |
246 with the approach we are going to use in this thesis. |
247 |
247 |
248 |
248 |
249 |
249 |
250 This superlinear blowup in regular expression engines |
250 This superlinear blowup in regular expression engines |
251 had repeatedly caused grief in ``real life'' where it is |
251 has caused grief in ``real life'' in the past where it is |
252 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
252 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
253 For example, on 20 July 2016 one evil |
253 For example, on 20 July 2016 one evil |
254 regular expression brought the webpage |
254 regular expression brought the webpage |
255 \href{http://stackexchange.com}{Stack Exchange} to its |
255 \href{http://stackexchange.com}{Stack Exchange} to its |
256 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
256 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
257 In this instance, a regular expression intended to just trim white |
257 In this instance, a regular expression intended to just trim white |
258 spaces from the beginning and the end of a line actually consumed |
258 spaces from the beginning and the end of a line actually consumed |
259 massive amounts of CPU resources---causing web servers to grind to a |
259 massive amounts of CPU resources---causing the web servers to grind to a |
260 halt. In this example, the time needed to process |
260 halt. In this example, the time needed to process |
261 the string was $O(n^2)$ with respect to the string length. This |
261 the string was |
|
262 $O(n^2)$ with respect to the string length $n$. This |
262 quadratic overhead was enough for the homepage of Stack Exchange to |
263 quadratic overhead was enough for the homepage of Stack Exchange to |
263 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
264 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
264 attack and therefore stopped the servers from responding to any |
265 attack and therefore stopped the servers from responding to any |
265 requests. This made the whole site become unavailable. |
266 requests. This made the whole site become unavailable. |
266 |
267 |
410 had several causes, at the heart was a regular expression that |
411 had several causes, at the heart was a regular expression that |
411 was used to monitor network |
412 was used to monitor network |
412 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
413 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
413 These problems with regular expressions |
414 These problems with regular expressions |
414 are not isolated events that happen |
415 are not isolated events that happen |
415 very occasionally, but actually widespread. |
416 very rarely, |
416 They occur so often that they have a |
417 %but actually widespread. |
|
418 %They occur so often that they have a |
|
419 but they occur actually often enough that they have a |
417 name: Regular-Expression-Denial-Of-Service (ReDoS) |
420 name: Regular-Expression-Denial-Of-Service (ReDoS) |
418 attack. |
421 attacks. |
419 \citeauthor{Davis18} detected more |
422 Davis et al. \cite{Davis18} detected more |
420 than 1000 evil regular expressions |
423 than 1000 evil regular expressions |
421 in Node.js, Python core libraries, npm and in pypi. |
424 in Node.js, Python core libraries, npm and in pypi. |
422 They therefore concluded that evil regular expressions |
425 They therefore concluded that evil regular expressions |
423 are real problems rather than "a parlour trick". |
426 are real problems rather than just "a parlour trick". |
424 |
427 |
425 This work aims to address this issue |
428 This work aims to address this issue |
426 with the help of formal proofs. |
429 with the help of formal proofs. |
427 We describe a lexing algorithm based |
430 We describe a lexing algorithm based |
428 on Brzozowski derivatives with verified correctness (in |
431 on Brzozowski derivatives with verified correctness (in |
429 Isabelle/HOL) |
432 Isabelle/HOL) |
430 and a finiteness property. |
433 and a finiteness property for the size of derivatives. |
431 Such properties %guarantee the absence of |
434 Such properties %guarantee the absence of |
432 are an important step in preventing |
435 are an important step in preventing |
433 catastrophic backtracking once and for all. |
436 catastrophic backtracking once and for all. |
434 We will give more details in the next sections |
437 We will give more details in the next sections |
435 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
438 on (i) why the slow cases in graph \ref{fig:aStarStarb} |