|    209  |    209  | 
|    210  |    210  | 
|    211 Regular expressions are widely used in computer science:  |    211 Regular expressions are widely used in computer science:  | 
|    212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; |    212 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; | 
|    213 command-line tools like $\mathit{grep}$ that facilitate easy  |    213 command-line tools like $\mathit{grep}$ that facilitate easy  | 
|    214 text-processing; network intrusion |    214 text-processing \cite{grep}; network intrusion | 
|    215 detection systems that inspect suspicious traffic; or compiler |    215 detection systems that inspect suspicious traffic; or compiler | 
|    216 front ends. |    216 front ends. | 
|    217 Given their usefulness and ubiquity, one would assume that |    217 Given their usefulness and ubiquity, one would assume that | 
|    218 modern regular expression matching implementations |    218 modern regular expression matching implementations | 
|    219 are mature and fully studied. |    219 are mature and fully studied. | 
|    220 Indeed, in a popular programming language's regex engine,  |    220 Indeed, in a popular programming language's regular expression engine,  | 
|    221 supplying it with regular expressions and strings, |    221 supplying it with regular expressions and strings, | 
|    222 in most cases one can |    222 in most cases one can | 
|    223 get the matching information in a very short time. |    223 get the matching information in a very short time. | 
|    224 Those engines can be blindingly fast--some  |    224 Those engines can be blindingly fast--some  | 
|    225 network intrusion detection systems |    225 network intrusion detection systems | 
|    226 use regex engines that are able to process  |    226 use regular expression engines that are able to process  | 
|    227 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |    227 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. | 
|    228 However, those engines can exhibit a surprising security vulnerability |    228 However, those engines can sometimes exhibit a surprising security vulnerability | 
|    229 under a certain class of inputs. |    229 under a certain class of inputs. | 
|    230 %However, , this is not the case for $\mathbf{all}$ inputs. |    230 %However, , this is not the case for $\mathbf{all}$ inputs. | 
|    231 %TODO: get source for SNORT/BRO's regex matching engine/speed |    231 %TODO: get source for SNORT/BRO's regex matching engine/speed | 
|    232  |    232  | 
|    233  |    233  | 
|    236 expression: Obviously the expected $b$ in the last |    236 expression: Obviously the expected $b$ in the last | 
|    237 position is missing. One would assume that modern regular expression |    237 position is missing. One would assume that modern regular expression | 
|    238 matching engines can find this out very quickly. Surprisingly, if one tries |    238 matching engines can find this out very quickly. Surprisingly, if one tries | 
|    239 this example in JavaScript, Python or Java 8, even with small strings,  |    239 this example in JavaScript, Python or Java 8, even with small strings,  | 
|    240 say of lenght of around 30 $a$'s, |    240 say of lenght of around 30 $a$'s, | 
|    241 the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}). |    241 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}). | 
|    242 This is clearly exponential behaviour, and  |    242 This is clearly exponential behaviour, and as can be seen | 
|    243 is triggered by some relatively simple regular expressions. |    243 is triggered by some relatively simple regular expressions. | 
|    244 Java 9 and newer |    244 Java 9 and newer | 
|    245 versions improve this behaviour somewhat, but is still slow compared  |    245 versions improve this behaviour somewhat, but are still slow compared  | 
|    246 with the approach we are going to use in this thesis. |    246 with the approach we are going to use in this thesis. | 
|    247  |    247  | 
|    248  |    248  | 
|    249  |    249  | 
|    250 This superlinear blowup in regular expression engines |    250 This superlinear blowup in regular expression engines | 
|    251 had repeatedly caused grief in ``real life'' where it is  |    251 has caused grief in ``real life'' in the past where it is  | 
|    252 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |    252 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. | 
|    253 For example, on 20 July 2016 one evil |    253 For example, on 20 July 2016 one evil | 
|    254 regular expression brought the webpage |    254 regular expression brought the webpage | 
|    255 \href{http://stackexchange.com}{Stack Exchange} to its |    255 \href{http://stackexchange.com}{Stack Exchange} to its | 
|    256 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |    256 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} | 
|    257 In this instance, a regular expression intended to just trim white |    257 In this instance, a regular expression intended to just trim white | 
|    258 spaces from the beginning and the end of a line actually consumed |    258 spaces from the beginning and the end of a line actually consumed | 
|    259 massive amounts of CPU resources---causing web servers to grind to a |    259 massive amounts of CPU resources---causing the web servers to grind to a | 
|    260 halt. In this example, the time needed to process |    260 halt. In this example, the time needed to process | 
|    261 the string was $O(n^2)$ with respect to the string length. This |    261 the string was  | 
|         |    262 $O(n^2)$ with respect to the string length $n$. This | 
|    262 quadratic overhead was enough for the homepage of Stack Exchange to |    263 quadratic overhead was enough for the homepage of Stack Exchange to | 
|    263 respond so slowly that the load balancer assumed a $\mathit{DoS}$  |    264 respond so slowly that the load balancer assumed a $\mathit{DoS}$  | 
|    264 attack and therefore stopped the servers from responding to any |    265 attack and therefore stopped the servers from responding to any | 
|    265 requests. This made the whole site become unavailable.  |    266 requests. This made the whole site become unavailable.  | 
|    266  |    267  | 
|    410 had several causes, at the heart was a regular expression that |    411 had several causes, at the heart was a regular expression that | 
|    411 was used to monitor network |    412 was used to monitor network | 
|    412 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |    413 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} | 
|    413 These problems with regular expressions  |    414 These problems with regular expressions  | 
|    414 are not isolated events that happen |    415 are not isolated events that happen | 
|    415 very occasionally, but actually widespread. |    416 very rarely,  | 
|    416 They occur so often that they have a  |    417 %but actually widespread. | 
|         |    418 %They occur so often that they have a  | 
|         |    419 but they occur actually often enough that they have a | 
|    417 name: Regular-Expression-Denial-Of-Service (ReDoS) |    420 name: Regular-Expression-Denial-Of-Service (ReDoS) | 
|    418 attack. |    421 attacks. | 
|    419 \citeauthor{Davis18} detected more |    422 Davis et al. \cite{Davis18} detected more | 
|    420 than 1000 evil regular expressions |    423 than 1000 evil regular expressions | 
|    421 in Node.js, Python core libraries, npm and in pypi.  |    424 in Node.js, Python core libraries, npm and in pypi.  | 
|    422 They therefore concluded that evil regular expressions |    425 They therefore concluded that evil regular expressions | 
|    423 are real problems rather than "a parlour trick". |    426 are real problems rather than just "a parlour trick". | 
|    424  |    427  | 
|    425 This work aims to address this issue |    428 This work aims to address this issue | 
|    426 with the help of formal proofs. |    429 with the help of formal proofs. | 
|    427 We describe a lexing algorithm based |    430 We describe a lexing algorithm based | 
|    428 on Brzozowski derivatives with verified correctness (in  |    431 on Brzozowski derivatives with verified correctness (in  | 
|    429 Isabelle/HOL) |    432 Isabelle/HOL) | 
|    430 and a finiteness property. |    433 and a finiteness property for the size of derivatives. | 
|    431 Such properties %guarantee the absence of  |    434 Such properties %guarantee the absence of  | 
|    432 are an important step in preventing |    435 are an important step in preventing | 
|    433 catastrophic backtracking once and for all. |    436 catastrophic backtracking once and for all. | 
|    434 We will give more details in the next sections |    437 We will give more details in the next sections | 
|    435 on (i) why the slow cases in graph \ref{fig:aStarStarb} |    438 on (i) why the slow cases in graph \ref{fig:aStarStarb} |