# HG changeset patch # User Chengsong # Date 1671387472 0 # Node ID b079aaee5e10d529659e60f4695045115e345785 # Parent f1edd5ee1a122d461112f617f5efa2c201b28b13 more diff -r f1edd5ee1a12 -r b079aaee5e10 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Dec 18 16:53:53 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Dec 18 18:17:52 2022 +0000 @@ -211,21 +211,21 @@ Regular expressions are widely used in computer science: be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion; command-line tools like $\mathit{grep}$ that facilitate easy -text-processing; network intrusion +text-processing \cite{grep}; network intrusion detection systems that inspect suspicious traffic; or compiler front ends. Given their usefulness and ubiquity, one would assume that modern regular expression matching implementations are mature and fully studied. -Indeed, in a popular programming language's regex engine, +Indeed, in a popular programming language's regular expression engine, supplying it with regular expressions and strings, in most cases one can get the matching information in a very short time. Those engines can be blindingly fast--some network intrusion detection systems -use regex engines that are able to process -megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. -However, those engines can exhibit a surprising security vulnerability +use regular expression engines that are able to process +hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. +However, those engines can sometimes exhibit a surprising security vulnerability under a certain class of inputs. %However, , this is not the case for $\mathbf{all}$ inputs. %TODO: get source for SNORT/BRO's regex matching engine/speed @@ -238,17 +238,17 @@ matching engines can find this out very quickly. Surprisingly, if one tries this example in JavaScript, Python or Java 8, even with small strings, say of lenght of around 30 $a$'s, -the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}). -This is clearly exponential behaviour, and +the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}). +This is clearly exponential behaviour, and as can be seen is triggered by some relatively simple regular expressions. Java 9 and newer -versions improve this behaviour somewhat, but is still slow compared +versions improve this behaviour somewhat, but are still slow compared with the approach we are going to use in this thesis. This superlinear blowup in regular expression engines -had repeatedly caused grief in ``real life'' where it is +has caused grief in ``real life'' in the past where it is given the name ``catastrophic backtracking'' or ``evil'' regular expressions. For example, on 20 July 2016 one evil regular expression brought the webpage @@ -256,9 +256,10 @@ knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} In this instance, a regular expression intended to just trim white spaces from the beginning and the end of a line actually consumed -massive amounts of CPU resources---causing web servers to grind to a +massive amounts of CPU resources---causing the web servers to grind to a halt. In this example, the time needed to process -the string was $O(n^2)$ with respect to the string length. This +the string was +$O(n^2)$ with respect to the string length $n$. This quadratic overhead was enough for the homepage of Stack Exchange to respond so slowly that the load balancer assumed a $\mathit{DoS}$ attack and therefore stopped the servers from responding to any @@ -412,22 +413,24 @@ traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} These problems with regular expressions are not isolated events that happen -very occasionally, but actually widespread. -They occur so often that they have a +very rarely, +%but actually widespread. +%They occur so often that they have a +but they occur actually often enough that they have a name: Regular-Expression-Denial-Of-Service (ReDoS) -attack. -\citeauthor{Davis18} detected more +attacks. +Davis et al. \cite{Davis18} detected more than 1000 evil regular expressions in Node.js, Python core libraries, npm and in pypi. They therefore concluded that evil regular expressions -are real problems rather than "a parlour trick". +are real problems rather than just "a parlour trick". This work aims to address this issue with the help of formal proofs. We describe a lexing algorithm based on Brzozowski derivatives with verified correctness (in Isabelle/HOL) -and a finiteness property. +and a finiteness property for the size of derivatives. Such properties %guarantee the absence of are an important step in preventing catastrophic backtracking once and for all. diff -r f1edd5ee1a12 -r b079aaee5e10 ChengsongTanPhdThesis/example.bib --- a/ChengsongTanPhdThesis/example.bib Sun Dec 18 16:53:53 2022 +0000 +++ b/ChengsongTanPhdThesis/example.bib Sun Dec 18 18:17:52 2022 +0000 @@ -1016,3 +1016,9 @@ pages = {246--256}, title = {{T}he {I}mpact of {R}egular {E}xpression {D}enial of {S}ervice ({ReDoS}) in {P}ractice: {A}n {E}mpirical {S}tudy at the {E}cosystem {S}cale}, year = {2018}} + + +@misc{grep, + title = {GNU grep}, + url = {https://www.gnu.org/software/grep/manual/grep.html}, +}