--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Dec 18 16:53:53 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Dec 18 18:17:52 2022 +0000
@@ -211,21 +211,21 @@
Regular expressions are widely used in computer science:
be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
command-line tools like $\mathit{grep}$ that facilitate easy
-text-processing; network intrusion
+text-processing \cite{grep}; network intrusion
detection systems that inspect suspicious traffic; or compiler
front ends.
Given their usefulness and ubiquity, one would assume that
modern regular expression matching implementations
are mature and fully studied.
-Indeed, in a popular programming language's regex engine,
+Indeed, in a popular programming language's regular expression engine,
supplying it with regular expressions and strings,
in most cases one can
get the matching information in a very short time.
Those engines can be blindingly fast--some
network intrusion detection systems
-use regex engines that are able to process
-megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those engines can exhibit a surprising security vulnerability
+use regular expression engines that are able to process
+hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
+However, those engines can sometimes exhibit a surprising security vulnerability
under a certain class of inputs.
%However, , this is not the case for $\mathbf{all}$ inputs.
%TODO: get source for SNORT/BRO's regex matching engine/speed
@@ -238,17 +238,17 @@
matching engines can find this out very quickly. Surprisingly, if one tries
this example in JavaScript, Python or Java 8, even with small strings,
say of lenght of around 30 $a$'s,
-the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
-This is clearly exponential behaviour, and
+the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
+This is clearly exponential behaviour, and as can be seen
is triggered by some relatively simple regular expressions.
Java 9 and newer
-versions improve this behaviour somewhat, but is still slow compared
+versions improve this behaviour somewhat, but are still slow compared
with the approach we are going to use in this thesis.
This superlinear blowup in regular expression engines
-had repeatedly caused grief in ``real life'' where it is
+has caused grief in ``real life'' in the past where it is
given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
For example, on 20 July 2016 one evil
regular expression brought the webpage
@@ -256,9 +256,10 @@
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
In this instance, a regular expression intended to just trim white
spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU resources---causing web servers to grind to a
+massive amounts of CPU resources---causing the web servers to grind to a
halt. In this example, the time needed to process
-the string was $O(n^2)$ with respect to the string length. This
+the string was
+$O(n^2)$ with respect to the string length $n$. This
quadratic overhead was enough for the homepage of Stack Exchange to
respond so slowly that the load balancer assumed a $\mathit{DoS}$
attack and therefore stopped the servers from responding to any
@@ -412,22 +413,24 @@
traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
These problems with regular expressions
are not isolated events that happen
-very occasionally, but actually widespread.
-They occur so often that they have a
+very rarely,
+%but actually widespread.
+%They occur so often that they have a
+but they occur actually often enough that they have a
name: Regular-Expression-Denial-Of-Service (ReDoS)
-attack.
-\citeauthor{Davis18} detected more
+attacks.
+Davis et al. \cite{Davis18} detected more
than 1000 evil regular expressions
in Node.js, Python core libraries, npm and in pypi.
They therefore concluded that evil regular expressions
-are real problems rather than "a parlour trick".
+are real problems rather than just "a parlour trick".
This work aims to address this issue
with the help of formal proofs.
We describe a lexing algorithm based
on Brzozowski derivatives with verified correctness (in
Isabelle/HOL)
-and a finiteness property.
+and a finiteness property for the size of derivatives.
Such properties %guarantee the absence of
are an important step in preventing
catastrophic backtracking once and for all.