--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sat Dec 17 22:39:55 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Dec 18 16:53:53 2022 +0000
@@ -221,20 +221,20 @@
supplying it with regular expressions and strings,
in most cases one can
get the matching information in a very short time.
-Those matchers can be blindingly fast--some
+Those engines can be blindingly fast--some
network intrusion detection systems
use regex engines that are able to process
megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those matchers can exhibit a surprising security vulnerability
+However, those engines can exhibit a surprising security vulnerability
under a certain class of inputs.
%However, , this is not the case for $\mathbf{all}$ inputs.
%TODO: get source for SNORT/BRO's regex matching engine/speed
-Consider $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ can be matched by this regular
-expression. Obviously this is not the case---the expected $b$ in the last
-position is missing. One would expect that modern regular expression
+Consider for example $(a^*)^*\,b$ and
+strings of the form $aa..a$, these strings cannot be matched by this regular
+expression: Obviously the expected $b$ in the last
+position is missing. One would assume that modern regular expression
matching engines can find this out very quickly. Surprisingly, if one tries
this example in JavaScript, Python or Java 8, even with small strings,
say of lenght of around 30 $a$'s,
@@ -717,7 +717,7 @@
it can be naturally extended to support bounded repetitions.
Moreover these extensions are still made up of only
inductive datatypes and recursive functions,
-making it handy to deal with using theorem provers.
+making it handy to deal with them in theorem provers.
%The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
%straightforwardly extended to deal with bounded regular expressions
%and moreover the resulting code still consists of only simple
@@ -832,14 +832,14 @@
$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
\end{center}
Despite being useful, the expressive power of regexes
-go beyond the regular language hierarchy
+go beyond regular languages
once back-references are included.
In fact, they allow the regex construct to express
languages that cannot be contained in context-free
languages either.
For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
-which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
+which cannot be expressed by context-free grammars \parencite{campeanu2003formal}.
Such a language is contained in the context-sensitive hierarchy
of formal languages.
Also solving the matching problem involving back-references
@@ -859,14 +859,14 @@
they impose restrictions
on the regular expressions (not allowing back-references,
bounded repetitions cannot exceed an ad hoc limit etc.).
-(ii) Those
+And (ii) those
that allow large bounded regular expressions and back-references
at the expense of using backtracking algorithms.
They can potentially ``grind to a halt''
on some very simple cases, resulting
-ReDoS attacks.
+ReDoS attacks if exposed to the internet.
-The proble with both approaches is the motivation for us
+The problems with both approaches are the motivation for us
to look again at the regular expression matching problem.
Another motivation is that regular expression matching algorithms
that follow the POSIX standard often contain errors and bugs