more, including link
authorChengsong
Sun, 18 Dec 2022 16:53:53 +0000
changeset 633 f1edd5ee1a12
parent 632 99eceee5c4ac
child 634 b079aaee5e10
more, including link
ChengsongTanPhdThesis/Chapters/Introduction.tex
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sat Dec 17 22:39:55 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sun Dec 18 16:53:53 2022 +0000
@@ -221,20 +221,20 @@
 supplying it with regular expressions and strings,
 in most cases one can
 get the matching information in a very short time.
-Those matchers can be blindingly fast--some 
+Those engines can be blindingly fast--some 
 network intrusion detection systems
 use regex engines that are able to process 
 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those matchers can exhibit a surprising security vulnerability
+However, those engines can exhibit a surprising security vulnerability
 under a certain class of inputs.
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
 
-Consider $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ can be matched by this regular
-expression. Obviously this is not the case---the expected $b$ in the last
-position is missing. One would expect that modern regular expression
+Consider for example $(a^*)^*\,b$ and 
+strings of the form $aa..a$, these strings cannot be matched by this regular
+expression: Obviously the expected $b$ in the last
+position is missing. One would assume that modern regular expression
 matching engines can find this out very quickly. Surprisingly, if one tries
 this example in JavaScript, Python or Java 8, even with small strings, 
 say of lenght of around 30 $a$'s,
@@ -717,7 +717,7 @@
 it can be naturally extended to support bounded repetitions.
 Moreover these extensions are still made up of only
 inductive datatypes and recursive functions,
-making it handy to deal with using theorem provers.
+making it handy to deal with them in theorem provers.
 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
 %straightforwardly extended to deal with bounded regular expressions
 %and moreover the resulting code still consists of only simple
@@ -832,14 +832,14 @@
 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
 \end{center}
 Despite being useful, the expressive power of regexes 
-go beyond the regular language hierarchy
+go beyond regular languages 
 once back-references are included.
 In fact, they allow the regex construct to express 
 languages that cannot be contained in context-free
 languages either.
 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
-which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
+which cannot be expressed by context-free grammars \parencite{campeanu2003formal}.
 Such a language is contained in the context-sensitive hierarchy
 of formal languages. 
 Also solving the matching problem involving back-references
@@ -859,14 +859,14 @@
 they impose restrictions
 on the regular expressions (not allowing back-references, 
 bounded repetitions cannot exceed an ad hoc limit etc.).
-(ii) Those 
+And (ii) those 
 that allow large bounded regular expressions and back-references
 at the expense of using backtracking algorithms.
 They can potentially ``grind to a halt''
 on some very simple cases, resulting 
-ReDoS attacks.
+ReDoS attacks if exposed to the internet.
 
-The proble with both approaches is the motivation for us 
+The problems with both approaches are the motivation for us 
 to look again at the regular expression matching problem. 
 Another motivation is that regular expression matching algorithms
 that follow the POSIX standard often contain errors and bugs