ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 602 46db6ae66448
parent 601 ce4e5151a836
child 603 370fe1dde7c7
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Sep 22 00:31:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Sep 23 00:44:22 2022 +0100
@@ -208,13 +208,17 @@
 Given its usefulness and ubiquity, one would imagine that
 modern regular expression matching implementations
 are mature and fully studied.
-Indeed, in a popular programming language' regex engine, 
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
+Indeed, in a popular programming language's regex engine, 
+supplying it with regular expressions and strings,
+in most cases one can
+get the matching information in a very short time.
+Those matchers can be blindingly fast--some 
+network intrusion detection systems
 use regex engines that are able to process 
 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+However, those matchers can exhibit a surprising security vulnerability
+under a certain class of inputs.
+%However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
 \begin{figure}[p]
@@ -386,6 +390,9 @@
 requires
 more research attention.
 
+
+\ChristianComment{I am not totally sure where this sentence should be
+put, seems a little out-standing here.}
 Regular expressions and regular expression matchers 
 have of course been studied for many, many years.
 One of the most recent work in the context of lexing
@@ -394,10 +401,28 @@
 our derivative-based matcher we are going to present.
 There is also some newer work called
 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
-For that the problem is the bounded repetitions ($r^{n}$,
-$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$).
-They often occur in practical use, in for example the Snort XML definitions:
-
+For that the problem is dealing with the bounded regular expressions of the form
+$r^{n}$ where $n$ is a constant specifying that $r$ must repeat
+exactly $n$ times.
+The other repetition constructs include
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating
+at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times.
+Their formal definitions will be given later.
+Bounded repetitions are important because they
+tend to occur often in practical use\cite{xml2015}, for example in RegExLib,
+Snort, as well as in XML Schema definitions (XSDs).
+One XSD that seems to be related to the MPEG-7 standard involves
+the below regular expression:
+\begin{verbatim}
+<sequence minOccurs="0" maxOccurs="65535">
+    <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+    <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{verbatim}
+This is just a fancy way of writing the regular expression 
+$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+regular expressions 
+satisfy certain constraints such as floating point number format.
 
 The problems are not limited to slowness on certain 
 cases.