--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Sep 22 00:31:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Sep 23 00:44:22 2022 +0100
@@ -208,13 +208,17 @@
Given its usefulness and ubiquity, one would imagine that
modern regular expression matching implementations
are mature and fully studied.
-Indeed, in a popular programming language' regex engine,
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
+Indeed, in a popular programming language's regex engine,
+supplying it with regular expressions and strings,
+in most cases one can
+get the matching information in a very short time.
+Those matchers can be blindingly fast--some
+network intrusion detection systems
use regex engines that are able to process
megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+However, those matchers can exhibit a surprising security vulnerability
+under a certain class of inputs.
+%However, , this is not the case for $\mathbf{all}$ inputs.
%TODO: get source for SNORT/BRO's regex matching engine/speed
\begin{figure}[p]
@@ -386,6 +390,9 @@
requires
more research attention.
+
+\ChristianComment{I am not totally sure where this sentence should be
+put, seems a little out-standing here.}
Regular expressions and regular expression matchers
have of course been studied for many, many years.
One of the most recent work in the context of lexing
@@ -394,10 +401,28 @@
our derivative-based matcher we are going to present.
There is also some newer work called
Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
-For that the problem is the bounded repetitions ($r^{n}$,
-$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$).
-They often occur in practical use, in for example the Snort XML definitions:
-
+For that the problem is dealing with the bounded regular expressions of the form
+$r^{n}$ where $n$ is a constant specifying that $r$ must repeat
+exactly $n$ times.
+The other repetition constructs include
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating
+at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times.
+Their formal definitions will be given later.
+Bounded repetitions are important because they
+tend to occur often in practical use\cite{xml2015}, for example in RegExLib,
+Snort, as well as in XML Schema definitions (XSDs).
+One XSD that seems to be related to the MPEG-7 standard involves
+the below regular expression:
+\begin{verbatim}
+<sequence minOccurs="0" maxOccurs="65535">
+ <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{verbatim}
+This is just a fancy way of writing the regular expression
+$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+regular expressions
+satisfy certain constraints such as floating point number format.
The problems are not limited to slowness on certain
cases.