--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Sep 22 00:31:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Sep 23 00:44:22 2022 +0100
@@ -208,13 +208,17 @@
Given its usefulness and ubiquity, one would imagine that
modern regular expression matching implementations
are mature and fully studied.
-Indeed, in a popular programming language' regex engine,
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
+Indeed, in a popular programming language's regex engine,
+supplying it with regular expressions and strings,
+in most cases one can
+get the matching information in a very short time.
+Those matchers can be blindingly fast--some
+network intrusion detection systems
use regex engines that are able to process
megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+However, those matchers can exhibit a surprising security vulnerability
+under a certain class of inputs.
+%However, , this is not the case for $\mathbf{all}$ inputs.
%TODO: get source for SNORT/BRO's regex matching engine/speed
\begin{figure}[p]
@@ -386,6 +390,9 @@
requires
more research attention.
+
+\ChristianComment{I am not totally sure where this sentence should be
+put, seems a little out-standing here.}
Regular expressions and regular expression matchers
have of course been studied for many, many years.
One of the most recent work in the context of lexing
@@ -394,10 +401,28 @@
our derivative-based matcher we are going to present.
There is also some newer work called
Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
-For that the problem is the bounded repetitions ($r^{n}$,
-$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$).
-They often occur in practical use, in for example the Snort XML definitions:
-
+For that the problem is dealing with the bounded regular expressions of the form
+$r^{n}$ where $n$ is a constant specifying that $r$ must repeat
+exactly $n$ times.
+The other repetition constructs include
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating
+at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times.
+Their formal definitions will be given later.
+Bounded repetitions are important because they
+tend to occur often in practical use\cite{xml2015}, for example in RegExLib,
+Snort, as well as in XML Schema definitions (XSDs).
+One XSD that seems to be related to the MPEG-7 standard involves
+the below regular expression:
+\begin{verbatim}
+<sequence minOccurs="0" maxOccurs="65535">
+ <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{verbatim}
+This is just a fancy way of writing the regular expression
+$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+regular expressions
+satisfy certain constraints such as floating point number format.
The problems are not limited to slowness on certain
cases.
--- a/ChengsongTanPhdThesis/example.bib Thu Sep 22 00:31:09 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib Fri Sep 23 00:44:22 2022 +0100
@@ -186,6 +186,25 @@
title = {A Crash-Course in Regular Expression Parsing and Regular Expressions as Types},
year = {2014}}
+@inproceedings{xml2015,
+author = {Bj\"{o}rklund, Henrik and Martens, Wim and Timm, Thomas},
+title = {Efficient Incremental Evaluation of Succinct Regular Expressions},
+year = {2015},
+isbn = {9781450337946},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/2806416.2806434},
+doi = {10.1145/2806416.2806434},
+abstract = {Regular expressions are omnipresent in database applications. They form the structural core of schema languages for XML, they are a fundamental ingredient for navigational queries in graph databases, and are being considered in languages for upcoming technologies such as schema- and transformation languages for tabular data on the Web. In this paper we study the usage and effectiveness of the counting operator (or: limited repetition) in regular expressions. The counting operator is a popular extension which is part of the POSIX standard and therefore also present in regular expressions in grep, Java, Python, Perl, and Ruby. In a database context, expressions with counting appear in XML Schema and languages for querying graphs such as SPARQL 1.1 and Cypher.We first present a practical study that suggests that counters are extensively used in practice. We then investigate evaluation methods for such expressions and develop a new algorithm for efficient incremental evaluation. Finally, we conduct an extensive benchmark study that shows that exploiting counting operators can lead to speed-ups of several orders of magnitude in a wide range of settings: normal and incremental evaluation on synthetic and real expressions.},
+booktitle = {Proceedings of the 24th ACM International on Conference on Information and Knowledge Management},
+pages = {1541–1550},
+numpages = {10},
+keywords = {regular expressions, schema, regular path queries, xml},
+location = {Melbourne, Australia},
+series = {CIKM '15}
+}
+
+
@misc{SE16,
author = {StackStatus},
date-added = {2019-06-26 11:28:41 +0000},