lexing: changeset 602:46db6ae66448

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Sep 22 00:31:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Sep 23 00:44:22 2022 +0100
@@ -208,13 +208,17 @@
 Given its usefulness and ubiquity, one would imagine that
 modern regular expression matching implementations
 are mature and fully studied.
-Indeed, in a popular programming language' regex engine, 
-supplying it with regular expressions and strings, one can
-get rich matching information in a very short time.
-Some network intrusion detection systems
+Indeed, in a popular programming language's regex engine, 
+supplying it with regular expressions and strings,
+in most cases one can
+get the matching information in a very short time.
+Those matchers can be blindingly fast--some 
+network intrusion detection systems
 use regex engines that are able to process 
 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-Unfortunately, this is not the case for $\mathbf{all}$ inputs.
+However, those matchers can exhibit a surprising security vulnerability
+under a certain class of inputs.
+%However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
 \begin{figure}[p]
@@ -386,6 +390,9 @@
 requires
 more research attention.
 
+
+\ChristianComment{I am not totally sure where this sentence should be
+put, seems a little out-standing here.}
 Regular expressions and regular expression matchers 
 have of course been studied for many, many years.
 One of the most recent work in the context of lexing
@@ -394,10 +401,28 @@
 our derivative-based matcher we are going to present.
 There is also some newer work called
 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
-For that the problem is the bounded repetitions ($r^{n}$,
-$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$).
-They often occur in practical use, in for example the Snort XML definitions:
-
+For that the problem is dealing with the bounded regular expressions of the form
+$r^{n}$ where $n$ is a constant specifying that $r$ must repeat
+exactly $n$ times.
+The other repetition constructs include
+$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating
+at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times.
+Their formal definitions will be given later.
+Bounded repetitions are important because they
+tend to occur often in practical use\cite{xml2015}, for example in RegExLib,
+Snort, as well as in XML Schema definitions (XSDs).
+One XSD that seems to be related to the MPEG-7 standard involves
+the below regular expression:
+\begin{verbatim}
+<sequence minOccurs="0" maxOccurs="65535">
+    <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+    <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{verbatim}
+This is just a fancy way of writing the regular expression 
+$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
+regular expressions 
+satisfy certain constraints such as floating point number format.
 
 The problems are not limited to slowness on certain 
 cases.

--- a/ChengsongTanPhdThesis/example.bib	Thu Sep 22 00:31:09 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib	Fri Sep 23 00:44:22 2022 +0100
@@ -186,6 +186,25 @@
 	title = {A Crash-Course in Regular Expression Parsing and Regular Expressions as Types},
 	year = {2014}}
 
+@inproceedings{xml2015,
+author = {Bj\"{o}rklund, Henrik and Martens, Wim and Timm, Thomas},
+title = {Efficient Incremental Evaluation of Succinct Regular Expressions},
+year = {2015},
+isbn = {9781450337946},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+url = {https://doi.org/10.1145/2806416.2806434},
+doi = {10.1145/2806416.2806434},
+abstract = {Regular expressions are omnipresent in database applications. They form the structural core of schema languages for XML, they are a fundamental ingredient for navigational queries in graph databases, and are being considered in languages for upcoming technologies such as schema- and transformation languages for tabular data on the Web. In this paper we study the usage and effectiveness of the counting operator (or: limited repetition) in regular expressions. The counting operator is a popular extension which is part of the POSIX standard and therefore also present in regular expressions in grep, Java, Python, Perl, and Ruby. In a database context, expressions with counting appear in XML Schema and languages for querying graphs such as SPARQL 1.1 and Cypher.We first present a practical study that suggests that counters are extensively used in practice. We then investigate evaluation methods for such expressions and develop a new algorithm for efficient incremental evaluation. Finally, we conduct an extensive benchmark study that shows that exploiting counting operators can lead to speed-ups of several orders of magnitude in a wide range of settings: normal and incremental evaluation on synthetic and real expressions.},
+booktitle = {Proceedings of the 24th ACM International on Conference on Information and Knowledge Management},
+pages = {1541–1550},
+numpages = {10},
+keywords = {regular expressions, schema, regular path queries, xml},
+location = {Melbourne, Australia},
+series = {CIKM '15}
+}
+
+
 @misc{SE16,
 	author = {StackStatus},
 	date-added = {2019-06-26 11:28:41 +0000},

author	Chengsong
	Fri, 23 Sep 2022 00:44:22 +0100
changeset 602	46db6ae66448
parent 601	ce4e5151a836
child 603	370fe1dde7c7

ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/example.bib		file \| annotate \| diff \| comparison \| revisions