# HG changeset patch # User Chengsong # Date 1663890262 -3600 # Node ID 46db6ae66448ddc96444480fd05ab53dec8d1490 # Parent ce4e5151a8368e6e34b6342737b84724e88cce60 chap1 diff -r ce4e5151a836 -r 46db6ae66448 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Sep 22 00:31:09 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Fri Sep 23 00:44:22 2022 +0100 @@ -208,13 +208,17 @@ Given its usefulness and ubiquity, one would imagine that modern regular expression matching implementations are mature and fully studied. -Indeed, in a popular programming language' regex engine, -supplying it with regular expressions and strings, one can -get rich matching information in a very short time. -Some network intrusion detection systems +Indeed, in a popular programming language's regex engine, +supplying it with regular expressions and strings, +in most cases one can +get the matching information in a very short time. +Those matchers can be blindingly fast--some +network intrusion detection systems use regex engines that are able to process megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. -Unfortunately, this is not the case for $\mathbf{all}$ inputs. +However, those matchers can exhibit a surprising security vulnerability +under a certain class of inputs. +%However, , this is not the case for $\mathbf{all}$ inputs. %TODO: get source for SNORT/BRO's regex matching engine/speed \begin{figure}[p] @@ -386,6 +390,9 @@ requires more research attention. + +\ChristianComment{I am not totally sure where this sentence should be +put, seems a little out-standing here.} Regular expressions and regular expression matchers have of course been studied for many, many years. One of the most recent work in the context of lexing @@ -394,10 +401,28 @@ our derivative-based matcher we are going to present. There is also some newer work called Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. -For that the problem is the bounded repetitions ($r^{n}$, -$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$). -They often occur in practical use, in for example the Snort XML definitions: - +For that the problem is dealing with the bounded regular expressions of the form +$r^{n}$ where $n$ is a constant specifying that $r$ must repeat +exactly $n$ times. +The other repetition constructs include +$r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating +at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times. +Their formal definitions will be given later. +Bounded repetitions are important because they +tend to occur often in practical use\cite{xml2015}, for example in RegExLib, +Snort, as well as in XML Schema definitions (XSDs). +One XSD that seems to be related to the MPEG-7 standard involves +the below regular expression: +\begin{verbatim} + + + + +\end{verbatim} +This is just a fancy way of writing the regular expression +$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves +regular expressions +satisfy certain constraints such as floating point number format. The problems are not limited to slowness on certain cases. diff -r ce4e5151a836 -r 46db6ae66448 ChengsongTanPhdThesis/example.bib --- a/ChengsongTanPhdThesis/example.bib Thu Sep 22 00:31:09 2022 +0100 +++ b/ChengsongTanPhdThesis/example.bib Fri Sep 23 00:44:22 2022 +0100 @@ -186,6 +186,25 @@ title = {A Crash-Course in Regular Expression Parsing and Regular Expressions as Types}, year = {2014}} +@inproceedings{xml2015, +author = {Bj\"{o}rklund, Henrik and Martens, Wim and Timm, Thomas}, +title = {Efficient Incremental Evaluation of Succinct Regular Expressions}, +year = {2015}, +isbn = {9781450337946}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/2806416.2806434}, +doi = {10.1145/2806416.2806434}, +abstract = {Regular expressions are omnipresent in database applications. They form the structural core of schema languages for XML, they are a fundamental ingredient for navigational queries in graph databases, and are being considered in languages for upcoming technologies such as schema- and transformation languages for tabular data on the Web. In this paper we study the usage and effectiveness of the counting operator (or: limited repetition) in regular expressions. The counting operator is a popular extension which is part of the POSIX standard and therefore also present in regular expressions in grep, Java, Python, Perl, and Ruby. In a database context, expressions with counting appear in XML Schema and languages for querying graphs such as SPARQL 1.1 and Cypher.We first present a practical study that suggests that counters are extensively used in practice. We then investigate evaluation methods for such expressions and develop a new algorithm for efficient incremental evaluation. Finally, we conduct an extensive benchmark study that shows that exploiting counting operators can lead to speed-ups of several orders of magnitude in a wide range of settings: normal and incremental evaluation on synthetic and real expressions.}, +booktitle = {Proceedings of the 24th ACM International on Conference on Information and Knowledge Management}, +pages = {1541–1550}, +numpages = {10}, +keywords = {regular expressions, schema, regular path queries, xml}, +location = {Melbourne, Australia}, +series = {CIKM '15} +} + + @misc{SE16, author = {StackStatus}, date-added = {2019-06-26 11:28:41 +0000},