206 involve lexing with regular |
206 involve lexing with regular |
207 expressions. |
207 expressions. |
208 Given its usefulness and ubiquity, one would imagine that |
208 Given its usefulness and ubiquity, one would imagine that |
209 modern regular expression matching implementations |
209 modern regular expression matching implementations |
210 are mature and fully studied. |
210 are mature and fully studied. |
211 Indeed, in a popular programming language' regex engine, |
211 Indeed, in a popular programming language's regex engine, |
212 supplying it with regular expressions and strings, one can |
212 supplying it with regular expressions and strings, |
213 get rich matching information in a very short time. |
213 in most cases one can |
214 Some network intrusion detection systems |
214 get the matching information in a very short time. |
|
215 Those matchers can be blindingly fast--some |
|
216 network intrusion detection systems |
215 use regex engines that are able to process |
217 use regex engines that are able to process |
216 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
218 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}. |
217 Unfortunately, this is not the case for $\mathbf{all}$ inputs. |
219 However, those matchers can exhibit a surprising security vulnerability |
|
220 under a certain class of inputs. |
|
221 %However, , this is not the case for $\mathbf{all}$ inputs. |
218 %TODO: get source for SNORT/BRO's regex matching engine/speed |
222 %TODO: get source for SNORT/BRO's regex matching engine/speed |
219 |
223 |
220 \begin{figure}[p] |
224 \begin{figure}[p] |
221 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
225 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
222 \begin{tikzpicture} |
226 \begin{tikzpicture} |
384 They therefore concluded that evil regular expressions |
388 They therefore concluded that evil regular expressions |
385 are problems "more than a parlour trick", but one that |
389 are problems "more than a parlour trick", but one that |
386 requires |
390 requires |
387 more research attention. |
391 more research attention. |
388 |
392 |
|
393 |
|
394 \ChristianComment{I am not totally sure where this sentence should be |
|
395 put, seems a little out-standing here.} |
389 Regular expressions and regular expression matchers |
396 Regular expressions and regular expression matchers |
390 have of course been studied for many, many years. |
397 have of course been studied for many, many years. |
391 One of the most recent work in the context of lexing |
398 One of the most recent work in the context of lexing |
392 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
399 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
393 This is relevant work and we will compare later on |
400 This is relevant work and we will compare later on |
394 our derivative-based matcher we are going to present. |
401 our derivative-based matcher we are going to present. |
395 There is also some newer work called |
402 There is also some newer work called |
396 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. |
403 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead. |
397 For that the problem is the bounded repetitions ($r^{n}$, |
404 For that the problem is dealing with the bounded regular expressions of the form |
398 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$). |
405 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat |
399 They often occur in practical use, in for example the Snort XML definitions: |
406 exactly $n$ times. |
400 |
407 The other repetition constructs include |
|
408 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating |
|
409 at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times. |
|
410 Their formal definitions will be given later. |
|
411 Bounded repetitions are important because they |
|
412 tend to occur often in practical use\cite{xml2015}, for example in RegExLib, |
|
413 Snort, as well as in XML Schema definitions (XSDs). |
|
414 One XSD that seems to be related to the MPEG-7 standard involves |
|
415 the below regular expression: |
|
416 \begin{verbatim} |
|
417 <sequence minOccurs="0" maxOccurs="65535"> |
|
418 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
|
419 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
|
420 </sequence> |
|
421 \end{verbatim} |
|
422 This is just a fancy way of writing the regular expression |
|
423 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
|
424 regular expressions |
|
425 satisfy certain constraints such as floating point number format. |
401 |
426 |
402 The problems are not limited to slowness on certain |
427 The problems are not limited to slowness on certain |
403 cases. |
428 cases. |
404 Another thing about these libraries is that there |
429 Another thing about these libraries is that there |
405 is no correctness guarantee. |
430 is no correctness guarantee. |