ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 602 46db6ae66448
parent 601 ce4e5151a836
child 603 370fe1dde7c7
equal deleted inserted replaced
601:ce4e5151a836 602:46db6ae66448
   206 involve lexing with regular 
   206 involve lexing with regular 
   207 expressions.
   207 expressions.
   208 Given its usefulness and ubiquity, one would imagine that
   208 Given its usefulness and ubiquity, one would imagine that
   209 modern regular expression matching implementations
   209 modern regular expression matching implementations
   210 are mature and fully studied.
   210 are mature and fully studied.
   211 Indeed, in a popular programming language' regex engine, 
   211 Indeed, in a popular programming language's regex engine, 
   212 supplying it with regular expressions and strings, one can
   212 supplying it with regular expressions and strings,
   213 get rich matching information in a very short time.
   213 in most cases one can
   214 Some network intrusion detection systems
   214 get the matching information in a very short time.
       
   215 Those matchers can be blindingly fast--some 
       
   216 network intrusion detection systems
   215 use regex engines that are able to process 
   217 use regex engines that are able to process 
   216 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   218 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   217 Unfortunately, this is not the case for $\mathbf{all}$ inputs.
   219 However, those matchers can exhibit a surprising security vulnerability
       
   220 under a certain class of inputs.
       
   221 %However, , this is not the case for $\mathbf{all}$ inputs.
   218 %TODO: get source for SNORT/BRO's regex matching engine/speed
   222 %TODO: get source for SNORT/BRO's regex matching engine/speed
   219 
   223 
   220 \begin{figure}[p]
   224 \begin{figure}[p]
   221 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   225 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   222 \begin{tikzpicture}
   226 \begin{tikzpicture}
   384 They therefore concluded that evil regular expressions
   388 They therefore concluded that evil regular expressions
   385 are problems "more than a parlour trick", but one that
   389 are problems "more than a parlour trick", but one that
   386 requires
   390 requires
   387 more research attention.
   391 more research attention.
   388 
   392 
       
   393 
       
   394 \ChristianComment{I am not totally sure where this sentence should be
       
   395 put, seems a little out-standing here.}
   389 Regular expressions and regular expression matchers 
   396 Regular expressions and regular expression matchers 
   390 have of course been studied for many, many years.
   397 have of course been studied for many, many years.
   391 One of the most recent work in the context of lexing
   398 One of the most recent work in the context of lexing
   392 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
   399 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
   393 This is relevant work and we will compare later on
   400 This is relevant work and we will compare later on
   394 our derivative-based matcher we are going to present.
   401 our derivative-based matcher we are going to present.
   395 There is also some newer work called
   402 There is also some newer work called
   396 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
   403 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
   397 For that the problem is the bounded repetitions ($r^{n}$,
   404 For that the problem is dealing with the bounded regular expressions of the form
   398 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$).
   405 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat
   399 They often occur in practical use, in for example the Snort XML definitions:
   406 exactly $n$ times.
   400 
   407 The other repetition constructs include
       
   408 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating
       
   409 at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times.
       
   410 Their formal definitions will be given later.
       
   411 Bounded repetitions are important because they
       
   412 tend to occur often in practical use\cite{xml2015}, for example in RegExLib,
       
   413 Snort, as well as in XML Schema definitions (XSDs).
       
   414 One XSD that seems to be related to the MPEG-7 standard involves
       
   415 the below regular expression:
       
   416 \begin{verbatim}
       
   417 <sequence minOccurs="0" maxOccurs="65535">
       
   418     <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
       
   419     <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
       
   420 </sequence>
       
   421 \end{verbatim}
       
   422 This is just a fancy way of writing the regular expression 
       
   423 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
       
   424 regular expressions 
       
   425 satisfy certain constraints such as floating point number format.
   401 
   426 
   402 The problems are not limited to slowness on certain 
   427 The problems are not limited to slowness on certain 
   403 cases. 
   428 cases. 
   404 Another thing about these libraries is that there
   429 Another thing about these libraries is that there
   405 is no correctness guarantee.
   430 is no correctness guarantee.