ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 603 370fe1dde7c7
parent 602 46db6ae66448
child 604 16d67f9c07d4
equal deleted inserted replaced
602:46db6ae66448 603:370fe1dde7c7
   218 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   218 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
   219 However, those matchers can exhibit a surprising security vulnerability
   219 However, those matchers can exhibit a surprising security vulnerability
   220 under a certain class of inputs.
   220 under a certain class of inputs.
   221 %However, , this is not the case for $\mathbf{all}$ inputs.
   221 %However, , this is not the case for $\mathbf{all}$ inputs.
   222 %TODO: get source for SNORT/BRO's regex matching engine/speed
   222 %TODO: get source for SNORT/BRO's regex matching engine/speed
       
   223 
       
   224 
       
   225 Take $(a^*)^*\,b$ and ask whether
       
   226 strings of the form $aa..a$ match this regular
       
   227 expression. Obviously this is not the case---the expected $b$ in the last
       
   228 position is missing. One would expect that modern regular expression
       
   229 matching engines can find this out very quickly. Alas, if one tries
       
   230 this example in JavaScript, Python or Java 8, even with strings of a small
       
   231 length, say around 30 $a$'s,
       
   232 the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
       
   233 This is clearly exponential behaviour, and 
       
   234 is triggered by some relatively simple regex patterns.
       
   235 Java 9 and newer
       
   236 versions improves this behaviour, but is still slow compared 
       
   237 with the approach we are going to use.
       
   238 
       
   239 
       
   240 
       
   241 
       
   242 This superlinear blowup in regular expression engines
       
   243 had repeatedly caused grief in real life that they
       
   244 get a name for them--``catastrophic backtracking''.
       
   245 For example, on 20 July 2016 one evil
       
   246 regular expression brought the webpage
       
   247 \href{http://stackexchange.com}{Stack Exchange} to its
       
   248 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
       
   249 In this instance, a regular expression intended to just trim white
       
   250 spaces from the beginning and the end of a line actually consumed
       
   251 massive amounts of CPU resources---causing web servers to grind to a
       
   252 halt. In this example, the time needed to process
       
   253 the string was $O(n^2)$ with respect to the string length. This
       
   254 quadratic overhead was enough for the homepage of Stack Exchange to
       
   255 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
       
   256 attack and therefore stopped the servers from responding to any
       
   257 requests. This made the whole site become unavailable. 
   223 
   258 
   224 \begin{figure}[p]
   259 \begin{figure}[p]
   225 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   260 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   226 \begin{tikzpicture}
   261 \begin{tikzpicture}
   227 \begin{axis}[
   262 \begin{axis}[
   334    The reason for their superlinear behaviour is that they do a depth-first-search.
   369    The reason for their superlinear behaviour is that they do a depth-first-search.
   335    If the string does not match, the engine starts to explore all possibilities. 
   370    If the string does not match, the engine starts to explore all possibilities. 
   336 }\label{fig:aStarStarb}
   371 }\label{fig:aStarStarb}
   337 \end{figure}\afterpage{\clearpage}
   372 \end{figure}\afterpage{\clearpage}
   338 
   373 
   339 Take $(a^*)^*\,b$ and ask whether
       
   340 strings of the form $aa..a$ match this regular
       
   341 expression. Obviously this is not the case---the expected $b$ in the last
       
   342 position is missing. One would expect that modern regular expression
       
   343 matching engines can find this out very quickly. Alas, if one tries
       
   344 this example in JavaScript, Python or Java 8, even with strings of a small
       
   345 length, say around 30 $a$'s, one discovers that 
       
   346 this decision takes crazy time to finish given the simplicity of the problem.
       
   347 This is clearly exponential behaviour, and 
       
   348 is triggered by some relatively simple regex patterns, as the graphs
       
   349  in \ref{fig:aStarStarb} show.
       
   350 Java 9 and newer
       
   351 versions improves this behaviour, but is still slow compared 
       
   352 with the approach we are going to use.
       
   353 
       
   354 
       
   355 
       
   356 
       
   357 This superlinear blowup in regular expression engines
       
   358 had repeatedly caused grief in real life.
       
   359 For example, on 20 July 2016 one evil
       
   360 regular expression brought the webpage
       
   361 \href{http://stackexchange.com}{Stack Exchange} to its
       
   362 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
       
   363 In this instance, a regular expression intended to just trim white
       
   364 spaces from the beginning and the end of a line actually consumed
       
   365 massive amounts of CPU resources---causing web servers to grind to a
       
   366 halt. In this example, the time needed to process
       
   367 the string was $O(n^2)$ with respect to the string length. This
       
   368 quadratic overhead was enough for the homepage of Stack Exchange to
       
   369 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
       
   370 attack and therefore stopped the servers from responding to any
       
   371 requests. This made the whole site become unavailable. 
       
   372 
       
   373 A more recent example is a global outage of all Cloudflare servers on 2 July
   374 A more recent example is a global outage of all Cloudflare servers on 2 July
   374 2019. A poorly written regular expression exhibited exponential
   375 2019. A poorly written regular expression exhibited exponential
   375 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
   376 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
   376 had several causes, at the heart was a regular expression that
   377 had several causes, at the heart was a regular expression that
   377 was used to monitor network
   378 was used to monitor network
   388 They therefore concluded that evil regular expressions
   389 They therefore concluded that evil regular expressions
   389 are problems "more than a parlour trick", but one that
   390 are problems "more than a parlour trick", but one that
   390 requires
   391 requires
   391 more research attention.
   392 more research attention.
   392 
   393 
   393 
   394 This work aims to address this issue
   394 \ChristianComment{I am not totally sure where this sentence should be
   395 with the help of formal proofs.
   395 put, seems a little out-standing here.}
   396 We offer a lexing algorithm based
       
   397 on Brzozowski derivatives with certified correctness (in 
       
   398 Isabelle/HOL)
       
   399 and finiteness property.
       
   400 Such properties guarantees the absence of 
       
   401 catastrophic backtracking in most cases.
       
   402 We will give more details why we choose our 
       
   403 approach (Brzozowski derivatives and formal proofs)
       
   404 in the next sections.
       
   405 
       
   406 
       
   407 \section{The Problem with Bounded Repetitions}
   396 Regular expressions and regular expression matchers 
   408 Regular expressions and regular expression matchers 
   397 have of course been studied for many, many years.
   409 have of course been studied for many, many years.
   398 One of the most recent work in the context of lexing
   410 One of the most recent work in the context of lexing
   399 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
   411 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
   400 This is relevant work and we will compare later on
   412 This is relevant work and we will compare later on
   401 our derivative-based matcher we are going to present.
   413 our derivative-based matcher we are going to present.
   402 There is also some newer work called
   414 There is also some newer work called
   403 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
   415 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
   404 For that the problem is dealing with the bounded regular expressions of the form
   416 For that the problem is dealing with the bounded regular expressions of the form
   405 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat
   417 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat
   406 exactly $n$ times.
   418 exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow
       
   419 on the bounded repetitions construct.
       
   420 
       
   421 In the work reported in \cite{CSL2022} and here, we add better support
       
   422 for them. 
   407 The other repetition constructs include
   423 The other repetition constructs include
   408 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which respectively mean repeating
   424 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify 
   409 at most $m$ times, repeating at least $n$ times and repeating between $n$ and $m$ times.
   425 intervals for how many times $r$ should match.
       
   426 $r^{\ldots m}$ means repeating
       
   427 at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and 
       
   428 $r^{n\ldots m}$ means repeating between $n$ and $m$ times.
       
   429 The results presented in this thesis extend straightforwardly to them
       
   430 too. 
   410 Their formal definitions will be given later.
   431 Their formal definitions will be given later.
       
   432 
   411 Bounded repetitions are important because they
   433 Bounded repetitions are important because they
   412 tend to occur often in practical use\cite{xml2015}, for example in RegExLib,
   434 tend to occur often in practical use, for example in RegExLib,
   413 Snort, as well as in XML Schema definitions (XSDs).
   435 Snort, as well as in XML Schema definitions (XSDs).
   414 One XSD that seems to be related to the MPEG-7 standard involves
   436 According to Bj\"{o}rklund et al \cite{xml2015},
   415 the below regular expression:
   437 bounded regular expressions occur frequently in the latter and can have
       
   438 counters up to ten million. An example XSD with a large counter they gave
       
   439 was:
   416 \begin{verbatim}
   440 \begin{verbatim}
   417 <sequence minOccurs="0" maxOccurs="65535">
   441 <sequence minOccurs="0" maxOccurs="65535">
   418     <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   442     <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   419     <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   443     <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   420 </sequence>
   444 </sequence>
   421 \end{verbatim}
   445 \end{verbatim}
   422 This is just a fancy way of writing the regular expression 
   446 This can be seen as the expression 
   423 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   447 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   424 regular expressions 
   448 regular expressions 
   425 satisfy certain constraints such as floating point number format.
   449 satisfying certain constraints (such as 
   426 
   450 satisfying the floating point number format).
   427 The problems are not limited to slowness on certain 
   451 The problem here is that tools based on the classic notion of
       
   452 automata need to expand $r^{n}$ into $n$ connected 
       
   453 copies of the automaton for $r$. This leads to very inefficient matching
       
   454 algorithms  or algorithms that consume large amounts of memory.
       
   455 A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
       
   456 where the minimal DFA requires at least $2^{n+1}$ states (more on this
       
   457 later).
       
   458 Therefore regular expressions matching libraries that rely on the classic
       
   459 notion of DFAs  often impose adhoc limits
       
   460 for bounded regular expressions:
       
   461 For example, in the regular expression matching library in the Go
       
   462 language the regular expression $a^{1001}$ is not permitted, because no counter
       
   463 can be above 1000, and in the built-in Rust regular expression library
       
   464 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
       
   465 for being too big. These problems can of course be solved in matching algorithms where 
       
   466 automata go beyond the classic notion and for instance include explicit
       
   467 counters \cite{Turo_ov__2020}.
       
   468 The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
       
   469 straightforwardly extended to deal with bounded regular expressions
       
   470 and moreover the resulting code still consists of only simple
       
   471 recursive functions and inductive datatypes.
       
   472 Finally, bounded regular expressions do not destroy our finite
       
   473 boundedness property, which we shall prove later on.
       
   474 
       
   475 \section{The Terminology Regex, and why Backtracking?}
       
   476 Shouldn't regular expression matching be linear?
       
   477 How can one explain the super-linear behaviour of the 
       
   478 regex matching engines we have?
       
   479 The time cost of regex matching algorithms in general
       
   480 involve two different phases, and different things can go differently wrong on 
       
   481 these phases.
       
   482 $\DFA$s usually have problems in the first (construction) phase
       
   483 , whereas $\NFA$s usually run into trouble
       
   484 on the second phase.
       
   485 
       
   486 
       
   487 \section{Error-prone POSIX Implementations}
       
   488 The problems with practical implementations
       
   489 of reare not limited to slowness on certain 
   428 cases. 
   490 cases. 
   429 Another thing about these libraries is that there
   491 Another thing about these libraries is that there
   430 is no correctness guarantee.
   492 is no correctness guarantee.
   431 In some cases, they either fail to generate a lexing result when there exists a match,
   493 In some cases, they either fail to generate a lexing result when there exists a match,
   432 or give results that are inconsistent with the $\POSIX$ standard.
   494 or give results that are inconsistent with the $\POSIX$ standard.
   466 certain regex engines are running horribly slow on the "catastrophic"
   528 certain regex engines are running horribly slow on the "catastrophic"
   467 cases and propose a solution that addresses both of these problems
   529 cases and propose a solution that addresses both of these problems
   468 based on Brzozowski and Sulzmann and Lu's work.
   530 based on Brzozowski and Sulzmann and Lu's work.
   469 
   531 
   470 
   532 
   471  \section{Why are current regex engines slow?}
       
   472 
       
   473 %find literature/find out for yourself that REGEX->DFA on basic regexes
       
   474 %does not blow up the size
       
   475 Shouldn't regular expression matching be linear?
       
   476 How can one explain the super-linear behaviour of the 
       
   477 regex matching engines we have?
       
   478 The time cost of regex matching algorithms in general
       
   479 involve two different phases, and different things can go differently wrong on 
       
   480 these phases.
       
   481 $\DFA$s usually have problems in the first (construction) phase
       
   482 , whereas $\NFA$s usually run into trouble
       
   483 on the second phase.
       
   484 
   533 
   485 \subsection{Different Phases of a Matching/Lexing Algorithm}
   534 \subsection{Different Phases of a Matching/Lexing Algorithm}
   486 
   535 
   487 
   536 
   488 Most lexing algorithms can be roughly divided into 
   537 Most lexing algorithms can be roughly divided into