ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 612 8c234a1bc7e0
parent 609 61139fdddae0
child 618 233cf2b97d1a
equal deleted inserted replaced
611:bc1df466150a 612:8c234a1bc7e0
   199 
   199 
   200 Regular expressions are widely used in computer science: 
   200 Regular expressions are widely used in computer science: 
   201 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   201 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
   202 command-line tools like $\mathit{grep}$ that facilitate easy 
   202 command-line tools like $\mathit{grep}$ that facilitate easy 
   203 text-processing; network intrusion
   203 text-processing; network intrusion
   204 detection systems that reject suspicious traffic; or compiler
   204 detection systems that inspect suspicious traffic; or compiler
   205 front ends--the majority of the solutions to these tasks 
   205 front ends.
   206 involve lexing with regular 
   206 Given their usefulness and ubiquity, one would assume that
   207 expressions.
       
   208 Given its usefulness and ubiquity, one would imagine that
       
   209 modern regular expression matching implementations
   207 modern regular expression matching implementations
   210 are mature and fully studied.
   208 are mature and fully studied.
   211 Indeed, in a popular programming language's regex engine, 
   209 Indeed, in a popular programming language's regex engine, 
   212 supplying it with regular expressions and strings,
   210 supplying it with regular expressions and strings,
   213 in most cases one can
   211 in most cases one can
   220 under a certain class of inputs.
   218 under a certain class of inputs.
   221 %However, , this is not the case for $\mathbf{all}$ inputs.
   219 %However, , this is not the case for $\mathbf{all}$ inputs.
   222 %TODO: get source for SNORT/BRO's regex matching engine/speed
   220 %TODO: get source for SNORT/BRO's regex matching engine/speed
   223 
   221 
   224 
   222 
   225 Take $(a^*)^*\,b$ and ask whether
   223 Consider $(a^*)^*\,b$ and ask whether
   226 strings of the form $aa..a$ match this regular
   224 strings of the form $aa..a$ can be matched by this regular
   227 expression. Obviously this is not the case---the expected $b$ in the last
   225 expression. Obviously this is not the case---the expected $b$ in the last
   228 position is missing. One would expect that modern regular expression
   226 position is missing. One would expect that modern regular expression
   229 matching engines can find this out very quickly. Alas, if one tries
   227 matching engines can find this out very quickly. Surprisingly, if one tries
   230 this example in JavaScript, Python or Java 8, even with strings of a small
   228 this example in JavaScript, Python or Java 8, even with small strings, 
   231 length, say around 30 $a$'s,
   229 say of lenght of around 30 $a$'s,
   232 the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
   230 the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
   233 This is clearly exponential behaviour, and 
   231 This is clearly exponential behaviour, and 
   234 is triggered by some relatively simple regex patterns.
   232 is triggered by some relatively simple regular expressions.
   235 Java 9 and newer
   233 Java 9 and newer
   236 versions improves this behaviour, but is still slow compared 
   234 versions improve this behaviour somewhat, but is still slow compared 
   237 with the approach we are going to use.
   235 with the approach we are going to use in this thesis.
   238 
       
   239 
   236 
   240 
   237 
   241 
   238 
   242 This superlinear blowup in regular expression engines
   239 This superlinear blowup in regular expression engines
   243 had repeatedly caused grief in real life that they
   240 had repeatedly caused grief in ``real life'' where it is 
   244 get a name for them--``catastrophic backtracking''.
   241 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
   245 For example, on 20 July 2016 one evil
   242 For example, on 20 July 2016 one evil
   246 regular expression brought the webpage
   243 regular expression brought the webpage
   247 \href{http://stackexchange.com}{Stack Exchange} to its
   244 \href{http://stackexchange.com}{Stack Exchange} to its
   248 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
   245 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)}
   249 In this instance, a regular expression intended to just trim white
   246 In this instance, a regular expression intended to just trim white
   255 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   252 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   256 attack and therefore stopped the servers from responding to any
   253 attack and therefore stopped the servers from responding to any
   257 requests. This made the whole site become unavailable. 
   254 requests. This made the whole site become unavailable. 
   258 
   255 
   259 \begin{figure}[p]
   256 \begin{figure}[p]
   260 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   257 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
   261 \begin{tikzpicture}
   258 \begin{tikzpicture}
   262 \begin{axis}[
   259 \begin{axis}[
   263     xlabel={$n$},
   260     xlabel={$n$},
   264     x label style={at={(1.05,-0.05)}},
   261     x label style={at={(1.05,-0.05)}},
   265     ylabel={time in secs},
   262     ylabel={time in secs},
   296     legend entries={Python},  
   293     legend entries={Python},  
   297     legend pos=north west,
   294     legend pos=north west,
   298     legend cell align=left]
   295     legend cell align=left]
   299 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   296 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   300 \end{axis}
   297 \end{axis}
   301 \end{tikzpicture}
   298 \end{tikzpicture}\\ 
   302   &
       
   303 \begin{tikzpicture}
       
   304 \begin{axis}[
       
   305     xlabel={$n$},
       
   306     x label style={at={(1.05,-0.05)}},
       
   307     %ylabel={time in secs},
       
   308     enlargelimits=false,
       
   309     xtick={0,5,...,30},
       
   310     xmax=33,
       
   311     ymax=35,
       
   312     ytick={0,5,...,30},
       
   313     scaled ticks=false,
       
   314     axis lines=left,
       
   315     width=5cm,
       
   316     height=4cm, 
       
   317     legend entries={Java 8},  
       
   318     legend pos=north west,
       
   319     legend cell align=left]
       
   320 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   321 \end{axis}
       
   322 \end{tikzpicture}\\
       
   323 \begin{tikzpicture}
   299 \begin{tikzpicture}
   324 \begin{axis}[
   300 \begin{axis}[
   325     xlabel={$n$},
   301     xlabel={$n$},
   326     x label style={at={(1.05,-0.05)}},
   302     x label style={at={(1.05,-0.05)}},
   327     ylabel={time in secs},
   303     ylabel={time in secs},
   332     ytick={0,5,...,30},
   308     ytick={0,5,...,30},
   333     scaled ticks=false,
   309     scaled ticks=false,
   334     axis lines=left,
   310     axis lines=left,
   335     width=5cm,
   311     width=5cm,
   336     height=4cm, 
   312     height=4cm, 
   337     legend entries={Dart},  
   313     legend entries={Java 8},  
   338     legend pos=north west,
   314     legend pos=north west,
   339     legend cell align=left]
   315     legend cell align=left]
   340 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
   316 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
   341 \end{axis}
   317 \end{axis}
   342 \end{tikzpicture}
   318 \end{tikzpicture}
   343   &
   319   &
   344 \begin{tikzpicture}
   320 \begin{tikzpicture}
   345 \begin{axis}[
   321 \begin{axis}[
   353     ytick={0,5,...,30},
   329     ytick={0,5,...,30},
   354     scaled ticks=false,
   330     scaled ticks=false,
   355     axis lines=left,
   331     axis lines=left,
   356     width=5cm,
   332     width=5cm,
   357     height=4cm, 
   333     height=4cm, 
       
   334     legend entries={Dart},  
       
   335     legend pos=north west,
       
   336     legend cell align=left]
       
   337 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
       
   338 \end{axis}
       
   339 \end{tikzpicture}\\ 
       
   340 \begin{tikzpicture}
       
   341 \begin{axis}[
       
   342     xlabel={$n$},
       
   343     x label style={at={(1.05,-0.05)}},
       
   344     ylabel={time in secs},
       
   345     enlargelimits=false,
       
   346     xtick={0,5,...,30},
       
   347     xmax=33,
       
   348     ymax=35,
       
   349     ytick={0,5,...,30},
       
   350     scaled ticks=false,
       
   351     axis lines=left,
       
   352     width=5cm,
       
   353     height=4cm, 
   358     legend entries={Swift},  
   354     legend entries={Swift},  
   359     legend pos=north west,
   355     legend pos=north west,
   360     legend cell align=left]
   356     legend cell align=left]
   361 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
   357 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
   362 \end{axis}
   358 \end{axis}
   363 \end{tikzpicture}
   359 \end{tikzpicture}
   364   & \\
   360   & 
   365 \multicolumn{3}{c}{Graphs}
   361 \begin{tikzpicture}
       
   362 \begin{axis}[
       
   363     xlabel={$n$},
       
   364     x label style={at={(1.05,-0.05)}},
       
   365     %ylabel={time in secs},
       
   366     enlargelimits=true,
       
   367     %xtick={0,5000,...,40000},
       
   368     %xmax=40000,
       
   369     %ymax=35,
       
   370     restrict x to domain*=0:40000,
       
   371     restrict y to domain*=0:35,
       
   372     %ytick={0,5,...,30},
       
   373     %scaled ticks=false,
       
   374     axis lines=left,
       
   375     width=5cm,
       
   376     height=4cm, 
       
   377     legend entries={Java9+},  
       
   378     legend pos=north west,
       
   379     legend cell align=left]
       
   380 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
       
   381 \end{axis}
       
   382 \end{tikzpicture}\\ 
       
   383 \multicolumn{2}{c}{Graphs}
   366 \end{tabular}    
   384 \end{tabular}    
   367 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
   385 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
   368            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
   386            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
   369    The reason for their superlinear behaviour is that they do a depth-first-search.
   387    The reason for their superlinear behaviour is that they do a depth-first-search
   370    If the string does not match, the engine starts to explore all possibilities. 
   388    using NFAs.
       
   389    If the string does not match, the regular expression matching
       
   390    engine starts to explore all possibilities. 
   371 }\label{fig:aStarStarb}
   391 }\label{fig:aStarStarb}
   372 \end{figure}\afterpage{\clearpage}
   392 \end{figure}\afterpage{\clearpage}
   373 
   393 
   374 A more recent example is a global outage of all Cloudflare servers on 2 July
   394 A more recent example is a global outage of all Cloudflare servers on 2 July
   375 2019. A poorly written regular expression exhibited exponential
   395 2019. A poorly written regular expression exhibited catastrophic backtracking
   376 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
   396 and exhausted CPUs that serve HTTP traffic. Although the outage
   377 had several causes, at the heart was a regular expression that
   397 had several causes, at the heart was a regular expression that
   378 was used to monitor network
   398 was used to monitor network
   379 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
   399 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
   380 These problems with regular expressions 
   400 These problems with regular expressions 
   381 are not isolated events that happen
   401 are not isolated events that happen
   382 very occasionally, but actually widespread.
   402 very occasionally, but actually widespread.
   383 They occur so often that they get a 
   403 They occur so often that they have a 
   384 name--Regular-Expression-Denial-Of-Service (ReDoS)
   404 name: Regular-Expression-Denial-Of-Service (ReDoS)
   385 attack.
   405 attack.
   386 \citeauthor{Davis18} detected more
   406 \citeauthor{Davis18} detected more
   387 than 1000 super-linear (SL) regular expressions
   407 than 1000 evil regular expressions
   388 in Node.js, Python core libraries, and npm and pypi. 
   408 in Node.js, Python core libraries, npm and in pypi. 
   389 They therefore concluded that evil regular expressions
   409 They therefore concluded that evil regular expressions
   390 are problems "more than a parlour trick", but one that
   410 are real problems rather than "a parlour trick".
   391 requires
       
   392 more research attention.
       
   393 
   411 
   394 This work aims to address this issue
   412 This work aims to address this issue
   395 with the help of formal proofs.
   413 with the help of formal proofs.
   396 We offer a lexing algorithm based
   414 We describe a lexing algorithm based
   397 on Brzozowski derivatives with certified correctness (in 
   415 on Brzozowski derivatives with verified correctness (in 
   398 Isabelle/HOL)
   416 Isabelle/HOL)
   399 and finiteness property.
   417 and a finiteness property.
   400 Such properties guarantee the absence of 
   418 Such properties %guarantee the absence of 
   401 catastrophic backtracking in most cases.
   419 are an important step in preventing
       
   420 catastrophic backtracking once and for all.
   402 We will give more details in the next sections
   421 We will give more details in the next sections
   403 on (i) why the slow cases in graph \ref{fig:aStarStarb}
   422 on (i) why the slow cases in graph \ref{fig:aStarStarb}
   404 can occur
   423 can occur in traditional regular expression engines
   405 and (ii) why we choose our 
   424 and (ii) why we choose our 
   406 approach (Brzozowski derivatives and formal proofs).
   425 approach based on Brzozowski derivatives and formal proofs.
   407 
   426 
   408 
   427 
   409 \section{Regex, and the Problems with Regex Matchers}
   428 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
   410 Regular expressions and regular expression matchers 
   429 Regular expressions and regular expression matchers 
   411 have of course been studied for many, many years.
   430 have of course been studied for many, many years.
   412 Theoretical results in automata theory say
   431 Theoretical results in automata theory state 
   413 that basic regular expression matching should be linear
   432 that basic regular expression matching should be linear
   414 w.r.t the input.
   433 w.r.t the input.
   415 This assumes that the regular expression
   434 This assumes that the regular expression
   416 $r$ was pre-processed and turned into a
   435 $r$ was pre-processed and turned into a
   417 deterministic finite automata (DFA) before matching,
   436 deterministic finite automaton (DFA) before matching\cite{Sakarovitch2009}.
   418 which could be exponential\cite{Sakarovitch2009}.
       
   419 By basic we mean textbook definitions such as the one
   437 By basic we mean textbook definitions such as the one
   420 below, involving only characters, alternatives,
   438 below, involving only regular expressions for characters, alternatives,
   421 sequences, and Kleene stars:
   439 sequences, and Kleene stars:
   422 \[
   440 \[
   423 	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
   441 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
   424 \]
   442 \]
   425 Modern regular expression matchers used by programmers,
   443 Modern regular expression matchers used by programmers,
   426 however,
   444 however,
   427 support richer constructs such as bounded repetitions
   445 support much richer constructs, such as bounded repetitions
   428 and back-references.
   446 and back-references.
   429 To differentiate, people use the word \emph{regex} to refer
   447 To differentiate, we use the word \emph{regex} to refer
   430 to those expressions with richer constructs while reserving the
   448 to those expressions with richer constructs while reserving the
   431 term \emph{regular expression}
   449 term \emph{regular expression}
   432 for the more traditional meaning in formal languages theory.
   450 for the more traditional meaning in formal languages theory.
   433 We follow this convention 
   451 We follow this convention 
   434 in this thesis.
   452 in this thesis.
   435 In the future, we aim to support all the popular features of regexes, 
   453 In the future, we aim to support all the popular features of regexes, 
   436 but for this work we mainly look at regular expressions.
   454 but for this work we mainly look at basic regular expressions
       
   455 and bounded repetitions.
   437 
   456 
   438 
   457 
   439 
   458 
   440 %Most modern regex libraries
   459 %Most modern regex libraries
   441 %the so-called PCRE standard (Peral Compatible Regular Expressions)
   460 %the so-called PCRE standard (Peral Compatible Regular Expressions)
   442 %has the back-references
   461 %has the back-references
   443 Regexes come with a lot of constructs
   462 Regexes come with a number of constructs
   444 beyond the basic ones
       
   445 that make it more convenient for 
   463 that make it more convenient for 
   446 programmers to write regular expressions.
   464 programmers to write regular expressions.
   447 Depending on the types of these constructs
   465 Depending on the types of constructs
   448 the task of matching and lexing with them
   466 the task of matching and lexing with them
   449 will have different levels of complexity increase.
   467 will have different levels of complexity.
   450 Some of those constructs are syntactic sugars that are
   468 Some of those constructs are just syntactic sugars that are
   451 simply short hand notations
   469 simply short hand notations
   452 that save the programmers a few keystrokes.
   470 that save the programmers a few keystrokes.
   453 These will not cause trouble for regex libraries.
   471 These will not cause problems for regex libraries.
   454 
       
   455 \noindent
       
   456 For example the
   472 For example the
   457 non-binary alternative involving three or more choices:
   473 non-binary alternative involving three or more choices just means:
   458 \[
   474 \[
   459 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
   475 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
   460 \]
   476 \]
   461 the range operator $-$ used to express the alternative
   477 Similarly, the range operator used to express the alternative
   462 of all characters between its operands in a concise way:
   478 of all characters between its operands is just a concise way:
   463 \[
   479 \[
   464 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
   480 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
   465 \]
   481 \]
   466 and the
   482 for an alternative. The
   467 wildcard character $.$ used to refer to any single character:
   483 wildcard character $.$ is used to refer to any single character,
   468 \[
   484 \[
   469 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
   485 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
   470 \]
   486 \]
   471 
   487 except the newline.
   472 \noindent
   488 
   473 \subsection{Bounded Repetitions}
   489 \subsection{Bounded Repetitions}
   474 Some of those constructs do make the expressions much
   490 More interesting are bounded repetitions, which can 
       
   491 make the regular expressions much
   475 more compact.
   492 more compact.
   476 For example, the bounded regular expressions
   493 There are 
   477 (where $n$ and $m$ are constant natural numbers)
   494 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
   478 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$,
   495 (where $n$ and $m$ are constant natural numbers).
   479 defined as 
   496 Like the star regular expressions, the set of strings or language
       
   497 a bounded regular expression can match
       
   498 is defined using the power operation on sets:
   480 \begin{center}
   499 \begin{center}
   481 	\begin{tabular}{lcl}
   500 	\begin{tabular}{lcl}
   482 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
   501 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
   483 		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
   502 		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
   484 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
   503 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
   485 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
   504 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
   486 	\end{tabular}
   505 	\end{tabular}
   487 \end{center}
   506 \end{center}
   488 are exponentially smaller compared with
   507 The attraction of bounded repetitions is that they can be
   489 their unfolded form: for example $r^{\{n\}}$
   508 used to avoid a blow up: for example $r^{\{n\}}$
   490 as opposed to
   509 is a shorthand for
   491 \[
   510 \[
   492 	\underbrace{r\ldots r}_\text{n copies of r}.
   511 	\underbrace{r\ldots r}_\text{n copies of r}.
   493 \]
   512 \]
   494 %Therefore, a naive algorithm that simply unfolds
   513 %Therefore, a naive algorithm that simply unfolds
   495 %them into their desugared forms
   514 %them into their desugared forms
   496 %will suffer from at least an exponential runtime increase.
   515 %will suffer from at least an exponential runtime increase.
   497 
   516 
   498 The problem here is that tools based on the classic notion of
   517 
   499 automata need to expand $r^{n}$ into $n$ connected 
   518 The problem with matching 
       
   519 is that tools based on the classic notion of
       
   520 automata need to expand $r^{\{n\}}$ into $n$ connected 
   500 copies of the automaton for $r$. This leads to very inefficient matching
   521 copies of the automaton for $r$. This leads to very inefficient matching
   501 algorithms  or algorithms that consume large amounts of memory.
   522 algorithms  or algorithms that consume large amounts of memory.
   502 Implementations using $\DFA$s will
   523 Implementations using $\DFA$s will
   503 either become excruciatingly slow 
   524 either become excruciatingly slow 
   504 (for example Verbatim++\cite{Verbatimpp}) or get
   525 (for example Verbatim++\cite{Verbatimpp}) or get
   510 gets an output program encoding a minimized $\mathit{DFA}$
   531 gets an output program encoding a minimized $\mathit{DFA}$
   511 that can be compiled and run. 
   532 that can be compiled and run. 
   512 When given the above countdown regular expression,
   533 When given the above countdown regular expression,
   513 a small $n$ (a few dozen) would result in a 
   534 a small $n$ (a few dozen) would result in a 
   514 determinised automata
   535 determinised automata
   515 with millions of states.}) under large counters.
   536 with millions of states.}) for large counters.
   516 A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
   537 A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
   517 where the minimal DFA requires at least $2^{n+1}$ states.
   538 where the minimal DFA requires at least $2^{n+1}$ states.
   518 For example, when $n$ is equal to 2,
   539 For example, when $n$ is equal to 2,
   519 an $\mathit{NFA}$ describing it would look like:
   540 The corresponding $\mathit{NFA}$ looks like:
   520 \begin{center}
   541 \begin{center}
   521 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
   542 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
   522    \node[state,initial] (q_0)   {$q_0$}; 
   543    \node[state,initial] (q_0)   {$q_0$}; 
   523    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
   544    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
   524    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
   545    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
   528     	  edge [loop below] node {a,b} ()
   549     	  edge [loop below] node {a,b} ()
   529     (q_1) edge  node  {a,b} (q_2)
   550     (q_1) edge  node  {a,b} (q_2)
   530     (q_2) edge  node  {a,b} (q_3);
   551     (q_2) edge  node  {a,b} (q_3);
   531 \end{tikzpicture}
   552 \end{tikzpicture}
   532 \end{center}
   553 \end{center}
   533 which requires at least $2^3$ states
   554 when turned into a DFA by the subset construction
   534 for its subset construction.\footnote{The 
   555 requires at least $2^3$ states.\footnote{The 
   535 red states are "countdown states" which counts down 
   556 red states are "countdown states" which counts down 
   536 the number of characters needed in addition to the current
   557 the number of characters needed in addition to the current
   537 string to make a successful match.
   558 string to make a successful match.
   538 For example, state $q_1$ indicates a match that has
   559 For example, state $q_1$ indicates a match that has
   539 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
   560 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
   558 would require at least $2^{n+1}$ states, if $r$ itself contains
   579 would require at least $2^{n+1}$ states, if $r$ itself contains
   559 more than 1 string.
   580 more than 1 string.
   560 This is to represent all different 
   581 This is to represent all different 
   561 scenarios which "countdown" states are active.}
   582 scenarios which "countdown" states are active.}
   562 
   583 
   563 One of the most recent work in the context of lexing
       
   564 %with this issue
       
   565 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
       
   566 This is relevant work and we will compare later on
       
   567 our derivative-based matcher we are going to present.
       
   568 There is also some newer work called
       
   569 Verbatim++\cite{Verbatimpp}, which does not use derivatives, 
       
   570 but deterministic finite automaton instead.
       
   571 %An example that gives problem to automaton approaches would be
       
   572 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
       
   573 %It requires at least $2^{n+1}$ states to represent
       
   574 %as a DFA.
       
   575 
       
   576 
   584 
   577 Bounded repetitions are very important because they
   585 Bounded repetitions are very important because they
   578 tend to occur a lot in practical use.
   586 tend to occur a lot in practical use,
   579 For example in the regex library RegExLib,
   587 for example in the regex library RegExLib,
   580 the rules library of Snort \cite{Snort1999}\footnote{
   588 the rules library of Snort \cite{Snort1999}\footnote{
   581 Snort is a network intrusion detection (NID) tool
   589 Snort is a network intrusion detection (NID) tool
   582 for monitoring network traffic.
   590 for monitoring network traffic.
   583 The network security community curates a list
   591 The network security community curates a list
   584 of malicious patterns written as regexes,
   592 of malicious patterns written as regexes,
   586 to match against network traffic for any hostile
   594 to match against network traffic for any hostile
   587 activities such as buffer overflow attacks.}, 
   595 activities such as buffer overflow attacks.}, 
   588 as well as in XML Schema definitions (XSDs).
   596 as well as in XML Schema definitions (XSDs).
   589 According to Bj\"{o}rklund et al \cite{xml2015},
   597 According to Bj\"{o}rklund et al \cite{xml2015},
   590 more than half of the 
   598 more than half of the 
   591 XSDs they found have bounded regular expressions in them.
   599 XSDs they found on the Maven.org central repository
   592 Often the counters are quite large, the largest up to ten million. 
   600 have bounded regular expressions in them.
       
   601 Often the counters are quite large, with the largest being
       
   602 approximately up to ten million. 
   593 An example XSD they gave
   603 An example XSD they gave
   594 was:
   604 is:
   595 %\begin{verbatim}
   605 \begin{verbatim}
   596 %<sequence minOccurs="0" maxOccurs="65535">
   606 <sequence minOccurs="0" maxOccurs="65535">
   597 %  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   607  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   598 %  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   608  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   599 %</sequence>
   609 </sequence>
   600 %\end{verbatim}
   610 \end{verbatim}
   601 This can be seen as the expression 
   611 This can be seen as the expression 
   602 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   612 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   603 regular expressions 
   613 regular expressions 
   604 satisfying certain constraints (such as 
   614 satisfying certain constraints (such as 
   605 satisfying the floating point number format).
   615 satisfying the floating point number format).
   606 
       
   607 It is therefore quite unsatisfying that 
   616 It is therefore quite unsatisfying that 
   608 some regular expressions matching libraries
   617 some regular expressions matching libraries
   609 impose adhoc limits
   618 impose adhoc limits
   610 for bounded regular expressions:
   619 for bounded regular expressions:
   611 For example, in the regular expression matching library in the Go
   620 For example, in the regular expression matching library in the Go
   613 can be above 1000, and in the built-in Rust regular expression library
   622 can be above 1000, and in the built-in Rust regular expression library
   614 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
   623 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
   615 for being too big. 
   624 for being too big. 
   616 As Becchi and Crawley\cite{Becchi08}  have pointed out,
   625 As Becchi and Crawley\cite{Becchi08}  have pointed out,
   617 the reason for these restrictions
   626 the reason for these restrictions
   618 are that they simulate a non-deterministic finite
   627 is that they simulate a non-deterministic finite
   619 automata (NFA) with a breadth-first search.
   628 automata (NFA) with a breadth-first search.
   620 This way the number of active states could
   629 This way the number of active states could
   621 be equal to the counter number.
   630 be equal to the counter number.
   622 When the counters are large, 
   631 When the counters are large, 
   623 the memory requirement could become
   632 the memory requirement could become
   624 infeasible, and the pattern needs to be rejected straight away.
   633 infeasible, and a regex engine
       
   634 like Go will reject this pattern straight away.
   625 \begin{figure}[H]
   635 \begin{figure}[H]
   626 \begin{center}
   636 \begin{center}
   627 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
   637 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
   628  
   638  
   629     	\node (q0) [state, initial] {$0$};
   639     	\node (q0) [state, initial] {$0$};
   630 	\node (q1) [state, right = of q0] {$1$};
   640 	\node (q1) [state, right = of q0] {$1$};
   631 	\node (q2) [state, right = of q1] {$2$};
   641 	%\node (q2) [state, right = of q1] {$2$};
   632 	\node (qdots) [right = of q2] {$\ldots$};
   642 	\node (qdots) [right = of q1] {$\ldots$};
   633 	\node (qn) [state, right = of qdots] {$n$};
   643 	\node (qn) [state, right = of qdots] {$n$};
   634 	\node (qn1) [state, right = of qn] {$n+1$};
   644 	\node (qn1) [state, right = of qn] {$n+1$};
   635 	\node (qn2) [state, right = of qn1] {$n+2$};
   645 	\node (qn2) [state, right = of qn1] {$n+2$};
   636 	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
   646 	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
   637  
   647  
   638 \path [-stealth, thick]
   648 \path [-stealth, thick]
   639 	(q0) edge [loop above] node {a} ()
   649 	(q0) edge [loop above] node {a} ()
   640     (q0) edge node {a}   (q1) 
   650     (q0) edge node {a}   (q1) 
   641     (q1) edge node {.}   (q2)
   651     %(q1) edge node {.}   (q2)
   642     (q2) edge node {.}   (qdots)
   652     (q1) edge node {.}   (qdots)
   643     (qdots) edge node {.} (qn)
   653     (qdots) edge node {.} (qn)
   644     (qn) edge node {.} (qn1)
   654     (qn) edge node {.} (qn1)
   645     (qn1) edge node {b} (qn2)
   655     (qn1) edge node {b} (qn2)
   646     (qn2) edge node {$c$} (qn3);
   656     (qn2) edge node {$c$} (qn3);
   647 \end{tikzpicture}
   657 \end{tikzpicture}
   670 %in terms of input string length.
   680 %in terms of input string length.
   671 %TODO:try out these lexers
   681 %TODO:try out these lexers
   672 These problems can of course be solved in matching algorithms where 
   682 These problems can of course be solved in matching algorithms where 
   673 automata go beyond the classic notion and for instance include explicit
   683 automata go beyond the classic notion and for instance include explicit
   674 counters \cite{Turo_ov__2020}.
   684 counters \cite{Turo_ov__2020}.
   675 These solutions can be quite effective,
   685 These solutions can be quite efficient,
   676 with the ability to process
   686 with the ability to process
   677 gigabytes of string input per second
   687 gigabytes of strings input per second
   678 even with large counters \cite{Becchi08}.
   688 even with large counters \cite{Becchi08}.
   679 But formally reasoning about these automata can be challenging
   689 But formal reasoning about these automata especially in Isabelle 
   680 and un-intuitive.
   690 can be challenging
   681 Therefore, correctness and runtime claims made about these solutions need to be
   691 and un-intuitive. 
   682 taken with a grain of salt.
   692 Therefore, we take correctness and runtime claims made about these solutions
       
   693 with a grain of salt.
   683 
   694 
   684 In the work reported in \cite{CSL2022} and here, 
   695 In the work reported in \cite{CSL2022} and here, 
   685 we add better support using derivatives
   696 we add better support using derivatives
   686 for bounded regular expressions $r^{\{n\}}$.
   697 for bounded regular expressions $r^{\{n\}}$.
   687 The results
   698 The results
  1178 This thesis implements the aggressive simplifications envisioned
  1189 This thesis implements the aggressive simplifications envisioned
  1179 by Ausaf and Urban,
  1190 by Ausaf and Urban,
  1180 together with a formal proof of the correctness with those simplifications.
  1191 together with a formal proof of the correctness with those simplifications.
  1181 
  1192 
  1182 
  1193 
       
  1194 One of the most recent work in the context of lexing
       
  1195 %with this issue
       
  1196 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
       
  1197 This is relevant work for us and we will compare it later with 
       
  1198 our derivative-based matcher we are going to present.
       
  1199 There is also some newer work called
       
  1200 Verbatim++\cite{Verbatimpp}, which does not use derivatives, 
       
  1201 but deterministic finite automaton instead.
       
  1202 %An example that gives problem to automaton approaches would be
       
  1203 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
       
  1204 %It requires at least $2^{n+1}$ states to represent
       
  1205 %as a DFA.
       
  1206 
       
  1207 
  1183 %----------------------------------------------------------------------------------------
  1208 %----------------------------------------------------------------------------------------
  1184 \section{Contribution}
  1209 \section{Contribution}
  1185 
  1210 
  1186 In this thesis,
  1211 In this thesis,
  1187 we propose a solution to catastrophic
  1212 we propose a solution to catastrophic