ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 635 7ce2389dff4b
parent 634 b079aaee5e10
child 636 0bcb4a7cb40c
equal deleted inserted replaced
634:b079aaee5e10 635:7ce2389dff4b
   419 but they occur actually often enough that they have a
   419 but they occur actually often enough that they have a
   420 name: Regular-Expression-Denial-Of-Service (ReDoS)
   420 name: Regular-Expression-Denial-Of-Service (ReDoS)
   421 attacks.
   421 attacks.
   422 Davis et al. \cite{Davis18} detected more
   422 Davis et al. \cite{Davis18} detected more
   423 than 1000 evil regular expressions
   423 than 1000 evil regular expressions
   424 in Node.js, Python core libraries, npm and in pypi. 
   424 in Node.js, Python core libraries, npm and pypi. 
   425 They therefore concluded that evil regular expressions
   425 They therefore concluded that evil regular expressions
   426 are real problems rather than just "a parlour trick".
   426 are real problems rather than just "a parlour trick".
   427 
   427 
   428 This work aims to address this issue
   428 This work aims to address this issue
   429 with the help of formal proofs.
   429 with the help of formal proofs.
   430 We describe a lexing algorithm based
   430 We describe a lexing algorithm based
   431 on Brzozowski derivatives with verified correctness (in 
   431 on Brzozowski derivatives with verified correctness 
   432 Isabelle/HOL)
   432 and a finiteness property for the size of derivatives
   433 and a finiteness property for the size of derivatives.
   433 (which are all done in Isabelle/HOL).
   434 Such properties %guarantee the absence of 
   434 Such properties %guarantee the absence of 
   435 are an important step in preventing
   435 are an important step in preventing
   436 catastrophic backtracking once and for all.
   436 catastrophic backtracking once and for all.
   437 We will give more details in the next sections
   437 We will give more details in the next sections
   438 on (i) why the slow cases in graph \ref{fig:aStarStarb}
   438 on (i) why the slow cases in graph \ref{fig:aStarStarb}
   441 approach based on Brzozowski derivatives and formal proofs.
   441 approach based on Brzozowski derivatives and formal proofs.
   442 
   442 
   443 
   443 
   444 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
   444 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
   445 Regular expressions and regular expression matchers 
   445 Regular expressions and regular expression matchers 
   446 have of course been studied for many, many years.
   446 have clearly been studied for many, many years.
   447 Theoretical results in automata theory state 
   447 Theoretical results in automata theory state 
   448 that basic regular expression matching should be linear
   448 that basic regular expression matching should be linear
   449 w.r.t the input.
   449 w.r.t the input.
   450 This assumes that the regular expression
   450 This assumes that the regular expression
   451 $r$ was pre-processed and turned into a
   451 $r$ was pre-processed and turned into a
   452 deterministic finite automaton (DFA) before matching~\cite{Sakarovitch2009}.
   452 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
   453 By basic we mean textbook definitions such as the one
   453 By basic we mean textbook definitions such as the one
   454 below, involving only regular expressions for characters, alternatives,
   454 below, involving only regular expressions for characters, alternatives,
   455 sequences, and Kleene stars:
   455 sequences, and Kleene stars:
   456 \[
   456 \[
   457 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
   457 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
   458 \]
   458 \]
   459 Modern regular expression matchers used by programmers,
   459 Modern regular expression matchers used by programmers,
   460 however,
   460 however,
   461 support much richer constructs, such as bounded repetitions
   461 support much richer constructs, such as bounded repetitions,
       
   462 negations,
   462 and back-references.
   463 and back-references.
   463 To differentiate, we use the word \emph{regex} to refer
   464 To differentiate, we use the word \emph{regex} to refer
   464 to those expressions with richer constructs while reserving the
   465 to those expressions with richer constructs while reserving the
   465 term \emph{regular expression}
   466 term \emph{regular expression}
   466 for the more traditional meaning in formal languages theory.
   467 for the more traditional meaning in formal languages theory.
   479 that make it more convenient for 
   480 that make it more convenient for 
   480 programmers to write regular expressions.
   481 programmers to write regular expressions.
   481 Depending on the types of constructs
   482 Depending on the types of constructs
   482 the task of matching and lexing with them
   483 the task of matching and lexing with them
   483 will have different levels of complexity.
   484 will have different levels of complexity.
   484 Some of those constructs are just syntactic sugars that are
   485 Some of those constructs are syntactic sugars that are
   485 simply short hand notations
   486 simply short hand notations
   486 that save the programmers a few keystrokes.
   487 that save the programmers a few keystrokes.
   487 These will not cause problems for regex libraries.
   488 These will not cause problems for regex libraries.
   488 For example the
   489 For example the
   489 non-binary alternative involving three or more choices just means:
   490 non-binary alternative involving three or more choices just means:
   490 \[
   491 \[
   491 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
   492 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
   492 \]
   493 \]
   493 Similarly, the range operator used to express the alternative
   494 Similarly, the range operator
   494 of all characters between its operands is just a concise way:
   495 %used to express the alternative
       
   496 %of all characters between its operands, 
       
   497 is just a concise way
       
   498 of expressing an alternative of consecutive characters:
   495 \[
   499 \[
   496 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
   500 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
   497 \]
   501 \]
   498 for an alternative. The
   502 for an alternative. The
   499 wildcard character $.$ is used to refer to any single character,
   503 wildcard character '$.$' is used to refer to any single character,
   500 \[
   504 \[
   501 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
   505 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
   502 \]
   506 \]
   503 except the newline.
   507 except the newline.
   504 
   508 
   505 \subsection{Bounded Repetitions}
   509 \subsection{Bounded Repetitions}
   506 More interesting are bounded repetitions, which can 
   510 More interesting are bounded repetitions, which can 
   507 make the regular expressions much
   511 make the regular expressions much
   508 more compact.
   512 more compact.
   509 There are 
   513 Normally there are four kinds of bounded repetitions:
   510 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
   514 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
   511 (where $n$ and $m$ are constant natural numbers).
   515 (where $n$ and $m$ are constant natural numbers).
   512 Like the star regular expressions, the set of strings or language
   516 Like the star regular expressions, the set of strings or language
   513 a bounded regular expression can match
   517 a bounded regular expression can match
   514 is defined using the power operation on sets:
   518 is defined using the power operation on sets:
   519 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
   523 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
   520 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
   524 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
   521 	\end{tabular}
   525 	\end{tabular}
   522 \end{center}
   526 \end{center}
   523 The attraction of bounded repetitions is that they can be
   527 The attraction of bounded repetitions is that they can be
   524 used to avoid a blow up: for example $r^{\{n\}}$
   528 used to avoid a size blow up: for example $r^{\{n\}}$
   525 is a shorthand for
   529 is a shorthand for
       
   530 the much longer regular expression:
   526 \[
   531 \[
   527 	\underbrace{r\ldots r}_\text{n copies of r}.
   532 	\underbrace{r\ldots r}_\text{n copies of r}.
   528 \]
   533 \]
   529 %Therefore, a naive algorithm that simply unfolds
   534 %Therefore, a naive algorithm that simply unfolds
   530 %them into their desugared forms
   535 %them into their desugared forms
   531 %will suffer from at least an exponential runtime increase.
   536 %will suffer from at least an exponential runtime increase.
   532 
   537 
   533 
   538 
   534 The problem with matching 
   539 The problem with matching 
       
   540 such bounded repetitions
   535 is that tools based on the classic notion of
   541 is that tools based on the classic notion of
   536 automata need to expand $r^{\{n\}}$ into $n$ connected 
   542 automata need to expand $r^{\{n\}}$ into $n$ connected 
   537 copies of the automaton for $r$. This leads to very inefficient matching
   543 copies of the automaton for $r$. This leads to very inefficient matching
   538 algorithms  or algorithms that consume large amounts of memory.
   544 algorithms  or algorithms that consume large amounts of memory.
   539 Implementations using $\DFA$s will
   545 Implementations using $\DFA$s will
       
   546 in such situations
   540 either become excruciatingly slow 
   547 either become excruciatingly slow 
   541 (for example Verbatim++~\cite{Verbatimpp}) or get
   548 (for example Verbatim++ \cite{Verbatimpp}) or run
   542 out of memory errors (for example $\mathit{LEX}$ and 
   549 out of memory (for example $\mathit{LEX}$ and 
   543 $\mathit{JFLEX}$\footnote{which are lexer generators
   550 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
   544 in C and JAVA that generate $\mathit{DFA}$-based
   551 in C and JAVA that generate $\mathit{DFA}$-based
   545 lexers. The user provides a set of regular expressions
   552 lexers. The user provides a set of regular expressions
   546 and configurations to them, and then 
   553 and configurations, and then 
   547 gets an output program encoding a minimized $\mathit{DFA}$
   554 gets an output program encoding a minimized $\mathit{DFA}$
   548 that can be compiled and run. 
   555 that can be compiled and run. 
   549 When given the above countdown regular expression,
   556 When given the above countdown regular expression,
   550 a small $n$ (a few dozen) would result in a 
   557 a small $n$ (say 20) would result in a program representing a
   551 determinised automata
   558 DFA
   552 with millions of states.}) for large counters.
   559 with millions of states.}) for large counters.
   553 A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
   560 A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
   554 where the minimal DFA requires at least $2^{n+1}$ states.
   561 where the minimal DFA requires at least $2^{n+1}$ states.
   555 For example, when $n$ is equal to 2,
   562 For example, when $n$ is equal to 2,
   556 The corresponding $\mathit{NFA}$ looks like:
   563 the corresponding $\mathit{NFA}$ looks like:
       
   564 \vspace{6mm}
   557 \begin{center}
   565 \begin{center}
   558 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
   566 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
   559    \node[state,initial] (q_0)   {$q_0$}; 
   567    \node[state,initial] (q_0)   {$q_0$}; 
   560    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
   568    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
   561    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
   569    \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
   565     	  edge [loop below] node {a,b} ()
   573     	  edge [loop below] node {a,b} ()
   566     (q_1) edge  node  {a,b} (q_2)
   574     (q_1) edge  node  {a,b} (q_2)
   567     (q_2) edge  node  {a,b} (q_3);
   575     (q_2) edge  node  {a,b} (q_3);
   568 \end{tikzpicture}
   576 \end{tikzpicture}
   569 \end{center}
   577 \end{center}
   570 when turned into a DFA by the subset construction
   578 and when turned into a DFA by the subset construction
   571 requires at least $2^3$ states.\footnote{The 
   579 requires at least $2^3$ states.\footnote{The 
   572 red states are "countdown states" which counts down 
   580 red states are "countdown states" which count down 
   573 the number of characters needed in addition to the current
   581 the number of characters needed in addition to the current
   574 string to make a successful match.
   582 string to make a successful match.
   575 For example, state $q_1$ indicates a match that has
   583 For example, state $q_1$ indicates a match that has
   576 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
   584 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
   577 and just consumed the "delimiter" $a$ in the middle, and 
   585 and just consumed the "delimiter" $a$ in the middle, and 
   578 need to match 2 more iterations of $(a|b)$ to complete.
   586 needs to match 2 more iterations of $(a|b)$ to complete.
   579 State $q_2$ on the other hand, can be viewed as a state
   587 State $q_2$ on the other hand, can be viewed as a state
   580 after $q_1$ has consumed 1 character, and just waits
   588 after $q_1$ has consumed 1 character, and just waits
   581 for 1 more character to complete.
   589 for 1 more character to complete.
   582 $q_3$ is the last state, requiring 0 more character and is accepting.
   590 The state $q_3$ is the last (accepting) state, requiring 0 
       
   591 more characters.
   583 Depending on the suffix of the
   592 Depending on the suffix of the
   584 input string up to the current read location,
   593 input string up to the current read location,
   585 the states $q_1$ and $q_2$, $q_3$
   594 the states $q_1$ and $q_2$, $q_3$
   586 may or may
   595 may or may
   587 not be active, independent from each other.
   596 not be active.
   588 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
   597 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
   589 contain at least $2^3$ non-equivalent states that cannot be merged, 
   598 contain at least $2^3$ non-equivalent states that cannot be merged, 
   590 because the subset construction during determinisation will generate
   599 because the subset construction during determinisation will generate
   591 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
   600 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
   592 Generalizing this to regular expressions with larger
   601 Generalizing this to regular expressions with larger
   593 bounded repetitions number, we have that
   602 bounded repetitions number, we have that
   594 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
   603 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
   595 would require at least $2^{n+1}$ states, if $r$ itself contains
   604 would require at least $2^{n+1}$ states, if $r$ itself contains
   596 more than 1 string.
   605 more than 1 string.
   597 This is to represent all different 
   606 This is to represent all different 
   598 scenarios which "countdown" states are active.}
   607 scenarios in which "countdown" states are active.}
   599 
   608 
   600 
   609 
   601 Bounded repetitions are very important because they
   610 Bounded repetitions are important because they
   602 tend to occur a lot in practical use,
   611 tend to occur frequently in practical use,
   603 for example in the regex library RegExLib,
   612 for example in the regex library RegExLib, in
   604 the rules library of Snort \cite{Snort1999}\footnote{
   613 the rules library of Snort \cite{Snort1999}\footnote{
   605 Snort is a network intrusion detection (NID) tool
   614 Snort is a network intrusion detection (NID) tool
   606 for monitoring network traffic.
   615 for monitoring network traffic.
   607 The network security community curates a list
   616 The network security community curates a list
   608 of malicious patterns written as regexes,
   617 of malicious patterns written as regexes,
   613 According to Bj\"{o}rklund et al \cite{xml2015},
   622 According to Bj\"{o}rklund et al \cite{xml2015},
   614 more than half of the 
   623 more than half of the 
   615 XSDs they found on the Maven.org central repository
   624 XSDs they found on the Maven.org central repository
   616 have bounded regular expressions in them.
   625 have bounded regular expressions in them.
   617 Often the counters are quite large, with the largest being
   626 Often the counters are quite large, with the largest being
   618 approximately up to ten million. 
   627 close to ten million. 
   619 An example XSD they gave
   628 A smaller sample XSD they gave
   620 is:
   629 is:
   621 \begin{verbatim}
   630 \begin{verbatim}
   622 <sequence minOccurs="0" maxOccurs="65535">
   631 <sequence minOccurs="0" maxOccurs="65535">
   623  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   632  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   624  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   633  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   625 </sequence>
   634 </sequence>
   626 \end{verbatim}
   635 \end{verbatim}
   627 This can be seen as the expression 
   636 This can be seen as the regex
   628 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   637 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   629 regular expressions 
   638 regular expressions 
   630 satisfying certain constraints (such as 
   639 satisfying certain constraints (such as 
   631 satisfying the floating point number format).
   640 satisfying the floating point number format).
   632 It is therefore quite unsatisfying that 
   641 It is therefore quite unsatisfying that 
   636 For example, in the regular expression matching library in the Go
   645 For example, in the regular expression matching library in the Go
   637 language the regular expression $a^{1001}$ is not permitted, because no counter
   646 language the regular expression $a^{1001}$ is not permitted, because no counter
   638 can be above 1000, and in the built-in Rust regular expression library
   647 can be above 1000, and in the built-in Rust regular expression library
   639 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
   648 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
   640 for being too big. 
   649 for being too big. 
   641 As Becchi and Crawley~\cite{Becchi08}  have pointed out,
   650 As Becchi and Crawley \cite{Becchi08}  have pointed out,
   642 the reason for these restrictions
   651 the reason for these restrictions
   643 is that they simulate a non-deterministic finite
   652 is that they simulate a non-deterministic finite
   644 automata (NFA) with a breadth-first search.
   653 automata (NFA) with a breadth-first search.
   645 This way the number of active states could
   654 This way the number of active states could
   646 be equal to the counter number.
   655 be equal to the counter number.
   647 When the counters are large, 
   656 When the counters are large, 
   648 the memory requirement could become
   657 the memory requirement could become
   649 infeasible, and a regex engine
   658 infeasible, and a regex engine
   650 like Go will reject this pattern straight away.
   659 like in Go will reject this pattern straight away.
   651 \begin{figure}[H]
   660 \begin{figure}[H]
   652 \begin{center}
   661 \begin{center}
   653 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
   662 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
   654  
   663  
   655     	\node (q0) [state, initial] {$0$};
   664     	\node (q0) [state, initial] {$0$};
   700 counters \cite{Turo_ov__2020}.
   709 counters \cite{Turo_ov__2020}.
   701 These solutions can be quite efficient,
   710 These solutions can be quite efficient,
   702 with the ability to process
   711 with the ability to process
   703 gigabytes of strings input per second
   712 gigabytes of strings input per second
   704 even with large counters \cite{Becchi08}.
   713 even with large counters \cite{Becchi08}.
   705 But formal reasoning about these automata especially in Isabelle 
   714 These practical solutions do not come with
   706 can be challenging
   715 formal guarantees, and as pointed out by
   707 and un-intuitive. 
   716 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
   708 Therefore, we take correctness and runtime claims made about these solutions
   717 %But formal reasoning about these automata especially in Isabelle 
   709 with a grain of salt.
   718 %can be challenging
       
   719 %and un-intuitive. 
       
   720 %Therefore, we take correctness and runtime claims made about these solutions
       
   721 %with a grain of salt.
   710 
   722 
   711 In the work reported in \cite{FoSSaCS2023} and here, 
   723 In the work reported in \cite{FoSSaCS2023} and here, 
   712 we add better support using derivatives
   724 we add better support using derivatives
   713 for bounded regular expressions $r^{\{n\}}$.
   725 for bounded regular expression $r^{\{n\}}$.
   714 The results
   726 Our results
   715 extend straightforwardly to
   727 extend straightforwardly to
   716 repetitions with an interval such as 
   728 repetitions with intervals such as 
   717 $r^{\{n\ldots m\}}$.
   729 $r^{\{n\ldots m\}}$.
   718 The merit of Brzozowski derivatives (more on this later)
   730 The merit of Brzozowski derivatives (more on this later)
   719 on this problem is that
   731 on this problem is that
   720 it can be naturally extended to support bounded repetitions.
   732 it can be naturally extended to support bounded repetitions.
   721 Moreover these extensions are still made up of only
   733 Moreover these extensions are still made up of only small
   722 inductive datatypes and recursive functions,
   734 inductive datatypes and recursive functions,
   723 making it handy to deal with them in theorem provers.
   735 making it handy to deal with them in theorem provers.
   724 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
   736 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
   725 %straightforwardly extended to deal with bounded regular expressions
   737 %straightforwardly extended to deal with bounded regular expressions
   726 %and moreover the resulting code still consists of only simple
   738 %and moreover the resulting code still consists of only simple