ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 630 d50a309a0645
parent 628 7af4e2420a8c
child 631 bdb3ffdd39f8
equal deleted inserted replaced
629:96e009a446d5 630:d50a309a0645
   263 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   263 respond so slowly that the load balancer assumed a $\mathit{DoS}$ 
   264 attack and therefore stopped the servers from responding to any
   264 attack and therefore stopped the servers from responding to any
   265 requests. This made the whole site become unavailable. 
   265 requests. This made the whole site become unavailable. 
   266 
   266 
   267 \begin{figure}[p]
   267 \begin{figure}[p]
       
   268 \begin{center}
   268 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
   269 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
   269 \begin{tikzpicture}
   270 \begin{tikzpicture}
   270 \begin{axis}[
   271 \begin{axis}[
   271     xlabel={$n$},
   272     xlabel={$n$},
   272     x label style={at={(1.05,-0.05)}},
   273     x label style={at={(1.05,-0.05)}},
   391 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
   392 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
   392 \end{axis}
   393 \end{axis}
   393 \end{tikzpicture}\\ 
   394 \end{tikzpicture}\\ 
   394 \multicolumn{2}{c}{Graphs}
   395 \multicolumn{2}{c}{Graphs}
   395 \end{tabular}    
   396 \end{tabular}    
       
   397 \end{center}
   396 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
   398 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
   397            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
   399            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
   398    The reason for their superlinear behaviour is that they do a depth-first-search
   400    The reason for their superlinear behaviour is that they do a depth-first-search
   399    using NFAs.
   401    using NFAs.
   400    If the string does not match, the regular expression matching
   402    If the string does not match, the regular expression matching
   442 Theoretical results in automata theory state 
   444 Theoretical results in automata theory state 
   443 that basic regular expression matching should be linear
   445 that basic regular expression matching should be linear
   444 w.r.t the input.
   446 w.r.t the input.
   445 This assumes that the regular expression
   447 This assumes that the regular expression
   446 $r$ was pre-processed and turned into a
   448 $r$ was pre-processed and turned into a
   447 deterministic finite automaton (DFA) before matching\cite{Sakarovitch2009}.
   449 deterministic finite automaton (DFA) before matching~\cite{Sakarovitch2009}.
   448 By basic we mean textbook definitions such as the one
   450 By basic we mean textbook definitions such as the one
   449 below, involving only regular expressions for characters, alternatives,
   451 below, involving only regular expressions for characters, alternatives,
   450 sequences, and Kleene stars:
   452 sequences, and Kleene stars:
   451 \[
   453 \[
   452 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
   454 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
   531 automata need to expand $r^{\{n\}}$ into $n$ connected 
   533 automata need to expand $r^{\{n\}}$ into $n$ connected 
   532 copies of the automaton for $r$. This leads to very inefficient matching
   534 copies of the automaton for $r$. This leads to very inefficient matching
   533 algorithms  or algorithms that consume large amounts of memory.
   535 algorithms  or algorithms that consume large amounts of memory.
   534 Implementations using $\DFA$s will
   536 Implementations using $\DFA$s will
   535 either become excruciatingly slow 
   537 either become excruciatingly slow 
   536 (for example Verbatim++\cite{Verbatimpp}) or get
   538 (for example Verbatim++~\cite{Verbatimpp}) or get
   537 out of memory errors (for example $\mathit{LEX}$ and 
   539 out of memory errors (for example $\mathit{LEX}$ and 
   538 $\mathit{JFLEX}$\footnote{which are lexer generators
   540 $\mathit{JFLEX}$\footnote{which are lexer generators
   539 in C and JAVA that generate $\mathit{DFA}$-based
   541 in C and JAVA that generate $\mathit{DFA}$-based
   540 lexers. The user provides a set of regular expressions
   542 lexers. The user provides a set of regular expressions
   541 and configurations to them, and then 
   543 and configurations to them, and then 
   631 For example, in the regular expression matching library in the Go
   633 For example, in the regular expression matching library in the Go
   632 language the regular expression $a^{1001}$ is not permitted, because no counter
   634 language the regular expression $a^{1001}$ is not permitted, because no counter
   633 can be above 1000, and in the built-in Rust regular expression library
   635 can be above 1000, and in the built-in Rust regular expression library
   634 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
   636 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
   635 for being too big. 
   637 for being too big. 
   636 As Becchi and Crawley\cite{Becchi08}  have pointed out,
   638 As Becchi and Crawley~\cite{Becchi08}  have pointed out,
   637 the reason for these restrictions
   639 the reason for these restrictions
   638 is that they simulate a non-deterministic finite
   640 is that they simulate a non-deterministic finite
   639 automata (NFA) with a breadth-first search.
   641 automata (NFA) with a breadth-first search.
   640 This way the number of active states could
   642 This way the number of active states could
   641 be equal to the counter number.
   643 be equal to the counter number.
   813 which matches four-character palindromes
   815 which matches four-character palindromes
   814 like $abba$, $x??x$ and so on.
   816 like $abba$, $x??x$ and so on.
   815 
   817 
   816 Back-references is a regex construct 
   818 Back-references is a regex construct 
   817 that programmers find quite useful.
   819 that programmers find quite useful.
   818 According to Becchi and Crawley\cite{Becchi08},
   820 According to Becchi and Crawley~\cite{Becchi08},
   819 6\% of Snort rules (up until 2008) use them.
   821 6\% of Snort rules (up until 2008) use them.
   820 The most common use of back-references
   822 The most common use of back-references
   821 is to express well-formed html files,
   823 is to express well-formed html files,
   822 where back-references are convenient for matching
   824 where back-references are convenient for matching
   823 opening and closing tags like 
   825 opening and closing tags like 
  1200 together with a formal proof of the correctness of those simplifications.
  1202 together with a formal proof of the correctness of those simplifications.
  1201 
  1203 
  1202 
  1204 
  1203 One of the most recent work in the context of lexing
  1205 One of the most recent work in the context of lexing
  1204 %with this issue
  1206 %with this issue
  1205 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
  1207 is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}.
  1206 This is relevant work for us and we will compare it later with 
  1208 This is relevant work for us and we will compare it later with 
  1207 our derivative-based matcher we are going to present.
  1209 our derivative-based matcher we are going to present.
  1208 There is also some newer work called
  1210 There is also some newer work called
  1209 Verbatim++\cite{Verbatimpp}, which does not use derivatives, 
  1211 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
  1210 but deterministic finite automaton instead.
  1212 but deterministic finite automaton instead.
  1211 %An example that gives problem to automaton approaches would be
  1213 %An example that gives problem to automaton approaches would be
  1212 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
  1214 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
  1213 %It requires at least $2^{n+1}$ states to represent
  1215 %It requires at least $2^{n+1}$ states to represent
  1214 %as a DFA.
  1216 %as a DFA.