equal
deleted
inserted
replaced
263 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
263 respond so slowly that the load balancer assumed a $\mathit{DoS}$ |
264 attack and therefore stopped the servers from responding to any |
264 attack and therefore stopped the servers from responding to any |
265 requests. This made the whole site become unavailable. |
265 requests. This made the whole site become unavailable. |
266 |
266 |
267 \begin{figure}[p] |
267 \begin{figure}[p] |
|
268 \begin{center} |
268 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}} |
269 \begin{tabular}{@{}c@{\hspace{0mm}}c@{}} |
269 \begin{tikzpicture} |
270 \begin{tikzpicture} |
270 \begin{axis}[ |
271 \begin{axis}[ |
271 xlabel={$n$}, |
272 xlabel={$n$}, |
272 x label style={at={(1.05,-0.05)}}, |
273 x label style={at={(1.05,-0.05)}}, |
391 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
392 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
392 \end{axis} |
393 \end{axis} |
393 \end{tikzpicture}\\ |
394 \end{tikzpicture}\\ |
394 \multicolumn{2}{c}{Graphs} |
395 \multicolumn{2}{c}{Graphs} |
395 \end{tabular} |
396 \end{tabular} |
|
397 \end{center} |
396 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
398 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
397 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
399 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
398 The reason for their superlinear behaviour is that they do a depth-first-search |
400 The reason for their superlinear behaviour is that they do a depth-first-search |
399 using NFAs. |
401 using NFAs. |
400 If the string does not match, the regular expression matching |
402 If the string does not match, the regular expression matching |
442 Theoretical results in automata theory state |
444 Theoretical results in automata theory state |
443 that basic regular expression matching should be linear |
445 that basic regular expression matching should be linear |
444 w.r.t the input. |
446 w.r.t the input. |
445 This assumes that the regular expression |
447 This assumes that the regular expression |
446 $r$ was pre-processed and turned into a |
448 $r$ was pre-processed and turned into a |
447 deterministic finite automaton (DFA) before matching\cite{Sakarovitch2009}. |
449 deterministic finite automaton (DFA) before matching~\cite{Sakarovitch2009}. |
448 By basic we mean textbook definitions such as the one |
450 By basic we mean textbook definitions such as the one |
449 below, involving only regular expressions for characters, alternatives, |
451 below, involving only regular expressions for characters, alternatives, |
450 sequences, and Kleene stars: |
452 sequences, and Kleene stars: |
451 \[ |
453 \[ |
452 r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
454 r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
531 automata need to expand $r^{\{n\}}$ into $n$ connected |
533 automata need to expand $r^{\{n\}}$ into $n$ connected |
532 copies of the automaton for $r$. This leads to very inefficient matching |
534 copies of the automaton for $r$. This leads to very inefficient matching |
533 algorithms or algorithms that consume large amounts of memory. |
535 algorithms or algorithms that consume large amounts of memory. |
534 Implementations using $\DFA$s will |
536 Implementations using $\DFA$s will |
535 either become excruciatingly slow |
537 either become excruciatingly slow |
536 (for example Verbatim++\cite{Verbatimpp}) or get |
538 (for example Verbatim++~\cite{Verbatimpp}) or get |
537 out of memory errors (for example $\mathit{LEX}$ and |
539 out of memory errors (for example $\mathit{LEX}$ and |
538 $\mathit{JFLEX}$\footnote{which are lexer generators |
540 $\mathit{JFLEX}$\footnote{which are lexer generators |
539 in C and JAVA that generate $\mathit{DFA}$-based |
541 in C and JAVA that generate $\mathit{DFA}$-based |
540 lexers. The user provides a set of regular expressions |
542 lexers. The user provides a set of regular expressions |
541 and configurations to them, and then |
543 and configurations to them, and then |
631 For example, in the regular expression matching library in the Go |
633 For example, in the regular expression matching library in the Go |
632 language the regular expression $a^{1001}$ is not permitted, because no counter |
634 language the regular expression $a^{1001}$ is not permitted, because no counter |
633 can be above 1000, and in the built-in Rust regular expression library |
635 can be above 1000, and in the built-in Rust regular expression library |
634 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
636 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
635 for being too big. |
637 for being too big. |
636 As Becchi and Crawley\cite{Becchi08} have pointed out, |
638 As Becchi and Crawley~\cite{Becchi08} have pointed out, |
637 the reason for these restrictions |
639 the reason for these restrictions |
638 is that they simulate a non-deterministic finite |
640 is that they simulate a non-deterministic finite |
639 automata (NFA) with a breadth-first search. |
641 automata (NFA) with a breadth-first search. |
640 This way the number of active states could |
642 This way the number of active states could |
641 be equal to the counter number. |
643 be equal to the counter number. |
813 which matches four-character palindromes |
815 which matches four-character palindromes |
814 like $abba$, $x??x$ and so on. |
816 like $abba$, $x??x$ and so on. |
815 |
817 |
816 Back-references is a regex construct |
818 Back-references is a regex construct |
817 that programmers find quite useful. |
819 that programmers find quite useful. |
818 According to Becchi and Crawley\cite{Becchi08}, |
820 According to Becchi and Crawley~\cite{Becchi08}, |
819 6\% of Snort rules (up until 2008) use them. |
821 6\% of Snort rules (up until 2008) use them. |
820 The most common use of back-references |
822 The most common use of back-references |
821 is to express well-formed html files, |
823 is to express well-formed html files, |
822 where back-references are convenient for matching |
824 where back-references are convenient for matching |
823 opening and closing tags like |
825 opening and closing tags like |
1200 together with a formal proof of the correctness of those simplifications. |
1202 together with a formal proof of the correctness of those simplifications. |
1201 |
1203 |
1202 |
1204 |
1203 One of the most recent work in the context of lexing |
1205 One of the most recent work in the context of lexing |
1204 %with this issue |
1206 %with this issue |
1205 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. |
1207 is the Verbatim lexer by Egolf, Lasser and Fisher~\cite{Verbatim}. |
1206 This is relevant work for us and we will compare it later with |
1208 This is relevant work for us and we will compare it later with |
1207 our derivative-based matcher we are going to present. |
1209 our derivative-based matcher we are going to present. |
1208 There is also some newer work called |
1210 There is also some newer work called |
1209 Verbatim++\cite{Verbatimpp}, which does not use derivatives, |
1211 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, |
1210 but deterministic finite automaton instead. |
1212 but deterministic finite automaton instead. |
1211 %An example that gives problem to automaton approaches would be |
1213 %An example that gives problem to automaton approaches would be |
1212 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$. |
1214 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$. |
1213 %It requires at least $2^{n+1}$ states to represent |
1215 %It requires at least $2^{n+1}$ states to represent |
1214 %as a DFA. |
1216 %as a DFA. |