ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 618 233cf2b97d1a
parent 612 8c234a1bc7e0
child 620 ae6010c14e49
equal deleted inserted replaced
614:d5e9bcb384ec 618:233cf2b97d1a
   778 %		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
   778 %		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
   779 %		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
   779 %		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
   780 %	\end{tabular}
   780 %	\end{tabular}
   781 %\end{center}
   781 %\end{center}
   782 A concrete example
   782 A concrete example
   783 for back-references would be
   783 for back-references is
   784 \begin{center}
   784 \begin{center}
   785 $(.^*)\backslash 1$,
   785 $(.^*)\backslash 1$,
   786 \end{center}
   786 \end{center}
   787 which would match 
   787 which matches
   788 strings that can be split into two identical halves,
   788 strings that can be split into two identical halves,
   789 for example $\mathit{foofoo}$, $\mathit{ww}$ and etc.
   789 for example $\mathit{foofoo}$, $\mathit{ww}$ and so on.
   790 Note that this is different from 
   790 Note that this is different from 
   791 repeating the  sub-expression verbatim like
   791 repeating the  sub-expression verbatim like
   792 \begin{center}
   792 \begin{center}
   793 	$(.^*)(.^*)$,
   793 	$(.^*)(.^*)$,
   794 \end{center}
   794 \end{center}
   795 which does not impose any restrictions on what strings the second 
   795 which does not impose any restrictions on what strings the second 
   796 sub-expression $.^*$
   796 sub-expression $.^*$
   797 might match.
   797 might match.
   798 Another example of back-references would be
   798 Another example of back-references is
   799 \begin{center}
   799 \begin{center}
   800 $(.)(.)\backslash 2\backslash 1$
   800 $(.)(.)\backslash 2\backslash 1$
   801 \end{center}
   801 \end{center}
   802 which expresses four-character palindromes
   802 which matches four-character palindromes
   803 like $abba$, $x??x$ etc.
   803 like $abba$, $x??x$ and so on.
   804 
   804 
   805 Back-references is a regex construct 
   805 Back-references is a regex construct 
   806 that programmers found quite useful.
   806 that programmers find quite useful.
   807 According to Becchi and Crawley\cite{Becchi08},
   807 According to Becchi and Crawley\cite{Becchi08},
   808 6\% of Snort rules (up until 2008) include the use of them.
   808 6\% of Snort rules (up until 2008) use them.
   809 The most common use of back-references
   809 The most common use of back-references
   810 would be expressing well-formed html files,
   810 is to express well-formed html files,
   811 where back-references would be handy in expressing
   811 where back-references are convenient for matching
   812 a pair of opening and closing tags like 
   812 opening and closing tags like 
   813 \begin{center}
   813 \begin{center}
   814 	$\langle html \rangle \ldots \langle / html \rangle$
   814 	$\langle html \rangle \ldots \langle / html \rangle$
   815 \end{center}
   815 \end{center}
   816 A regex describing such a format
   816 A regex describing such a format
   817 could be
   817 is
   818 \begin{center}
   818 \begin{center}
   819 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
   819 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
   820 \end{center}
   820 \end{center}
   821 Despite being useful, the syntax and expressive power of regexes 
   821 Despite being useful, the expressive power of regexes 
   822 go beyond the regular language hierarchy
   822 go beyond the regular language hierarchy
   823 with back-references.
   823 once back-references are included.
   824 In fact, they allow the regex construct to express 
   824 In fact, they allow the regex construct to express 
   825 languages that cannot be contained in context-free
   825 languages that cannot be contained in context-free
   826 languages either.
   826 languages either.
   827 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
   827 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
   828 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   828 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   829 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
   829 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
   830 Such a language is contained in the context-sensitive hierarchy
   830 Such a language is contained in the context-sensitive hierarchy
   831 of formal languages. 
   831 of formal languages. 
   832 Solving the back-reference expressions matching problem
   832 Also solving the matching problem involving back-references
   833 is known to be NP-complete \parencite{alfred2014algorithms}.
   833 is known to be NP-complete \parencite{alfred2014algorithms}.
   834 A non-bactracking,
       
   835 efficient solution is not known to exist.
       
   836 Regex libraries supporting back-references such as 
   834 Regex libraries supporting back-references such as 
   837 PCRE \cite{pcre} therefore have to
   835 PCRE \cite{pcre} therefore have to
   838 revert to a depth-first search algorithm which backtracks.
   836 revert to a depth-first search algorithm which backtracks.
   839 What is unexpected is that even in the cases 
   837 What is unexpected is that even in the cases 
   840 not involving back-references, there is still
   838 not involving back-references, there is still
   841 a (non-negligible) chance they might backtrack super-linearly,
   839 a (non-negligible) chance they might backtrack super-linearly,
   842 as shown in the graphs in \ref{fig:aStarStarb}.
   840 as shown in the graphs in figure\ref{fig:aStarStarb}.
   843 
   841 
   844 \subsection{Summary of the Catastrophic Backtracking Problem}
   842 \subsection{Summary of the Catastrophic Backtracking Problem}
   845 Summing these up, we can categorise existing 
   843 Summing these up, we can categorise existing 
   846 practical regex libraries into two kinds:
   844 practical regex libraries into two kinds:
   847 (i)The ones  with  linear
   845 (i)The ones  with  linear