diff -r d5e9bcb384ec -r 233cf2b97d1a ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Wed Oct 12 14:08:06 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Nov 06 23:05:47 2022 +0000 @@ -780,13 +780,13 @@ % \end{tabular} %\end{center} A concrete example -for back-references would be +for back-references is \begin{center} $(.^*)\backslash 1$, \end{center} -which would match +which matches strings that can be split into two identical halves, -for example $\mathit{foofoo}$, $\mathit{ww}$ and etc. +for example $\mathit{foofoo}$, $\mathit{ww}$ and so on. Note that this is different from repeating the sub-expression verbatim like \begin{center} @@ -795,32 +795,32 @@ which does not impose any restrictions on what strings the second sub-expression $.^*$ might match. -Another example of back-references would be +Another example of back-references is \begin{center} $(.)(.)\backslash 2\backslash 1$ \end{center} -which expresses four-character palindromes -like $abba$, $x??x$ etc. +which matches four-character palindromes +like $abba$, $x??x$ and so on. Back-references is a regex construct -that programmers found quite useful. +that programmers find quite useful. According to Becchi and Crawley\cite{Becchi08}, -6\% of Snort rules (up until 2008) include the use of them. +6\% of Snort rules (up until 2008) use them. The most common use of back-references -would be expressing well-formed html files, -where back-references would be handy in expressing -a pair of opening and closing tags like +is to express well-formed html files, +where back-references are convenient for matching +opening and closing tags like \begin{center} $\langle html \rangle \ldots \langle / html \rangle$ \end{center} A regex describing such a format -could be +is \begin{center} $\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$ \end{center} -Despite being useful, the syntax and expressive power of regexes +Despite being useful, the expressive power of regexes go beyond the regular language hierarchy -with back-references. +once back-references are included. In fact, they allow the regex construct to express languages that cannot be contained in context-free languages either. @@ -829,17 +829,15 @@ which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. Such a language is contained in the context-sensitive hierarchy of formal languages. -Solving the back-reference expressions matching problem +Also solving the matching problem involving back-references is known to be NP-complete \parencite{alfred2014algorithms}. -A non-bactracking, -efficient solution is not known to exist. Regex libraries supporting back-references such as PCRE \cite{pcre} therefore have to revert to a depth-first search algorithm which backtracks. What is unexpected is that even in the cases not involving back-references, there is still a (non-negligible) chance they might backtrack super-linearly, -as shown in the graphs in \ref{fig:aStarStarb}. +as shown in the graphs in figure\ref{fig:aStarStarb}. \subsection{Summary of the Catastrophic Backtracking Problem} Summing these up, we can categorise existing