ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 636 0bcb4a7cb40c
parent 635 7ce2389dff4b
child 637 e3752aac8ec2
equal deleted inserted replaced
635:7ce2389dff4b 636:0bcb4a7cb40c
   707 These problems can of course be solved in matching algorithms where 
   707 These problems can of course be solved in matching algorithms where 
   708 automata go beyond the classic notion and for instance include explicit
   708 automata go beyond the classic notion and for instance include explicit
   709 counters \cite{Turo_ov__2020}.
   709 counters \cite{Turo_ov__2020}.
   710 These solutions can be quite efficient,
   710 These solutions can be quite efficient,
   711 with the ability to process
   711 with the ability to process
   712 gigabytes of strings input per second
   712 gigabits of strings input per second
   713 even with large counters \cite{Becchi08}.
   713 even with large counters \cite{Becchi08}.
   714 These practical solutions do not come with
   714 These practical solutions do not come with
   715 formal guarantees, and as pointed out by
   715 formal guarantees, and as pointed out by
   716 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
   716 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
   717 %But formal reasoning about these automata especially in Isabelle 
   717 %But formal reasoning about these automata especially in Isabelle 
   746 
   746 
   747 \subsection{Back-References}
   747 \subsection{Back-References}
   748 The other way to simulate an $\mathit{NFA}$ for matching is choosing  
   748 The other way to simulate an $\mathit{NFA}$ for matching is choosing  
   749 a single transition each time, keeping all the other options in 
   749 a single transition each time, keeping all the other options in 
   750 a queue or stack, and backtracking if that choice eventually 
   750 a queue or stack, and backtracking if that choice eventually 
   751 fails. This method, often called a  "depth-first-search", 
   751 fails. 
   752 is efficient in a lot of cases, but could end up
   752 This method, often called a  "depth-first-search", 
       
   753 is efficient in many cases, but could end up
   753 with exponential run time.
   754 with exponential run time.
   754 The backtracking method is employed in regex libraries
   755 The backtracking method is employed in regex libraries
   755 that support \emph{back-references}, for example
   756 that support \emph{back-references}, for example
   756 in Java and Python.
   757 in Java and Python.
   757 %\section{Back-references and The Terminology Regex}
   758 %\section{Back-references and The Terminology Regex}
   767 %for a path terminating
   768 %for a path terminating
   768 %at an accepting state.
   769 %at an accepting state.
   769 
   770 
   770 
   771 
   771 
   772 
   772 Given a regular expression like this (the sequence
   773 Consider the following regular expression where the sequence
   773 operator is omitted for brevity):
   774 operator is omitted for brevity:
   774 \begin{center}
   775 \begin{center}
   775 	$r_1r_2r_3r_4$
   776 	$r_1r_2r_3r_4$
   776 \end{center}
   777 \end{center}
       
   778 In this example,
   777 one could label sub-expressions of interest 
   779 one could label sub-expressions of interest 
   778 by parenthesizing them and giving 
   780 by parenthesizing them and giving 
   779 them a number by the order in which their opening parentheses appear.
   781 them a number in the order in which their opening parentheses appear.
   780 One possible way of parenthesizing and labelling is given below:
   782 One possible way of parenthesizing and labelling is: 
   781 \begin{center}
   783 \begin{center}
   782 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
   784 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
   783 \end{center}
   785 \end{center}
   784 The sub-expressions
   786 The sub-expressions
   785 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
   787 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
   786 by 1 to 4, and can be ``referred back'' by their respective numbers. 
   788 by 1 to 4, and can be ``referred back'' by their respective numbers. 
   787 %These sub-expressions are called "capturing groups".
   789 %These sub-expressions are called "capturing groups".
   788 To do so, we use the syntax $\backslash i$ 
   790 To do so, we use the syntax $\backslash i$ 
   789 to denote that we want the sub-string 
   791 to denote that we want the sub-string 
   790 of the input just matched by the i-th
   792 of the input matched by the i-th
   791 sub-expression to appear again, 
   793 sub-expression to appear again, 
   792 exactly the same as it first appeared: 
   794 exactly the same as it first appeared: 
   793 \begin{center}
   795 \begin{center}
   794 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
   796 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
   795 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
   797 \underset{s_i \text{ which just matched} \;r_i}{\backslash i} \ldots$
   796 \end{center}
   798 \end{center}
       
   799 Once the sub-string $s_i$ for the sub-expression $r_i$
       
   800 has been fixed, there is no variability on what the back-reference
       
   801 label $\backslash i$ can be---it is tied to $s_i$.
       
   802 The overall string will look like $\ldots s_i \ldots s_i \ldots $
   797 %The backslash and number $i$ are the
   803 %The backslash and number $i$ are the
   798 %so-called "back-references".
   804 %so-called "back-references".
   799 %Let $e$ be an expression made of regular expressions 
   805 %Let $e$ be an expression made of regular expressions 
   800 %and back-references. $e$ contains the expression $e_i$
   806 %and back-references. $e$ contains the expression $e_i$
   801 %as its $i$-th capturing group.
   807 %as its $i$-th capturing group.
   821 	$(.^*)(.^*)$,
   827 	$(.^*)(.^*)$,
   822 \end{center}
   828 \end{center}
   823 which does not impose any restrictions on what strings the second 
   829 which does not impose any restrictions on what strings the second 
   824 sub-expression $.^*$
   830 sub-expression $.^*$
   825 might match.
   831 might match.
   826 Another example of back-references is
   832 Another example for back-references is
   827 \begin{center}
   833 \begin{center}
   828 $(.)(.)\backslash 2\backslash 1$
   834 $(.)(.)\backslash 2\backslash 1$
   829 \end{center}
   835 \end{center}
   830 which matches four-character palindromes
   836 which matches four-character palindromes
   831 like $abba$, $x??x$ and so on.
   837 like $abba$, $x??x$ and so on.
   832 
   838 
   833 Back-references is a regex construct 
   839 Back-references is a regex construct 
   834 that programmers find quite useful.
   840 that programmers find quite useful.
   835 According to Becchi and Crawley~\cite{Becchi08},
   841 According to Becchi and Crawley \cite{Becchi08},
   836 6\% of Snort rules (up until 2008) use them.
   842 6\% of Snort rules (up until 2008) use them.
   837 The most common use of back-references
   843 The most common use of back-references
   838 is to express well-formed html files,
   844 is to express well-formed html files,
   839 where back-references are convenient for matching
   845 where back-references are convenient for matching
   840 opening and closing tags like 
   846 opening and closing tags like 
   852 In fact, they allow the regex construct to express 
   858 In fact, they allow the regex construct to express 
   853 languages that cannot be contained in context-free
   859 languages that cannot be contained in context-free
   854 languages either.
   860 languages either.
   855 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
   861 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
   856 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   862 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   857 which cannot be expressed by context-free grammars \parencite{campeanu2003formal}.
   863 which cannot be expressed by 
       
   864 context-free grammars \cite{campeanu2003formal}.
   858 Such a language is contained in the context-sensitive hierarchy
   865 Such a language is contained in the context-sensitive hierarchy
   859 of formal languages. 
   866 of formal languages. 
   860 Also solving the matching problem involving back-references
   867 Also solving the matching problem involving back-references
   861 is known to be NP-complete \parencite{alfred2014algorithms}.
   868 is known to be NP-complete \parencite{alfred2014algorithms}.
   862 Regex libraries supporting back-references such as 
   869 Regex libraries supporting back-references such as 
   863 PCRE \cite{pcre} therefore have to
   870 PCRE \cite{pcre} therefore have to
   864 revert to a depth-first search algorithm which backtracks.
   871 revert to a depth-first search algorithm including backtracking.
   865 What is unexpected is that even in the cases 
   872 What is unexpected is that even in the cases 
   866 not involving back-references, there is still
   873 not involving back-references, there is still
   867 a (non-negligible) chance they might backtrack super-linearly,
   874 a (non-negligible) chance they might backtrack super-linearly,
   868 as shown in the graphs in figure\ref{fig:aStarStarb}.
   875 as shown in the graphs in figure \ref{fig:aStarStarb}.
   869 
   876 
   870 Summing these up, we can categorise existing 
   877 Summing up, we can categorise existing 
   871 practical regex libraries into two kinds:
   878 practical regex libraries into two kinds:
   872 (i) The ones  with  linear
   879 (i) The ones  with  linear
   873 time guarantees like Go and Rust. The downside with them is that
   880 time guarantees like Go and Rust. The downside with them is that
   874 they impose restrictions
   881 they impose restrictions
   875 on the regular expressions (not allowing back-references, 
   882 on the regular expressions (not allowing back-references, 
   930 In some cases, they either fail to generate a
   937 In some cases, they either fail to generate a
   931 result when there exists a match,
   938 result when there exists a match,
   932 or give results that are inconsistent with the POSIX standard.
   939 or give results that are inconsistent with the POSIX standard.
   933 A concrete example is the regex:
   940 A concrete example is the regex:
   934 \begin{center}
   941 \begin{center}
   935 	$(aba + ab + a)^* \text{and the string} ababa$
   942 	$(aba + ab + a)^* \text{and the string} \; ababa$
   936 \end{center}
   943 \end{center}
   937 The correct POSIX match for the above
   944 The correct POSIX match for the above
   938 is the entire string $ababa$, 
   945 involves the entire string $ababa$, 
   939 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
   946 split into two Kleene star iterations, namely $[ab], [aba]$ at positions
   940 $[0, 2), [2, 5)$
   947 $[0, 2), [2, 5)$
   941 respectively.
   948 respectively.
   942 But trying this out in regex101 \parencite{regex101} \footnote{
   949 But feeding this example to the different engines
       
   950 in regex101 \parencite{regex101} \footnote{
   943 	regex101 is an online regular expression matcher which
   951 	regex101 is an online regular expression matcher which
   944 	provides API for trying out regular expression
   952 	provides API for trying out regular expression
   945 	engines of multiple popular programming languages like
   953 	engines of multiple popular programming languages like
   946 Java, Python, Go, etc.}
   954 Java, Python, Go, etc.}
   947 with different engines yields
   955 yields
   948 always matches: $[aba]$ at $[0, 3)$
   956 only two incomplete matches: $[aba]$ at $[0, 3)$
   949 and $a$ at $[4, 5)$.
   957 and $a$ at $[4, 5)$.
   950 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
   958 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
   951 commented that most regex libraries are not
   959 commented that most regex libraries are not
   952 correctly implementing the central POSIX
   960 correctly implementing the central POSIX
   953 rule, called the maximum munch rule.
   961 rule, called the maximum munch rule.
   958 	the length of matched strings in the various subexpressions.''
   966 	the length of matched strings in the various subexpressions.''
   959 \end{quote}
   967 \end{quote}
   960 %\noindent
   968 %\noindent
   961 We think the implementation complexity of POSIX rules also come from
   969 We think the implementation complexity of POSIX rules also come from
   962 the specification being not very precise.
   970 the specification being not very precise.
       
   971 The developers of the regexp package of GO 
       
   972 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
       
   973 commented that
       
   974 \begin{quote}\it
       
   975 ``
       
   976 The POSIX rule is computationally prohibitive
       
   977 and not even well-defined.
       
   978 ``
       
   979 \end{quote}
   963 There are many informal summaries of this disambiguation
   980 There are many informal summaries of this disambiguation
   964 strategy, which are often quite long and delicate.
   981 strategy, which are often quite long and delicate.
   965 For example Kuklewicz \cite{KuklewiczHaskell} 
   982 For example Kuklewicz \cite{KuklewiczHaskell} 
   966 described the POSIX rule as (section 1, last paragraph):
   983 described the POSIX rule as (section 1, last paragraph):
   967 \begin{quote}
   984 \begin{quote}
   995 POSIX specification given by Okui and Suzuki \cite{Okui10}.
  1012 POSIX specification given by Okui and Suzuki \cite{Okui10}.
   996 They then formally proved the correctness of
  1013 They then formally proved the correctness of
   997 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
  1014 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
   998 with regards to that specification.
  1015 with regards to that specification.
   999 They also found that the informal POSIX
  1016 They also found that the informal POSIX
  1000 specification by Sulzmann and Lu does not work for the correctness proof.
  1017 specification by Sulzmann and Lu needs to be revised for the correctness proof.
  1001 
  1018 
  1002 In the next section we will briefly
  1019 In the next section, we will briefly
  1003 introduce Brzozowski derivatives and Sulzmann
  1020 introduce Brzozowski derivatives and Sulzmann
  1004 and Lu's algorithm, which the main point of this thesis builds on.
  1021 and Lu's algorithm, which the main point of this thesis builds on.
  1005 %We give a taste of what they 
  1022 %We give a taste of what they 
  1006 %are like and why they are suitable for regular expression
  1023 %are like and why they are suitable for regular expression
  1007 %matching and lexing.
  1024 %matching and lexing.
  1025 %$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
  1042 %$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
  1026 %	\end{tabular}
  1043 %	\end{tabular}
  1027 %\end{center}
  1044 %\end{center}
  1028 %\noindent
  1045 %\noindent
  1029 %The first clause says that for the regular expression
  1046 %The first clause says that for the regular expression
  1030 %denoting a singleton set consisting of a sinlge-character string $\{ d \}$,
  1047 %denoting a singleton set consisting of a single-character string $\{ d \}$,
  1031 %we check the derivative character $c$ against $d$,
  1048 %we check the derivative character $c$ against $d$,
  1032 %returning a set containing only the empty string $\{ [] \}$
  1049 %returning a set containing only the empty string $\{ [] \}$
  1033 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
  1050 %if $c$ and $d$ are equal, and the empty set $\varnothing$ otherwise.
  1034 %The second clause states that to obtain the regular expression
  1051 %The second clause states that to obtain the regular expression
  1035 %representing all strings' head character $c$ being chopped off
  1052 %representing all strings' head character $c$ being chopped off
  1041 %This property can be used on regular expressions
  1058 %This property can be used on regular expressions
  1042 %matching and lexing--to test whether a string $s$ is in $L \; r$,
  1059 %matching and lexing--to test whether a string $s$ is in $L \; r$,
  1043 %one simply takes derivatives of $r$ successively with
  1060 %one simply takes derivatives of $r$ successively with
  1044 %respect to the characters (in the correct order) in $s$,
  1061 %respect to the characters (in the correct order) in $s$,
  1045 %and then test whether the empty string is in the last regular expression.
  1062 %and then test whether the empty string is in the last regular expression.
  1046 With this derivatives give a simple solution
  1063 With this property, derivatives give a simple solution
  1047 to the problem of matching a string $s$ with a regular
  1064 to the problem of matching a string $s$ with a regular
  1048 expression $r$: if the derivative of $r$ w.r.t.\ (in
  1065 expression $r$: if the derivative of $r$ w.r.t.\ (in
  1049 succession) all the characters of the string matches the empty string,
  1066 succession) all the characters of the string matches the empty string,
  1050 then $r$ matches $s$ (and {\em vice versa}).  
  1067 then $r$ matches $s$ (and {\em vice versa}).  
  1051 %This makes formally reasoning about these properties such
  1068 %This makes formally reasoning about these properties such