ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 607 e6fc9b72c0e3
parent 606 99b530103464
child 608 37b6fd310a16
equal deleted inserted replaced
606:99b530103464 607:e6fc9b72c0e3
   575 
   575 
   576 
   576 
   577 Bounded repetitions are very important because they
   577 Bounded repetitions are very important because they
   578 tend to occur a lot in practical use.
   578 tend to occur a lot in practical use.
   579 For example in the regex library RegExLib,
   579 For example in the regex library RegExLib,
   580 the rules library of Snort, as well as in XML Schema definitions (XSDs).
   580 the rules library of Snort \cite{Snort1999}\footnote{
       
   581 Snort is a network intrusion detection (NID) tool
       
   582 for monitoring network traffic.}, 
       
   583 as well as in XML Schema definitions (XSDs).
   581 According to Bj\"{o}rklund et al \cite{xml2015},
   584 According to Bj\"{o}rklund et al \cite{xml2015},
   582 more than half of the 
   585 more than half of the 
   583 XSDs they found have bounded regular expressions in them.
   586 XSDs they found have bounded regular expressions in them.
   584 Often the counters are quite large, the largest up to ten million. 
   587 Often the counters are quite large, the largest up to ten million. 
   585 An example XSD they gave
   588 An example XSD they gave
   705 is efficient in a lot of cases, but could end up
   708 is efficient in a lot of cases, but could end up
   706 with exponential run time.
   709 with exponential run time.
   707 The backtracking method is employed in regex libraries
   710 The backtracking method is employed in regex libraries
   708 that support \emph{back-references}, for example
   711 that support \emph{back-references}, for example
   709 in Java and Python.
   712 in Java and Python.
   710 The syntax and expressive power of regexes 
       
   711 go beyond the regular language hierarchy
       
   712 with back-references.
       
   713 %\section{Back-references and The Terminology Regex}
   713 %\section{Back-references and The Terminology Regex}
   714 
   714 
   715 %When one constructs an $\NFA$ out of a regular expression
   715 %When one constructs an $\NFA$ out of a regular expression
   716 %there is often very little to be done in the first phase, one simply 
   716 %there is often very little to be done in the first phase, one simply 
   717 %construct the $\NFA$ states based on the structure of the input regular expression.
   717 %construct the $\NFA$ states based on the structure of the input regular expression.
   764 %	\end{tabular}
   764 %	\end{tabular}
   765 %\end{center}
   765 %\end{center}
   766 A concrete example
   766 A concrete example
   767 for back-references would be
   767 for back-references would be
   768 \begin{center}
   768 \begin{center}
   769 $((a|b|c|\ldots|z)^*)\backslash 1$,
   769 $(.^*)\backslash 1$,
   770 \end{center}
   770 \end{center}
   771 which would match 
   771 which would match 
   772 strings that can be split into two identical halves,
   772 strings that can be split into two identical halves,
   773 for example $\mathit{bobo}$, $\mathit{weewee}$ and etc.
   773 for example $\mathit{foofoo}$, $\mathit{ww}$ and etc.
   774 Back-reference is a regex construct 
   774 Note that this is different from 
   775 that programmers found useful, but not exactly 
   775 repeating the  sub-expression verbatim like
   776 regular any more.
   776 \begin{center}
   777 In fact, that allows the regex construct to express 
   777 	$(.^*)(.^*)$,
       
   778 \end{center}
       
   779 which does not impose any restrictions on what strings the second 
       
   780 sub-expression $.^*$
       
   781 might match.
       
   782 Another example of back-references would be
       
   783 \begin{center}
       
   784 $(.)(.)\backslash 2\backslash 1$
       
   785 \end{center}
       
   786 which expresses four-character palindromes
       
   787 like $abba$, $x??x$ etc.
       
   788 
       
   789 Back-references is a regex construct 
       
   790 that programmers found quite useful.
       
   791 According to Becchi and Crawley\cite{Becchi08},
       
   792 6\% of Snort rules (up until 2008) include the use of them.
       
   793 The most common use of back-references
       
   794 would be expressing well-formed html files,
       
   795 where back-references would be handy in expressing
       
   796 a pair of opening and closing tags like 
       
   797 \begin{center}
       
   798 	$\langle html \rangle \ldots \langle / html \rangle$
       
   799 \end{center}
       
   800 A regex describing such a format
       
   801 could be
       
   802 \begin{center}
       
   803 	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
       
   804 \end{center}
       
   805 Despite being useful, the syntax and expressive power of regexes 
       
   806 go beyond the regular language hierarchy
       
   807 with back-references.
       
   808 In fact, they allow the regex construct to express 
   778 languages that cannot be contained in context-free
   809 languages that cannot be contained in context-free
   779 languages either.
   810 languages either.
   780 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
   811 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
   781 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   812 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   782 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
   813 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
   783 Such a language is contained in the context-sensitive hierarchy
   814 Such a language is contained in the context-sensitive hierarchy
   784 of formal languages. 
   815 of formal languages. 
   785 Solving the back-reference expressions matching problem
   816 Solving the back-reference expressions matching problem
   786 is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
   817 is NP-complete\parencite{alfred2014algorithms}.
       
   818 A non-bactracking,
   787 efficient solution is not known to exist.
   819 efficient solution is not known to exist.
   788 %TODO:read a bit more about back reference algorithms
   820 Regex libraries supporting back-references such as PCRE therefore have to
   789 
   821 revert to a depth-first search algorithm that backtracks.
   790 It seems that languages like Java and Python made the trade-off
   822 The unreasonable part with them, is that even in the case of 
   791 to support back-references at the expense of having to backtrack,
   823 regexes not involving back-references, there is still
   792 even in the case of regexes not involving back-references.\\
   824 a (non-negligible) chance they might backtrack super-linearly.
       
   825 
       
   826 \subsection{Summary of the Catastrophic Backtracking Problem}
   793 Summing these up, we can categorise existing 
   827 Summing these up, we can categorise existing 
   794 practical regex libraries into the ones  with  linear
   828 practical regex libraries into two kinds:
   795 time guarantees like Go and Rust, which impose restrictions
   829 (i)The ones  with  linear
       
   830 time guarantees like Go and Rust. The cost with them is that
       
   831 they impose restrictions
   796 on the user input (not allowing back-references, 
   832 on the user input (not allowing back-references, 
   797 bounded repetitions cannot exceed 1000 etc.), and ones  
   833 bounded repetitions cannot exceed a counter limit etc.).
   798  that allows the programmer much freedom, but grinds to a halt
   834 (ii) Those 
   799  in some non-negligible portion of cases.
   835 that allow large bounded regular expressions and back-references
       
   836 at the expense of using a backtracking algorithm.
       
   837 They could grind to a halt
       
   838 on some very simple cases, posing a vulnerability of
       
   839 a ReDoS attack.
       
   840 
       
   841  
       
   842 We would like to have regex engines that can 
       
   843 deal with the regular part (e.g.
       
   844 bounded repetitions) of regexes more
       
   845 efficiently.
       
   846 Also we want to make sure that they do it correctly.
       
   847 It turns out that such aim is not so easy to achieve.
   800  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
   848  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
   801 % For example, the Rust regex engine claims to be linear, 
   849 % For example, the Rust regex engine claims to be linear, 
   802 % but does not support lookarounds and back-references.
   850 % but does not support lookarounds and back-references.
   803 % The GoLang regex library does not support over 1000 repetitions.  
   851 % The GoLang regex library does not support over 1000 repetitions.  
   804 % Java and Python both support back-references, but shows
   852 % Java and Python both support back-references, but shows
   817 %, whereas $\NFA$s usually run into trouble
   865 %, whereas $\NFA$s usually run into trouble
   818 %on the second phase.
   866 %on the second phase.
   819 
   867 
   820 
   868 
   821 \section{Error-prone POSIX Implementations}
   869 \section{Error-prone POSIX Implementations}
   822 The problems with practical implementations
   870 When there are multiple ways of matching a string
   823 of reare not limited to slowness on certain 
   871 with a regular expression, a matcher needs to
   824 cases. 
   872 disambiguate.
   825 Another thing about these libraries is that there
   873 The standard for which particular match to pick
   826 is no correctness guarantee.
   874 is called the disambiguation strategy.
   827 In some cases, they either fail to generate a lexing result when there exists a match,
   875 The more intuitive strategy is called POSIX,
       
   876 which always chooses the longest initial match.
       
   877 An alternative strategy would be greedy matches,
       
   878 which always ends a sub-match as early as possible.
       
   879 The POSIX standard is widely adopted in many operating systems.
       
   880 However, many implementations (including the C libraries
       
   881 used by Linux and OS X distributions) contain bugs
       
   882 or do not meet the specification they claim to adhere to.
       
   883 In some cases, they either fail to generate a lexing 
       
   884 result when there exists a match,
   828 or give results that are inconsistent with the $\POSIX$ standard.
   885 or give results that are inconsistent with the $\POSIX$ standard.
   829 A concrete example would be the regex
   886 A concrete example would be the regex given by \cite{fowler2003}
   830 \begin{center}
   887 \begin{center}
   831 	$(aba + ab + a)* \text{and the string} ababa$
   888 	$(aba + ab + a)^* \text{and the string} ababa$
   832 \end{center}
   889 \end{center}
   833 The correct $\POSIX$ match for the above would be 
   890 The correct $\POSIX$ match for the above would be 
   834 with the entire string $ababa$, 
   891 with the entire string $ababa$, 
   835 split into two Kleene star iterations, $[ab] [aba]$ at positions
   892 split into two Kleene star iterations, $[ab] [aba]$ at positions
   836 $[0, 2), [2, 5)$
   893 $[0, 2), [2, 5)$
   837 respectively.
   894 respectively.
   838 But trying this out in regex101\parencite{regex101}
   895 But trying this out in regex101\parencite{regex101}
   839 with different language engines would yield 
   896 with different language engines would yield 
   840 the same two fragmented matches: $[aba]$ at $[0, 3)$
   897 the same two fragmented matches: $[aba]$ at $[0, 3)$
   841 and $a$ at $[4, 5)$.
   898 and $a$ at $[4, 5)$.
   842 
   899 Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
   843 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
   900 commented that most regex libraries are not
   844 correctly implementing the POSIX (maximum-munch)
   901 correctly implementing the POSIX (maximum-munch)
   845 rule of regular expression matching.
   902 rule of regular expression matching.
   846 
       
   847 As Grathwohl\parencite{grathwohl2014crash} wrote,
   903 As Grathwohl\parencite{grathwohl2014crash} wrote,
   848 \begin{quote}
   904 \begin{quote}
   849 	The POSIX strategy is more complicated than the 
   905 	``The POSIX strategy is more complicated than the 
   850 	greedy because of the dependence on information about 
   906 	greedy because of the dependence on information about 
   851 	the length of matched strings in the various subexpressions.
   907 	the length of matched strings in the various subexpressions.''
   852 \end{quote}
   908 \end{quote}
   853 %\noindent
   909 %\noindent
   854 To summarise the above, regular expressions are important.
   910 The implementation complexity of POSIX rules also come from
   855 They are popular and programming languages' library functions
   911 the specification being not very clear.
   856 for them are very fast on non-catastrophic cases.
   912 There are many informal summaries of this disambiguation
   857 But there are problems with current practical implementations.
   913 strategy, which are often quite long and delicate.
   858 First thing is that the running time might blow up.
   914 For example Kuklewicz described the POSIX rule as
   859 The second problem is that they might be error-prone on certain
   915 \begin{quote}
   860 very simple cases.
   916 	``
   861 In the next part of the chapter, we will look into reasons why 
   917 	\begin{itemize}
   862 certain regex engines are running horribly slow on the "catastrophic"
   918 		\item
   863 cases and propose a solution that addresses both of these problems
   919 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   864 based on Brzozowski and Sulzmann and Lu's work.
   920 earlier subpatterns have leftmost-longest priority over later subpatterns\\
   865 
   921 \item
   866 
   922 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
   867 \subsection{Different Phases of a Matching/Lexing Algorithm}
   923 \item
   868 
   924 REs have right associative concatenation which can be changed with parenthesis\\
   869 
   925 \item
   870 Most lexing algorithms can be roughly divided into 
   926 parenthesized subexpressions return the match from their last usage\\
   871 two phases during its run.
   927 \item
   872 The first phase is the "construction" phase,
   928 text of component subexpressions must be contained in the text of the 
   873 in which the algorithm builds some  
   929 higher-level subexpressions\\
   874 suitable data structure from the input regex $r$, so that
   930 \item
   875 it can be easily operated on later.
   931 if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
   876 We denote
   932 \item
   877 the time cost for such a phase by $P_1(r)$.
   933 if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
   878 The second phase is the lexing phase, when the input string 
   934 \end{itemize}
   879 $s$ is read and the data structure
   935 \end{quote}
   880 representing that regex $r$ is being operated on. 
   936 The text above 
   881 We represent the time
   937 is trying to capture something very precise,
   882 it takes by $P_2(r, s)$.\\
   938 and is crying out for formalising.
   883 For $\mathit{DFA}$,
   939 
   884 we have $P_2(r, s) = O( |s| )$,
   940 
   885 because we take at most $|s|$ steps, 
   941 %\subsection{Different Phases of a Matching/Lexing Algorithm}
   886 and each step takes
   942 %
   887 at most one transition--
   943 %
   888 a deterministic-finite-automata
   944 %Most lexing algorithms can be roughly divided into 
   889 by definition has at most one state active and at most one
   945 %two phases during its run.
   890 transition upon receiving an input symbol.
   946 %The first phase is the "construction" phase,
   891 But unfortunately in the  worst case
   947 %in which the algorithm builds some  
   892 $P_1(r) = O(exp^{|r|})$. An example will be given later. 
   948 %suitable data structure from the input regex $r$, so that
   893 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
   949 %it can be easily operated on later.
   894 expressions like $r^n$ into 
   950 %We denote
   895 \[
   951 %the time cost for such a phase by $P_1(r)$.
   896 	\underbrace{r \cdots r}_{\text{n copies of r}}.
   952 %The second phase is the lexing phase, when the input string 
   897 \]
   953 %$s$ is read and the data structure
   898 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
   954 %representing that regex $r$ is being operated on. 
   899 On the other hand, if backtracking is used, the worst-case time bound bloats
   955 %We represent the time
   900 to $|r| * 2^{|s|}$.
   956 %it takes by $P_2(r, s)$.\\
   901 %on the input
   957 %For $\mathit{DFA}$,
   902 %And when calculating the time complexity of the matching algorithm,
   958 %we have $P_2(r, s) = O( |s| )$,
   903 %we are assuming that each input reading step requires constant time.
   959 %because we take at most $|s|$ steps, 
   904 %which translates to that the number of 
   960 %and each step takes
   905 %states active and transitions taken each time is bounded by a
   961 %at most one transition--
   906 %constant $C$.
   962 %a deterministic-finite-automata
   907 %But modern  regex libraries in popular language engines
   963 %by definition has at most one state active and at most one
   908 % often want to support much richer constructs than just
   964 %transition upon receiving an input symbol.
   909 % sequences and Kleene stars,
   965 %But unfortunately in the  worst case
   910 %such as negation, intersection, 
   966 %$P_1(r) = O(exp^{|r|})$. An example will be given later. 
   911 %bounded repetitions and back-references.
   967 %For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
   912 %And de-sugaring these "extended" regular expressions 
   968 %expressions like $r^n$ into 
   913 %into basic ones might bloat the size exponentially.
   969 %\[
   914 %TODO: more reference for exponential size blowup on desugaring. 
   970 %	\underbrace{r \cdots r}_{\text{n copies of r}}.
   915 
   971 %\]
   916 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
   972 %The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
   917 
   973 %On the other hand, if backtracking is used, the worst-case time bound bloats
   918 
   974 %to $|r| * 2^{|s|}$.
   919 The good things about $\mathit{DFA}$s is that once
   975 %%on the input
   920 generated, they are fast and stable, unlike
   976 %%And when calculating the time complexity of the matching algorithm,
   921 backtracking algorithms. 
   977 %%we are assuming that each input reading step requires constant time.
   922 However, they do not scale well with bounded repetitions.
   978 %%which translates to that the number of 
   923 
   979 %%states active and transitions taken each time is bounded by a
   924 \subsubsection{Problems with Bounded Repetitions}
   980 %%constant $C$.
   925 
   981 %%But modern  regex libraries in popular language engines
   926 
   982 %% often want to support much richer constructs than just
   927 
   983 %% sequences and Kleene stars,
   928 
   984 %%such as negation, intersection, 
   929 
   985 %%bounded repetitions and back-references.
   930 
   986 %%And de-sugaring these "extended" regular expressions 
   931 
   987 %%into basic ones might bloat the size exponentially.
   932 
   988 %%TODO: more reference for exponential size blowup on desugaring. 
   933 
   989 %
   934 
   990 %\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
   935 
   991 %
   936 
   992 %
   937 
   993 %The good things about $\mathit{DFA}$s is that once
   938 
   994 %generated, they are fast and stable, unlike
   939 So we have practical implementations 
   995 %backtracking algorithms. 
   940 on regular expression matching/lexing which are fast
   996 %However, they do not scale well with bounded repetitions.
   941 but do not come with any guarantees that it will not grind to a halt
   997 %
   942 or give wrong answers.
   998 %\subsubsection{Problems with Bounded Repetitions}
   943 Our goal is to have a regex lexing algorithm that comes with 
   999 %
       
  1000 %
       
  1001 
       
  1002 To summarise, we need regex libraries that are both fast
       
  1003 and correct.
       
  1004 And that correctness needs to be built on a precise
       
  1005 model of what POSIX disambiguation is.
       
  1006 We propose a solution that addresses both problems
       
  1007 based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work.
       
  1008 The end result is a regular expression lexing algorithm that comes with 
   944 \begin{itemize}
  1009 \begin{itemize}
   945 \item
  1010 \item
   946 proven correctness 
  1011 a proven correctness theorem according to POSIX specification 
       
  1012 given by Ausaf and Urban \cite{AusafDyckhoffUrban2016},
   947 \item 
  1013 \item 
   948 proven non-catastrophic properties
  1014 a proven property saying that the algorithm's internal data structure will
       
  1015 remain finite,
   949 \item
  1016 \item
   950 easy extensions to
  1017 and extension to
   951 constructs like 
  1018 the bounded repetitions construct with the correctness and finiteness property
   952  bounded repetitions, negation,  lookarounds, and even back-references.
  1019 maintained.
   953  \end{itemize}
  1020  \end{itemize}
   954  
  1021  
   955 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
  1022 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
   956 We propose Brzozowski derivatives on regular expressions as
  1023 We propose Brzozowski derivatives on regular expressions as
   957   a solution to this.
  1024   a solution to this.