ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 608 37b6fd310a16
parent 607 e6fc9b72c0e3
child 609 61139fdddae0
equal deleted inserted replaced
607:e6fc9b72c0e3 608:37b6fd310a16
   577 Bounded repetitions are very important because they
   577 Bounded repetitions are very important because they
   578 tend to occur a lot in practical use.
   578 tend to occur a lot in practical use.
   579 For example in the regex library RegExLib,
   579 For example in the regex library RegExLib,
   580 the rules library of Snort \cite{Snort1999}\footnote{
   580 the rules library of Snort \cite{Snort1999}\footnote{
   581 Snort is a network intrusion detection (NID) tool
   581 Snort is a network intrusion detection (NID) tool
   582 for monitoring network traffic.}, 
   582 for monitoring network traffic.
       
   583 The network security community curates a list
       
   584 of malicious patterns written as regexes,
       
   585 which is used by Snort's detection engine
       
   586 to match against network traffic for any hostile
       
   587 activities such as buffer overflow attacks.}, 
   583 as well as in XML Schema definitions (XSDs).
   588 as well as in XML Schema definitions (XSDs).
   584 According to Bj\"{o}rklund et al \cite{xml2015},
   589 According to Bj\"{o}rklund et al \cite{xml2015},
   585 more than half of the 
   590 more than half of the 
   586 XSDs they found have bounded regular expressions in them.
   591 XSDs they found have bounded regular expressions in them.
   587 Often the counters are quite large, the largest up to ten million. 
   592 Often the counters are quite large, the largest up to ten million. 
   681 for bounded regular expressions $r^{\{n\}}$.
   686 for bounded regular expressions $r^{\{n\}}$.
   682 The results
   687 The results
   683 extend straightforwardly to
   688 extend straightforwardly to
   684 repetitions with an interval such as 
   689 repetitions with an interval such as 
   685 $r^{\{n\ldots m\}}$.
   690 $r^{\{n\ldots m\}}$.
   686 The merit of Brzozowski derivatives
   691 The merit of Brzozowski derivatives (more on this later)
   687 on this problem is that
   692 on this problem is that
   688 it can be naturally extended to support bounded repetitions.
   693 it can be naturally extended to support bounded repetitions.
   689 Moreover these extensions are still made up of only
   694 Moreover these extensions are still made up of only
   690 inductive datatypes and recursive functions,
   695 inductive datatypes and recursive functions,
   691 making it handy to deal with using theorem provers.
   696 making it handy to deal with using theorem provers.
   806 go beyond the regular language hierarchy
   811 go beyond the regular language hierarchy
   807 with back-references.
   812 with back-references.
   808 In fact, they allow the regex construct to express 
   813 In fact, they allow the regex construct to express 
   809 languages that cannot be contained in context-free
   814 languages that cannot be contained in context-free
   810 languages either.
   815 languages either.
   811 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
   816 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
   812 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   817 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
   813 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
   818 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
   814 Such a language is contained in the context-sensitive hierarchy
   819 Such a language is contained in the context-sensitive hierarchy
   815 of formal languages. 
   820 of formal languages. 
   816 Solving the back-reference expressions matching problem
   821 Solving the back-reference expressions matching problem
   817 is NP-complete\parencite{alfred2014algorithms}.
   822 is NP-complete\parencite{alfred2014algorithms}.
   818 A non-bactracking,
   823 A non-bactracking,
   819 efficient solution is not known to exist.
   824 efficient solution is not known to exist.
   820 Regex libraries supporting back-references such as PCRE therefore have to
   825 Regex libraries supporting back-references such as 
       
   826 PCRE \cite{pcre} therefore have to
   821 revert to a depth-first search algorithm that backtracks.
   827 revert to a depth-first search algorithm that backtracks.
   822 The unreasonable part with them, is that even in the case of 
   828 The unreasonable part with them, is that even in the case of 
   823 regexes not involving back-references, there is still
   829 regexes not involving back-references, there is still
   824 a (non-negligible) chance they might backtrack super-linearly.
   830 a (non-negligible) chance they might backtrack super-linearly.
   825 
   831 
   909 %\noindent
   915 %\noindent
   910 The implementation complexity of POSIX rules also come from
   916 The implementation complexity of POSIX rules also come from
   911 the specification being not very clear.
   917 the specification being not very clear.
   912 There are many informal summaries of this disambiguation
   918 There are many informal summaries of this disambiguation
   913 strategy, which are often quite long and delicate.
   919 strategy, which are often quite long and delicate.
   914 For example Kuklewicz described the POSIX rule as
   920 For example Kuklewicz \cite{KuklewiczHaskell} 
       
   921 described the POSIX rule as
   915 \begin{quote}
   922 \begin{quote}
   916 	``
   923 	``
   917 	\begin{itemize}
   924 	\begin{itemize}
   918 		\item
   925 		\item
   919 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   926 regular expressions (REs) take the leftmost starting match, and the longest match starting there
   934 \end{itemize}
   941 \end{itemize}
   935 \end{quote}
   942 \end{quote}
   936 The text above 
   943 The text above 
   937 is trying to capture something very precise,
   944 is trying to capture something very precise,
   938 and is crying out for formalising.
   945 and is crying out for formalising.
   939 
   946 Ausaf et al. \cite{AusafDyckhoffUrban2016}
   940 
   947 are the first to describe such a formalised POSIX
   941 %\subsection{Different Phases of a Matching/Lexing Algorithm}
   948 specification in Isabelle/HOL.
   942 %
   949 They then formally proved the correctness of
   943 %
   950 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
   944 %Most lexing algorithms can be roughly divided into 
   951 based on that specification.
   945 %two phases during its run.
   952 
   946 %The first phase is the "construction" phase,
   953 In this thesis,
   947 %in which the algorithm builds some  
   954 we propose a solution to catastrophic
   948 %suitable data structure from the input regex $r$, so that
   955 backtracking and error-prone matchers--a formally verified
   949 %it can be easily operated on later.
   956 regular expression lexing algorithm
   950 %We denote
   957 that is both fast
   951 %the time cost for such a phase by $P_1(r)$.
   958 and correct by extending Ausaf et al.'s work.
   952 %The second phase is the lexing phase, when the input string 
       
   953 %$s$ is read and the data structure
       
   954 %representing that regex $r$ is being operated on. 
       
   955 %We represent the time
       
   956 %it takes by $P_2(r, s)$.\\
       
   957 %For $\mathit{DFA}$,
       
   958 %we have $P_2(r, s) = O( |s| )$,
       
   959 %because we take at most $|s|$ steps, 
       
   960 %and each step takes
       
   961 %at most one transition--
       
   962 %a deterministic-finite-automata
       
   963 %by definition has at most one state active and at most one
       
   964 %transition upon receiving an input symbol.
       
   965 %But unfortunately in the  worst case
       
   966 %$P_1(r) = O(exp^{|r|})$. An example will be given later. 
       
   967 %For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
       
   968 %expressions like $r^n$ into 
       
   969 %\[
       
   970 %	\underbrace{r \cdots r}_{\text{n copies of r}}.
       
   971 %\]
       
   972 %The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
       
   973 %On the other hand, if backtracking is used, the worst-case time bound bloats
       
   974 %to $|r| * 2^{|s|}$.
       
   975 %%on the input
       
   976 %%And when calculating the time complexity of the matching algorithm,
       
   977 %%we are assuming that each input reading step requires constant time.
       
   978 %%which translates to that the number of 
       
   979 %%states active and transitions taken each time is bounded by a
       
   980 %%constant $C$.
       
   981 %%But modern  regex libraries in popular language engines
       
   982 %% often want to support much richer constructs than just
       
   983 %% sequences and Kleene stars,
       
   984 %%such as negation, intersection, 
       
   985 %%bounded repetitions and back-references.
       
   986 %%And de-sugaring these "extended" regular expressions 
       
   987 %%into basic ones might bloat the size exponentially.
       
   988 %%TODO: more reference for exponential size blowup on desugaring. 
       
   989 %
       
   990 %\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
       
   991 %
       
   992 %
       
   993 %The good things about $\mathit{DFA}$s is that once
       
   994 %generated, they are fast and stable, unlike
       
   995 %backtracking algorithms. 
       
   996 %However, they do not scale well with bounded repetitions.
       
   997 %
       
   998 %\subsubsection{Problems with Bounded Repetitions}
       
   999 %
       
  1000 %
       
  1001 
       
  1002 To summarise, we need regex libraries that are both fast
       
  1003 and correct.
       
  1004 And that correctness needs to be built on a precise
       
  1005 model of what POSIX disambiguation is.
       
  1006 We propose a solution that addresses both problems
       
  1007 based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work.
       
  1008 The end result is a regular expression lexing algorithm that comes with 
   959 The end result is a regular expression lexing algorithm that comes with 
  1009 \begin{itemize}
   960 \begin{itemize}
  1010 \item
   961 \item
  1011 a proven correctness theorem according to POSIX specification 
   962 a proven correctness theorem according to POSIX specification 
  1012 given by Ausaf and Urban \cite{AusafDyckhoffUrban2016},
   963 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, 
  1013 \item 
   964 \item 
  1014 a proven property saying that the algorithm's internal data structure will
   965 a proven complexity-related property saying that the 
       
   966 algorithm's internal data structure will
  1015 remain finite,
   967 remain finite,
  1016 \item
   968 \item
  1017 and extension to
   969 and extension to
  1018 the bounded repetitions construct with the correctness and finiteness property
   970 the bounded repetitions construct with the correctness and finiteness property
  1019 maintained.
   971 maintained.
  1020  \end{itemize}
   972  \end{itemize}
       
   973 In the next section we will very briefly
       
   974 introduce Brzozowski derivatives.
       
   975 We will give a taste to what they 
       
   976 are like and why they are suitable for regular expression
       
   977 matching and lexing.
  1021  
   978  
  1022 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
   979 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
  1023 We propose Brzozowski derivatives on regular expressions as
   980 We propose Brzozowski derivatives on regular expressions as
  1024   a solution to this.
   981   a solution to this.
  1025 In the last fifteen or so years, Brzozowski's derivatives of regular
   982 In the last fifteen or so years, Brzozowski's derivatives of regular