577 Bounded repetitions are very important because they |
577 Bounded repetitions are very important because they |
578 tend to occur a lot in practical use. |
578 tend to occur a lot in practical use. |
579 For example in the regex library RegExLib, |
579 For example in the regex library RegExLib, |
580 the rules library of Snort \cite{Snort1999}\footnote{ |
580 the rules library of Snort \cite{Snort1999}\footnote{ |
581 Snort is a network intrusion detection (NID) tool |
581 Snort is a network intrusion detection (NID) tool |
582 for monitoring network traffic.}, |
582 for monitoring network traffic. |
|
583 The network security community curates a list |
|
584 of malicious patterns written as regexes, |
|
585 which is used by Snort's detection engine |
|
586 to match against network traffic for any hostile |
|
587 activities such as buffer overflow attacks.}, |
583 as well as in XML Schema definitions (XSDs). |
588 as well as in XML Schema definitions (XSDs). |
584 According to Bj\"{o}rklund et al \cite{xml2015}, |
589 According to Bj\"{o}rklund et al \cite{xml2015}, |
585 more than half of the |
590 more than half of the |
586 XSDs they found have bounded regular expressions in them. |
591 XSDs they found have bounded regular expressions in them. |
587 Often the counters are quite large, the largest up to ten million. |
592 Often the counters are quite large, the largest up to ten million. |
806 go beyond the regular language hierarchy |
811 go beyond the regular language hierarchy |
807 with back-references. |
812 with back-references. |
808 In fact, they allow the regex construct to express |
813 In fact, they allow the regex construct to express |
809 languages that cannot be contained in context-free |
814 languages that cannot be contained in context-free |
810 languages either. |
815 languages either. |
811 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$ |
816 For example, the back-reference $(a^*)b\backslash1 b \backslash 1$ |
812 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
817 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$, |
813 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. |
818 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}. |
814 Such a language is contained in the context-sensitive hierarchy |
819 Such a language is contained in the context-sensitive hierarchy |
815 of formal languages. |
820 of formal languages. |
816 Solving the back-reference expressions matching problem |
821 Solving the back-reference expressions matching problem |
817 is NP-complete\parencite{alfred2014algorithms}. |
822 is NP-complete\parencite{alfred2014algorithms}. |
818 A non-bactracking, |
823 A non-bactracking, |
819 efficient solution is not known to exist. |
824 efficient solution is not known to exist. |
820 Regex libraries supporting back-references such as PCRE therefore have to |
825 Regex libraries supporting back-references such as |
|
826 PCRE \cite{pcre} therefore have to |
821 revert to a depth-first search algorithm that backtracks. |
827 revert to a depth-first search algorithm that backtracks. |
822 The unreasonable part with them, is that even in the case of |
828 The unreasonable part with them, is that even in the case of |
823 regexes not involving back-references, there is still |
829 regexes not involving back-references, there is still |
824 a (non-negligible) chance they might backtrack super-linearly. |
830 a (non-negligible) chance they might backtrack super-linearly. |
825 |
831 |
934 \end{itemize} |
941 \end{itemize} |
935 \end{quote} |
942 \end{quote} |
936 The text above |
943 The text above |
937 is trying to capture something very precise, |
944 is trying to capture something very precise, |
938 and is crying out for formalising. |
945 and is crying out for formalising. |
939 |
946 Ausaf et al. \cite{AusafDyckhoffUrban2016} |
940 |
947 are the first to describe such a formalised POSIX |
941 %\subsection{Different Phases of a Matching/Lexing Algorithm} |
948 specification in Isabelle/HOL. |
942 % |
949 They then formally proved the correctness of |
943 % |
950 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014} |
944 %Most lexing algorithms can be roughly divided into |
951 based on that specification. |
945 %two phases during its run. |
952 |
946 %The first phase is the "construction" phase, |
953 In this thesis, |
947 %in which the algorithm builds some |
954 we propose a solution to catastrophic |
948 %suitable data structure from the input regex $r$, so that |
955 backtracking and error-prone matchers--a formally verified |
949 %it can be easily operated on later. |
956 regular expression lexing algorithm |
950 %We denote |
957 that is both fast |
951 %the time cost for such a phase by $P_1(r)$. |
958 and correct by extending Ausaf et al.'s work. |
952 %The second phase is the lexing phase, when the input string |
|
953 %$s$ is read and the data structure |
|
954 %representing that regex $r$ is being operated on. |
|
955 %We represent the time |
|
956 %it takes by $P_2(r, s)$.\\ |
|
957 %For $\mathit{DFA}$, |
|
958 %we have $P_2(r, s) = O( |s| )$, |
|
959 %because we take at most $|s|$ steps, |
|
960 %and each step takes |
|
961 %at most one transition-- |
|
962 %a deterministic-finite-automata |
|
963 %by definition has at most one state active and at most one |
|
964 %transition upon receiving an input symbol. |
|
965 %But unfortunately in the worst case |
|
966 %$P_1(r) = O(exp^{|r|})$. An example will be given later. |
|
967 %For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold |
|
968 %expressions like $r^n$ into |
|
969 %\[ |
|
970 % \underbrace{r \cdots r}_{\text{n copies of r}}. |
|
971 %\] |
|
972 %The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack. |
|
973 %On the other hand, if backtracking is used, the worst-case time bound bloats |
|
974 %to $|r| * 2^{|s|}$. |
|
975 %%on the input |
|
976 %%And when calculating the time complexity of the matching algorithm, |
|
977 %%we are assuming that each input reading step requires constant time. |
|
978 %%which translates to that the number of |
|
979 %%states active and transitions taken each time is bounded by a |
|
980 %%constant $C$. |
|
981 %%But modern regex libraries in popular language engines |
|
982 %% often want to support much richer constructs than just |
|
983 %% sequences and Kleene stars, |
|
984 %%such as negation, intersection, |
|
985 %%bounded repetitions and back-references. |
|
986 %%And de-sugaring these "extended" regular expressions |
|
987 %%into basic ones might bloat the size exponentially. |
|
988 %%TODO: more reference for exponential size blowup on desugaring. |
|
989 % |
|
990 %\subsection{Why $\mathit{DFA}s$ can be slow in the first phase} |
|
991 % |
|
992 % |
|
993 %The good things about $\mathit{DFA}$s is that once |
|
994 %generated, they are fast and stable, unlike |
|
995 %backtracking algorithms. |
|
996 %However, they do not scale well with bounded repetitions. |
|
997 % |
|
998 %\subsubsection{Problems with Bounded Repetitions} |
|
999 % |
|
1000 % |
|
1001 |
|
1002 To summarise, we need regex libraries that are both fast |
|
1003 and correct. |
|
1004 And that correctness needs to be built on a precise |
|
1005 model of what POSIX disambiguation is. |
|
1006 We propose a solution that addresses both problems |
|
1007 based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work. |
|
1008 The end result is a regular expression lexing algorithm that comes with |
959 The end result is a regular expression lexing algorithm that comes with |
1009 \begin{itemize} |
960 \begin{itemize} |
1010 \item |
961 \item |
1011 a proven correctness theorem according to POSIX specification |
962 a proven correctness theorem according to POSIX specification |
1012 given by Ausaf and Urban \cite{AusafDyckhoffUrban2016}, |
963 given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, |
1013 \item |
964 \item |
1014 a proven property saying that the algorithm's internal data structure will |
965 a proven complexity-related property saying that the |
|
966 algorithm's internal data structure will |
1015 remain finite, |
967 remain finite, |
1016 \item |
968 \item |
1017 and extension to |
969 and extension to |
1018 the bounded repetitions construct with the correctness and finiteness property |
970 the bounded repetitions construct with the correctness and finiteness property |
1019 maintained. |
971 maintained. |
1020 \end{itemize} |
972 \end{itemize} |
|
973 In the next section we will very briefly |
|
974 introduce Brzozowski derivatives. |
|
975 We will give a taste to what they |
|
976 are like and why they are suitable for regular expression |
|
977 matching and lexing. |
1021 |
978 |
1022 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} |
979 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives} |
1023 We propose Brzozowski derivatives on regular expressions as |
980 We propose Brzozowski derivatives on regular expressions as |
1024 a solution to this. |
981 a solution to this. |
1025 In the last fifteen or so years, Brzozowski's derivatives of regular |
982 In the last fifteen or so years, Brzozowski's derivatives of regular |