# HG changeset patch # User Chengsong # Date 1671575574 0 # Node ID 7ce2389dff4b2dfae52b88e2797435cb9ae928d2 # Parent b079aaee5e10d529659e60f4695045115e345785 more diff -r b079aaee5e10 -r 7ce2389dff4b ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Dec 18 18:17:52 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Dec 20 22:32:54 2022 +0000 @@ -421,16 +421,16 @@ attacks. Davis et al. \cite{Davis18} detected more than 1000 evil regular expressions -in Node.js, Python core libraries, npm and in pypi. +in Node.js, Python core libraries, npm and pypi. They therefore concluded that evil regular expressions are real problems rather than just "a parlour trick". This work aims to address this issue with the help of formal proofs. We describe a lexing algorithm based -on Brzozowski derivatives with verified correctness (in -Isabelle/HOL) -and a finiteness property for the size of derivatives. +on Brzozowski derivatives with verified correctness +and a finiteness property for the size of derivatives +(which are all done in Isabelle/HOL). Such properties %guarantee the absence of are an important step in preventing catastrophic backtracking once and for all. @@ -443,13 +443,13 @@ \section{Preliminaries}%Regex, and the Problems with Regex Matchers} Regular expressions and regular expression matchers -have of course been studied for many, many years. +have clearly been studied for many, many years. Theoretical results in automata theory state that basic regular expression matching should be linear w.r.t the input. This assumes that the regular expression $r$ was pre-processed and turned into a -deterministic finite automaton (DFA) before matching~\cite{Sakarovitch2009}. +deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}. By basic we mean textbook definitions such as the one below, involving only regular expressions for characters, alternatives, sequences, and Kleene stars: @@ -458,7 +458,8 @@ \] Modern regular expression matchers used by programmers, however, -support much richer constructs, such as bounded repetitions +support much richer constructs, such as bounded repetitions, +negations, and back-references. To differentiate, we use the word \emph{regex} to refer to those expressions with richer constructs while reserving the @@ -481,7 +482,7 @@ Depending on the types of constructs the task of matching and lexing with them will have different levels of complexity. -Some of those constructs are just syntactic sugars that are +Some of those constructs are syntactic sugars that are simply short hand notations that save the programmers a few keystrokes. These will not cause problems for regex libraries. @@ -490,13 +491,16 @@ \[ (a | b | c) \stackrel{means}{=} ((a + b)+ c) \] -Similarly, the range operator used to express the alternative -of all characters between its operands is just a concise way: +Similarly, the range operator +%used to express the alternative +%of all characters between its operands, +is just a concise way +of expressing an alternative of consecutive characters: \[ - [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)} + [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \] for an alternative. The -wildcard character $.$ is used to refer to any single character, +wildcard character '$.$' is used to refer to any single character, \[ . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] \] @@ -506,7 +510,7 @@ More interesting are bounded repetitions, which can make the regular expressions much more compact. -There are +Normally there are four kinds of bounded repetitions: $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ (where $n$ and $m$ are constant natural numbers). Like the star regular expressions, the set of strings or language @@ -521,8 +525,9 @@ \end{tabular} \end{center} The attraction of bounded repetitions is that they can be -used to avoid a blow up: for example $r^{\{n\}}$ +used to avoid a size blow up: for example $r^{\{n\}}$ is a shorthand for +the much longer regular expression: \[ \underbrace{r\ldots r}_\text{n copies of r}. \] @@ -532,28 +537,31 @@ The problem with matching +such bounded repetitions is that tools based on the classic notion of automata need to expand $r^{\{n\}}$ into $n$ connected copies of the automaton for $r$. This leads to very inefficient matching algorithms or algorithms that consume large amounts of memory. Implementations using $\DFA$s will +in such situations either become excruciatingly slow -(for example Verbatim++~\cite{Verbatimpp}) or get -out of memory errors (for example $\mathit{LEX}$ and -$\mathit{JFLEX}$\footnote{which are lexer generators +(for example Verbatim++ \cite{Verbatimpp}) or run +out of memory (for example $\mathit{LEX}$ and +$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators in C and JAVA that generate $\mathit{DFA}$-based lexers. The user provides a set of regular expressions -and configurations to them, and then +and configurations, and then gets an output program encoding a minimized $\mathit{DFA}$ that can be compiled and run. When given the above countdown regular expression, -a small $n$ (a few dozen) would result in a -determinised automata +a small $n$ (say 20) would result in a program representing a +DFA with millions of states.}) for large counters. A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ where the minimal DFA requires at least $2^{n+1}$ states. For example, when $n$ is equal to 2, -The corresponding $\mathit{NFA}$ looks like: +the corresponding $\mathit{NFA}$ looks like: +\vspace{6mm} \begin{center} \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] \node[state,initial] (q_0) {$q_0$}; @@ -567,24 +575,25 @@ (q_2) edge node {a,b} (q_3); \end{tikzpicture} \end{center} -when turned into a DFA by the subset construction +and when turned into a DFA by the subset construction requires at least $2^3$ states.\footnote{The -red states are "countdown states" which counts down +red states are "countdown states" which count down the number of characters needed in addition to the current string to make a successful match. For example, state $q_1$ indicates a match that has gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, and just consumed the "delimiter" $a$ in the middle, and -need to match 2 more iterations of $(a|b)$ to complete. +needs to match 2 more iterations of $(a|b)$ to complete. State $q_2$ on the other hand, can be viewed as a state after $q_1$ has consumed 1 character, and just waits for 1 more character to complete. -$q_3$ is the last state, requiring 0 more character and is accepting. +The state $q_3$ is the last (accepting) state, requiring 0 +more characters. Depending on the suffix of the input string up to the current read location, the states $q_1$ and $q_2$, $q_3$ may or may -not be active, independent from each other. +not be active. A $\mathit{DFA}$ for such an $\mathit{NFA}$ would contain at least $2^3$ non-equivalent states that cannot be merged, because the subset construction during determinisation will generate @@ -595,12 +604,12 @@ would require at least $2^{n+1}$ states, if $r$ itself contains more than 1 string. This is to represent all different -scenarios which "countdown" states are active.} +scenarios in which "countdown" states are active.} -Bounded repetitions are very important because they -tend to occur a lot in practical use, -for example in the regex library RegExLib, +Bounded repetitions are important because they +tend to occur frequently in practical use, +for example in the regex library RegExLib, in the rules library of Snort \cite{Snort1999}\footnote{ Snort is a network intrusion detection (NID) tool for monitoring network traffic. @@ -615,8 +624,8 @@ XSDs they found on the Maven.org central repository have bounded regular expressions in them. Often the counters are quite large, with the largest being -approximately up to ten million. -An example XSD they gave +close to ten million. +A smaller sample XSD they gave is: \begin{verbatim} @@ -624,7 +633,7 @@ \end{verbatim} -This can be seen as the expression +This can be seen as the regex $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves regular expressions satisfying certain constraints (such as @@ -638,7 +647,7 @@ can be above 1000, and in the built-in Rust regular expression library expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message for being too big. -As Becchi and Crawley~\cite{Becchi08} have pointed out, +As Becchi and Crawley \cite{Becchi08} have pointed out, the reason for these restrictions is that they simulate a non-deterministic finite automata (NFA) with a breadth-first search. @@ -647,7 +656,7 @@ When the counters are large, the memory requirement could become infeasible, and a regex engine -like Go will reject this pattern straight away. +like in Go will reject this pattern straight away. \begin{figure}[H] \begin{center} \begin{tikzpicture} [node distance = 2cm, on grid, auto] @@ -702,23 +711,26 @@ with the ability to process gigabytes of strings input per second even with large counters \cite{Becchi08}. -But formal reasoning about these automata especially in Isabelle -can be challenging -and un-intuitive. -Therefore, we take correctness and runtime claims made about these solutions -with a grain of salt. +These practical solutions do not come with +formal guarantees, and as pointed out by +Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. +%But formal reasoning about these automata especially in Isabelle +%can be challenging +%and un-intuitive. +%Therefore, we take correctness and runtime claims made about these solutions +%with a grain of salt. In the work reported in \cite{FoSSaCS2023} and here, we add better support using derivatives -for bounded regular expressions $r^{\{n\}}$. -The results +for bounded regular expression $r^{\{n\}}$. +Our results extend straightforwardly to -repetitions with an interval such as +repetitions with intervals such as $r^{\{n\ldots m\}}$. The merit of Brzozowski derivatives (more on this later) on this problem is that it can be naturally extended to support bounded repetitions. -Moreover these extensions are still made up of only +Moreover these extensions are still made up of only small inductive datatypes and recursive functions, making it handy to deal with them in theorem provers. %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be