lexing: comparison ChengsongTanPhdThesis/Chapters/Introduction.tex

equal deleted inserted replaced

-:b079aaee5e10
+:7ce2389dff4b
 but they occur actually often enough that they have a
 name: Regular-Expression-Denial-Of-Service (ReDoS)
 attacks.
 Davis et al. \cite{Davis18} detected more
 than 1000 evil regular expressions
-in Node.js, Python core libraries, npm and in pypi.
+in Node.js, Python core libraries, npm and pypi.
 They therefore concluded that evil regular expressions
 are real problems rather than just "a parlour trick".
 This work aims to address this issue
 with the help of formal proofs.
 We describe a lexing algorithm based
-on Brzozowski derivatives with verified correctness (in
+on Brzozowski derivatives with verified correctness
-Isabelle/HOL)
+and a finiteness property for the size of derivatives
-and a finiteness property for the size of derivatives.
+(which are all done in Isabelle/HOL).
 Such properties %guarantee the absence of
 are an important step in preventing
 catastrophic backtracking once and for all.
 We will give more details in the next sections
 on (i) why the slow cases in graph \ref{fig:aStarStarb}
 approach based on Brzozowski derivatives and formal proofs.
 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
 Regular expressions and regular expression matchers
-have of course been studied for many, many years.
+have clearly been studied for many, many years.
 Theoretical results in automata theory state
 that basic regular expression matching should be linear
 w.r.t the input.
 This assumes that the regular expression
 $r$ was pre-processed and turned into a
-deterministic finite automaton (DFA) before matching~\cite{Sakarovitch2009}.
+deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
 By basic we mean textbook definitions such as the one
 below, involving only regular expressions for characters, alternatives,
 sequences, and Kleene stars:
 \[
 	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
 \]
 Modern regular expression matchers used by programmers,
 however,
-support much richer constructs, such as bounded repetitions
+support much richer constructs, such as bounded repetitions,
+negations,
 and back-references.
 To differentiate, we use the word \emph{regex} to refer
 to those expressions with richer constructs while reserving the
 term \emph{regular expression}
 for the more traditional meaning in formal languages theory.
 that make it more convenient for
 programmers to write regular expressions.
 Depending on the types of constructs
 the task of matching and lexing with them
 will have different levels of complexity.
-Some of those constructs are just syntactic sugars that are
+Some of those constructs are syntactic sugars that are
 simply short hand notations
 that save the programmers a few keystrokes.
 These will not cause problems for regex libraries.
 For example the
 non-binary alternative involving three or more choices just means:
 \[
 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
 \]
-Similarly, the range operator used to express the alternative
+Similarly, the range operator
-of all characters between its operands is just a concise way:
+%used to express the alternative
+%of all characters between its operands,
+is just a concise way
+of expressing an alternative of consecutive characters:
 \[
-	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
+	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )
 \]
 for an alternative. The
-wildcard character $.$ is used to refer to any single character,
+wildcard character '$.$' is used to refer to any single character,
 \[
 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
 \]
 except the newline.
 \subsection{Bounded Repetitions}
 More interesting are bounded repetitions, which can
 make the regular expressions much
 more compact.
-There are
+Normally there are four kinds of bounded repetitions:
 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
 (where $n$ and $m$ are constant natural numbers).
 Like the star regular expressions, the set of strings or language
 a bounded regular expression can match
 is defined using the power operation on sets:
 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
 	\end{tabular}
 \end{center}
 The attraction of bounded repetitions is that they can be
-used to avoid a blow up: for example $r^{\{n\}}$
+used to avoid a size blow up: for example $r^{\{n\}}$
 is a shorthand for
+the much longer regular expression:
 \[
 	\underbrace{r\ldots r}_\text{n copies of r}.
 \]
 %Therefore, a naive algorithm that simply unfolds
 %them into their desugared forms
 %will suffer from at least an exponential runtime increase.
 The problem with matching
+such bounded repetitions
 is that tools based on the classic notion of
 automata need to expand $r^{\{n\}}$ into $n$ connected
 copies of the automaton for $r$. This leads to very inefficient matching
 algorithms  or algorithms that consume large amounts of memory.
 Implementations using $\DFA$s will
+in such situations
 either become excruciatingly slow
-(for example Verbatim++~\cite{Verbatimpp}) or get
+(for example Verbatim++ \cite{Verbatimpp}) or run
-out of memory errors (for example $\mathit{LEX}$ and
+out of memory (for example $\mathit{LEX}$ and
-$\mathit{JFLEX}$\footnote{which are lexer generators
+$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
 in C and JAVA that generate $\mathit{DFA}$-based
 lexers. The user provides a set of regular expressions
-and configurations to them, and then
+and configurations, and then
 gets an output program encoding a minimized $\mathit{DFA}$
 that can be compiled and run.
 When given the above countdown regular expression,
-a small $n$ (a few dozen) would result in a
+a small $n$ (say 20) would result in a program representing a
-determinised automata
+DFA
 with millions of states.}) for large counters.
 A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
 where the minimal DFA requires at least $2^{n+1}$ states.
 For example, when $n$ is equal to 2,
-The corresponding $\mathit{NFA}$ looks like:
+the corresponding $\mathit{NFA}$ looks like:
+\vspace{6mm}
 \begin{center}
 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
 \node[state,initial] (q_0)   {$q_0$};
 \node[state, red] (q_1) [right=of q_0] {$q_1$};
 \node[state, red] (q_2) [right=of q_1] {$q_2$};
 	  edge [loop below] node {a,b} ()
 (q_1) edge  node  {a,b} (q_2)
 (q_2) edge  node  {a,b} (q_3);
 \end{tikzpicture}
 \end{center}
-when turned into a DFA by the subset construction
+and when turned into a DFA by the subset construction
 requires at least $2^3$ states.\footnote{The
-red states are "countdown states" which counts down
+red states are "countdown states" which count down
 the number of characters needed in addition to the current
 string to make a successful match.
 For example, state $q_1$ indicates a match that has
 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
 and just consumed the "delimiter" $a$ in the middle, and
-need to match 2 more iterations of $(a|b)$ to complete.
+needs to match 2 more iterations of $(a|b)$ to complete.
 State $q_2$ on the other hand, can be viewed as a state
 after $q_1$ has consumed 1 character, and just waits
 for 1 more character to complete.
-$q_3$ is the last state, requiring 0 more character and is accepting.
+The state $q_3$ is the last (accepting) state, requiring 0
+more characters.
 Depending on the suffix of the
 input string up to the current read location,
 the states $q_1$ and $q_2$, $q_3$
 may or may
-not be active, independent from each other.
+not be active.
 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
 contain at least $2^3$ non-equivalent states that cannot be merged,
 because the subset construction during determinisation will generate
 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
 Generalizing this to regular expressions with larger
 bounded repetitions number, we have that
 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
 would require at least $2^{n+1}$ states, if $r$ itself contains
 more than 1 string.
 This is to represent all different
-scenarios which "countdown" states are active.}
+scenarios in which "countdown" states are active.}
-Bounded repetitions are very important because they
+Bounded repetitions are important because they
-tend to occur a lot in practical use,
+tend to occur frequently in practical use,
-for example in the regex library RegExLib,
+for example in the regex library RegExLib, in
 the rules library of Snort \cite{Snort1999}\footnote{
 Snort is a network intrusion detection (NID) tool
 for monitoring network traffic.
 The network security community curates a list
 of malicious patterns written as regexes,
 According to Bj\"{o}rklund et al \cite{xml2015},
 more than half of the
 XSDs they found on the Maven.org central repository
 have bounded regular expressions in them.
 Often the counters are quite large, with the largest being
-approximately up to ten million.
+close to ten million.
-An example XSD they gave
+A smaller sample XSD they gave
 is:
 \begin{verbatim}
 <sequence minOccurs="0" maxOccurs="65535">
 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
 </sequence>
 \end{verbatim}
-This can be seen as the expression
+This can be seen as the regex
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions
 satisfying certain constraints (such as
 satisfying the floating point number format).
 It is therefore quite unsatisfying that
 For example, in the regular expression matching library in the Go
 language the regular expression $a^{1001}$ is not permitted, because no counter
 can be above 1000, and in the built-in Rust regular expression library
 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
 for being too big.
-As Becchi and Crawley~\cite{Becchi08}  have pointed out,
+As Becchi and Crawley \cite{Becchi08}  have pointed out,
 the reason for these restrictions
 is that they simulate a non-deterministic finite
 automata (NFA) with a breadth-first search.
 This way the number of active states could
 be equal to the counter number.
 When the counters are large,
 the memory requirement could become
 infeasible, and a regex engine
-like Go will reject this pattern straight away.
+like in Go will reject this pattern straight away.
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
 	\node (q0) [state, initial] {$0$};
 counters \cite{Turo_ov__2020}.
 These solutions can be quite efficient,
 with the ability to process
 gigabytes of strings input per second
 even with large counters \cite{Becchi08}.
-But formal reasoning about these automata especially in Isabelle
+These practical solutions do not come with
-can be challenging
+formal guarantees, and as pointed out by
-and un-intuitive.
+Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
-Therefore, we take correctness and runtime claims made about these solutions
+%But formal reasoning about these automata especially in Isabelle
-with a grain of salt.
+%can be challenging
+%and un-intuitive.
+%Therefore, we take correctness and runtime claims made about these solutions
+%with a grain of salt.
 In the work reported in \cite{FoSSaCS2023} and here,
 we add better support using derivatives
-for bounded regular expressions $r^{\{n\}}$.
+for bounded regular expression $r^{\{n\}}$.
-The results
+Our results
 extend straightforwardly to
-repetitions with an interval such as
+repetitions with intervals such as
 $r^{\{n\ldots m\}}$.
 The merit of Brzozowski derivatives (more on this later)
 on this problem is that
 it can be naturally extended to support bounded repetitions.
-Moreover these extensions are still made up of only
+Moreover these extensions are still made up of only small
 inductive datatypes and recursive functions,
 making it handy to deal with them in theorem provers.
 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
 %straightforwardly extended to deal with bounded regular expressions
 %and moreover the resulting code still consists of only simple

changeset 635	7ce2389dff4b
parent 634	b079aaee5e10
child 636	0bcb4a7cb40c