lexing: changeset 635:7ce2389dff4b

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sun Dec 18 18:17:52 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Dec 20 22:32:54 2022 +0000
@@ -421,16 +421,16 @@
 attacks.
 Davis et al. \cite{Davis18} detected more
 than 1000 evil regular expressions
-in Node.js, Python core libraries, npm and in pypi. 
+in Node.js, Python core libraries, npm and pypi. 
 They therefore concluded that evil regular expressions
 are real problems rather than just "a parlour trick".
 
 This work aims to address this issue
 with the help of formal proofs.
 We describe a lexing algorithm based
-on Brzozowski derivatives with verified correctness (in 
-Isabelle/HOL)
-and a finiteness property for the size of derivatives.
+on Brzozowski derivatives with verified correctness 
+and a finiteness property for the size of derivatives
+(which are all done in Isabelle/HOL).
 Such properties %guarantee the absence of 
 are an important step in preventing
 catastrophic backtracking once and for all.
@@ -443,13 +443,13 @@
 
 \section{Preliminaries}%Regex, and the Problems with Regex Matchers}
 Regular expressions and regular expression matchers 
-have of course been studied for many, many years.
+have clearly been studied for many, many years.
 Theoretical results in automata theory state 
 that basic regular expression matching should be linear
 w.r.t the input.
 This assumes that the regular expression
 $r$ was pre-processed and turned into a
-deterministic finite automaton (DFA) before matching~\cite{Sakarovitch2009}.
+deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}.
 By basic we mean textbook definitions such as the one
 below, involving only regular expressions for characters, alternatives,
 sequences, and Kleene stars:
@@ -458,7 +458,8 @@
 \]
 Modern regular expression matchers used by programmers,
 however,
-support much richer constructs, such as bounded repetitions
+support much richer constructs, such as bounded repetitions,
+negations,
 and back-references.
 To differentiate, we use the word \emph{regex} to refer
 to those expressions with richer constructs while reserving the
@@ -481,7 +482,7 @@
 Depending on the types of constructs
 the task of matching and lexing with them
 will have different levels of complexity.
-Some of those constructs are just syntactic sugars that are
+Some of those constructs are syntactic sugars that are
 simply short hand notations
 that save the programmers a few keystrokes.
 These will not cause problems for regex libraries.
@@ -490,13 +491,16 @@
 \[
 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
 \]
-Similarly, the range operator used to express the alternative
-of all characters between its operands is just a concise way:
+Similarly, the range operator
+%used to express the alternative
+%of all characters between its operands, 
+is just a concise way
+of expressing an alternative of consecutive characters:
 \[
-	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
+	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 )  
 \]
 for an alternative. The
-wildcard character $.$ is used to refer to any single character,
+wildcard character '$.$' is used to refer to any single character,
 \[
 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
 \]
@@ -506,7 +510,7 @@
 More interesting are bounded repetitions, which can 
 make the regular expressions much
 more compact.
-There are 
+Normally there are four kinds of bounded repetitions:
 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
 (where $n$ and $m$ are constant natural numbers).
 Like the star regular expressions, the set of strings or language
@@ -521,8 +525,9 @@
 	\end{tabular}
 \end{center}
 The attraction of bounded repetitions is that they can be
-used to avoid a blow up: for example $r^{\{n\}}$
+used to avoid a size blow up: for example $r^{\{n\}}$
 is a shorthand for
+the much longer regular expression:
 \[
 	\underbrace{r\ldots r}_\text{n copies of r}.
 \]
@@ -532,28 +537,31 @@
 
 
 The problem with matching 
+such bounded repetitions
 is that tools based on the classic notion of
 automata need to expand $r^{\{n\}}$ into $n$ connected 
 copies of the automaton for $r$. This leads to very inefficient matching
 algorithms  or algorithms that consume large amounts of memory.
 Implementations using $\DFA$s will
+in such situations
 either become excruciatingly slow 
-(for example Verbatim++~\cite{Verbatimpp}) or get
-out of memory errors (for example $\mathit{LEX}$ and 
-$\mathit{JFLEX}$\footnote{which are lexer generators
+(for example Verbatim++ \cite{Verbatimpp}) or run
+out of memory (for example $\mathit{LEX}$ and 
+$\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators
 in C and JAVA that generate $\mathit{DFA}$-based
 lexers. The user provides a set of regular expressions
-and configurations to them, and then 
+and configurations, and then 
 gets an output program encoding a minimized $\mathit{DFA}$
 that can be compiled and run. 
 When given the above countdown regular expression,
-a small $n$ (a few dozen) would result in a 
-determinised automata
+a small $n$ (say 20) would result in a program representing a
+DFA
 with millions of states.}) for large counters.
 A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
 where the minimal DFA requires at least $2^{n+1}$ states.
 For example, when $n$ is equal to 2,
-The corresponding $\mathit{NFA}$ looks like:
+the corresponding $\mathit{NFA}$ looks like:
+\vspace{6mm}
 \begin{center}
 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
    \node[state,initial] (q_0)   {$q_0$}; 
@@ -567,24 +575,25 @@
     (q_2) edge  node  {a,b} (q_3);
 \end{tikzpicture}
 \end{center}
-when turned into a DFA by the subset construction
+and when turned into a DFA by the subset construction
 requires at least $2^3$ states.\footnote{The 
-red states are "countdown states" which counts down 
+red states are "countdown states" which count down 
 the number of characters needed in addition to the current
 string to make a successful match.
 For example, state $q_1$ indicates a match that has
 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
 and just consumed the "delimiter" $a$ in the middle, and 
-need to match 2 more iterations of $(a|b)$ to complete.
+needs to match 2 more iterations of $(a|b)$ to complete.
 State $q_2$ on the other hand, can be viewed as a state
 after $q_1$ has consumed 1 character, and just waits
 for 1 more character to complete.
-$q_3$ is the last state, requiring 0 more character and is accepting.
+The state $q_3$ is the last (accepting) state, requiring 0 
+more characters.
 Depending on the suffix of the
 input string up to the current read location,
 the states $q_1$ and $q_2$, $q_3$
 may or may
-not be active, independent from each other.
+not be active.
 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
 contain at least $2^3$ non-equivalent states that cannot be merged, 
 because the subset construction during determinisation will generate
@@ -595,12 +604,12 @@
 would require at least $2^{n+1}$ states, if $r$ itself contains
 more than 1 string.
 This is to represent all different 
-scenarios which "countdown" states are active.}
+scenarios in which "countdown" states are active.}
 
 
-Bounded repetitions are very important because they
-tend to occur a lot in practical use,
-for example in the regex library RegExLib,
+Bounded repetitions are important because they
+tend to occur frequently in practical use,
+for example in the regex library RegExLib, in
 the rules library of Snort \cite{Snort1999}\footnote{
 Snort is a network intrusion detection (NID) tool
 for monitoring network traffic.
@@ -615,8 +624,8 @@
 XSDs they found on the Maven.org central repository
 have bounded regular expressions in them.
 Often the counters are quite large, with the largest being
-approximately up to ten million. 
-An example XSD they gave
+close to ten million. 
+A smaller sample XSD they gave
 is:
 \begin{verbatim}
 <sequence minOccurs="0" maxOccurs="65535">
@@ -624,7 +633,7 @@
  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
 </sequence>
 \end{verbatim}
-This can be seen as the expression 
+This can be seen as the regex
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions 
 satisfying certain constraints (such as 
@@ -638,7 +647,7 @@
 can be above 1000, and in the built-in Rust regular expression library
 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
 for being too big. 
-As Becchi and Crawley~\cite{Becchi08}  have pointed out,
+As Becchi and Crawley \cite{Becchi08}  have pointed out,
 the reason for these restrictions
 is that they simulate a non-deterministic finite
 automata (NFA) with a breadth-first search.
@@ -647,7 +656,7 @@
 When the counters are large, 
 the memory requirement could become
 infeasible, and a regex engine
-like Go will reject this pattern straight away.
+like in Go will reject this pattern straight away.
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
@@ -702,23 +711,26 @@
 with the ability to process
 gigabytes of strings input per second
 even with large counters \cite{Becchi08}.
-But formal reasoning about these automata especially in Isabelle 
-can be challenging
-and un-intuitive. 
-Therefore, we take correctness and runtime claims made about these solutions
-with a grain of salt.
+These practical solutions do not come with
+formal guarantees, and as pointed out by
+Kuklewicz \cite{KuklewiczHaskell}, can be error-prone.
+%But formal reasoning about these automata especially in Isabelle 
+%can be challenging
+%and un-intuitive. 
+%Therefore, we take correctness and runtime claims made about these solutions
+%with a grain of salt.
 
 In the work reported in \cite{FoSSaCS2023} and here, 
 we add better support using derivatives
-for bounded regular expressions $r^{\{n\}}$.
-The results
+for bounded regular expression $r^{\{n\}}$.
+Our results
 extend straightforwardly to
-repetitions with an interval such as 
+repetitions with intervals such as 
 $r^{\{n\ldots m\}}$.
 The merit of Brzozowski derivatives (more on this later)
 on this problem is that
 it can be naturally extended to support bounded repetitions.
-Moreover these extensions are still made up of only
+Moreover these extensions are still made up of only small
 inductive datatypes and recursive functions,
 making it handy to deal with them in theorem provers.
 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
author	Chengsong
	Tue, 20 Dec 2022 22:32:54 +0000
changeset 635	7ce2389dff4b
parent 634	b079aaee5e10
child 636	0bcb4a7cb40c