lexing: changeset 638:dd9dde2d902b

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Sat Dec 24 11:55:04 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Fri Dec 30 01:52:32 2022 +0000
@@ -30,14 +30,14 @@
 The first problem with this algorithm is that
 for the function $\inj$ to work properly
 we cannot destroy the structure of a regular expression,
-and therefore canno simplify aggressively.
+and therefore cannot simplify too aggressively.
 For example,
 \[
 	r + (r + a) \rightarrow r + a
 \]
-cannot be done because that would require
+cannot be applied because that would require
 breaking up the inner alternative.
-The $\lexer$ function therefore only enables
+The $\lexer$ plus $\textit{simp}$ therefore only enables
 same-level de-duplications like
 \[
 	r + r \rightarrow r.
@@ -74,7 +74,7 @@
 
 
 \end{tikzpicture}
-\caption{First Derivative Taken}
+\caption{First derivative taken}
 \end{figure}
 
 
@@ -110,7 +110,7 @@
 	edge [] node {} (r2);
 
 \end{tikzpicture}
-\caption{Second Derivative Taken}
+\caption{Second derivative taken}
 \end{figure}
 \noindent
 As the number of derivative steps increase,
@@ -158,7 +158,7 @@
 \path   (5.03, 4.9)
 	edge [bend left, dashed] node {} (stack);
 \end{tikzpicture}
-\caption{More Derivatives Taken}
+\caption{More derivatives taken}
 \end{figure}
 
 
@@ -222,7 +222,7 @@
 \path (rn)
 	edge [dashed, bend left] node {} (stack);
 \end{tikzpicture}
-\caption{Before Injection Phase Starts}
+\caption{Before injection phase starts}
 \end{figure}
 
 
@@ -324,16 +324,16 @@
 into bitcodes.
 The bitcodes are then carried with the regular
 expression, and augmented or moved around
-as the lexing goes on.
+as the lexing proceeds.
 It becomes unnecessary
 to remember all the 
 intermediate expresssions, but only the most recent one
 with this bit-carrying regular expression.
 Annotated regular expressions 
 are defined as the following 
-Isabelle datatype \footnote{ We use subscript notation to indicate
+Isabelle datatype:\footnote{ We use subscript notation to indicate
 	that the bitcodes are auxiliary information that do not
-interfere with the structure of the regular expressions }:
+interfere with the structure of the regular expressions }
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{a}$ & $::=$  & $\ZERO$\\
@@ -347,7 +347,7 @@
 \noindent
 where $bs$ stands for bit-codes, $a$  for $\mathbf{a}$nnotated regular
 expressions and $as$ for lists of annotated regular expressions.
-The alternative constructor, written, $\sum$, has been generalised to 
+The alternative constructor, written $\sum$, has been generalised to 
 take a list of annotated regular expressions rather than just two.
 Why is it generalised? This is because when we analyse nested 
 alternatives, there can be more than two elements at the same level
@@ -369,7 +369,7 @@
 $(_{Z}a+_{S}b) \backslash a$ gives us
 $_{Z}\ONE + \ZERO$, where the $Z$ bitcode will
 later be used to decode that a left branch was
-selected at the time when the part $a+b$ was anallysed by
+selected at the time when the part $a+b$ was analysed by
 derivative.
 \subsection{A Bird's Eye View of the Bit-coded Lexer}
 Before we give the details of the functions and definitions 
@@ -453,7 +453,9 @@
 \caption{A bird's eye view of $\blexer$. The $_{bsi}a_{i}$s stand
 	for the annotated regular expressions in each derivative step.
 	The characters used in each derivative is written as $c_i$.
-The relative size increase reflect the size increase in each derivative step.}
+The size of the derivatives typically increase during each derivative step,
+therefore we draw the larger circles to indicate that.
+The process starts with an internalise phase and concludes with a decoding phase.}
 \end{figure}
 \noindent
 The plain regular expressions
@@ -476,9 +478,9 @@
 \begin{itemize}
 	\item
 
-		An absence of the second injection phase.
+		the absence of the second injection phase.
 	\item
-		One does not need to store each pair of the 
+		one does not need to store each pair of the 
 		intermediate regular expressions $_{bs_i}a_i$ and  $c_{i+1}$. 
 		The previous annotated regular expression $_{bs_i}a_i$'s information is passed
 		on without loss to its successor $_{bs_{i+1}}a_{i+1}$,
@@ -486,7 +488,7 @@
 		which will be used during the decode phase, where we use $r$ as
 	a source of information.}
 	\item
-		Finally, simplification works much better--one can maintain correctness
+		simplification works much smoother--one can maintain correctness
 		while applying quite aggressive simplifications.
 \end{itemize}
 %In the next section we will 
@@ -555,15 +557,14 @@
 	\end{tabular}
 \end{center}
 \noindent
-We also abbreviate the $\erase\; a$ operation
-as $a_\downarrow$, for conciseness.
+Where we abbreviate the $\erase\; a$ operation
+as $(a)_\downarrow$, for conciseness.
 
 Determining whether an annotated regular expression
-contains the empty string in its lauguage requires
+contains the empty string in its language requires
 a dedicated function $\bnullable$.
 In our formalisation this function is defined by simply calling $\erase$
 before $\nullable$.
-$\bnullable$ simply calls $\erase$ before $\nullable$.
 \begin{definition}
 		$\bnullable \; a \dn  \nullable \; (a_\downarrow)$
 \end{definition}
@@ -597,7 +598,7 @@
 In the case of alternative regular expressions, 
 it looks for the leftmost
 $\nullable$ branch
-to visit and ignores other siblings.
+to visit and ignores the other siblings.
 %Whenever there is some bitcodes attached to a
 %node, it returns the bitcodes concatenated with whatever
 %child recursive calls return.
@@ -606,7 +607,7 @@
 list it returns.
 
 The bitcodes extracted by $\bmkeps$ need to be 
-$\decode$d (with the guidance of a plain regular expression):
+$\decode$d (with the guidance of a ``plain'' regular expression):
 %\begin{definition}[Bitdecoding of Values]\mbox{}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
@@ -653,7 +654,7 @@
 characters, and does not encode the ``boundary'' between two
 sequence values. Moreover, with only the bitcode we cannot even tell
 whether the $S$s and $Z$s are for $\Left/\Right$ or $\Stars$,
-but this swill not be necessary in our proof.
+but this will not be necessary in our correctness proof.
 \begin{center}
 \begin{tabular}{lcl}
   $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
@@ -681,7 +682,7 @@
 we obtain the property.
 \end{proof}
 \noindent
-Now we define the central part of Sulzmann and Lu's lexing algorithm,
+Now we define the central part of Sulzmann and Lu's second lexing algorithm,
 the $\bder$ function (which stands for \emph{b}itcoded-derivative):
 \begin{center}
   \begin{tabular}{@{}lcl@{}}
@@ -707,7 +708,8 @@
 where the bitcodes are augmented and carried around 
 when a derivative is taken.
 We give the intuition behind some of the more involved cases in 
-$\bder$. \\
+$\bder$.
+
 For example,
 in the \emph{star} case,
 a derivative of $_{bs}a^*$ means 
@@ -802,7 +804,7 @@
 and attach the collected bit-codes to the front of $a_2$
 before throwing away $a_1$. 
 In the injection-based lexer, $r_1$ is immediately thrown away in 
-in the $\textit{if}$ branch,
+the $\textit{if}$ branch,
 the information $r_1$ stores is therefore lost:
 \begin{center}
 	\begin{tabular}{lcl}
@@ -854,7 +856,7 @@
 If yes, then extract the bitcodes from the nullable expression,
 and decodes the bitcodes into a lexical value.
 If there does not exists a match between $r$ and $s$ the lexer
-outputs $\None$ indicating a mismatch.\\
+outputs $\None$ indicating a mismatch.
 Ausaf and Urban formally proved the correctness of the $\blexer$, namely
 \begin{property}
 $\blexer \;r \; s = \lexer \; r \; s$.
@@ -866,7 +868,8 @@
 \subsection{An Example $\blexer$ Run}
 Before introducing the proof we shall first walk 
 through a concrete example of our $\blexer$ calculating the right 
-lexical information through bit-coded regular expressions.\\
+lexical information through bit-coded regular expressions.
+
 Consider the regular expression $(aa)^* \cdot (b+c)$ matching the string $aab$.
 We give again the bird's eye view of this particular example 
 in each stage of the algorithm:
@@ -927,13 +930,14 @@
 
 
 	\end{tikzpicture}
-	\caption{$\blexer \;\;\;\; (aa)^*(b+c) \;\;\;\; aab$}
+	\caption{$\blexer$ with the regular expression $(aa)^*(b+c)$ and $aab$}
 \end{figure}
 \noindent
 The one dashed arrow indicates that $_Z(\ONE \cdot (aa)^*)$
 turned into $ZS$ after derivative w.r.t $b$
-is taken, which calls $\bmkeps$ on the nuallable $_Z\ONE\cdot (aa)^*$
-before processing $_Zb+_Sc$.\\
+is taken, which calls $\bmkeps$ on the nullable $_Z\ONE\cdot (aa)^*$
+before processing $_Zb+_Sc$.
+
 The annotated regular expressions
 would look overwhelming if we explicitly indicate all the
 locations where bitcodes are attached.
@@ -944,7 +948,8 @@
 internalise.
 Therefore for readability we omit bitcodes if they are empty. 
 This applies to all annotated 
-regular expressions in this thesis.\\
+regular expressions in this thesis.
+
 %and assume we have just read the first character $a$:
 %\begin{center}
 %\begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
@@ -1030,7 +1035,7 @@
 %         \nodepart{two} EOF};
 %\end{tikzpicture}
 %\end{center}
-\noindent
+
 In the next section we introduce the correctness proof
 found by Ausaf and Urban
 of the bitcoded lexer.
@@ -1067,7 +1072,7 @@
 \subsubsection{$\textit{Retrieve}$}
 The first function we shall introduce is $\retrieve$.
 Sulzmann and Lu gave its definition, and
-Ausaf and Urban found
+Ausaf and Urban found it
 useful in their mechanised proofs.
 Our bit-coded lexer ``retrieve''s the bitcodes using $\bmkeps$,
 after all the derivatives have been taken:
@@ -1125,7 +1130,8 @@
 For example, in the leftmost pair the 
 lexical information is condensed in 
 $v_0$, whereas for the rightmost pair,
-the lexing result hides is in the bitcodes of $a_n$.\\
+the lexing result hides is in the bitcodes of $a_n$.
+
 The function $\bmkeps$ is able to extract from $a_n$ the result
 by looking for nullable parts of the regular expression.
 However for regular expressions $a_i$ in general,
@@ -1245,7 +1251,7 @@
 results $a_i$ and $a_{i+1}$ in $\blexer$.
 For plain regular expressions something similar is required.
 
-\subsection{$\flex$}
+\subsection{The Function $\flex$}
 Ausaf and Urban defined an auxiliary function called $\flex$ for $\lexer$,
 defined as
 \begin{center}
@@ -1255,7 +1261,7 @@
 \end{tabular}
 \end{center}
 which accumulates the characters that need to be injected back, 
-and does the injection in a stack-like manner (last taken derivative first injected).
+and does the injection in a stack-like manner (the last taken derivative is first injected).
 The function
 $\flex$ can calculate what $\lexer$ calculates, given the input regular expression
 $r$, the identity function $id$, the input string $s$ and the value
@@ -1291,7 +1297,7 @@
 %----------------------------------------------------------------------------------------
 \section{Correctness of the  Bit-coded Algorithm (Without Simplification)}
 We can first show that 
-$\flex$ and $\blexer$ calculates the same thing.
+$\flex$ and $\blexer$ calculates the same value.
 \begin{lemma}\label{flex_retrieve}
 	If $\vdash v: (r\backslash s)$, then\\
 $\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$

--- a/ChengsongTanPhdThesis/Chapters/Cubic.tex	Sat Dec 24 11:55:04 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Cubic.tex	Fri Dec 30 01:52:32 2022 +0000
@@ -301,7 +301,8 @@
 
 
 \end{lstlisting}
-\caption{The helper functions of $\textit{prune}$ XXX.}
+\caption{The helper functions of $\textit{prune}$: 
+  $\textit{atMostEmpty}$, $\textit{isOne}$ and $\textit{removeSeqTail}$}
 \end{figure}
 \noindent
 Suppose we feed

--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Sat Dec 24 11:55:04 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Dec 30 01:52:32 2022 +0000
@@ -2075,8 +2075,10 @@
 \begin{proof}
 	By induction on $a$. $rs$ is set to take arbitrary values.
 \end{proof}
-It is also not surprising that $\textit{rsimp}$ will cover
-the effect of $\hflataux{\_}$:
+It is also not surprising that 
+two regular expressions differing only in terms
+of the
+nesting of parentheses are equivalent w.r.t. $\textit{rsimp}$:
 \begin{lemma}\label{cbsHfauRsimpeq1}
 	$\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$
 \end{lemma}

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Sat Dec 24 11:55:04 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Dec 30 01:52:32 2022 +0000
@@ -13,7 +13,7 @@
 This is essentially a description in ``English''
 the functions and datatypes of our formalisation in Isabelle/HOL.
 We also define what $\POSIX$ lexing means, 
-followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
+followed by the first lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
 that produces the output conforming
 to the $\POSIX$ standard\footnote{In what follows 
 we choose to use the Isabelle-style notation
@@ -109,7 +109,7 @@
 to test membership of a string in 
 a set of strings. 
 For example, to test whether the string
-$bar$ is contained in the set $\{foo, bar, brak\}$, one takes derivative of the set with
+$bar$ is contained in the set $\{foo, bar, brak\}$, one can take derivative of the set with
 respect to the string $bar$:
 \begin{center}
 \begin{tabular}{lll}
@@ -122,8 +122,9 @@
 \end{center}
 \noindent
 and in the end, test whether the set
-has the empty string.\footnote{We use the infix notation $A\backslash c$
-	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
+contains the empty string.\footnote{We use the infix notation $A\backslash c$
+	instead of $\Der \; c \; A$ for brevity, as it will always be
+	clear from the context that we are operating
 on languages rather than regular expressions.}
 
 In general, if we have a language $S$,
@@ -204,7 +205,7 @@
 expression that matches only the empty string.\footnote{
 Some authors
 also use $\phi$ and $\epsilon$ for $\ZERO$ and $\ONE$
-but we prefer our notation.} 
+but we prefer this notation.} 
 The sequence regular expression is written $r_1\cdot r_2$
 and sometimes we omit the dot if it is clear which
 regular expression is meant; the alternative
@@ -227,7 +228,7 @@
 %Now with language derivatives of a language and regular expressions and
 %their language interpretations in place, we are ready to define derivatives on regular expressions.
 With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
-We do so by first introducing what properties it should satisfy.
+We do so by first introducing what properties they should satisfy.
 
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
 %Recall, the language derivative acts on a set of strings
@@ -291,14 +292,13 @@
 he calls the derivative of a regular expression $r$
 with respect to a character $c$, written
 $r \backslash c$. This infix operator
-takes an original regular expression $r$ as input
-and a character as a right operand and
-outputs a result, which is a new regular expression.
+takes regular expression $r$ as input
+and a character as a right operand.
 The derivative operation on regular expression
 is defined such that the language of the derivative result 
 coincides with the language of the original 
 regular expression being taken 
-derivative with respect to the same character:
+derivative with respect to the same characters, namely
 \begin{property}
 
 \[
@@ -335,11 +335,11 @@
 \vspace{ 3mm }
 
 The derivatives on regular expression can again be 
-generalised to a string.
+generalised to strings.
 One could compute $r_{start} \backslash s$  and test membership of $s$
 in $L \; r_{start}$ by checking
 whether the empty string is in the language of 
-$r_{end}$ ($r_{start}\backslash s$).\\
+$r_{end}$ (that is $r_{start}\backslash s$).\\
 
 \vspace{2mm}
 \Longstack{
@@ -378,18 +378,18 @@
 }
 
 
-
+We have the property that
 \begin{property}
 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
 \end{property}
 \noindent
 Next, we give the recursive definition of derivative on
 regular expressions so that it satisfies the properties above.
-The derivative function, written $r\backslash c$, 
-takes a regular expression $r$ and character $c$, and
-returns a new regular expression representing
-the original regular expression's language $L \; r$
-being taken the language derivative with respect to $c$.
+%The derivative function, written $r\backslash c$, 
+%takes a regular expression $r$ and character $c$, and
+%returns a new regular expression representing
+%the original regular expression's language $L \; r$
+%being taken the language derivative with respect to $c$.
 \begin{table}
 	\begin{center}
 \begin{tabular}{lcl}
@@ -418,7 +418,7 @@
 	\begin{tabular}{lcl}
 		$(r_1 \cdot r_2 ) \backslash c$ & $\dn$ & 
 		$\textit{if}\;\,([] \in L(r_1))\; 
-		\textit{then} \; r_1 \backslash c \cdot r_2 + r_2 \backslash c$ \\
+		\textit{then} \; (r_1 \backslash c) \cdot r_2 + r_2 \backslash c$ \\
 		& & $\textit{else} \; (r_1 \backslash c) \cdot r_2$
 	\end{tabular}
 \end{center}
@@ -483,7 +483,7 @@
 is always nullable because it naturally
 contains the empty string.  
 \noindent
-We have the following correspondence between 
+We have the following two correspondences between 
 derivatives on regular expressions and
 derivatives on a set of strings:
 \begin{lemma}\label{derDer}
@@ -499,8 +499,8 @@
 	By induction on $r$.
 \end{proof}
 \noindent
-which is the main property of derivatives
-that enables us to reason about the correctness of
+which are the main properties of derivatives
+that enables us later to reason about the correctness of
 derivative-based matching.
 We can generalise the derivative operation shown above for single characters
 to strings as follows:
@@ -570,14 +570,14 @@
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
 Around two dozen characters
-would already ``explode'' the matcher with the regular expression 
+this algorithm already ``explodes'' with the regular expression 
 $(a^*)^*b$.
 To improve this situation, we need to introduce simplification
 rules for the intermediate results,
-such as $r + r \rightarrow r$,
+such as $r + r \rightarrow r$ or $\ONE \cdot r \rightarrow r$,
 and make sure those rules do not change the 
 language of the regular expression.
-One simpled-minded simplification function
+One simple-minded simplification function
 that achieves these requirements 
 is given below (see Ausaf et al. \cite{AusafDyckhoffUrban2016}):
 \begin{center}
@@ -733,7 +733,7 @@
 one of the values to be generated. $\POSIX$ is one of the
 disambiguation strategies that is widely adopted.
 
-Ausaf et al.\parencite{AusafDyckhoffUrban2016} 
+Ausaf et al. \cite{AusafDyckhoffUrban2016} 
 formalised the property 
 as a ternary relation.
 The $\POSIX$ value $v$ for a regular expression
@@ -843,7 +843,7 @@
 then a match with a keyword (if)
 followed by
 an identifier (foo) would be incorrect.
-POSIX lexing would generate what we want.
+POSIX lexing generates what is included by lexing.
 
 \noindent
 We know that a POSIX 
@@ -915,7 +915,7 @@
 after the initial phase of successive derivatives.
 This second phase generates a POSIX value 
 if the regular expression matches the string.
-Two functions are involved: $\inj$ and $\mkeps$.
+The algorithm uses two functions called $\inj$ and $\mkeps$.
 The function $\mkeps$ constructs a POSIX value from the last
 derivative $r_n$:
 \begin{ceqn}
@@ -958,7 +958,7 @@
 The result of $\mkeps$ on a $\nullable$ $r$ 
 is a POSIX value for $r$ and the empty string:
 \begin{lemma}\label{mePosix}
-$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; v)$
+$\nullable\; r \implies (r, []) \rightarrow (\mkeps\; r)$
 \end{lemma}
 \begin{proof}
 	By induction on $r$.
@@ -1322,7 +1322,7 @@
 		   \Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [])
 		  ] 
 \]
-And $\derssimp \; aa \; (a^*a^*)^*$ would be
+And $\derssimp \; aa \; (a^*a^*)^*$ is
 \[
 	((a^*a^* + a^*)+a^*)\cdot(a^*a^*)^* + 
 	(a^*a^* + a^*)\cdot(a^*a^*)^*.
@@ -1352,7 +1352,7 @@
 At any moment, the  subterms in a regular expression
 that will potentially result in a POSIX value is only
 a minority among the many other terms,
-and one can remove ones that are not possible to 
+and one can remove the ones that are not possible to 
 be POSIX.
 In the above example,
 \begin{equation}\label{eqn:growth2}
@@ -1379,15 +1379,15 @@
 \[
 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
 \]
-is too hopeless to contribute a POSIX lexical value,
-and is therefore thrown away.
+do not to contribute a POSIX lexical value,
+and therefore can be thrown away.
 
 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
 have come up with some simplification steps, however those steps
-are not yet sufficiently strong, so they achieve the above effects.
+are not yet sufficiently strong, to achieve the above effects.
 And even with these relatively mild simplifications, the proof
 is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
-One would prove something like this: 
+One would need to prove something like this: 
 \[
 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
 	\textit{then}\;\; (r, c::s) \rightarrow 
@@ -1396,7 +1396,7 @@
 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
 not only has to return a simplified regular expression,
 but also what specific simplifications 
-has been done as a function on values
+have been done as a function on values
 showing how one can transform the value
 underlying the simplified regular expression
 to the unsimplified one.

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sat Dec 24 11:55:04 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Dec 30 01:52:32 2022 +0000
@@ -217,29 +217,28 @@
 Given their usefulness and ubiquity, one would assume that
 modern regular expression matching implementations
 are mature and fully studied.
-Indeed, in a popular programming language's regular expression engine, 
-supplying it with regular expressions and strings,
-in most cases one can
+Indeed, in many popular programming languages' regular expression engines, 
+one can supply a regular expression and a string, and in most cases get
 get the matching information in a very short time.
-Those engines can be blindingly fast--some 
+Those engines can sometimes be blindingly fast--some 
 network intrusion detection systems
 use regular expression engines that are able to process 
 hundreds of megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
-However, those engines can sometimes exhibit a surprising security vulnerability
+However, those engines can sometimes also exhibit a surprising security vulnerability
 under a certain class of inputs.
 %However, , this is not the case for $\mathbf{all}$ inputs.
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
 
-Consider for example $(a^*)^*\,b$ and 
-strings of the form $aa..a$, these strings cannot be matched by this regular
+Consider for example the regular expression $(a^*)^*\,b$ and 
+strings of the form $aa..a$. These strings cannot be matched by this regular
 expression: Obviously the expected $b$ in the last
 position is missing. One would assume that modern regular expression
 matching engines can find this out very quickly. Surprisingly, if one tries
 this example in JavaScript, Python or Java 8, even with small strings, 
 say of lenght of around 30 $a$'s,
 the decision takes an absurd amount of time to finish (see graphs in figure \ref{fig:aStarStarb}).
-This is clearly exponential behaviour, and as can be seen
+The algorithms clearly show exponential behaviour, and as can be seen
 is triggered by some relatively simple regular expressions.
 Java 9 and newer
 versions improve this behaviour somewhat, but are still slow compared 
@@ -248,7 +247,7 @@
 
 
 This superlinear blowup in regular expression engines
-has caused grief in ``real life'' in the past where it is 
+has caused grief in ``real life'' where it is 
 given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
 For example, on 20 July 2016 one evil
 regular expression brought the webpage
@@ -423,9 +422,9 @@
 than 1000 evil regular expressions
 in Node.js, Python core libraries, npm and pypi. 
 They therefore concluded that evil regular expressions
-are real problems rather than just "a parlour trick".
+are a real problem rather than just "a parlour trick".
 
-This work aims to address this issue
+The work in this thesis aims to address this issue
 with the help of formal proofs.
 We describe a lexing algorithm based
 on Brzozowski derivatives with verified correctness 
@@ -627,12 +626,18 @@
 close to ten million. 
 A smaller sample XSD they gave
 is:
-\begin{verbatim}
+\lstset{
+	basicstyle=\fontsize{8.5}{9}\ttfamily,
+  language=XML,
+  morekeywords={encoding,
+    xs:schema,xs:element,xs:complexType,xs:sequence,xs:attribute}
+}
+\begin{lstlisting}
 <sequence minOccurs="0" maxOccurs="65535">
- <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
- <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+	<element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ 	<element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
 </sequence>
-\end{verbatim}
+\end{lstlisting}
 This can be seen as the regex
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions 
@@ -787,7 +792,7 @@
 $r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
 by 1 to 4, and can be ``referred back'' by their respective numbers. 
 %These sub-expressions are called "capturing groups".
-To do so, we use the syntax $\backslash i$ 
+To do so, one uses the syntax $\backslash i$ 
 to denote that we want the sub-string 
 of the input matched by the i-th
 sub-expression to appear again, 
@@ -836,7 +841,7 @@
 which matches four-character palindromes
 like $abba$, $x??x$ and so on.
 
-Back-references is a regex construct 
+Back-references are a regex construct 
 that programmers find quite useful.
 According to Becchi and Crawley \cite{Becchi08},
 6\% of Snort rules (up until 2008) use them.
@@ -928,7 +933,8 @@
 disambiguate.
 The more widely used strategy is called POSIX,
 which roughly speaking always chooses the longest initial match.
-The POSIX strategy is widely adopted in many regular expression matchers.
+The POSIX strategy is widely adopted in many regular expression matchers
+because it is a natural strategy for lexers.
 However, many implementations (including the C libraries
 used by Linux and OS X distributions) contain bugs
 or do not meet the specification they claim to adhere to.
@@ -947,11 +953,11 @@
 $[0, 2), [2, 5)$
 respectively.
 But feeding this example to the different engines
-in regex101 \parencite{regex101} \footnote{
+listed at regex101 \footnote{
 	regex101 is an online regular expression matcher which
 	provides API for trying out regular expression
 	engines of multiple popular programming languages like
-Java, Python, Go, etc.}
+Java, Python, Go, etc.} \parencite{regex101}. 
 yields
 only two incomplete matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
@@ -959,8 +965,8 @@
 commented that most regex libraries are not
 correctly implementing the central POSIX
 rule, called the maximum munch rule.
-Grathwohl \parencite{grathwohl2014crash} wrote,
-\begin{quote}
+Grathwohl \parencite{grathwohl2014crash} wrote:
+\begin{quote}\it
 	``The POSIX strategy is more complicated than the 
 	greedy because of the dependence on information about 
 	the length of matched strings in the various subexpressions.''
@@ -968,7 +974,7 @@
 %\noindent
 We think the implementation complexity of POSIX rules also come from
 the specification being not very precise.
-The developers of the regexp package of GO 
+The developers of the regexp package of Go 
 \footnote{\url{https://pkg.go.dev/regexp\#pkg-overview}}
 commented that
 \begin{quote}\it
@@ -1008,13 +1014,16 @@
 are the first to
 give a quite simple formalised POSIX
 specification in Isabelle/HOL, and also prove 
-that their specification coincides with the 
+that their specification coincides with an earlier (unformalised) 
 POSIX specification given by Okui and Suzuki \cite{Okui10}.
 They then formally proved the correctness of
 a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
 with regards to that specification.
 They also found that the informal POSIX
-specification by Sulzmann and Lu needs to be revised for the correctness proof.
+specification by Sulzmann and Lu needs to be substantially 
+revised in order for the correctness proof to go through.
+Their original specification and proof were unfixable
+according to Ausaf et al.
 
 In the next section, we will briefly
 introduce Brzozowski derivatives and Sulzmann
@@ -1030,7 +1039,7 @@
 The derivative of a regular expression $r$
 with respect to a character $c$, is written as $r \backslash c$.
 This operation tells us what $r$ transforms into
-if we ``chop'' off the first character $c$ 
+if we ``chop'' off a particular character $c$ 
 from all strings in the language of $r$ (defined
 later as $L \; r$).
 %To give a flavour of Brzozowski derivatives, we present
@@ -1060,14 +1069,14 @@
 %one simply takes derivatives of $r$ successively with
 %respect to the characters (in the correct order) in $s$,
 %and then test whether the empty string is in the last regular expression.
-With this property, derivatives give a simple solution
+With this property, derivatives can give a simple solution
 to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
 succession) all the characters of the string matches the empty string,
 then $r$ matches $s$ (and {\em vice versa}).  
 %This makes formally reasoning about these properties such
 %as correctness and complexity smooth and intuitive.
-There had been several mechanised proofs about this property in various theorem
+There are several mechanised proofs of this property in various theorem
 provers,
 for example one by Owens and Slind \cite{Owens2008} in HOL4,
 another one by Krauss and Nipkow \cite{Nipkow98} in Isabelle/HOL, and
@@ -1079,7 +1088,7 @@
 \begin{center}
 	\begin{tabular}{lcl}
 		$r^{\{n\}} \backslash c$     & $\dn$ & $r \backslash c \cdot
-		r^{\{n-1\}}$\\
+		r^{\{n-1\}} (\text{when} n > 0)$\\
 	\end{tabular}
 \end{center}
 \noindent
@@ -1089,8 +1098,8 @@
 large counters
 quite well.
 
-There has also been 
-extensions to other constructs.
+There have also been 
+extensions of derivatives to other regex constructs.
 For example, Owens et al include the derivatives
 for the \emph{NOT} regular expression, which is
 able to concisely express C-style comments of the form
@@ -1106,11 +1115,11 @@
 
 Given the above definitions and properties of
 Brzozowski derivatives, one quickly realises their potential
-in generating a formally verified algorithm for lexing--the clauses and property
+in generating a formally verified algorithm for lexing: the clauses and property
 can be easily expressed in a functional programming language 
 or converted to theorem prover
-code, with great extensibility.
-Perhaps this is the reason why it has sparked quite a bit of interest
+code, with great ease.
+Perhaps this is the reason why derivatives have sparked quite a bit of interest
 in the functional programming and theorem prover communities in the last
 fifteen or so years (
 \cite{Almeidaetal10}, \cite{Berglund14}, \cite{Berglund18},
@@ -1125,7 +1134,8 @@
 little information in the context of lexing where separate tokens must
 be identified and also classified (for example as keywords
 or identifiers). 
-Second, derivative-based matchers need to be more efficient.
+Second, derivative-based matchers need to be more efficient in terms
+of the sizes of derivatives.
 Elegant and beautiful
 as many implementations are,
 they can be excruciatingly slow. 
@@ -1146,7 +1156,7 @@
 \noindent
 They concluded that the algorithm is quadratic in terms of 
 the length of the input string.
-When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+When we tried out their extracted OCaml code with the example $(a+aa)^*$,
 the time it took to match a string of 40 $a$'s was approximately 5 minutes.
 
 
@@ -1158,16 +1168,17 @@
 \emph{how} a regular expression matches a string following the POSIX
 rules for regular expression matching. They achieve this by adding a
 second ``phase'' to Brzozowski's algorithm involving an injection
-function.
+function. This injection function in a sense undoes the ``damage''
+of the derivatives chopping off characters.
 In earlier work, Ausaf et al provided the formal
 specification of what POSIX matching means and proved in Isabelle/HOL
 the correctness
 of this extended algorithm accordingly
 \cite{AusafDyckhoffUrban2016}.
 
-The version of the algorithm proven correct 
-suffers heavily from a 
-second difficulty, where the internal derivatives can
+The version of the algorithm proven correct
+suffers however heavily from a 
+second difficulty, where derivatives can
 grow to arbitrarily big sizes. 
 For example if we start with the
 regular expression $(a+aa)^*$ and take
@@ -1185,7 +1196,7 @@
 \end{tabular}
 \end{center}
  
-\noindent where after around 35 steps we run out of memory on a
+\noindent where after around 35 steps we usually run out of memory on a
 typical computer.  Clearly, the
 notation involving $\ZERO$s and $\ONE$s already suggests
 simplification rules that can be applied to regular regular
@@ -1242,6 +1253,7 @@
 There is also some newer work called
 Verbatim++~\cite{Verbatimpp}, which does not use derivatives, 
 but deterministic finite automaton instead.
+We will also study this work in a later section.
 %An example that gives problem to automaton approaches would be
 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
 %It requires at least $2^{n+1}$ states to represent
@@ -1256,8 +1268,8 @@
 backtracking and error-prone matchers: a formally verified
 regular expression lexing algorithm
 that is both fast
-and correct by extending Ausaf et al.'s work.
-The end result is %a regular expression lexing algorithm that comes with 
+and correct. 
+The result is %a regular expression lexing algorithm that comes with 
 \begin{itemize}
 \item
 an improved version of  Sulzmann and Lu's bit-coded algorithm using 
@@ -1268,10 +1280,10 @@
 \item 
 a complexity-related property for that algorithm saying that the 
 internal data structure will
-remain finite,
+remain below a finite bound,
 \item
-and extension to
-the bounded repetitions construct with the correctness and finiteness property
+and an extension to
+the bounded repetition constructs with the correctness and finiteness property
 maintained.
 \end{itemize}
 \noindent
@@ -1285,34 +1297,33 @@
 Further improvements to the algorithm with an even stronger version of 
 simplification can be made. We conjecture that the resulting size of derivatives
 can be bounded by a cubic bound w.r.t. the size of the regular expression.
-
-
-
-
+We will give relevant code in Scala, 
+but do not give a formal proof for that in Isabelle/HOL.
+This is still future work.
 
 
 \section{Structure of the thesis}
 In chapter \ref{Inj} we will introduce the concepts
 and notations we 
 use for describing regular expressions and derivatives,
-and the first version of their lexing algorithm without bitcodes (including 
+and the first version of Sulzmann and Lu's lexing algorithm without bitcodes (including 
 its correctness proof).
 We will give their second lexing algorithm with bitcodes in \ref{Bitcoded1}
 together with the correctness proof by Ausaf and Urban.
 Then we illustrate in chapter \ref{Bitcoded2}
 how Sulzmann and Lu's
-simplifications fail to simplify. We therefore introduce our version of the
-algorithm with simplification and 
-its correctness proof .  
+simplifications fail to simplify correctly. We therefore introduce our own version of the
+algorithm with correct simplifications and 
+their correctness proof.  
 In chapter \ref{Finite} we give the second guarantee
 of our bitcoded algorithm, that is a finite bound on the size of any 
 regular expression's derivatives. 
 We also show how one can extend the
 algorithm to include bounded repetitions. 
 In chapter \ref{Cubic} we discuss stronger simplification rules which 
-improve the finite bound to a cubic bound.%and the NOT regular expression.
+improve the finite bound to a cubic bound. %and the NOT regular expression.
 Chapter \ref{RelatedWork} introduces relevant work for this thesis.
-Chapter \ref{Future} concludes with avenues of future research.
+Chapter \ref{Future} concludes and mentions avenues of future research.

--- a/ChengsongTanPhdThesis/main.tex	Sat Dec 24 11:55:04 2022 +0000
+++ b/ChengsongTanPhdThesis/main.tex	Fri Dec 30 01:52:32 2022 +0000
@@ -296,12 +296,17 @@
 are often treated as a constant factor.
 However with some regular expressions and inputs, existing implementations 
 often suffer from non-linear or even exponential running time, 
-giving to for example ReDoS (regular expression denial-of-service ) attacks. 
-To avoid these attacks, lexers with formalised correctness and running time related
+giving rise to ReDoS (regular expression denial-of-service ) attacks. 
+To avoid these attacks, regular expression matchers and lexers with 
+formalised correctness and running time related
 properties are of interest because the guarantees apply to all inputs, not just a finite
 number of empirical test cases.
 
-Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.
+Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. 
+Our simplification function is more aggressive than the one by Sulzmann and Lu.
+We also fix a problem in Sulzmann and Lu's simplification to do with their use of
+the $\textit{nub}$ function which does not remove non-trivial duplicates.
+Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.
 In this thesis we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our algorithm is correct and generates unique POSIX values; we also (ii) establish a finite bound for the size of the derivatives for every input string; we also
 (iii) give a program and a conjecture that the finite 
 bound can be improved to be cubic if stronger simplification rules are applied. 
@@ -329,9 +334,10 @@
 I want to thank Doctor Kathrin Stark, my SIGPLAN mentor, for offering brilliant advice
 at the late stage of my PhD. My transition from a PhD student to a postdoc researcher
 could not have been so smooth without Kathrin's mentoring.
-I want to thank Jeanna Wheeler, my UMO mentor, for helping me regulate my mental health
-and productivity, by being always encouraging
-and compassionate in her sessions.
+%I want to thank Jeanna Wheeler, my UMO mentor, for helping me regulate my mental health
+%and productivity, by being always encouraging
+%and compassionate in her sessions.
+I want to thank Jeanna Wheeler for helping me with keeping sane during my time during the PhD and COVID times when an encouraging and compassionate person was very appreciated.
 
 I want to thank my father Haiyan Tan and my mother Yunan Cheng,
 for their unconditional love, and who I have not seen

author	Chengsong
	Fri, 30 Dec 2022 01:52:32 +0000
changeset 638	dd9dde2d902b
parent 637	e3752aac8ec2
child 639	80cc6dc4c98b

ChengsongTanPhdThesis/Chapters/Bitcoded1.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Cubic.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Finite.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/main.tex		file \| annotate \| diff \| comparison \| revisions