lexing: changeset 612:8c234a1bc7e0

--- a/ChengsongTanPhdThesis/Chapters/Cubic.tex	Tue Oct 04 00:25:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Cubic.tex	Tue Oct 11 13:09:47 2022 +0100
@@ -497,82 +497,78 @@
 
 
 
-\section{The NTIMES Constructor, and the Size Bound and Correctness Proof for it}
-The NTIMES construct has the following closed form:
-\begin{verbatim}
-"rders_simp (RNTIMES r0 (Suc n)) (c#s) = 
-rsimp ( RALTS ( (map (optermsimp r0 ) (nupdates s r0 [Some ([c], n)]) ) ))"
-\end{verbatim}
-%----------------------------------------------------------------------------------------
-%	SECTION 1
-%----------------------------------------------------------------------------------------
-
-\section{Adding Support for the Negation Construct, and its Correctness Proof}
-We now add support for the negation regular expression:
-\[			r ::=   \ZERO \mid  \ONE
-			 \mid  c  
-			 \mid  r_1 \cdot r_2
-			 \mid  r_1 + r_2   
-			 \mid r^*    
-			 \mid \sim r
-\]
-The $\textit{nullable}$ function's clause for it would be 
-\[
-\textit{nullable}(~r) = \neg \nullable(r)
-\]
-The derivative would be
-\[
-~r \backslash c = ~ (r \backslash c)
-\]
- 
-The most tricky part of lexing for the $~r$ regular expression
- is creating a value for it.
- For other regular expressions, the value aligns with the 
- structure of the regular expression:
- \[
- \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
- \]
-But for the $~r$ regular expression, $s$ is a member of it if and only if
-$s$ does not belong to $L(r)$. 
-That means when there
-is a match for the not regular expression, it is not possible to generate how the string $s$ matched
-with $r$. 
-What we can do is preserve the information of how $s$ was not matched by $r$,
-and there are a number of options to do this.
-
-We could give a partial value when there is a partial match for the regular expression inside
-the $\mathbf{not}$ construct.
-For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
-A value for it could be 
- \[
- \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
- \]
- The above example demonstrates what value to construct 
- when the string $s$ is at most a real prefix
- of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
- in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
- constructor.
- 
- Another option would be to either store the string $s$ that resulted in 
- a mis-match for $r$ or a dummy value as a placeholder:
- \[
- \vdash \textit{Not}(abcd) : ~( r_1 )
- \]
-or
- \[
- \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
- \] 
- We choose to implement this as it is most straightforward:
- \[
- \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
- \]
- 
 %----------------------------------------------------------------------------------------
 %	SECTION 2
 %----------------------------------------------------------------------------------------
 
 \section{Bounded Repetitions}
-We define for the $r^{n}$ constructor something similar to $\starupdate$
+We have promised in chapter \ref{Introduction}
+that our lexing algorithm can potentially be extended
+to handle bounded repetitions
+in natural and elegant ways.
+Now we fulfill our promise by adding support for 
+the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.
+We add clauses in our derivatives-based lexing algorithms (with simplifications)
+introduced in chapter \ref{Bitcoded2}.
+
+\subsection{Augmented Definitions}
+There are a number of definitions that needs to be augmented.
+The most notable one would be the POSIX rules for $r^{\{n\}}$:
+\begin{mathpar}
+	\inferrule{v \in vs}{\textit{Stars} place holder}
+\end{mathpar}
+
+
+\subsection{Proofs for the Augmented Lexing Algorithm}
+We need to maintain two proofs with the additional $r^{\{n\}}$
+construct: the 
+correctness proof in chapter \ref{Bitcoded2},
+and the finiteness proof in chapter \ref{Finite}.
+
+\subsubsection{Correctness Proof Augmentation}
+The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions
+have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}.
+As they have commented, once the definitions are in place,
+the proofs given for the basic regular expressions will extend to
+bounded regular expressions, and there are no ``surprises''.
+We confirm this point because the correctness theorem would also
+extend without surprise to $\blexersimp$.
+The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
+do not need to be changed,
+and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to 
+add one more line which can be solved by sledgehammer 
+to solve the $r^{\{n\}}$ inductive case.
+
+
+\subsubsection{Finiteness Proof Augmentation}
+The bounded repetitions can be handled
+using techniques similar to stars.
+The closed form for them looks like:
+%\begin{center}
+%	\begin{tabular}{llrclll}
+%		$r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\
+%		$\textit{rsimp}$ & $($ & $
+%		\sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
+%			 & & & & $\textit{nupdates} \;$ & 
+%			 $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
+%			 & & & & $)$ &\\
+%			 & &  $)$ & & &\\
+%			 & $)$ & & & &\\
+%	\end{tabular}
+%\end{center}
+\begin{center}
+	\begin{tabular}{llrcllrllll}
+		$r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\
+			      &&&&$\textit{rsimp}$ & $($ & $
+			      \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
+					  &&&& & & & & $\;\; \textit{nupdates} \;$ & 
+			 		  $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
+					  &&&& & & & & $)$ &\\
+					  &&&& & &  $)$ & & &\\
+					  &&&& & $)$ & & & &\\
+	\end{tabular}
+\end{center}
+We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$
 and $\starupdates$:
 \begin{center}
 	\begin{tabular}{lcl}
@@ -600,4 +596,68 @@
 \end{center}
 \noindent
 
+%----------------------------------------------------------------------------------------
+%	SECTION 1
+%----------------------------------------------------------------------------------------
 
+%\section{Adding Support for the Negation Construct, and its Correctness Proof}
+%We now add support for the negation regular expression:
+%\[			r ::=   \ZERO \mid  \ONE
+%			 \mid  c  
+%			 \mid  r_1 \cdot r_2
+%			 \mid  r_1 + r_2   
+%			 \mid r^*    
+%			 \mid \sim r
+%\]
+%The $\textit{nullable}$ function's clause for it would be 
+%\[
+%\textit{nullable}(~r) = \neg \nullable(r)
+%\]
+%The derivative would be
+%\[
+%~r \backslash c = ~ (r \backslash c)
+%\]
+% 
+%The most tricky part of lexing for the $~r$ regular expression
+% is creating a value for it.
+% For other regular expressions, the value aligns with the 
+% structure of the regular expression:
+% \[
+% \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
+% \]
+%But for the $~r$ regular expression, $s$ is a member of it if and only if
+%$s$ does not belong to $L(r)$. 
+%That means when there
+%is a match for the not regular expression, it is not possible to generate how the string $s$ matched
+%with $r$. 
+%What we can do is preserve the information of how $s$ was not matched by $r$,
+%and there are a number of options to do this.
+%
+%We could give a partial value when there is a partial match for the regular expression inside
+%the $\mathbf{not}$ construct.
+%For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
+%A value for it could be 
+% \[
+% \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
+% \]
+% The above example demonstrates what value to construct 
+% when the string $s$ is at most a real prefix
+% of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
+% in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
+% constructor.
+% 
+% Another option would be to either store the string $s$ that resulted in 
+% a mis-match for $r$ or a dummy value as a placeholder:
+% \[
+% \vdash \textit{Not}(abcd) : ~( r_1 )
+% \]
+%or
+% \[
+% \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
+% \] 
+% We choose to implement this as it is most straightforward:
+% \[
+% \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
+% \]
+% 
+%

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Oct 04 00:25:09 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Oct 11 13:09:47 2022 +0100
@@ -201,11 +201,9 @@
 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
 command-line tools like $\mathit{grep}$ that facilitate easy 
 text-processing; network intrusion
-detection systems that reject suspicious traffic; or compiler
-front ends--the majority of the solutions to these tasks 
-involve lexing with regular 
-expressions.
-Given its usefulness and ubiquity, one would imagine that
+detection systems that inspect suspicious traffic; or compiler
+front ends.
+Given their usefulness and ubiquity, one would assume that
 modern regular expression matching implementations
 are mature and fully studied.
 Indeed, in a popular programming language's regex engine, 
@@ -222,26 +220,25 @@
 %TODO: get source for SNORT/BRO's regex matching engine/speed
 
 
-Take $(a^*)^*\,b$ and ask whether
-strings of the form $aa..a$ match this regular
+Consider $(a^*)^*\,b$ and ask whether
+strings of the form $aa..a$ can be matched by this regular
 expression. Obviously this is not the case---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
-matching engines can find this out very quickly. Alas, if one tries
-this example in JavaScript, Python or Java 8, even with strings of a small
-length, say around 30 $a$'s,
-the decision takes crazy time to finish (graph \ref{fig:aStarStarb}).
+matching engines can find this out very quickly. Surprisingly, if one tries
+this example in JavaScript, Python or Java 8, even with small strings, 
+say of lenght of around 30 $a$'s,
+the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}).
 This is clearly exponential behaviour, and 
-is triggered by some relatively simple regex patterns.
+is triggered by some relatively simple regular expressions.
 Java 9 and newer
-versions improves this behaviour, but is still slow compared 
-with the approach we are going to use.
-
+versions improve this behaviour somewhat, but is still slow compared 
+with the approach we are going to use in this thesis.
 
 
 
 This superlinear blowup in regular expression engines
-had repeatedly caused grief in real life that they
-get a name for them--``catastrophic backtracking''.
+had repeatedly caused grief in ``real life'' where it is 
+given the name ``catastrophic backtracking'' or ``evil'' regular expressions.
 For example, on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
@@ -257,7 +254,7 @@
 requests. This made the whole site become unavailable. 
 
 \begin{figure}[p]
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
     xlabel={$n$},
@@ -298,28 +295,7 @@
     legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
 \end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=35,
-    ytick={0,5,...,30},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={Java 8},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
-\end{axis}
-\end{tikzpicture}\\
+\end{tikzpicture}\\ 
 \begin{tikzpicture}
 \begin{axis}[
     xlabel={$n$},
@@ -334,10 +310,10 @@
     axis lines=left,
     width=5cm,
     height=4cm, 
-    legend entries={Dart},  
+    legend entries={Java 8},  
     legend pos=north west,
     legend cell align=left]
-\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
 \end{axis}
 \end{tikzpicture}
   &
@@ -355,128 +331,171 @@
     axis lines=left,
     width=5cm,
     height=4cm, 
+    legend entries={Dart},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
+\end{axis}
+\end{tikzpicture}\\ 
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=35,
+    ytick={0,5,...,30},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
     legend entries={Swift},  
     legend pos=north west,
     legend cell align=left]
 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
 \end{axis}
 \end{tikzpicture}
-  & \\
-\multicolumn{3}{c}{Graphs}
+  & 
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    %ylabel={time in secs},
+    enlargelimits=true,
+    %xtick={0,5000,...,40000},
+    %xmax=40000,
+    %ymax=35,
+    restrict x to domain*=0:40000,
+    restrict y to domain*=0:35,
+    %ytick={0,5,...,30},
+    %scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={Java9+},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data};
+\end{axis}
+\end{tikzpicture}\\ 
+\multicolumn{2}{c}{Graphs}
 \end{tabular}    
 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
-   The reason for their superlinear behaviour is that they do a depth-first-search.
-   If the string does not match, the engine starts to explore all possibilities. 
+   The reason for their superlinear behaviour is that they do a depth-first-search
+   using NFAs.
+   If the string does not match, the regular expression matching
+   engine starts to explore all possibilities. 
 }\label{fig:aStarStarb}
 \end{figure}\afterpage{\clearpage}
 
 A more recent example is a global outage of all Cloudflare servers on 2 July
-2019. A poorly written regular expression exhibited exponential
-behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
+2019. A poorly written regular expression exhibited catastrophic backtracking
+and exhausted CPUs that serve HTTP traffic. Although the outage
 had several causes, at the heart was a regular expression that
 was used to monitor network
 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
 These problems with regular expressions 
 are not isolated events that happen
 very occasionally, but actually widespread.
-They occur so often that they get a 
-name--Regular-Expression-Denial-Of-Service (ReDoS)
+They occur so often that they have a 
+name: Regular-Expression-Denial-Of-Service (ReDoS)
 attack.
 \citeauthor{Davis18} detected more
-than 1000 super-linear (SL) regular expressions
-in Node.js, Python core libraries, and npm and pypi. 
+than 1000 evil regular expressions
+in Node.js, Python core libraries, npm and in pypi. 
 They therefore concluded that evil regular expressions
-are problems "more than a parlour trick", but one that
-requires
-more research attention.
+are real problems rather than "a parlour trick".
 
 This work aims to address this issue
 with the help of formal proofs.
-We offer a lexing algorithm based
-on Brzozowski derivatives with certified correctness (in 
+We describe a lexing algorithm based
+on Brzozowski derivatives with verified correctness (in 
 Isabelle/HOL)
-and finiteness property.
-Such properties guarantee the absence of 
-catastrophic backtracking in most cases.
+and a finiteness property.
+Such properties %guarantee the absence of 
+are an important step in preventing
+catastrophic backtracking once and for all.
 We will give more details in the next sections
 on (i) why the slow cases in graph \ref{fig:aStarStarb}
-can occur
+can occur in traditional regular expression engines
 and (ii) why we choose our 
-approach (Brzozowski derivatives and formal proofs).
+approach based on Brzozowski derivatives and formal proofs.
 
 
-\section{Regex, and the Problems with Regex Matchers}
+\section{Preliminaries}%Regex, and the Problems with Regex Matchers}
 Regular expressions and regular expression matchers 
 have of course been studied for many, many years.
-Theoretical results in automata theory say
+Theoretical results in automata theory state 
 that basic regular expression matching should be linear
 w.r.t the input.
 This assumes that the regular expression
 $r$ was pre-processed and turned into a
-deterministic finite automata (DFA) before matching,
-which could be exponential\cite{Sakarovitch2009}.
+deterministic finite automaton (DFA) before matching\cite{Sakarovitch2009}.
 By basic we mean textbook definitions such as the one
-below, involving only characters, alternatives,
+below, involving only regular expressions for characters, alternatives,
 sequences, and Kleene stars:
 \[
-	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
+	r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^*
 \]
 Modern regular expression matchers used by programmers,
 however,
-support richer constructs such as bounded repetitions
+support much richer constructs, such as bounded repetitions
 and back-references.
-To differentiate, people use the word \emph{regex} to refer
+To differentiate, we use the word \emph{regex} to refer
 to those expressions with richer constructs while reserving the
 term \emph{regular expression}
 for the more traditional meaning in formal languages theory.
 We follow this convention 
 in this thesis.
 In the future, we aim to support all the popular features of regexes, 
-but for this work we mainly look at regular expressions.
+but for this work we mainly look at basic regular expressions
+and bounded repetitions.
 
 
 
 %Most modern regex libraries
 %the so-called PCRE standard (Peral Compatible Regular Expressions)
 %has the back-references
-Regexes come with a lot of constructs
-beyond the basic ones
+Regexes come with a number of constructs
 that make it more convenient for 
 programmers to write regular expressions.
-Depending on the types of these constructs
+Depending on the types of constructs
 the task of matching and lexing with them
-will have different levels of complexity increase.
-Some of those constructs are syntactic sugars that are
+will have different levels of complexity.
+Some of those constructs are just syntactic sugars that are
 simply short hand notations
 that save the programmers a few keystrokes.
-These will not cause trouble for regex libraries.
-
-\noindent
+These will not cause problems for regex libraries.
 For example the
-non-binary alternative involving three or more choices:
+non-binary alternative involving three or more choices just means:
 \[
 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
 \]
-the range operator $-$ used to express the alternative
-of all characters between its operands in a concise way:
+Similarly, the range operator used to express the alternative
+of all characters between its operands is just a concise way:
 \[
 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
 \]
-and the
-wildcard character $.$ used to refer to any single character:
+for an alternative. The
+wildcard character $.$ is used to refer to any single character,
 \[
 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
 \]
+except the newline.
 
-\noindent
 \subsection{Bounded Repetitions}
-Some of those constructs do make the expressions much
+More interesting are bounded repetitions, which can 
+make the regular expressions much
 more compact.
-For example, the bounded regular expressions
-(where $n$ and $m$ are constant natural numbers)
-$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$,
-defined as 
+There are 
+$r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$
+(where $n$ and $m$ are constant natural numbers).
+Like the star regular expressions, the set of strings or language
+a bounded regular expression can match
+is defined using the power operation on sets:
 \begin{center}
 	\begin{tabular}{lcl}
 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
@@ -485,9 +504,9 @@
 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
 	\end{tabular}
 \end{center}
-are exponentially smaller compared with
-their unfolded form: for example $r^{\{n\}}$
-as opposed to
+The attraction of bounded repetitions is that they can be
+used to avoid a blow up: for example $r^{\{n\}}$
+is a shorthand for
 \[
 	\underbrace{r\ldots r}_\text{n copies of r}.
 \]
@@ -495,8 +514,10 @@
 %them into their desugared forms
 %will suffer from at least an exponential runtime increase.
 
-The problem here is that tools based on the classic notion of
-automata need to expand $r^{n}$ into $n$ connected 
+
+The problem with matching 
+is that tools based on the classic notion of
+automata need to expand $r^{\{n\}}$ into $n$ connected 
 copies of the automaton for $r$. This leads to very inefficient matching
 algorithms  or algorithms that consume large amounts of memory.
 Implementations using $\DFA$s will
@@ -512,11 +533,11 @@
 When given the above countdown regular expression,
 a small $n$ (a few dozen) would result in a 
 determinised automata
-with millions of states.}) under large counters.
-A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
+with millions of states.}) for large counters.
+A classic example for this phenomenon is the regular expression $(a+b)^*  a (a+b)^{n}$
 where the minimal DFA requires at least $2^{n+1}$ states.
 For example, when $n$ is equal to 2,
-an $\mathit{NFA}$ describing it would look like:
+The corresponding $\mathit{NFA}$ looks like:
 \begin{center}
 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
    \node[state,initial] (q_0)   {$q_0$}; 
@@ -530,8 +551,8 @@
     (q_2) edge  node  {a,b} (q_3);
 \end{tikzpicture}
 \end{center}
-which requires at least $2^3$ states
-for its subset construction.\footnote{The 
+when turned into a DFA by the subset construction
+requires at least $2^3$ states.\footnote{The 
 red states are "countdown states" which counts down 
 the number of characters needed in addition to the current
 string to make a successful match.
@@ -560,23 +581,10 @@
 This is to represent all different 
 scenarios which "countdown" states are active.}
 
-One of the most recent work in the context of lexing
-%with this issue
-is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
-This is relevant work and we will compare later on
-our derivative-based matcher we are going to present.
-There is also some newer work called
-Verbatim++\cite{Verbatimpp}, which does not use derivatives, 
-but deterministic finite automaton instead.
-%An example that gives problem to automaton approaches would be
-%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
-%It requires at least $2^{n+1}$ states to represent
-%as a DFA.
-
 
 Bounded repetitions are very important because they
-tend to occur a lot in practical use.
-For example in the regex library RegExLib,
+tend to occur a lot in practical use,
+for example in the regex library RegExLib,
 the rules library of Snort \cite{Snort1999}\footnote{
 Snort is a network intrusion detection (NID) tool
 for monitoring network traffic.
@@ -588,22 +596,23 @@
 as well as in XML Schema definitions (XSDs).
 According to Bj\"{o}rklund et al \cite{xml2015},
 more than half of the 
-XSDs they found have bounded regular expressions in them.
-Often the counters are quite large, the largest up to ten million. 
+XSDs they found on the Maven.org central repository
+have bounded regular expressions in them.
+Often the counters are quite large, with the largest being
+approximately up to ten million. 
 An example XSD they gave
-was:
-%\begin{verbatim}
-%<sequence minOccurs="0" maxOccurs="65535">
-%  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-%  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-%</sequence>
-%\end{verbatim}
+is:
+\begin{verbatim}
+<sequence minOccurs="0" maxOccurs="65535">
+ <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+ <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+</sequence>
+\end{verbatim}
 This can be seen as the expression 
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions 
 satisfying certain constraints (such as 
 satisfying the floating point number format).
-
 It is therefore quite unsatisfying that 
 some regular expressions matching libraries
 impose adhoc limits
@@ -615,21 +624,22 @@
 for being too big. 
 As Becchi and Crawley\cite{Becchi08}  have pointed out,
 the reason for these restrictions
-are that they simulate a non-deterministic finite
+is that they simulate a non-deterministic finite
 automata (NFA) with a breadth-first search.
 This way the number of active states could
 be equal to the counter number.
 When the counters are large, 
 the memory requirement could become
-infeasible, and the pattern needs to be rejected straight away.
+infeasible, and a regex engine
+like Go will reject this pattern straight away.
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture} [node distance = 2cm, on grid, auto]
  
     	\node (q0) [state, initial] {$0$};
 	\node (q1) [state, right = of q0] {$1$};
-	\node (q2) [state, right = of q1] {$2$};
-	\node (qdots) [right = of q2] {$\ldots$};
+	%\node (q2) [state, right = of q1] {$2$};
+	\node (qdots) [right = of q1] {$\ldots$};
 	\node (qn) [state, right = of qdots] {$n$};
 	\node (qn1) [state, right = of qn] {$n+1$};
 	\node (qn2) [state, right = of qn1] {$n+2$};
@@ -638,8 +648,8 @@
 \path [-stealth, thick]
 	(q0) edge [loop above] node {a} ()
     (q0) edge node {a}   (q1) 
-    (q1) edge node {.}   (q2)
-    (q2) edge node {.}   (qdots)
+    %(q1) edge node {.}   (q2)
+    (q1) edge node {.}   (qdots)
     (qdots) edge node {.} (qn)
     (qn) edge node {.} (qn1)
     (qn1) edge node {b} (qn2)
@@ -672,14 +682,15 @@
 These problems can of course be solved in matching algorithms where 
 automata go beyond the classic notion and for instance include explicit
 counters \cite{Turo_ov__2020}.
-These solutions can be quite effective,
+These solutions can be quite efficient,
 with the ability to process
-gigabytes of string input per second
+gigabytes of strings input per second
 even with large counters \cite{Becchi08}.
-But formally reasoning about these automata can be challenging
-and un-intuitive.
-Therefore, correctness and runtime claims made about these solutions need to be
-taken with a grain of salt.
+But formal reasoning about these automata especially in Isabelle 
+can be challenging
+and un-intuitive. 
+Therefore, we take correctness and runtime claims made about these solutions
+with a grain of salt.
 
 In the work reported in \cite{CSL2022} and here, 
 we add better support using derivatives
@@ -1180,6 +1191,20 @@
 together with a formal proof of the correctness with those simplifications.
 
 
+One of the most recent work in the context of lexing
+%with this issue
+is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
+This is relevant work for us and we will compare it later with 
+our derivative-based matcher we are going to present.
+There is also some newer work called
+Verbatim++\cite{Verbatimpp}, which does not use derivatives, 
+but deterministic finite automaton instead.
+%An example that gives problem to automaton approaches would be
+%the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
+%It requires at least $2^{n+1}$ states to represent
+%as a DFA.
+
+
 %----------------------------------------------------------------------------------------
 \section{Contribution}

--- a/ChengsongTanPhdThesis/example.bib	Tue Oct 04 00:25:09 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib	Tue Oct 11 13:09:47 2022 +0100
@@ -7,6 +7,14 @@
 %% Saved with string encoding Unicode (UTF-8) 
 
 
+%@phdthesis{Fahad18,
+%  author  = "Rempel, Robert Charles",
+%  title   = "Relaxation Effects for Coupled Nuclear Spins",
+%  school  = "Stanford University",
+%  address = "Stanford, CA",
+%  year    = 1956,
+%  month   = jun
+%}
 
 %% POSIX specification------------------------
 @InProceedings{Okui10,

--- a/ChengsongTanPhdThesis/main.tex	Tue Oct 04 00:25:09 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Tue Oct 11 13:09:47 2022 +0100
@@ -295,10 +295,10 @@
 linear with respect to the input. 
 However with some regular expressions and inputs, existing implementations 
 often suffer from non-linear or even exponential running time, 
-allowing for example ReDoS (regular expression denial-of-service ) attacks. 
+giving to for example ReDoS (regular expression denial-of-service ) attacks. 
 To avoid these attacks, lexers with formalised correctness and running time related
-properties become appealing because the guarantee applies to all inputs, not just 
-a few empirical test cases.
+properties become appealing because the guarantee applies to all inputs, not a finite
+number of empirical test cases.
 
 Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.

author	Chengsong
	Tue, 11 Oct 2022 13:09:47 +0100
changeset 612	8c234a1bc7e0
parent 611	bc1df466150a
child 613	b0f0d884a547

ChengsongTanPhdThesis/Chapters/Cubic.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/example.bib		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/main.tex		file \| annotate \| diff \| comparison \| revisions