lexing: changeset 637:e3752aac8ec2

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Future.tex	Sat Dec 24 11:55:04 2022 +0000
@@ -0,0 +1,387 @@
+% Chapter Template
+
+\chapter{Future Work and Conclusion} % Main chapter title
+
+\label{Future} 
+
+In this thesis, in order to solve the ReDoS attacks
+once and for all, we have set out to formalise the correctness
+proof of Sulzmann and Lu's
+lexing algorithm with simplifications \cite{Sulzmann2014}
+in Isabelle/HOL. 
+We have fixed inefficiencies in  the original simplification
+algorithm, and established the correctness by proving
+that our algorithm outputs the same result as the original bit-coded
+lexer without simplifications (whose correctness was established in 
+previous work by Ausaf et al. in
+\cite{Ausaf}  and \cite{AusafDyckhoffUrban2016}).
+The proof technique used in \cite{Ausaf} does not work in our case
+because the simplification function messes with the structure of
+regular expressions.
+Despite having to try out several workarounds and being
+stuck for months or even years looking for proofs, 
+we were delighted to have discovered the simple yet
+effective proof method
+by modelling individual
+simplification steps as small-step rewriting rules and proving
+equivalence between terms linked by these rules.
+
+In addition, we have proved a formal size bound on the 
+regular expressions. The technique was by establishing
+a ``closed form'' informally
+described by
+Murugesan and Shanmuga Sundaram \cite{Murugesan2014} 
+for compound regular expressions' derivatives
+and using those closed forms to control the size.
+The Isabelle/HOL code for our formalisation can be 
+found at
+\begin{center}
+\url{https://github.com/hellotommmy/posix}
+\end{center}
+\noindent
+Thanks to our theorem-prover-friendly approach,
+we believe that 
+this finiteness bound can be improved to a bound
+linear to input and
+cubic to the regular expression size using a technique by
+Antimirov\cite{Antimirov95}.
+Once formalised, this would be a guarantee for the absence of all super-linear behavious.
+We are working out the
+details.
+
+
+To our best knowledge, no lexing libraries using Brzozowski derivatives
+have similar complexity-related bounds, 
+and claims about running time are usually speculative and backed by empirical
+evidence on a few test cases.
+If a matching or lexing algorithm
+does not come with certain basic complexity related 
+guarantees (for examaple the internal data structure size
+does not grow indefinitely), 
+then one cannot claim with confidence having solved the problem
+of catastrophic backtracking.
+
+We believe our proof method is not specific to this
+algorithm, and intend to extend this approach
+to prove the correctness of a faster version
+of the lexer. The closed
+forms can then be re-used as well. Together
+with the idea from Antimirov \cite{Antimirov95}
+we plan to reduce the size bound to become polynomial
+with respect to the size of the regular expression.
+%We have tried out a variety of other optimisations 
+%to make lexing/parsing with derivatives faster,
+%for example in \cite{Edelmann2021}, \cite{Darragh2020}
+%and \cite{Verbatimpp}.
+
+We have learned quite a few lessons in this process.
+As simple as the end result may seem,
+coming up with the initial proof idea with rewriting relations
+was the hardest for us, as it requires a deep understanding of what information
+is preserved during the simplification process.
+Having the first formalisation gives us much more confidence----it took
+us much less time to obtain the closed forms and size bound, although they
+are more involved than the correctness proof.
+As having been tested numerous times, 
+the usefulness of formal approaches cannot be overstated:
+they not only allow us to find bugs in Sulzmann and Lu's simplification functions,
+but also helped us set up realistic expectations about performance
+of algorithms. We believed in the beginning that
+the $\blexersimp$ lexer defined in chapter \ref{Bitcoded2}
+was already able to achieve the linear bound claimed by
+Sulzmann and Lu.
+We then attempted to prove that $\llbracket \blexersimp\; r \; s\rrbracket$ 
+is $O(\llbracket r\rrbracket^c\cdot |s|)$,
+where $c$ is a small constant number, making
+$\llbracket r\rrbracket^c$ a small constant factor.
+We were quite surprised that this did not go through despite a lot of efforts.
+This led us to discover the example where $\llbracket r\rrbracket^c$
+can in fact be exponentially large in chapter \ref{Finite}.
+We therefore learned the necessity to introduce the stronger simplifications
+in chapter \ref{Cubic}.
+Without formal proofs we could not have found out this so soon.
+
+
+In the future, we plan to incorporate many extensions
+to this work.
+
+\section{Future Work}
+The most obvious next-step is to implement the cubic bound and
+correctness of $\blexerStrong$
+in chapter \ref{Cubic} formally.
+A cubic bound ($O(\llbracket r\rrbracket^c\cdot |s|)$) with respect to 
+regular expression size will fulfill
+the linear complexity claim made by Sulzmann and Lu.
+
+With a linear size bound theoretically, the next challenge would be
+to generate code that are competitive with respect to commercial
+matchers. For that a lot of optimisations are needed.
+We aim to  integrate the zipper data structure into our lexer.
+The idea is very simple: using a zipper with multiple focuses
+just like Darragh \cite{Darragh2020} did, we could represent
+\[
+	x\cdot r + y \cdot r + \ldots
+\]
+as 
+\[
+	(x+y + \ldots) \cdot r.
+\]
+This would greatly reduce the amount of space needed
+when we have many terms like $x\cdot r$.
+Some initial results have been obtained, but 
+significant changes are needed to make sure 
+that the lexer output conforms to the POSIX standard. 
+We aim to make use of Okui and Suzuki's labelling
+system \cite{Okui10} to ensure regular expressions represented as zippers
+always maintain the POSIX orderings.
+%Formalising correctness of 
+%such a lexer using zippers will probably require using an imperative
+%framework with separation logic as it involves
+%mutable data structures, which is also interesting to look into.  
+
+To further optimise the algorithm, 
+we plan to add a deduplicate function that tells 
+whether two regular expressions denote the same 
+language using
+an efficient and verified equivalence checker like \cite{Krauss2011}.
+In this way, the internal data structures can be pruned more aggressively, 
+leading to better simplifications and ultimately faster algorithms.
+
+Traditional automaton approaches can be sped up by compiling
+multiple rules into the same automaton. This has been done
+in \cite{Kumar2006} and \cite{Becchi08}, for example.
+Currently our lexer only deals with a single regular expression each time.
+Putting multiple regular expressions might open up more
+possibilities of simplifications.
+
+As already mentioned in chapter \ref{RelatedWork},
+reducing the number of memory accesses can accelerate the 
+matching speed. It would be interesting to see the memory
+bandwidth of our derivative-based matching and improve accordingly.
+
+Memoization has been used frequently in lexing and parsing to achieve
+better complexity results, for example in \cite{Reps1998},
+\cite{Verbatimpp}, \cite{Adams2016}, \cite{Darragh2020} and \cite{Edelmann2021}.
+We plan to explore the performance enhancements by memoization in our algorithm
+in a correctness-preserving way.
+The Monadic data refinement technique that Lammich and Tuerk used
+in \cite{Lammich2012} to optimise the Hopcroft automaton minimisation
+algorithm seem quite interesting.
+We aim to learn from their refinement framework
+which generates high performance code with proofs
+that can be broken down into uncluttered small steps.
+
+
+
+
+%We establish the correctness 
+%claim made by them about the 
+%, which were only proven with pencil and paper.
+
+
+%\section{Work on Back-References}
+%We introduced regular expressions with back-references
+%in chapter \ref{Introduction}.
+%We adopt the common practice of calling them rewbrs (Regular Expressions
+%With Back References) in this section for brevity.
+%It has been shown by Aho \cite{Aho1990}
+%that the k-vertex cover problem can be reduced
+%to the problem of rewbrs matching, and is therefore NP-complete.
+%Given the depth of the problem, the progress made at the full generality
+%of arbitrary rewbrs matching has been slow, with
+%theoretical work on them being 
+%fairly recent. 
+%
+%Campaneu et al. studied regexes with back-references
+%in the context of formal languages theory in 
+%their 2003 work \cite{campeanu2003formal}.
+%They devised a pumping lemma for Perl-like regexes, 
+%proving that the langugages denoted by them
+%is  properly  contained in context-sensitive
+%languages.
+%More interesting questions such as 
+%whether the language denoted by Perl-like regexes
+%can express palindromes ($\{w \mid w = w^R\}$)
+%are discussed in \cite{campeanu2009patterns} 
+%and \cite{CAMPEANU2009Intersect}.
+%Freydenberger \cite{Frey2013} investigated the 
+%expressive power of back-references. He showed several 
+%undecidability and decriptional complexity results 
+%of back-references, concluding that they add
+%great power to certain programming tasks but are difficult to harness.
+%An interesting question would then be
+%whether we can add restrictions to them, 
+%so that they become expressive enough for practical use such
+%as html files, but not too powerful.
+%Freydenberger and Schmid \cite{FREYDENBERGER20191}
+%introduced the notion of deterministic
+%regular expressions with back-references to achieve
+%a better balance between expressiveness and tractability.
+%
+%
+%Fernau and Schmid \cite{FERNAU2015287} and Schmid \cite{Schmid2012}
+%investigated the time complexity of different variants
+%of back-references.
+%We are not aware of any work that uses derivatives on back-references.
+%
+%See \cite{BERGLUND2022} for a survey
+%of these works and comparison between different
+%flavours  of back-references syntax (e.g. whether references can be circular, 
+%can labels be repeated etc.).
+%
+%
+%\subsection{Matchers and Lexers with Mechanised Proofs}
+%We are aware
+%of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+%Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
+%of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
+%in Coq is given by Coquand and Siles \parencite{Coquand2012}.
+%Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
+% 
+%\subsection{Static Analysis of Evil Regex Patterns} 
+%When a regular expression does not behave as intended,
+%people usually try to rewrite the regex to some equivalent form
+%or they try to avoid the possibly problematic patterns completely,
+%for which many false positives exist\parencite{Davis18}.
+%Animated tools to "debug" regular expressions such as
+% \parencite{regexploit2021} \parencite{regex101} are also popular.
+%We are also aware of static analysis work on regular expressions that
+%aims to detect potentially expoential regex patterns. Rathnayake and Thielecke 
+%\parencite{Rathnayake2014StaticAF} proposed an algorithm
+%that detects regular expressions triggering exponential
+%behavious on backtracking matchers.
+%Weideman \parencite{Weideman2017Static} came up with 
+%non-linear polynomial worst-time estimates
+%for regexes, attack string that exploit the worst-time 
+%scenario, and "attack automata" that generates
+%attack strings.
+%
+%
+%
+%
+%
+%
+%Thanks to our theorem-prover-friendly approach,
+%we believe that 
+%this finiteness bound can be improved to a bound
+%linear to input and
+%cubic to the regular expression size using a technique by
+%Antimirov\cite{Antimirov95}.
+%Once formalised, this would be a guarantee for the absence of all super-linear behavious.
+%We are working out the
+%details.
+%
+% 
+%To our best knowledge, no lexing libraries using Brzozowski derivatives
+%have similar complexity-related bounds, 
+%and claims about running time are usually speculative and backed by empirical
+%evidence on a few test cases.
+%If a matching or lexing algorithm
+%does not come with certain basic complexity related 
+%guarantees (for examaple the internal data structure size
+%does not grow indefinitely), 
+%then they cannot claim with confidence having solved the problem
+%of catastrophic backtracking.
+%
+%
+%\section{Optimisations}
+%Perhaps the biggest problem that prevents derivative-based lexing 
+%from being more widely adopted
+%is that they are still not blindingly fast in practice, unable to 
+%reach throughputs like gigabytes per second, which the commercial
+%regular expression matchers such as Snort and Bro are able to achieve.
+%For our algorithm to be more attractive for practical use, we
+%need more correctness-preserving optimisations.
+%%So far our focus has been mainly on the bit-coded algorithm,
+%%but the injection-based lexing algorithm 
+%%could also be sped up in various ways:
+%%
+%One such optimisation is defining string derivatives
+%directly, without recursively decomposing them into 
+%character-by-character derivatives. For example, instead of
+%replacing $(r_1 + r_2)\backslash (c::cs)$ by
+%$((r_1 + r_2)\backslash c)\backslash cs$, we rather
+%calculate $(r_1\backslash (c::cs) + r_2\backslash (c::cs))$.
+%
+%
+%\subsection{Derivatives and Zippers}
+%Zipper is a data structure designed to focus on 
+%and navigate between local parts of a tree.
+%The idea is that often operations on a large tree only deal with
+%local regions each time.
+%It was first formally described by Huet \cite{HuetZipper}.
+%Typical applications of zippers involve text editor buffers
+%and proof system databases.
+%In our setting, the idea is to compactify the representation
+%of derivatives with zippers, thereby making our algorithm faster.
+%We draw our inspirations from several works on parsing, derivatives
+%and zippers.
+%
+%Edelmann et al. developed a formalised parser for LL(1) grammars using derivatives
+%\cite{Zippy2020}.
+%They adopted zippers to improve the speed, and argued that the runtime
+%complexity of their algorithm was linear with respect to the input string.
+%
+%The idea of using Brzozowski derivatives on general context-free 
+%parsing was first implemented
+%by Might et al. \ref{Might2011}. 
+%They used memoization and fixpoint construction to eliminate infinite recursion,
+%which is a major problem for using derivatives on context-free grammars.
+%The initial version was quite slow----exponential on the size of the input.
+%Adams et al. \ref{Adams2016} improved the speed and argued that their version
+%was cubic with respect to the input.
+%Darragh and Adams \cite{10.1145/3408990} further improved the performance
+%by using zippers in an innovative way--their zippers had multiple focuses
+%instead of just one in a traditional zipper to handle ambiguity.
+%Their algorithm was not formalised, though.
+%
+%We aim to  integrate the zipper data structure into our lexer.
+%The idea is very simple: using a zipper with multiple focuses
+%just like Darragh did, we could represent
+%\[
+%	x\cdot r + y \cdot r + \ldots
+%\]
+%as 
+%\[
+%	(x+y + \ldots) \cdot r.
+%\]
+%This would greatly reduce the amount of space needed
+%when we have many terms like $x\cdot r$.
+%
+%into the current lexer. 
+%This allows the lexer to not traverse the entire 
+%derivative tree generated
+%in every intermediate step, but only a small segment 
+%that is currently active. 
+%
+%Some initial results have been obtained, but more care needs to be taken to make sure 
+%that the lexer output conforms to the POSIX standard. Formalising correctness of 
+%such a lexer using zippers will probably require using an imperative
+%framework with separation logic as it involves
+%mutable data structures, which is also interesting to look into.  
+%
+%To further optimise the algorithm, 
+%I plan to add a ``deduplicate'' function that tells 
+%whether two regular expressions denote the same 
+%language using
+%an efficient and verified equivalence checker.
+%In this way, the internal data structures can be pruned more aggressively, 
+%leading to better simplifications and ultimately faster algorithms.
+%
+%
+%Some initial results 
+%We first give a brief introduction to what zippers are,
+%and other works
+%that apply zippers to derivatives
+%When dealing with large trees, it would be a waste to 
+%traverse the entire tree if
+%the operation only
+%involves a small fraction of it.
+%The idea is to put the focus on that subtree, turning other parts
+%of the tree into a context
+%
+%
+%One observation about our derivative-based lexing algorithm is that
+%the derivative operation sometimes traverses the entire regular expression
+%unnecessarily:
+

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Dec 23 13:27:56 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Sat Dec 24 11:55:04 2022 +0000
@@ -11,8 +11,8 @@
 In this chapter, we define the basic notions 
 for regular languages and regular expressions.
 This is essentially a description in ``English''
-of our formalisation in Isabelle/HOL.
-We also give the definition of what $\POSIX$ lexing means, 
+the functions and datatypes of our formalisation in Isabelle/HOL.
+We also define what $\POSIX$ lexing means, 
 followed by a lexing algorithm by Sulzmanna and Lu \parencite{Sulzmann2014} 
 that produces the output conforming
 to the $\POSIX$ standard\footnote{In what follows 
@@ -26,7 +26,7 @@
 \section{Basic Concepts}
 Formal language theory usually starts with an alphabet 
 denoting a set of characters.
-Here we just use the datatype of characters from Isabelle,
+Here we use the datatype of characters from Isabelle,
 which roughly corresponds to the ASCII characters.
 In what follows, we shall leave the information about the alphabet
 implicit.
@@ -40,9 +40,9 @@
 where $c$ is a variable ranging over characters.
 The $::$ stands for list cons and $[]$ for the empty
 list.
-A singleton list is sometimes written as $[c]$ for brevity.
+For brevity, a singleton list is sometimes written as $[c]$.
 Strings can be concatenated to form longer strings in the same
-way as we concatenate two lists, which we shall write as $s_1 @ s_2$.
+way we concatenate two lists, which we shall write as $s_1 @ s_2$.
 We omit the precise 
 recursive definition here.
 We overload this concatenation operator for two sets of strings:
@@ -121,7 +121,7 @@
 \end{tabular}	
 \end{center}
 \noindent
-and in the end test whether the set
+and in the end, test whether the set
 has the empty string.\footnote{We use the infix notation $A\backslash c$
 	instead of $\Der \; c \; A$ for brevity, as it is clear we are operating
 on languages rather than regular expressions.}
@@ -156,15 +156,18 @@
 \begin{itemize}
 \item{$\subseteq$}:\\
 The set 
-\[ \{s \mid c :: s \in A*\} \]
+\[ S_1 = \{s \mid c :: s \in A*\} \]
 is enclosed in the set
-\[ \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \} \]
-because whenever you have a string starting with a character 
-in the language of a Kleene star $A*$, 
-then that character together with some sub-string
-immediately after it will form the first iteration, 
-and the rest of the string will 
-be still in $A*$.
+\[ S_2 = \{s_1 @ s_2 \mid s_1 \, s_2.\;  s_1 \in \{s \mid c :: s \in A\} \land s_2 \in A* \}. \]
+This is because for any string $c::s$ satisfying $c::s \in A*$,
+%whenever you have a string starting with a character 
+%in the language of a Kleene star $A*$, 
+%then that
+the character $c$, together with a prefix of $s$
+%immediately after $c$ 
+forms the first iteration of $A*$, 
+and the rest of the $s$ is also $A*$.
+This coincides with the definition of $S_2$.
 \item{$\supseteq$}:\\
 Note that
 \[ \Der \; c \; (A*) = \Der \; c \;  (\{ [] \} \cup (A @ A*) ) \]
@@ -194,7 +197,7 @@
 \]
 \noindent
 We call them basic because we will introduce
-additional constructors in later chapters such as negation
+additional constructors in later chapters, such as negation
 and bounded repetitions.
 We use $\ZERO$ for the regular expression that
 matches no string, and $\ONE$ for the regular
@@ -223,7 +226,7 @@
 \noindent
 %Now with language derivatives of a language and regular expressions and
 %their language interpretations in place, we are ready to define derivatives on regular expressions.
-With $L$ we are ready to introduce Brzozowski derivatives on regular expressions.
+With $L$, we are ready to introduce Brzozowski derivatives on regular expressions.
 We do so by first introducing what properties it should satisfy.
 
 \subsection{Brzozowski Derivatives and a Regular Expression Matcher}
@@ -380,8 +383,8 @@
 	$s \in L \; r_{start} \iff [] \in L \; r_{end}$
 \end{property}
 \noindent
-Next we give the recursive definition of derivative on
-regular expressions, so that it satisfies the properties above.
+Next, we give the recursive definition of derivative on
+regular expressions so that it satisfies the properties above.
 The derivative function, written $r\backslash c$, 
 takes a regular expression $r$ and character $c$, and
 returns a new regular expression representing
@@ -441,7 +444,7 @@
 	(r^*) \backslash c \dn (r \backslash c)\cdot r^*.
 \]
 Again,
-the structure is the same as the language derivative of Kleene star: 
+the structure is the same as the language derivative of the Kleene star: 
 \[
 	\textit{Der} \;c \;(A*) \dn (\textit{Der}\; c A) @ (A*)
 \]
@@ -522,8 +525,8 @@
 \end{definition}
 
 \noindent
-Assuming the string is given as a sequence of characters, say $c_0c_1..c_n$, 
-this algorithm presented graphically is as follows:
+Assuming the string is given as a sequence of characters, say $c_0c_1 \ldots c_n$, 
+this algorithm, presented graphically, is as follows:
 
 \begin{equation}\label{matcher}
 \begin{tikzcd}
@@ -540,7 +543,7 @@
 	$\textit{match} \; s\; r  = \textit{true} \; \textit{iff} \; s \in L(r)$
 \end{lemma}
 \begin{proof}
-	By induction on $s$ using property of derivatives:
+	By induction on $s$ using the property of derivatives:
 	lemma \ref{derDer}.
 \end{proof}
 \begin{figure}
@@ -566,7 +569,7 @@
 the algorithm can be excruciatingly slow, as shown in 
 \ref{NaiveMatcher}.
 Note that both axes are in logarithmic scale.
-Around two dozens characters
+Around two dozen characters
 would already ``explode'' the matcher with the regular expression 
 $(a^*)^*b$.
 To improve this situation, we need to introduce simplification
@@ -628,8 +631,8 @@
 is now ``tame''  in terms of the length of inputs,
 as shown in Figure \ref{BetterMatcher}.
 
-So far the story is use Brzozowski derivatives and
-simplify as much as possible and at the end test
+So far, the story is use Brzozowski derivatives and
+simplify as much as possible, and at the end test
 whether the empty string is recognised 
 by the final derivative.
 But what if we want to 
@@ -727,7 +730,7 @@
 $\vdash \Seq(\Right\; a)(\Left \; bc ):(ab+a)(bc+c)$ hold
 and the values both flatten to $abc$.
 Lexers therefore have to disambiguate and choose only
-one of the values be generated. $\POSIX$ is one of the
+one of the values to be generated. $\POSIX$ is one of the
 disambiguation strategies that is widely adopted.
 
 Ausaf et al.\parencite{AusafDyckhoffUrban2016} 
@@ -796,10 +799,10 @@
 %\end{tikzpicture}
 %\caption{Maximum munch example: $s$ matches $r_{token1} \cdot r_{token2}$}\label{munch}
 %\end{figure}
-The above $\POSIX$ rules follows the intuition described below: 
+The above $\POSIX$ rules follow the intuition described below: 
 \begin{itemize}
 	\item (Left Priority)\\
-		Match the leftmost regular expression when multiple options of matching
+		Match the leftmost regular expression when multiple options for matching
 		are available. See P+L and P+R where in P+R $s$ cannot
 		be in the language of $L \; r_1$.
 	\item (Maximum munch)\\
@@ -809,12 +812,12 @@
 		$r_{part1}\cdot r_{part2}$, and we have two ways $s$ can be split:
 		Then the split that matches a longer string for the first part
 		$r_{part1}$ is preferred by this maximum munch rule.
-		This is caused by the side-condition 
+		The side-condition 
 		\begin{center}
 		$\nexists s_3 \; s_4. s_3 \neq [] \land s_3 @ s_4 = s_2 \land 
 		s_1@ s_3 \in L \; r_1 \land s_4 \in L \; r_2$
 		\end{center}
-		in PS.
+		in PS causes this.
 		%(See
 		%\ref{munch} for an illustration).
 \end{itemize}
@@ -883,8 +886,8 @@
 \;\;and\;\; (s_2'', r_2) \rightarrow v_2'' \;\;and \;\;s_1'' @s_2'' = s 
 \]
 cannot possibly form a $\POSIX$ value either, because
-by definition there is a candidate
-with longer initial string
+by definition, there is a candidate
+with a longer initial string
 $s_1$. Therefore, we know that the POSIX
 value $\Seq \; a \; b$ for $r_1 \cdot r_2$ matching
 $s$ must have the 
@@ -900,8 +903,8 @@
 is the same as $\Seq(v_1, v_2)$. 
 \end{proof}
 \noindent
-We have now defined what a POSIX value is and shown that it is unique,
-the problem is to generate
+We have now defined what a POSIX value is and shown that it is unique.
+The problem is to generate
 such a value in a lexing algorithm using derivatives.
 
 \subsection{Sulzmann and Lu's Injection-based Lexing Algorithm}
@@ -963,7 +966,7 @@
 \noindent
 After the $\mkeps$-call, Sulzmann and Lu inject back the characters one by one
 in reverse order as they were chopped off in the derivative phase.
-The fucntion for this is called $\inj$. This function 
+The function for this is called $\inj$. This function 
 operates on values, unlike $\backslash$ which operates on regular expressions.
 In the diagram below, $v_i$ stands for the (POSIX) value 
 for how the regular expression 
@@ -985,9 +988,9 @@
 	The first phase involves taking successive derivatives w.r.t the characters $c_0$,
 	$c_1$, and so on. These are the same operations as they have appeared in the matcher
 	\ref{matcher}. When the final derivative regular expression is nullable (contains the empty string),
-	then the second phase starts. First $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
-	the empty string , by always selecting the leftmost 
-	nullable regular expression. After that $\inj$ ``injects'' back the character in reverse order as they 
+	then the second phase starts. First, $\mkeps$ generates a POSIX value which tells us how $r_n$ matches
+	the empty string, by always selecting the leftmost 
+	nullable regular expression. After that, $\inj$ ``injects'' back the character in reverse order as they 
 	appeared in the string, always preserving POSIXness.}\label{graph:inj}
 \end{figure}
 \noindent
@@ -1020,7 +1023,7 @@
 \noindent 
 The function recurses on 
 the shape of regular
-expressionsw  and values.
+expressions and values.
 Intuitively, each clause analyses 
 how $r_i$ could have transformed when being 
 derived by $c$, identifying which subpart
@@ -1028,13 +1031,13 @@
 to inject the character back into.
 Once the character is
 injected back to that sub-value; 
-$\inj$ assembles all parts together
+$\inj$ assembles all parts
 to form a new value.
 
 For instance, the last clause is an
 injection into a sequence value $v_{i+1}$
 whose second child
-value is a star, and the shape of the 
+value is a star and the shape of the 
 regular expression $r_i$ before injection 
 is a star.
 We therefore know 
@@ -1057,7 +1060,7 @@
 \]
 Finally, 
 $\inj \; r \;c \; v$ is prepended
-to the previous list of iterations, and then
+to the previous list of iterations and then
 wrapped under the $\Stars$
 constructor, giving us $\Stars \; ((\inj \; r \; c \; v) ::vs)$.
 
@@ -1088,7 +1091,7 @@
 				\end{tabular}
 			\end{center}
 			We know that there exists a unique pair of
-			$s_a$ and $s_b$ satisfaying	
+			$s_a$ and $s_b$ satisfying	
 				$(a \backslash c, s_a) \rightarrow v_a$,
 				$(b , s_b) \rightarrow v_b$, and
 				$\nexists s_3 \; s_4. s_3 \neq [] \land s_a @ s_3 \in 
@@ -1113,7 +1116,7 @@
 					& $=$ & $\Seq \; (\inj \;a \; c \; v_a) \; v_b$
 				\end{tabular}
 			\end{center}
-			With a similar reasoning, 
+			With similar reasoning, 
 
 			\[
 				(a\cdot b, (c::s_a)@s_b) \rightarrow \Seq \; (\inj \; a\;c \;v_a) \; v_b.
@@ -1133,7 +1136,7 @@
 			\[
 				(a, []) \rightarrow \mkeps \; a.
 			\]
-			Also by inductive hypothesis
+			Also, by inductive hypothesis
 			\[
 				(b, c::s) \rightarrow \inj\; b \; c \; v_b
 			\]
@@ -1154,7 +1157,7 @@
 				s_3 @s_4 = c::s  \land s_3 \in L \; a 
 				\land s_4 \in L \; b.
 			\]
-			(Which basically says there cannot be a longer 
+			(Which says there cannot be a longer 
 			initial split for $s$ other than the empty string.)
 			Therefore we have $\Seq \; (\mkeps \; a) \;(\inj \;b \; c\; v_b)$
 			as the POSIX value for $a\cdot b$.
@@ -1187,20 +1190,21 @@
 \end{theorem} 
 \begin{proof}
 By induction on $s$. $r$ generalising over an arbitrary regular expression.
-The $[]$ case is proven by  an application of lemma \ref{mePosix}, and the inductive case
+The $[]$ case is proven by an application of lemma \ref{mePosix}, and the inductive case
 by lemma \ref{injPosix}.
 \end{proof}
 \noindent
 As we did earlier in this chapter with the matcher, one can 
 introduce simplification on the regular expression in each derivative step.
-However, now one needs to do a backward phase and make sure
-the values align with the regular expression.
+However, due to lexing, one needs to do a backward phase (w.r.t the forward derivative phase)
+and ensure that
+the values align with the regular expression at each step.
 Therefore one has to
 be careful not to break the correctness, as the injection 
 function heavily relies on the structure of 
 the regular expressions and values being aligned.
 This can be achieved by recording some extra rectification functions
-during the derivatives step, and applying these rectifications in 
+during the derivatives step and applying these rectifications in 
 each run during the injection phase.
 With extra care
 one can show that POSIXness will not be affected
@@ -1223,16 +1227,17 @@
 	\end{tabular}
 \end{center}
 
-However, with the simple-minded simplification rules allowed
-in an injection-based lexer, one can still end up
-with exploding derivatives.
-\section{A Case Requring More Aggressive Simplifications}
+However, one can still end up
+with exploding derivatives,
+even with the simple-minded simplification rules allowed
+in an injection-based lexer.
+\section{A Case Requiring More Aggressive Simplifications}
 For example, when starting with the regular
 expression $(a^* \cdot a^*)^*$ and building just over
 a dozen successive derivatives 
 w.r.t.~the character $a$, one obtains a derivative regular expression
 with millions of nodes (when viewed as a tree)
-even with the simple-minded simplification.
+even with the mentioned simplifications.
 \begin{figure}[H]
 \begin{center}
 \begin{tikzpicture}
@@ -1262,7 +1267,7 @@
 as an example.
 This is a highly ambiguous regular expression, with
 many ways to split up the string into multiple segments for
-different star iteratioins,
+different star iterations,
 and for each segment 
 multiple ways of splitting between 
 the two $a^*$ sub-expressions.
@@ -1331,7 +1336,7 @@
 A lexer without a good enough strategy to 
 deduplicate will naturally
 have an exponential runtime on highly
-ambiguous regular expressions, because there
+ambiguous regular expressions because there
 are exponentially many matches.
 For this particular example, it seems
 that the number of distinct matches growth
@@ -1342,12 +1347,12 @@
 $s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$ is  
 \[
 	\Stars\,
-	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]]
+	[\Seq \; (\Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}]), \Stars\,[]].
 \]
-and at any moment the  subterms in a regular expression
-that will result in a POSIX value is only
+At any moment, the  subterms in a regular expression
+that will potentially result in a POSIX value is only
 a minority among the many other terms,
-and one can remove ones that are absolutely not possible to 
+and one can remove ones that are not possible to 
 be POSIX.
 In the above example,
 \begin{equation}\label{eqn:growth2}
@@ -1370,23 +1375,23 @@
 		$\Stars \; [\Seq \; (\Stars \; [\Char \; a])\; (\Stars \; [\Char \; a])]  $ &  $(\text{term 2})$
 	\end{tabular}
 \end{center}
-Other terms with an underlying value such as
+Other terms with an underlying value, such as
 \[
 	\Stars \; [\Seq \; (\Stars \; [])\; (\Stars \; [\Char \; a, \Char \; a])]
 \]
-is simply too hopeless to contribute a POSIX lexical value,
+is too hopeless to contribute a POSIX lexical value,
 and is therefore thrown away.
 
 Ausaf et al. \cite{AusafDyckhoffUrban2016} 
 have come up with some simplification steps, however those steps
-are not yet sufficiently strong so that they achieve the above effects.
-And even with these relatively mild simplifications the proof
-is already quite a bit complicated than the theorem \ref{lexerCorrectness}.
-One would prove something like: 
+are not yet sufficiently strong, so they achieve the above effects.
+And even with these relatively mild simplifications, the proof
+is already quite a bit more complicated than the theorem \ref{lexerCorrectness}.
+One would prove something like this: 
 \[
 	\textit{If}\; (\textit{snd} \; (\textit{simp} \; r\backslash c), s) \rightarrow  v  \;\;
 	\textit{then}\;\; (r, c::s) \rightarrow 
-	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v) 
+	\inj\;\, r\,  \;c \;\, ((\textit{fst} \; (\textit{simp} \; r \backslash c))\; v).
 \]
 instead of the simple lemma \ref{injPosix}, where now $\textit{simp}$
 not only has to return a simplified regular expression,

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Dec 23 13:27:56 2022 +0000
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sat Dec 24 11:55:04 2022 +0000
@@ -1311,6 +1311,8 @@
 algorithm to include bounded repetitions. 
 In chapter \ref{Cubic} we discuss stronger simplification rules which 
 improve the finite bound to a cubic bound.%and the NOT regular expression.
+Chapter \ref{RelatedWork} introduces relevant work for this thesis.
+Chapter \ref{Future} concludes with avenues of future research.

author	Chengsong
	Sat, 24 Dec 2022 11:55:04 +0000
changeset 637	e3752aac8ec2
parent 636	0bcb4a7cb40c
child 638	dd9dde2d902b

ChengsongTanPhdThesis/Chapters/Future.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions