lexing: changeset 653:bc5571c38d1f

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Tue Jun 27 16:26:48 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded1.tex	Thu Jun 29 04:17:48 2023 +0100
@@ -9,9 +9,10 @@
 %its correctness proof in 
 %Chapter 3\ref{Chapter3}. 
 In this chapter, we are going to describe the bit-coded algorithm
-introduced by Sulzmann and Lu \parencite{Sulzmann2014} to address the growth problem of 
-derivatives of
-regular expressions. 
+introduced by Sulzmann and Lu \parencite{Sulzmann2014} and their correctness proof.
+%to address the growth problem of 
+%derivatives of
+%regular expressions. 
 We have implemented their algorithm in Scala and Isabelle, 
 and found problems 
 in their algorithm, such as de-duplication not working properly and redundant

--- a/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Tue Jun 27 16:26:48 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Bitcoded2.tex	Thu Jun 29 04:17:48 2023 +0100
@@ -11,20 +11,27 @@
 {\color{red} \rule{\linewidth}{0.5mm}}
 New content starts.
 {\color{red} \rule{\linewidth}{0.5mm}}
+\marginpar{\em Added a completely new overview section, highlighting the contributions.}
 
 \section{Overview}
 
 This chapter
 is the point from which novel contributions of this PhD project are introduced
-in detail, 
-and previous
-chapters are essential background work for setting the scene of the formal proof we
+in detail. 
+The material in the
+previous
+chapters is necessary 
+material for setting the scene of the formal proof we
 are about to describe.
+
+The fundamental reason is we cannot extend the correctness proof of theorem 4,
+because lemma 13 does not hold anymore when simplifications are involved.
+
 The proof details are necessary materials for this thesis
 because it provides necessary context to explain why we need a
 new framework for the proof of $\blexersimp$, which involves
 simplifications that cause structural changes to the regular expression.
-a new formal proof of the correctness of $\blexersimp$, where the 
+A new formal proof of the correctness of $\blexersimp$, where the 
 proof of $\blexer$
 is not applicatble in the sense that we cannot straightforwardly extend the
 proof of theorem \ref{blexerCorrect} because lemma \ref{retrieveStepwise} does
@@ -44,15 +51,15 @@
 In particular, the correctness theorem 
 of the un-optimised bit-coded lexer $\blexer$ in 
 chapter \ref{Bitcoded1} formalised by Ausaf et al.
-relies on lemma \ref{retrieveStepwise} that says
-any value can be retrieved in a stepwise manner:
-\begin{center}	
+relies crucially on lemma \ref{retrieveStepwise} that says
+any value can be retrieved in a stepwise manner, namely:
+\begin{center}%eqref: this proposition needs to be referred	
 	$\vdash v : (r\backslash c) \implies \retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
 \end{center}
 This no longer holds once we introduce simplifications.
-Simplifications are necessary to control the size of regular expressions 
-during derivatives by eliminating redundant 
-sub-expression with some procedure we call $\textit{bsimp}$.
+Simplifications are necessary to control the size of derivatives 
+
+
 We want to prove the correctness of $\blexersimp$ which integrates
 $\textit{bsimp}$ by applying it after each call to the derivative:
 \begin{center}
@@ -75,23 +82,33 @@
 we use the convenient notation $r_{c} \dn r\backslash c$
 and $v_{r}^{c} \dn \inj \;r \; c \; v$),
 but $\blexersimp$ introduces simplification after the derivative,
-getting us trouble in aligning the pairs:
+making it difficult to align the pairs:
 \begin{center}
-	$\vdash v: \textit{bsimp} \; r_{c} \implies \retrieve \; \textit{bsimp} \; r_c \; v =\retrieve \; r  \;(\mathord{?} v_{r}^{c}) $
+	$\vdash v: \textit{bsimp} \; r_{c} \implies \retrieve \; (\textit{bsimp} \; r_c) \; v =\retrieve \; r  \;(\mathord{?} v_{r}^{c}) $
 \end{center}
 \noindent
-It is quite clear that once we made 
+It is clear that once we made 
 $v$ to align with $\textit{bsimp} \; r_{c}$
 in the inhabitation relation, something different than $v_{r}^{c}$ needs to be plugged
 in for the above statement to hold.
 Ausaf et al. \cite{AusafUrbanDyckhoff2016}
 made some initial attempts with this idea, see \cite{FahadThesis}
 for details.
+
+They added
+and then rectify it to
+
+
+this works fine, however that limits the kind of simplifications you can introduce.
+We cannot use their idea for our very strong simplification rules.
+Therefore we take our route
+a wea
+
 The other route is to dispose of lemma \ref{retrieveStepwise},
-and prove a slightly weakened inductive invariant instead.
+and prove a weakened inductive invariant instead.
 We adopt this approach in this thesis.
 
-We first introduce why the inductive invariant in $\blexer$'s proof
+Let us first explain why the requirement in $\blexer$'s proof
 is too strong, and suggest a few possible fixes, which leads to
 our proof which we believe was the most natural and effective method.
 
@@ -104,9 +121,84 @@
 The $\blexer$ proof relies on a lockstep POSIX
 correspondence between the lexical value and the
 regular expression in each derivative and injection.
-If we zoom into the diagram \ref{graph:in} and look specifically at
+If we zoom into the diagram \ref{graph:inj} and look specifically at
 the pairs $v_i, r_i$ and $v_{i+1},\, r_{i+1}$, we get the diagram demonstrating
 the invariant that the same bitcodes can be extracted from the pairs:
+\tikzset{three sided/.style={
+        draw=none,
+        append after command={
+            [-,shorten <= -0.5\pgflinewidth]
+            ([shift={(-1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
+        edge([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west) 
+            ([shift={( 0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
+        edge([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)            
+            ([shift={( 0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
+        edge([shift={(-1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
+        }
+    }
+}
+
+\tikzset{three sided1/.style={
+        draw=none,
+        append after command={
+            [-,shorten <= -0.5\pgflinewidth]
+            ([shift={(1.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north west)
+        edge([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east) 
+            ([shift={(-0.5\pgflinewidth,-0.5\pgflinewidth)}]\tikzlastnode.north east)
+        edge([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)            
+            ([shift={(-0.5\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south east)
+        edge([shift={(1.0\pgflinewidth,+0.5\pgflinewidth)}]\tikzlastnode.south west)
+        }
+    }
+}
+
+\begin{figure}[H]
+	\begin{tikzpicture}[->, >=stealth', shorten >= 1pt, auto, thick]
+		\node [rectangle, draw] (1)  at (-7, 2) {$\ldots$};
+		\node [rectangle, draw] (2) at  (-4, 2) {$_{bs'}(_Za+_Saa)^*$};
+		\node [rectangle, draw] (3) at  (4, 2) {$_{bs'}(_Z(_Z\ONE + _S(\ONE \cdot a)))\cdot(_0a+_1aa)^*$};
+		\node [rectangle, draw] (4) at  (7, 2) {$\ldots$};
+		\node [rectangle, draw] (5) at  (-7, -2) {$\ldots$};
+		\node [rectangle, draw] (6) at  (-4, -2) {$\Stars \; [\Left (a)]$};
+		\node [rectangle, draw] (7) at  ( 4, -2) {$\Seq (\Alt (\Left \; \Empty)) \; \Stars \, []$};
+		\node [rectangle, draw] (8) at  ( 7, -2) {$\ldots$};
+		\node [rectangle, draw] (9) at  (-7, -6) {$\ldots$};
+		\node [rectangle, draw] (10) at (-4, -6) {$\textit{bits} = \retrieve \; r_i\;v_i$};
+		\node [rectangle, draw] (11) at (4, -6) {$\textit{bits} = \retrieve \; r_{i+1}\;v_{i+1}$};
+		\node [rectangle, draw] (12) at  (7, -6) {$\ldots$};
+
+
+		\path (1) edge [] node {} (2);
+		\path (2) edge [] node {$\backslash a$} (3);
+
+%		\node [rectangle, draw] (r) at (-6, -1) {$(aa)^*(b+c)$};
+%		\node [rectangle, draw] (a) at (-6, 4)	  {$(aa)^*(_{Z}b + _{S}c)$};
+%		\path	(r)
+%			edge [] node {$\internalise$} (a);
+%		\node [rectangle, draw] (a1) at (-3, 1) {$(_{Z}(\ONE \cdot a) \cdot (aa)^*) (_{Z}b + _Sc)$};
+%		\path	(a)
+%			edge [] node {$\backslash a$} (a1);
+%
+%		\node [rectangle, draw, three sided] (a21) at (-2.5, 4) {$(_{Z}\ONE \cdot (aa)^*)$};
+%		\node [rectangle, draw, three sided1] (a22) at (-0.8, 4) {$(_{Z}b + _{S}c)$};
+%		\path	(a1)
+%			edge [] node {$\backslash a$} (a21);
+%		\node [rectangle, draw] (a3) at (0.5, 2) {$_{ZS}(_{Z}\ONE + \ZERO)$};
+%		\path	(a22)
+%			edge [] node {$\backslash b$} (a3);
+%		\path	(a21)
+%			edge [dashed, bend right] node {} (a3);
+%		\node [rectangle, draw] (bs) at (2, 4) {$ZSZ$};
+%		\path	(a3)
+%			edge [below] node {$\bmkeps$} (bs);
+%		\node [rectangle, draw] (v) at (3, 0) {$\Seq \; (\Stars\; [\Seq \; a \; a]) \; (\Left \; b)$};
+%		\path 	(bs)
+%			edge [] node {$\decode$} (v);
+
+
+	\end{tikzpicture}
+	\caption{$\blexer$ with the regular expression $(aa)^*(b+c)$ and $aab$}
+\end{figure}
 
 When simplifications are added, the inhabitation relation no longer holds,
 causing the above diagram to break.

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Tue Jun 27 16:26:48 2023 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Jun 29 04:17:48 2023 +0100
@@ -107,6 +107,7 @@
 \def\Stars{\textit{Stars}}
 \def\Char{\textit{Char}}
 \def\Seq{\textit{Seq}}
+\def\Alt{\textit{Alt}}
 \def\Der{\textit{Der}}
 \def\Ders{\textit{Ders}}
 \def\nullable{\mathit{nullable}}
@@ -206,19 +207,30 @@
 
 Regular expressions, since their inception in the 1940s, 
 have been subject to extensive study and implementation. 
-Their primary application lies in lexing, 
-a process whereby an unstructured input string is disassembled into a tree of tokens
-with their guidance.
-This is often used to match strings that comprises of numerous fields, 
-where certain fields may recur or be omitted. 
-For example, the regular expression for recognising email addresses is
+Their primary application lies in text processing--finding
+matches and identifying patterns in a string.
+%It is often used to match strings that comprises of numerous fields, 
+%where certain fields may recur or be omitted. 
+For example, a simple regular expression that tries 
+to recognise email addresses is
+\marginpar{rephrased from "the regex for recognising" to "a simple regex that tries to match email"}
 \begin{center}
-\verb|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$|.
+$[a-z0-9.\_]^\backslash+@[a-z0-9.-]^\backslash+\.\{a-z\}\{2,6\}$
+%$[a-z0-9._]^+@[a-z0-9.-]^+\.[a-z]{2,6}$.
 \end{center}
-Using this, regular expression matchers and lexers are able to extract 
-the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
-Consequently, they are an indispensible components in text processing tools 
-of software systems such as compilers, IDEs, and firewalls.
+\marginpar{Simplified example, but the distinction between . and escaped . is correct
+and therefore left unchanged.}
+
+%Using this, regular expression matchers and lexers are able to extract 
+%the domain names by the use of \verb|[a-zA-Z0-9.-]+|. 
+\marginpar{Rewrote explanation for the expression.}
+The bracketed sub-expressions are used to extract specific parts of an email address.
+The local part is recognised by the expression enclosed in 
+the first pair of brackets: $[a-z0-9._]$, and after the ``@'' sign
+is the part that recognises the domain, where $[a-z]{2, 6}$ specifically
+matches the top-level domain.
+%Consequently, they are an indispensible components in text processing tools 
+%of software systems such as compilers, IDEs, and firewalls.
 
 The study of regular expressions is ongoing due to an
 issue known as catastrophic backtracking. 
@@ -226,8 +238,8 @@
 matching or lexing engine exhibits a disproportionately long 
 runtime despite the simplicity of the input and expression.
 
-The origin of catastrophic backtracking is rooted in the 
-ambiguities of lexing. 
+One cause of catastrophic backtracking lies within the 
+ambiguities of lexing.\marginpar{rephrased "the origin of catastrophic ...} 
 In the process of matching a multi-character string with 
 a regular expression that encompasses several sub-expressions, 
 different positions can be designated to mark 
@@ -239,7 +251,8 @@
 
 Catastrophic backtracking facilitates a category of computationally inexpensive attacks known as Regular Expression Denial of Service (ReDoS) attacks. Here, an attacker merely dispatches a small attack string to a server, provoking high-complexity behaviours in the server's regular expression engine. Such attacks, whether intentional or accidental, have led to the failure of real-world systems (additional details can be found in the practical overview section in chapter \ref{Introduction}).
 
-Various disambiguation strategies are employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
+Various disambiguation strategies are 
+employed to select sub-matches, notably including Greedy and POSIX strategies. POSIX, the strategy most widely adopted in practice, invariably opts for the longest possible sub-match. Kuklewicz \cite{KuklewiczHaskell}, for instance, offers a descriptive definition of the POSIX rule in section 1, last paragraph:
 
 
 %Regular expressions 
@@ -283,6 +296,7 @@
 %There have been prose definitions like the following
 %by Kuklewicz \cite{KuklewiczHaskell}: 
 %described the POSIX rule as (section 1, last paragraph):
+\marginpar{\em Deleted some peripheral specifications.}
 \begin{quote}
 	\begin{itemize}
 		\item
@@ -291,16 +305,17 @@
 \item
 higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
 \item
-REs have right associative concatenation which can be changed with parenthesis\\
-\item
-parenthesized subexpressions return the match from their last usage\\
-\item
-text of component subexpressions must be contained in the text of the 
-higher-level subexpressions\\
-\item
-if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
-\item
-if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
+	$\ldots$
+%REs have right associative concatenation which can be changed with parenthesis\\
+%\item
+%parenthesized subexpressions return the match from their last usage\\
+%\item
+%text of component subexpressions must be contained in the text of the 
+%higher-level subexpressions\\
+%\item
+%if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+%\item
+%if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\
 \end{itemize}
 \end{quote}
 However, the author noted that various lexers that claim to be POSIX

--- a/ChengsongTanPhdThesis/main.tex	Tue Jun 27 16:26:48 2023 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Thu Jun 29 04:17:48 2023 +0100
@@ -291,10 +291,9 @@
 \begin{abstract}
 \addchaptertocentry{\abstractname} % Add the abstract to the table of contents
 %\addchap{Abstract} 
-\textbf{Problem: not like an abstract, more like a summary}
-\textbf{Goal for new abstract: more high-level and abstract, tell the problem and solution in a concise way.}
-
-New abstract:\\
+\marginpar{\em Reviewer feedback:  Problem: not like an abstract, more like a summary\\
+New abstract: more high-level and abstract, tell the problem and solution in a concise way.
+}
 %Modern computer systems rely on lexing for essential applications such as
 %compilers, IDEs, file systems, and network intrusion detection systems.
 %These applications require correctness with respect to
@@ -316,26 +315,26 @@
 
 
 Old abstract:
-This thesis is about regular expressions and derivatives. It combines functional algorithms and their formal verification in the Isabelle/HOL theorem prover. 
-Classic results say that regular expression matching should be 
-linear with respect to the input. The size of the regular expressions
-are often treated as a constant factor.
-However with some regular expressions and inputs, existing implementations 
-often suffer from non-linear or even exponential running time, 
-giving rise to ReDoS (regular expression denial-of-service ) attacks. 
-To avoid these attacks, regular expression matchers and lexers with 
-formalised correctness and running time related
-properties are of interest because the guarantees apply to all inputs, not just a finite
-number of empirical test cases.
-
-Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. 
-Our simplification function is more aggressive than the one by Sulzmann and Lu.
-We also fix a problem in Sulzmann and Lu's simplification to do with their use of
-the $\textit{nub}$ function which does not remove non-trivial duplicates.
-Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.
-In this thesis we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our algorithm is correct and generates unique POSIX values; we also (ii) establish a finite bound for the size of the derivatives for every input string; we also
-(iii) give a program and a conjecture that the finite 
-bound can be improved to be cubic if stronger simplification rules are applied. 
+%This thesis is about regular expressions and derivatives. It combines functional algorithms and their formal verification in the Isabelle/HOL theorem prover. 
+%Classic results say that regular expression matching should be 
+%linear with respect to the input. The size of the regular expressions
+%are often treated as a constant factor.
+%However with some regular expressions and inputs, existing implementations 
+%often suffer from non-linear or even exponential running time, 
+%giving rise to ReDoS (regular expression denial-of-service ) attacks. 
+%To avoid these attacks, regular expression matchers and lexers with 
+%formalised correctness and running time related
+%properties are of interest because the guarantees apply to all inputs, not just a finite
+%number of empirical test cases.
+%
+%Sulzmann and Lu describe a lexing algorithm that calculates Brzozowski derivatives using bitcodes annotated to regular expressions. Their algorithm generates POSIX values which encode the information of how a regular expression matches a string—that is, which part of the string is matched by which part of the regular expression. This information is needed in the context of lexing in order to extract and to classify tokens. The purpose of the bitcodes is to generate POSIX values incrementally while derivatives are calculated. They also help with designing an “aggressive” simplification function that keeps the size of derivatives finitely bounded. 
+%Our simplification function is more aggressive than the one by Sulzmann and Lu.
+%We also fix a problem in Sulzmann and Lu's simplification to do with their use of
+%the $\textit{nub}$ function which does not remove non-trivial duplicates.
+%Without simplification the size of some derivatives can grow arbitrarily big resulting in an extremely slow lexing algorithm.
+%In this thesis we describe a variant of Sulzmann and Lu’s algorithm: Our variant is a recursive functional program, whereas Sulzmann and Lu’s version involves a fixpoint construction. We (i) prove in Isabelle/HOL that our algorithm is correct and generates unique POSIX values; we also (ii) establish a finite bound for the size of the derivatives for every input string; we also
+%(iii) give a program and a conjecture that the finite 
+%bound can be improved to be cubic if stronger simplification rules are applied.

author	Chengsong
	Thu, 29 Jun 2023 04:17:48 +0100
changeset 653	bc5571c38d1f
parent 652	a4d692a9a289
child 654	2ad20ba5b178

ChengsongTanPhdThesis/Chapters/Bitcoded1.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Bitcoded2.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/main.tex		file \| annotate \| diff \| comparison \| revisions