ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 577 f47fc4840579
parent 576 3e1b699696b6
child 590 988e92a70704
--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Aug 12 00:39:23 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Sun Aug 14 09:44:27 2022 +0100
@@ -9,20 +9,32 @@
 
 In this chapter we give a guarantee in terms of size: 
 given an annotated regular expression $a$, for any string $s$ 
-our algorithm's internal data structure 
-$\bderssimp{a}{s}$'s size
+our algorithm's internal annotated regular expression 
+size
 \begin{center}
 $\llbracket \bderssimp{a}{s} \rrbracket$
 \end{center}
 \noindent
 is finitely bounded
-by a constant $N_a$ that only depends on $a$.
+by a constant $N_a$ that only depends on $a$,
+where the size of an annotated regular expression is defined
+in terms of the number of nodes in its tree structure:
+\begin{center}
+\begin{tabular}{ccc}
+	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+\end{tabular}
+\end{center}
 
 \section{Formalising About Size}
+\noindent
 In our lexer $\blexersimp$,
 The regular expression is repeatedly being taken derivative of
 and then simplified.
-
 \begin{center}
 \begin{tikzpicture}[scale=2,node distance=1.4cm,
                     every node/.style={minimum size=9mm}]
@@ -37,36 +49,48 @@
 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
-\node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$};
+\node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$};
 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
 \end{tikzpicture}
 \end{center}
 \noindent
 As illustrated in the picture, 
-each time the derivative is taken derivative of, it grows,
-and when it is being simplified, it shrinks.
-The blue ones are the regular expressions after simplification,
-which would be smaller than before.
+each time the regular expression 
+is taken derivative of, it grows (the black nodes),
+and after simplification, it shrinks (the blue nodes).
+This intuition is depicted by the relative size
+difference between the black and blue nodes.
 We give a mechanised proof that after simplification 
 the regular expression's size (the blue ones)
 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
 a fixed $a$. 
-While it is not yet a direct formalisation of our lexer's complexity,
-it is a stepping stone towards a complexity proof because
-if the data structures out lexer has to traverse is large, the program
-will certainly be slow.
 
-The bound is not yet tight, and we seek to improve $N_a$ so that
+There are two problems with this finiteness proof, though.
+\begin{itemize}
+	\item
+First, It is not yet a direct formalisation of our lexer's complexity,
+as a complexity proof would require looking into 
+the time it takes to execute {\bf all} the operations
+involved in the lexer (simp, collect, decode), not just the derivative.
+\item
+Second, the bound is not yet tight, and we seek to improve $N_a$ so that
 it is polynomial on $\llbracket a \rrbracket$.
-We believe a formalisation of the complexity-related properties
-of the algorithm is important in our context, because we want to address
-catastrophic backtracking, which is not a correctness issue but
-in essence a performance issue, a formal proof can give
-the strongest assurance that such issues cannot arise 
-regardless of what the input is.
-This level of certainty cannot come from a pencil and paper proof or
-eimpirical evidence.
-
+\end{itemize}
+Still, we believe this size formalisation 
+of the algorithm is important in our context, because 
+\begin{itemize}
+	\item
+		Derivatives are the most important phases of our lexer algorithm.
+		Size properties about derivatives covers the majority of the algorithm
+		and is therefore a good indication of complexity of the entire program.
+	\item
+		What the size bound proof does ensure is an absence of 
+		catastrophic backtracking, which is prevalent in regular expression engines
+		in popular programming languages like Java.
+		We prove catastrophic backtracking cannot happen for {\bf all} inputs,
+		which is an advantage over work which 
+		only gives empirical evidence on some test cases.
+\end{itemize}
 For example, Sulzmann and Lu's made claimed that their algorithm
 with bitcodes and simplifications can lex in linear time with respect
 to the input string. This assumes that each
@@ -76,8 +100,9 @@
 of each derivative step to grow arbitrarily large.
 Here in our proof we state that such explosions cannot happen 
 with our simplification function.
-
-Here is a bird's eye view of how the proof of finiteness works: 
+\subsection{Overview of the Proof}
+Here is a bird's eye view of how the proof of finiteness works,
+which involves three steps:
 \begin{center}
   \begin{tikzpicture}[scale=1,font=\bf,
                       node/.style={
@@ -98,45 +123,45 @@
   \end{tikzpicture}
 \end{center}
 \noindent
-We explain the above steps one by one:
+We explain the steps one by one:
 \begin{itemize}
 \item
-We first use a new datatype, called $\textit{rrexp}$s, whose 
-inductive type definition and derivative and simplification operations are
+We first define a new datatype
+that is more straightforward to tweak 
+into the shape we want 
+compared with an annotated regular expression,
+called $\textit{rrexp}$s.
+Its inductive type definition and 
+derivative and simplification operations are
 almost identical to those of the annotated regular expressions,
 except that no bitcodes are attached.
-This new datatype is more straightforward to tweak 
-compared with an annotated regular expression.
 \item
 We have a set of equalities for this new datatype that enables one to
-rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc.
+rewrite $\rderssimp{r_1 \cdot r_2}{s}$ 
+and $\rderssimp{r^*}{s}$ (the most involved
+inductive cases)
 by a combinatioin of derivatives of their 
-children regular expressions ($r_1$, $r_2$, and $r$, respectively).
-These equalities are chained together to get into a shape
-that is very easy to estimate, which we call the \emph{Closed Forms}
+children regular expressions ($r_1$, $r_2$, and $r$, respectively),
+which we call the \emph{Closed Forms}
 of the derivatives.
 \item
-This closed form is controlled by terms that
-are easier to deal with, wich are in turn bounded loosely
+The Closed Forms of the regular expressions
+are controlled by terms that
+are easier to deal with.
+Using inductive hypothesis, these terms
+are in turn bounded loosely
 by a large yet constant number.
 \end{itemize}
 We give details of these steps in the next sections.
-The first step is to use something simpler than annotated regular expressions. 
+The first step is to use 
+$\textit{rrexp}$s,
+something simpler than
+annotated regular expressions. 
 
-\section{the $\mathbf{r}$-rexp datatype and the size functions}
+\section{the $\textit{rrexp}$ Datatype and Its Size Functions}
 We want to prove the size property about annotated regular expressions.
 The size is 
 written $\llbracket r\rrbracket$, whose intuitive definition is as below
-\begin{center}
-\begin{tabular}{ccc}
-	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
-	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
-	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
-$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
-$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
-$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
-\end{tabular}
-\end{center}
 \noindent
 We first note that $\llbracket \_ \rrbracket$
 is unaware of bitcodes, since