ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 590 988e92a70704
parent 577 f47fc4840579
child 591 b2d0de6aee18
--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Wed Aug 31 12:51:53 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Wed Aug 31 23:57:42 2022 +0100
@@ -8,15 +8,14 @@
 %regex's derivatives. 
 
 In this chapter we give a guarantee in terms of size: 
-given an annotated regular expression $a$, for any string $s$ 
-our algorithm's internal annotated regular expression 
-size
+given an annotated regular expression $a$, for any string $s$
+our algorithm $\blexersimp$'s internal annotated regular expression 
+size  is finitely bounded
+by a constant $N_a$ that only depends on $a$:
 \begin{center}
-$\llbracket \bderssimp{a}{s} \rrbracket$
+$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
 \end{center}
 \noindent
-is finitely bounded
-by a constant $N_a$ that only depends on $a$,
 where the size of an annotated regular expression is defined
 in terms of the number of nodes in its tree structure:
 \begin{center}
@@ -26,80 +25,110 @@
 	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
-$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
 \end{tabular}
 \end{center}
-
+We believe this size formalisation 
+of the algorithm is important in our context, because 
+\begin{itemize}
+	\item
+		It is a stepping stone towards an ``absence of catastrophic-backtracking''
+		guarantee. The next step would be to refine the bound $N_a$ so that it
+		is polynomial on $\llbracket a\rrbracket$.
+	\item
+		The size bound proof gives us a higher confidence that
+		our simplification algorithm $\simp$ does not ``mis-behave''
+		like $\simpsulz$ does.
+		The bound is universal, which is an advantage over work which 
+		only gives empirical evidence on some test cases.
+\end{itemize}
 \section{Formalising About Size}
 \noindent
 In our lexer $\blexersimp$,
 The regular expression is repeatedly being taken derivative of
 and then simplified.
-\begin{center}
-\begin{tikzpicture}[scale=2,node distance=1.4cm,
-                    every node/.style={minimum size=9mm}]
-\node (r0)  {$r$};
-\node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$};
+\begin{figure}[H]
+\begin{tikzpicture}[scale=2,
+                    every node/.style={minimum size=11mm},
+		    ->,>=stealth',shorten >=1pt,auto,thick
+		    ]
+\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
+\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
-\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$r_{1s}$};
-\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {simp};
-\node (r2) [rectangle, draw=black, thick,  right=of r1s]{$r_2$};
+
+\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$};
+\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$};
+
+\node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 12mm]{$a_2$};
 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
-\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$};
-\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
-\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
+
+\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$};
+\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$};
+
+\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$};
 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
-\node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$};
-\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
+
+\node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$};
+\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode};
 \end{tikzpicture}
-\end{center}
+\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
+\end{figure}
 \noindent
-As illustrated in the picture, 
-each time the regular expression 
-is taken derivative of, it grows (the black nodes),
-and after simplification, it shrinks (the blue nodes).
+Each time
+a derivative is taken, a regular expression might grow a bit,
+but simplification always takes care that 
+it stays small.
 This intuition is depicted by the relative size
-difference between the black and blue nodes.
-We give a mechanised proof that after simplification 
-the regular expression's size (the blue ones)
-$\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
-a fixed $a$. 
+change between the black and blue nodes:
+After $\simp$ the node always shrinks.
+Our proof says that all the blue nodes
+stay below a size bound $N_a$ determined by $a$.
 
-There are two problems with this finiteness proof, though.
-\begin{itemize}
-	\item
-First, It is not yet a direct formalisation of our lexer's complexity,
-as a complexity proof would require looking into 
-the time it takes to execute {\bf all} the operations
-involved in the lexer (simp, collect, decode), not just the derivative.
-\item
-Second, the bound is not yet tight, and we seek to improve $N_a$ so that
-it is polynomial on $\llbracket a \rrbracket$.
-\end{itemize}
-Still, we believe this size formalisation 
-of the algorithm is important in our context, because 
-\begin{itemize}
-	\item
-		Derivatives are the most important phases of our lexer algorithm.
-		Size properties about derivatives covers the majority of the algorithm
-		and is therefore a good indication of complexity of the entire program.
-	\item
-		What the size bound proof does ensure is an absence of 
-		catastrophic backtracking, which is prevalent in regular expression engines
-		in popular programming languages like Java.
-		We prove catastrophic backtracking cannot happen for {\bf all} inputs,
-		which is an advantage over work which 
-		only gives empirical evidence on some test cases.
-\end{itemize}
-For example, Sulzmann and Lu's made claimed that their algorithm
-with bitcodes and simplifications can lex in linear time with respect
-to the input string. This assumes that each
-derivative operation takes constant time. 
-However, it turns out that on certain cases their lexer 
+\noindent
+Sulzmann and Lu's assumed something similar about their algorithm,
+though in fact their algorithm's size might be better depicted by the following graph:
+\begin{figure}[H]
+\begin{tikzpicture}[scale=2,
+                    every node/.style={minimum size=11mm},
+		    ->,>=stealth',shorten >=1pt,auto,thick
+		    ]
+\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
+\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
+\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
+
+\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$};
+\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$};
+
+\node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 17mm]{$a_2$};
+\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
+
+\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$};
+\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$};
+
+\node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$};
+\draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$};
+
+\node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$};
+\draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$};
+
+\node (rnn) [right = of rns, minimum size = 1mm]{};
+\draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$};
+
+\end{tikzpicture}
+\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
+\end{figure}
+\noindent
+That is, on certain cases their lexer 
 will have an indefinite size explosion, causing the running time 
-of each derivative step to grow arbitrarily large.
-Here in our proof we state that such explosions cannot happen 
-with our simplification function.
+of each derivative step to grow arbitrarily large (for example 
+in \ref{SulzmannLuLexerTime}).
+The reason they made this mistake was that
+they tested out the run time of their
+lexer on particular examples such as $(a+b+ab)^*$
+and generalised to all cases, which
+cannot happen with our mecahnised proof.\\
+We give details of the proof in the next sections.
+
 \subsection{Overview of the Proof}
 Here is a bird's eye view of how the proof of finiteness works,
 which involves three steps:
@@ -112,45 +141,49 @@
                       top color=white,bottom color=black!20}]
 
 
-		      \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$};
-		      \node (A) at (0,0) [node,text width=1.6cm,  text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$};
-		      \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
+		      \node (0) at (-5,0) 
+			      [node, text width=1.8cm, text centered] 
+			      {$\llbracket \bderssimp{a}{s} \rrbracket$};
+		      \node (A) at (0,0) 
+			      [node,text width=1.6cm,  text centered] 
+			      {$\llbracket \rderssimp{r}{s} \rrbracket_r$};
+		      \node (B) at (3,0) 
+			      [node,text width=3.0cm, anchor=west, minimum width = 40mm] 
+			      {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
   \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};
  
-  \draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$r = a \downarrow_r$} (A); 
-  \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); 
-  \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
+  \draw [->,line width=0.5mm] (0) -- 
+	  node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); 
+  \draw [->,line width=0.5mm] (A) -- 
+	  node [above,pos=0.35] {$\quad =\ldots=$} (B); 
+  \draw [->,line width=0.5mm] (B) -- 
+	  node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
   \end{tikzpicture}
 \end{center}
 \noindent
 We explain the steps one by one:
 \begin{itemize}
-\item
-We first define a new datatype
-that is more straightforward to tweak 
-into the shape we want 
-compared with an annotated regular expression,
-called $\textit{rrexp}$s.
-Its inductive type definition and 
-derivative and simplification operations are
-almost identical to those of the annotated regular expressions,
-except that no bitcodes are attached.
-\item
-We have a set of equalities for this new datatype that enables one to
-rewrite $\rderssimp{r_1 \cdot r_2}{s}$ 
-and $\rderssimp{r^*}{s}$ (the most involved
-inductive cases)
-by a combinatioin of derivatives of their 
-children regular expressions ($r_1$, $r_2$, and $r$, respectively),
-which we call the \emph{Closed Forms}
-of the derivatives.
-\item
-The Closed Forms of the regular expressions
-are controlled by terms that
-are easier to deal with.
-Using inductive hypothesis, these terms
-are in turn bounded loosely
-by a large yet constant number.
+	\item
+		We first introduce the operations such as 
+		derivatives, simplification, size calculation, etc.
+		associated with $\rrexp$s, which we have given
+		a very brief introduction to in chapter \ref{Bitcoded2}.
+	\item
+		We have a set of equalities for this new datatype that enables one to
+		rewrite $\rderssimp{r_1 \cdot r_2}{s}$ 
+		and $\rderssimp{r^*}{s}$ (the most involved
+		inductive cases)
+		by a combinatioin of derivatives of their 
+		children regular expressions ($r_1$, $r_2$, and $r$, respectively),
+		which we call the \emph{Closed Forms}
+		of the derivatives.
+	\item
+		The Closed Forms of the regular expressions
+		are controlled by terms that
+		are easier to deal with.
+		Using inductive hypothesis, these terms
+		are in turn bounded loosely
+		by a large yet constant number.
 \end{itemize}
 We give details of these steps in the next sections.
 The first step is to use 
@@ -159,6 +192,15 @@
 annotated regular expressions. 
 
 \section{the $\textit{rrexp}$ Datatype and Its Size Functions}
+
+We first recap a bit about the new datatype
+we defined in \ref{rrexpDef},
+called $\textit{rrexp}$s.
+We chose $\rrexp$ over 
+the basic regular expressions
+because it is more straightforward to tweak 
+into the shape we want 
+compared with an annotated regular expression.
 We want to prove the size property about annotated regular expressions.
 The size is 
 written $\llbracket r\rrbracket$, whose intuitive definition is as below
@@ -232,6 +274,10 @@
 
 This does not hold for plain $\rexp$s. 
 
+
+		These operations are 
+		almost identical to those of the annotated regular expressions,
+		except that no bitcodes are attached.
 Of course, the bits which encode the lexing information would grow linearly with respect 
 to the input, which should be taken into account when we wish to tackle the runtime comlexity.
 But at the current stage 
@@ -2133,6 +2179,35 @@
 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
 \end{center}
 
+There are two problems with this finiteness result, though.
+\begin{itemize}
+	\item
+First, It is not yet a direct formalisation of our lexer's complexity,
+as a complexity proof would require looking into 
+the time it takes to execute {\bf all} the operations
+involved in the lexer (simp, collect, decode), not just the derivative.
+\item
+Second, the bound is not yet tight, and we seek to improve $N_a$ so that
+it is polynomial on $\llbracket a \rrbracket$.
+\end{itemize}
+Still, we believe this contribution is fruitful,
+because
+\begin{itemize}
+	\item
+
+		The size proof can serve as a cornerstone for a complexity
+		formalisation.
+		Derivatives are the most important phases of our lexer algorithm.
+		Size properties about derivatives covers the majority of the algorithm
+		and is therefore a good indication of complexity of the entire program.
+	\item
+		The bound is already a strong indication that catastrophic
+		backtracking is much less likely to occur in our $\blexersimp$
+		algorithm.
+		We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
+		so that the bound becomes polynomial.
+\end{itemize}
+		
 %----------------------------------------------------------------------------------------
 %	SECTION 4
 %----------------------------------------------------------------------------------------