lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:86e0203db2da
+:988e92a70704
 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
 %of our bitcoded algorithm, that is a finite bound on the size of any
 %regex's derivatives.
 In this chapter we give a guarantee in terms of size:
 given an annotated regular expression $a$, for any string $s$
-our algorithm's internal annotated regular expression
+our algorithm $\blexersimp$'s internal annotated regular expression
-size
+size  is finitely bounded
-\begin{center}
+by a constant $N_a$ that only depends on $a$:
-$\llbracket \bderssimp{a}{s} \rrbracket$
+\begin{center}
-\end{center}
+$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$
-\noindent
+\end{center}
-is finitely bounded
+\noindent
-by a constant $N_a$ that only depends on $a$,
 where the size of an annotated regular expression is defined
 in terms of the number of nodes in its tree structure:
 \begin{center}
 \begin{tabular}{ccc}
 	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
 	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
 	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
-$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$.
 \end{tabular}
 \end{center}
+We believe this size formalisation
+of the algorithm is important in our context, because
+\begin{itemize}
+	\item
+		It is a stepping stone towards an ``absence of catastrophic-backtracking''
+		guarantee. The next step would be to refine the bound $N_a$ so that it
+		is polynomial on $\llbracket a\rrbracket$.
+	\item
+		The size bound proof gives us a higher confidence that
+		our simplification algorithm $\simp$ does not ``mis-behave''
+		like $\simpsulz$ does.
+		The bound is universal, which is an advantage over work which
+		only gives empirical evidence on some test cases.
+\end{itemize}
 \section{Formalising About Size}
 \noindent
 In our lexer $\blexersimp$,
 The regular expression is repeatedly being taken derivative of
 and then simplified.
-\begin{center}
+\begin{figure}[H]
-\begin{tikzpicture}[scale=2,node distance=1.4cm,
+\begin{tikzpicture}[scale=2,
-every node/.style={minimum size=9mm}]
+every node/.style={minimum size=11mm},
-\node (r0)  {$r$};
+		    ->,>=stealth',shorten >=1pt,auto,thick
-\node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$};
+		    ]
+\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
+\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
-\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$r_{1s}$};
-\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {simp};
+\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$};
-\node (r2) [rectangle, draw=black, thick,  right=of r1s]{$r_2$};
+\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$};
+\node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 12mm]{$a_2$};
 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
-\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$};
-\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
+\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$};
-\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
+\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$};
+\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$};
 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
-\node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$};
-\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
+\node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$};
+\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode};
 \end{tikzpicture}
-\end{center}
+\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks}
-\noindent
+\end{figure}
-As illustrated in the picture,
+\noindent
-each time the regular expression
+Each time
-is taken derivative of, it grows (the black nodes),
+a derivative is taken, a regular expression might grow a bit,
-and after simplification, it shrinks (the blue nodes).
+but simplification always takes care that
+it stays small.
 This intuition is depicted by the relative size
-difference between the black and blue nodes.
+change between the black and blue nodes:
-We give a mechanised proof that after simplification
+After $\simp$ the node always shrinks.
-the regular expression's size (the blue ones)
+Our proof says that all the blue nodes
-$\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
+stay below a size bound $N_a$ determined by $a$.
-a fixed $a$.
+\noindent
-There are two problems with this finiteness proof, though.
+Sulzmann and Lu's assumed something similar about their algorithm,
-\begin{itemize}
+though in fact their algorithm's size might be better depicted by the following graph:
-	\item
+\begin{figure}[H]
-First, It is not yet a direct formalisation of our lexer's complexity,
+\begin{tikzpicture}[scale=2,
-as a complexity proof would require looking into
+every node/.style={minimum size=11mm},
-the time it takes to execute {\bf all} the operations
+		    ->,>=stealth',shorten >=1pt,auto,thick
-involved in the lexer (simp, collect, decode), not just the derivative.
+		    ]
-\item
+\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$};
-Second, the bound is not yet tight, and we seek to improve $N_a$ so that
+\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$};
-it is polynomial on $\llbracket a \rrbracket$.
+\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
-\end{itemize}
-Still, we believe this size formalisation
+\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$};
-of the algorithm is important in our context, because
+\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$};
-\begin{itemize}
-	\item
+\node (r2) [rectangle, draw=black, thick,  right=of r1s, minimum size = 17mm]{$a_2$};
-		Derivatives are the most important phases of our lexer algorithm.
+\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
-		Size properties about derivatives covers the majority of the algorithm
-		and is therefore a good indication of complexity of the entire program.
+\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$};
-	\item
+\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$};
-		What the size bound proof does ensure is an absence of
-		catastrophic backtracking, which is prevalent in regular expression engines
+\node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$};
-		in popular programming languages like Java.
+\draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$};
-		We prove catastrophic backtracking cannot happen for {\bf all} inputs,
-		which is an advantage over work which
+\node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$};
-		only gives empirical evidence on some test cases.
+\draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$};
-\end{itemize}
-For example, Sulzmann and Lu's made claimed that their algorithm
+\node (rnn) [right = of rns, minimum size = 1mm]{};
-with bitcodes and simplifications can lex in linear time with respect
+\draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$};
-to the input string. This assumes that each
-derivative operation takes constant time.
+\end{tikzpicture}
-However, it turns out that on certain cases their lexer
+\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks}
+\end{figure}
+\noindent
+That is, on certain cases their lexer
 will have an indefinite size explosion, causing the running time
-of each derivative step to grow arbitrarily large.
+of each derivative step to grow arbitrarily large (for example
-Here in our proof we state that such explosions cannot happen
+in \ref{SulzmannLuLexerTime}).
-with our simplification function.
+The reason they made this mistake was that
+they tested out the run time of their
+lexer on particular examples such as $(a+b+ab)^*$
+and generalised to all cases, which
+cannot happen with our mecahnised proof.\\
+We give details of the proof in the next sections.
 \subsection{Overview of the Proof}
 Here is a bird's eye view of how the proof of finiteness works,
 which involves three steps:
 \begin{center}
 \begin{tikzpicture}[scale=1,font=\bf,
 ultra thick,draw=black!50,minimum height=18mm,
 minimum width=20mm,
 top color=white,bottom color=black!20}]
-		      \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$};
+		      \node (0) at (-5,0)
-		      \node (A) at (0,0) [node,text width=1.6cm,  text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$};
+			      [node, text width=1.8cm, text centered]
-		      \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
+			      {$\llbracket \bderssimp{a}{s} \rrbracket$};
+		      \node (A) at (0,0)
+			      [node,text width=1.6cm,  text centered]
+			      {$\llbracket \rderssimp{r}{s} \rrbracket_r$};
+		      \node (B) at (3,0)
+			      [node,text width=3.0cm, anchor=west, minimum width = 40mm]
+			      {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
 \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};
-\draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$r = a \downarrow_r$} (A);
+\draw [->,line width=0.5mm] (0) --
-\draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B);
+	  node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A);
-\draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C);
+\draw [->,line width=0.5mm] (A) --
+	  node [above,pos=0.35] {$\quad =\ldots=$} (B);
+\draw [->,line width=0.5mm] (B) --
+	  node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C);
 \end{tikzpicture}
 \end{center}
 \noindent
 We explain the steps one by one:
 \begin{itemize}
-\item
+	\item
-We first define a new datatype
+		We first introduce the operations such as
-that is more straightforward to tweak
+		derivatives, simplification, size calculation, etc.
-into the shape we want
+		associated with $\rrexp$s, which we have given
-compared with an annotated regular expression,
+		a very brief introduction to in chapter \ref{Bitcoded2}.
-called $\textit{rrexp}$s.
+	\item
-Its inductive type definition and
+		We have a set of equalities for this new datatype that enables one to
-derivative and simplification operations are
+		rewrite $\rderssimp{r_1 \cdot r_2}{s}$
-almost identical to those of the annotated regular expressions,
+		and $\rderssimp{r^*}{s}$ (the most involved
-except that no bitcodes are attached.
+		inductive cases)
-\item
+		by a combinatioin of derivatives of their
-We have a set of equalities for this new datatype that enables one to
+		children regular expressions ($r_1$, $r_2$, and $r$, respectively),
-rewrite $\rderssimp{r_1 \cdot r_2}{s}$
+		which we call the \emph{Closed Forms}
-and $\rderssimp{r^*}{s}$ (the most involved
+		of the derivatives.
-inductive cases)
+	\item
-by a combinatioin of derivatives of their
+		The Closed Forms of the regular expressions
-children regular expressions ($r_1$, $r_2$, and $r$, respectively),
+		are controlled by terms that
-which we call the \emph{Closed Forms}
+		are easier to deal with.
-of the derivatives.
+		Using inductive hypothesis, these terms
-\item
+		are in turn bounded loosely
-The Closed Forms of the regular expressions
+		by a large yet constant number.
-are controlled by terms that
-are easier to deal with.
-Using inductive hypothesis, these terms
-are in turn bounded loosely
-by a large yet constant number.
 \end{itemize}
 We give details of these steps in the next sections.
 The first step is to use
 $\textit{rrexp}$s,
 something simpler than
 annotated regular expressions.
 \section{the $\textit{rrexp}$ Datatype and Its Size Functions}
+We first recap a bit about the new datatype
+we defined in \ref{rrexpDef},
+called $\textit{rrexp}$s.
+We chose $\rrexp$ over
+the basic regular expressions
+because it is more straightforward to tweak
+into the shape we want
+compared with an annotated regular expression.
 We want to prove the size property about annotated regular expressions.
 The size is
 written $\llbracket r\rrbracket$, whose intuitive definition is as below
 \noindent
 We first note that $\llbracket \_ \rrbracket$
 $\rrexp$ give the exact correspondence between an annotated regular expression
 and its (r-)erased version:
 This does not hold for plain $\rexp$s.
+		These operations are
+		almost identical to those of the annotated regular expressions,
+		except that no bitcodes are attached.
 Of course, the bits which encode the lexing information would grow linearly with respect
 to the input, which should be taken into account when we wish to tackle the runtime comlexity.
 But at the current stage
 we can safely ignore them.
 Similarly there is a size function for plain regular expressions:
 equality as below in the style of Arden's lemma:\\
 \begin{center}
 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
 \end{center}
+There are two problems with this finiteness result, though.
+\begin{itemize}
+	\item
+First, It is not yet a direct formalisation of our lexer's complexity,
+as a complexity proof would require looking into
+the time it takes to execute {\bf all} the operations
+involved in the lexer (simp, collect, decode), not just the derivative.
+\item
+Second, the bound is not yet tight, and we seek to improve $N_a$ so that
+it is polynomial on $\llbracket a \rrbracket$.
+\end{itemize}
+Still, we believe this contribution is fruitful,
+because
+\begin{itemize}
+	\item
+		The size proof can serve as a cornerstone for a complexity
+		formalisation.
+		Derivatives are the most important phases of our lexer algorithm.
+		Size properties about derivatives covers the majority of the algorithm
+		and is therefore a good indication of complexity of the entire program.
+	\item
+		The bound is already a strong indication that catastrophic
+		backtracking is much less likely to occur in our $\blexersimp$
+		algorithm.
+		We refine $\blexersimp$ with $\blexerStrong$ in the next chapter
+		so that the bound becomes polynomial.
+\end{itemize}
 %----------------------------------------------------------------------------------------
 %	SECTION 4
 %----------------------------------------------------------------------------------------

changeset 590	988e92a70704
parent 577	f47fc4840579
child 591	b2d0de6aee18