lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:3e1b699696b6
+:f47fc4840579
 %of our bitcoded algorithm, that is a finite bound on the size of any
 %regex's derivatives.
 In this chapter we give a guarantee in terms of size:
 given an annotated regular expression $a$, for any string $s$
-our algorithm's internal data structure
+our algorithm's internal annotated regular expression
-$\bderssimp{a}{s}$'s size
+size
 \begin{center}
 $\llbracket \bderssimp{a}{s} \rrbracket$
 \end{center}
 \noindent
 is finitely bounded
-by a constant $N_a$ that only depends on $a$.
+by a constant $N_a$ that only depends on $a$,
+where the size of an annotated regular expression is defined
+in terms of the number of nodes in its tree structure:
+\begin{center}
+\begin{tabular}{ccc}
+	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+\end{tabular}
+\end{center}
 \section{Formalising About Size}
+\noindent
 In our lexer $\blexersimp$,
 The regular expression is repeatedly being taken derivative of
 and then simplified.
 \begin{center}
 \begin{tikzpicture}[scale=2,node distance=1.4cm,
 every node/.style={minimum size=9mm}]
 \node (r0)  {$r$};
 \node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$};
 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$};
 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
-\node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$};
+\node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$};
 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
 \end{tikzpicture}
 \end{center}
 \noindent
 As illustrated in the picture,
-each time the derivative is taken derivative of, it grows,
+each time the regular expression
-and when it is being simplified, it shrinks.
+is taken derivative of, it grows (the black nodes),
-The blue ones are the regular expressions after simplification,
+and after simplification, it shrinks (the blue nodes).
-which would be smaller than before.
+This intuition is depicted by the relative size
+difference between the black and blue nodes.
 We give a mechanised proof that after simplification
 the regular expression's size (the blue ones)
 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
 a fixed $a$.
-While it is not yet a direct formalisation of our lexer's complexity,
-it is a stepping stone towards a complexity proof because
+There are two problems with this finiteness proof, though.
-if the data structures out lexer has to traverse is large, the program
+\begin{itemize}
-will certainly be slow.
+	\item
+First, It is not yet a direct formalisation of our lexer's complexity,
-The bound is not yet tight, and we seek to improve $N_a$ so that
+as a complexity proof would require looking into
+the time it takes to execute {\bf all} the operations
+involved in the lexer (simp, collect, decode), not just the derivative.
+\item
+Second, the bound is not yet tight, and we seek to improve $N_a$ so that
 it is polynomial on $\llbracket a \rrbracket$.
-We believe a formalisation of the complexity-related properties
+\end{itemize}
-of the algorithm is important in our context, because we want to address
+Still, we believe this size formalisation
-catastrophic backtracking, which is not a correctness issue but
+of the algorithm is important in our context, because
-in essence a performance issue, a formal proof can give
+\begin{itemize}
-the strongest assurance that such issues cannot arise
+	\item
-regardless of what the input is.
+		Derivatives are the most important phases of our lexer algorithm.
-This level of certainty cannot come from a pencil and paper proof or
+		Size properties about derivatives covers the majority of the algorithm
-eimpirical evidence.
+		and is therefore a good indication of complexity of the entire program.
+	\item
+		What the size bound proof does ensure is an absence of
+		catastrophic backtracking, which is prevalent in regular expression engines
+		in popular programming languages like Java.
+		We prove catastrophic backtracking cannot happen for {\bf all} inputs,
+		which is an advantage over work which
+		only gives empirical evidence on some test cases.
+\end{itemize}
 For example, Sulzmann and Lu's made claimed that their algorithm
 with bitcodes and simplifications can lex in linear time with respect
 to the input string. This assumes that each
 derivative operation takes constant time.
 However, it turns out that on certain cases their lexer
 will have an indefinite size explosion, causing the running time
 of each derivative step to grow arbitrarily large.
 Here in our proof we state that such explosions cannot happen
 with our simplification function.
+\subsection{Overview of the Proof}
-Here is a bird's eye view of how the proof of finiteness works:
+Here is a bird's eye view of how the proof of finiteness works,
+which involves three steps:
 \begin{center}
 \begin{tikzpicture}[scale=1,font=\bf,
 node/.style={
 rectangle,rounded corners=3mm,
 ultra thick,draw=black!50,minimum height=18mm,
 \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B);
 \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C);
 \end{tikzpicture}
 \end{center}
 \noindent
-We explain the above steps one by one:
+We explain the steps one by one:
 \begin{itemize}
 \item
-We first use a new datatype, called $\textit{rrexp}$s, whose
+We first define a new datatype
-inductive type definition and derivative and simplification operations are
+that is more straightforward to tweak
+into the shape we want
+compared with an annotated regular expression,
+called $\textit{rrexp}$s.
+Its inductive type definition and
+derivative and simplification operations are
 almost identical to those of the annotated regular expressions,
 except that no bitcodes are attached.
-This new datatype is more straightforward to tweak
-compared with an annotated regular expression.
 \item
 We have a set of equalities for this new datatype that enables one to
-rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc.
+rewrite $\rderssimp{r_1 \cdot r_2}{s}$
+and $\rderssimp{r^*}{s}$ (the most involved
+inductive cases)
 by a combinatioin of derivatives of their
-children regular expressions ($r_1$, $r_2$, and $r$, respectively).
+children regular expressions ($r_1$, $r_2$, and $r$, respectively),
-These equalities are chained together to get into a shape
+which we call the \emph{Closed Forms}
-that is very easy to estimate, which we call the \emph{Closed Forms}
 of the derivatives.
 \item
-This closed form is controlled by terms that
+The Closed Forms of the regular expressions
-are easier to deal with, wich are in turn bounded loosely
+are controlled by terms that
+are easier to deal with.
+Using inductive hypothesis, these terms
+are in turn bounded loosely
 by a large yet constant number.
 \end{itemize}
 We give details of these steps in the next sections.
-The first step is to use something simpler than annotated regular expressions.
+The first step is to use
+$\textit{rrexp}$s,
-\section{the $\mathbf{r}$-rexp datatype and the size functions}
+something simpler than
+annotated regular expressions.
+\section{the $\textit{rrexp}$ Datatype and Its Size Functions}
 We want to prove the size property about annotated regular expressions.
 The size is
 written $\llbracket r\rrbracket$, whose intuitive definition is as below
-\begin{center}
-\begin{tabular}{ccc}
-	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
-	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
-	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
-$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
-$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
-$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
-\end{tabular}
-\end{center}
 \noindent
 We first note that $\llbracket \_ \rrbracket$
 is unaware of bitcodes, since
 it only counts the number of nodes
 if we regard $r$ as a tree.

changeset 577	f47fc4840579
parent 576	3e1b699696b6
child 590	988e92a70704