diff -r 3e1b699696b6 -r f47fc4840579 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Aug 12 00:39:23 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Sun Aug 14 09:44:27 2022 +0100 @@ -9,20 +9,32 @@ In this chapter we give a guarantee in terms of size: given an annotated regular expression $a$, for any string $s$ -our algorithm's internal data structure -$\bderssimp{a}{s}$'s size +our algorithm's internal annotated regular expression +size \begin{center} $\llbracket \bderssimp{a}{s} \rrbracket$ \end{center} \noindent is finitely bounded -by a constant $N_a$ that only depends on $a$. +by a constant $N_a$ that only depends on $a$, +where the size of an annotated regular expression is defined +in terms of the number of nodes in its tree structure: +\begin{center} +\begin{tabular}{ccc} + $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ + $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ + $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ +$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ +$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ +$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ +\end{tabular} +\end{center} \section{Formalising About Size} +\noindent In our lexer $\blexersimp$, The regular expression is repeatedly being taken derivative of and then simplified. - \begin{center} \begin{tikzpicture}[scale=2,node distance=1.4cm, every node/.style={minimum size=9mm}] @@ -37,36 +49,48 @@ \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$}; \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$}; \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; -\node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$}; +\node (v) [circle, draw = blue, thick, right=of rns, minimum size=7mm, right=2.7cm]{$v$}; \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode}; \end{tikzpicture} \end{center} \noindent As illustrated in the picture, -each time the derivative is taken derivative of, it grows, -and when it is being simplified, it shrinks. -The blue ones are the regular expressions after simplification, -which would be smaller than before. +each time the regular expression +is taken derivative of, it grows (the black nodes), +and after simplification, it shrinks (the blue nodes). +This intuition is depicted by the relative size +difference between the black and blue nodes. We give a mechanised proof that after simplification the regular expression's size (the blue ones) $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for a fixed $a$. -While it is not yet a direct formalisation of our lexer's complexity, -it is a stepping stone towards a complexity proof because -if the data structures out lexer has to traverse is large, the program -will certainly be slow. -The bound is not yet tight, and we seek to improve $N_a$ so that +There are two problems with this finiteness proof, though. +\begin{itemize} + \item +First, It is not yet a direct formalisation of our lexer's complexity, +as a complexity proof would require looking into +the time it takes to execute {\bf all} the operations +involved in the lexer (simp, collect, decode), not just the derivative. +\item +Second, the bound is not yet tight, and we seek to improve $N_a$ so that it is polynomial on $\llbracket a \rrbracket$. -We believe a formalisation of the complexity-related properties -of the algorithm is important in our context, because we want to address -catastrophic backtracking, which is not a correctness issue but -in essence a performance issue, a formal proof can give -the strongest assurance that such issues cannot arise -regardless of what the input is. -This level of certainty cannot come from a pencil and paper proof or -eimpirical evidence. - +\end{itemize} +Still, we believe this size formalisation +of the algorithm is important in our context, because +\begin{itemize} + \item + Derivatives are the most important phases of our lexer algorithm. + Size properties about derivatives covers the majority of the algorithm + and is therefore a good indication of complexity of the entire program. + \item + What the size bound proof does ensure is an absence of + catastrophic backtracking, which is prevalent in regular expression engines + in popular programming languages like Java. + We prove catastrophic backtracking cannot happen for {\bf all} inputs, + which is an advantage over work which + only gives empirical evidence on some test cases. +\end{itemize} For example, Sulzmann and Lu's made claimed that their algorithm with bitcodes and simplifications can lex in linear time with respect to the input string. This assumes that each @@ -76,8 +100,9 @@ of each derivative step to grow arbitrarily large. Here in our proof we state that such explosions cannot happen with our simplification function. - -Here is a bird's eye view of how the proof of finiteness works: +\subsection{Overview of the Proof} +Here is a bird's eye view of how the proof of finiteness works, +which involves three steps: \begin{center} \begin{tikzpicture}[scale=1,font=\bf, node/.style={ @@ -98,45 +123,45 @@ \end{tikzpicture} \end{center} \noindent -We explain the above steps one by one: +We explain the steps one by one: \begin{itemize} \item -We first use a new datatype, called $\textit{rrexp}$s, whose -inductive type definition and derivative and simplification operations are +We first define a new datatype +that is more straightforward to tweak +into the shape we want +compared with an annotated regular expression, +called $\textit{rrexp}$s. +Its inductive type definition and +derivative and simplification operations are almost identical to those of the annotated regular expressions, except that no bitcodes are attached. -This new datatype is more straightforward to tweak -compared with an annotated regular expression. \item We have a set of equalities for this new datatype that enables one to -rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc. +rewrite $\rderssimp{r_1 \cdot r_2}{s}$ +and $\rderssimp{r^*}{s}$ (the most involved +inductive cases) by a combinatioin of derivatives of their -children regular expressions ($r_1$, $r_2$, and $r$, respectively). -These equalities are chained together to get into a shape -that is very easy to estimate, which we call the \emph{Closed Forms} +children regular expressions ($r_1$, $r_2$, and $r$, respectively), +which we call the \emph{Closed Forms} of the derivatives. \item -This closed form is controlled by terms that -are easier to deal with, wich are in turn bounded loosely +The Closed Forms of the regular expressions +are controlled by terms that +are easier to deal with. +Using inductive hypothesis, these terms +are in turn bounded loosely by a large yet constant number. \end{itemize} We give details of these steps in the next sections. -The first step is to use something simpler than annotated regular expressions. +The first step is to use +$\textit{rrexp}$s, +something simpler than +annotated regular expressions. -\section{the $\mathbf{r}$-rexp datatype and the size functions} +\section{the $\textit{rrexp}$ Datatype and Its Size Functions} We want to prove the size property about annotated regular expressions. The size is written $\llbracket r\rrbracket$, whose intuitive definition is as below -\begin{center} -\begin{tabular}{ccc} - $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ - $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ - $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ -$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ -$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ -$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ -\end{tabular} -\end{center} \noindent We first note that $\llbracket \_ \rrbracket$ is unaware of bitcodes, since