lexing: comparison ChengsongTanPhdThesis/Chapters/Finite.tex

equal deleted inserted replaced

-:3178f0e948ac
+:3e1b699696b6
 \label{Finite}
 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
 %of our bitcoded algorithm, that is a finite bound on the size of any
 %regex's derivatives.
-In this chapter we give a guarantee in terms of time complexity:
+In this chapter we give a guarantee in terms of size:
-given a regular expression $r$, for any string $s$
+given an annotated regular expression $a$, for any string $s$
-our algorithm's internal data structure is finitely bounded.
+our algorithm's internal data structure
-Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal
+$\bderssimp{a}{s}$'s size
-data structure used in our $\blexer$)
+\begin{center}
-is bounded by a constant $N_r$, where $N$ only depends on the regular expression
+$\llbracket \bderssimp{a}{s} \rrbracket$
-$r$, not the string $s$.
+\end{center}
-When doing a time complexity analysis of any
+\noindent
-lexer/parser based on Brzozowski derivatives, one needs to take into account that
+is finitely bounded
-not only the "derivative steps".
+by a constant $N_a$ that only depends on $a$.
-%TODO: get a grpah for internal data structure growing arbitrary large
+\section{Formalising About Size}
+In our lexer $\blexersimp$,
+The regular expression is repeatedly being taken derivative of
-To obtain such a proof, we need to
+and then simplified.
+\begin{center}
+\begin{tikzpicture}[scale=2,node distance=1.4cm,
+every node/.style={minimum size=9mm}]
+\node (r0)  {$r$};
+\node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$};
+\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
+\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$r_{1s}$};
+\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {simp};
+\node (r2) [rectangle, draw=black, thick,  right=of r1s]{$r_2$};
+\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
+\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$};
+\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
+\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
+\draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
+\node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$};
+\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
+\end{tikzpicture}
+\end{center}
+\noindent
+As illustrated in the picture,
+each time the derivative is taken derivative of, it grows,
+and when it is being simplified, it shrinks.
+The blue ones are the regular expressions after simplification,
+which would be smaller than before.
+We give a mechanised proof that after simplification
+the regular expression's size (the blue ones)
+$\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
+a fixed $a$.
+While it is not yet a direct formalisation of our lexer's complexity,
+it is a stepping stone towards a complexity proof because
+if the data structures out lexer has to traverse is large, the program
+will certainly be slow.
+The bound is not yet tight, and we seek to improve $N_a$ so that
+it is polynomial on $\llbracket a \rrbracket$.
+We believe a formalisation of the complexity-related properties
+of the algorithm is important in our context, because we want to address
+catastrophic backtracking, which is not a correctness issue but
+in essence a performance issue, a formal proof can give
+the strongest assurance that such issues cannot arise
+regardless of what the input is.
+This level of certainty cannot come from a pencil and paper proof or
+eimpirical evidence.
+For example, Sulzmann and Lu's made claimed that their algorithm
+with bitcodes and simplifications can lex in linear time with respect
+to the input string. This assumes that each
+derivative operation takes constant time.
+However, it turns out that on certain cases their lexer
+will have an indefinite size explosion, causing the running time
+of each derivative step to grow arbitrarily large.
+Here in our proof we state that such explosions cannot happen
+with our simplification function.
+Here is a bird's eye view of how the proof of finiteness works:
+\begin{center}
+\begin{tikzpicture}[scale=1,font=\bf,
+node/.style={
+rectangle,rounded corners=3mm,
+ultra thick,draw=black!50,minimum height=18mm,
+minimum width=20mm,
+top color=white,bottom color=black!20}]
+		      \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$};
+		      \node (A) at (0,0) [node,text width=1.6cm,  text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$};
+		      \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
+\node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};
+\draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$r = a \downarrow_r$} (A);
+\draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B);
+\draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C);
+\end{tikzpicture}
+\end{center}
+\noindent
+We explain the above steps one by one:
 \begin{itemize}
 \item
-Define an new datatype for regular expressions that makes it easy
+We first use a new datatype, called $\textit{rrexp}$s, whose
-to reason about the size of an annotated regular expression.
+inductive type definition and derivative and simplification operations are
+almost identical to those of the annotated regular expressions,
+except that no bitcodes are attached.
+This new datatype is more straightforward to tweak
+compared with an annotated regular expression.
 \item
-A set of equalities for this new datatype that enables one to
+We have a set of equalities for this new datatype that enables one to
-rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc.
+rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc.
-by their children regexes $r_1$, $r_2$, and $r$.
+by a combinatioin of derivatives of their
+children regular expressions ($r_1$, $r_2$, and $r$, respectively).
+These equalities are chained together to get into a shape
+that is very easy to estimate, which we call the \emph{Closed Forms}
+of the derivatives.
 \item
-Using those equalities to actually get those rewriting equations, which we call
+This closed form is controlled by terms that
-"closed forms".
+are easier to deal with, wich are in turn bounded loosely
-\item
+by a large yet constant number.
-Bound the closed forms, thereby bounding the original
-$\blexersimp$'s internal data structures.
 \end{itemize}
+We give details of these steps in the next sections.
+The first step is to use something simpler than annotated regular expressions.
 \section{the $\mathbf{r}$-rexp datatype and the size functions}
+We want to prove the size property about annotated regular expressions.
-We have a size function for bitcoded regular expressions, written
+The size is
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
+written $\llbracket r\rrbracket$, whose intuitive definition is as below
 \begin{center}
 \begin{tabular}{ccc}
 	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
 	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
 	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
 \end{tabular}
 \end{center}
+\noindent
-\noindent
+We first note that $\llbracket \_ \rrbracket$
+is unaware of bitcodes, since
+it only counts the number of nodes
+if we regard $r$ as a tree.
+This suggests that if we define a new datatype that is similar to plain $\rexp$s,
+\[			\rrexp ::=   \RZERO \mid  \RONE
+			 \mid  \RCHAR{c}
+			 \mid  \RSEQ{r_1}{r_2}
+			 \mid  \RALTS{rs}
+			 \mid \RSTAR{r}
+\]
+and define
+\begin{center}
+\begin{tabular}{lcl}
+$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
+$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
+	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
+$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
+$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
+\end{tabular}
+\end{center}
+\noindent
+we could calculate the size of annotated regular expressions in terms of
+its un-annotated counterpart:
+\begin{lemma}
+$\rsize{\rerase a} = \asize a$
+\end{lemma}
+\begin{proof}
+	By routine induction on the structure of $a$.
+\end{proof}
+\noindent
+We call them \emph{r}-regular expressions:
+we use
+$\rrexp$ as its type name, so as to make a distinction
+with plain regular expressions.
+In $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
+but keep everything else intact.
+To transform an annotated regular expression into an r-regular expression,
+we use the function \emph{rerase}, written $\downarrow_r$.
+The $r$ in the subscript of $\downarrow$ is to
+differentiate with the $\downarrow$ for the $\erase$ operation.
+Before we introduce more functions related to r-regular expressions,
+we first give out the reason why we take all the trouble
+defining a new datatype in the first place.
+\subsection{Motivation Behind a New Datatype}
+The main difference between a plain regular expression
+and an r-regular expression is that it can take
+non-binary arguments for its alternative constructor.
+This turned out to be necessary if we want our proofs to be
+simple.
+We initially started by using
+plain regular expressions and tried to prove
+equalities like
+\begin{center}
+	$\llbracket a \rrbracket = \llbracket a\downarrow \rrbracket_p$
+\end{center}
+and
+\[
+	\llbracket a \backslash_{bsimps} s \rrbracket =
+	\llbracket a \downarrow \backslash_s s
+\]
+One might be able to prove that
+$\llbracket a \downarrow \rrbracket_p \leq \llbracket a \rrbracket$.
+$\rrexp$ give the exact correspondence between an annotated regular expression
+and its (r-)erased version:
+This does not hold for plain $\rexp$s.
+Of course, the bits which encode the lexing information would grow linearly with respect
+to the input, which should be taken into account when we wish to tackle the runtime comlexity.
+But at the current stage
+we can safely ignore them.
 Similarly there is a size function for plain regular expressions:
 \begin{center}
 \begin{tabular}{ccc}
 	$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
 	$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
 \noindent
 An alternative regular expression with an empty list of children
 is turned into a $\ZERO$ during the
 $\erase$ function, thereby changing the size and structure of the regex.
 Therefore the equality in question does not hold.
 These will likely be fixable if we really want to use plain $\rexp$s for dealing
 with size, but we choose a more straightforward (or stupid) method by
-defining a new datatype that is similar to plain $\rexp$s but can take
-non-binary arguments for its alternative constructor,
-which we call $\rrexp$ to denote
-the difference between it and plain regular expressions.
-\[			\rrexp ::=   \RZERO \mid  \RONE
-			 \mid  \RCHAR{c}
-			 \mid  \RSEQ{r_1}{r_2}
-			 \mid  \RALTS{rs}
-			 \mid \RSTAR{r}
-\]
-For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
-but keep everything else intact.
-It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-We denote the operation of erasing the bits and turning an annotated regular expression
-into an $\rrexp{}$ as $\rerase{}$.
-\begin{center}
-\begin{tabular}{lcl}
-$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
-$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
-	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
-$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
-$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
-\end{tabular}
-\end{center}
-\noindent
-$\rrexp$ give the exact correspondence between an annotated regular expression
-and its (r-)erased version:
-\begin{lemma}
-$\rsize{\rerase a} = \asize a$
-\end{lemma}
-\noindent
-This does not hold for plain $\rexp$s.
 Similarly we could define the derivative  and simplification on
 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
 except that now they can operate on alternatives taking multiple arguments.
 	 $\sflat{\sum []}$ & $ = $ & $ \sum []$\\
 $\sflat r$ & $=$ & $ r$
 \end{tabular}
 \end{center}
 \noindent
-$\sflataux{\_}$ breaks up nested alternative regexes
+$\sflataux{\_}$ breaks up nested alternative regular expressions
 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
 into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
 It will return the singleton list $[r]$ otherwise.
 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
 the output type a regular expression, not a list.
 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not
 count the possible size explosions of $r \backslash c$ themselves.
-Thanks to $\rflts$ and $\rDistinct$, we are able to open up regexes like
+Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like
 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'',
 r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
 \end{center}
 \noindent
 %MAYBE TODO: introduce createdByStar
 Again these definitions are tailor-made for dealing with alternatives that have
 originated from a star's derivatives, so we do not attempt to open up all possible
-regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
+regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
 elements.
 We give a predicate for such "star-created" regular expressions:
 \begin{center}
 \begin{tabular}{lcr}
 &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
 Test data in the graphs from randomly generated regular expressions
 shows that the giant bounds are far from being hit.
 %a few sample regular experessions' derivatives
 %size change
-%TODO: giving regex1_size_change.data showing a few regexes' size changes
+%TODO: giving regex1_size_change.data showing a few regular expressions' size changes
 %w;r;t the input characters number, where the size is usually cubic in terms of original size
 %a*, aa*, aaa*, .....
-%randomly generated regexes
+%randomly generated regular expressions
 \begin{center}
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
 xlabel={number of $a$'s},
 legend pos=north west,
 legend cell align=left]
 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
 \end{axis}
 \end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
+\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.}
 \end{tabular}
 \end{center}
 	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
 	where $a$ is not nullable.\\
 	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
 	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
 \end{center}
-We start with two syntactically different regexes,
+We start with two syntactically different regular expressions,
 and end up with the same derivative result.
 This is not surprising as we have such
 equality as below in the style of Arden's lemma:\\
 \begin{center}
 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
 exponential $w.r.t$ the number of characters,
 but will eventually level off when the string $s$ is long enough.
 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
 would still be slow.
 And unfortunately, we have concrete examples
-where such regexes grew exponentially large before levelling off:
+where such regular expressions grew exponentially large before levelling off:
 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
 size that is  exponential on the number $n$
 under our current simplification rules:
 %TODO: graph of a regex whose size increases exponentially.

changeset 576	3e1b699696b6
parent 564	3cbcd7cda0a9
child 577	f47fc4840579