ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 576 3e1b699696b6
parent 564 3cbcd7cda0a9
child 577 f47fc4840579
--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Tue Aug 02 22:11:22 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Fri Aug 12 00:39:23 2022 +0100
@@ -7,42 +7,126 @@
 %of our bitcoded algorithm, that is a finite bound on the size of any 
 %regex's derivatives. 
 
-In this chapter we give a guarantee in terms of time complexity:
-given a regular expression $r$, for any string $s$ 
-our algorithm's internal data structure is finitely bounded.
-Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal
-data structure used in our $\blexer$)
-is bounded by a constant $N_r$, where $N$ only depends on the regular expression
-$r$, not the string $s$.
-When doing a time complexity analysis of any 
-lexer/parser based on Brzozowski derivatives, one needs to take into account that
-not only the "derivative steps".
+In this chapter we give a guarantee in terms of size: 
+given an annotated regular expression $a$, for any string $s$ 
+our algorithm's internal data structure 
+$\bderssimp{a}{s}$'s size
+\begin{center}
+$\llbracket \bderssimp{a}{s} \rrbracket$
+\end{center}
+\noindent
+is finitely bounded
+by a constant $N_a$ that only depends on $a$.
+
+\section{Formalising About Size}
+In our lexer $\blexersimp$,
+The regular expression is repeatedly being taken derivative of
+and then simplified.
 
-%TODO: get a grpah for internal data structure growing arbitrary large
+\begin{center}
+\begin{tikzpicture}[scale=2,node distance=1.4cm,
+                    every node/.style={minimum size=9mm}]
+\node (r0)  {$r$};
+\node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$};
+\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
+\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$r_{1s}$};
+\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {simp};
+\node (r2) [rectangle, draw=black, thick,  right=of r1s]{$r_2$};
+\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
+\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$};
+\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
+\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
+\draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
+\node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$};
+\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
+\end{tikzpicture}
+\end{center}
+\noindent
+As illustrated in the picture, 
+each time the derivative is taken derivative of, it grows,
+and when it is being simplified, it shrinks.
+The blue ones are the regular expressions after simplification,
+which would be smaller than before.
+We give a mechanised proof that after simplification 
+the regular expression's size (the blue ones)
+$\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
+a fixed $a$. 
+While it is not yet a direct formalisation of our lexer's complexity,
+it is a stepping stone towards a complexity proof because
+if the data structures out lexer has to traverse is large, the program
+will certainly be slow.
+
+The bound is not yet tight, and we seek to improve $N_a$ so that
+it is polynomial on $\llbracket a \rrbracket$.
+We believe a formalisation of the complexity-related properties
+of the algorithm is important in our context, because we want to address
+catastrophic backtracking, which is not a correctness issue but
+in essence a performance issue, a formal proof can give
+the strongest assurance that such issues cannot arise 
+regardless of what the input is.
+This level of certainty cannot come from a pencil and paper proof or
+eimpirical evidence.
+
+For example, Sulzmann and Lu's made claimed that their algorithm
+with bitcodes and simplifications can lex in linear time with respect
+to the input string. This assumes that each
+derivative operation takes constant time. 
+However, it turns out that on certain cases their lexer 
+will have an indefinite size explosion, causing the running time 
+of each derivative step to grow arbitrarily large.
+Here in our proof we state that such explosions cannot happen 
+with our simplification function.
+
+Here is a bird's eye view of how the proof of finiteness works: 
+\begin{center}
+  \begin{tikzpicture}[scale=1,font=\bf,
+                      node/.style={
+                      rectangle,rounded corners=3mm,
+                      ultra thick,draw=black!50,minimum height=18mm, 
+                      minimum width=20mm,
+                      top color=white,bottom color=black!20}]
 
 
-To obtain such a proof, we need to 
+		      \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$};
+		      \node (A) at (0,0) [node,text width=1.6cm,  text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$};
+		      \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
+  \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};
+ 
+  \draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$r = a \downarrow_r$} (A); 
+  \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); 
+  \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
+  \end{tikzpicture}
+\end{center}
+\noindent
+We explain the above steps one by one:
 \begin{itemize}
 \item
-Define an new datatype for regular expressions that makes it easy
-to reason about the size of an annotated regular expression.
-\item
-A set of equalities for this new datatype that enables one to
-rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc.
-by their children regexes $r_1$, $r_2$, and $r$.
+We first use a new datatype, called $\textit{rrexp}$s, whose 
+inductive type definition and derivative and simplification operations are
+almost identical to those of the annotated regular expressions,
+except that no bitcodes are attached.
+This new datatype is more straightforward to tweak 
+compared with an annotated regular expression.
 \item
-Using those equalities to actually get those rewriting equations, which we call
-"closed forms".
+We have a set of equalities for this new datatype that enables one to
+rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc.
+by a combinatioin of derivatives of their 
+children regular expressions ($r_1$, $r_2$, and $r$, respectively).
+These equalities are chained together to get into a shape
+that is very easy to estimate, which we call the \emph{Closed Forms}
+of the derivatives.
 \item
-Bound the closed forms, thereby bounding the original
-$\blexersimp$'s internal data structures.
+This closed form is controlled by terms that
+are easier to deal with, wich are in turn bounded loosely
+by a large yet constant number.
 \end{itemize}
+We give details of these steps in the next sections.
+The first step is to use something simpler than annotated regular expressions. 
 
 \section{the $\mathbf{r}$-rexp datatype and the size functions}
-
-We have a size function for bitcoded regular expressions, written
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
-
+We want to prove the size property about annotated regular expressions.
+The size is 
+written $\llbracket r\rrbracket$, whose intuitive definition is as below
 \begin{center}
 \begin{tabular}{ccc}
 	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
@@ -53,8 +137,80 @@
 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
 \end{tabular}
 \end{center}
+\noindent
+We first note that $\llbracket \_ \rrbracket$
+is unaware of bitcodes, since 
+it only counts the number of nodes
+if we regard $r$ as a tree.
+This suggests that if we define a new datatype that is similar to plain $\rexp$s,
+\[			\rrexp ::=   \RZERO \mid  \RONE
+			 \mid  \RCHAR{c}  
+			 \mid  \RSEQ{r_1}{r_2}
+			 \mid  \RALTS{rs}
+			 \mid \RSTAR{r}        
+\]
+and define 
+\begin{center}
+\begin{tabular}{lcl}
+$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
+$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
+	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
+$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
+$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
+\end{tabular}
+\end{center}
+\noindent
+we could calculate the size of annotated regular expressions in terms of
+its un-annotated counterpart: 
+\begin{lemma}
+$\rsize{\rerase a} = \asize a$
+\end{lemma}
+\begin{proof}
+	By routine induction on the structure of $a$.
+\end{proof}
+\noindent
+We call them \emph{r}-regular expressions:
+we use
+$\rrexp$ as its type name, so as to make a distinction  
+with plain regular expressions.
+In $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
+but keep everything else intact.
+To transform an annotated regular expression into an r-regular expression,
+we use the function \emph{rerase}, written $\downarrow_r$. 
+The $r$ in the subscript of $\downarrow$ is to 
+differentiate with the $\downarrow$ for the $\erase$ operation.
+Before we introduce more functions related to r-regular expressions,
+we first give out the reason why we take all the trouble 
+defining a new datatype in the first place.
+\subsection{Motivation Behind a New Datatype}
+The main difference between a plain regular expression
+and an r-regular expression is that it can take
+non-binary arguments for its alternative constructor.
+This turned out to be necessary if we want our proofs to be
+simple.
+We initially started by using 
+plain regular expressions and tried to prove
+equalities like 
+\begin{center}
+	$\llbracket a \rrbracket = \llbracket a\downarrow \rrbracket_p$
+\end{center}
+and 
+\[
+	\llbracket a \backslash_{bsimps} s \rrbracket = 
+	\llbracket a \downarrow \backslash_s s
+\]
+One might be able to prove that 
+$\llbracket a \downarrow \rrbracket_p \leq \llbracket a \rrbracket$.
+$\rrexp$ give the exact correspondence between an annotated regular expression
+and its (r-)erased version:
 
-\noindent
+This does not hold for plain $\rexp$s. 
+
+Of course, the bits which encode the lexing information would grow linearly with respect 
+to the input, which should be taken into account when we wish to tackle the runtime comlexity.
+But at the current stage 
+we can safely ignore them.
 Similarly there is a size function for plain regular expressions:
 \begin{center}
 \begin{tabular}{ccc}
@@ -103,41 +259,9 @@
  is turned into a $\ZERO$ during the
 $\erase$ function, thereby changing the size and structure of the regex.
 Therefore the equality in question does not hold.
+
 These will likely be fixable if we really want to use plain $\rexp$s for dealing
 with size, but we choose a more straightforward (or stupid) method by 
-defining a new datatype that is similar to plain $\rexp$s but can take
-non-binary arguments for its alternative constructor,
- which we call $\rrexp$ to denote
-the difference between it and plain regular expressions. 
-\[			\rrexp ::=   \RZERO \mid  \RONE
-			 \mid  \RCHAR{c}  
-			 \mid  \RSEQ{r_1}{r_2}
-			 \mid  \RALTS{rs}
-			 \mid \RSTAR{r}        
-\]
-For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
-but keep everything else intact.
-It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-We denote the operation of erasing the bits and turning an annotated regular expression 
-into an $\rrexp{}$ as $\rerase{}$.
-\begin{center}
-\begin{tabular}{lcl}
-$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
-$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
-	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
-$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
-$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
-\end{tabular}
-\end{center}
-\noindent
-$\rrexp$ give the exact correspondence between an annotated regular expression
-and its (r-)erased version:
-\begin{lemma}
-$\rsize{\rerase a} = \asize a$
-\end{lemma}
-\noindent
-This does not hold for plain $\rexp$s. 
 
 Similarly we could define the derivative  and simplification on 
 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
@@ -1166,7 +1290,7 @@
 \end{tabular}
 \end{center}
 \noindent
-$\sflataux{\_}$ breaks up nested alternative regexes 
+$\sflataux{\_}$ breaks up nested alternative regular expressions 
 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
 into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
 It will return the singleton list $[r]$ otherwise.
@@ -1263,7 +1387,7 @@
 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
 count the possible size explosions of $r \backslash c$ themselves.
 
-Thanks to $\rflts$ and $\rDistinct$, we are able to open up regexes like
+Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like
 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
@@ -1333,7 +1457,7 @@
 %MAYBE TODO: introduce createdByStar
 Again these definitions are tailor-made for dealing with alternatives that have
 originated from a star's derivatives, so we do not attempt to open up all possible 
-regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
+regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
 elements.
 We give a predicate for such "star-created" regular expressions:
 \begin{center}
@@ -1741,10 +1865,10 @@
 shows that the giant bounds are far from being hit.
 %a few sample regular experessions' derivatives
 %size change
-%TODO: giving regex1_size_change.data showing a few regexes' size changes 
+%TODO: giving regex1_size_change.data showing a few regular expressions' size changes 
 %w;r;t the input characters number, where the size is usually cubic in terms of original size
 %a*, aa*, aaa*, .....
-%randomly generated regexes
+%randomly generated regular expressions
 \begin{center}
 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
 \begin{tikzpicture}
@@ -1809,7 +1933,7 @@
 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
 \end{axis}
 \end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
+\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.}
 \end{tabular}    
 \end{center}  
 
@@ -1976,7 +2100,7 @@
 	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
 	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
 \end{center}
-We start with two syntactically different regexes,
+We start with two syntactically different regular expressions,
 and end up with the same derivative result.
 This is not surprising as we have such 
 equality as below in the style of Arden's lemma:\\
@@ -2006,7 +2130,7 @@
 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
 would still be slow.
 And unfortunately, we have concrete examples
-where such regexes grew exponentially large before levelling off:
+where such regular expressions grew exponentially large before levelling off:
 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
  size that is  exponential on the number $n$