ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 543 b2bea5968b89
parent 532 cc54ce075db5
child 553 0f00d440f484
--- a/ChengsongTanPhdThesis/Chapters/Finite.tex	Tue Jun 14 18:06:33 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex	Thu Jun 23 16:09:40 2022 +0100
@@ -10,6 +10,17 @@
 In this chapter we give a guarantee in terms of time complexity:
 given a regular expression $r$, for any string $s$ 
 our algorithm's internal data structure is finitely bounded.
+Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal
+data structure used in our $\blexer$)
+is bounded by a constant $N_r$, where $N$ only depends on the regular expression
+$r$, not the string $s$.
+When doing a time complexity analysis of any 
+lexer/parser based on Brzozowski derivatives, one needs to take into account that
+not only the "derivative steps".
+
+%TODO: get a grpah for internal data structure growing arbitrary large
+
+
 To obtain such a proof, we need to 
 \begin{itemize}
 \item
@@ -34,32 +45,64 @@
 
 \begin{center}
 \begin{tabular}{ccc}
-$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
+	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+\end{tabular}
+\end{center}
+
+\noindent
+Similarly there is a size function for plain regular expressions:
+\begin{center}
+\begin{tabular}{ccc}
+	$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
+	$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
+	$\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
+$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
+$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
+$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
 \end{tabular}
 \end{center}
-(TODO: COMPLETE this defn and for $rs$)
 
-The size is based on a recursive function on the structure of the regex,
-not the bitcodes.
-Therefore we may as well talk about size of an annotated regular expression 
-in their un-annotated form:
+\noindent
+The idea of obatining a bound for $\llbracket \bderssimp{a}{s} \rrbracket$
+is to get an equivalent form
+of something like $\llbracket \bderssimp{a}{s}\rrbracket = f(a, s)$, where $f(a, s)$ 
+is easier to estimate than $\llbracket \bderssimp{a}{s}\rrbracket$.
+We notice that while it is not so clear how to obtain
+a metamorphic representation of $\bderssimp{a}{s}$ (as we argued in chapter \ref{Bitcoded2},
+not interleaving the application of the functions $\backslash$ and $\bsimp{\_}$ 
+in the order as our lexer will result in the bit-codes dispensed differently),
+it is possible to get an slightly different representation of the unlifted versions:
+$ (\bderssimp{a}{s})_\downarrow = (\erase \; \bsimp{a \backslash s})_\downarrow$.
+This suggest setting the bounding function $f(a, s)$ as 
+$\llbracket  (a \backslash s)_\downarrow \rrbracket_p$, the plain size
+of the erased annotated regular expression.
+This requires the the regular expression accompanied by bitcodes
+to have the same size as its plain counterpart after erasure:
 \begin{center}
-$\asize(a) = \size(\erase(a))$. 
+	$\asize{a} \stackrel{?}{=} \llbracket \erase(a)\rrbracket_p$. 
 \end{center}
-(TODO: turn equals to roughly equals)
-
-But there is a minor nuisance here:
-the erase function actually messes with the structure of the regular expression:
+\noindent
+But there is a minor nuisance: 
+the erase function unavoidbly messes with the structure of the regular expression,
+due to the discrepancy between annotated regular expression's $\sum$ constructor
+and plain regular expression's $+$ constructor having different arity.
 \begin{center}
 \begin{tabular}{ccc}
-$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
+$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
+$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
+$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
 \end{tabular}
 \end{center}
-(TODO: complete this definition with singleton r in alts)
-
+\noindent
 An alternative regular expression with an empty list of children
- is turned into an $\ZERO$ during the
+ is turned into a $\ZERO$ during the
 $\erase$ function, thereby changing the size and structure of the regex.
+Therefore the equality in question does not hold.
 These will likely be fixable if we really want to use plain $\rexp$s for dealing
 with size, but we choose a more straightforward (or stupid) method by 
 defining a new datatype that is similar to plain $\rexp$s but can take
@@ -75,56 +118,105 @@
 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
 but keep everything else intact.
 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
 We denote the operation of erasing the bits and turning an annotated regular expression 
 into an $\rrexp{}$ as $\rerase{}$.
 \begin{center}
 \begin{tabular}{lcl}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
+$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
+$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
+	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
+$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
+$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
 \end{tabular}
 \end{center}
-%TODO: FINISH definition of rerase
+\noindent
+$\rrexp$ give the exact correspondence between an annotated regular expression
+and its (r-)erased version:
+\begin{lemma}
+$\rsize{\rerase a} = \asize a$
+\end{lemma}
+This does not hold for plain $\rexp$s. 
+
 Similarly we could define the derivative  and simplification on 
 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
 except that now they can operate on alternatives taking multiple arguments.
 
 \begin{center}
 \begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
+	$(\RALTS{rs})\; \backslash c$ & $\dn$ &  $\RALTS{\map\; (\_ \backslash c) \;rs}$\\
 (other clauses omitted)
+With the new $\rrexp$ datatype in place, one can define its size function,
+which precisely mirrors that of the annotated regular expressions:
+\end{tabular}
+\end{center}
+\noindent
+\begin{center}
+\begin{tabular}{ccc}
+	$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
+	$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
+	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as   + 1$\\
+$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$
 \end{tabular}
 \end{center}
+\noindent
 
-Now that $\rrexp$s do not have bitcodes on them, we can do the 
-duplicate removal without  an equivalence relation:
+\subsection{Lexing Related Functions for $\rrexp$}
+Everything else for $\rrexp$ will be precisely the same for annotated expressions,
+except that they do not involve rectifying and augmenting bit-encoded tokenization information.
+As expected, most functions are simpler, such as the derivative:
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\  
+  $(\ONE)\,\backslash_r c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         \ONE\;\textit{else}\;\ZERO$\\  
+  $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
+  $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
+  $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
+     $\textit{if}\;\textit{rnullable}\,r_1$\\
+					       & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
+					       & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
+  & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
+  $(r^*)\,\backslash_r c$ & $\dn$ &
+      $( r\,\backslash_r c)\cdot
+       (_{[]}r^*))$
+\end{tabular}    
+\end{center}  
+\noindent
+The simplification function is simplified without annotation causing superficial differences.
+Duplicate removal without  an equivalence relation:
 \begin{center}
 \begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
+	$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
+$\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
            			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
 \end{tabular}
 \end{center}
 %TODO: definition of rsimp (maybe only the alternative clause)
-
-
-The reason why these definitions  mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-
+\noindent
+With $\textit{rdistinct}$ one can chain together all the other modules
+of $\bsimp{\_}$ (removing the functionalities related to bit-sequences)
+and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$.
+We omit these functions, as they are routine. Please refer to the formalisation
+(in file BasicIdentities.thy) for the exact definition.
+With $\rrexp$ the size caclulation of annotated regular expressions'
+simplification and derivatives can be done by the size of their unlifted 
+counterpart with the unlifted version of simplification and derivatives applied.
 \begin{lemma}
-$\rsize{\rerase a} = \asize a$
+	\begin{itemize}
+		\item
+$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
+\item
+$\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
+\end{itemize}
 \end{lemma}
-This lemma says that the size of an r-erased regex is the same as the annotated regex.
-this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the 
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
+\noindent
+In the following content, we will focus on $\rrexp$'s size bound.
+We will piece together this bound and show the same bound for annotated regular 
+expressions in the end.
 Unless stated otherwise in this chapter all $\textit{rexp}$s without
  bitcodes are seen as $\rrexp$s.
  We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
@@ -136,7 +228,135 @@
 %-----------------------------------
 %	SECTION ?
 %-----------------------------------
-\section{preparatory properties for getting a finiteness bound}
+ \section{Roadmap to a Finiteness Proof}
+ Now that we have defined the $\rrexp$ datatype, and proven that its size changes
+ w.r.t derivatives and simplifications mirrors precisely those of annotated regular expressions,
+ we aim to bound the size of $r \backslash s$ for any $\rrexp$  $r$.
+The way we do it is by two steps:
+\begin{itemize}
+	\item
+		First, we rewrite $r\backslash s$ into something else that is easier
+		to bound. This step is especially important for the inductive case 
+		$r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
+		but after simplification they will always be equal or smaller to a form consisting of an alternative
+		list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
+		The functions $f$ and $g$ are usually from $\flts$ $\distinctBy$ and other appearing in $\bsimp{_}$.
+		This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
+		right hand side the "closed form" of $r\backslash s$.
+	\item
+		Then, for such a sum  list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
+		by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only 
+		reduce the size of a regular expression, not adding to it. The closed form $f\; (g\; (\sum rs)) $ is controlled
+		by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$.
+\end{itemize}
+
+\section{Closed Forms}
+
+\subsection{Basic Properties needed for Closed Forms}
+
+\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
+The $\textit{rdistinct}$ function, as its name suggests, will
+remove duplicates according to the accumulator
+and leave only one of each different element in a list:
+\begin{lemma}
+	The function $\textit{rdistinct}$ satisfies the following
+	properties:
+	\begin{itemize}
+		\item
+			If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
+		\item
+			If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
+			All the elements in $rs'$ are unique.
+	\end{itemize}
+\end{lemma}
+\begin{proof}
+	The first part is by an induction on $rs$.
+	The second part can be proven by using the 
+	induction rules of $\rdistinct{\_}{\_}$.
+	
+\end{proof}
+
+\noindent
+$\rdistinct{\_}{\_}$ will cancel out all regular expression terms
+that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary
+list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$
+on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$
+without the prepending of $rs$:
+\begin{lemma}
+	If $rs \subseteq rset$, then 
+	$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
+\end{lemma}
+\begin{proof}
+	By induction on $rs$.
+\end{proof}
+\noindent
+On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
+then expanding the accumulator to include that element will not cause the output list to change:
+\begin{lemma}
+	The accumulator can be augmented to include elements not appearing in the input list,
+	and the output will not change.	
+	\begin{itemize}
+		\item
+		If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{\{r\} \cup acc}$.
+		\item
+			Particularly, when $acc = \varnothing$ and $rs$ de-duplicated, we have\\
+			\[ \rdistinct{rs}{\varnothing} = rs \]
+	\end{itemize}
+\end{lemma}
+\begin{proof}
+	The first half is by induction on $rs$. The second half is a corollary of the first.
+\end{proof}
+\noindent
+The next property gives the condition for
+when $\rdistinct{\_}{\_}$ becomes an identical mapping
+for any prefix of an input list, in other words, when can 
+we ``push out" the arguments of $\rdistinct{\_}{\_}$:
+\begin{lemma}
+	If $rs_1$'s elements are all unique, and not appearing in the accumulator $acc$, 
+	then it can be taken out and left untouched in the output:
+	\[\textit{rdistinct}\;  rs_1 @ rsa\;\, acc
+	= rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1)\]
+\end{lemma}
+
+\begin{proof}
+By an induction on $rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
+\end{proof}
+\subsubsection{The properties of $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS}$} 
+We give in this subsection some properties of how $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS} $ interact with each other and  with $@$, the concatenation operator.
+These will be helpful in later closed form proofs, when
+we want to transform the ways in which multiple functions involving
+those are composed together
+in interleaving derivative and  simplification steps.
+
+When the function $\textit{Rflts}$ 
+is applied to the concatenation of two lists, the output can be calculated by first applying the
+functions on two lists separately, and then concatenating them together.
+\begin{lemma}
+	The function $\rflts$ has the below properties:\\
+	\begin{itemize}
+		\item
+	$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
+\item
+	If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
+	\end{itemize}
+\end{lemma}
+\noindent
+\begin{proof}
+	By induction on $rs_1$.
+\end{proof}
+
+
+
+\subsection{A Closed Form for the Sequence Regular Expression}
+\begin{quote}\it
+	Claim: For regular expressions $r_1 \cdot r_2$, we claim that
+	\begin{center}
+		$ \rderssimp{r_1 \cdot r_2}{s} = 
+	\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
+	\end{center}
+\end{quote}
+\noindent
+
 Before we get to the proof that says the intermediate result of our lexer will
 remain finitely bounded, which is an important efficiency/liveness guarantee,
 we shall first develop a few preparatory properties and definitions to