--- a/ChengsongTanPhdThesis/Chapters/Finite.tex Tue Jun 14 18:06:33 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Thu Jun 23 16:09:40 2022 +0100
@@ -10,6 +10,17 @@
In this chapter we give a guarantee in terms of time complexity:
given a regular expression $r$, for any string $s$
our algorithm's internal data structure is finitely bounded.
+Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal
+data structure used in our $\blexer$)
+is bounded by a constant $N_r$, where $N$ only depends on the regular expression
+$r$, not the string $s$.
+When doing a time complexity analysis of any
+lexer/parser based on Brzozowski derivatives, one needs to take into account that
+not only the "derivative steps".
+
+%TODO: get a grpah for internal data structure growing arbitrary large
+
+
To obtain such a proof, we need to
\begin{itemize}
\item
@@ -34,32 +45,64 @@
\begin{center}
\begin{tabular}{ccc}
-$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
+ $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
+ $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
+ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\
+$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
+\end{tabular}
+\end{center}
+
+\noindent
+Similarly there is a size function for plain regular expressions:
+\begin{center}
+\begin{tabular}{ccc}
+ $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
+ $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
+ $\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\
+$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\
+$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\
+$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$
\end{tabular}
\end{center}
-(TODO: COMPLETE this defn and for $rs$)
-The size is based on a recursive function on the structure of the regex,
-not the bitcodes.
-Therefore we may as well talk about size of an annotated regular expression
-in their un-annotated form:
+\noindent
+The idea of obatining a bound for $\llbracket \bderssimp{a}{s} \rrbracket$
+is to get an equivalent form
+of something like $\llbracket \bderssimp{a}{s}\rrbracket = f(a, s)$, where $f(a, s)$
+is easier to estimate than $\llbracket \bderssimp{a}{s}\rrbracket$.
+We notice that while it is not so clear how to obtain
+a metamorphic representation of $\bderssimp{a}{s}$ (as we argued in chapter \ref{Bitcoded2},
+not interleaving the application of the functions $\backslash$ and $\bsimp{\_}$
+in the order as our lexer will result in the bit-codes dispensed differently),
+it is possible to get an slightly different representation of the unlifted versions:
+$ (\bderssimp{a}{s})_\downarrow = (\erase \; \bsimp{a \backslash s})_\downarrow$.
+This suggest setting the bounding function $f(a, s)$ as
+$\llbracket (a \backslash s)_\downarrow \rrbracket_p$, the plain size
+of the erased annotated regular expression.
+This requires the the regular expression accompanied by bitcodes
+to have the same size as its plain counterpart after erasure:
\begin{center}
-$\asize(a) = \size(\erase(a))$.
+ $\asize{a} \stackrel{?}{=} \llbracket \erase(a)\rrbracket_p$.
\end{center}
-(TODO: turn equals to roughly equals)
-
-But there is a minor nuisance here:
-the erase function actually messes with the structure of the regular expression:
+\noindent
+But there is a minor nuisance:
+the erase function unavoidbly messes with the structure of the regular expression,
+due to the discrepancy between annotated regular expression's $\sum$ constructor
+and plain regular expression's $+$ constructor having different arity.
\begin{center}
\begin{tabular}{ccc}
-$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
+$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\
+$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\
+$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$
\end{tabular}
\end{center}
-(TODO: complete this definition with singleton r in alts)
-
+\noindent
An alternative regular expression with an empty list of children
- is turned into an $\ZERO$ during the
+ is turned into a $\ZERO$ during the
$\erase$ function, thereby changing the size and structure of the regex.
+Therefore the equality in question does not hold.
These will likely be fixable if we really want to use plain $\rexp$s for dealing
with size, but we choose a more straightforward (or stupid) method by
defining a new datatype that is similar to plain $\rexp$s but can take
@@ -75,56 +118,105 @@
For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
but keep everything else intact.
It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
We denote the operation of erasing the bits and turning an annotated regular expression
into an $\rrexp{}$ as $\rerase{}$.
\begin{center}
\begin{tabular}{lcl}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
+$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
+$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
+ $\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
+$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
+$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
\end{tabular}
\end{center}
-%TODO: FINISH definition of rerase
+\noindent
+$\rrexp$ give the exact correspondence between an annotated regular expression
+and its (r-)erased version:
+\begin{lemma}
+$\rsize{\rerase a} = \asize a$
+\end{lemma}
+This does not hold for plain $\rexp$s.
+
Similarly we could define the derivative and simplification on
$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
except that now they can operate on alternatives taking multiple arguments.
\begin{center}
\begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\
+ $(\RALTS{rs})\; \backslash c$ & $\dn$ & $\RALTS{\map\; (\_ \backslash c) \;rs}$\\
(other clauses omitted)
+With the new $\rrexp$ datatype in place, one can define its size function,
+which precisely mirrors that of the annotated regular expressions:
+\end{tabular}
+\end{center}
+\noindent
+\begin{center}
+\begin{tabular}{ccc}
+ $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\
+ $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\
+ $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\
+$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\
+$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\
+$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$
\end{tabular}
\end{center}
+\noindent
-Now that $\rrexp$s do not have bitcodes on them, we can do the
-duplicate removal without an equivalence relation:
+\subsection{Lexing Related Functions for $\rrexp$}
+Everything else for $\rrexp$ will be precisely the same for annotated expressions,
+except that they do not involve rectifying and augmenting bit-encoded tokenization information.
+As expected, most functions are simpler, such as the derivative:
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\
+ $(\ONE)\,\backslash_r c$ & $\dn$ &
+ $\textit{if}\;c=d\; \;\textit{then}\;
+ \ONE\;\textit{else}\;\ZERO$\\
+ $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ &
+ $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\
+ $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ &
+ $\textit{if}\;\textit{rnullable}\,r_1$\\
+ & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\
+ & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\
+ & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\
+ $(r^*)\,\backslash_r c$ & $\dn$ &
+ $( r\,\backslash_r c)\cdot
+ (_{[]}r^*))$
+\end{tabular}
+\end{center}
+\noindent
+The simplification function is simplified without annotation causing superficial differences.
+Duplicate removal without an equivalence relation:
\begin{center}
\begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
+ $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\
+$\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
& & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
\end{tabular}
\end{center}
%TODO: definition of rsimp (maybe only the alternative clause)
-
-
-The reason why these definitions mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-
+\noindent
+With $\textit{rdistinct}$ one can chain together all the other modules
+of $\bsimp{\_}$ (removing the functionalities related to bit-sequences)
+and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$.
+We omit these functions, as they are routine. Please refer to the formalisation
+(in file BasicIdentities.thy) for the exact definition.
+With $\rrexp$ the size caclulation of annotated regular expressions'
+simplification and derivatives can be done by the size of their unlifted
+counterpart with the unlifted version of simplification and derivatives applied.
\begin{lemma}
-$\rsize{\rerase a} = \asize a$
+ \begin{itemize}
+ \item
+$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
+\item
+$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
+\end{itemize}
\end{lemma}
-This lemma says that the size of an r-erased regex is the same as the annotated regex.
-this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
+\noindent
+In the following content, we will focus on $\rrexp$'s size bound.
+We will piece together this bound and show the same bound for annotated regular
+expressions in the end.
Unless stated otherwise in this chapter all $\textit{rexp}$s without
bitcodes are seen as $\rrexp$s.
We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
@@ -136,7 +228,135 @@
%-----------------------------------
% SECTION ?
%-----------------------------------
-\section{preparatory properties for getting a finiteness bound}
+ \section{Roadmap to a Finiteness Proof}
+ Now that we have defined the $\rrexp$ datatype, and proven that its size changes
+ w.r.t derivatives and simplifications mirrors precisely those of annotated regular expressions,
+ we aim to bound the size of $r \backslash s$ for any $\rrexp$ $r$.
+The way we do it is by two steps:
+\begin{itemize}
+ \item
+ First, we rewrite $r\backslash s$ into something else that is easier
+ to bound. This step is especially important for the inductive case
+ $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way,
+ but after simplification they will always be equal or smaller to a form consisting of an alternative
+ list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application.
+ The functions $f$ and $g$ are usually from $\flts$ $\distinctBy$ and other appearing in $\bsimp{_}$.
+ This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the
+ right hand side the "closed form" of $r\backslash s$.
+ \item
+ Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size
+ by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only
+ reduce the size of a regular expression, not adding to it. The closed form $f\; (g\; (\sum rs)) $ is controlled
+ by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$.
+\end{itemize}
+
+\section{Closed Forms}
+
+\subsection{Basic Properties needed for Closed Forms}
+
+\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully}
+The $\textit{rdistinct}$ function, as its name suggests, will
+remove duplicates according to the accumulator
+and leave only one of each different element in a list:
+\begin{lemma}
+ The function $\textit{rdistinct}$ satisfies the following
+ properties:
+ \begin{itemize}
+ \item
+ If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$.
+ \item
+ If list $rs'$ is the result of $\rdistinct{rs}{acc}$,
+ All the elements in $rs'$ are unique.
+ \end{itemize}
+\end{lemma}
+\begin{proof}
+ The first part is by an induction on $rs$.
+ The second part can be proven by using the
+ induction rules of $\rdistinct{\_}{\_}$.
+
+\end{proof}
+
+\noindent
+$\rdistinct{\_}{\_}$ will cancel out all regular expression terms
+that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary
+list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$
+on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$
+without the prepending of $rs$:
+\begin{lemma}
+ If $rs \subseteq rset$, then
+ $\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$.
+\end{lemma}
+\begin{proof}
+ By induction on $rs$.
+\end{proof}
+\noindent
+On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated,
+then expanding the accumulator to include that element will not cause the output list to change:
+\begin{lemma}
+ The accumulator can be augmented to include elements not appearing in the input list,
+ and the output will not change.
+ \begin{itemize}
+ \item
+ If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{\{r\} \cup acc}$.
+ \item
+ Particularly, when $acc = \varnothing$ and $rs$ de-duplicated, we have\\
+ \[ \rdistinct{rs}{\varnothing} = rs \]
+ \end{itemize}
+\end{lemma}
+\begin{proof}
+ The first half is by induction on $rs$. The second half is a corollary of the first.
+\end{proof}
+\noindent
+The next property gives the condition for
+when $\rdistinct{\_}{\_}$ becomes an identical mapping
+for any prefix of an input list, in other words, when can
+we ``push out" the arguments of $\rdistinct{\_}{\_}$:
+\begin{lemma}
+ If $rs_1$'s elements are all unique, and not appearing in the accumulator $acc$,
+ then it can be taken out and left untouched in the output:
+ \[\textit{rdistinct}\; rs_1 @ rsa\;\, acc
+ = rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1)\]
+\end{lemma}
+
+\begin{proof}
+By an induction on $rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary.
+\end{proof}
+\subsubsection{The properties of $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS}$}
+We give in this subsection some properties of how $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS} $ interact with each other and with $@$, the concatenation operator.
+These will be helpful in later closed form proofs, when
+we want to transform the ways in which multiple functions involving
+those are composed together
+in interleaving derivative and simplification steps.
+
+When the function $\textit{Rflts}$
+is applied to the concatenation of two lists, the output can be calculated by first applying the
+functions on two lists separately, and then concatenating them together.
+\begin{lemma}
+ The function $\rflts$ has the below properties:\\
+ \begin{itemize}
+ \item
+ $\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$
+\item
+ If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$
+ \end{itemize}
+\end{lemma}
+\noindent
+\begin{proof}
+ By induction on $rs_1$.
+\end{proof}
+
+
+
+\subsection{A Closed Form for the Sequence Regular Expression}
+\begin{quote}\it
+ Claim: For regular expressions $r_1 \cdot r_2$, we claim that
+ \begin{center}
+ $ \rderssimp{r_1 \cdot r_2}{s} =
+ \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$
+ \end{center}
+\end{quote}
+\noindent
+
Before we get to the proof that says the intermediate result of our lexer will
remain finitely bounded, which is an important efficiency/liveness guarantee,
we shall first develop a few preparatory properties and definitions to