diff -r a7344c9afbaf -r b2bea5968b89 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Tue Jun 14 18:06:33 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Thu Jun 23 16:09:40 2022 +0100 @@ -10,6 +10,17 @@ In this chapter we give a guarantee in terms of time complexity: given a regular expression $r$, for any string $s$ our algorithm's internal data structure is finitely bounded. +Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal +data structure used in our $\blexer$) +is bounded by a constant $N_r$, where $N$ only depends on the regular expression +$r$, not the string $s$. +When doing a time complexity analysis of any +lexer/parser based on Brzozowski derivatives, one needs to take into account that +not only the "derivative steps". + +%TODO: get a grpah for internal data structure growing arbitrary large + + To obtain such a proof, we need to \begin{itemize} \item @@ -34,32 +45,64 @@ \begin{center} \begin{tabular}{ccc} -$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\ + $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ + $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ + $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ +$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ +$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ +$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$ +\end{tabular} +\end{center} + +\noindent +Similarly there is a size function for plain regular expressions: +\begin{center} +\begin{tabular}{ccc} + $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ + $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ + $\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ +$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\ +$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\ +$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$ \end{tabular} \end{center} -(TODO: COMPLETE this defn and for $rs$) -The size is based on a recursive function on the structure of the regex, -not the bitcodes. -Therefore we may as well talk about size of an annotated regular expression -in their un-annotated form: +\noindent +The idea of obatining a bound for $\llbracket \bderssimp{a}{s} \rrbracket$ +is to get an equivalent form +of something like $\llbracket \bderssimp{a}{s}\rrbracket = f(a, s)$, where $f(a, s)$ +is easier to estimate than $\llbracket \bderssimp{a}{s}\rrbracket$. +We notice that while it is not so clear how to obtain +a metamorphic representation of $\bderssimp{a}{s}$ (as we argued in chapter \ref{Bitcoded2}, +not interleaving the application of the functions $\backslash$ and $\bsimp{\_}$ +in the order as our lexer will result in the bit-codes dispensed differently), +it is possible to get an slightly different representation of the unlifted versions: +$ (\bderssimp{a}{s})_\downarrow = (\erase \; \bsimp{a \backslash s})_\downarrow$. +This suggest setting the bounding function $f(a, s)$ as +$\llbracket (a \backslash s)_\downarrow \rrbracket_p$, the plain size +of the erased annotated regular expression. +This requires the the regular expression accompanied by bitcodes +to have the same size as its plain counterpart after erasure: \begin{center} -$\asize(a) = \size(\erase(a))$. + $\asize{a} \stackrel{?}{=} \llbracket \erase(a)\rrbracket_p$. \end{center} -(TODO: turn equals to roughly equals) - -But there is a minor nuisance here: -the erase function actually messes with the structure of the regular expression: +\noindent +But there is a minor nuisance: +the erase function unavoidbly messes with the structure of the regular expression, +due to the discrepancy between annotated regular expression's $\sum$ constructor +and plain regular expression's $+$ constructor having different arity. \begin{center} \begin{tabular}{ccc} -$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\ +$\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\ +$\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\ +$\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$ \end{tabular} \end{center} -(TODO: complete this definition with singleton r in alts) - +\noindent An alternative regular expression with an empty list of children - is turned into an $\ZERO$ during the + is turned into a $\ZERO$ during the $\erase$ function, thereby changing the size and structure of the regex. +Therefore the equality in question does not hold. These will likely be fixable if we really want to use plain $\rexp$s for dealing with size, but we choose a more straightforward (or stupid) method by defining a new datatype that is similar to plain $\rexp$s but can take @@ -75,56 +118,105 @@ For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, but keep everything else intact. It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved -(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton -$\ALTS$). We denote the operation of erasing the bits and turning an annotated regular expression into an $\rrexp{}$ as $\rerase{}$. \begin{center} \begin{tabular}{lcl} -$\rerase{\AZERO}$ & $=$ & $\RZERO$\\ -$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ -$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ +$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ +$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ + $\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\ +$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ +$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ +$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$ \end{tabular} \end{center} -%TODO: FINISH definition of rerase +\noindent +$\rrexp$ give the exact correspondence between an annotated regular expression +and its (r-)erased version: +\begin{lemma} +$\rsize{\rerase a} = \asize a$ +\end{lemma} +This does not hold for plain $\rexp$s. + Similarly we could define the derivative and simplification on $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, except that now they can operate on alternatives taking multiple arguments. \begin{center} \begin{tabular}{lcr} -$\RALTS{rs} \backslash c$ & $=$ & $\RALTS{map\; (\_ \backslash c) \;rs}$\\ + $(\RALTS{rs})\; \backslash c$ & $\dn$ & $\RALTS{\map\; (\_ \backslash c) \;rs}$\\ (other clauses omitted) +With the new $\rrexp$ datatype in place, one can define its size function, +which precisely mirrors that of the annotated regular expressions: +\end{tabular} +\end{center} +\noindent +\begin{center} +\begin{tabular}{ccc} + $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ + $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ + $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ +$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\ +$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\ +$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$ \end{tabular} \end{center} +\noindent -Now that $\rrexp$s do not have bitcodes on them, we can do the -duplicate removal without an equivalence relation: +\subsection{Lexing Related Functions for $\rrexp$} +Everything else for $\rrexp$ will be precisely the same for annotated expressions, +except that they do not involve rectifying and augmenting bit-encoded tokenization information. +As expected, most functions are simpler, such as the derivative: +\begin{center} + \begin{tabular}{@{}lcl@{}} + $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ + $(\ONE)\,\backslash_r c$ & $\dn$ & + $\textit{if}\;c=d\; \;\textit{then}\; + \ONE\;\textit{else}\;\ZERO$\\ + $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ & + $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\ + $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ & + $\textit{if}\;\textit{rnullable}\,r_1$\\ + & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\ + & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\ + & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\ + $(r^*)\,\backslash_r c$ & $\dn$ & + $( r\,\backslash_r c)\cdot + (_{[]}r^*))$ +\end{tabular} +\end{center} +\noindent +The simplification function is simplified without annotation causing superficial differences. +Duplicate removal without an equivalence relation: \begin{center} \begin{tabular}{lcl} -$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ + $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ +$\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ \end{tabular} \end{center} %TODO: definition of rsimp (maybe only the alternative clause) - - -The reason why these definitions mirror precisely the corresponding operations -on annotated regualar expressions is that we can calculate sizes more easily: - +\noindent +With $\textit{rdistinct}$ one can chain together all the other modules +of $\bsimp{\_}$ (removing the functionalities related to bit-sequences) +and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$. +We omit these functions, as they are routine. Please refer to the formalisation +(in file BasicIdentities.thy) for the exact definition. +With $\rrexp$ the size caclulation of annotated regular expressions' +simplification and derivatives can be done by the size of their unlifted +counterpart with the unlifted version of simplification and derivatives applied. \begin{lemma} -$\rsize{\rerase a} = \asize a$ + \begin{itemize} + \item +$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ +\item +$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ +\end{itemize} \end{lemma} -This lemma says that the size of an r-erased regex is the same as the annotated regex. -this does not hold for plain $\rexp$s due to their ways of how alternatives are represented. -\begin{lemma} -$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ -\end{lemma} -Putting these two together we get a property that allows us to estimate the -size of an annotated regular expression derivative using its un-annotated counterpart: -\begin{lemma} -$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ -\end{lemma} +\noindent +In the following content, we will focus on $\rrexp$'s size bound. +We will piece together this bound and show the same bound for annotated regular +expressions in the end. Unless stated otherwise in this chapter all $\textit{rexp}$s without bitcodes are seen as $\rrexp$s. We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably @@ -136,7 +228,135 @@ %----------------------------------- % SECTION ? %----------------------------------- -\section{preparatory properties for getting a finiteness bound} + \section{Roadmap to a Finiteness Proof} + Now that we have defined the $\rrexp$ datatype, and proven that its size changes + w.r.t derivatives and simplifications mirrors precisely those of annotated regular expressions, + we aim to bound the size of $r \backslash s$ for any $\rrexp$ $r$. +The way we do it is by two steps: +\begin{itemize} + \item + First, we rewrite $r\backslash s$ into something else that is easier + to bound. This step is especially important for the inductive case + $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way, + but after simplification they will always be equal or smaller to a form consisting of an alternative + list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application. + The functions $f$ and $g$ are usually from $\flts$ $\distinctBy$ and other appearing in $\bsimp{_}$. + This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the + right hand side the "closed form" of $r\backslash s$. + \item + Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size + by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only + reduce the size of a regular expression, not adding to it. The closed form $f\; (g\; (\sum rs)) $ is controlled + by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$. +\end{itemize} + +\section{Closed Forms} + +\subsection{Basic Properties needed for Closed Forms} + +\subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully} +The $\textit{rdistinct}$ function, as its name suggests, will +remove duplicates according to the accumulator +and leave only one of each different element in a list: +\begin{lemma} + The function $\textit{rdistinct}$ satisfies the following + properties: + \begin{itemize} + \item + If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. + \item + If list $rs'$ is the result of $\rdistinct{rs}{acc}$, + All the elements in $rs'$ are unique. + \end{itemize} +\end{lemma} +\begin{proof} + The first part is by an induction on $rs$. + The second part can be proven by using the + induction rules of $\rdistinct{\_}{\_}$. + +\end{proof} + +\noindent +$\rdistinct{\_}{\_}$ will cancel out all regular expression terms +that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary +list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$ +on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$ +without the prepending of $rs$: +\begin{lemma} + If $rs \subseteq rset$, then + $\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$. +\end{lemma} +\begin{proof} + By induction on $rs$. +\end{proof} +\noindent +On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated, +then expanding the accumulator to include that element will not cause the output list to change: +\begin{lemma} + The accumulator can be augmented to include elements not appearing in the input list, + and the output will not change. + \begin{itemize} + \item + If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{\{r\} \cup acc}$. + \item + Particularly, when $acc = \varnothing$ and $rs$ de-duplicated, we have\\ + \[ \rdistinct{rs}{\varnothing} = rs \] + \end{itemize} +\end{lemma} +\begin{proof} + The first half is by induction on $rs$. The second half is a corollary of the first. +\end{proof} +\noindent +The next property gives the condition for +when $\rdistinct{\_}{\_}$ becomes an identical mapping +for any prefix of an input list, in other words, when can +we ``push out" the arguments of $\rdistinct{\_}{\_}$: +\begin{lemma} + If $rs_1$'s elements are all unique, and not appearing in the accumulator $acc$, + then it can be taken out and left untouched in the output: + \[\textit{rdistinct}\; rs_1 @ rsa\;\, acc + = rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1)\] +\end{lemma} + +\begin{proof} +By an induction on $rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary. +\end{proof} +\subsubsection{The properties of $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS}$} +We give in this subsection some properties of how $\backslash_r$, $\backslash_{rsimp}$, $\textit{Rflts}$ and $\textit{Rsimp}_{ALTS} $ interact with each other and with $@$, the concatenation operator. +These will be helpful in later closed form proofs, when +we want to transform the ways in which multiple functions involving +those are composed together +in interleaving derivative and simplification steps. + +When the function $\textit{Rflts}$ +is applied to the concatenation of two lists, the output can be calculated by first applying the +functions on two lists separately, and then concatenating them together. +\begin{lemma} + The function $\rflts$ has the below properties:\\ + \begin{itemize} + \item + $\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$ +\item + If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$ + \end{itemize} +\end{lemma} +\noindent +\begin{proof} + By induction on $rs_1$. +\end{proof} + + + +\subsection{A Closed Form for the Sequence Regular Expression} +\begin{quote}\it + Claim: For regular expressions $r_1 \cdot r_2$, we claim that + \begin{center} + $ \rderssimp{r_1 \cdot r_2}{s} = + \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ + \end{center} +\end{quote} +\noindent + Before we get to the proof that says the intermediate result of our lexer will remain finitely bounded, which is an important efficiency/liveness guarantee, we shall first develop a few preparatory properties and definitions to