diff -r 812e5d112f49 -r 671a83abccf3 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Fri Jul 01 13:02:15 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Mon Jul 04 12:27:13 2022 +0100 @@ -265,10 +265,21 @@ \section{Step One: Closed Forms} We transform the function application $\rderssimp{r}{s}$ - into an equivalent form $f\; (g \; (\sum rs))$. + into an equivalent + form $f\; (g \; (\sum rs))$. The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$. This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the right hand side the "closed form" of $r\backslash s$. + +\begin{quote}\it + Claim: For regular expressions $r_1 \cdot r_2$, we claim that + \begin{center} + $ \rderssimp{r_1 \cdot r_2}{s} = + \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ + \end{center} +\end{quote} +\noindent +We explain in detail how we reached those claims. \subsection{Basic Properties needed for Closed Forms} \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully} @@ -1045,37 +1056,100 @@ Before we go on to obtain them, some preliminary definitions are needed to make proof statements concise. -\section{"Closed Forms" of regular expressions' derivatives w.r.t strings} -We note that the different ways in which regular expressions are -nested do not matter under $\rsimp{}$: +\section{"Closed Forms" of Sequence Regular Expressions} +The problem of obataining a closed-form for sequence regular expression +is constructing $(r_1 \cdot r_2) \backslash_r s$ +if we are only allowed to use a combination of $r_1 \backslash s''$ +and $r_2 \backslash s''$ , where $s''$ is from $s$. +First let's look at a series of derivatives steps on a sequence +regular expression, assuming that each time the first +component of the sequence is always nullable): \begin{center} - To be proven:\\ - $\rsimp{r} \stackrel{?}{=} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ -and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$ + +$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ +$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad + \ldots$ + \end{center} -\noindent +Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as +a giant alternative taking a list of terms +$[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$, +where the head of the list is always the term +representing a match involving only $r_1$, and the tail of the list consisting of +terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$. This intuition is also echoed by IndianPaper, where they gave a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$: \begin{center} - $((r_1 \cdot r_2) \backslash_r c_1) \stackrel{\backslash_r c_2}{=} - ((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1),\; r_2 \backslash_r c_1)) - \stackrel{\backslash_r c_2}{=} - ((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2) - + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2) + \begin{tabular}{c} + $(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\ + \rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n) + \myequiv$\\ + \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2)) + + (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n) $ + \end{tabular} \end{center} \noindent -The $\delta \; b \; r$ function above turns the entire term into $\RZERO$ -if the boolean condition $b$ evaluates to false, and outputs $r$ otherwise. -The equality in their derivation steps should be interpretated -as language equivalence. To pin down the intuition into rigorous terms, -$\sflat{}$ is used to enable the transformation from +The equality in above should be interpretated +as language equivalence. +The $\delta$ function works similarly to that of +a Kronecker delta function: +\[ \delta \; b\; r\] +will produce $r$ +if $b$ evaluates to true, +and $\RZERO$ otherwise. +Note that their formulation +\[ + ((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2) + + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2) +\] +does not faithfully +represent what the intermediate derivatives would actually look like +when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not +nullable in the head of the sequence. +For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable, +the regular expression would not look like +\[ + (r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO, +\] +but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the +first place. +In a closed-form one would want to take into account this +and generate the list of +regular expressions $r_2 \backslash_r s''$ with +string pairs $(s', s'')$ where $s'@s'' = s$ and +$r_1 \backslash s'$ nullable. +We denote the list consisting of such +strings $s''$ as $\vsuf{s}{r_1}$. + +The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: +\begin{center} +\begin{tabular}{lcl} +$\vsuf{[]}{\_} $ & $=$ & $[]$\\ +$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ + && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ +\end{tabular} +\end{center} +\noindent +The list is sorted in the order $r_2\backslash s''$ +appears in $(r_1\cdot r_2)\backslash s$. +In essence, $\vsuf{\_}{\_}$ is doing a +"virtual derivative" of $r_1 \cdot r_2$, but instead of producing +the entire result $(r_1 \cdot r_2) \backslash s$, +it only stores all the strings $s''$ such that $r_2 \backslash s''$ +are occurring terms in $(r_1\cdot r_2)\backslash s$. + +To make the closed form representation +more straightforward, +the flattetning function $\sflat{\_}$ is used to enable the transformation from a left-associative nested sequence of alternatives into a flattened list: -$r = \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{}}{\rightarrow} -(\ldots ((r_1 + r_2) + r_3) + \ldots)$ -The definitions $\sflat{}$, $\sflataux{}$ are given below. - +\[ + \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow} +(\ldots ((r_1 + r_2) + r_3) + \ldots) +\] +\noindent +The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below. \begin{center} \begin{tabular}{ccc} $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ @@ -1091,118 +1165,363 @@ $\sflat r$ & $=$ & $ r$ \end{tabular} \end{center} -The intuition behind $\sflataux{\_}$ is to break up nested regexes +\noindent +$\sflataux{\_}$ breaks up nested alternative regexes of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape -into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. +into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. It will return the singleton list $[r]$ otherwise. $\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps the output type a regular expression, not a list. -$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the - first element of the list of children of -an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression -$r_1 \cdot r_2 \backslash s$. -\subsection{The $\textit{vsuf}$ Function} -Note that $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2) - + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)$ does not faithfully -represent what would the intermediate derivatives would actually look like. -For example, when $r_1$ and $r_1 \backslash_r c_1$ are nullable, -the regular expression would not look like $(r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO$, -but actually $r_1\backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the -first place. -In a closed-form one would want to take into account this -and generate the list of string pairs $(s', s'')$ where $s'@s'' = s$ such that -only $r_1 \backslash s'$ nullable. -With $\sflat{\_}$ and $\sflataux{\_}$ we make -precise what "closed forms" we have for the sequence derivatives and their simplifications, -in other words, how can we construct $(r_1 \cdot r_2) \backslash s$ -and $\bderssimp{(r_1\cdot r_2)}{s}$, -if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$) -and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over -the substring of $s$? -First let's look at a series of derivatives steps on a sequence -regular expression, (assuming) that each time the first -component of the sequence is always nullable): -\begin{center} +$\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the +first element of the list. + +With $\sflataux{}$ a preliminary to the closed form can be stated, +where the derivative of $r_1 \cdot r_2 \backslash s$ can be +flattened into a list whose head and tail meet the description +we gave earlier. +\begin{lemma}\label{seqSfau0} + $\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2 + :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ +\end{lemma} +\begin{proof} + By an induction on the string $s$, where the inductive cases + are split as $[]$ and $xs @ [x]$. + Note the key identify holds: + \[ + \map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\; + \map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1})) + \] + = + \[ + \map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1}) + \] + This enables the inductive case to go through. +\end{proof} +\noindent +Note that this lemma does $\mathbf{not}$ depend on any +specific definitions we used, +allowing people investigating derivatives to get an alternative +view of what $r_1 \cdot r_2$ is. -$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ -$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad - \ldots$ - -\end{center} -%TODO: cite indian paper -Indianpaper have come up with a slightly more formal way of putting the above process: +Now we are able to use this for the intuition that +the different ways in which regular expressions are +nested do not matter under $\rsimp{}$: \begin{center} -$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) + -\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots -+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$ + $\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ +and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$ \end{center} -where $\delta(b, r)$ will produce $r$ -if $b$ evaluates to true, -and $\ZERO$ otherwise. +Simply wrap with $\sum$ constructor and add +simplifications to both sides of \ref{seqSfau0} +and one gets +\begin{corollary}\label{seqClosedFormGeneral} + $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} } + =\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 :: + \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$ +\end{corollary} +Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), +it is possible to convert the above lemma to obtain a "closed form" +for derivatives nested with simplification: +\begin{lemma}\label{seqClosedForm} + $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) + :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$ +\end{lemma} +\begin{proof} + By a case analysis of string $s$. + When $s$ is empty list, the rewrite is straightforward. + When $s$ is a list, one could use the corollary \ref{seqSfau0}, + and lemma \ref{Simpders} to rewrite the left-hand-side. +\end{proof} +As a corollary for this closed form, one can estimate the size +of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using +an easier-to-handle expression: +\begin{corollary}\label{seqEstimate1} + \begin{center} - But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical - equivalence. To make this intuition useful - for a formal proof, we need something -stronger than language equivalence. -With the help of $\sflat{\_}$ we can state the equation in Indianpaper -more rigorously: -\begin{lemma} -$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ -\end{lemma} - -The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: + $\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) + :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$ + + \end{center} +\end{corollary} +\noindent +\subsection{Closed Forms for Star Regular Expressions} +We use a similar technique as $r_1 \cdot r_2$ case, +generating +all possible sub-strings $s'$ of $s$ +such that $r\backslash s' \cdot r^*$ will appear +as a term in $(r^*) \backslash s$. +The first function we define is a single-step +updating function $\starupdate$, which takes three arguments as input: +the new character $c$ to take derivative with, +the regular expression +$r$ directly under the star $r^*$, and the +list of strings $sSet$ for the derivative $r^* \backslash s$ +up til this point +such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ +(the equality is not exact, more on this later). +\begin{center} + \begin{tabular}{lcl} + $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ + $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ + & & $\textit{if} \; + (\rnullable \; (\rders \; r \; s))$ \\ + & & $\textit{then} \;\; (s @ [c]) :: [c] :: ( + \starupdate \; c \; r \; Ss)$ \\ + & & $\textit{else} \;\; (s @ [c]) :: ( + \starupdate \; c \; r \; Ss)$ + \end{tabular} +\end{center} +\noindent +As a generalisation from characters to strings, +$\starupdates$ takes a string instead of a character +as the first input argument, and is otherwise the same +as $\starupdate$. +\begin{center} + \begin{tabular}{lcl} + $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ + $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( + \starupdate \; c \; r \; Ss)$ + \end{tabular} +\end{center} +\noindent +For the star regular expression, +its derivatives can be seen as a nested gigantic +alternative similar to that of sequence regular expression's derivatives, +and therefore need +to be ``straightened out" as well. +The function for this would be $\hflat{}$ and $\hflataux{}$. +\begin{center} + \begin{tabular}{lcl} + $\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ + $\hflataux{r}$ & $\dn$ & $[r]$ + \end{tabular} +\end{center} \begin{center} -\begin{tabular}{lcl} -$\vsuf{[]}{\_} $ & $=$ & $[]$\\ -$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ - && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ -\end{tabular} -\end{center} -It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable, -and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in -the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$. -In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing -the entire result of $(r_1 \cdot r_2) \backslash s$, -it only stores all the $s''$ such that $r_2 \backslash s''$ are occurring terms in $(r_1\cdot r_2)\backslash s$. + \begin{tabular}{lcl} + $\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ + $\hflat{r}$ & $\dn$ & $r$ + \end{tabular} +\end{center} +\noindent +%MAYBE TODO: introduce createdByStar +We first introduce an inductive property +for $\starupdate$ and $\hflataux{\_}$, +it says if we do derivatives of $r^*$ +with a string that starts with $c$, +then flatten it out, +we obtain a list +of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$, +where $sSet = \starupdates \; s \; r \; [[c]]$. +\begin{lemma}\label{starHfauInduct} + $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = + \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; + (\starupdates \; s \; r_0 \; [[c]])$ +\end{lemma} +\begin{proof} + By an induction on $s$, the inductive cases + being $[]$ and $s@[c]$. +\end{proof} +\noindent +Here is a corollary that states the lemma in +a more intuitive way: +\begin{corollary} + $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot + (r^*))\; (\starupdates \; c\; r\; [[c]])$ +\end{corollary} +\noindent +Note that this is also agnostic of the simplification +function we defined, and is therefore of more general interest. + +Now adding the $\rsimp{}$ bit for closed forms, +we have +\begin{lemma} + $a :: rs \grewrites \hflataux{a} @ rs$ +\end{lemma} +\noindent +giving us +\begin{lemma}\label{cbsHfauRsimpeq1} + $\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$. +\end{lemma} +\noindent +This yields +\begin{lemma}\label{hfauRsimpeq2} +$\rsimp{r} = \rsimp{(\sum \hflataux{r})}$ +\end{lemma} +\noindent +Together with the rewriting relation +\begin{lemma}\label{starClosedForm6Hrewrites} + $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss + \scfrewrites + \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ +\end{lemma} +\noindent +We obtain the closed form for star regular expression: +\begin{lemma}\label{starClosedForm} + $\rderssimp{r^*}{c::s} = + \rsimp{ + (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; + (\starupdates \; s\; r \; [[c]]) + ) + ) + } + $ +\end{lemma} +\begin{proof} + By an induction on $s$. + The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2} + are used. +\end{proof} +\section{Estimating the Closed Forms' sizes} +We now summarize the closed forms below: +\begin{itemize} + \item + $\rderssimp{(\sum rs)}{s} \sequal + \sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ + \item + $\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 ) + :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$ + \item + + $\rderssimp{r^*}{c::s} = + \rsimp{ + (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; + (\starupdates \; s\; r \; [[c]]) + ) + ) + } + $ +\end{itemize} +\noindent +The closed forms on the left-hand-side +are all of the same shape: $\rsimp{ (\sum rs)} $. +Such regular expression will be bounded by the size of $\sum rs'$, +where every element in $rs'$ is distinct, and each element +can be described by some inductive sub-structures +(for example when $r = r_1 \cdot r_2$ then $rs'$ +will be solely comprised of $r_1 \backslash s'$ +and $r_2 \backslash s''$, $s'$ and $s''$ being +sub-strings of $s$). +which will each have a size uppder bound +according to inductive hypothesis, which controls $r \backslash s$. -With this we can also add simplifications to both sides and get +We elaborate the above reasoning by a series of lemmas +below, where straightforward proofs are omitted. \begin{lemma} -$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ + If $\forall r \in rs. \rsize{r} $ is less than or equal to $N$, + and $\textit{length} \; rs$ is less than or equal to $l$, + then $\rsize{\sum rs}$ is less than or equal to $l*N + 1$. +\end{lemma} +\noindent +If we define all regular expressions with size no +more than $N$ as $\sizeNregex \; N$: +\[ + \sizeNregex \; N \dn \{r \mid \rsize{r} \leq N \} +\] +Then such set is finite: +\begin{lemma}\label{finiteSizeN} + $\textit{isFinite}\; (\sizeNregex \; N)$ +\end{lemma} +\begin{proof} + By overestimating the set $\sizeNregex \; N + 1$ + using union of sets like + $\{r_1 \cdot r_2 \mid r_1 \in A + \text{and} + r_2 \in A\} + $ where $A = \sizeNregex \; N$. +\end{proof} +\noindent +From this we get a corollary that +if forall $r \in rs$, $\rsize{r} \leq N$, then the output of +$\rdistinct{rs}{\varnothing}$ is a list of regular +expressions of finite size depending on $N$ only. +\begin{corollary} + Assumes that for all $r \in rs. \rsize{r} \leq N$, + and the cardinality of $\sizeNregex \; N$ is $c_N$ + then$\rsize{\rdistinct{rs}{\varnothing}} \leq c*N$. +\end{corollary} +\noindent +We have proven that the output of $\rdistinct{rs'}{\varnothing}$ +is bounded by a constant $c_N$ depending only on $N$, +provided that each of $rs'$'s element +is bounded by $N$. +We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. + +We show how $\rdistinct$ and $\rflts$ +in the simplification function together is at least as +good as $\rdistinct{}{}$ alone. +\begin{lemma}\label{interactionFltsDB} + $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r + \leq + \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. \end{lemma} -Together with the idempotency property of $\rsimp{}$\ref{rsimpIdem}, -%TODO: state the idempotency property of rsimp -it is possible to convert the above lemma to obtain a "closed form" -for derivatives nested with simplification: -\begin{lemma} -$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ +\noindent +The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of +duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. + +Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$: +\begin{lemma}\label{altsSimpControl} + $\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$ \end{lemma} -And now the reason we have (1) in section 1 is clear: -$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, -is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$ - as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$. +\begin{proof} + By using \ref{interactionFltsDB}. +\end{proof} +\noindent +which says that the size of regular expression +is always smaller if we apply the full simplification +rather than just one component ($\rdistinct{}{}$). + + +Now we are ready to control the sizes of +$r_1 \cdot r_2 \backslash s$, $r^* \backslash s$. +\begin{theorem} +For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$ +\end{theorem} +\noindent +\begin{proof} +We prove this by induction on $r$. The base cases for $\RZERO$, +$\RONE $ and $\RCHAR{c}$ are straightforward. +In the sequence $r_1 \cdot r_2$ case, +the inductive hypotheses state $\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and +$\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows +% +\begin{center} +\begin{tabular}{lcll} +& & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\ +& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 \; \; :: \; \; +\map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\ +& $\leq$ & $\llbracket \rsimp{(\sum(r_1 \backslash s \cdot r_2 :: +\map \; (\lambda s'. r_2\backslash_{rsimp} s') (\vsuf{s}{r})))} \rrbracket_r $ & (2) \\ + +\end{tabular} +\end{center} + +\end{proof} + +\noindent +where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\rderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). +The reason why we could write the derivatives of a sequence this way is described in section 2. +The term (2) is used to control (1). +That is because one can obtain an overall +smaller regex list +by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. +Section 3 is dedicated to its proof. +In (3) we know that $\llbracket \ASEQ{bs}{(\rderssimp{ r_1}{s}}{r_2}\rrbracket$ is +bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller +than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands +for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). +We reason similarly for $\STAR$.\medskip + +%----------------------------------- +% SECTION 2 +%----------------------------------- + %---------------------------------------------------------------------------------------- % SECTION 3 %---------------------------------------------------------------------------------------- -\section{interaction between $\distinctWith$ and $\flts$} -Note that it is not immediately obvious that -\begin{center} -$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $. -\end{center} -The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of -duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. - \subsection{A Closed Form for the Sequence Regular Expression} -\begin{quote}\it - Claim: For regular expressions $r_1 \cdot r_2$, we claim that - \begin{center} - $ \rderssimp{r_1 \cdot r_2}{s} = - \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ - \end{center} -\end{quote} \noindent Before we get to the proof that says the intermediate result of our lexer will @@ -1215,57 +1534,6 @@ but this time we will have stronger equalities established. -\section{Estimating the Closed Forms' sizes} - - The closed form $f\; (g\; (\sum rs)) $ is controlled - by the size of $rs'$, where every element in $rs'$ is distinct, and each element can be described by som e inductive sub-structures (for example when $r = r_1 \cdot r_2$ then $rs'$ will be solely comprised of $r_1 \backslash s'$ and $r_2 \backslash s''$, $s'$ and $s''$ being sub-strings of $s$), which will have a numeric uppder bound according to inductive hypothesis, which controls $r \backslash s$. -%----------------------------------- -% SECTION 2 -%----------------------------------- - -\begin{theorem} -For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$ -\end{theorem} - -\noindent -\begin{proof} -We prove this by induction on $r$. The base cases for $\AZERO$, -$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is -for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction -hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and -$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows -% -\begin{center} -\begin{tabular}{lcll} -& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ -& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: - [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ -& $\leq$ & - $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: - [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ -& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + - \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ -& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + - \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ -& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) -\end{tabular} -\end{center} - - -\noindent -where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). -The reason why we could write the derivatives of a sequence this way is described in section 2. -The term (2) is used to control (1). -That is because one can obtain an overall -smaller regex list -by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. -Section 3 is dedicated to its proof. -In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is -bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller -than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands -for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). -We reason similarly for $\STAR$.\medskip -\end{proof} What guarantee does this bound give us?