diff -r 8c234a1bc7e0 -r b0f0d884a547 ChengsongTanPhdThesis/Chapters/Finite.tex --- a/ChengsongTanPhdThesis/Chapters/Finite.tex Tue Oct 11 13:09:47 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Finite.tex Wed Oct 12 14:01:33 2022 +0100 @@ -7,6 +7,19 @@ %of our bitcoded algorithm, that is a finite bound on the size of any %regex's derivatives. +\begin{figure} +\begin{center} + \begin{tabular}{ccc} + $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ + $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ + $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ + $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ + $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ + $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. + \end{tabular} +\end{center} +\caption{The size function of bitcoded regular expressions}\label{brexpSize} +\end{figure} In this chapter we give a guarantee in terms of size: given an annotated regular expression $a$, for any string $s$ our algorithm $\blexersimp$'s internal annotated regular expression @@ -17,26 +30,19 @@ \end{center} \noindent where the size of an annotated regular expression is defined -in terms of the number of nodes in its tree structure: -\begin{center} - \begin{tabular}{ccc} - $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ - $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ - $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ - $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ - $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ - $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. - \end{tabular} -\end{center} -We believe this size formalisation -of the algorithm is important in our context, because +in terms of the number of nodes in its tree structure (see figure \ref{brexpSize}). +We believe this size bound +is important in the context of POSIX lexing, because \begin{itemize} \item It is a stepping stone towards an ``absence of catastrophic-backtracking'' - guarantee. The next step would be to refine the bound $N_a$ so that it + guarantee. + If the internal data structures used by our algorithm cannot grow very large, + then our algorithm (which traverses those structures) cannot be too slow. + The next step would be to refine the bound $N_a$ so that it is polynomial on $\llbracket a\rrbracket$. \item - The size bound proof gives us a higher confidence that + Having it formalised gives us a higher confidence that our simplification algorithm $\simp$ does not ``mis-behave'' like $\simpsulz$ does. The bound is universal, which is an advantage over work which @@ -44,9 +50,9 @@ \end{itemize} \section{Formalising About Size} \noindent -In our lexer $\blexersimp$, -The regular expression is repeatedly being taken derivative of -and then simplified. +In our lexer ($\blexersimp$), +we take an annotated regular expression as input, +and repeately take derivative of and simplify it: \begin{figure}[H] \begin{tikzpicture}[scale=2, every node/.style={minimum size=11mm}, @@ -75,17 +81,17 @@ \end{figure} \noindent Each time -a derivative is taken, a regular expression might grow a bit, -but simplification always takes care that +a derivative is taken, the regular expression might grow. +However, the simplification that is immediately afterwards will always shrink it so that it stays small. This intuition is depicted by the relative size change between the black and blue nodes: After $\simp$ the node always shrinks. Our proof says that all the blue nodes -stay below a size bound $N_a$ determined by $a$. +stay below a size bound $N_a$ determined by the input $a$. \noindent -Sulzmann and Lu's assumed something similar about their algorithm, +Sulzmann and Lu's assumed a similar picture about their algorithm, though in fact their algorithm's size might be better depicted by the following graph: \begin{figure}[H] \begin{tikzpicture}[scale=2, @@ -118,19 +124,25 @@ \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks} \end{figure} \noindent -That is, on certain cases their lexer +The picture means that on certain cases their lexer (where they use $\simpsulz$ +as the simplification function) will have an indefinite size explosion, causing the running time -of each derivative step to grow arbitrarily large (for example +of each derivative step to grow continuously (for example in \ref{SulzmannLuLexerTime}). -The reason they made this mistake was that -they tested out the run time of their +They tested out the run time of their lexer on particular examples such as $(a+b+ab)^*$ -and generalised to all cases, which -cannot happen with our mecahnised proof.\\ -We give details of the proof in the next sections. +and claimed that their algorithm is linear w.r.t to the input. +With our mecahnised proof, we avoid this type of unintentional +generalisation.\\ + +Before delving in to the details of the formalisation, +we are going to provide an overview of it. +In the next subsection, we draw a picture of the bird's eye view +of the proof. + \subsection{Overview of the Proof} -Here is a bird's eye view of how the proof of finiteness works, +Here is a bird's eye view of the main components of the finiteness proof, which involves three steps: \begin{figure}[H] \begin{tikzpicture}[scale=1,font=\bf, @@ -187,7 +199,7 @@ \end{itemize} We will expand on these steps in the next sections.\\ -\section{The $\textit{Rrexp}$ Datatype and Its Lexing-Related Functions} +\section{The $\textit{Rrexp}$ Datatype} The first step is to define $\textit{rrexp}$s. They are without bitcodes, @@ -240,21 +252,8 @@ $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ \end{tabular} \end{center} -\noindent -$\textit{Rerase}$ throws away the bitcodes -on the annotated regular expressions, -but keeps everything else intact. -Therefore it does not change the size -of an annotated regular expression: -\begin{lemma}\label{rsizeAsize} - $\rsize{\rerase a} = \asize a$ -\end{lemma} -\begin{proof} - By routine structural induction on $a$. -\end{proof} -\noindent -\subsection{Motivation Behind a New Datatype} +\subsection{Why a New Datatype?} The reason we take all the trouble defining a new datatype is that $\erase$ makes things harder. We initially started by using @@ -282,7 +281,7 @@ All these operations change the size and structure of an annotated regular expressions, adding unnecessary complexities to the size bound proof.\\ -For example, if we define the size of a basic regular expression +For example, if we define the size of a basic plain regular expression in the usual way, \begin{center} \begin{tabular}{ccc} @@ -297,19 +296,38 @@ \noindent Then the property \begin{center} - $\llbracket a \rrbracket = \llbracket a_\downarrow \rrbracket_p$ + $\llbracket a \rrbracket \stackrel{?}{=} \llbracket a_\downarrow \rrbracket_p$ \end{center} does not hold. +With $\textit{rerase}$, however, +only the bitcodes are thrown away. +Everything about the structure remains intact. +Therefore it does not change the size +of an annotated regular expression: +\begin{lemma}\label{rsizeAsize} + $\rsize{\rerase a} = \asize a$ +\end{lemma} +\begin{proof} + By routine structural induction on $a$. +\end{proof} +\noindent One might be able to prove an inequality such as $\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $ and then estimate $\llbracket a_\downarrow \rrbracket_p$, but we found our approach more straightforward.\\ -\subsection{Lexing Related Functions for $\rrexp$ such as $\backslash_r$, $\rdistincts$, and $\rsimp$} -The operations on r-regular expressions are -almost identical to those of the annotated regular expressions, -except that no bitcodes are used. For example, -the derivative operation becomes simpler:\\ +\subsection{Functions for R-regular Expressions} +We shall define the r-regular expression version +of $\blexer$ and $\blexersimp$ related functions. +We use $r$ as the prefix or subscript to differentiate +with the bitcoded version. +For example,$\backslash_r$, $\rdistincts$, and $\rsimp$ +as opposed to $\backslash$, $\distinctBy$, and $\bsimp$. +As promised, they are much simpler than their bitcoded counterparts. +%The operations on r-regular expressions are +%almost identical to those of the annotated regular expressions, +%except that no bitcodes are used. For example, +The derivative operation becomes simpler:\\ \begin{center} \begin{tabular}{@{}lcl@{}} $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ @@ -404,8 +422,14 @@ We do not define an r-regular expression version of $\blexersimp$, as our proof does not involve its use (and there is no bitcode to decode into a lexical value). -Everything about the size of annotated regular expressions -can be calculated via the size of r-regular expressions: +Now we are ready to introduce how r-regular expressions allow +us to prove the size bound on bitcoded regular expressions. + +\subsection{Using R-regular Expressions to Bound Bit-coded Regular Expressions} +Everything about the size of annotated regular expressions after the application +of function $\bsimp$ and $\backslash_{simps}$ +can be calculated via the size of r-regular expressions after the application +of $\rsimp$ and $\backslash_{rsimps}$: \begin{lemma}\label{sizeRelations} The following equalities hold: \begin{itemize} @@ -432,6 +456,7 @@ implies $\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$. \end{center} +From now on we Unless stated otherwise in the rest of this chapter all regular expressions without bitcodes are seen as r-regular expressions ($\rrexp$s). @@ -470,14 +495,9 @@ % %\begin{quote}\it % Claim: For regular expressions $r_1 \cdot r_2$, we claim that -% \begin{center} -% $ \rderssimp{r_1 \cdot r_2}{s} = -% \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ -% \end{center} %\end{quote} %\noindent %We explain in detail how we reached those claims. -\subsection{The Idea Behind Closed Forms} If we attempt to prove \begin{center} $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$ @@ -503,6 +523,7 @@ inductive hypotheses cannot be directly used. We have already seen that $(r_1 \cdot r_2)\backslash s$ and $(r^*)\backslash s$ can grow in a wild way. + The point is that they will be equivalent to a list of terms $\sum rs$, where each term in $rs$ will be made of $r_1 \backslash s' $, $r_2\backslash s'$, @@ -510,48 +531,60 @@ The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$ in the simplification which saves $rs$ from growing indefinitely. -Based on this idea, we sketch a proof by first showing the equality (where +Based on this idea, we develop a proof in two steps. +First, we show the equality (where $f$ and $g$ are functions that do not increase the size of the input) \begin{center} -$(r_1 \cdot r_2)\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, +$r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, \end{center} -and then show the right-hand-side can be finitely bounded. +where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on. +For example, for $r_1 \cdot r_2$ we have the equality as + \begin{center} + $ \rderssimp{r_1 \cdot r_2}{s} = + \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ + \end{center} We call the right-hand-side the \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. -We will flesh out the proof sketch in the next section. +Second, we will bound the closed form of r-regular expressions +using some estimation techniques +and then piece it together +with lemma \ref{sizeRelations} to show the bitcoded regular expressions +in our $\blexersimp$ are finitely bounded. -\section{Details of Closed Forms and Bounds} +We will flesh out the first step of the proof we +sketched just now in the next section. + +\section{Closed Forms} In this section we introduce in detail how the closed forms are obtained for regular expressions' -derivatives and how they are bounded. -We start by proving some basic identities and inequalities +derivatives. +We start by proving some basic identities involving the simplification functions for r-regular expressions. -After that we use these identities to establish the -closed forms we need. -Finally, we prove the functions such as $\flts$ -will keep the size non-increasing. -Putting this together with a general bound -on the finiteness of distinct regular expressions -smaller than a certain size, we obtain a bound on -the closed forms. +After that we introduce the rewrite relations +$\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$ +$\rightsquigarrow_f$ and $\rightsquigarrow_g$. +These relations involves similar techniques in chapter \ref{Bitcoded2}. +Finally, we use these identities to establish the +closed forms of the alternative regular expression, +the sequence regular expression, and the star regular expression. %$r_1\cdot r_2$, $r^*$ and $\sum rs$. -\subsection{Some Basic Identities and Inequalities} +\subsection{Some Basic Identities} -In this subsection, we introduce lemmas +We now introduce lemmas that are repeatedly used in later proofs. Note that for the $\textit{rdistinct}$ function there will be a lot of conversion from lists to sets. -We use the name $set$ to refere to the +We use $set$ to refere to the function that converts a list $rs$ to the set containing all the elements in $rs$. \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication} The $\textit{rdistinct}$ function, as its name suggests, will -remove duplicates in an r-regular expression list. -It will also correctly exclude any elements that -is intially in the accumulator set. +de-duplicate an r-regular expression list. +It will also remove any elements that +is already in the accumulator set. \begin{lemma}\label{rdistinctDoesTheJob} %The function $\textit{rdistinct}$ satisfies the following %properties: @@ -559,7 +592,7 @@ recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} readily defined for testing - whether a list's elements are all unique. Then the following + whether a list's elements are unique. Then the following properties about $\textit{rdistinct}$ hold: \begin{itemize} \item @@ -581,10 +614,11 @@ \end{proof} \noindent -$\textit{rdistinct}$ will cancel out all regular expression terms -that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary -list $rs$ whose elements are all from the accumulator, and then call $\textit{rdistinct}$ -on the resulting list, the output will be as if we had called $\textit{rdistinct}$ +%$\textit{rdistinct}$ will out all regular expression terms +%that are in the accumulator, therefore +Concatenating a list $rs_a$ at the front of another +list $rs$ whose elements are all from the accumulator, and then calling $\textit{rdistinct}$ +on the merged list, the output will be as if we had called $\textit{rdistinct}$ without the prepending of $rs$: \begin{lemma}\label{rdistinctConcat} The elements appearing in the accumulator will always be removed. @@ -672,6 +706,8 @@ By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}. \end{proof} \noindent +The next lemma is a more general form of \ref{rdistinctConcat}, +it says that $\textit{rdistinct}$ is composable w.r.t list concatenation: \begin{lemma}\label{distinctRdistinctAppend} If $\;\; \textit{isDistinct} \; rs_1$, @@ -749,388 +785,16 @@ $\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$ \end{lemma} \noindent -\subsubsection{$\textit{rsimp}$ Does Not Increment the Size} -Although it seems evident, we need a series -of non-trivial lemmas to establish this property. -\begin{lemma}\label{rsimpMonoLemmas} - \mbox{} - \begin{itemize} - \item - \[ - \llbracket \rsimpalts \; rs \rrbracket_r \leq - \llbracket \sum \; rs \rrbracket_r - \] - \item - \[ - \llbracket \rsimpseq \; r_1 \; r_2 \rrbracket_r \leq - \llbracket r_1 \cdot r_2 \rrbracket_r - \] - \item - \[ - \llbracket \rflts \; rs \rrbracket_r \leq - \llbracket rs \rrbracket_r - \] - \item - \[ - \llbracket \rDistinct \; rs \; ss \rrbracket_r \leq - \llbracket rs \rrbracket_r - \] - \item - If all elements $a$ in the set $as$ satisfy the property - that $\llbracket \textit{rsimp} \; a \rrbracket_r \leq - \llbracket a \rrbracket_r$, then we have - \[ - \llbracket \; \rsimpalts \; (\textit{rdistinct} \; - (\textit{rflts} \; (\textit{map}\;\textit{rsimp} as)) \{\}) - \rrbracket \leq - \llbracket \; \sum \; (\rDistinct \; (\rflts \;(\map \; - \textit{rsimp} \; x))\; \{ \} ) \rrbracket_r - \] - \end{itemize} -\end{lemma} -\begin{proof} - Point 1, 3, 4 can be proven by an induction on $rs$. - Point 2 is by case analysis on $r_1$ and $r_2$. - The last part is a corollary of the previous ones. -\end{proof} -\noindent -With the lemmas for each inductive case in place, we are ready to get -the non-increasing property as a corollary: -\begin{corollary}\label{rsimpMono} - $\llbracket \textit{rsimp} \; r \rrbracket_r$ -\end{corollary} -\begin{proof} - By \ref{rsimpMonoLemmas}. -\end{proof} - -\subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s} -Much like the definition of $L$ on plain regular expressions, one could also -define the language interpretation of $\rrexp$s. -\begin{center} - \begin{tabular}{lcl} - $RL \; (\ZERO)$ & $\dn$ & $\phi$\\ - $RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\ - $RL \; (c)$ & $\dn$ & $\{[c]\}$\\ - $RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ - $RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ - $RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ - \end{tabular} -\end{center} -\noindent -The main use of $RL$ is to establish some connections between $\rsimp{}$ -and $\rnullable{}$: -\begin{lemma} - The following properties hold: - \begin{itemize} - \item - If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$. - \item - $\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$. - \end{itemize} -\end{lemma} -\begin{proof} - The first part is by induction on $r$. - The second part is true because property - \[ RL \; r = RL \; (\rsimp{r})\] holds. -\end{proof} - -\subsubsection{$\rsimp{}$ is Non-Increasing} -In this subsection, we prove that the function $\rsimp{}$ does not -make the $\llbracket \rrbracket_r$ size increase. - - -\begin{lemma}\label{rsimpSize} - $\llbracket \rsimp{r} \rrbracket_r \leq \llbracket r \rrbracket_r$ -\end{lemma} -\subsubsection{Simplified $\textit{Rrexp}$s are Good} -We formalise the notion of ``good" regular expressions, -which means regular expressions that -are not fully simplified. For alternative regular expressions that means they -do not contain any nested alternatives like -\[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\] -or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]: -\begin{center} - \begin{tabular}{@{}lcl@{}} - $\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ - $\good\; \RONE$ & $\dn$ & $\textit{true}$\\ - $\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ - $\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ - $\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ - $\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & - $\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ - & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \textit{and}\; \, \textit{nonAlt}\; r')$\\ - $\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ - $\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ - $\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ - $\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ - $\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ - \end{tabular} -\end{center} -\noindent -The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an -alternative, and false otherwise. -The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that -its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, -and unique: -\begin{lemma}\label{rsimpaltsGood} - If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, - then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$. -\end{lemma} -\noindent -We also note that -if a regular expression $r$ is good, then $\rflts$ on the singleton -list $[r]$ will not break goodness: -\begin{lemma}\label{flts2} - If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$. -\end{lemma} -\begin{proof} - By an induction on $r$. -\end{proof} -\noindent -The other observation we make about $\rsimp{r}$ is that it never -comes with nested alternatives, which we describe as the $\nonnested$ -property: -\begin{center} - \begin{tabular}{lcl} - $\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\ - $\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\ - $\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\ - $\nonnested \; \, r $ & $\dn$ & $\btrue$ - \end{tabular} -\end{center} -\noindent -The $\rflts$ function -always opens up nested alternatives, -which enables $\rsimp$ to be non-nested: +We need more equalities like the above to enable a closed form, +for which we need to introduce a few rewrite relations +to help +us obtain them. -\begin{lemma}\label{nonnestedRsimp} - $\nonnested \; (\rsimp{r})$ -\end{lemma} -\begin{proof} - By an induction on $r$. -\end{proof} -\noindent -With this we could prove that a regular expressions -after simplification and flattening and de-duplication, -will not contain any alternative regular expression directly: -\begin{lemma}\label{nonaltFltsRd} - If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ - then $\textit{nonAlt} \; x$. -\end{lemma} -\begin{proof} - By \ref{nonnestedRsimp}. -\end{proof} -\noindent -The other thing we know is that once $\rsimp{}$ had finished -processing an alternative regular expression, it will not -contain any $\RZERO$s, this is because all the recursive -calls to the simplification on the children regular expressions -make the children good, and $\rflts$ will not take out -any $\RZERO$s out of a good regular expression list, -and $\rdistinct{}$ will not mess with the result. -\begin{lemma}\label{flts3Obv} - The following are true: - \begin{itemize} - \item - If for all $r \in rs. \, \good \; r $ or $r = \RZERO$, - then for all $r \in \rflts\; rs. \, \good \; r$. - \item - If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$ - and for all $y$ such that $\llbracket y \rrbracket_r$ less than - $\llbracket rs \rrbracket_r + 1$, either - $\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$, - then $\good \; x$. - \end{itemize} -\end{lemma} -\begin{proof} - The first part is by induction on $rs$, where the induction - rule is the inductive cases for $\rflts$. - The second part is a corollary from the first part. -\end{proof} - -And this leads to good structural property of $\rsimp{}$, -that after simplification, a regular expression is -either good or $\RZERO$: -\begin{lemma}\label{good1} - For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$. -\end{lemma} -\begin{proof} - By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$. - Lemma \ref{rsimpSize} says that - $\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to - $\llbracket r \rrbracket_r$. - Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, - Inductive hypothesis applies to the children regular expressions - $r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied - by that as well. - The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used - to ensure that goodness is preserved at the topmost level. -\end{proof} -We shall prove that any good regular expression is -a fixed-point for $\rsimp{}$. -First we prove an auxiliary lemma: -\begin{lemma}\label{goodaltsNonalt} - If $\good \; \sum rs$, then $\rflts\; rs = rs$. -\end{lemma} -\begin{proof} - By an induction on $\sum rs$. The inductive rules are the cases - for $\good$. -\end{proof} -\noindent -Now we are ready to prove that good regular expressions are invariant -of $\rsimp{}$ application: -\begin{lemma}\label{test} - If $\good \;r$ then $\rsimp{r} = r$. -\end{lemma} -\begin{proof} - By an induction on the inductive cases of $\good$, using lemmas - \ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}. - The lemma \ref{goodaltsNonalt} is used in the alternative - case where 2 or more elements are present in the list. -\end{proof} -\noindent -Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$, -which requires $\ref{good1}$ to go through smoothly. -It says that an application of $\rsimp_{ALTS}$ can be "absorbed", -if it its output is concatenated with a list and then applied to $\rflts$. -\begin{lemma}\label{flattenRsimpalts} - $\rflts \; ( (\rsimp_{ALTS} \; - (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: - \map \; \rsimp{} \; rs' ) = - \rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ ( - \map \; \rsimp{rs'}))$ - - -\end{lemma} -\begin{proof} - By \ref{good1}. -\end{proof} -\noindent - - - - - -We are also ready to prove that $\textit{rsimp}$ is idempotent. -\subsubsection{$\rsimp$ is Idempotent} -The idempotency of $\rsimp$ is very useful in -manipulating regular expression terms into desired -forms so that key steps allowing further rewriting to closed forms -are possible. -\begin{lemma}\label{rsimpIdem} - $\rsimp{r} = \rsimp{\rsimp{r}}$ -\end{lemma} - -\begin{proof} - By \ref{test} and \ref{good1}. -\end{proof} -\noindent -This property means we do not have to repeatedly -apply simplification in each step, which justifies -our definition of $\blexersimp$. - - -On the other hand, we could repeat the same $\rsimp{}$ applications -on regular expressions as many times as we want, if we have at least -one simplification applied to it, and apply it wherever we would like to: -\begin{corollary}\label{headOneMoreSimp} - The following properties hold, directly from \ref{rsimpIdem}: - - \begin{itemize} - \item - $\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$ - \item - $\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$ - \end{itemize} -\end{corollary} -\noindent -This will be useful in later closed form proof's rewriting steps. -Similarly, we point out the following useful facts below: -\begin{lemma} - The following equalities hold if $r = \rsimp{r'}$ for some $r'$: - \begin{itemize} - \item - If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$. - \item - If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$. - \item - $\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$. - \end{itemize} -\end{lemma} -\begin{proof} - By application of lemmas \ref{rsimpIdem} and \ref{good1}. -\end{proof} - -\noindent -With the idempotency of $\rsimp{}$ and its corollaries, -we can start proving some key equalities leading to the -closed forms. -Now presented are a few equivalent terms under $\rsimp{}$. -We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$. -\begin{lemma} - \begin{itemize} - The following equivalence hold: - \item - $\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$ - \item - $\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$ - \item - $\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$ - \item - $\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$ - \item - $\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$ -\end{itemize} -\end{lemma} -\begin{proof} - By induction on the lists involved. -\end{proof} -\noindent -The above allows us to prove -two similar equalities (which are a bit more involved). -It says that we could flatten out the elements -before simplification and still get the same result. -\begin{lemma}\label{simpFlatten3} - One can flatten the inside $\sum$ of a $\sum$ if it is being - simplified. Concretely, - \begin{itemize} - \item - If for all $r \in rs, rs', rs''$, we have $\good \; r $ - or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal - \sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary, - \item - $\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$ - \end{itemize} -\end{lemma} -\begin{proof} - By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}. - The second sub-lemma is a corollary of the previous. -\end{proof} -%Rewriting steps not put in--too long and complicated------------------------------- -\begin{comment} - \begin{center} - $\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\ - $\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\ - $\stackrel{by \ref{test}}{=} - \rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{ - \varnothing})$\\ - $\stackrel{by \ref{rdistinctConcatGeneral}}{=} - \rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{( - \rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\ - - \end{center} -\end{comment} -%Rewriting steps not put in--too long and complicated------------------------------- -\noindent -We need more equalities like the above to enable a closed form, -but to proceed we need to introduce two rewrite relations, -to make things smoother. \subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$} -Insired by the success we had in the correctness proof -in \ref{Bitcoded2}, where we invented -a term rewriting system to capture the similarity between terms, -we follow suit here defining simplification -steps as rewriting steps. This allows capturing +Inspired by the success we had in the correctness proof +in \ref{Bitcoded2}, +we follow suit here, defining atomic simplification +steps as ``small-step'' rewriting steps. This allows capturing similarities between terms that would be otherwise hard to express. @@ -1138,10 +802,12 @@ regular expression simplification, $\frewrite$ for rewrite of list of regular expressions that include all operations carried out in $\rflts$, and $\grewrite$ for -rewriting a list of regular expressions possible in both $\rflts$ and $\rdistinct{}{}$. +rewriting a list of regular expressions possible in both $\rflts$ and $\textit{rdistinct}$. Their reflexive transitive closures are used to denote zero or many steps, as was the case in the previous chapter. -The presentation will be more concise than that in \ref{Bitcoded2}. +As we have already +done something similar, the presentation about +these rewriting rules will be more concise than that in \ref{Bitcoded2}. To differentiate between the rewriting steps for annotated regular expressions and $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbol to mean atomic simplification transitions @@ -1149,9 +815,8 @@ -List of one-step rewrite rules for $\rrexp$ ($\hrewrite$): - +\begin{figure}[H] \begin{center} \begin{mathpar} \inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\} @@ -1178,22 +843,30 @@ \end{mathpar} \end{center} +\caption{List of one-step rewrite rules for r-regular expressions ($\hrewrite$)}\label{hRewrite} +\end{figure} -List of rewrite rules for a list of regular expressions, +Like $\rightsquigarrow_s$, it is +convenient to define rewrite rules for a list of regular expressions, where each element can rewrite in many steps to the other (scf stands for li\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the $\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions. +\begin{figure}[H] \begin{center} \begin{mathpar} \inferrule{}{[] \scfrewrites [] } + \inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'} \end{mathpar} \end{center} +\caption{List of one-step rewrite rules for a list of r-regular expressions}\label{scfRewrite} +\end{figure} %frewrite List of one-step rewrite rules for flattening a list of regular expressions($\frewrite$): +\begin{figure}[H] \begin{center} \begin{mathpar} \inferrule{}{\RZERO :: rs \frewrite rs \\} @@ -1203,9 +876,12 @@ \inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2} \end{mathpar} \end{center} +\caption{List of one-step rewrite rules characterising the $\rflts$ operation on a list}\label{fRewrites} +\end{figure} Lists of one-step rewrite rules for flattening and de-duplicating a list of regular expressions ($\grewrite$): +\begin{figure}[H] \begin{center} \begin{mathpar} \inferrule{}{\RZERO :: rs \grewrite rs \\} @@ -1217,23 +893,27 @@ \inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc} \end{mathpar} \end{center} - +\caption{List of one-step rewrite rules characterising the $\rflts$ and $\textit{rdistinct}$ +operations}\label{gRewrite} +\end{figure} \noindent We defined -two separate list rewriting definitions $\frewrite$ and $\grewrite$. +two separate list rewriting relations $\frewrite$ and $\grewrite$. The rewriting steps that take place during flattening is characterised by $\frewrite$. $\grewrite$ characterises both flattening and de-duplicating. Sometimes $\grewrites$ is slightly too powerful -so we would rather use $\frewrites$ because we only -need to prove about certain equivalence under the rewriting steps of $\frewrites$. +so we would rather use $\frewrites$ to prove +%because we only +equalities related to $\rflts$. +%certain equivalence under the rewriting steps of $\frewrites$. For example, when proving the closed-form for the alternative regular expression, -one of the rewriting steps would be: -\begin{lemma} +one of the equalities needed is: +\begin{center} $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) $ -\end{lemma} +\end{center} \noindent Proving this is by first showing \begin{lemma}\label{earlyLaterDerFrewrites} @@ -1241,13 +921,20 @@ \rflts \; (\map \; (\_ \backslash x) \; rs)$ \end{lemma} \noindent -and then using lemma +and then the equivalence between two terms +that can reduce in many steps to each other. \begin{lemma}\label{frewritesSimpeq} If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal \sum (\rDistinct \; rs_2 \; \varnothing)$. \end{lemma} \noindent -is a piece of cake. + +\begin{corollary} + $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal + \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) + $ +\end{corollary} + But this trick will not work for $\grewrites$. For example, a rewriting step in proving closed forms is: @@ -1268,14 +955,21 @@ For this rewriting step we will introduce some slightly more cumbersome proof technique in later sections. The point is that $\frewrite$ -allows us to prove equivalence in a straightforward two-step method that is -not possible for $\grewrite$, thereby reducing the complexity of the entire proof. +allows us to prove equivalence in a straightforward way that is +not possible for $\grewrite$. \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$} +In this part, we present lemmas stating +pairs of r-regular expressions or r-regular expression lists +where one could rewrite from one in many steps to the other. +Most of the proofs to these lemmas are straightforward, using +an induction on the inductive cases of the corresponding rewriting relations. +These proofs will therefore be omitted when this is the case. We present in the below lemma a few pairs of terms that are rewritable via $\grewrites$: \begin{lemma}\label{gstarRdistinctGeneral} + \mbox{} \begin{itemize} \item $rs_1 @ rs \grewrites rs_1 @ (\rDistinct \; rs \; rs_1)$ @@ -1295,7 +989,7 @@ then they are equivalent under $\rsimp{}$: \begin{lemma}\label{grewritesSimpalts} If $rs_1 \grewrites rs_2$, then - we have the following equivalence hold: + we have the following equivalence: \begin{itemize} \item $\sum rs_1 \sequal \sum rs_2$ @@ -1360,10 +1054,13 @@ \end{lemma} \begin{proof} Part 1 is by lemma \ref{insideSimpRemoval}, - part 2 is by lemmas \ref{insideSimpRemoval} and \ref{distinctDer}. + part 2 is by lemma \ref{insideSimpRemoval} .%and \ref{distinctDer}. \end{proof} \noindent -This leads to the first closed form-- + +\subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$} +\subsubsection{Closed Form for Alternative Regular Expression} +Lemma \ref{Simpders} leads to the first closed form-- \begin{lemma}\label{altsClosedForm} \begin{center} $\rderssimp{(\sum rs)}{s} \sequal @@ -1398,18 +1095,11 @@ \end{corollary} \noindent The harder closed forms are the sequence and star ones. -Before we go on to obtain them, some preliminary definitions +Before we obtain them, some preliminary definitions are needed to make proof statements concise. - - -\subsection{Closed Forms} \subsubsection{Closed Form for Sequence Regular Expressions} -The problem of obataining a closed-form for sequence regular expression -is constructing $(r_1 \cdot r_2) \backslash_r s$ -if we are only allowed to use a combination of $r_1 \backslash s''$ -and $r_2 \backslash s''$ , where $s''$ is from $s$. First let's look at a series of derivatives steps on a sequence regular expression, assuming that each time the first component of the sequence is always nullable): @@ -1815,6 +1505,441 @@ +%---------------------------------------------------------------------------------------- +% SECTION ?? +%---------------------------------------------------------------------------------------- + +%----------------------------------- +% SECTION syntactic equivalence under simp +%----------------------------------- + + +%---------------------------------------------------------------------------------------- +% SECTION ALTS CLOSED FORM +%---------------------------------------------------------------------------------------- +%\section{A Closed Form for \textit{ALTS}} +%Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. +% +% +%There are a few key steps, one of these steps is +% +% +% +%One might want to prove this by something a simple statement like: +% +%For this to hold we want the $\textit{distinct}$ function to pick up +%the elements before and after derivatives correctly: +%$r \in rset \equiv (rder x r) \in (rder x rset)$. +%which essentially requires that the function $\backslash$ is an injective mapping. +% +%Unfortunately the function $\backslash c$ is not an injective mapping. +% +%\subsection{function $\backslash c$ is not injective (1-to-1)} +%\begin{center} +% The derivative $w.r.t$ character $c$ is not one-to-one. +% Formally, +% $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ +%\end{center} +%This property is trivially true for the +%character regex example: +%\begin{center} +% $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ +%\end{center} +%But apart from the cases where the derivative +%output is $\ZERO$, are there non-trivial results +%of derivatives which contain strings? +%The answer is yes. +%For example, +%\begin{center} +% Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ +% where $a$ is not nullable.\\ +% $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ +% $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ +%\end{center} +%We start with two syntactically different regular expressions, +%and end up with the same derivative result. +%This is not surprising as we have such +%equality as below in the style of Arden's lemma:\\ +%\begin{center} +% $L(A^*B) = L(A\cdot A^* \cdot B + B)$ +%\end{center} +\section{Bounding Closed Forms} + +In this section, we introduce how we formalised the bound +on closed forms. +We first prove that functions such as $\rflts$ +will not cause the size of r-regular expressions to grow. +Putting this together with a general bound +on the finiteness of distinct regular expressions +smaller than a certain size, we obtain a bound on +the closed forms. + +\subsection{$\textit{rsimp}$ Does Not Increment the Size} +Although it seems evident, we need a series +of non-trivial lemmas to establish that functions such as $\rflts$ +do not cause the regular expressions to grow. +\begin{lemma}\label{rsimpMonoLemmas} + \mbox{} + \begin{itemize} + \item + \[ + \llbracket \rsimpalts \; rs \rrbracket_r \leq + \llbracket \sum \; rs \rrbracket_r + \] + \item + \[ + \llbracket \rsimpseq \; r_1 \; r_2 \rrbracket_r \leq + \llbracket r_1 \cdot r_2 \rrbracket_r + \] + \item + \[ + \llbracket \rflts \; rs \rrbracket_r \leq + \llbracket rs \rrbracket_r + \] + \item + \[ + \llbracket \rDistinct \; rs \; ss \rrbracket_r \leq + \llbracket rs \rrbracket_r + \] + \item + If all elements $a$ in the set $as$ satisfy the property + that $\llbracket \textit{rsimp} \; a \rrbracket_r \leq + \llbracket a \rrbracket_r$, then we have + \[ + \llbracket \; \rsimpalts \; (\textit{rdistinct} \; + (\textit{rflts} \; (\textit{map}\;\textit{rsimp} as)) \{\}) + \rrbracket \leq + \llbracket \; \sum \; (\rDistinct \; (\rflts \;(\map \; + \textit{rsimp} \; x))\; \{ \} ) \rrbracket_r + \] + \end{itemize} +\end{lemma} +\begin{proof} + Point 1, 3, 4 can be proven by an induction on $rs$. + Point 2 is by case analysis on $r_1$ and $r_2$. + The last part is a corollary of the previous ones. +\end{proof} +\noindent +With the lemmas for each inductive case in place, we are ready to get +the non-increasing property as a corollary: +\begin{corollary}\label{rsimpMono} + $\llbracket \textit{rsimp} \; r \rrbracket_r \leq \llbracket r \rrbracket_r$ +\end{corollary} +\begin{proof} + By \ref{rsimpMonoLemmas}. +\end{proof} + +\subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s} +Much like the definition of $L$ on plain regular expressions, one could also +define the language interpretation of $\rrexp$s. +\begin{center} + \begin{tabular}{lcl} + $RL \; (\ZERO)$ & $\dn$ & $\phi$\\ + $RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\ + $RL \; (c)$ & $\dn$ & $\{[c]\}$\\ + $RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ + $RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ + $RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ + \end{tabular} +\end{center} +\noindent +The main use of $RL$ is to establish some connections between $\rsimp{}$ +and $\rnullable{}$: +\begin{lemma} + The following properties hold: + \begin{itemize} + \item + If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$. + \item + $\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$. + \end{itemize} +\end{lemma} +\begin{proof} + The first part is by induction on $r$. + The second part is true because property + \[ RL \; r = RL \; (\rsimp{r})\] holds. +\end{proof} + +\subsubsection{Simplified $\textit{Rrexp}$s are Good} +We formalise the notion of ``good" regular expressions, +which means regular expressions that +are not fully simplified. For alternative regular expressions that means they +do not contain any nested alternatives like +\[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\] +or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]: +\begin{center} + \begin{tabular}{@{}lcl@{}} + $\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ + $\good\; \RONE$ & $\dn$ & $\textit{true}$\\ + $\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ + $\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ + $\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ + $\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & + $\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ + & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \textit{and}\; \, \textit{nonAlt}\; r')$\\ + $\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ + $\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ + $\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ + $\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ + $\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ + \end{tabular} +\end{center} +\noindent +The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an +alternative, and false otherwise. +The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that +its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, +and unique: +\begin{lemma}\label{rsimpaltsGood} + If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, + then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$. +\end{lemma} +\noindent +We also note that +if a regular expression $r$ is good, then $\rflts$ on the singleton +list $[r]$ will not break goodness: +\begin{lemma}\label{flts2} + If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$. +\end{lemma} +\begin{proof} + By an induction on $r$. +\end{proof} +\noindent +The other observation we make about $\rsimp{r}$ is that it never +comes with nested alternatives, which we describe as the $\nonnested$ +property: +\begin{center} + \begin{tabular}{lcl} + $\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\ + $\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\ + $\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\ + $\nonnested \; \, r $ & $\dn$ & $\btrue$ + \end{tabular} +\end{center} +\noindent +The $\rflts$ function +always opens up nested alternatives, +which enables $\rsimp$ to be non-nested: + +\begin{lemma}\label{nonnestedRsimp} + $\nonnested \; (\rsimp{r})$ +\end{lemma} +\begin{proof} + By an induction on $r$. +\end{proof} +\noindent +With this we could prove that a regular expressions +after simplification and flattening and de-duplication, +will not contain any alternative regular expression directly: +\begin{lemma}\label{nonaltFltsRd} + If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ + then $\textit{nonAlt} \; x$. +\end{lemma} +\begin{proof} + By \ref{nonnestedRsimp}. +\end{proof} +\noindent +The other thing we know is that once $\rsimp{}$ had finished +processing an alternative regular expression, it will not +contain any $\RZERO$s, this is because all the recursive +calls to the simplification on the children regular expressions +make the children good, and $\rflts$ will not take out +any $\RZERO$s out of a good regular expression list, +and $\rdistinct{}$ will not mess with the result. +\begin{lemma}\label{flts3Obv} + The following are true: + \begin{itemize} + \item + If for all $r \in rs. \, \good \; r $ or $r = \RZERO$, + then for all $r \in \rflts\; rs. \, \good \; r$. + \item + If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$ + and for all $y$ such that $\llbracket y \rrbracket_r$ less than + $\llbracket rs \rrbracket_r + 1$, either + $\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$, + then $\good \; x$. + \end{itemize} +\end{lemma} +\begin{proof} + The first part is by induction on $rs$, where the induction + rule is the inductive cases for $\rflts$. + The second part is a corollary from the first part. +\end{proof} + +And this leads to good structural property of $\rsimp{}$, +that after simplification, a regular expression is +either good or $\RZERO$: +\begin{lemma}\label{good1} + For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$. +\end{lemma} +\begin{proof} + By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$. + Lemma \ref{rsimpSize} says that + $\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to + $\llbracket r \rrbracket_r$. + Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, + Inductive hypothesis applies to the children regular expressions + $r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied + by that as well. + The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used + to ensure that goodness is preserved at the topmost level. +\end{proof} +We shall prove that any good regular expression is +a fixed-point for $\rsimp{}$. +First we prove an auxiliary lemma: +\begin{lemma}\label{goodaltsNonalt} + If $\good \; \sum rs$, then $\rflts\; rs = rs$. +\end{lemma} +\begin{proof} + By an induction on $\sum rs$. The inductive rules are the cases + for $\good$. +\end{proof} +\noindent +Now we are ready to prove that good regular expressions are invariant +of $\rsimp{}$ application: +\begin{lemma}\label{test} + If $\good \;r$ then $\rsimp{r} = r$. +\end{lemma} +\begin{proof} + By an induction on the inductive cases of $\good$, using lemmas + \ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}. + The lemma \ref{goodaltsNonalt} is used in the alternative + case where 2 or more elements are present in the list. +\end{proof} +\noindent +Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$, +which requires $\ref{good1}$ to go through smoothly. +It says that an application of $\rsimp_{ALTS}$ can be "absorbed", +if it its output is concatenated with a list and then applied to $\rflts$. +\begin{lemma}\label{flattenRsimpalts} + $\rflts \; ( (\rsimp_{ALTS} \; + (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: + \map \; \rsimp{} \; rs' ) = + \rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ ( + \map \; \rsimp{rs'}))$ + + +\end{lemma} +\begin{proof} + By \ref{good1}. +\end{proof} +\noindent + + + + + +We are also ready to prove that $\textit{rsimp}$ is idempotent. +\subsubsection{$\rsimp$ is Idempotent} +The idempotency of $\rsimp$ is very useful in +manipulating regular expression terms into desired +forms so that key steps allowing further rewriting to closed forms +are possible. +\begin{lemma}\label{rsimpIdem} + $\rsimp{r} = \rsimp{\rsimp{r}}$ +\end{lemma} + +\begin{proof} + By \ref{test} and \ref{good1}. +\end{proof} +\noindent +This property means we do not have to repeatedly +apply simplification in each step, which justifies +our definition of $\blexersimp$. + + +On the other hand, we could repeat the same $\rsimp{}$ applications +on regular expressions as many times as we want, if we have at least +one simplification applied to it, and apply it wherever we would like to: +\begin{corollary}\label{headOneMoreSimp} + The following properties hold, directly from \ref{rsimpIdem}: + + \begin{itemize} + \item + $\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$ + \item + $\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$ + \end{itemize} +\end{corollary} +\noindent +This will be useful in later closed form proof's rewriting steps. +Similarly, we point out the following useful facts below: +\begin{lemma} + The following equalities hold if $r = \rsimp{r'}$ for some $r'$: + \begin{itemize} + \item + If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$. + \item + If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$. + \item + $\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$. + \end{itemize} +\end{lemma} +\begin{proof} + By application of lemmas \ref{rsimpIdem} and \ref{good1}. +\end{proof} + +\noindent +With the idempotency of $\rsimp{}$ and its corollaries, +we can start proving some key equalities leading to the +closed forms. +Now presented are a few equivalent terms under $\rsimp{}$. +We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$. +\begin{lemma} + \begin{itemize} + The following equivalence hold: + \item + $\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$ + \item + $\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$ + \item + $\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$ + \item + $\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$ + \item + $\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$ +\end{itemize} +\end{lemma} +\begin{proof} + By induction on the lists involved. +\end{proof} +\noindent +The above allows us to prove +two similar equalities (which are a bit more involved). +It says that we could flatten out the elements +before simplification and still get the same result. +\begin{lemma}\label{simpFlatten3} + One can flatten the inside $\sum$ of a $\sum$ if it is being + simplified. Concretely, + \begin{itemize} + \item + If for all $r \in rs, rs', rs''$, we have $\good \; r $ + or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal + \sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary, + \item + $\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$ + \end{itemize} +\end{lemma} +\begin{proof} + By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}. + The second sub-lemma is a corollary of the previous. +\end{proof} +%Rewriting steps not put in--too long and complicated------------------------------- +\begin{comment} + \begin{center} + $\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\ + $\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\ + $\stackrel{by \ref{test}}{=} + \rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{ + \varnothing})$\\ + $\stackrel{by \ref{rdistinctConcatGeneral}}{=} + \rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{( + \rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\ + + \end{center} +\end{comment} +%Rewriting steps not put in--too long and complicated------------------------------- +\noindent \subsection{Estimating the Closed Forms' sizes} We now summarize the closed forms below: \begin{itemize} @@ -2254,66 +2379,7 @@ -%---------------------------------------------------------------------------------------- -% SECTION ?? -%---------------------------------------------------------------------------------------- - -%----------------------------------- -% SECTION syntactic equivalence under simp -%----------------------------------- - - -%---------------------------------------------------------------------------------------- -% SECTION ALTS CLOSED FORM -%---------------------------------------------------------------------------------------- -%\section{A Closed Form for \textit{ALTS}} -%Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. -% -% -%There are a few key steps, one of these steps is -% -% -% -%One might want to prove this by something a simple statement like: -% -%For this to hold we want the $\textit{distinct}$ function to pick up -%the elements before and after derivatives correctly: -%$r \in rset \equiv (rder x r) \in (rder x rset)$. -%which essentially requires that the function $\backslash$ is an injective mapping. -% -%Unfortunately the function $\backslash c$ is not an injective mapping. -% -%\subsection{function $\backslash c$ is not injective (1-to-1)} -%\begin{center} -% The derivative $w.r.t$ character $c$ is not one-to-one. -% Formally, -% $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ -%\end{center} -%This property is trivially true for the -%character regex example: -%\begin{center} -% $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ -%\end{center} -%But apart from the cases where the derivative -%output is $\ZERO$, are there non-trivial results -%of derivatives which contain strings? -%The answer is yes. -%For example, -%\begin{center} -% Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ -% where $a$ is not nullable.\\ -% $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ -% $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ -%\end{center} -%We start with two syntactically different regular expressions, -%and end up with the same derivative result. -%This is not surprising as we have such -%equality as below in the style of Arden's lemma:\\ -%\begin{center} -% $L(A^*B) = L(A\cdot A^* \cdot B + B)$ -%\end{center} - -\section{Further Improvements to the Bound} +\section{Possible Further Improvements} There are two problems with this finiteness result, though. \begin{itemize} \item