5 \label{Finite} |
5 \label{Finite} |
6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
8 %regex's derivatives. |
8 %regex's derivatives. |
9 |
9 |
10 In this chapter we give a guarantee in terms of size: |
10 \begin{figure} |
11 given an annotated regular expression $a$, for any string $s$ |
|
12 our algorithm $\blexersimp$'s internal annotated regular expression |
|
13 size is finitely bounded |
|
14 by a constant $N_a$ that only depends on $a$: |
|
15 \begin{center} |
|
16 $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ |
|
17 \end{center} |
|
18 \noindent |
|
19 where the size of an annotated regular expression is defined |
|
20 in terms of the number of nodes in its tree structure: |
|
21 \begin{center} |
11 \begin{center} |
22 \begin{tabular}{ccc} |
12 \begin{tabular}{ccc} |
23 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
13 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
24 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
14 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
25 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
15 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
26 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
16 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
27 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
17 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
28 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. |
18 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. |
29 \end{tabular} |
19 \end{tabular} |
30 \end{center} |
20 \end{center} |
31 We believe this size formalisation |
21 \caption{The size function of bitcoded regular expressions}\label{brexpSize} |
32 of the algorithm is important in our context, because |
22 \end{figure} |
|
23 In this chapter we give a guarantee in terms of size: |
|
24 given an annotated regular expression $a$, for any string $s$ |
|
25 our algorithm $\blexersimp$'s internal annotated regular expression |
|
26 size is finitely bounded |
|
27 by a constant $N_a$ that only depends on $a$: |
|
28 \begin{center} |
|
29 $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ |
|
30 \end{center} |
|
31 \noindent |
|
32 where the size of an annotated regular expression is defined |
|
33 in terms of the number of nodes in its tree structure (see figure \ref{brexpSize}). |
|
34 We believe this size bound |
|
35 is important in the context of POSIX lexing, because |
33 \begin{itemize} |
36 \begin{itemize} |
34 \item |
37 \item |
35 It is a stepping stone towards an ``absence of catastrophic-backtracking'' |
38 It is a stepping stone towards an ``absence of catastrophic-backtracking'' |
36 guarantee. The next step would be to refine the bound $N_a$ so that it |
39 guarantee. |
|
40 If the internal data structures used by our algorithm cannot grow very large, |
|
41 then our algorithm (which traverses those structures) cannot be too slow. |
|
42 The next step would be to refine the bound $N_a$ so that it |
37 is polynomial on $\llbracket a\rrbracket$. |
43 is polynomial on $\llbracket a\rrbracket$. |
38 \item |
44 \item |
39 The size bound proof gives us a higher confidence that |
45 Having it formalised gives us a higher confidence that |
40 our simplification algorithm $\simp$ does not ``mis-behave'' |
46 our simplification algorithm $\simp$ does not ``mis-behave'' |
41 like $\simpsulz$ does. |
47 like $\simpsulz$ does. |
42 The bound is universal, which is an advantage over work which |
48 The bound is universal, which is an advantage over work which |
43 only gives empirical evidence on some test cases. |
49 only gives empirical evidence on some test cases. |
44 \end{itemize} |
50 \end{itemize} |
45 \section{Formalising About Size} |
51 \section{Formalising About Size} |
46 \noindent |
52 \noindent |
47 In our lexer $\blexersimp$, |
53 In our lexer ($\blexersimp$), |
48 The regular expression is repeatedly being taken derivative of |
54 we take an annotated regular expression as input, |
49 and then simplified. |
55 and repeately take derivative of and simplify it: |
50 \begin{figure}[H] |
56 \begin{figure}[H] |
51 \begin{tikzpicture}[scale=2, |
57 \begin{tikzpicture}[scale=2, |
52 every node/.style={minimum size=11mm}, |
58 every node/.style={minimum size=11mm}, |
53 ->,>=stealth',shorten >=1pt,auto,thick |
59 ->,>=stealth',shorten >=1pt,auto,thick |
54 ] |
60 ] |
501 $(r_1\cdot r_2) \backslash_{rsimps} s$ looks like, |
521 $(r_1\cdot r_2) \backslash_{rsimps} s$ looks like, |
502 and therefore $N_{r_1}$ and $N_{r_2}$ in the |
522 and therefore $N_{r_1}$ and $N_{r_2}$ in the |
503 inductive hypotheses cannot be directly used. |
523 inductive hypotheses cannot be directly used. |
504 We have already seen that $(r_1 \cdot r_2)\backslash s$ |
524 We have already seen that $(r_1 \cdot r_2)\backslash s$ |
505 and $(r^*)\backslash s$ can grow in a wild way. |
525 and $(r^*)\backslash s$ can grow in a wild way. |
|
526 |
506 The point is that they will be equivalent to a list of |
527 The point is that they will be equivalent to a list of |
507 terms $\sum rs$, where each term in $rs$ will |
528 terms $\sum rs$, where each term in $rs$ will |
508 be made of $r_1 \backslash s' $, $r_2\backslash s'$, |
529 be made of $r_1 \backslash s' $, $r_2\backslash s'$, |
509 and $r \backslash s'$ with $s' \in \textit{SubString} \; s$. |
530 and $r \backslash s'$ with $s' \in \textit{SubString} \; s$. |
510 The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$ |
531 The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$ |
511 in the simplification which saves $rs$ from growing indefinitely. |
532 in the simplification which saves $rs$ from growing indefinitely. |
512 |
533 |
513 Based on this idea, we sketch a proof by first showing the equality (where |
534 Based on this idea, we develop a proof in two steps. |
|
535 First, we show the equality (where |
514 $f$ and $g$ are functions that do not increase the size of the input) |
536 $f$ and $g$ are functions that do not increase the size of the input) |
515 \begin{center} |
537 \begin{center} |
516 $(r_1 \cdot r_2)\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, |
538 $r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, |
517 \end{center} |
539 \end{center} |
518 and then show the right-hand-side can be finitely bounded. |
540 where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on. |
|
541 For example, for $r_1 \cdot r_2$ we have the equality as |
|
542 \begin{center} |
|
543 $ \rderssimp{r_1 \cdot r_2}{s} = |
|
544 \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ |
|
545 \end{center} |
519 We call the right-hand-side the |
546 We call the right-hand-side the |
520 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. |
547 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. |
521 We will flesh out the proof sketch in the next section. |
548 Second, we will bound the closed form of r-regular expressions |
522 |
549 using some estimation techniques |
523 \section{Details of Closed Forms and Bounds} |
550 and then piece it together |
|
551 with lemma \ref{sizeRelations} to show the bitcoded regular expressions |
|
552 in our $\blexersimp$ are finitely bounded. |
|
553 |
|
554 We will flesh out the first step of the proof we |
|
555 sketched just now in the next section. |
|
556 |
|
557 \section{Closed Forms} |
524 In this section we introduce in detail |
558 In this section we introduce in detail |
525 how the closed forms are obtained for regular expressions' |
559 how the closed forms are obtained for regular expressions' |
526 derivatives and how they are bounded. |
560 derivatives. |
527 We start by proving some basic identities and inequalities |
561 We start by proving some basic identities |
528 involving the simplification functions for r-regular expressions. |
562 involving the simplification functions for r-regular expressions. |
529 After that we use these identities to establish the |
563 After that we introduce the rewrite relations |
530 closed forms we need. |
564 $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$ |
531 Finally, we prove the functions such as $\flts$ |
565 $\rightsquigarrow_f$ and $\rightsquigarrow_g$. |
532 will keep the size non-increasing. |
566 These relations involves similar techniques in chapter \ref{Bitcoded2}. |
533 Putting this together with a general bound |
567 Finally, we use these identities to establish the |
534 on the finiteness of distinct regular expressions |
568 closed forms of the alternative regular expression, |
535 smaller than a certain size, we obtain a bound on |
569 the sequence regular expression, and the star regular expression. |
536 the closed forms. |
|
537 %$r_1\cdot r_2$, $r^*$ and $\sum rs$. |
570 %$r_1\cdot r_2$, $r^*$ and $\sum rs$. |
538 |
571 |
539 |
572 |
540 |
573 |
541 \subsection{Some Basic Identities and Inequalities} |
574 \subsection{Some Basic Identities} |
542 |
575 |
543 In this subsection, we introduce lemmas |
576 We now introduce lemmas |
544 that are repeatedly used in later proofs. |
577 that are repeatedly used in later proofs. |
545 Note that for the $\textit{rdistinct}$ function there |
578 Note that for the $\textit{rdistinct}$ function there |
546 will be a lot of conversion from lists to sets. |
579 will be a lot of conversion from lists to sets. |
547 We use the name $set$ to refere to the |
580 We use $set$ to refere to the |
548 function that converts a list $rs$ to the set |
581 function that converts a list $rs$ to the set |
549 containing all the elements in $rs$. |
582 containing all the elements in $rs$. |
550 \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication} |
583 \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication} |
551 The $\textit{rdistinct}$ function, as its name suggests, will |
584 The $\textit{rdistinct}$ function, as its name suggests, will |
552 remove duplicates in an r-regular expression list. |
585 de-duplicate an r-regular expression list. |
553 It will also correctly exclude any elements that |
586 It will also remove any elements that |
554 is intially in the accumulator set. |
587 is already in the accumulator set. |
555 \begin{lemma}\label{rdistinctDoesTheJob} |
588 \begin{lemma}\label{rdistinctDoesTheJob} |
556 %The function $\textit{rdistinct}$ satisfies the following |
589 %The function $\textit{rdistinct}$ satisfies the following |
557 %properties: |
590 %properties: |
558 Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its |
591 Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its |
559 recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} |
592 recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} |
560 readily defined |
593 readily defined |
561 for testing |
594 for testing |
562 whether a list's elements are all unique. Then the following |
595 whether a list's elements are unique. Then the following |
563 properties about $\textit{rdistinct}$ hold: |
596 properties about $\textit{rdistinct}$ hold: |
564 \begin{itemize} |
597 \begin{itemize} |
565 \item |
598 \item |
566 If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. |
599 If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. |
567 \item |
600 \item |
747 the alternative regular expression: |
783 the alternative regular expression: |
748 \begin{lemma}\label{rderRsimpAltsCommute} |
784 \begin{lemma}\label{rderRsimpAltsCommute} |
749 $\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$ |
785 $\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$ |
750 \end{lemma} |
786 \end{lemma} |
751 \noindent |
787 \noindent |
752 \subsubsection{$\textit{rsimp}$ Does Not Increment the Size} |
|
753 Although it seems evident, we need a series |
|
754 of non-trivial lemmas to establish this property. |
|
755 \begin{lemma}\label{rsimpMonoLemmas} |
|
756 \mbox{} |
|
757 \begin{itemize} |
|
758 \item |
|
759 \[ |
|
760 \llbracket \rsimpalts \; rs \rrbracket_r \leq |
|
761 \llbracket \sum \; rs \rrbracket_r |
|
762 \] |
|
763 \item |
|
764 \[ |
|
765 \llbracket \rsimpseq \; r_1 \; r_2 \rrbracket_r \leq |
|
766 \llbracket r_1 \cdot r_2 \rrbracket_r |
|
767 \] |
|
768 \item |
|
769 \[ |
|
770 \llbracket \rflts \; rs \rrbracket_r \leq |
|
771 \llbracket rs \rrbracket_r |
|
772 \] |
|
773 \item |
|
774 \[ |
|
775 \llbracket \rDistinct \; rs \; ss \rrbracket_r \leq |
|
776 \llbracket rs \rrbracket_r |
|
777 \] |
|
778 \item |
|
779 If all elements $a$ in the set $as$ satisfy the property |
|
780 that $\llbracket \textit{rsimp} \; a \rrbracket_r \leq |
|
781 \llbracket a \rrbracket_r$, then we have |
|
782 \[ |
|
783 \llbracket \; \rsimpalts \; (\textit{rdistinct} \; |
|
784 (\textit{rflts} \; (\textit{map}\;\textit{rsimp} as)) \{\}) |
|
785 \rrbracket \leq |
|
786 \llbracket \; \sum \; (\rDistinct \; (\rflts \;(\map \; |
|
787 \textit{rsimp} \; x))\; \{ \} ) \rrbracket_r |
|
788 \] |
|
789 \end{itemize} |
|
790 \end{lemma} |
|
791 \begin{proof} |
|
792 Point 1, 3, 4 can be proven by an induction on $rs$. |
|
793 Point 2 is by case analysis on $r_1$ and $r_2$. |
|
794 The last part is a corollary of the previous ones. |
|
795 \end{proof} |
|
796 \noindent |
|
797 With the lemmas for each inductive case in place, we are ready to get |
|
798 the non-increasing property as a corollary: |
|
799 \begin{corollary}\label{rsimpMono} |
|
800 $\llbracket \textit{rsimp} \; r \rrbracket_r$ |
|
801 \end{corollary} |
|
802 \begin{proof} |
|
803 By \ref{rsimpMonoLemmas}. |
|
804 \end{proof} |
|
805 |
|
806 \subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s} |
|
807 Much like the definition of $L$ on plain regular expressions, one could also |
|
808 define the language interpretation of $\rrexp$s. |
|
809 \begin{center} |
|
810 \begin{tabular}{lcl} |
|
811 $RL \; (\ZERO)$ & $\dn$ & $\phi$\\ |
|
812 $RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\ |
|
813 $RL \; (c)$ & $\dn$ & $\{[c]\}$\\ |
|
814 $RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ |
|
815 $RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ |
|
816 $RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ |
|
817 \end{tabular} |
|
818 \end{center} |
|
819 \noindent |
|
820 The main use of $RL$ is to establish some connections between $\rsimp{}$ |
|
821 and $\rnullable{}$: |
|
822 \begin{lemma} |
|
823 The following properties hold: |
|
824 \begin{itemize} |
|
825 \item |
|
826 If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$. |
|
827 \item |
|
828 $\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$. |
|
829 \end{itemize} |
|
830 \end{lemma} |
|
831 \begin{proof} |
|
832 The first part is by induction on $r$. |
|
833 The second part is true because property |
|
834 \[ RL \; r = RL \; (\rsimp{r})\] holds. |
|
835 \end{proof} |
|
836 |
|
837 \subsubsection{$\rsimp{}$ is Non-Increasing} |
|
838 In this subsection, we prove that the function $\rsimp{}$ does not |
|
839 make the $\llbracket \rrbracket_r$ size increase. |
|
840 |
|
841 |
|
842 \begin{lemma}\label{rsimpSize} |
|
843 $\llbracket \rsimp{r} \rrbracket_r \leq \llbracket r \rrbracket_r$ |
|
844 \end{lemma} |
|
845 \subsubsection{Simplified $\textit{Rrexp}$s are Good} |
|
846 We formalise the notion of ``good" regular expressions, |
|
847 which means regular expressions that |
|
848 are not fully simplified. For alternative regular expressions that means they |
|
849 do not contain any nested alternatives like |
|
850 \[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\] |
|
851 or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]: |
|
852 \begin{center} |
|
853 \begin{tabular}{@{}lcl@{}} |
|
854 $\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ |
|
855 $\good\; \RONE$ & $\dn$ & $\textit{true}$\\ |
|
856 $\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ |
|
857 $\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ |
|
858 $\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ |
|
859 $\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & |
|
860 $\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ |
|
861 & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \textit{and}\; \, \textit{nonAlt}\; r')$\\ |
|
862 $\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ |
|
863 $\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ |
|
864 $\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ |
|
865 $\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ |
|
866 $\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ |
|
867 \end{tabular} |
|
868 \end{center} |
|
869 \noindent |
|
870 The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an |
|
871 alternative, and false otherwise. |
|
872 The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that |
|
873 its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, |
|
874 and unique: |
|
875 \begin{lemma}\label{rsimpaltsGood} |
|
876 If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, |
|
877 then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$. |
|
878 \end{lemma} |
|
879 \noindent |
|
880 We also note that |
|
881 if a regular expression $r$ is good, then $\rflts$ on the singleton |
|
882 list $[r]$ will not break goodness: |
|
883 \begin{lemma}\label{flts2} |
|
884 If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$. |
|
885 \end{lemma} |
|
886 \begin{proof} |
|
887 By an induction on $r$. |
|
888 \end{proof} |
|
889 \noindent |
|
890 The other observation we make about $\rsimp{r}$ is that it never |
|
891 comes with nested alternatives, which we describe as the $\nonnested$ |
|
892 property: |
|
893 \begin{center} |
|
894 \begin{tabular}{lcl} |
|
895 $\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\ |
|
896 $\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\ |
|
897 $\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\ |
|
898 $\nonnested \; \, r $ & $\dn$ & $\btrue$ |
|
899 \end{tabular} |
|
900 \end{center} |
|
901 \noindent |
|
902 The $\rflts$ function |
|
903 always opens up nested alternatives, |
|
904 which enables $\rsimp$ to be non-nested: |
|
905 |
|
906 \begin{lemma}\label{nonnestedRsimp} |
|
907 $\nonnested \; (\rsimp{r})$ |
|
908 \end{lemma} |
|
909 \begin{proof} |
|
910 By an induction on $r$. |
|
911 \end{proof} |
|
912 \noindent |
|
913 With this we could prove that a regular expressions |
|
914 after simplification and flattening and de-duplication, |
|
915 will not contain any alternative regular expression directly: |
|
916 \begin{lemma}\label{nonaltFltsRd} |
|
917 If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ |
|
918 then $\textit{nonAlt} \; x$. |
|
919 \end{lemma} |
|
920 \begin{proof} |
|
921 By \ref{nonnestedRsimp}. |
|
922 \end{proof} |
|
923 \noindent |
|
924 The other thing we know is that once $\rsimp{}$ had finished |
|
925 processing an alternative regular expression, it will not |
|
926 contain any $\RZERO$s, this is because all the recursive |
|
927 calls to the simplification on the children regular expressions |
|
928 make the children good, and $\rflts$ will not take out |
|
929 any $\RZERO$s out of a good regular expression list, |
|
930 and $\rdistinct{}$ will not mess with the result. |
|
931 \begin{lemma}\label{flts3Obv} |
|
932 The following are true: |
|
933 \begin{itemize} |
|
934 \item |
|
935 If for all $r \in rs. \, \good \; r $ or $r = \RZERO$, |
|
936 then for all $r \in \rflts\; rs. \, \good \; r$. |
|
937 \item |
|
938 If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$ |
|
939 and for all $y$ such that $\llbracket y \rrbracket_r$ less than |
|
940 $\llbracket rs \rrbracket_r + 1$, either |
|
941 $\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$, |
|
942 then $\good \; x$. |
|
943 \end{itemize} |
|
944 \end{lemma} |
|
945 \begin{proof} |
|
946 The first part is by induction on $rs$, where the induction |
|
947 rule is the inductive cases for $\rflts$. |
|
948 The second part is a corollary from the first part. |
|
949 \end{proof} |
|
950 |
|
951 And this leads to good structural property of $\rsimp{}$, |
|
952 that after simplification, a regular expression is |
|
953 either good or $\RZERO$: |
|
954 \begin{lemma}\label{good1} |
|
955 For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$. |
|
956 \end{lemma} |
|
957 \begin{proof} |
|
958 By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$. |
|
959 Lemma \ref{rsimpSize} says that |
|
960 $\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to |
|
961 $\llbracket r \rrbracket_r$. |
|
962 Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, |
|
963 Inductive hypothesis applies to the children regular expressions |
|
964 $r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied |
|
965 by that as well. |
|
966 The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used |
|
967 to ensure that goodness is preserved at the topmost level. |
|
968 \end{proof} |
|
969 We shall prove that any good regular expression is |
|
970 a fixed-point for $\rsimp{}$. |
|
971 First we prove an auxiliary lemma: |
|
972 \begin{lemma}\label{goodaltsNonalt} |
|
973 If $\good \; \sum rs$, then $\rflts\; rs = rs$. |
|
974 \end{lemma} |
|
975 \begin{proof} |
|
976 By an induction on $\sum rs$. The inductive rules are the cases |
|
977 for $\good$. |
|
978 \end{proof} |
|
979 \noindent |
|
980 Now we are ready to prove that good regular expressions are invariant |
|
981 of $\rsimp{}$ application: |
|
982 \begin{lemma}\label{test} |
|
983 If $\good \;r$ then $\rsimp{r} = r$. |
|
984 \end{lemma} |
|
985 \begin{proof} |
|
986 By an induction on the inductive cases of $\good$, using lemmas |
|
987 \ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}. |
|
988 The lemma \ref{goodaltsNonalt} is used in the alternative |
|
989 case where 2 or more elements are present in the list. |
|
990 \end{proof} |
|
991 \noindent |
|
992 Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$, |
|
993 which requires $\ref{good1}$ to go through smoothly. |
|
994 It says that an application of $\rsimp_{ALTS}$ can be "absorbed", |
|
995 if it its output is concatenated with a list and then applied to $\rflts$. |
|
996 \begin{lemma}\label{flattenRsimpalts} |
|
997 $\rflts \; ( (\rsimp_{ALTS} \; |
|
998 (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: |
|
999 \map \; \rsimp{} \; rs' ) = |
|
1000 \rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ ( |
|
1001 \map \; \rsimp{rs'}))$ |
|
1002 |
|
1003 |
|
1004 \end{lemma} |
|
1005 \begin{proof} |
|
1006 By \ref{good1}. |
|
1007 \end{proof} |
|
1008 \noindent |
|
1009 |
|
1010 |
|
1011 |
|
1012 |
|
1013 |
|
1014 We are also ready to prove that $\textit{rsimp}$ is idempotent. |
|
1015 \subsubsection{$\rsimp$ is Idempotent} |
|
1016 The idempotency of $\rsimp$ is very useful in |
|
1017 manipulating regular expression terms into desired |
|
1018 forms so that key steps allowing further rewriting to closed forms |
|
1019 are possible. |
|
1020 \begin{lemma}\label{rsimpIdem} |
|
1021 $\rsimp{r} = \rsimp{\rsimp{r}}$ |
|
1022 \end{lemma} |
|
1023 |
|
1024 \begin{proof} |
|
1025 By \ref{test} and \ref{good1}. |
|
1026 \end{proof} |
|
1027 \noindent |
|
1028 This property means we do not have to repeatedly |
|
1029 apply simplification in each step, which justifies |
|
1030 our definition of $\blexersimp$. |
|
1031 |
|
1032 |
|
1033 On the other hand, we could repeat the same $\rsimp{}$ applications |
|
1034 on regular expressions as many times as we want, if we have at least |
|
1035 one simplification applied to it, and apply it wherever we would like to: |
|
1036 \begin{corollary}\label{headOneMoreSimp} |
|
1037 The following properties hold, directly from \ref{rsimpIdem}: |
|
1038 |
|
1039 \begin{itemize} |
|
1040 \item |
|
1041 $\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$ |
|
1042 \item |
|
1043 $\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$ |
|
1044 \end{itemize} |
|
1045 \end{corollary} |
|
1046 \noindent |
|
1047 This will be useful in later closed form proof's rewriting steps. |
|
1048 Similarly, we point out the following useful facts below: |
|
1049 \begin{lemma} |
|
1050 The following equalities hold if $r = \rsimp{r'}$ for some $r'$: |
|
1051 \begin{itemize} |
|
1052 \item |
|
1053 If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$. |
|
1054 \item |
|
1055 If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$. |
|
1056 \item |
|
1057 $\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$. |
|
1058 \end{itemize} |
|
1059 \end{lemma} |
|
1060 \begin{proof} |
|
1061 By application of lemmas \ref{rsimpIdem} and \ref{good1}. |
|
1062 \end{proof} |
|
1063 |
|
1064 \noindent |
|
1065 With the idempotency of $\rsimp{}$ and its corollaries, |
|
1066 we can start proving some key equalities leading to the |
|
1067 closed forms. |
|
1068 Now presented are a few equivalent terms under $\rsimp{}$. |
|
1069 We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$. |
|
1070 \begin{lemma} |
|
1071 \begin{itemize} |
|
1072 The following equivalence hold: |
|
1073 \item |
|
1074 $\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$ |
|
1075 \item |
|
1076 $\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$ |
|
1077 \item |
|
1078 $\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$ |
|
1079 \item |
|
1080 $\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$ |
|
1081 \item |
|
1082 $\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$ |
|
1083 \end{itemize} |
|
1084 \end{lemma} |
|
1085 \begin{proof} |
|
1086 By induction on the lists involved. |
|
1087 \end{proof} |
|
1088 \noindent |
|
1089 The above allows us to prove |
|
1090 two similar equalities (which are a bit more involved). |
|
1091 It says that we could flatten out the elements |
|
1092 before simplification and still get the same result. |
|
1093 \begin{lemma}\label{simpFlatten3} |
|
1094 One can flatten the inside $\sum$ of a $\sum$ if it is being |
|
1095 simplified. Concretely, |
|
1096 \begin{itemize} |
|
1097 \item |
|
1098 If for all $r \in rs, rs', rs''$, we have $\good \; r $ |
|
1099 or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal |
|
1100 \sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary, |
|
1101 \item |
|
1102 $\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$ |
|
1103 \end{itemize} |
|
1104 \end{lemma} |
|
1105 \begin{proof} |
|
1106 By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}. |
|
1107 The second sub-lemma is a corollary of the previous. |
|
1108 \end{proof} |
|
1109 %Rewriting steps not put in--too long and complicated------------------------------- |
|
1110 \begin{comment} |
|
1111 \begin{center} |
|
1112 $\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\ |
|
1113 $\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\ |
|
1114 $\stackrel{by \ref{test}}{=} |
|
1115 \rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{ |
|
1116 \varnothing})$\\ |
|
1117 $\stackrel{by \ref{rdistinctConcatGeneral}}{=} |
|
1118 \rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{( |
|
1119 \rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\ |
|
1120 |
|
1121 \end{center} |
|
1122 \end{comment} |
|
1123 %Rewriting steps not put in--too long and complicated------------------------------- |
|
1124 \noindent |
|
1125 We need more equalities like the above to enable a closed form, |
788 We need more equalities like the above to enable a closed form, |
1126 but to proceed we need to introduce two rewrite relations, |
789 for which we need to introduce a few rewrite relations |
1127 to make things smoother. |
790 to help |
|
791 us obtain them. |
|
792 |
1128 \subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$} |
793 \subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$} |
1129 Insired by the success we had in the correctness proof |
794 Inspired by the success we had in the correctness proof |
1130 in \ref{Bitcoded2}, where we invented |
795 in \ref{Bitcoded2}, |
1131 a term rewriting system to capture the similarity between terms, |
796 we follow suit here, defining atomic simplification |
1132 we follow suit here defining simplification |
797 steps as ``small-step'' rewriting steps. This allows capturing |
1133 steps as rewriting steps. This allows capturing |
|
1134 similarities between terms that would be otherwise |
798 similarities between terms that would be otherwise |
1135 hard to express. |
799 hard to express. |
1136 |
800 |
1137 We use $\hrewrite$ for one-step atomic rewrite of |
801 We use $\hrewrite$ for one-step atomic rewrite of |
1138 regular expression simplification, |
802 regular expression simplification, |
1139 $\frewrite$ for rewrite of list of regular expressions that |
803 $\frewrite$ for rewrite of list of regular expressions that |
1140 include all operations carried out in $\rflts$, and $\grewrite$ for |
804 include all operations carried out in $\rflts$, and $\grewrite$ for |
1141 rewriting a list of regular expressions possible in both $\rflts$ and $\rdistinct{}{}$. |
805 rewriting a list of regular expressions possible in both $\rflts$ and $\textit{rdistinct}$. |
1142 Their reflexive transitive closures are used to denote zero or many steps, |
806 Their reflexive transitive closures are used to denote zero or many steps, |
1143 as was the case in the previous chapter. |
807 as was the case in the previous chapter. |
1144 The presentation will be more concise than that in \ref{Bitcoded2}. |
808 As we have already |
|
809 done something similar, the presentation about |
|
810 these rewriting rules will be more concise than that in \ref{Bitcoded2}. |
1145 To differentiate between the rewriting steps for annotated regular expressions |
811 To differentiate between the rewriting steps for annotated regular expressions |
1146 and $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbol |
812 and $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbol |
1147 to mean atomic simplification transitions |
813 to mean atomic simplification transitions |
1148 of $\rrexp$s and $\rrexp$ lists, respectively. |
814 of $\rrexp$s and $\rrexp$ lists, respectively. |
1149 |
815 |
1150 |
816 |
1151 |
817 |
1152 List of one-step rewrite rules for $\rrexp$ ($\hrewrite$): |
818 |
1153 |
819 \begin{figure}[H] |
1154 |
|
1155 \begin{center} |
820 \begin{center} |
1156 \begin{mathpar} |
821 \begin{mathpar} |
1157 \inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\} |
822 \inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\} |
1158 |
823 |
1159 \inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\} |
824 \inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\} |
1813 |
1503 |
1814 |
1504 |
1815 |
1505 |
1816 |
1506 |
1817 |
1507 |
|
1508 %---------------------------------------------------------------------------------------- |
|
1509 % SECTION ?? |
|
1510 %---------------------------------------------------------------------------------------- |
|
1511 |
|
1512 %----------------------------------- |
|
1513 % SECTION syntactic equivalence under simp |
|
1514 %----------------------------------- |
|
1515 |
|
1516 |
|
1517 %---------------------------------------------------------------------------------------- |
|
1518 % SECTION ALTS CLOSED FORM |
|
1519 %---------------------------------------------------------------------------------------- |
|
1520 %\section{A Closed Form for \textit{ALTS}} |
|
1521 %Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. |
|
1522 % |
|
1523 % |
|
1524 %There are a few key steps, one of these steps is |
|
1525 % |
|
1526 % |
|
1527 % |
|
1528 %One might want to prove this by something a simple statement like: |
|
1529 % |
|
1530 %For this to hold we want the $\textit{distinct}$ function to pick up |
|
1531 %the elements before and after derivatives correctly: |
|
1532 %$r \in rset \equiv (rder x r) \in (rder x rset)$. |
|
1533 %which essentially requires that the function $\backslash$ is an injective mapping. |
|
1534 % |
|
1535 %Unfortunately the function $\backslash c$ is not an injective mapping. |
|
1536 % |
|
1537 %\subsection{function $\backslash c$ is not injective (1-to-1)} |
|
1538 %\begin{center} |
|
1539 % The derivative $w.r.t$ character $c$ is not one-to-one. |
|
1540 % Formally, |
|
1541 % $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
|
1542 %\end{center} |
|
1543 %This property is trivially true for the |
|
1544 %character regex example: |
|
1545 %\begin{center} |
|
1546 % $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
|
1547 %\end{center} |
|
1548 %But apart from the cases where the derivative |
|
1549 %output is $\ZERO$, are there non-trivial results |
|
1550 %of derivatives which contain strings? |
|
1551 %The answer is yes. |
|
1552 %For example, |
|
1553 %\begin{center} |
|
1554 % Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
|
1555 % where $a$ is not nullable.\\ |
|
1556 % $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
|
1557 % $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
|
1558 %\end{center} |
|
1559 %We start with two syntactically different regular expressions, |
|
1560 %and end up with the same derivative result. |
|
1561 %This is not surprising as we have such |
|
1562 %equality as below in the style of Arden's lemma:\\ |
|
1563 %\begin{center} |
|
1564 % $L(A^*B) = L(A\cdot A^* \cdot B + B)$ |
|
1565 %\end{center} |
|
1566 \section{Bounding Closed Forms} |
|
1567 |
|
1568 In this section, we introduce how we formalised the bound |
|
1569 on closed forms. |
|
1570 We first prove that functions such as $\rflts$ |
|
1571 will not cause the size of r-regular expressions to grow. |
|
1572 Putting this together with a general bound |
|
1573 on the finiteness of distinct regular expressions |
|
1574 smaller than a certain size, we obtain a bound on |
|
1575 the closed forms. |
|
1576 |
|
1577 \subsection{$\textit{rsimp}$ Does Not Increment the Size} |
|
1578 Although it seems evident, we need a series |
|
1579 of non-trivial lemmas to establish that functions such as $\rflts$ |
|
1580 do not cause the regular expressions to grow. |
|
1581 \begin{lemma}\label{rsimpMonoLemmas} |
|
1582 \mbox{} |
|
1583 \begin{itemize} |
|
1584 \item |
|
1585 \[ |
|
1586 \llbracket \rsimpalts \; rs \rrbracket_r \leq |
|
1587 \llbracket \sum \; rs \rrbracket_r |
|
1588 \] |
|
1589 \item |
|
1590 \[ |
|
1591 \llbracket \rsimpseq \; r_1 \; r_2 \rrbracket_r \leq |
|
1592 \llbracket r_1 \cdot r_2 \rrbracket_r |
|
1593 \] |
|
1594 \item |
|
1595 \[ |
|
1596 \llbracket \rflts \; rs \rrbracket_r \leq |
|
1597 \llbracket rs \rrbracket_r |
|
1598 \] |
|
1599 \item |
|
1600 \[ |
|
1601 \llbracket \rDistinct \; rs \; ss \rrbracket_r \leq |
|
1602 \llbracket rs \rrbracket_r |
|
1603 \] |
|
1604 \item |
|
1605 If all elements $a$ in the set $as$ satisfy the property |
|
1606 that $\llbracket \textit{rsimp} \; a \rrbracket_r \leq |
|
1607 \llbracket a \rrbracket_r$, then we have |
|
1608 \[ |
|
1609 \llbracket \; \rsimpalts \; (\textit{rdistinct} \; |
|
1610 (\textit{rflts} \; (\textit{map}\;\textit{rsimp} as)) \{\}) |
|
1611 \rrbracket \leq |
|
1612 \llbracket \; \sum \; (\rDistinct \; (\rflts \;(\map \; |
|
1613 \textit{rsimp} \; x))\; \{ \} ) \rrbracket_r |
|
1614 \] |
|
1615 \end{itemize} |
|
1616 \end{lemma} |
|
1617 \begin{proof} |
|
1618 Point 1, 3, 4 can be proven by an induction on $rs$. |
|
1619 Point 2 is by case analysis on $r_1$ and $r_2$. |
|
1620 The last part is a corollary of the previous ones. |
|
1621 \end{proof} |
|
1622 \noindent |
|
1623 With the lemmas for each inductive case in place, we are ready to get |
|
1624 the non-increasing property as a corollary: |
|
1625 \begin{corollary}\label{rsimpMono} |
|
1626 $\llbracket \textit{rsimp} \; r \rrbracket_r \leq \llbracket r \rrbracket_r$ |
|
1627 \end{corollary} |
|
1628 \begin{proof} |
|
1629 By \ref{rsimpMonoLemmas}. |
|
1630 \end{proof} |
|
1631 |
|
1632 \subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s} |
|
1633 Much like the definition of $L$ on plain regular expressions, one could also |
|
1634 define the language interpretation of $\rrexp$s. |
|
1635 \begin{center} |
|
1636 \begin{tabular}{lcl} |
|
1637 $RL \; (\ZERO)$ & $\dn$ & $\phi$\\ |
|
1638 $RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\ |
|
1639 $RL \; (c)$ & $\dn$ & $\{[c]\}$\\ |
|
1640 $RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ |
|
1641 $RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ |
|
1642 $RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ |
|
1643 \end{tabular} |
|
1644 \end{center} |
|
1645 \noindent |
|
1646 The main use of $RL$ is to establish some connections between $\rsimp{}$ |
|
1647 and $\rnullable{}$: |
|
1648 \begin{lemma} |
|
1649 The following properties hold: |
|
1650 \begin{itemize} |
|
1651 \item |
|
1652 If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$. |
|
1653 \item |
|
1654 $\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$. |
|
1655 \end{itemize} |
|
1656 \end{lemma} |
|
1657 \begin{proof} |
|
1658 The first part is by induction on $r$. |
|
1659 The second part is true because property |
|
1660 \[ RL \; r = RL \; (\rsimp{r})\] holds. |
|
1661 \end{proof} |
|
1662 |
|
1663 \subsubsection{Simplified $\textit{Rrexp}$s are Good} |
|
1664 We formalise the notion of ``good" regular expressions, |
|
1665 which means regular expressions that |
|
1666 are not fully simplified. For alternative regular expressions that means they |
|
1667 do not contain any nested alternatives like |
|
1668 \[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\] |
|
1669 or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]: |
|
1670 \begin{center} |
|
1671 \begin{tabular}{@{}lcl@{}} |
|
1672 $\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ |
|
1673 $\good\; \RONE$ & $\dn$ & $\textit{true}$\\ |
|
1674 $\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ |
|
1675 $\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ |
|
1676 $\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ |
|
1677 $\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & |
|
1678 $\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ |
|
1679 & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \textit{and}\; \, \textit{nonAlt}\; r')$\\ |
|
1680 $\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ |
|
1681 $\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ |
|
1682 $\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ |
|
1683 $\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ |
|
1684 $\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ |
|
1685 \end{tabular} |
|
1686 \end{center} |
|
1687 \noindent |
|
1688 The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an |
|
1689 alternative, and false otherwise. |
|
1690 The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that |
|
1691 its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, |
|
1692 and unique: |
|
1693 \begin{lemma}\label{rsimpaltsGood} |
|
1694 If $rs \neq []$ and forall $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, |
|
1695 then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$. |
|
1696 \end{lemma} |
|
1697 \noindent |
|
1698 We also note that |
|
1699 if a regular expression $r$ is good, then $\rflts$ on the singleton |
|
1700 list $[r]$ will not break goodness: |
|
1701 \begin{lemma}\label{flts2} |
|
1702 If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$. |
|
1703 \end{lemma} |
|
1704 \begin{proof} |
|
1705 By an induction on $r$. |
|
1706 \end{proof} |
|
1707 \noindent |
|
1708 The other observation we make about $\rsimp{r}$ is that it never |
|
1709 comes with nested alternatives, which we describe as the $\nonnested$ |
|
1710 property: |
|
1711 \begin{center} |
|
1712 \begin{tabular}{lcl} |
|
1713 $\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\ |
|
1714 $\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\ |
|
1715 $\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\ |
|
1716 $\nonnested \; \, r $ & $\dn$ & $\btrue$ |
|
1717 \end{tabular} |
|
1718 \end{center} |
|
1719 \noindent |
|
1720 The $\rflts$ function |
|
1721 always opens up nested alternatives, |
|
1722 which enables $\rsimp$ to be non-nested: |
|
1723 |
|
1724 \begin{lemma}\label{nonnestedRsimp} |
|
1725 $\nonnested \; (\rsimp{r})$ |
|
1726 \end{lemma} |
|
1727 \begin{proof} |
|
1728 By an induction on $r$. |
|
1729 \end{proof} |
|
1730 \noindent |
|
1731 With this we could prove that a regular expressions |
|
1732 after simplification and flattening and de-duplication, |
|
1733 will not contain any alternative regular expression directly: |
|
1734 \begin{lemma}\label{nonaltFltsRd} |
|
1735 If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ |
|
1736 then $\textit{nonAlt} \; x$. |
|
1737 \end{lemma} |
|
1738 \begin{proof} |
|
1739 By \ref{nonnestedRsimp}. |
|
1740 \end{proof} |
|
1741 \noindent |
|
1742 The other thing we know is that once $\rsimp{}$ had finished |
|
1743 processing an alternative regular expression, it will not |
|
1744 contain any $\RZERO$s, this is because all the recursive |
|
1745 calls to the simplification on the children regular expressions |
|
1746 make the children good, and $\rflts$ will not take out |
|
1747 any $\RZERO$s out of a good regular expression list, |
|
1748 and $\rdistinct{}$ will not mess with the result. |
|
1749 \begin{lemma}\label{flts3Obv} |
|
1750 The following are true: |
|
1751 \begin{itemize} |
|
1752 \item |
|
1753 If for all $r \in rs. \, \good \; r $ or $r = \RZERO$, |
|
1754 then for all $r \in \rflts\; rs. \, \good \; r$. |
|
1755 \item |
|
1756 If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$ |
|
1757 and for all $y$ such that $\llbracket y \rrbracket_r$ less than |
|
1758 $\llbracket rs \rrbracket_r + 1$, either |
|
1759 $\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$, |
|
1760 then $\good \; x$. |
|
1761 \end{itemize} |
|
1762 \end{lemma} |
|
1763 \begin{proof} |
|
1764 The first part is by induction on $rs$, where the induction |
|
1765 rule is the inductive cases for $\rflts$. |
|
1766 The second part is a corollary from the first part. |
|
1767 \end{proof} |
|
1768 |
|
1769 And this leads to good structural property of $\rsimp{}$, |
|
1770 that after simplification, a regular expression is |
|
1771 either good or $\RZERO$: |
|
1772 \begin{lemma}\label{good1} |
|
1773 For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$. |
|
1774 \end{lemma} |
|
1775 \begin{proof} |
|
1776 By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$. |
|
1777 Lemma \ref{rsimpSize} says that |
|
1778 $\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to |
|
1779 $\llbracket r \rrbracket_r$. |
|
1780 Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, |
|
1781 Inductive hypothesis applies to the children regular expressions |
|
1782 $r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied |
|
1783 by that as well. |
|
1784 The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used |
|
1785 to ensure that goodness is preserved at the topmost level. |
|
1786 \end{proof} |
|
1787 We shall prove that any good regular expression is |
|
1788 a fixed-point for $\rsimp{}$. |
|
1789 First we prove an auxiliary lemma: |
|
1790 \begin{lemma}\label{goodaltsNonalt} |
|
1791 If $\good \; \sum rs$, then $\rflts\; rs = rs$. |
|
1792 \end{lemma} |
|
1793 \begin{proof} |
|
1794 By an induction on $\sum rs$. The inductive rules are the cases |
|
1795 for $\good$. |
|
1796 \end{proof} |
|
1797 \noindent |
|
1798 Now we are ready to prove that good regular expressions are invariant |
|
1799 of $\rsimp{}$ application: |
|
1800 \begin{lemma}\label{test} |
|
1801 If $\good \;r$ then $\rsimp{r} = r$. |
|
1802 \end{lemma} |
|
1803 \begin{proof} |
|
1804 By an induction on the inductive cases of $\good$, using lemmas |
|
1805 \ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}. |
|
1806 The lemma \ref{goodaltsNonalt} is used in the alternative |
|
1807 case where 2 or more elements are present in the list. |
|
1808 \end{proof} |
|
1809 \noindent |
|
1810 Given below is a property involving $\rflts$, $\rdistinct{}{}$, $\rsimp{}$ and $\rsimp_{ALTS}$, |
|
1811 which requires $\ref{good1}$ to go through smoothly. |
|
1812 It says that an application of $\rsimp_{ALTS}$ can be "absorbed", |
|
1813 if it its output is concatenated with a list and then applied to $\rflts$. |
|
1814 \begin{lemma}\label{flattenRsimpalts} |
|
1815 $\rflts \; ( (\rsimp_{ALTS} \; |
|
1816 (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: |
|
1817 \map \; \rsimp{} \; rs' ) = |
|
1818 \rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ ( |
|
1819 \map \; \rsimp{rs'}))$ |
|
1820 |
|
1821 |
|
1822 \end{lemma} |
|
1823 \begin{proof} |
|
1824 By \ref{good1}. |
|
1825 \end{proof} |
|
1826 \noindent |
|
1827 |
|
1828 |
|
1829 |
|
1830 |
|
1831 |
|
1832 We are also ready to prove that $\textit{rsimp}$ is idempotent. |
|
1833 \subsubsection{$\rsimp$ is Idempotent} |
|
1834 The idempotency of $\rsimp$ is very useful in |
|
1835 manipulating regular expression terms into desired |
|
1836 forms so that key steps allowing further rewriting to closed forms |
|
1837 are possible. |
|
1838 \begin{lemma}\label{rsimpIdem} |
|
1839 $\rsimp{r} = \rsimp{\rsimp{r}}$ |
|
1840 \end{lemma} |
|
1841 |
|
1842 \begin{proof} |
|
1843 By \ref{test} and \ref{good1}. |
|
1844 \end{proof} |
|
1845 \noindent |
|
1846 This property means we do not have to repeatedly |
|
1847 apply simplification in each step, which justifies |
|
1848 our definition of $\blexersimp$. |
|
1849 |
|
1850 |
|
1851 On the other hand, we could repeat the same $\rsimp{}$ applications |
|
1852 on regular expressions as many times as we want, if we have at least |
|
1853 one simplification applied to it, and apply it wherever we would like to: |
|
1854 \begin{corollary}\label{headOneMoreSimp} |
|
1855 The following properties hold, directly from \ref{rsimpIdem}: |
|
1856 |
|
1857 \begin{itemize} |
|
1858 \item |
|
1859 $\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$ |
|
1860 \item |
|
1861 $\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$ |
|
1862 \end{itemize} |
|
1863 \end{corollary} |
|
1864 \noindent |
|
1865 This will be useful in later closed form proof's rewriting steps. |
|
1866 Similarly, we point out the following useful facts below: |
|
1867 \begin{lemma} |
|
1868 The following equalities hold if $r = \rsimp{r'}$ for some $r'$: |
|
1869 \begin{itemize} |
|
1870 \item |
|
1871 If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$. |
|
1872 \item |
|
1873 If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$. |
|
1874 \item |
|
1875 $\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$. |
|
1876 \end{itemize} |
|
1877 \end{lemma} |
|
1878 \begin{proof} |
|
1879 By application of lemmas \ref{rsimpIdem} and \ref{good1}. |
|
1880 \end{proof} |
|
1881 |
|
1882 \noindent |
|
1883 With the idempotency of $\rsimp{}$ and its corollaries, |
|
1884 we can start proving some key equalities leading to the |
|
1885 closed forms. |
|
1886 Now presented are a few equivalent terms under $\rsimp{}$. |
|
1887 We use $r_1 \sequal r_2 $ here to denote $\rsimp{r_1} = \rsimp{r_2}$. |
|
1888 \begin{lemma} |
|
1889 \begin{itemize} |
|
1890 The following equivalence hold: |
|
1891 \item |
|
1892 $\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$ |
|
1893 \item |
|
1894 $\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$ |
|
1895 \item |
|
1896 $\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$ |
|
1897 \item |
|
1898 $\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$ |
|
1899 \item |
|
1900 $\RALTS{rs} = \RALTS{\map \; \rsimp{} \; rs}$ |
|
1901 \end{itemize} |
|
1902 \end{lemma} |
|
1903 \begin{proof} |
|
1904 By induction on the lists involved. |
|
1905 \end{proof} |
|
1906 \noindent |
|
1907 The above allows us to prove |
|
1908 two similar equalities (which are a bit more involved). |
|
1909 It says that we could flatten out the elements |
|
1910 before simplification and still get the same result. |
|
1911 \begin{lemma}\label{simpFlatten3} |
|
1912 One can flatten the inside $\sum$ of a $\sum$ if it is being |
|
1913 simplified. Concretely, |
|
1914 \begin{itemize} |
|
1915 \item |
|
1916 If for all $r \in rs, rs', rs''$, we have $\good \; r $ |
|
1917 or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal |
|
1918 \sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary, |
|
1919 \item |
|
1920 $\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$ |
|
1921 \end{itemize} |
|
1922 \end{lemma} |
|
1923 \begin{proof} |
|
1924 By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}. |
|
1925 The second sub-lemma is a corollary of the previous. |
|
1926 \end{proof} |
|
1927 %Rewriting steps not put in--too long and complicated------------------------------- |
|
1928 \begin{comment} |
|
1929 \begin{center} |
|
1930 $\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\ |
|
1931 $\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\ |
|
1932 $\stackrel{by \ref{test}}{=} |
|
1933 \rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{ |
|
1934 \varnothing})$\\ |
|
1935 $\stackrel{by \ref{rdistinctConcatGeneral}}{=} |
|
1936 \rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{( |
|
1937 \rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\ |
|
1938 |
|
1939 \end{center} |
|
1940 \end{comment} |
|
1941 %Rewriting steps not put in--too long and complicated------------------------------- |
|
1942 \noindent |
1818 \subsection{Estimating the Closed Forms' sizes} |
1943 \subsection{Estimating the Closed Forms' sizes} |
1819 We now summarize the closed forms below: |
1944 We now summarize the closed forms below: |
1820 \begin{itemize} |
1945 \begin{itemize} |
1821 \item |
1946 \item |
1822 $\rderssimp{(\sum rs)}{s} \sequal |
1947 $\rderssimp{(\sum rs)}{s} \sequal |