37 is not just finite but polynomial in $\llbracket a\rrbracket$. |
37 is not just finite but polynomial in $\llbracket a\rrbracket$. |
38 \item |
38 \item |
39 Having the finite bound formalised |
39 Having the finite bound formalised |
40 gives us a higher confidence that |
40 gives us a higher confidence that |
41 our simplification algorithm $\simp$ does not ``mis-behave'' |
41 our simplification algorithm $\simp$ does not ``mis-behave'' |
42 like $\simpsulz$ does. |
42 like $\textit{simpSL}$ does. |
43 The bound is universal for a given regular expression, |
43 The bound is universal for a given regular expression, |
44 which is an advantage over work which |
44 which is an advantage over work which |
45 only gives empirical evidence on |
45 only gives empirical evidence on |
46 some test cases (see Verbatim++ work\cite{Verbatimpp}). |
46 some test cases (see for example Verbatim work \cite{Verbatimpp}). |
47 \end{itemize} |
47 \end{itemize} |
48 \noindent |
48 \noindent |
49 We then extend our $\blexersimp$ |
49 We then extend our $\blexersimp$ |
50 to support bounded repetitions ($r^{\{n\}}$). |
50 to support bounded repetitions ($r^{\{n\}}$). |
51 We update our formalisation of |
51 We update our formalisation of |
52 the correctness and finiteness properties to |
52 the correctness and finiteness properties to |
53 include this new construct. |
53 include this new construct. |
54 we can out-compete other verified lexers such as |
54 We show that we can out-compete other verified lexers such as |
55 Verbatim++ on bounded regular expressions. |
55 Verbatim++ on bounded regular expressions. |
56 |
56 |
57 In the next section we describe in more detail |
57 In the next section we describe in more detail |
58 what the finite bound means in our algorithm |
58 what the finite bound means in our algorithm |
59 and why the size of the internal data structures of |
59 and why the size of the internal data structures of |
60 a typical derivative-based lexer such as |
60 a typical derivative-based lexer such as |
61 Sulzmann and Lu's need formal treatment. |
61 Sulzmann and Lu's needs formal treatment. |
62 |
62 |
63 |
63 |
64 |
64 |
65 |
65 |
66 \section{Formalising About Size} |
66 \section{Formalising Size Bound of Derivatives} |
67 \noindent |
67 \noindent |
68 In our lexer ($\blexersimp$), |
68 In our lexer ($\blexersimp$), |
69 we take an annotated regular expression as input, |
69 we take an annotated regular expression as input, |
70 and repeately take derivative of and simplify it. |
70 and repeately take derivative of and simplify it. |
71 \begin{figure} |
71 \begin{figure} |
110 \end{figure} |
110 \end{figure} |
111 |
111 |
112 \noindent |
112 \noindent |
113 Each time |
113 Each time |
114 a derivative is taken, the regular expression might grow. |
114 a derivative is taken, the regular expression might grow. |
115 However, the simplification that is immediately afterwards will always shrink it so that |
115 However, the simplification that is immediately afterwards will often shrink it so that |
116 it stays relatively small. |
116 the overall size of the derivatives stays relatively small. |
117 This intuition is depicted by the relative size |
117 This intuition is depicted by the relative size |
118 change between the black and blue nodes: |
118 change between the black and blue nodes: |
119 After $\simp$ the node always shrinks. |
119 After $\simp$ the node shrinks. |
120 Our proof states that all the blue nodes |
120 Our proof states that all the blue nodes |
121 stay below a size bound $N_a$ determined by the input $a$. |
121 stay below a size bound $N_a$ determined by the input $a$. |
122 |
122 |
123 \noindent |
123 \noindent |
124 Sulzmann and Lu's assumed a similar picture about their algorithm, |
124 Sulzmann and Lu's assumed a similar picture about their algorithm, |
152 |
152 |
153 \end{tikzpicture} |
153 \end{tikzpicture} |
154 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks} |
154 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks} |
155 \end{figure} |
155 \end{figure} |
156 \noindent |
156 \noindent |
157 The picture means that on certain cases their lexer (where they use $\simpsulz$ |
157 The picture means that in some cases their lexer (where they use $\simpsulz$ |
158 as the simplification function) |
158 as the simplification function) |
159 will have a size explosion, causing the running time |
159 will have a size explosion, causing the running time |
160 of each derivative step to grow continuously (for example |
160 of each derivative step to grow continuously (for example |
161 in \ref{SulzmannLuLexerTime}). |
161 in \ref{SulzmannLuLexerTime}). |
162 They tested out the run time of their |
162 They tested out the run time of their |
163 lexer on particular examples such as $(a+b+ab)^*$ |
163 lexer on particular examples such as $(a+b+ab)^*$ |
164 and claimed that their algorithm is linear w.r.t to the input. |
164 and claimed that their algorithm is linear w.r.t to the input. |
165 With our mecahnised proof, we avoid this type of unintentional |
165 With our mecahnised proof, we avoid this type of unintentional |
166 generalisation.\\ |
166 generalisation. |
167 |
167 |
168 Before delving in to the details of the formalisation, |
168 Before delving in to the details of the formalisation, |
169 we are going to provide an overview of it. |
169 we are going to provide an overview of it in the next subsection. |
170 In the next subsection, we give a high-level view |
|
171 of the proof. |
|
172 |
170 |
173 |
171 |
174 \subsection{Overview of the Proof} |
172 \subsection{Overview of the Proof} |
175 A high-level overview of the main components of the finiteness proof |
173 A high-level overview of the main components of the finiteness proof |
176 is as follows: |
174 is as follows: |
233 |
231 |
234 \section{The $\textit{Rrexp}$ Datatype} |
232 \section{The $\textit{Rrexp}$ Datatype} |
235 The first step is to define |
233 The first step is to define |
236 $\textit{rrexp}$s. |
234 $\textit{rrexp}$s. |
237 They are annotated regular expressions without bitcodes, |
235 They are annotated regular expressions without bitcodes, |
238 allowing a much simpler size bound proof. |
236 allowing a more convenient size bound proof. |
239 %Of course, the bits which encode the lexing information |
237 %Of course, the bits which encode the lexing information |
240 %would grow linearly with respect |
238 %would grow linearly with respect |
241 %to the input, which should be taken into account when we wish to tackle the runtime comlexity. |
239 %to the input, which should be taken into account when we wish to tackle the runtime comlexity. |
242 %But for the sake of the structural size |
240 %But for the sake of the structural size |
243 %we can safely ignore them.\\ |
241 %we can safely ignore them.\\ |
246 \emph{r-regular expressions}, |
244 \emph{r-regular expressions}, |
247 was initially defined in \ref{rrexpDef}. |
245 was initially defined in \ref{rrexpDef}. |
248 The reason for the prefix $r$ is |
246 The reason for the prefix $r$ is |
249 to make a distinction |
247 to make a distinction |
250 with basic regular expressions. |
248 with basic regular expressions. |
|
249 We give here again the definition of $\rrexp$. |
251 \[ \rrexp ::= \RZERO \mid \RONE |
250 \[ \rrexp ::= \RZERO \mid \RONE |
252 \mid \RCHAR{c} |
251 \mid \RCHAR{c} |
253 \mid \RSEQ{r_1}{r_2} |
252 \mid \RSEQ{r_1}{r_2} |
254 \mid \RALTS{rs} |
253 \mid \RALTS{rs} |
255 \mid \RSTAR{r} |
254 \mid \RSTAR{r} |
306 $\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\ |
305 $\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\ |
307 $\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$ |
306 $\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$ |
308 \end{tabular} |
307 \end{tabular} |
309 \end{center} |
308 \end{center} |
310 \noindent |
309 \noindent |
311 As can be seen alternative regular expression with an empty argument list |
310 As can be seen, alternative regular expressions with an empty argument list |
312 will be turned into a $\ZERO$. |
311 will be turned into a $\ZERO$. |
313 The singleton alternative $\sum [r]$ becomes $r$ during the |
312 The singleton alternative $\sum [r]$ becomes $r$ during the |
314 $\erase$ function. |
313 $\erase$ function. |
315 The annotated regular expression $\sum[a, b, c]$ would turn into |
314 The annotated regular expression $\sum[a, b, c]$ would turn into |
316 $(a+(b+c))$. |
315 $(a+(b+c))$. |
317 All these operations change the size and structure of |
316 All these operations change the size and structure of |
318 an annotated regular expressions, adding unnecessary |
317 an annotated regular expressions, adding unnecessary |
319 complexities to the size bound proof.\\ |
318 complexities to the size bound proof. |
|
319 |
320 For example, if we define the size of a basic plain regular expression |
320 For example, if we define the size of a basic plain regular expression |
321 in the usual way, |
321 in the usual way, |
322 \begin{center} |
322 \begin{center} |
323 \begin{tabular}{lcl} |
323 \begin{tabular}{lcl} |
324 $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
324 $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
351 $\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $ |
351 $\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $ |
352 and then estimate $\llbracket a_\downarrow \rrbracket_p$, |
352 and then estimate $\llbracket a_\downarrow \rrbracket_p$, |
353 but we found our approach more straightforward.\\ |
353 but we found our approach more straightforward.\\ |
354 |
354 |
355 \subsection{Functions for R-regular Expressions} |
355 \subsection{Functions for R-regular Expressions} |
|
356 The downside of our approach is that we need to redefine |
|
357 several functions for $\rrexp$. |
356 In this section we shall define the r-regular expression version |
358 In this section we shall define the r-regular expression version |
357 of $\blexer$, and $\blexersimp$ related functions. |
359 of $\bder$, and $\textit{bsimp}$ related functions. |
358 We use $r$ as the prefix or subscript to differentiate |
360 We use $r$ as the prefix or subscript to differentiate |
359 with the bitcoded version. |
361 with the bitcoded version. |
360 %For example,$\backslash_r$, $\rdistincts$, and $\rsimp$ |
362 %For example,$\backslash_r$, $\rdistincts$, and $\rsimp$ |
361 %as opposed to $\backslash$, $\distinctBy$, and $\bsimp$. |
363 %as opposed to $\backslash$, $\distinctBy$, and $\bsimp$. |
362 %As promised, they are much simpler than their bitcoded counterparts. |
364 %As promised, they are much simpler than their bitcoded counterparts. |
585 \end{center} |
587 \end{center} |
586 where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on. |
588 where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on. |
587 For example, for $r_1 \cdot r_2$ we have the equality as |
589 For example, for $r_1 \cdot r_2$ we have the equality as |
588 \begin{center} |
590 \begin{center} |
589 $ \rderssimp{r_1 \cdot r_2}{s} = |
591 $ \rderssimp{r_1 \cdot r_2}{s} = |
590 \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ |
592 \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r_2}{\_} \;(\vsuf{s}{r_1})))}$ |
591 \end{center} |
593 \end{center} |
592 We call the right-hand-side the |
594 We call the right-hand-side the |
593 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. |
595 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. |
594 Second, we will bound the closed form of r-regular expressions |
596 Second, we will bound the closed form of r-regular expressions |
595 using some estimation techniques |
597 using some estimation techniques |
770 \end{lemma} |
772 \end{lemma} |
771 \begin{proof} |
773 \begin{proof} |
772 By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}. |
774 By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}. |
773 \end{proof} |
775 \end{proof} |
774 \noindent |
776 \noindent |
775 The next lemma is a more general form of \ref{rdistinctConcat}, |
777 The next lemma is a more general form of \ref{rdistinctConcat}; |
776 it says that |
778 It says that |
777 $\textit{rdistinct}$ is composable w.r.t list concatenation: |
779 $\textit{rdistinct}$ is composable w.r.t list concatenation: |
778 \begin{lemma}\label{distinctRdistinctAppend} |
780 \begin{lemma}\label{distinctRdistinctAppend} |
779 If $\;\; \textit{isDistinct} \; rs_1$, |
781 If $\;\; \textit{isDistinct} \; rs_1$, |
780 and $(set \; rs_1) \cap acc = \varnothing$, |
782 and $(set \; rs_1) \cap acc = \varnothing$, |
781 then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not |
783 then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not |
789 \end{proof} |
791 \end{proof} |
790 \noindent |
792 \noindent |
791 $\textit{rdistinct}$ needs to be applied only once, and |
793 $\textit{rdistinct}$ needs to be applied only once, and |
792 applying it multiple times does not make any difference: |
794 applying it multiple times does not make any difference: |
793 \begin{corollary}\label{distinctOnceEnough} |
795 \begin{corollary}\label{distinctOnceEnough} |
794 $\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; (rdistinct \; |
796 $\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; ( (rdistinct \; |
795 rs \{ \} @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$ |
797 rs \; \{ \}) @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$ |
796 \end{corollary} |
798 \end{corollary} |
797 \begin{proof} |
799 \begin{proof} |
798 By lemma \ref{distinctRdistinctAppend}. |
800 By lemma \ref{distinctRdistinctAppend}. |
799 \end{proof} |
801 \end{proof} |
800 |
802 |
843 last sub-lemma. |
845 last sub-lemma. |
844 \end{proof} |
846 \end{proof} |
845 \noindent |
847 \noindent |
846 Now we introduce the property that the operations |
848 Now we introduce the property that the operations |
847 derivative and $\rsimpalts$ |
849 derivative and $\rsimpalts$ |
848 commute, this will be used later on, when deriving the closed form for |
850 commute, this will be used later on when deriving the closed form for |
849 the alternative regular expression: |
851 the alternative regular expression: |
850 \begin{lemma}\label{rderRsimpAltsCommute} |
852 \begin{lemma}\label{rderRsimpAltsCommute} |
851 $\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$ |
853 $\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$ |
852 \end{lemma} |
854 \end{lemma} |
853 \begin{proof} |
855 \begin{proof} |
854 By induction on $rs$. |
856 By induction on $rs$. |
855 \end{proof} |
857 \end{proof} |
856 \noindent |
858 \noindent |
857 |
859 |
858 \subsubsection{The $RL$ Function: Language Interpretation of $\textit{Rrexp}$s} |
860 \subsubsection{The $RL$ Function: Language Interpretation for $\textit{Rrexp}$s} |
859 Much like the definition of $L$ on plain regular expressions, one can also |
861 Much like the definition of $L$ on plain regular expressions, one can also |
860 define the language interpretation of $\rrexp$s. |
862 define the language interpretation for $\rrexp$s. |
861 \begin{center} |
863 \begin{center} |
862 \begin{tabular}{lcl} |
864 \begin{tabular}{lcl} |
863 $RL \; (\ZERO)$ & $\dn$ & $\phi$\\ |
865 $RL \; (\ZERO_r)$ & $\dn$ & $\phi$\\ |
864 $RL \; (\ONE)$ & $\dn$ & $\{[]\}$\\ |
866 $RL \; (\ONE_r)$ & $\dn$ & $\{[]\}$\\ |
865 $RL \; (c)$ & $\dn$ & $\{[c]\}$\\ |
867 $RL \; (c)$ & $\dn$ & $\{[c]\}$\\ |
866 $RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ |
868 $RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ |
867 $RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ |
869 $RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ |
868 $RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ |
870 $RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ |
869 \end{tabular} |
871 \end{tabular} |
958 \end{lemma} |
960 \end{lemma} |
959 \begin{proof} |
961 \begin{proof} |
960 By induction on $r$. |
962 By induction on $r$. |
961 \end{proof} |
963 \end{proof} |
962 \noindent |
964 \noindent |
963 With this we could prove that a regular expressions |
965 With this we can prove that a regular expressions |
964 after simplification and flattening and de-duplication, |
966 after simplification and flattening and de-duplication, |
965 will not contain any alternative regular expression directly: |
967 will not contain any alternative regular expression directly: |
966 \begin{lemma}\label{nonaltFltsRd} |
968 \begin{lemma}\label{nonaltFltsRd} |
967 If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ |
969 If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ |
968 then $\textit{nonAlt} \; x$. |
970 then $\textit{nonAlt} \; x$. |
996 The first part is by induction on $rs$, where the induction |
998 The first part is by induction on $rs$, where the induction |
997 rule is the inductive cases for $\rflts$. |
999 rule is the inductive cases for $\rflts$. |
998 The second part is a corollary from the first part. |
1000 The second part is a corollary from the first part. |
999 \end{proof} |
1001 \end{proof} |
1000 |
1002 |
1001 And this leads to good structural property of $\rsimp{}$, |
1003 This leads to good structural property of $\rsimp{}$, |
1002 that after simplification, a regular expression is |
1004 that after simplification, a regular expression is |
1003 either good or $\RZERO$: |
1005 either good or $\RZERO$: |
1004 \begin{lemma}\label{good1} |
1006 \begin{lemma}\label{good1} |
1005 For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$. |
1007 For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$. |
1006 \end{lemma} |
1008 \end{lemma} |
1015 by that as well. |
1017 by that as well. |
1016 The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used |
1018 The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used |
1017 to ensure that goodness is preserved at the topmost level. |
1019 to ensure that goodness is preserved at the topmost level. |
1018 \end{proof} |
1020 \end{proof} |
1019 We shall prove that any good regular expression is |
1021 We shall prove that any good regular expression is |
1020 a fixed-point for $\rsimp{}$. |
1022 a fixed-point for $\textit{rsimp}$. |
1021 First we prove an auxiliary lemma: |
1023 First we prove an auxiliary lemma: |
1022 \begin{lemma}\label{goodaltsNonalt} |
1024 \begin{lemma}\label{goodaltsNonalt} |
1023 If $\good \; \sum rs$, then $\rflts\; rs = rs$. |
1025 If $\good \; \sum rs$, then $\rflts\; rs = rs$. |
1024 \end{lemma} |
1026 \end{lemma} |
1025 \begin{proof} |
1027 \begin{proof} |
1037 \ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}. |
1039 \ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}. |
1038 The lemma \ref{goodaltsNonalt} is used in the alternative |
1040 The lemma \ref{goodaltsNonalt} is used in the alternative |
1039 case where 2 or more elements are present in the list. |
1041 case where 2 or more elements are present in the list. |
1040 \end{proof} |
1042 \end{proof} |
1041 \noindent |
1043 \noindent |
1042 Given below is a property involving $\rflts$, $\textit{rdistinct}$, |
1044 Below we show a property involving $\rflts$, $\textit{rdistinct}$, |
1043 $\rsimp{}$ and $\rsimp_{ALTS}$, |
1045 $\rsimp{}$ and $\rsimp_{ALTS}$, |
1044 which requires $\ref{good1}$ to go through smoothly: |
1046 which requires $\ref{good1}$ to go through smoothly: |
1045 \begin{lemma}\label{flattenRsimpalts} |
1047 \begin{lemma}\label{flattenRsimpalts} |
1046 An application of $\rsimp_{ALTS}$ can be ``absorbed'', |
1048 An application of $\rsimp_{ALTS}$ can be ``absorbed'', |
1047 if its output is concatenated with a list and then applied to $\rflts$. |
1049 if its output is concatenated with a list and then applied to $\rflts$. |
1079 \end{proof} |
1081 \end{proof} |
1080 \noindent |
1082 \noindent |
1081 This property means we do not have to repeatedly |
1083 This property means we do not have to repeatedly |
1082 apply simplification in each step, which justifies |
1084 apply simplification in each step, which justifies |
1083 our definition of $\blexersimp$. |
1085 our definition of $\blexersimp$. |
|
1086 This is in contrast to the work of Sulzmann and Lu where |
|
1087 the simplification is applied in a fixpoint manner. |
1084 |
1088 |
1085 |
1089 |
1086 On the other hand, we can repeat the same $\rsimp{}$ applications |
1090 On the other hand, we can repeat the same $\rsimp{}$ applications |
1087 on regular expressions as many times as we want, if we have at least |
1091 on regular expressions as many times as we want, if we have at least |
1088 one simplification applied to it, and apply it wherever we would like to: |
1092 one simplification applied to it, and apply it wherever we need to: |
1089 \begin{corollary}\label{headOneMoreSimp} |
1093 \begin{corollary}\label{headOneMoreSimp} |
1090 The following properties hold, directly from \ref{rsimpIdem}: |
1094 The following properties hold, directly from \ref{rsimpIdem}: |
1091 |
1095 |
1092 \begin{itemize} |
1096 \begin{itemize} |
1093 \item |
1097 \item |
1113 \begin{proof} |
1117 \begin{proof} |
1114 By application of lemmas \ref{rsimpIdem} and \ref{good1}. |
1118 By application of lemmas \ref{rsimpIdem} and \ref{good1}. |
1115 \end{proof} |
1119 \end{proof} |
1116 |
1120 |
1117 \noindent |
1121 \noindent |
1118 With the idempotency of $\rsimp{}$ and its corollaries, |
1122 With the idempotency of $\textit{rsimp}$ and its corollaries, |
1119 we can start proving some key equalities leading to the |
1123 we can start proving some key equalities leading to the |
1120 closed forms. |
1124 closed forms. |
1121 Now presented are a few equivalent terms under $\rsimp{}$. |
1125 Next we present a few equivalent terms under $\textit{rsimp}$. |
1122 To make the notation more concise |
1126 To make the notation more concise |
1123 We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$: |
1127 We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$. |
1124 \begin{center} |
1128 %\begin{center} |
1125 \begin{tabular}{lcl} |
1129 %\begin{tabular}{lcl} |
1126 $a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$ |
1130 % $a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$ |
1127 \end{tabular} |
1131 %\end{tabular} |
1128 \end{center} |
1132 %\end{center} |
1129 \noindent |
1133 %\noindent |
1130 %\vspace{0em} |
1134 %\vspace{0em} |
1131 \begin{lemma} |
1135 \begin{lemma} |
1132 The following equivalence hold: |
1136 The following equivalence hold: |
1133 \begin{itemize} |
1137 \begin{itemize} |
1134 \item |
1138 \item |
1298 \end{figure} |
1302 \end{figure} |
1299 \noindent |
1303 \noindent |
1300 We define |
1304 We define |
1301 two separate list rewriting relations $\frewrite$ and $\grewrite$. |
1305 two separate list rewriting relations $\frewrite$ and $\grewrite$. |
1302 The rewriting steps that take place during |
1306 The rewriting steps that take place during |
1303 flattening is characterised by $\frewrite$. |
1307 flattening are characterised by $\frewrite$. |
1304 The rewrite relation $\grewrite$ characterises both flattening and de-duplicating. |
1308 The rewrite relation $\grewrite$ characterises both flattening and de-duplicating. |
1305 Sometimes $\grewrites$ is slightly too powerful |
1309 Sometimes $\grewrites$ is slightly too powerful |
1306 so we would rather use $\frewrites$ to prove |
1310 so we would rather use $\frewrites$ to prove |
1307 %because we only |
1311 %because we only |
1308 equalities related to $\rflts$. |
1312 equalities related to $\rflts$. |
1329 \end{lemma} |
1333 \end{lemma} |
1330 \noindent |
1334 \noindent |
1331 These two lemmas can both be proven using a straightforward induction (and |
1335 These two lemmas can both be proven using a straightforward induction (and |
1332 the proofs for them are therefore omitted). |
1336 the proofs for them are therefore omitted). |
1333 |
1337 |
1334 Now the above equalities can be derived like a breeze: |
1338 Now the above equalities can be derived with ease: |
1335 \begin{corollary} |
1339 \begin{corollary} |
1336 $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal |
1340 $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal |
1337 \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) |
1341 \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) |
1338 $ |
1342 $ |
1339 \end{corollary} |
1343 \end{corollary} |
1365 |
1369 |
1366 |
1370 |
1367 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$} |
1371 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$} |
1368 In this part, we present lemmas stating |
1372 In this part, we present lemmas stating |
1369 pairs of r-regular expressions and r-regular expression lists |
1373 pairs of r-regular expressions and r-regular expression lists |
1370 where one could rewrite from one in many steps to the other. |
1374 where one can rewrite from one in many steps to the other. |
1371 Most of the proofs to these lemmas are straightforward, using |
1375 Most of the proofs to these lemmas are straightforward, using |
1372 an induction on the corresponding rewriting relations. |
1376 an induction on the corresponding rewriting relations. |
1373 These proofs will therefore be omitted when this is the case. |
1377 These proofs will therefore be omitted when this is the case. |
1374 We present in the following lemma a few pairs of terms that are rewritable via |
1378 We present in the following lemma a few pairs of terms that are rewritable via |
1375 $\grewrites$: |
1379 $\grewrites$: |
1422 \end{itemize} |
1426 \end{itemize} |
1423 \end{lemma} |
1427 \end{lemma} |
1424 \noindent |
1428 \noindent |
1425 Now comes the core of the proof, |
1429 Now comes the core of the proof, |
1426 which says that once two lists are rewritable to each other, |
1430 which says that once two lists are rewritable to each other, |
1427 then they are equivalent under $\rsimp{}$: |
1431 then they are equivalent under $\textit{rsimp}$: |
1428 \begin{lemma} |
1432 \begin{lemma} |
1429 If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$. |
1433 If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$. |
1430 \end{lemma} |
1434 \end{lemma} |
1431 |
1435 |
1432 \noindent |
1436 \noindent |
1433 Similar to what we did in \ref{Bitcoded2}, |
1437 Similar to what we did in chapter \ref{Bitcoded2}, |
1434 we prove that if one can rewrite from one r-regular expression ($r$) |
1438 we prove that if one can rewrite from one r-regular expression ($r$) |
1435 to the other ($r'$), after taking derivatives one could still rewrite |
1439 to the other ($r'$), after taking derivatives one can still rewrite |
1436 the first ($r\backslash c$) to the other ($r'\backslash c$). |
1440 the first ($r\backslash c$) to the other ($r'\backslash c$). |
1437 \begin{lemma}\label{interleave} |
1441 \begin{lemma}\label{interleave} |
1438 If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$ |
1442 If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$ |
1439 \end{lemma} |
1443 \end{lemma} |
1440 \noindent |
1444 \noindent |
1554 \end{center} |
1558 \end{center} |
1555 \noindent |
1559 \noindent |
1556 The $\delta$ function |
1560 The $\delta$ function |
1557 returns $r$ when the boolean condition |
1561 returns $r$ when the boolean condition |
1558 $b$ evaluates to true and |
1562 $b$ evaluates to true and |
1559 $\ZERO$ otherwise: |
1563 $\ZERO_r$ otherwise: |
1560 \begin{center} |
1564 \begin{center} |
1561 \begin{tabular}{lcl} |
1565 \begin{tabular}{lcl} |
1562 $\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\ |
1566 $\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\ |
1563 & $\dn$ & $\ZERO \quad otherwise$ |
1567 & $\dn$ & $\ZERO_r \quad otherwise$ |
1564 \end{tabular} |
1568 \end{tabular} |
1565 \end{center} |
1569 \end{center} |
1566 \noindent |
1570 \noindent |
1567 Note that the term |
1571 Note that the term |
1568 \begin{center} |
1572 \begin{center} |
1585 \[ |
1589 \[ |
1586 r_1 \backslash_r c_1c_2 |
1590 r_1 \backslash_r c_1c_2 |
1587 \] |
1591 \] |
1588 instead of |
1592 instead of |
1589 \[ |
1593 \[ |
1590 (r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO. |
1594 (r_1 \backslash_r c_1c_2 + \ZERO_r ) + \ZERO_r. |
1591 \] |
1595 \] |
1592 The redundant $\ZERO$s will not be created in the |
1596 The redundant $\ZERO_r$s will not be created in the |
1593 first place. |
1597 first place. |
1594 In a closed-form one needs to take into account this (because |
1598 In a closed-form one needs to take into account this (because |
1595 closed forms require exact equality rather than language equivalence) |
1599 closed forms require exact equality rather than language equivalence) |
1596 and only generate the |
1600 and only generate the |
1597 $r_2 \backslash_r s''$ terms satisfying the property |
1601 $r_2 \backslash_r s''$ terms satisfying the property |
1627 is occurring in $(r_1\cdot r_2)\backslash s$. |
1631 is occurring in $(r_1\cdot r_2)\backslash s$. |
1628 |
1632 |
1629 With $\textit{Suffix}$ we are ready to express the |
1633 With $\textit{Suffix}$ we are ready to express the |
1630 sequence regular expression's closed form, |
1634 sequence regular expression's closed form, |
1631 but before doing so |
1635 but before doing so |
1632 more devices are needed. |
1636 more definitions are needed. |
1633 The first thing is the flattening function $\sflat{\_}$, |
1637 The first thing is the flattening function $\sflat{\_}$, |
1634 which takes an alternative regular expression and produces a flattened version |
1638 which takes an alternative regular expression and produces a flattened version |
1635 of that alternative regular expression. |
1639 of that alternative regular expression. |
1636 It is needed to convert |
1640 It is needed to convert |
1637 a left-associative nested sequence of alternatives into |
1641 a left-associative nested sequence of alternatives into |
1662 |
1666 |
1663 \begin{center} |
1667 \begin{center} |
1664 \begin{tabular}{lcl} |
1668 \begin{tabular}{lcl} |
1665 $\sflataux{[]}'$ & $ \dn $ & $ []$\\ |
1669 $\sflataux{[]}'$ & $ \dn $ & $ []$\\ |
1666 $\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\ |
1670 $\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\ |
1667 $\sflataux{r :: rs}$ & $\dn$ & $ r::rs$ |
1671 $\sflataux{r :: rs}'$ & $\dn$ & $ r::rs$ |
1668 \end{tabular} |
1672 \end{tabular} |
1669 \end{center} |
1673 \end{center} |
1670 \noindent |
1674 \noindent |
1671 $\sflataux{\_}$ breaks up nested alternative regular expressions |
1675 $\sflataux{\_}$ breaks up nested alternative regular expressions |
1672 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
1676 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
1701 \begin{proof} |
1705 \begin{proof} |
1702 By a reverse induction on the string $s$, where the inductive cases are $[]$ |
1706 By a reverse induction on the string $s$, where the inductive cases are $[]$ |
1703 and $xs @ [x]$. |
1707 and $xs @ [x]$. |
1704 \end{proof} |
1708 \end{proof} |
1705 \noindent |
1709 \noindent |
1706 If we have a regular expression $r$ whose shpae |
1710 If we have a regular expression $r$ whose shape |
1707 fits into those described by $\textit{createdBySequence}$, |
1711 fits into those described by $\textit{createdBySequence}$, |
1708 then we can convert between |
1712 then we can convert between |
1709 $r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with |
1713 $r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with |
1710 $\sflataux{\_}'$: |
1714 $\sflataux{\_}'$: |
1711 \begin{lemma}\label{sfauIdemDer} |
1715 \begin{lemma}\label{sfauIdemDer} |
1712 If $\textit{createdBySequence} \; r$, then |
1716 If $\textit{createdBySequence} \; r$, then |
1713 \begin{center} |
1717 \begin{center} |
1723 |
1727 |
1724 Now we are ready to express |
1728 Now we are ready to express |
1725 the shape of $r_1 \cdot r_2 \backslash s$ |
1729 the shape of $r_1 \cdot r_2 \backslash s$ |
1726 \begin{lemma}\label{seqSfau0} |
1730 \begin{lemma}\label{seqSfau0} |
1727 $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
1731 $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
1728 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ |
1732 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ |
1729 \end{lemma} |
1733 \end{lemma} |
1730 \begin{proof} |
1734 \begin{proof} |
1731 By a reverse induction on the string $s$, where the inductive cases |
1735 By a reverse induction on the string $s$, where the inductive cases |
1732 are $[]$ and $xs @ [x]$. |
1736 are $[]$ and $xs @ [x]$. |
1733 For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2) |
1737 For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2) |
1753 %allowing people investigating derivatives to get an alternative |
1757 %allowing people investigating derivatives to get an alternative |
1754 %view of what $r_1 \cdot r_2$ is. |
1758 %view of what $r_1 \cdot r_2$ is. |
1755 |
1759 |
1756 We now use $\textit{createdBySequence}$ and |
1760 We now use $\textit{createdBySequence}$ and |
1757 $\sflataux{\_}$ to describe an intuition |
1761 $\sflataux{\_}$ to describe an intuition |
1758 behind the closed form of the sequence closed form. |
1762 behind the sequence closed form. |
1759 If two regular expressions only differ in the way their |
1763 If two regular expressions only differ in the way their |
1760 alternatives are nested, then we should be able to get the same result |
1764 alternatives are nested, then we should be able to get the same result |
1761 once we apply simplification to both of them: |
1765 once we apply simplification to both of them: |
1762 \begin{lemma}\label{sflatRsimpeq} |
1766 \begin{lemma}\label{sflatRsimpeq} |
1763 If $r$ is created from a sequence through |
1767 If $r$ is created from a sequence through |
1784 \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$ |
1788 \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$ |
1785 \end{lemma} |
1789 \end{lemma} |
1786 \begin{proof} |
1790 \begin{proof} |
1787 We know that |
1791 We know that |
1788 $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
1792 $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
1789 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ |
1793 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ |
1790 holds |
1794 holds |
1791 by lemma \ref{seqSfau0}. |
1795 by lemma \ref{seqSfau0}. |
1792 This allows the theorem to go through because of lemma \ref{sflatRsimpeq}. |
1796 This allows the theorem to go through because of lemma \ref{sflatRsimpeq}. |
1793 \end{proof} |
1797 \end{proof} |
1794 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), |
1798 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), |
1843 r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
1847 r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
1844 \end{center} |
1848 \end{center} |
1845 which can be de-duplicated by $\rDistinct$ |
1849 which can be de-duplicated by $\rDistinct$ |
1846 and therefore bounded finitely. |
1850 and therefore bounded finitely. |
1847 |
1851 |
1848 and then de-duplicate terms of the form ($s'$ being a substring of $s$). |
1852 %and then de-duplicate terms of the form ($s'$ being a substring of $s$). |
1849 This allows us to use a similar technique as $r_1 \cdot r_2$ case, |
1853 %This allows us to use a similar technique as $r_1 \cdot r_2$ case, |
1850 |
1854 |
1851 Now the crux of this section is finding a suitable description |
1855 Now the crux of this section is finding a suitable description |
1852 for $rs$ where |
1856 for $rs$ where |
1853 \begin{center} |
1857 \begin{center} |
1854 $\rderssimp{r^*}{s} = \rsimp{\sum rs}$. |
1858 $\rderssimp{r^*}{s} = \rsimp{\sum rs}$. |
2066 By an induction on $s$, the inductive cases |
2070 By an induction on $s$, the inductive cases |
2067 being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used. |
2071 being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used. |
2068 \end{proof} |
2072 \end{proof} |
2069 \noindent |
2073 \noindent |
2070 |
2074 |
2071 $\hflataux{\_}$ has a similar effect as $\textit{flatten}$: |
2075 The function $\hflataux{\_}$ has a similar effect as $\textit{flatten}$: |
2072 \begin{lemma}\label{hflatauxGrewrites} |
2076 \begin{lemma}\label{hflatauxGrewrites} |
2073 $a :: rs \grewrites \hflataux{a} @ rs$ |
2077 $a :: rs \grewrites \hflataux{a} @ rs$ |
2074 \end{lemma} |
2078 \end{lemma} |
2075 \begin{proof} |
2079 \begin{proof} |
2076 By induction on $a$. $rs$ is set to take arbitrary values. |
2080 By induction on $a$. $rs$ is set to take arbitrary values. |
2084 \end{lemma} |
2088 \end{lemma} |
2085 |
2089 |
2086 \begin{proof} |
2090 \begin{proof} |
2087 By using the rewriting relation $\rightsquigarrow$ |
2091 By using the rewriting relation $\rightsquigarrow$ |
2088 \end{proof} |
2092 \end{proof} |
2089 And from this we obtain a proof that a star's derivative will be the same |
2093 And from this we obtain that a |
2090 as if it had all its nested alternatives created during deriving being flattened out: |
2094 regular expression created by star |
|
2095 is the same as its flattened version, up to equivalence under $\textit{bsimp}$. |
2091 For example, |
2096 For example, |
2092 \begin{lemma}\label{hfauRsimpeq2} |
2097 \begin{lemma}\label{hfauRsimpeq2} |
2093 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
2098 $\textit{createdByStar} \; r \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
2094 \end{lemma} |
2099 \end{lemma} |
2095 \begin{proof} |
2100 \begin{proof} |
2096 By structural induction on $r$, where the induction rules |
2101 By structural induction on $r$, where the induction rules |
2097 are these of $\createdByStar{\_}$. |
2102 are these of $\createdByStar{\_}$. |
2098 Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case. |
2103 Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case. |
2253 $\textit{finite} \; (\textit{sizeNregex} \; N)$ holds. |
2258 $\textit{finite} \; (\textit{sizeNregex} \; N)$ holds. |
2254 \end{lemma} |
2259 \end{lemma} |
2255 \begin{proof} |
2260 \begin{proof} |
2256 By splitting the set $\textit{sizeNregex} \; (N + 1)$ into |
2261 By splitting the set $\textit{sizeNregex} \; (N + 1)$ into |
2257 subsets by their categories: |
2262 subsets by their categories: |
2258 $\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$, |
2263 $\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$, |
2259 and so on. Each of these subsets are finitely bounded. |
2264 and so on. Each of these subsets are finitely bounded. |
2260 \end{proof} |
2265 \end{proof} |
2261 \noindent |
2266 \noindent |
2262 From this we get a corollary that |
2267 From this we get a corollary that |
2263 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of |
2268 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of |
2281 %output of $\textit{rdistinct} \; rs' \; \varnothing$ |
2286 %output of $\textit{rdistinct} \; rs' \; \varnothing$ |
2282 %is bounded by a constant $N * c_N$ depending only on $N$, |
2287 %is bounded by a constant $N * c_N$ depending only on $N$, |
2283 %provided that each of $rs'$'s element |
2288 %provided that each of $rs'$'s element |
2284 %is bounded by $N$. |
2289 %is bounded by $N$. |
2285 |
2290 |
2286 \subsection{$\textit{rsimp}$ Does Not Increment the Size} |
2291 \subsection{$\textit{rsimp}$ Does Not Increase the Size} |
2287 Although it seems evident, we need a series |
2292 Although it seems evident, we need a series |
2288 of non-trivial lemmas to establish that functions such as $\rflts$ |
2293 of non-trivial lemmas to establish that functions such as $\rflts$ |
2289 do not cause the regular expressions to grow. |
2294 do not cause the regular expressions to grow. |
2290 \begin{lemma}\label{rsimpMonoLemmas} |
2295 \begin{lemma}\label{rsimpMonoLemmas} |
2291 \mbox{} |
2296 \mbox{} |
2372 which will each have a size uppder bound |
2377 which will each have a size uppder bound |
2373 according to inductive hypothesis, which controls $r \backslash s$. |
2378 according to inductive hypothesis, which controls $r \backslash s$. |
2374 |
2379 |
2375 We elaborate the above reasoning by a series of lemmas |
2380 We elaborate the above reasoning by a series of lemmas |
2376 below, where straightforward proofs are omitted. |
2381 below, where straightforward proofs are omitted. |
2377 |
2382 %We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. |
2378 |
2383 We show that $\textit{rdistinct}$ and $\rflts$ |
2379 |
|
2380 |
|
2381 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. |
|
2382 |
|
2383 We show that $\rdistinct$ and $\rflts$ |
|
2384 working together is at least as |
2384 working together is at least as |
2385 good as $\rdistinct{}{}$ alone, which can be written as |
2385 good as $\textit{rdistinct}$ alone, which can be written as |
2386 \begin{center} |
2386 \begin{center} |
2387 $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
2387 $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
2388 \leq |
2388 \leq |
2389 \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
2389 \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
2390 \end{center} |
2390 \end{center} |
2553 \end{center} |
2553 \end{center} |
2554 \noindent |
2554 \noindent |
2555 (5) is by theorem \ref{starClosedForm}. |
2555 (5) is by theorem \ref{starClosedForm}. |
2556 (6) is by \ref{altsSimpControl}. |
2556 (6) is by \ref{altsSimpControl}. |
2557 (7) is by corollary \ref{finiteSizeNCorollary}. |
2557 (7) is by corollary \ref{finiteSizeNCorollary}. |
2558 Combining with the case when $s = []$, one gets |
2558 Combining with the case when $s = []$, one obtains |
2559 \begin{center} |
2559 \begin{center} |
2560 \begin{tabular}{lcll} |
2560 \begin{tabular}{lcll} |
2561 $\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r))) |
2561 $\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r))) |
2562 * (1 + (N + n_r)) $ & (8)\\ |
2562 * (1 + (N + n_r)) $ & (8)\\ |
2563 \end{tabular} |
2563 \end{tabular} |
2826 For the closed form to be bounded, we would like |
2826 For the closed form to be bounded, we would like |
2827 simplification to be applied to each term in the list. |
2827 simplification to be applied to each term in the list. |
2828 Therefore we introduce some variants of $\opterm$, |
2828 Therefore we introduce some variants of $\opterm$, |
2829 which help conveniently express the rewriting steps |
2829 which help conveniently express the rewriting steps |
2830 needed in the closed form proof. |
2830 needed in the closed form proof. |
|
2831 We have $\optermOsimp$, $\optermosimp$ and $\optermsimp$ |
|
2832 with slightly different spellings because they help the proof to go through: |
2831 \begin{center} |
2833 \begin{center} |
2832 \begin{tabular}{lcl} |
2834 \begin{tabular}{lcl} |
2833 $\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
2835 $\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
2834 & & $\;\;\Some \; (s, n) \Rightarrow |
2836 & & $\;\;\Some \; (s, n) \Rightarrow |
2835 \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\ |
2837 \textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\ |
3082 \textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$, |
3084 \textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$, |
3083 where $c_N = \textit{card} \; (\textit{sizeNregex} \; ( |
3085 where $c_N = \textit{card} \; (\textit{sizeNregex} \; ( |
3084 N + \llbracket r^{\{n\}} \rrbracket_r+1))$. |
3086 N + \llbracket r^{\{n\}} \rrbracket_r+1))$. |
3085 \end{theorem} |
3087 \end{theorem} |
3086 \begin{proof} |
3088 \begin{proof} |
3087 We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; ( |
3089 We have that for all regular expressions $r'$ in |
|
3090 \begin{center} |
|
3091 $\textit{set} \; (\map \; (\optermsimp \; r) \; ( |
3088 \nupdates \; s \; r \; [\Some \; ([c], n)]))$, |
3092 \nupdates \; s \; r \; [\Some \; ([c], n)]))$, |
|
3093 \end{center} |
3089 $r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} |
3094 $r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} |
3090 \rrbracket_r + 1$ |
3095 \rrbracket_r + 1$ |
3091 because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot |
3096 because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot |
3092 r^{\{m\}}$ for some string $s'$ and number |
3097 r^{\{m\}}$ for some string $s'$ and number |
3093 $m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}). |
3098 $m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}). |
3114 |
3119 |
3115 |
3120 |
3116 \section{Comments and Future Improvements} |
3121 \section{Comments and Future Improvements} |
3117 \subsection{Some Experimental Results} |
3122 \subsection{Some Experimental Results} |
3118 What guarantee does this bound give us? |
3123 What guarantee does this bound give us? |
3119 Whatever the regex is, it will not grow indefinitely. |
3124 It states that whatever the regex is, it will not grow indefinitely. |
3120 Take our previous example $(a + aa)^*$ as an example: |
3125 Take our previous example $(a + aa)^*$ as an example: |
3121 \begin{center} |
3126 \begin{center} |
3122 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
3127 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
3123 \begin{tikzpicture} |
3128 \begin{tikzpicture} |
3124 \begin{axis}[ |
3129 \begin{axis}[ |
3235 We will discuss improvements to this bound in the next chapter. |
3240 We will discuss improvements to this bound in the next chapter. |
3236 |
3241 |
3237 |
3242 |
3238 |
3243 |
3239 \subsection{Possible Further Improvements} |
3244 \subsection{Possible Further Improvements} |
3240 There are two problems with this finiteness result, though.\\ |
3245 There are two problems with this finiteness result, though:\\ |
3241 (i) |
3246 (i) |
3242 First, it is not yet a direct formalisation of our lexer's complexity, |
3247 First, it is not yet a direct formalisation of our lexer's complexity, |
3243 as a complexity proof would require looking into |
3248 as a complexity proof would require looking into |
3244 the time it takes to execute {\bf all} the operations |
3249 the time it takes to execute {\bf all} the operations |
3245 involved in the lexer (simp, collect, decode), not just the derivative.\\ |
3250 involved in the lexer (simp, collect, decode), not just the derivative.\\ |
3259 \item |
3264 \item |
3260 The bound is already a strong indication that catastrophic |
3265 The bound is already a strong indication that catastrophic |
3261 backtracking is much less likely to occur in our $\blexersimp$ |
3266 backtracking is much less likely to occur in our $\blexersimp$ |
3262 algorithm. |
3267 algorithm. |
3263 We refine $\blexersimp$ with $\blexerStrong$ in the next chapter |
3268 We refine $\blexersimp$ with $\blexerStrong$ in the next chapter |
3264 so that the bound becomes polynomial. |
3269 so that we conjecture the bound becomes polynomial. |
3265 \end{itemize} |
3270 \end{itemize} |
3266 |
3271 |
3267 %---------------------------------------------------------------------------------------- |
3272 %---------------------------------------------------------------------------------------- |
3268 % SECTION 4 |
3273 % SECTION 4 |
3269 %---------------------------------------------------------------------------------------- |
3274 %---------------------------------------------------------------------------------------- |
3324 \begin{center} |
3329 \begin{center} |
3325 $(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
3330 $(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
3326 [1mm] |
3331 [1mm] |
3327 $(1 \leq m' \leq m )$ |
3332 $(1 \leq m' \leq m )$ |
3328 \end{center} |
3333 \end{center} |
3329 There at at least exponentially |
3334 There are at least exponentially |
3330 many such terms.\footnote{To be exact, these terms are |
3335 many such terms.\footnote{To be exact, these terms are |
3331 distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted, |
3336 distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted, |
3332 but the point is that the number is exponential. |
3337 but the point is that the number is exponential. |
3333 } |
3338 } |
3334 With each new input character taking the derivative against the intermediate result, more and more such distinct |
3339 With each new input character taking the derivative against the intermediate result, more and more such distinct |