474 % \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ |
475 % \rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r2}{\_} \;(\vsuf{s}{r_1})))}$ |
475 % \end{center} |
476 % \end{center} |
476 %\end{quote} |
477 %\end{quote} |
477 %\noindent |
478 %\noindent |
478 %We explain in detail how we reached those claims. |
479 %We explain in detail how we reached those claims. |
479 \subsection{Basic Properties needed for Closed Forms} |
480 \subsection{The Idea Behind Closed Forms} |
480 If we attempt to prove |
481 If we attempt to prove |
481 \begin{center} |
482 \begin{center} |
482 $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{bsimps} s \rrbracket_r \leq N_r$ |
483 $\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$ |
483 \end{center} |
484 \end{center} |
484 using a naive induction on the structure of $r$, |
485 using a naive induction on the structure of $r$, |
485 then we are stuck at the inductive cases such as |
486 then we are stuck at the inductive cases such as |
486 $r_1\cdot r_2$. |
487 $r_1\cdot r_2$. |
487 The inductive hypotheses are: |
488 The inductive hypotheses are: |
488 \begin{center} |
489 \begin{center} |
489 1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. |
490 1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. |
490 \;\;\forall s. \llbracket r_1 \backslash_{bsimps} s \rrbracket_r \leq N_{r_1}. $\\ |
491 \;\;\forall s. \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\ |
491 2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. |
492 2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. |
492 \;\; \forall s. \llbracket r_2 \backslash_{bsimps} s \rrbracket_r \leq N_{r_2}. $ |
493 \;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $ |
493 \end{center} |
494 \end{center} |
494 The inductive step to prove would be |
495 The inductive step to prove would be |
495 \begin{center} |
496 \begin{center} |
496 $\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. |
497 $\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. |
497 \llbracket (r_1 \cdot r_2) \backslash_{bsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$ |
498 \llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$ |
498 \end{center} |
499 \end{center} |
499 The problem is that it is not clear what |
500 The problem is that it is not clear what |
500 $(r_1\cdot r_2) \backslash_{bsimps} s$ looks like, |
501 $(r_1\cdot r_2) \backslash_{rsimps} s$ looks like, |
501 and therefore $N_{r_1}$ and $N_{r_2}$ in the |
502 and therefore $N_{r_1}$ and $N_{r_2}$ in the |
502 inductive hypotheses cannot be directly used. |
503 inductive hypotheses cannot be directly used. |
503 |
504 We have already seen that $(r_1 \cdot r_2)\backslash s$ |
504 |
505 and $(r^*)\backslash s$ can grow in a wild way. |
505 The next steps involves getting a closed form for |
506 The point is that they will be equivalent to a list of |
506 $\rderssimp{r}{s}$ and then obtaining |
507 terms $\sum rs$, where each term in $rs$ will |
507 an estimate for the closed form. |
508 be made of $r_1 \backslash s' $, $r_2\backslash s'$, |
|
509 and $r \backslash s'$ with $s' \in \textit{SubString} \; s$. |
|
510 The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$ |
|
511 in the simplification which saves $rs$ from growing indefinitely. |
|
512 |
|
513 Based on this idea, we sketch a proof by first showing the equality (where |
|
514 $f$ and $g$ are functions that do not increase the size of the input) |
|
515 \begin{center} |
|
516 $(r_1 \cdot r_2)\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, |
|
517 \end{center} |
|
518 and then show the right-hand-side can be finitely bounded. |
|
519 We call the right-hand-side the |
|
520 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. |
|
521 We will flesh out the proof sketch in the next section. |
|
522 |
|
523 \section{Details of Closed Forms and Bounds} |
|
524 In this section we introduce in detail |
|
525 how the closed forms are obtained for regular expressions' |
|
526 derivatives and how they are bounded. |
|
527 We start by proving some basic identities |
|
528 involving the simplification functions for r-regular expressions. |
|
529 After that we use these identities to establish the |
|
530 closed forms we need. |
|
531 Finally, we prove the functions such as $\flts$ |
|
532 will keep the size non-increasing. |
|
533 Putting this together with a general bound |
|
534 on the finiteness of distinct regular expressions |
|
535 smaller than a certain size, we obtain a bound on |
|
536 the closed forms. |
|
537 %$r_1\cdot r_2$, $r^*$ and $\sum rs$. |
|
538 |
|
539 |
|
540 |
|
541 \subsection{Some Basic Identities} |
|
542 |
|
543 |
508 \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully} |
544 \subsubsection{$\textit{rdistinct}$'s Deduplicates Successfully} |
509 The $\textit{rdistinct}$ function, as its name suggests, will |
545 The $\textit{rdistinct}$ function, as its name suggests, will |
510 remove duplicates in an \emph{r}$\textit{rexp}$ list, |
546 remove duplicates in an r-regular expression list. |
511 according to the accumulator |
547 It will also correctly exclude any elements that |
512 and leave only one of each different element in a list: |
548 is intially in the accumulator set. |
513 \begin{lemma}\label{rdistinctDoesTheJob} |
549 \begin{lemma}\label{rdistinctDoesTheJob} |
514 The function $\textit{rdistinct}$ satisfies the following |
550 %The function $\textit{rdistinct}$ satisfies the following |
515 properties: |
551 %properties: |
|
552 Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its |
|
553 recursive definition here, its Isabelle counterpart would be $\textit{distinct}$.} |
|
554 readily defined |
|
555 for testing |
|
556 whether a list's elements are all unique. Then the following |
|
557 properties about $\textit{rdistinct}$ hold: |
516 \begin{itemize} |
558 \begin{itemize} |
517 \item |
559 \item |
518 If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. |
560 If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. |
519 \item |
561 \item |
520 If list $rs'$ is the result of $\rdistinct{rs}{acc}$, |
562 %If list $rs'$ is the result of $\rdistinct{rs}{acc}$, |
521 then $\textit{isDistinct} \; rs'$. |
563 $\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$. |
522 \item |
564 \item |
523 $\rdistinct{rs}{acc} = rs - acc$ |
565 $\textit{set} \; (\rdistinct{rs}{acc}) |
|
566 = (textit{set} \; rs) - acc$ |
524 \end{itemize} |
567 \end{itemize} |
525 \end{lemma} |
568 \end{lemma} |
526 \noindent |
569 \noindent |
527 The predicate $\textit{isDistinct}$ is for testing |
|
528 whether a list's elements are all unique. It is defined |
|
529 recursively on the structure of a regular expression, |
|
530 and we omit the precise definition here. |
|
531 \begin{proof} |
570 \begin{proof} |
532 The first part is by an induction on $rs$. |
571 The first part is by an induction on $rs$. |
533 The second and third part can be proven by using the |
572 The second and third part can be proven by using the |
534 induction rules of $\rdistinct{\_}{\_}$. |
573 inductive cases of $\textit{rdistinct}$. |
535 |
574 |
536 \end{proof} |
575 \end{proof} |
537 |
576 |
538 \noindent |
577 \noindent |
539 $\rdistinct{\_}{\_}$ will cancel out all regular expression terms |
578 $\rdistinct{\_}{\_}$ will cancel out all regular expression terms |
540 that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary |
579 that are in the accumulator, therefore prepending a list $rs_a$ with an arbitrary |
541 list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$ |
580 list $rs$ whose elements are all from the accumulator, and then call $\rdistinct{\_}{\_}$ |
542 on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$ |
581 on the resulting list, the output will be as if we had called $\rdistinct{\_}{\_}$ |
543 without the prepending of $rs$: |
582 without the prepending of $rs$: |
544 \begin{lemma} |
583 \begin{lemma}\label{rdistinctConcat} |
545 The elements appearing in the accumulator will always be removed. |
584 The elements appearing in the accumulator will always be removed. |
546 More precisely, |
585 More precisely, |
547 \begin{itemize} |
586 \begin{itemize} |
548 \item |
587 \item |
549 If $rs \subseteq rset$, then |
588 If $rs \subseteq rset$, then |
550 $\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$. |
589 $\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$. |
551 \item |
590 \item |
552 Furthermore, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, |
591 More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, |
553 then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$ |
592 then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$ |
554 \end{itemize} |
593 \end{itemize} |
555 \end{lemma} |
594 \end{lemma} |
556 |
595 |
557 \begin{proof} |
596 \begin{proof} |
558 By induction on $rs$. |
597 By induction on $rs$ and using \ref{rdistinctDoesTheJob}. |
559 \end{proof} |
598 \end{proof} |
560 \noindent |
599 \noindent |
561 On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated, |
600 On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated, |
562 then expanding the accumulator to include that element will not cause the output list to change: |
601 then expanding the accumulator to include that element will not cause the output list to change: |
563 \begin{lemma} |
602 \begin{lemma} |
1756 is bounded by a constant $c_N$ depending only on $N$, |
1806 is bounded by a constant $c_N$ depending only on $N$, |
1757 provided that each of $rs'$'s element |
1807 provided that each of $rs'$'s element |
1758 is bounded by $N$. |
1808 is bounded by $N$. |
1759 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. |
1809 We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. |
1760 |
1810 |
1761 We show how $\rdistinct$ and $\rflts$ |
1811 We show that $\rdistinct$ and $\rflts$ |
1762 in the simplification function together is at least as |
1812 working together is at least as |
1763 good as $\rdistinct{}{}$ alone. |
1813 good as $\rdistinct{}{}$ alone, which can be written as |
1764 \begin{lemma}\label{interactionFltsDB} |
1814 \begin{center} |
1765 $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
1815 $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
1766 \leq |
1816 \leq |
1767 \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
1817 \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
1768 \end{lemma} |
1818 \end{center} |
|
1819 We need this so that we know the outcome of our real |
|
1820 simplification is better than or equal to a rough estimate, |
|
1821 and therefore can be bounded by that estimate. |
|
1822 This is a bit harder to establish compared with proving |
|
1823 $\textit{flts}$ does not make a list larger (which can |
|
1824 be proven using routine induction): |
|
1825 \begin{center} |
|
1826 $\llbracket \textit{rflts}\; rs \rrbracket_r \leq |
|
1827 \llbracket \textit{rs} \rrbracket_r$ |
|
1828 \end{center} |
|
1829 We cannot simply prove how each helper function |
|
1830 reduces the size and then put them together: |
|
1831 From |
|
1832 \begin{center} |
|
1833 $\llbracket \textit{rflts}\; rs \rrbracket_r \leq |
|
1834 \llbracket \; \textit{rs} \rrbracket_r$ |
|
1835 \end{center} |
|
1836 and |
|
1837 \begin{center} |
|
1838 $\llbracket \textit{rdistinct} \; rs \; \varnothing \leq |
|
1839 \llbracket rs \rrbracket_r$ |
|
1840 \end{center} |
|
1841 one cannot imply |
|
1842 \begin{center} |
|
1843 $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
1844 \leq |
|
1845 \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
|
1846 \end{center} |
|
1847 What we can imply is that |
|
1848 \begin{center} |
|
1849 $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
1850 \leq |
|
1851 \llbracket rs \rrbracket_r$ |
|
1852 \end{center} |
|
1853 but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded. |
|
1854 The way we |
|
1855 get through this is by first proving a more general lemma |
|
1856 (so that the inductive case goes through): |
|
1857 \begin{lemma}\label{fltsSizeReductionAlts} |
|
1858 If we have three accumulator sets: |
|
1859 $noalts\_set$, $alts\_set$ and $corr\_set$, |
|
1860 satisfying: |
|
1861 \begin{itemize} |
|
1862 \item |
|
1863 $\forall r \in noalts\_set. \; \nexists xs.\; r = \sum xs$ |
|
1864 \item |
|
1865 $\forall r \in alts\_set. \; \exists xs. \; r = \sum xs |
|
1866 \; \textit{and} \; set \; xs \subseteq corr\_set$ |
|
1867 \end{itemize} |
|
1868 then we have that |
|
1869 \begin{center} |
|
1870 \begin{tabular}{lcl} |
|
1871 $\llbracket (\textit{rdistinct} \; (\textit{rflts} \; as) \; |
|
1872 (noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\ |
|
1873 $\llbracket (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup |
|
1874 \{ \ZERO \} )) \rrbracket_r$ & & \\ |
|
1875 \end{tabular} |
|
1876 \end{center} |
|
1877 holds. |
|
1878 \end{lemma} |
|
1879 \noindent |
|
1880 We need to split the accumulator into two parts: the part |
|
1881 which contains alternative regular expressions ($alts\_set$), and |
|
1882 the part without any of them($noalts\_set$). |
|
1883 The set $corr\_set$ is the corresponding set |
|
1884 of $alts\_set$ with all elements under the $\sum$ constructor |
|
1885 spilled out. |
|
1886 \begin{proof} |
|
1887 By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}. |
|
1888 \end{proof} |
|
1889 By setting all three sets to the empty set, one gets the desired size estimate: |
|
1890 \begin{corollary}\label{interactionFltsDB} |
|
1891 $\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
1892 \leq |
|
1893 \llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
|
1894 \end{corollary} |
|
1895 \begin{proof} |
|
1896 By using the lemma \ref{fltsSizeReductionAlts}. |
|
1897 \end{proof} |
1769 \noindent |
1898 \noindent |
1770 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
1899 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
1771 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
1900 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
1772 |
1901 |
1773 Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$: |
1902 Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$: |
2060 |
2190 |
2061 |
2191 |
2062 %---------------------------------------------------------------------------------------- |
2192 %---------------------------------------------------------------------------------------- |
2063 % SECTION ALTS CLOSED FORM |
2193 % SECTION ALTS CLOSED FORM |
2064 %---------------------------------------------------------------------------------------- |
2194 %---------------------------------------------------------------------------------------- |
2065 \section{A Closed Form for \textit{ALTS}} |
2195 %\section{A Closed Form for \textit{ALTS}} |
2066 Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. |
2196 %Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. |
2067 |
2197 % |
2068 |
2198 % |
2069 There are a few key steps, one of these steps is |
2199 %There are a few key steps, one of these steps is |
2070 |
2200 % |
2071 |
2201 % |
2072 |
2202 % |
2073 One might want to prove this by something a simple statement like: |
2203 %One might want to prove this by something a simple statement like: |
2074 |
2204 % |
2075 For this to hold we want the $\textit{distinct}$ function to pick up |
2205 %For this to hold we want the $\textit{distinct}$ function to pick up |
2076 the elements before and after derivatives correctly: |
2206 %the elements before and after derivatives correctly: |
2077 $r \in rset \equiv (rder x r) \in (rder x rset)$. |
2207 %$r \in rset \equiv (rder x r) \in (rder x rset)$. |
2078 which essentially requires that the function $\backslash$ is an injective mapping. |
2208 %which essentially requires that the function $\backslash$ is an injective mapping. |
2079 |
2209 % |
2080 Unfortunately the function $\backslash c$ is not an injective mapping. |
2210 %Unfortunately the function $\backslash c$ is not an injective mapping. |
2081 |
2211 % |
2082 \subsection{function $\backslash c$ is not injective (1-to-1)} |
2212 %\subsection{function $\backslash c$ is not injective (1-to-1)} |
2083 \begin{center} |
2213 %\begin{center} |
2084 The derivative $w.r.t$ character $c$ is not one-to-one. |
2214 % The derivative $w.r.t$ character $c$ is not one-to-one. |
2085 Formally, |
2215 % Formally, |
2086 $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
2216 % $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
2087 \end{center} |
2217 %\end{center} |
2088 This property is trivially true for the |
2218 %This property is trivially true for the |
2089 character regex example: |
2219 %character regex example: |
2090 \begin{center} |
2220 %\begin{center} |
2091 $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
2221 % $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
2092 \end{center} |
2222 %\end{center} |
2093 But apart from the cases where the derivative |
2223 %But apart from the cases where the derivative |
2094 output is $\ZERO$, are there non-trivial results |
2224 %output is $\ZERO$, are there non-trivial results |
2095 of derivatives which contain strings? |
2225 %of derivatives which contain strings? |
2096 The answer is yes. |
2226 %The answer is yes. |
2097 For example, |
2227 %For example, |
2098 \begin{center} |
2228 %\begin{center} |
2099 Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
2229 % Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
2100 where $a$ is not nullable.\\ |
2230 % where $a$ is not nullable.\\ |
2101 $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
2231 % $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
2102 $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
2232 % $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
2103 \end{center} |
2233 %\end{center} |
2104 We start with two syntactically different regular expressions, |
2234 %We start with two syntactically different regular expressions, |
2105 and end up with the same derivative result. |
2235 %and end up with the same derivative result. |
2106 This is not surprising as we have such |
2236 %This is not surprising as we have such |
2107 equality as below in the style of Arden's lemma:\\ |
2237 %equality as below in the style of Arden's lemma:\\ |
2108 \begin{center} |
2238 %\begin{center} |
2109 $L(A^*B) = L(A\cdot A^* \cdot B + B)$ |
2239 % $L(A^*B) = L(A\cdot A^* \cdot B + B)$ |
2110 \end{center} |
2240 %\end{center} |
2111 |
2241 |
|
2242 \section{Further Improvements to the Bound} |
2112 There are two problems with this finiteness result, though. |
2243 There are two problems with this finiteness result, though. |
2113 \begin{itemize} |
2244 \begin{itemize} |
2114 \item |
2245 \item |
2115 First, It is not yet a direct formalisation of our lexer's complexity, |
2246 First, It is not yet a direct formalisation of our lexer's complexity, |
2116 as a complexity proof would require looking into |
2247 as a complexity proof would require looking into |