207 We omit these functions, as they are routine. Please refer to the formalisation |
207 We omit these functions, as they are routine. Please refer to the formalisation |
208 (in file BasicIdentities.thy) for the exact definition. |
208 (in file BasicIdentities.thy) for the exact definition. |
209 With $\rrexp$ the size caclulation of annotated regular expressions' |
209 With $\rrexp$ the size caclulation of annotated regular expressions' |
210 simplification and derivatives can be done by the size of their unlifted |
210 simplification and derivatives can be done by the size of their unlifted |
211 counterpart with the unlifted version of simplification and derivatives applied. |
211 counterpart with the unlifted version of simplification and derivatives applied. |
212 \begin{lemma} |
212 \begin{lemma}\label{sizeRelations} |
213 The following equalities hold: |
213 The following equalities hold: |
214 \begin{itemize} |
214 \begin{itemize} |
215 \item |
215 \item |
216 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ |
216 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ |
217 \item |
217 \item |
1242 |
1242 |
1243 \end{center} |
1243 \end{center} |
1244 \end{corollary} |
1244 \end{corollary} |
1245 \noindent |
1245 \noindent |
1246 \subsection{Closed Forms for Star Regular Expressions} |
1246 \subsection{Closed Forms for Star Regular Expressions} |
1247 We use a similar technique as $r_1 \cdot r_2$ case, |
1247 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using |
1248 generating |
1248 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then |
|
1249 the property of the $\distinct$ function. |
|
1250 Now we try to get a bound on $r^* \backslash s$ as well. |
|
1251 Again, we first look at how a star's derivatives evolve, if they grow maximally: |
|
1252 \begin{center} |
|
1253 |
|
1254 $r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad |
|
1255 r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad |
|
1256 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} |
|
1257 \quad \ldots$ |
|
1258 |
|
1259 \end{center} |
|
1260 When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, |
|
1261 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, |
|
1262 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size |
|
1263 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not |
|
1264 count the possible size explosions of $r \backslash c$ themselves. |
|
1265 |
|
1266 Thanks to $\rflts$ and $\rDistinct$, we are able to open up regexes like |
|
1267 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
|
1268 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
1269 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', |
|
1270 r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
|
1271 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). |
|
1272 This allows us to use a similar technique as $r_1 \cdot r_2$ case, |
|
1273 where the crux is to get an equivalent form of |
|
1274 $\rderssimp{r^*}{s}$ with shape $\rsimp{\sum rs}$. |
|
1275 This requires generating |
1249 all possible sub-strings $s'$ of $s$ |
1276 all possible sub-strings $s'$ of $s$ |
1250 such that $r\backslash s' \cdot r^*$ will appear |
1277 such that $r\backslash s' \cdot r^*$ will appear |
1251 as a term in $(r^*) \backslash s$. |
1278 as a term in $(r^*) \backslash s$. |
1252 The first function we define is a single-step |
1279 The first function we define is a single-step |
1253 updating function $\starupdate$, which takes three arguments as input: |
1280 updating function $\starupdate$, which takes three arguments as input: |
1302 $\hflat{r}$ & $\dn$ & $r$ |
1329 $\hflat{r}$ & $\dn$ & $r$ |
1303 \end{tabular} |
1330 \end{tabular} |
1304 \end{center} |
1331 \end{center} |
1305 \noindent |
1332 \noindent |
1306 %MAYBE TODO: introduce createdByStar |
1333 %MAYBE TODO: introduce createdByStar |
|
1334 Again these definitions are tailor-made for dealing with alternatives that have |
|
1335 originated from a star's derivatives, so we do not attempt to open up all possible |
|
1336 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 |
|
1337 elements. |
|
1338 We give a predicate for such "star-created" regular expressions: |
|
1339 \begin{center} |
|
1340 \begin{tabular}{lcr} |
|
1341 & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ |
|
1342 $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ |
|
1343 \end{tabular} |
|
1344 \end{center} |
|
1345 |
|
1346 These definitions allows us the flexibility to talk about |
|
1347 regular expressions in their most convenient format, |
|
1348 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ |
|
1349 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. |
|
1350 These definitions help express that certain classes of syntatically |
|
1351 distinct regular expressions are actually the same under simplification. |
|
1352 This is not entirely true for annotated regular expressions: |
|
1353 %TODO: bsimp bders \neq bderssimp |
|
1354 \begin{center} |
|
1355 $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
|
1356 \end{center} |
|
1357 For bit-codes, the order in which simplification is applied |
|
1358 might cause a difference in the location they are placed. |
|
1359 If we want something like |
|
1360 \begin{center} |
|
1361 $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
|
1362 \end{center} |
|
1363 Some "canonicalization" procedure is required, |
|
1364 which either pushes all the common bitcodes to nodes |
|
1365 as senior as possible: |
|
1366 \begin{center} |
|
1367 $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
|
1368 \end{center} |
|
1369 or does the reverse. However bitcodes are not of interest if we are talking about |
|
1370 the $\llbracket r \rrbracket$ size of a regex. |
|
1371 Therefore for the ease and simplicity of producing a |
|
1372 proof for a size bound, we are happy to restrict ourselves to |
|
1373 unannotated regular expressions, and obtain such equalities as |
|
1374 \begin{lemma} |
|
1375 $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ |
|
1376 \end{lemma} |
|
1377 |
|
1378 \begin{proof} |
|
1379 By using the rewriting relation $\rightsquigarrow$ |
|
1380 \end{proof} |
|
1381 %TODO: rsimp sflat |
|
1382 And from this we obtain a proof that a star's derivative will be the same |
|
1383 as if it had all its nested alternatives created during deriving being flattened out: |
|
1384 For example, |
|
1385 \begin{lemma} |
|
1386 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
|
1387 \end{lemma} |
|
1388 \begin{proof} |
|
1389 By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. |
|
1390 \end{proof} |
|
1391 % The simplification of a flattened out regular expression, provided it comes |
|
1392 %from the derivative of a star, is the same as the one nested. |
|
1393 |
|
1394 |
|
1395 |
1307 We first introduce an inductive property |
1396 We first introduce an inductive property |
1308 for $\starupdate$ and $\hflataux{\_}$, |
1397 for $\starupdate$ and $\hflataux{\_}$, |
1309 it says if we do derivatives of $r^*$ |
1398 it says if we do derivatives of $r^*$ |
1310 with a string that starts with $c$, |
1399 with a string that starts with $c$, |
1311 then flatten it out, |
1400 then flatten it out, |
1889 \end{center} |
1985 \end{center} |
1890 |
1986 |
1891 %---------------------------------------------------------------------------------------- |
1987 %---------------------------------------------------------------------------------------- |
1892 % SECTION 4 |
1988 % SECTION 4 |
1893 %---------------------------------------------------------------------------------------- |
1989 %---------------------------------------------------------------------------------------- |
1894 \section{A Bound for the Star Regular Expression} |
|
1895 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using |
|
1896 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then |
|
1897 the property of the $\distinct$ function. |
|
1898 Now we try to get a bound on $r^* \backslash s$ as well. |
|
1899 Again, we first look at how a star's derivatives evolve, if they grow maximally: |
|
1900 \begin{center} |
|
1901 |
|
1902 $r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad |
|
1903 r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad |
|
1904 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} |
|
1905 \quad \ldots$ |
|
1906 |
|
1907 \end{center} |
|
1908 When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, |
|
1909 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, |
|
1910 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size |
|
1911 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not |
|
1912 count the possible size explosions of $r \backslash c$ themselves. |
|
1913 |
|
1914 Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like |
|
1915 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
1916 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
|
1917 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). |
|
1918 For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$: |
|
1919 %TODO: definitions of and \hflataux \hflat |
|
1920 \begin{center} |
|
1921 \begin{tabular}{ccc} |
|
1922 $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ |
|
1923 $\hflataux r$ & $=$ & $ [r]$ |
|
1924 \end{tabular} |
|
1925 \end{center} |
|
1926 |
|
1927 \begin{center} |
|
1928 \begin{tabular}{ccc} |
|
1929 $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\ |
|
1930 $\hflat r$ & $=$ & $ r$ |
|
1931 \end{tabular} |
|
1932 \end{center}s |
|
1933 Again these definitions are tailor-made for dealing with alternatives that have |
|
1934 originated from a star's derivatives, so we don't attempt to open up all possible |
|
1935 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 |
|
1936 elements. |
|
1937 We give a predicate for such "star-created" regular expressions: |
|
1938 \begin{center} |
|
1939 \begin{tabular}{lcr} |
|
1940 & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ |
|
1941 $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ |
|
1942 \end{tabular} |
|
1943 \end{center} |
|
1944 |
|
1945 These definitions allows us the flexibility to talk about |
|
1946 regular expressions in their most convenient format, |
|
1947 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ |
|
1948 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. |
|
1949 These definitions help express that certain classes of syntatically |
|
1950 distinct regular expressions are actually the same under simplification. |
|
1951 This is not entirely true for annotated regular expressions: |
|
1952 %TODO: bsimp bders \neq bderssimp |
|
1953 \begin{center} |
|
1954 $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
|
1955 \end{center} |
|
1956 For bit-codes, the order in which simplification is applied |
|
1957 might cause a difference in the location they are placed. |
|
1958 If we want something like |
|
1959 \begin{center} |
|
1960 $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
|
1961 \end{center} |
|
1962 Some "canonicalization" procedure is required, |
|
1963 which either pushes all the common bitcodes to nodes |
|
1964 as senior as possible: |
|
1965 \begin{center} |
|
1966 $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
|
1967 \end{center} |
|
1968 or does the reverse. However bitcodes are not of interest if we are talking about |
|
1969 the $\llbracket r \rrbracket$ size of a regex. |
|
1970 Therefore for the ease and simplicity of producing a |
|
1971 proof for a size bound, we are happy to restrict ourselves to |
|
1972 unannotated regular expressions, and obtain such equalities as |
|
1973 \begin{lemma} |
|
1974 $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ |
|
1975 \end{lemma} |
|
1976 |
|
1977 \begin{proof} |
|
1978 By using the rewriting relation $\rightsquigarrow$ |
|
1979 \end{proof} |
|
1980 %TODO: rsimp sflat |
|
1981 And from this we obtain a proof that a star's derivative will be the same |
|
1982 as if it had all its nested alternatives created during deriving being flattened out: |
|
1983 For example, |
|
1984 \begin{lemma} |
|
1985 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
|
1986 \end{lemma} |
|
1987 \begin{proof} |
|
1988 By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. |
|
1989 \end{proof} |
|
1990 % The simplification of a flattened out regular expression, provided it comes |
|
1991 %from the derivative of a star, is the same as the one nested. |
|
1992 |
|
1993 |
1990 |
1994 |
1991 |
1995 |
1992 |
1996 |
1993 |
1997 |
1994 |