6 \label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
6 \label{Bitcoded2} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX} |
7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive |
7 %Then we illustrate how the algorithm without bitcodes falls short for such aggressive |
8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
8 %simplifications and therefore introduce our version of the bitcoded algorithm and |
9 %its correctness proof in |
9 %its correctness proof in |
10 %Chapter 3\ref{Chapter3}. |
10 %Chapter 3\ref{Chapter3}. |
|
11 |
|
12 |
|
13 |
|
14 Now we introduce the simplifications, which is why we introduce the |
|
15 bitcodes in the first place. |
|
16 |
|
17 \subsection*{Simplification Rules} |
|
18 |
|
19 This section introduces aggressive (in terms of size) simplification rules |
|
20 on annotated regular expressions |
|
21 to keep derivatives small. Such simplifications are promising |
|
22 as we have |
|
23 generated test data that show |
|
24 that a good tight bound can be achieved. We could only |
|
25 partially cover the search space as there are infinitely many regular |
|
26 expressions and strings. |
|
27 |
|
28 One modification we introduced is to allow a list of annotated regular |
|
29 expressions in the $\sum$ constructor. This allows us to not just |
|
30 delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but |
|
31 also unnecessary ``copies'' of regular expressions (very similar to |
|
32 simplifying $r + r$ to just $r$, but in a more general setting). Another |
|
33 modification is that we use simplification rules inspired by Antimirov's |
|
34 work on partial derivatives. They maintain the idea that only the first |
|
35 ``copy'' of a regular expression in an alternative contributes to the |
|
36 calculation of a POSIX value. All subsequent copies can be pruned away from |
|
37 the regular expression. A recursive definition of our simplification function |
|
38 that looks somewhat similar to our Scala code is given below: |
|
39 %\comment{Use $\ZERO$, $\ONE$ and so on. |
|
40 %Is it $ALTS$ or $ALTS$?}\\ |
|
41 |
|
42 \begin{center} |
|
43 \begin{tabular}{@{}lcl@{}} |
|
44 |
|
45 $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
|
46 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
|
47 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
|
48 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
|
49 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
|
50 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ |
|
51 |
|
52 $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map} \; simp \; as)) \; \textit{match} $ \\ |
|
53 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
|
54 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
55 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
|
56 |
|
57 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
|
58 \end{tabular} |
|
59 \end{center} |
|
60 |
|
61 \noindent |
|
62 The simplification does a pattern matching on the regular expression. |
|
63 When it detected that the regular expression is an alternative or |
|
64 sequence, it will try to simplify its child regular expressions |
|
65 recursively and then see if one of the children turns into $\ZERO$ or |
|
66 $\ONE$, which might trigger further simplification at the current level. |
|
67 The most involved part is the $\sum$ clause, where we use two |
|
68 auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested |
|
69 alternatives and reduce as many duplicates as possible. Function |
|
70 $\textit{distinct}$ keeps the first occurring copy only and removes all later ones |
|
71 when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s. |
|
72 Its recursive definition is given below: |
|
73 |
|
74 \begin{center} |
|
75 \begin{tabular}{@{}lcl@{}} |
|
76 $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
|
77 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
|
78 $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\ |
|
79 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) |
|
80 \end{tabular} |
|
81 \end{center} |
|
82 |
|
83 \noindent |
|
84 Here $\textit{flatten}$ behaves like the traditional functional programming flatten |
|
85 function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it |
|
86 removes parentheses, for example changing $a+(b+c)$ into $a+b+c$. |
|
87 |
|
88 Having defined the $\simp$ function, |
|
89 we can use the previous notation of natural |
|
90 extension from derivative w.r.t.~character to derivative |
|
91 w.r.t.~string:%\comment{simp in the [] case?} |
|
92 |
|
93 \begin{center} |
|
94 \begin{tabular}{lcl} |
|
95 $r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\ |
|
96 $r \backslash_{simp} [\,] $ & $\dn$ & $r$ |
|
97 \end{tabular} |
|
98 \end{center} |
|
99 |
|
100 \noindent |
|
101 to obtain an optimised version of the algorithm: |
|
102 |
|
103 \begin{center} |
|
104 \begin{tabular}{lcl} |
|
105 $\textit{blexer\_simp}\;r\,s$ & $\dn$ & |
|
106 $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\ |
|
107 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
|
108 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
|
109 & & $\;\;\textit{else}\;\textit{None}$ |
|
110 \end{tabular} |
|
111 \end{center} |
|
112 |
|
113 \noindent |
|
114 This algorithm keeps the regular expression size small, for example, |
|
115 with this simplification our previous $(a + aa)^*$ example's 8000 nodes |
|
116 will be reduced to just 17 and stays constant, no matter how long the |
|
117 input string is. |
|
118 |
|
119 |
|
120 |
|
121 |
|
122 |
|
123 |
11 |
124 |
12 |
125 |
13 |
126 |
14 |
127 |
15 %---------------------------------------------------------------------------------------- |
128 %---------------------------------------------------------------------------------------- |