10 %Chapter 3\ref{Chapter3}. |
10 %Chapter 3\ref{Chapter3}. |
11 |
11 |
12 |
12 |
13 |
13 |
14 In this chapter we introduce simplifications |
14 In this chapter we introduce simplifications |
15 on annotated regular expressions that can be applied to |
15 for annotated regular expressions that can be applied to |
16 each intermediate derivative result. This allows |
16 each intermediate derivative result. This allows |
17 us to make $\blexer$ much more efficient. |
17 us to make $\blexer$ much more efficient. |
18 Sulzmann and Lu already had some bit-coded simplifications, |
18 Sulzmann and Lu already introduced some simplifications for bitcoded regular expressions, |
19 but their simplification functions were inefficient. |
19 but their simplification functions were inefficient and in some cases needs fixing. |
20 We contrast our simplification function |
20 %We contrast our simplification function |
21 with Sulzmann and Lu's, indicating the simplicity of our algorithm. |
21 %with Sulzmann and Lu's, indicating the simplicity of our algorithm. |
22 This is another case for the usefulness |
22 %This is another case for the usefulness |
23 and reliability of formal proofs on algorithms. |
23 %and reliability of formal proofs on algorithms. |
24 These ``aggressive'' simplifications would not be possible in the injection-based |
24 %These ``aggressive'' simplifications would not be possible in the injection-based |
25 lexing we introduced in chapter \ref{Inj}. |
25 %lexing we introduced in chapter \ref{Inj}. |
26 We then prove the correctness with the improved version of |
26 %We then prove the correctness with the improved version of |
27 $\blexer$, called $\blexersimp$, by establishing |
27 %$\blexer$, called $\blexersimp$, by establishing |
28 $\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system. |
28 %$\blexer \; r \; s= \blexersimp \; r \; s$ using a term rewriting system. |
29 |
29 % |
30 \section{Simplifications by Sulzmann and Lu} |
30 \section{Simplifications by Sulzmann and Lu} |
31 Consider the derivatives of examples such as $(a^*a^*)^*$ |
31 Consider the derivatives of the following example $(a^*a^*)^*$: |
32 and $(a^* + (aa)^*)^*$: |
32 %and $(a^* + (aa)^*)^*$: |
33 \begin{center} |
33 \begin{center} |
34 $(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} $\\ |
34 \begin{tabular}{lcl} |
35 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^* \stackrel{\backslash a}{\longrightarrow} \ldots$ |
35 $(a^*a^*)^*$ & $ \stackrel{\backslash a}{\longrightarrow}$ & |
36 \end{center} |
36 $ (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ |
37 \noindent |
37 & |
38 As can be seen, there is a lot of duplication |
38 $ \stackrel{\backslash a}{\longrightarrow} $ & |
39 in the example we have already mentioned in |
39 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$\\ |
40 \ref{eqn:growth2}. |
40 & $\stackrel{\backslash a}{ |
|
41 \longrightarrow} $ & $\ldots$\\ |
|
42 \end{tabular} |
|
43 \end{center} |
|
44 \noindent |
|
45 As can be seen, there are serveral duplications. |
41 A simple-minded simplification function cannot simplify |
46 A simple-minded simplification function cannot simplify |
42 the third regular expression in the above chain of derivative |
47 the third regular expression in the above chain of derivative |
43 regular expressions, namely |
48 regular expressions, namely |
44 \begin{center} |
49 \begin{center} |
45 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
50 $((a^*a^* + a^*) + a^*)\cdot(a^*a^*)^* + (a^*a^* + a^*)\cdot(a^*a^*)^*$ |
50 One would expect a better simplification function to work in the |
55 One would expect a better simplification function to work in the |
51 following way: |
56 following way: |
52 \begin{gather*} |
57 \begin{gather*} |
53 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
58 ((a^*a^* + \underbrace{a^*}_\text{A})+\underbrace{a^*}_\text{duplicate of A})\cdot(a^*a^*)^* + |
54 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\ |
59 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this}.\\ |
55 \bigg\downarrow \\ |
60 \bigg\downarrow (1) \\ |
56 (a^*a^* + a^* |
61 (a^*a^* + a^* |
57 \color{gray} + a^* \color{black})\cdot(a^*a^*)^* + |
62 \color{gray} + a^* \color{black})\cdot(a^*a^*)^* + |
58 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\ |
63 \underbrace{(a^*a^* + a^*)\cdot(a^*a^*)^*}_\text{further simp removes this} \\ |
59 \bigg\downarrow \\ |
64 \bigg\downarrow (2) \\ |
60 (a^*a^* + a^* |
65 (a^*a^* + a^* |
61 )\cdot(a^*a^*)^* |
66 )\cdot(a^*a^*)^* |
62 \color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\ |
67 \color{gray} + (a^*a^* + a^*) \cdot(a^*a^*)^*\\ |
63 \bigg\downarrow \\ |
68 \bigg\downarrow (3) \\ |
64 (a^*a^* + a^* |
69 (a^*a^* + a^* |
65 )\cdot(a^*a^*)^* |
70 )\cdot(a^*a^*)^* |
66 \end{gather*} |
71 \end{gather*} |
67 \noindent |
72 \noindent |
68 In the first step, the nested alternative regular expression |
73 In the first step, the nested alternative regular expression |
69 $(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$. |
74 $(a^*a^* + a^*) + a^*$ is flattened into $a^*a^* + a^* + a^*$. |
70 Now the third term $a^*$ is clearly identified as a duplicate |
75 Now the third term $a^*$ can clearly be identified as a duplicate |
71 and therefore removed in the second step. This causes the two |
76 and therefore removed in the second step. |
|
77 This causes the two |
72 top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ |
78 top-level terms to become the same and the second $(a^*a^*+a^*)\cdot(a^*a^*)^*$ |
73 removed in the final step.\\ |
79 removed in the final step. |
74 This motivating example is from testing Sulzmann and Lu's |
80 Sulzmann and Lu's simplification function (using our notations) can achieve this |
75 algorithm: their simplification does |
81 simplification: |
76 not work! |
|
77 Consider their simplification (using our notations): |
|
78 \begin{center} |
82 \begin{center} |
79 \begin{tabular}{lcl} |
83 \begin{tabular}{lcl} |
80 $\simpsulz \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & |
84 $\textit{simp}\_{SL} \; _{bs}(_{bs'}\ONE \cdot r)$ & $\dn$ & |
81 $\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\ |
85 $\textit{if} \; (\textit{zeroable} \; r)\; \textit{then} \;\; \ZERO$\\ |
82 & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\ |
86 & &$\textit{else}\;\; \fuse \; (bs@ bs') \; r$\\ |
83 $\simpsulz \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if} |
87 $\textit{simp}\_{SL} \;(_{bs}r_1\cdot r_2)$ & $\dn$ & $\textit{if} |
84 \; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\; |
88 \; (\textit{zeroable} \; r_1 \; \textit{or} \; \textit{zeroable}\; r_2)\; |
85 \textit{then} \;\; \ZERO$\\ |
89 \textit{then} \;\; \ZERO$\\ |
86 & & $\textit{else}\;\;_{bs}((\simpsulz \;r_1)\cdot |
90 & & $\textit{else}\;\;_{bs}((\textit{simp}\_{SL} \;r_1)\cdot |
87 (\simpsulz \; r_2))$\\ |
91 (\textit{simp}\_{SL} \; r_2))$\\ |
88 $\simpsulz \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\ |
92 $\textit{simp}\_{SL} \; _{bs}\sum []$ & $\dn$ & $\ZERO$\\ |
89 $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
93 $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
90 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
94 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
91 $\simpsulz \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\simpsulz \; r)$\\ |
95 $\textit{simp}\_{SL} \; _{bs}\sum[r]$ & $\dn$ & $\fuse \; bs \; (\textit{simp}\_{SL} \; r)$\\ |
92 $\simpsulz \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum |
96 $\textit{simp}\_{SL} \; _{bs}\sum(r::rs)$ & $\dn$ & $_{bs}\sum |
93 (\nub \; (\filter \; (\not \circ \zeroable)\;((\simpsulz \; r) :: \map \; \simpsulz \; rs)))$\\ |
97 (\nub \; (\filter \; (\not \circ \zeroable)\;((\textit{simp}\_{SL} \; r) :: \map \; \textit{simp}\_{SL} \; rs)))$\\ |
94 |
98 |
95 \end{tabular} |
99 \end{tabular} |
96 \end{center} |
100 \end{center} |
97 \noindent |
101 \noindent |
98 where the $\textit{zeroable}$ predicate |
102 The $\textit{zeroable}$ predicate |
99 tests whether the regular expression |
103 tests whether the regular expression |
100 is equivalent to $\ZERO$, |
104 is equivalent to $\ZERO$, and |
101 can be defined as: |
105 can be defined as: |
102 \begin{center} |
106 \begin{center} |
103 \begin{tabular}{lcl} |
107 \begin{tabular}{lcl} |
104 $\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\; |
108 $\zeroable \; _{bs}\sum (r::rs)$ & $\dn$ & $\zeroable \; r\;\; \land \;\; |
105 \zeroable \;_{[]}\sum\;rs $\\ |
109 \zeroable \;_{[]}\sum\;rs $\\ |
109 $\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\ |
113 $\zeroable\;_{bs}\ONE$ & $\dn$ & $\textit{false}$\\ |
110 $\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$ |
114 $\zeroable\;_{bs}\ZERO$ & $\dn$ & $\textit{true}$ |
111 \end{tabular} |
115 \end{tabular} |
112 \end{center} |
116 \end{center} |
113 \noindent |
117 \noindent |
114 They suggested that the $\simpsulz $ function should be |
118 The |
|
119 \begin{center} |
|
120 \begin{tabular}{lcl} |
|
121 $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
|
122 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
|
123 \end{tabular} |
|
124 \end{center} |
|
125 \noindent |
|
126 clause does flatten the alternative as required in step (1), |
|
127 but $\textit{simp}\_{SL}$ is insufficient if we want to do steps (2) and (3), |
|
128 as these ``identical'' terms have different bit-annotations. |
|
129 They also suggested that the $\textit{simp}\_{SL} $ function should be |
115 applied repeatedly until a fixpoint is reached. |
130 applied repeatedly until a fixpoint is reached. |
116 We call this construction $\textit{sulzSimp}$: |
131 We call this construction $\textit{SLSimp}$: |
117 \begin{center} |
132 \begin{center} |
118 \begin{tabular}{lcl} |
133 \begin{tabular}{lcl} |
119 $\textit{sulzSimp} \; r$ & $\dn$ & |
134 $\textit{SLSimp} \; r$ & $\dn$ & |
120 $\textit{while}((\simpsulz \; r)\; \cancel{=} \; r)$ \\ |
135 $\textit{while}((\textit{simp}\_{SL} \; r)\; \cancel{=} \; r)$ \\ |
121 & & $\quad r := \simpsulz \; r$\\ |
136 & & $\quad r := \textit{simp}\_{SL} \; r$\\ |
122 & & $\textit{return} \; r$ |
137 & & $\textit{return} \; r$ |
123 \end{tabular} |
138 \end{tabular} |
124 \end{center} |
139 \end{center} |
125 We call the operation of alternatingly |
140 We call the operation of alternatingly |
126 applying derivatives and simplifications |
141 applying derivatives and simplifications |
127 (until the string is exhausted) Sulz-simp-derivative, |
142 (until the string is exhausted) Sulz-simp-derivative, |
128 written $\backslash_{sulzSimp}$: |
143 written $\backslash_{SLSimp}$: |
129 \begin{center} |
144 \begin{center} |
130 \begin{tabular}{lcl} |
145 \begin{tabular}{lcl} |
131 $r \backslash_{sulzSimp} (c\!::\!s) $ & $\dn$ & $(\textit{sulzSimp} \; (r \backslash c)) \backslash_{sulzSimp}\, s$ \\ |
146 $r \backslash_{SLSimp} (c\!::\!s) $ & $\dn$ & $(\textit{SLSimp} \; (r \backslash c)) \backslash_{SLSimp}\, s$ \\ |
132 $r \backslash_{sulzSimp} [\,] $ & $\dn$ & $r$ |
147 $r \backslash_{SLSimp} [\,] $ & $\dn$ & $r$ |
133 \end{tabular} |
148 \end{tabular} |
134 \end{center} |
149 \end{center} |
135 \noindent |
150 \noindent |
136 After the derivatives have been taken, the bitcodes |
151 After the derivatives have been taken, the bitcodes |
137 are extracted and decoded in the same manner |
152 are extracted and decoded in the same manner |
138 as $\blexer$: |
153 as $\blexer$: |
139 \begin{center} |
154 \begin{center} |
140 \begin{tabular}{lcl} |
155 \begin{tabular}{lcl} |
141 $\textit{blexer\_sulzSimp}\;r\,s$ & $\dn$ & |
156 $\textit{blexer\_SLSimp}\;r\,s$ & $\dn$ & |
142 $\textit{let}\;a = (r^\uparrow)\backslash_{sulzSimp}\, s\;\textit{in}$\\ |
157 $\textit{let}\;a = (r^\uparrow)\backslash_{SLSimp}\, s\;\textit{in}$\\ |
143 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
158 & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\ |
144 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
159 & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
145 & & $\;\;\textit{else}\;\textit{None}$ |
160 & & $\;\;\textit{else}\;\textit{None}$ |
146 \end{tabular} |
161 \end{tabular} |
147 \end{center} |
162 \end{center} |
148 \noindent |
163 \noindent |
149 We implemented this lexing algorithm in Scala, |
164 We implemented this lexing algorithm in Scala, |
150 and found that the final derivative regular expression |
165 and found that the final derivative regular expression |
151 size grows exponentially fast: |
166 size grows exponentially fast (note the logarithmic scale): |
152 \begin{figure}[H] |
167 \begin{figure}[H] |
153 \centering |
168 \centering |
154 \begin{tikzpicture} |
169 \begin{tikzpicture} |
155 \begin{axis}[ |
170 \begin{axis}[ |
156 xlabel={$n$}, |
171 xlabel={$n$}, |
197 simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input. |
212 simp, fuse, mkEpsBC and isPhi leads to subcalls whose number is bound by the size of the regular expression involved. We claim that thanks to aggressively applying simp this size remains finite. Hence, we can argue that the above mentioned functions/operations have constant time complexity which implies that we can incrementally compute bit-coded parse trees in linear time in the size of the input. |
198 \end{quote} |
213 \end{quote} |
199 \noindent |
214 \noindent |
200 The assumption that the size of the regular expressions |
215 The assumption that the size of the regular expressions |
201 in the algorithm |
216 in the algorithm |
202 would stay below a finite constant is not ture. |
217 would stay below a finite constant is not true, not in the |
203 The main reason behind this is that (i) The $\textit{nub}$ |
218 examples we considered. |
|
219 The main reason behind this is that (i) the $\textit{nub}$ |
204 function requires identical annotations between two |
220 function requires identical annotations between two |
205 annotated regular expressions to qualify as duplicates, |
221 annotated regular expressions to qualify as duplicates, |
206 and cannot simplify the cases like $_{SZZ}a^*+_{SZS}a^*$ |
222 and therefore cannot simplify cases like $_{SZZ}a^*+_{SZS}a^*$ |
207 even if both $a^*$ denote the same language. |
223 even if both $a^*$ denote the same language, and |
208 (ii) The ``flattening'' only applies to the head of the list |
224 (ii) the ``flattening'' only applies to the head of the list |
209 in the |
225 in the |
210 \begin{center} |
226 \begin{center} |
211 \begin{tabular}{lcl} |
227 \begin{tabular}{lcl} |
212 |
228 |
213 $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
229 $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2)$ & $\dn$ & |
214 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
230 $_{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$\\ |
215 \end{tabular} |
231 \end{tabular} |
216 \end{center} |
232 \end{center} |
217 \noindent |
233 \noindent |
218 clause, and therefore is not thorough enough to simplify all |
234 clause, and therefore is not strong enough to simplify all |
219 needed parts of the regular expression.\\ |
235 needed parts of the regular expression. Moreover, even if the regular expressions size |
220 In addition to that, even if the regular expressions size |
|
221 do stay finite, one has to take into account that |
236 do stay finite, one has to take into account that |
222 the $\simpsulz$ function is applied many times |
237 the $\textit{simp}\_{SL}$ function is applied many times |
223 in each derivative step, and that number is not necessarily |
238 in each derivative step, and that number is not necessarily |
224 a constant with respect to the size of the regular expression. |
239 a constant with respect to the size of the regular expression. |
225 To not get ``caught off guard'' by |
240 %To not get ``caught off guard'' by |
226 these counterexamples, |
241 %these counterexamples, |
227 one needs to be more careful when designing the |
242 %one needs to be more careful when designing the |
228 simplification function and making claims about them. |
243 %simplification function and making claims about them. |
229 |
244 |
230 \section{Our $\textit{Simp}$ Function} |
245 \section{Our $\textit{Simp}$ Function} |
231 We will now introduce our simplification function, |
246 We will now introduce our simplification function. |
232 by making a contrast with $\simpsulz$. |
247 %by making a contrast with $\textit{simp}\_{SL}$. |
233 We describe |
248 We also describe |
234 the ideas behind components in their algorithm |
249 the ideas behind Sulzmann and Lu's $\textit{simp}\_{SL}$ |
235 and why they fail to achieve the desired effect, followed |
250 algorithm |
236 by our solution. These solutions come with correctness |
251 and why it fails to achieve the desired effect. |
237 statements that are backed up by formal proofs. |
252 Our simplification function comes with a formal |
|
253 correctness proof. |
238 \subsection{Flattening Nested Alternatives} |
254 \subsection{Flattening Nested Alternatives} |
239 The idea behind the clause |
255 The idea behind the clause |
240 \begin{center} |
256 \begin{center} |
241 $\simpsulz \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad |
257 $\textit{simp}\_{SL} \; _{bs}\sum ((_{bs'}\sum rs_1) :: rs_2) \quad \dn \quad |
242 _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$ |
258 _{bs}\sum ((\map \; (\fuse \; bs')\; rs_1) @ rs_2)$ |
243 \end{center} |
259 \end{center} |
244 is that it allows |
260 is that it allows |
245 duplicate removal of regular expressions at different |
261 duplicate removal of regular expressions at different |
246 ``levels'' of alternatives. |
262 ``levels'' of alternatives. |
248 following simplification: |
264 following simplification: |
249 |
265 |
250 \begin{center} |
266 \begin{center} |
251 $(a+r)+r \longrightarrow a+r$ |
267 $(a+r)+r \longrightarrow a+r$ |
252 \end{center} |
268 \end{center} |
253 The problem here is that only the head element |
269 The problem is that only the head element |
254 is ``spilled out'', |
270 is ``spilled out''. |
255 whereas we would want to flatten |
271 It is more desirable |
256 an entire list to open up possibilities for further simplifications. |
272 to flatten |
|
273 an entire list to open up possibilities for further simplifications |
|
274 with later regular expressions. |
257 Not flattening the rest of the elements also means that |
275 Not flattening the rest of the elements also means that |
258 the later de-duplication processs |
276 the later de-duplication processs |
259 does not fully remove further duplicates. |
277 does not fully remove further duplicates. |
260 For example, |
278 For example, |
261 using $\simpsulz$ we could not |
279 using $\textit{simp}\_{SL}$ we cannot |
262 simplify |
280 simplify |
263 \begin{center} |
281 \begin{center} |
264 $((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+ |
282 $((a^* a^*)+\underline{(a^* + a^*)})\cdot (a^*a^*)^*+ |
265 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$ |
283 ((a^*a^*)+a^*)\cdot (a^*a^*)^*$ |
266 \end{center} |
284 \end{center} |
267 due to the underlined part not in the first element |
285 due to the underlined part not being in the head |
268 of the alternative.\\ |
286 of the alternative. |
269 We define a flatten operation that flattens not only |
287 |
270 the first regular expression of an alternative, |
288 We define our flatten operation so that it flattens |
271 but the entire list: |
289 the entire list: |
272 \begin{center} |
290 \begin{center} |
273 \begin{tabular}{@{}lcl@{}} |
291 \begin{tabular}{@{}lcl@{}} |
274 $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
292 $\textit{flts} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \; |
275 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\ |
293 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flts} \; as' $ \\ |
276 $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\ |
294 $\textit{flts} \; \ZERO :: as'$ & $\dn$ & $ \textit{flts} \; \textit{as'} $ \\ |
280 \noindent |
298 \noindent |
281 Our $\flts$ operation |
299 Our $\flts$ operation |
282 also throws away $\ZERO$s |
300 also throws away $\ZERO$s |
283 as they do not contribute to a lexing result. |
301 as they do not contribute to a lexing result. |
284 \subsection{Duplicate Removal} |
302 \subsection{Duplicate Removal} |
285 After flattening is done, we are ready to deduplicate. |
303 After flattening is done, we can deduplicate. |
286 The de-duplicate function is called $\distinctBy$, |
304 The de-duplicate function is called $\distinctBy$, |
287 and that is where we make our second improvement over |
305 and that is where we make our second improvement over |
288 Sulzmann and Lu's. |
306 Sulzmann and Lu's simplification method. |
289 The process goes as follows: |
307 The process goes as follows: |
290 \begin{center} |
308 \begin{center} |
291 $rs \stackrel{\textit{flts}}{\longrightarrow} |
309 $rs \stackrel{\textit{flts}}{\longrightarrow} |
292 rs_{flat} |
310 rs_{flat} |
293 \xrightarrow{\distinctBy \; |
311 \xrightarrow{\distinctBy \; |
325 regular expressions we have already seen |
343 regular expressions we have already seen |
326 will definitely not contribute to a POSIX value, |
344 will definitely not contribute to a POSIX value, |
327 even if they are attached with different bitcodes. |
345 even if they are attached with different bitcodes. |
328 These duplicates therefore need to be removed. |
346 These duplicates therefore need to be removed. |
329 To achieve this, we call $\rerases$ as the function $f$ during the distinction |
347 To achieve this, we call $\rerases$ as the function $f$ during the distinction |
330 operation.\\ |
348 operation. The function |
331 $\rerases$ is very similar to $\erase$, except that it preserves the structure |
349 $\rerases$ is very similar to $\erase$, except that it preserves the structure |
332 when erasing an alternative regular expression. |
350 when erasing an alternative regular expression. |
333 The reason why we use $\rerases$ instead of $\erase$ is that |
351 The reason why we use $\rerases$ instead of $\erase$ is that |
334 it keeps the structures of alternative |
352 it keeps the structures of alternative |
335 annotated regular expressions |
353 annotated regular expressions |
336 whereas $\erase$ would turn it back into a binary structure. |
354 whereas $\erase$ would turn it back into a binary tree structure. |
337 Not having to mess with the structure |
355 Not having to mess with the structure |
338 greatly simplifies the finiteness proof in chapter |
356 greatly simplifies the finiteness proof in chapter |
339 \ref{Finite} (we will follow up with more details there). |
357 \ref{Finite}. |
340 We give the definitions of $\rerases$ here together with |
358 We give the definitions of $\rerases$ here together with |
341 the new datatype used by $\rerases$ (as our plain |
359 the new datatype used by $\rerases$ (as our plain |
342 regular expression datatype does not allow non-binary alternatives). |
360 regular expression datatype does not allow non-binary alternatives). |
343 For the moment the reader can just think of |
361 For now we can think of |
344 $\rerases$ as $\erase$ and $\rrexp$ as plain regular expressions. |
362 $\rerases$ as the function $(\_)_\downarrow$ defined in chapter \ref{Bitcoded1} |
|
363 and $\rrexp$ as plain regular expressions, but having a general list constructor |
|
364 for alternatives: |
345 \begin{figure}[H] |
365 \begin{figure}[H] |
346 \begin{center} |
366 \begin{center} |
347 $\rrexp ::= \RZERO \mid \RONE |
367 $\rrexp ::= \RZERO \mid \RONE |
348 \mid \RCHAR{c} |
368 \mid \RCHAR{c} |
349 \mid \RSEQ{r_1}{r_2} |
369 \mid \RSEQ{r_1}{r_2} |
402 The most involved part is the $\sum$ clause, where we first call $\flts$ on |
419 The most involved part is the $\sum$ clause, where we first call $\flts$ on |
403 the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$. |
420 the simplified children regular expression list $\textit{map}\; \textit{bsimp}\; \textit{as}$. |
404 and then call $\distinctBy$ on that list, the predicate determining whether two |
421 and then call $\distinctBy$ on that list, the predicate determining whether two |
405 elements are the same is $\rerases \; r_1 = \rerases\; r_2$. |
422 elements are the same is $\rerases \; r_1 = \rerases\; r_2$. |
406 Finally, depending on whether the regular expression list $as'$ has turned into a |
423 Finally, depending on whether the regular expression list $as'$ has turned into a |
407 singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{AALTS}$ |
424 singleton or empty list after $\flts$ and $\distinctBy$, $\textit{bsimp}_{ALTS}$ |
408 decides whether to keep the current level constructor $\sum$ as it is, and |
425 decides whether to keep the current level constructor $\sum$ as it is, and |
409 removes it when there are less than two elements: |
426 removes it when there are less than two elements: |
410 \begin{center} |
427 \begin{center} |
411 \begin{tabular}{lcl} |
428 \begin{tabular}{lcl} |
412 $\textit{bsimp}_{AALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ |
429 $\textit{bsimp}_{ALTS} \; bs \; as'$ & $ \dn$ & $ as' \; \textit{match}$\\ |
413 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
430 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
414 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
431 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
415 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
432 &&$\quad\textit{case} \; as' \Rightarrow _{bs}\sum \textit{as'}$\\ |
416 \end{tabular} |
433 \end{tabular} |
417 |
434 |
418 \end{center} |
435 \end{center} |
419 Having defined the $\bsimp$ function, |
436 Having defined the $\textit{bsimp}$ function, |
420 we add it as a phase after a derivative is taken, |
437 we add it as a phase after a derivative is taken. |
421 so it stays small: |
|
422 \begin{center} |
438 \begin{center} |
423 \begin{tabular}{lcl} |
439 \begin{tabular}{lcl} |
424 $r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$ |
440 $r \backslash_{bsimp} s$ & $\dn$ & $\textit{bsimp}(r \backslash s)$ |
425 \end{tabular} |
441 \end{tabular} |
426 \end{center} |
442 \end{center} |
427 %Following previous notations |
443 %Following previous notations |
428 %when extending from derivatives w.r.t.~character to derivative |
444 %when extending from derivatives w.r.t.~character to derivative |
429 %w.r.t.~string, we define the derivative that nests simplifications |
445 %w.r.t.~string, we define the derivative that nests simplifications |
430 %with derivatives:%\comment{simp in the [] case?} |
446 %with derivatives:%\comment{simp in the [] case?} |
431 We extend this from character to string: |
447 We extend this from characters to strings: |
432 \begin{center} |
448 \begin{center} |
433 \begin{tabular}{lcl} |
449 \begin{tabular}{lcl} |
434 $r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\ |
450 $r \backslash_{bsimps} (c\!::\!s) $ & $\dn$ & $(r \backslash_{bsimp}\, c) \backslash_{bsimps}\, s$ \\ |
435 $r \backslash_{bsimps} [\,] $ & $\dn$ & $r$ |
451 $r \backslash_{bsimps} [\,] $ & $\dn$ & $r$ |
436 \end{tabular} |
452 \end{tabular} |
497 \caption{Our Improvement over Sulzmann and Lu's in terms of size} |
513 \caption{Our Improvement over Sulzmann and Lu's in terms of size} |
498 \end{figure} |
514 \end{figure} |
499 \noindent |
515 \noindent |
500 Given the size difference, it is not |
516 Given the size difference, it is not |
501 surprising that our $\blexersimp$ significantly outperforms |
517 surprising that our $\blexersimp$ significantly outperforms |
502 $\textit{blexer\_sulzSimp}$. |
518 $\textit{blexer\_SLSimp}$. |
503 In the next section we are going to establish the |
519 In the next section we are going to establish that our |
504 first important property of our lexer--the correctness. |
520 simplification preserves the correctness of the algorithm. |
505 %---------------------------------------------------------------------------------------- |
521 %---------------------------------------------------------------------------------------- |
506 % SECTION rewrite relation |
522 % SECTION rewrite relation |
507 %---------------------------------------------------------------------------------------- |
523 %---------------------------------------------------------------------------------------- |
508 \section{Correctness of $\blexersimp$} |
524 \section{Correctness of $\blexersimp$} |
509 In this section we give details |
|
510 of the correctness proof of $\blexersimp$, |
|
511 one of the contributions of this thesis.\\ |
|
512 We first introduce the rewriting relation \emph{rrewrite} |
525 We first introduce the rewriting relation \emph{rrewrite} |
513 ($\rrewrite$) between two regular expressions, |
526 ($\rrewrite$) between two regular expressions, |
514 which expresses an atomic |
527 which stands for an atomic |
515 simplification. |
528 simplification. |
516 We then prove properties about |
529 We then prove properties about |
517 this rewriting relation and its reflexive transitive closure. |
530 this rewriting relation and its reflexive transitive closure. |
518 Finally we leverage these properties to show |
531 Finally we leverage these properties to show |
519 an equivalence between the internal data structures of |
532 an equivalence between the results generated by |
520 $\blexer$ and $\blexersimp$. |
533 $\blexer$ and $\blexersimp$. |
521 |
534 |
522 \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)} |
535 \subsection{The Rewriting Relation $\rrewrite$($\rightsquigarrow$)} |
523 In the $\blexer$'s correctness proof, we |
536 In the $\blexer$'s correctness proof, we |
524 did not directly derive the fact that $\blexer$ generates the POSIX value, |
537 did not directly derive the fact that $\blexer$ generates the POSIX value, |
525 but first proved that $\blexer$ is linked with $\lexer$. |
538 but first proved that $\blexer$ generates the same result as $\lexer$. |
526 Then we re-use |
539 Then we re-use |
527 the correctness of $\lexer$ |
540 the correctness of $\lexer$ |
528 to obtain |
541 to obtain |
529 \begin{center} |
542 \begin{center} |
530 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$. |
543 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$\\ |
531 \end{center} |
544 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; |
|
545 r\;s = \None$. |
|
546 \end{center} |
|
547 %\begin{center} |
|
548 % $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer \; r \;s = v$. |
|
549 %\end{center} |
532 Here we apply this |
550 Here we apply this |
533 modularised technique again |
551 modularised technique again |
534 by first proving that |
552 by first proving that |
535 $\blexersimp \; r \; s $ |
553 $\blexersimp \; r \; s $ |
536 produces the same output as $\blexer \; r\; s$, |
554 produces the same output as $\blexer \; r\; s$, |
537 and then piecing it together with |
555 and then piecing it together with |
538 $\blexer$'s correctness to achieve our main |
556 $\blexer$'s correctness to achieve our main |
539 theorem:\footnote{ The case when |
557 theorem: |
540 $s$ is not in $L \; r$, is routine to establish.} |
558 \begin{center} |
541 \begin{center} |
559 $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = \Some \;v$ |
542 $(r, s) \rightarrow v \; \; \textit{iff} \;\; \blexersimp \; r \; s = v$ |
560 \\ |
|
561 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; |
|
562 r\;s = \None$ |
543 \end{center} |
563 \end{center} |
544 \noindent |
564 \noindent |
545 The overall idea for the proof |
565 The overall idea for the proof |
546 of $\blexer \;r \;s = \blexersimp \; r \;s$ |
566 of $\blexer \;r \;s = \blexersimp \; r \;s$ |
547 is that the transition from $r$ to $\textit{bsimp}\; r$ can be |
567 is that the transition from $r$ to $\textit{bsimp}\; r$ can be |
548 broken down into finitely many rewrite steps: |
568 broken down into smaller rewrite steps of the form: |
549 \begin{center} |
569 \begin{center} |
550 $r \rightsquigarrow^* \textit{bsimp} \; r$ |
570 $r \rightsquigarrow^* \textit{bsimp} \; r$ |
551 \end{center} |
571 \end{center} |
552 where each rewrite step, written $\rightsquigarrow$, |
572 where each rewrite step, written $\rightsquigarrow$, |
553 is an ``atomic'' simplification that |
573 is an ``atomic'' simplification that |
554 is similar to a small-step reduction in operational semantics: |
574 is similar to a small-step reduction in operational semantics ( |
|
575 see figure \ref{rrewriteRules} for the rules): |
555 \begin{figure}[H] |
576 \begin{figure}[H] |
556 \begin{mathpar} |
577 \begin{mathpar} |
557 \inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\} |
578 \inferrule * [Right = $S\ZERO_l$]{\vspace{0em}}{_{bs} \ZERO \cdot r_2 \rightsquigarrow \ZERO\\} |
558 |
579 |
559 \inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\} |
580 \inferrule * [Right = $S\ZERO_r$]{\vspace{0em}}{_{bs} r_1 \cdot \ZERO \rightsquigarrow \ZERO\\} |
594 to be removed provided a regular expression |
615 to be removed provided a regular expression |
595 earlier in the list can match the same strings. |
616 earlier in the list can match the same strings. |
596 }\label{rrewriteRules} |
617 }\label{rrewriteRules} |
597 \end{figure} |
618 \end{figure} |
598 \noindent |
619 \noindent |
599 The rules such as $LT$ and $LH$ are for rewriting between two regular expression lists |
620 The rules $LT$ and $LH$ are for rewriting two regular expression lists |
600 such that one regular expression |
621 such that one regular expression |
601 in the left-hand-side list is rewritable in one step |
622 in the left-hand-side list is rewritable in one step |
602 to the right-hand-side's regular expression at the same position. |
623 to the right-hand-side's regular expression at the same position. |
603 This helps with defining the ``context rules'' such as $AL$.\\ |
624 This helps with defining the ``context rule'' $AL$.\\ |
604 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ |
625 The reflexive transitive closure of $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$ |
605 are defined in the usual way: |
626 are defined in the usual way: |
606 \begin{figure}[H] |
627 \begin{figure}[H] |
607 \centering |
628 \centering |
608 \begin{mathpar} |
629 \begin{mathpar} |
617 \caption{The Reflexive Transitive Closure of |
638 \caption{The Reflexive Transitive Closure of |
618 $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure} |
639 $\rightsquigarrow$ and $\stackrel{s}{\rightsquigarrow}$}\label{transClosure} |
619 \end{figure} |
640 \end{figure} |
620 %Two rewritable terms will remain rewritable to each other |
641 %Two rewritable terms will remain rewritable to each other |
621 %even after a derivative is taken: |
642 %even after a derivative is taken: |
622 Rewriting is preserved under derivatives, |
643 The main point of our rewriting relation |
|
644 is that it is preserved under derivatives, |
623 namely |
645 namely |
624 \begin{center} |
646 \begin{center} |
625 $r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$ |
647 $r_1 \rightsquigarrow r_2 \implies (r_1 \backslash c) \rightsquigarrow^* (r_2 \backslash c)$ |
626 \end{center} |
648 \end{center} |
627 And finally, if two terms are rewritable to each other, |
649 And also, if two terms are rewritable to each other, |
628 then they produce the same bitcodes: |
650 then they produce the same bitcodes: |
629 \begin{center} |
651 \begin{center} |
630 $r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$ |
652 $r \rightsquigarrow^* r' \;\; \textit{then} \; \; \bmkeps \; r = \bmkeps \; r'$ |
631 \end{center} |
653 \end{center} |
632 The decoding phase of both $\blexer$ and $\blexersimp$ |
654 The decoding phase of both $\blexer$ and $\blexersimp$ |
633 are the same, which means that if they get the same |
655 are the same, which means that if they receive the same |
634 bitcodes before the decoding phase, |
656 bitcodes before the decoding phase, |
635 they get the same value after decoding is done. |
657 they generate the same value after decoding is done. |
636 We will prove the three properties |
658 We will prove the three properties |
637 we mentioned above in the next sub-section. |
659 we mentioned above in the next sub-section. |
638 \subsection{Important Properties of $\rightsquigarrow$} |
660 \subsection{Important Properties of $\rightsquigarrow$} |
639 First we prove some basic facts |
661 First we prove some basic facts |
640 about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$, |
662 about $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$, |
734 $\stackrel{s*}{\rightsquigarrow}$ and using part one to three. |
756 $\stackrel{s*}{\rightsquigarrow}$ and using part one to three. |
735 The fifth part is a corollary of part four. |
757 The fifth part is a corollary of part four. |
736 The last part is proven by rule induction again on $\rightsquigarrow^*$. |
758 The last part is proven by rule induction again on $\rightsquigarrow^*$. |
737 \end{proof} |
759 \end{proof} |
738 \noindent |
760 \noindent |
739 Now we are ready to give the proofs of the below properties: |
761 Now we are ready to give the proofs of the following properties: |
740 \begin{itemize} |
762 \begin{itemize} |
741 \item |
763 \item |
742 $(r \rightsquigarrow^* r'\land \bnullable \; r_1) |
764 $r \rightsquigarrow^* r'\land \bnullable \; r_1 |
743 \implies \bmkeps \; r = \bmkeps \; r'$. \\ |
765 \implies \bmkeps \; r = \bmkeps \; r'$. \\ |
744 \item |
766 \item |
745 $r \rightsquigarrow^* \textit{bsimp} \;r$.\\ |
767 $r \rightsquigarrow^* \textit{bsimp} \;r$.\\ |
746 \item |
768 \item |
747 $r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\ |
769 $r \rightsquigarrow r' \implies r \backslash c \rightsquigarrow^* r'\backslash c$.\\ |
748 \end{itemize} |
770 \end{itemize} |
749 These properties would work together towards the correctness theorem. |
771 |
750 \subsubsection{Property 1: $(r \rightsquigarrow^* r'\land \bnullable \; r_1) |
772 \subsubsection{Property 1: $r \rightsquigarrow^* r'\land \bnullable \; r_1 |
751 \implies \bmkeps \; r = \bmkeps \; r'$} |
773 \implies \bmkeps \; r = \bmkeps \; r'$} |
752 Intuitively, this property says we can |
774 Intuitively, this property says we can |
753 extract the same bitcodes using $\bmkeps$ from the nullable |
775 extract the same bitcodes using $\bmkeps$ from the nullable |
754 components of two regular expressions $r$ and $r'$, |
776 components of two regular expressions $r$ and $r'$, |
755 if we can rewrite from one to the other in finitely |
777 if we can rewrite from one to the other in finitely |
810 \end{lemma} |
832 \end{lemma} |
811 \begin{proof} |
833 \begin{proof} |
812 By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$. |
834 By rule induction over the cases that lead to $r_1 \rightsquigarrow r_2$. |
813 \end{proof} |
835 \end{proof} |
814 \noindent |
836 \noindent |
815 With lemma \ref{rewriteBmkepsAux} we are ready to prove its |
837 With lemma \ref{rewriteBmkepsAux} in place we are ready to prove its |
816 many-step version: |
838 many-step version: |
817 \begin{lemma} |
839 \begin{lemma} |
818 $\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$ |
840 $\text{If} \;\; r \stackrel{*}{\rightsquigarrow} r' \;\; \text{and} \;\; \bnullable \; r, \;\;\; \text{then} \;\; \bmkeps \; r = \bmkeps \; r'$ |
819 \end{lemma} |
841 \end{lemma} |
820 \begin{proof} |
842 \begin{proof} |
821 By rule induction of $\stackrel{*}{\rightsquigarrow} $. |
843 By rule induction of $\stackrel{*}{\rightsquigarrow} $. Lemma |
822 $\ref{rewritesBnullable}$ tells us both $r$ and $r'$ are nullable. |
844 $\ref{rewritesBnullable}$ gives us both $r$ and $r'$ are nullable. |
823 \ref{rewriteBmkepsAux} solves the inductive case. |
845 The lemma \ref{rewriteBmkepsAux} solves the inductive case. |
824 \end{proof} |
846 \end{proof} |
825 |
847 |
826 \subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$} |
848 \subsubsection{Property 2: $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$} |
827 Now we get to the ``meaty'' part of the proof, |
849 Now we get to the key part of the proof, |
828 which says that our simplification's helper functions |
850 which says that our simplification's helper functions |
829 such as $\distinctBy$ and $\flts$ conform to |
851 such as $\distinctBy$ and $\flts$ describe |
830 the $\stackrel{s*}{\rightsquigarrow}$ and |
852 reducts of $\stackrel{s*}{\rightsquigarrow}$ and |
831 $\rightsquigarrow^* $ rewriting relations.\\ |
853 $\rightsquigarrow^* $.\\ |
832 The first lemma to prove is a more general version of |
854 The first lemma to prove is a more general version of |
833 $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$: |
855 $rs_ 1 \rightsquigarrow^* \distinctBy \; rs_1 \; \phi$: |
834 \begin{lemma} |
856 \begin{lemma} |
835 $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} |
857 $rs_1 @ rs_2 \stackrel{s*}{\rightsquigarrow} |
836 (rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$ |
858 (rs_1 @ (\distinctBy \; rs_2 \; \; \rerases \;\; (\map\;\; \rerases \; \; rs_1)))$ |
863 \begin{lemma}\label{bsimpaltsPreserves} |
885 \begin{lemma}\label{bsimpaltsPreserves} |
864 $_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$ |
886 $_{bs} \sum rs \stackrel{*}{\rightsquigarrow} \bsimpalts \; _{bs} \; rs$ |
865 \end{lemma} |
887 \end{lemma} |
866 \noindent |
888 \noindent |
867 The simplification function |
889 The simplification function |
868 $\textit{bsimp}$ only transforms the regex $r$ using steps specified by |
890 $\textit{bsimp}$ only transforms the regular expression using steps specified by |
869 $\rightsquigarrow^*$ and nothing else. |
891 $\rightsquigarrow^*$ and nothing else: |
870 \begin{lemma} |
892 \begin{lemma} |
871 $r \stackrel{*}{\rightsquigarrow} \bsimp{r}$ |
893 $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ |
872 \end{lemma} |
894 \end{lemma} |
873 \begin{proof} |
895 \begin{proof} |
874 By an induction on $r$. |
896 By an induction on $r$. |
875 The most involved case would be the alternative, |
897 The most involved case is the alternative, |
876 where we use lemmas \ref{bsimpaltsPreserves}, |
898 where we use lemmas \ref{bsimpaltsPreserves}, |
877 \ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\ |
899 \ref{fltsPreserves} and \ref{dBPreserves} to do a series of rewriting:\\ |
878 \begin{center} |
900 \begin{center} |
879 \begin{tabular}{lcl} |
901 \begin{tabular}{lcl} |
880 $rs$ & $\stackrel{s*}{\rightsquigarrow}$ & $ \map \; \textit{bsimp} \; rs$\\ |
902 $rs$ & $\stackrel{s*}{\rightsquigarrow}$ & $ \map \; \textit{bsimp} \; rs$\\ |
881 & $\stackrel{s*}{\rightsquigarrow}$ & $ \flts \; (\map \; \textit{bsimp} \; rs)$\\ |
903 & $\stackrel{s*}{\rightsquigarrow}$ & $ \flts \; (\map \; \textit{bsimp} \; rs)$\\ |
882 & $\stackrel{s*}{\rightsquigarrow}$ & $ \distinctBy \; |
904 & $\stackrel{s*}{\rightsquigarrow}$ & $ \distinctBy \; |
883 (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\ |
905 (\flts \; (\map \; \textit{bsimp}\; rs)) \; \rerases \; \phi$\\ |
884 \end{tabular} |
906 \end{tabular} |
885 \end{center} |
907 \end{center} |
886 Using this we derive the following rewrite relation:\\ |
908 Using this we can derive the following rewrite sequence:\\ |
887 \begin{center} |
909 \begin{center} |
888 \begin{tabular}{lcl} |
910 \begin{tabular}{lcl} |
889 $r$ & $=$ & $_{bs}\sum rs$\\[1.5ex] |
911 $r$ & $=$ & $_{bs}\sum rs$\\[1.5ex] |
890 & $\rightsquigarrow^*$ & $\bsimpalts \; bs \; rs$ \\[1.5ex] |
912 & $\rightsquigarrow^*$ & $\bsimpalts \; bs \; rs$ \\[1.5ex] |
891 & $\rightsquigarrow^*$ & $\ldots$ \\ [1.5ex] |
913 & $\rightsquigarrow^*$ & $\ldots$ \\ [1.5ex] |
925 \begin{corollary}\label{rewritesBder} |
946 \begin{corollary}\label{rewritesBder} |
926 $r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^* |
947 $r_1 \rightsquigarrow^* r_2 \implies r_1 \backslash c \rightsquigarrow^* |
927 r_2 \backslash c$ |
948 r_2 \backslash c$ |
928 \end{corollary} |
949 \end{corollary} |
929 \begin{proof} |
950 \begin{proof} |
930 By rule induction of $\stackrel{*}{\rightsquigarrow} $ and using the previous lemma \ref{rewriteBder}. |
951 By rule induction of $\stackrel{*}{\rightsquigarrow} $ and lemma \ref{rewriteBder}. |
931 \end{proof} |
952 \end{proof} |
932 \noindent |
953 \noindent |
933 This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$ |
954 This can be extended and combined with $r \rightsquigarrow^* \textit{bsimp} \; r$ |
934 to obtain the rewritability between |
955 to obtain the correspondence between |
935 $\blexer$ and $\blexersimp$'s intermediate |
956 $\blexer$ and $\blexersimp$'s intermediate |
936 derivative regular expressions |
957 derivative regular expressions |
937 \begin{lemma}\label{bderBderssimp} |
958 \begin{lemma}\label{bderBderssimp} |
938 $a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $ |
959 $a \backslash s \rightsquigarrow^* \bderssimp{a}{s} $ |
939 \end{lemma} |
960 \end{lemma} |
940 \begin{proof} |
961 \begin{proof} |
941 By an induction on $s$. |
962 By an induction on $s$. |
942 \end{proof} |
963 \end{proof} |
943 \subsection{Main Theorem} |
964 \subsection{Main Theorem} |
944 Now with \ref{bderBderssimp} we are ready for the main theorem. |
965 Now with \ref{bderBderssimp} in place we are ready for the main theorem. |
945 \begin{theorem} |
966 \begin{theorem} |
946 $\blexer \; r \; s = \blexersimp{r}{s}$ |
967 $\blexer \; r \; s = \blexersimp{r}{s}$ |
947 \end{theorem} |
968 \end{theorem} |
948 \noindent |
969 \noindent |
949 \begin{proof} |
970 \begin{proof} |
950 One can rewrite in many steps from the original lexer's |
971 We can rewrite in many steps from the original lexer's |
951 derivative regular expressions to the |
972 derivative regular expressions to the |
952 lexer with simplification applied (by lemma \ref{bderBderssimp}): |
973 lexer with simplification applied (by lemma \ref{bderBderssimp}): |
953 \begin{center} |
974 \begin{center} |
954 $a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $. |
975 $a \backslash s \stackrel{*}{\rightsquigarrow} \bderssimp{a}{s} $. |
955 \end{center} |
976 \end{center} |
956 we know that they give out the same bits, if the lexing result is a match: |
977 We know that they generate the same bits, if the lexing result is a match: |
957 \begin{center} |
978 \begin{center} |
958 $\bnullable \; (a \backslash s) |
979 $\bnullable \; (a \backslash s) |
959 \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ |
980 \implies \bmkeps \; (a \backslash s) = \bmkeps \; (\bderssimp{a}{s})$ |
960 \end{center} |
981 \end{center} |
961 Now that they give out the same bits, we know that they give the same value after decoding. |
982 Now that they generate the same bits, we know that they give the same value after decoding. |
962 \begin{center} |
983 \begin{center} |
963 $\bnullable \; (a \backslash s) |
984 $\bnullable \; (a \backslash s) |
964 \implies \decode \; r \; (\bmkeps \; (a \backslash s)) = |
985 \implies \decode \; r \; (\bmkeps \; (a \backslash s)) = |
965 \decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$ |
986 \decode \; r \; (\bmkeps \; (\bderssimp{a}{s}))$ |
966 \end{center} |
987 \end{center} |
967 Which is equivalent to our proof goal: |
988 Which is required by our proof goal: |
968 \begin{center} |
989 \begin{center} |
969 $\blexer \; r \; s = \blexersimp \; r \; s$. |
990 $\blexer \; r \; s = \blexersimp \; r \; s$. |
970 \end{center} |
991 \end{center} |
971 \end{proof} |
992 \end{proof} |
972 \noindent |
993 \noindent |
973 As a corollary, |
994 As a corollary, |
974 we link this result with the lemma we proved earlier that |
995 we can link this result with the lemma we proved earlier that |
975 \begin{center} |
996 \begin{center} |
976 $(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = v$ |
997 $(r, s) \rightarrow v \;\; \textit{iff}\;\; \blexer \; r \; s = \Some \;v$\\ |
|
998 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexer\; |
|
999 r\;s = \None$. |
977 \end{center} |
1000 \end{center} |
978 and obtain the corollary that the bit-coded lexer with simplification is |
1001 and obtain the corollary that the bit-coded lexer with simplification is |
979 indeed correctly outputting POSIX lexing result, if such a result exists. |
1002 indeed correctly outputting POSIX lexing result, if such a result exists. |
980 \begin{corollary} |
1003 \begin{corollary} |
981 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s $ |
1004 $(r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp \; r\; s = \Some \; v$\\ |
|
1005 $\nexists v. \; (r, s) \rightarrow v \;\; \textit{iff} \;\; \blexersimp\; |
|
1006 r\;s = \None$. |
982 \end{corollary} |
1007 \end{corollary} |
983 |
1008 |
984 \subsection{Comments on the Proof Techniques Used} |
1009 \subsection{Comments on the Proof Techniques Used} |
985 Straightforward and simple as the proof may seem, |
1010 Straightforward and simple as the proof may seem, |
986 the efforts we spent obtaining it were far from trivial.\\ |
1011 the efforts we spent obtaining it were far from trivial. |
987 We initially attempted to re-use the argument |
1012 We initially attempted to re-use the argument |
988 in \cref{flex_retrieve}. |
1013 in \cref{flex_retrieve}. |
989 The problem was that both functions $\inj$ and $\retrieve$ require |
1014 The problem is that both functions $\inj$ and $\retrieve$ require |
990 that the annotated regular expressions stay unsimplified, |
1015 that the annotated regular expressions stay unsimplified, |
991 so that one can |
1016 so that one can |
992 correctly compare $v_{i+1}$ and $r_i$ and $v_i$ |
1017 correctly compare $v_{i+1}$ and $r_i$ and $v_i$ |
993 in diagram \ref{graph:inj} and |
1018 in diagram \ref{graph:inj} and |
994 ``fit the key into the lock hole''. |
1019 ``fit the key into the lock hole''. |
1039 final, |
1064 final, |
1040 one could come up new rules |
1065 one could come up new rules |
1041 such as |
1066 such as |
1042 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow |
1067 $\SEQ r_1 \cdot (\SEQ r_1 \cdot r_3) \rightarrow |
1043 \SEQs [r_1, r_2, r_3]$. |
1068 \SEQs [r_1, r_2, r_3]$. |
1044 This does not fit with the proof technique |
1069 However this does not fit with the proof technique |
1045 of our main theorem, but seem to not violate the POSIX |
1070 of our main theorem, but seem to not violate the POSIX |
1046 property.\\ |
1071 property. |
1047 Having correctness property is good. |
1072 |
1048 But we would also a guarantee that the lexer is not slow in |
1073 Having established the correctness of our |
1049 some sense, for exampe, not grinding to a halt regardless of the input. |
1074 $\blexersimp$, in the next chapter we shall prove that with our $\simp$ function, |
1050 As we have already seen, Sulzmann and Lu's simplification function |
1075 for a given $r$, the derivative size is always |
1051 $\simpsulz$ cannot achieve this, because their claim that |
|
1052 the regular expression size does not grow arbitrary large |
|
1053 was not true. |
|
1054 In the next chapter we shall prove that with our $\simp$, |
|
1055 for a given $r$, the internal derivative size is always |
|
1056 finitely bounded by a constant. |
1076 finitely bounded by a constant. |