98 negation and back-references. |
98 negation and back-references. |
99 \end{abstract} |
99 \end{abstract} |
100 |
100 |
101 \section{Introduction} |
101 \section{Introduction} |
102 |
102 |
103 While we believe derivatives of regular expressions is a beautiful |
103 While we believe derivatives of regular expressions, written |
104 concept (in terms of ease of implementing them in functional |
104 $r\backslash s$, are a beautiful concept (in terms of ease of |
105 programming languages and in terms of reasoning about them formally), |
105 implementing them in functional programming languages and in terms of |
106 they have one major drawback: every derivative step can make regular |
106 reasoning about them formally), they have one major drawback: every |
107 expressions grow drastically in size. This in turn has negative effect |
107 derivative step can make regular expressions grow drastically in |
108 on the runtime of the corresponding lexing algorithms. Consider for |
108 size. This in turn has negative effect on the runtime of the |
109 example the regular expression $(a+aa)^*$ and the short string |
109 corresponding lexing algorithms. Consider for example the regular |
110 $aaaaaaaaaaaa$. The corresponding derivative is already 8668 nodes |
110 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The |
111 assuming the derivative is seen as a tree. The reason for the poor |
111 corresponding derivative contains already 8668 nodes where we assume |
112 runtime of the lexing algorithms is that they need to traverse such |
112 the derivative is given as a tree. The reason for the poor runtime of |
113 trees over and over again. The solution is to find a complete set of |
113 the derivative-based lexing algorithms is that they need to traverse |
114 simplification rules that keep the sizes of derivatives uniformly |
114 such trees over and over again. The solution is to find a complete set |
|
115 of simplification rules that keep the sizes of derivatives uniformly |
115 small. |
116 small. |
116 |
117 |
117 For reasons beyond this report, it turns out that a complete set of |
118 For reasons beyond this report, it turns out that a complete set of |
118 simplification rules depends on values being encoded as |
119 simplification rules depends on values being encoded as |
119 bitsequences.\footnote{Values are the results the lexing algorithms |
120 bitsequences.\footnote{Values are the results the lexing algorithms |
120 generate; they encode how a regular expression matched a string.} We |
121 generate; they encode how a regular expression matched a string.} We |
121 already know that the lexing algorithm using bitsequences but |
122 already know that the lexing algorithm using bitsequences but |
122 \emph{without} simplification is correct, albeilt horribly |
123 \emph{without} simplification is correct, albeilt horribly |
123 slow. Therefore in the past 6 months we were trying to prove that the |
124 slow. Therefore in the past 6 months I was trying to prove that the |
124 algorithm using bitsequences plus our simplification rules is |
125 algorithm using bitsequences plus our simplification rules is |
125 correct. Formally this amounts to show that |
126 also correct. Formally this amounts to show that |
126 |
127 |
127 \begin{equation}\label{mainthm} |
128 \begin{equation}\label{mainthm} |
128 \blexers \; r \; s = \blexer \;r\;s |
129 \blexers \; r \; s = \blexer \;r\;s |
129 \end{equation} |
130 \end{equation} |
130 |
131 |
131 \noindent |
132 \noindent |
132 whereby $\blexers$ simplifies (makes derivatives smaller) in each |
133 whereby $\blexers$ simplifies (makes derivatives smaller) in each |
133 step, whereas with $\blexer$ the size can grow exponentially. This |
134 step, whereas with $\blexer$ the size can grow exponentially. This |
134 would be an important milestone for the thesis, because we already |
135 would be an important milestone for my thesis, because we already |
135 have a very good idea how to establish that our set our simplification |
136 have a very good idea how to establish that our set our simplification |
136 rules keeps the size of derivativs below a relatively tight bound. |
137 rules keeps the size of derivativs below a relatively tight bound. |
137 |
138 |
138 In order to prove the main theorem in \eqref{mainthm}, we need to prove the |
139 In order to prove the main theorem in \eqref{mainthm}, we need to prove that the |
139 two functions produce the same output. The definition of these functions |
140 two functions produce the same output. The definition of these two functions |
140 is shown below. |
141 is shown below. |
141 |
142 |
142 \begin{center} |
143 \begin{center} |
143 \begin{tabular}{lcl} |
144 \begin{tabular}{lcl} |
144 $\textit{blexer}\;r\,s$ & $\dn$ & |
145 $\textit{blexer}\;r\,s$ & $\dn$ & |
159 \end{tabular} |
160 \end{tabular} |
160 \end{center} |
161 \end{center} |
161 \noindent |
162 \noindent |
162 In these definitions $(r^\uparrow)$ is a kind of coding function that |
163 In these definitions $(r^\uparrow)$ is a kind of coding function that |
163 is the same in each case, similarly the decode and the \textit{bmkeps} |
164 is the same in each case, similarly the decode and the \textit{bmkeps} |
164 are functions that are the same in each case. Our main theorem |
165 are functions that are the same in each case. Our main |
165 \eqref{mainthm} therefore boils down to proving the following two |
166 theorem~\eqref{mainthm} therefore boils down to proving the following |
166 propositions (depending on which branch the if-else clause takes). It |
167 two propositions (depending on which branch the if-else clause |
167 establishes how the derivatives \emph{with} simplification do not |
168 takes). They establish how the derivatives \emph{with} simplification |
168 change the computed result: |
169 do not change the computed result: |
169 |
170 |
170 \begin{itemize} |
171 \begin{itemize} |
171 \item{(a)} If a string $s$ is in the language of $L(r)$, then \\ |
172 \item{(a)} If a string $s$ is in the language of $L(r)$, then \\ |
172 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\ |
173 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\ |
173 \item{(b)} If a string $s$ is in the language $L(r)$, then |
174 \item{(b)} If a string $s$ is in the language $L(r)$, then |
174 $\rup \backslash_{simp} \,s$ is not nullable. |
175 $\rup \backslash_{simp} \,s$ is not nullable. |
175 \end{itemize} |
176 \end{itemize} |
176 |
177 |
177 \noindent |
178 \noindent |
178 We have already proved in Isabelle the second part. This is actually |
179 We have already proved the second part in Isabelle. This is actually |
179 not too difficult because we can show that simplification does not |
180 not too difficult because we can show that simplification does not |
180 change the language of simplified regular expressions. |
181 change the language of simplified regular expressions. |
181 |
182 |
182 If we can prove the first part, that is the bitsequence algorithm with |
183 If we can prove the first part, that is the bitsequence algorithm with |
183 simplification produces the same result as the one without |
184 simplification produces the same result as the one without |
222 %\begin{center} |
225 %\begin{center} |
223 %$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$ |
226 %$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$ |
224 %\end{center} |
227 %\end{center} |
225 |
228 |
226 \noindent |
229 \noindent |
227 This proof used two ``tricks''. One is that it defined a \flex-function |
230 The proof uses two ``tricks''. One is that it uses a \flex-function |
228 |
231 |
229 \begin{center} |
232 \begin{center} |
230 \begin{tabular}{lcl} |
233 \begin{tabular}{lcl} |
231 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\ |
234 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\ |
232 $\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$ |
235 $\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$ |
233 \end{tabular} |
236 \end{tabular} |
234 \end{center} |
237 \end{center} |
235 |
238 |
236 \noindent |
239 \noindent |
237 and then proved for the right-hand side in \eqref{lexer} |
240 and then proves for the right-hand side in \eqref{lexer} |
238 |
241 |
239 \begin{center} |
242 \begin{center} |
240 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; r\backslash s)$ |
243 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$ |
241 \end{center}. |
244 \end{center}. |
242 |
245 |
243 |
246 |
244 \noindent\rule[1.5ex]{\linewidth}{1pt} |
247 |
245 |
248 |
246 \noindent |
249 \noindent |
247 The $\flex$-function essentially does lexing by |
250 The $\flex$-function essentially does lexing by |
248 stacking up injection functions while doing derivatives. |
251 stacking up injection functions while doing derivatives. |
249 |
252 |
250 explicitly showing the order of characters being |
253 explicitly showing the order of characters being |
251 injected back in each step. |
254 injected back in each step. |
252 With $\flex$ we can write $\lexer$ this way: |
255 With $\flex$ we can write $\lexer$ this way: |
|
256 |
253 \begin{center} |
257 \begin{center} |
254 $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$ |
258 $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$ |
255 \end{center} |
259 \end{center} |
256 \noindent |
260 |
257 $\flex$ focuses on |
261 %\noindent |
258 the injections instead |
262 %$\flex$ focuses on the injections instead of the derivatives , |
259 of the derivatives , |
263 %compared to the original definition of $\lexer$, which puts equal |
260 compared |
264 %amount of emphasis on injection and derivative with respect to each |
261 to the original definition of $\lexer$, |
265 %character: |
262 which puts equal amount of emphasis on |
266 |
263 injection and derivative with respect to each character: |
267 %\begin{center} |
264 \begin{center} |
268 %\begin{tabular}{lcl} |
265 \begin{tabular}{lcl} |
269 %$\textit{lexer} \; r\; (c\!::\!s) $ & $\dn$ & $\textit{case} \; \lexer \; (r\backslash c) \;s \; %\textit{of}$ \\ |
266 $\textit{lexer} \; r\; (c\!::\!s) $ & $\dn$ & $\textit{case} \; \lexer \; (r\backslash c) \;s \; \textit{of}$ \\ |
270 % & & $\textit{None} \; \Longrightarrow \; \textit{None}$\\ |
267 & & $\textit{None} \; \Longrightarrow \; \textit{None}$\\ |
271 % & & $\textbar \; v \; \Longrightarrow \; \inj \; r\;c\;v$\\ |
268 & & $\textbar \; v \; \Longrightarrow \; \inj \; r\;c\;v$\\ |
272 %$\textit{lexer} \; r\; [\,] $ & $\dn$ & $\textit{if} \; \nullable (r) \; \textit{then} \; \mkeps% (r) \; \textit{else} \;None$ |
269 $\textit{lexer} \; r\; [\,] $ & $\dn$ & $\textit{if} \; \nullable (r) \; \textit{then} \; \mkeps (r) \; \textit{else} \;None$ |
273 %\end{tabular} |
270 \end{tabular} |
274 %\end{center} |
271 \end{center} |
275 |
272 \noindent |
276 \noindent |
273 Using this feature of $\flex$ we can rewrite the lexing |
277 The crux in the existing proof is how $\flex$ relates to injection, namely |
274 $w.r.t \; s @ [c]$ in term of lexing |
278 |
275 $w.r.t \; s$: |
|
276 \begin{center} |
279 \begin{center} |
277 $\flex \; r \; id \; (s@[c]) \; v = \flex \; r \; id \; s \; (inj \; (r\backslash s) \; c\; v)$. |
280 $\flex \; r \; id \; (s@[c]) \; v = \flex \; r \; id \; s \; (inj \; (r\backslash s) \; c\; v)$. |
278 \end{center} |
281 \end{center} |
279 \noindent |
282 |
280 this allows us to use |
283 \noindent |
281 the inductive hypothesis to get |
284 This property allows one to rewrite an induction hypothesis like |
|
285 |
282 \begin{center} |
286 \begin{center} |
283 $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$ |
287 $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$ |
284 \end{center} |
288 \end{center} |
|
289 |
|
290 \noindent\rule[1.5ex]{\linewidth}{4pt} |
|
291 There is no mention of retrieve yet .... this is the second trick in the |
|
292 existing proof. I am not sure whether you need to explain annotated regular |
|
293 expressions much earlier - maybe before the ``existing proof'' section, or |
|
294 evan earlier. |
|
295 |
|
296 \noindent\rule[1.5ex]{\linewidth}{4pt} |
|
297 |
285 \noindent |
298 \noindent |
286 By using a property of retrieve we have the $\textit{RHS}$ of the above equality is |
299 By using a property of retrieve we have the $\textit{RHS}$ of the above equality is |
287 $decode (retrieve (r^\uparrow \backslash(s @ [c])) v) r$, and this gives the |
300 $decode (retrieve (r^\uparrow \backslash(s @ [c])) v) r$, and this gives the |
288 main lemma result: |
301 main lemma result: |
|
302 |
289 \begin{center} |
303 \begin{center} |
290 $ \flex \;r\; id \; (s@[c]) \; v =\textit{decode}(\textit{retrieve} (\rup \backslash (s@[c])) \;v) r$ |
304 $ \flex \;r\; id \; (s@[c]) \; v =\textit{decode}(\textit{retrieve} (\rup \backslash (s@[c])) \;v) r$ |
291 \end{center} |
305 \end{center} |
|
306 |
|
307 |
|
308 |
|
309 |
292 \noindent |
310 \noindent |
293 To use this lemma result for our |
311 To use this lemma result for our |
294 correctness proof, simply replace the $v$ in the |
312 correctness proof, simply replace the $v$ in the |
295 $\textit{RHS}$ of the above equality with |
313 $\textit{RHS}$ of the above equality with |
296 $\mkeps\;(r\backslash (s@[c]))$, and apply the lemma that |
314 $\mkeps\;(r\backslash (s@[c]))$, and apply the lemma that |