89 applications such as lexing (tokenising a string). The problem is to |
89 applications such as lexing (tokenising a string). The problem is to |
90 make the algorithm by Sulzmann and Lu fast on all inputs without |
90 make the algorithm by Sulzmann and Lu fast on all inputs without |
91 breaking its correctness. Being fast depends on a complete set of |
91 breaking its correctness. Being fast depends on a complete set of |
92 simplification rules, some of which |
92 simplification rules, some of which |
93 have been put forward by Sulzmann and Lu. We have extended their |
93 have been put forward by Sulzmann and Lu. We have extended their |
94 rules in order to obtain a tight bound on size of regular expressions. |
94 rules in order to obtain a tight bound on the size of regular expressions. |
95 We have tested the correctness of these extended rules, but have not |
95 We have tested these extended rules, but have not |
96 formally established their correctness. We also have not yet looked |
96 formally established their correctness. We have also not yet looked |
97 at extended regular expressions, such as bounded repetitions, |
97 at extended regular expressions, such as bounded repetitions, |
98 negation and back-references. |
98 negation and back-references. |
99 \end{abstract} |
99 \end{abstract} |
100 |
100 |
101 \section{Introduction} |
101 \section{Introduction} |
102 |
102 |
103 While we believe derivatives of regular expressions is a beautiful |
103 While we believe derivatives of regular expressions is a beautiful |
104 concept (interms of ease to implementing them in functional programming |
104 concept (in terms of ease of implementing them in functional |
105 language and ease to reason about them formally), they have one major |
105 programming languages and in terms of reasoning about them formally), |
106 drawback: every derivative step can make regular expressions grow |
106 they have one major drawback: every derivative step can make regular |
107 drastically in size. This in turn has negative effects on the runtime of |
107 expressions grow drastically in size. This in turn has negative effect |
108 the corresponding lexing algorithms. Consider for example the regular |
108 on the runtime of the corresponding lexing algorithms. Consider for |
109 expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The size of |
109 example the regular expression $(a+aa)^*$ and the short string |
110 the corresponding derivative is already 8668 node assuming the derivatives |
110 $aaaaaaaaaaaa$. The corresponding derivative is already 8668 nodes |
111 is seen as a tree. The reason for the poor runtime of the lexing algorithms is |
111 assuming the derivative is seen as a tree. The reason for the poor |
112 that they need to traverse such trees over and over again. The solution is to |
112 runtime of the lexing algorithms is that they need to traverse such |
113 find a complete set of simplification rules that keep the sizes of derivatives |
113 trees over and over again. The solution is to find a complete set of |
114 uniformly small. |
114 simplification rules that keep the sizes of derivatives uniformly |
115 |
115 small. |
116 For reasons beyond this report, it turns out that a complete set of |
116 |
117 simplification rules depend on values being encoded as bitsequences. |
117 For reasons beyond this report, it turns out that a complete set of |
118 (Vlue are the results of the lexing algorithms generate; they encode how |
118 simplification rules depends on values being encoded as |
119 a regular expression matched a string.) We already know that the lexing |
119 bitsequences.\footnote{Values are the results the lexing algorithms |
120 algorithm \emph{without} simplification is correct. Therefore in the |
120 generate; they encode how a regular expression matched a string.} We |
121 past 6 months we were trying to prove that the algorithm using bitsequences plus |
121 already know that the lexing algorithm using bitsequences but |
122 our simplification rules is correct. Formally this amounts to show that |
122 \emph{without} simplification is correct, albeilt horribly |
|
123 slow. Therefore in the past 6 months we were trying to prove that the |
|
124 algorithm using bitsequences plus our simplification rules is |
|
125 correct. Formally this amounts to show that |
123 |
126 |
124 \begin{equation}\label{mainthm} |
127 \begin{equation}\label{mainthm} |
125 \blexers \; r \; s = \blexer \;r\;s |
128 \blexers \; r \; s = \blexer \;r\;s |
126 \end{equation} |
129 \end{equation} |
127 |
130 |
128 \noindent |
131 \noindent |
129 whereby $\blexers$ simplifies (makes derivatives smaller) in each step, |
132 whereby $\blexers$ simplifies (makes derivatives smaller) in each |
130 whereas with $\blexer$ the size can grow exponentially. This would be an |
133 step, whereas with $\blexer$ the size can grow exponentially. This |
131 important milestone, because we already have a very good idea how to |
134 would be an important milestone for the thesis, because we already |
132 establish that our set our simplification rules keeps the size below a |
135 have a very good idea how to establish that our set our simplification |
133 relatively tight bound. |
136 rules keeps the size of derivativs below a relatively tight bound. |
134 |
137 |
135 In order to prove the main theorem \eqref{mainthm}, we need to prove the |
138 In order to prove the main theorem in \eqref{mainthm}, we need to prove the |
136 two functions produce the same output. The definition of these functions |
139 two functions produce the same output. The definition of these functions |
137 is shown below. |
140 is shown below. |
138 |
141 |
139 \begin{center} |
142 \begin{center} |
140 \begin{tabular}{lcl} |
143 \begin{tabular}{lcl} |
154 & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
157 & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\ |
155 & & $\;\; \textit{else}\;\textit{None}$ |
158 & & $\;\; \textit{else}\;\textit{None}$ |
156 \end{tabular} |
159 \end{tabular} |
157 \end{center} |
160 \end{center} |
158 \noindent |
161 \noindent |
159 In these definitions $(r^\uparrow)$ is a kind of coding function that is the |
162 In these definitions $(r^\uparrow)$ is a kind of coding function that |
160 same in each case, similarly the decode and the \textit{bmkeps} |
163 is the same in each case, similarly the decode and the \textit{bmkeps} |
161 functions. Our main theorem \eqref{mainthm} therefore boils down to |
164 are functions that are the same in each case. Our main theorem |
162 proving the following two propositions (depending on which branch the |
165 \eqref{mainthm} therefore boils down to proving the following two |
163 if-else clause takes). It establishes how the derivatives \emph{with} |
166 propositions (depending on which branch the if-else clause takes). It |
164 simplification do not change the computed result: |
167 establishes how the derivatives \emph{with} simplification do not |
|
168 change the computed result: |
165 |
169 |
166 \begin{itemize} |
170 \begin{itemize} |
167 \item{} If a string $s$ is in the language of $L(r)$, then \\ |
171 \item{(a)} If a string $s$ is in the language of $L(r)$, then \\ |
168 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\ |
172 $\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\ |
169 \item{} If a string $s$ is in the language $L(r)$, then |
173 \item{(b)} If a string $s$ is in the language $L(r)$, then |
170 $\rup \backslash_{simp} \,s$ is not nullable. |
174 $\rup \backslash_{simp} \,s$ is not nullable. |
171 \end{itemize} |
175 \end{itemize} |
172 |
176 |
173 \noindent |
177 \noindent |
174 We have already proved in Isabelle the second part. This is actually not |
178 We have already proved in Isabelle the second part. This is actually |
175 too difficult because we can show that simplification does not change |
179 not too difficult because we can show that simplification does not |
176 the language of regular expressions. If we can prove the first case, |
180 change the language of simplified regular expressions. |
177 that is the bitsequence algorithm with simplification produces the same |
181 |
178 result as the one without simplification, then we are done. |
182 If we can prove the first part, that is the bitsequence algorithm with |
179 Unfortunately that part requires more effort, because simplification does not |
183 simplification produces the same result as the one without |
180 only.need to \emph{not} change the language, but also not change |
184 simplification, then we are done. Unfortunately that part requires |
181 the value (computed result). |
185 more effort, because simplification does not only need to \emph{not} |
182 |
186 change the language, but also not change the value (that is the |
183 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
187 computed result). |
184 Do you want to keep this? You essentially want to say that the old |
188 |
185 method used retrieve, which unfortunately cannot be adopted to |
189 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
186 the simplification rules. You could just say that and give an example. |
190 %Do you want to keep this? You essentially want to say that the old |
187 However you have to think about how you give the example....nobody knows |
191 %method used retrieve, which unfortunately cannot be adopted to |
188 about AZERO etc yet. Maybe it might be better to use normal regexes |
192 %the simplification rules. You could just say that and give an example. |
189 like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$. |
193 %However you have to think about how you give the example....nobody knows |
190 |
194 %about AZERO etc yet. Maybe it might be better to use normal regexes |
191 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
195 %like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$. |
192 REPLY:\\ |
196 |
193 Yes, I am essentially saying that the old method |
197 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
194 cannot be adopted without adjustments. |
198 %REPLY:\\ |
195 But this does not mean we should skip |
199 %Yes, I am essentially saying that the old method |
196 the proof of the bit-coded algorithm |
200 %cannot be adopted without adjustments. |
197 as it is still the main direction we are looking into |
201 %But this does not mean we should skip |
198 to prove things. We are trying to modify |
202 %the proof of the bit-coded algorithm |
199 the old proof to suit our needs, but not give |
203 %as it is still the main direction we are looking into |
200 up it totally, that is why i believe the old |
204 %to prove things. We are trying to modify |
201 proof is fundamental in understanding |
205 %the old proof to suit our needs, but not give |
202 what we are doing in the past 6 months. |
206 %up it totally, that is why i believe the old |
203 |
207 %proof is fundamental in understanding |
204 \bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
208 %what we are doing in the past 6 months. |
205 |
209 %\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt} |
206 The correctness proof of |
210 |
207 \begin{center} |
211 For this we have started with looking at the original proof that |
208 $\blexer \; r^\uparrow s = \lexer \;r \;s$ |
212 established that the bitsequence algorrithm produces the same result |
209 \end{center} |
213 as the algorithm not using bitsequences. Formally this proof |
210 \noindent |
214 established |
211 might provide us insight into proving |
215 |
212 \begin{center} |
216 \begin{equation}\label{lexer} |
213 $\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$ |
217 \blexer \; (r^\uparrow) s = \lexer \;r \;s |
214 \end{center} |
218 \end{equation} |
215 \noindent |
219 |
216 (that is also |
220 %\noindent |
217 why we say the new proof builds on the older one). |
221 %might provide us insight into proving |
218 The proof defined the function $\flex$ as another way of |
222 %\begin{center} |
219 expressing the $\lexer$ function: |
223 %$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$ |
220 \begin{center} |
224 %\end{center} |
221 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; r\backslash s)$ |
225 |
222 \end{center}. |
226 \noindent |
223 \noindent |
227 This proof used two ``tricks''. One is that it defined a \flex-function |
224 (proof for the above equality will be explained later) |
228 |
225 The definition of $flex$ is as follows: |
|
226 \begin{center} |
229 \begin{center} |
227 \begin{tabular}{lcl} |
230 \begin{tabular}{lcl} |
228 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\ |
231 $\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\ |
229 $\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$ |
232 $\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$ |
230 \end{tabular} |
233 \end{tabular} |
231 \end{center} |
234 \end{center} |
232 \noindent |
235 |
233 here $\flex$ essentially does lexing by |
236 \noindent |
234 stacking up injection functions while doing derivatives, |
237 and then proved for the right-hand side in \eqref{lexer} |
|
238 |
|
239 \begin{center} |
|
240 $\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; r\backslash s)$ |
|
241 \end{center}. |
|
242 |
|
243 |
|
244 \noindent\rule[1.5ex]{\linewidth}{1pt} |
|
245 |
|
246 \noindent |
|
247 The $\flex$-function essentially does lexing by |
|
248 stacking up injection functions while doing derivatives. |
|
249 |
235 explicitly showing the order of characters being |
250 explicitly showing the order of characters being |
236 injected back in each step. |
251 injected back in each step. |
237 With $\flex$ we can write $\lexer$ this way: |
252 With $\flex$ we can write $\lexer$ this way: |
238 \begin{center} |
253 \begin{center} |
239 $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$ |
254 $\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$ |