10 |
10 |
11 So far our algorithm based on derivatives was only able to say |
11 So far our algorithm based on derivatives was only able to say |
12 yes or no depending on whether a string was matched by regular |
12 yes or no depending on whether a string was matched by regular |
13 expression or not. Often a more interesting question is to |
13 expression or not. Often a more interesting question is to |
14 find out \emph{how} a regular expression matched a string? |
14 find out \emph{how} a regular expression matched a string? |
15 Answering this question will help us with the problem we are |
15 Answering this question will also help us with the problem we |
16 after, namely tokenising an input string. The algorithm we |
16 are after, namely tokenising an input string. The algorithm we |
17 will be looking at for this was designed by Sulzmann \& Lu in |
17 will be looking at for this was designed by Sulzmann \& Lu in |
18 a rather recent paper. A link to it is provided on KEATS, in |
18 a rather recent paper. A link to it is provided on KEATS, in |
19 case you are interested.\footnote{In my humble opinion this is |
19 case you are interested.\footnote{In my humble opinion this is |
20 an interesting instance of the research literature: it |
20 an interesting instance of the research literature: it |
21 contains a very neat idea, but its presentation is rather |
21 contains a very neat idea, but its presentation is rather |
71 corresponding to the two alternatives. Note that $r^*$ is |
71 corresponding to the two alternatives. Note that $r^*$ is |
72 associated with a list of values, one for each copy of $r$ |
72 associated with a list of values, one for each copy of $r$ |
73 that was needed to match the string. This means we might also |
73 that was needed to match the string. This means we might also |
74 return the empty list $[]$, if no copy was needed. |
74 return the empty list $[]$, if no copy was needed. |
75 |
75 |
76 Graphically the algorithm by Sulzmann \& Lu can be represneted |
76 To emphasise the connection between regular expressions and |
|
77 values, I have in my implementation the convention that |
|
78 regular expressions are written entirely with upper-case |
|
79 letters, while values just start with a single upper-case |
|
80 character. My definition of values in Scala is below. I use |
|
81 this in the REPL of Scala; when I use the Scala compiler |
|
82 I need to rename some constructors, because Scala on Macs |
|
83 does not like classes that are called \pcode{EMPTY} and |
|
84 \pcode{Empty}. |
|
85 |
|
86 {\small\lstinputlisting[language=Scala,numbers=none] |
|
87 {../progs/app02.scala}} |
|
88 |
|
89 |
|
90 Graphically the algorithm by Sulzmann \& Lu can be illustrated |
77 by the picture in Figure~\ref{Sulz} where the path from the |
91 by the picture in Figure~\ref{Sulz} where the path from the |
78 left to the right involving $der/nullable$ is the first phase |
92 left to the right involving $der/nullable$ is the first phase |
79 of the algorithm and $mkeps/inj$, the path from right to left, |
93 of the algorithm and $mkeps/inj$, the path from right to left, |
80 the second phase. This picture shows the steps required when a |
94 the second phase. This picture shows the steps required when a |
81 regular expression, say $r_1$, matches the string $abc$. We |
95 regular expression, say $r_1$, matches the string $abc$. We |
130 these regular expression cannot match the empty string. Note |
144 these regular expression cannot match the empty string. Note |
131 also that in case of alternatives we give preference to the |
145 also that in case of alternatives we give preference to the |
132 regular expression on the left-hand side. This will become |
146 regular expression on the left-hand side. This will become |
133 important later on. |
147 important later on. |
134 |
148 |
135 The second phase of the algorithm is organised recursively |
149 The second phase of the algorithm is organised so that it will |
136 such that it will calculate a value for how the derivative |
150 calculate a value for how the derivative regular expression |
137 regular expression has matched a string whose first character |
151 has matched a string whose first character has been chopped |
138 has been chopped off. Now we need a function that reverses |
152 off. Now we need a function that reverses this ``chopping |
139 this ``chopping off'' for values. The corresponding function |
153 off'' for values. The corresponding function is called $inj$ |
140 is called $inj$ for injection. This function takes three |
154 for injection. This function takes three arguments: the first |
141 arguments: the first one is a regular expression for which we |
155 one is a regular expression for which we want to calculate the |
142 want to calculate the value, the second is the character we |
156 value, the second is the character we want to inject and the |
143 want to inject and the third argument is the value where we |
157 third argument is the value where we will inject the |
144 will inject the character. The result of this function is a |
158 character. The result of this function is a new value. The |
145 new value. The definition of $inj$ is as follows: |
159 definition of $inj$ is as follows: |
146 |
160 |
147 \begin{center} |
161 \begin{center} |
148 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
162 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
149 $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
163 $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
150 $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ |
164 $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ |
159 \noindent This definition is by recursion on the regular |
173 \noindent This definition is by recursion on the regular |
160 expression and by analysing the shape of the values. Therefore |
174 expression and by analysing the shape of the values. Therefore |
161 there are, for example, three cases for sequnece regular |
175 there are, for example, three cases for sequnece regular |
162 expressions. The last clause for the star regular expression |
176 expressions. The last clause for the star regular expression |
163 returns a list where the first element is $inj\,r\,c\,v$ and |
177 returns a list where the first element is $inj\,r\,c\,v$ and |
164 the other elements are $vs$. That mean $\_\,::\,\_$ should be |
178 the other elements are $vs$. That means $\_\,::\,\_$ should be |
165 read as list cons. |
179 read as list cons. |
166 |
180 |
167 To understand what is going on, it might be best to do some |
181 To understand what is going on, it might be best to do some |
168 example calculations and compare with Figure~\ref{Sulz}. For |
182 example calculations and compare them with Figure~\ref{Sulz}. |
169 this note that we have not yet dealt with the need of |
183 For this note that we have not yet dealt with the need of |
170 simplifying regular expressions (this will be a topic on its |
184 simplifying regular expressions (this will be a topic on its |
171 own later). Suppose the regular expression is $a \cdot (b |
185 own later). Suppose the regular expression is $a \cdot (b |
172 \cdot c)$ and the input string is $abc$. The derivatives from |
186 \cdot c)$ and the input string is $abc$. The derivatives from |
173 the first phase are as follows: |
187 the first phase are as follows: |
174 |
188 |
213 |
227 |
214 \begin{center} |
228 \begin{center} |
215 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ |
229 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ |
216 \end{center} |
230 \end{center} |
217 |
231 |
218 \noindent Finally we need to inject back the letter $a$ into |
232 \noindent which is again the correct result for how $r_2$ |
219 $v_2$ giving the final result |
233 matched the string $bc$. Finally we need to inject back the |
|
234 letter $a$ into $v_2$ giving the final result |
220 |
235 |
221 \begin{center} |
236 \begin{center} |
222 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ |
237 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ |
223 \end{center} |
238 \end{center} |
224 |
239 |
225 \noindent This now corresponds to how the regular |
240 \noindent This now corresponds to how the regular |
226 expression $a \cdot (b \cdot c)$ matched the string $abc$. |
241 expression $a \cdot (b \cdot c)$ matched the string $abc$. |
227 |
242 |
228 There are a few auxiliary functions that are of interest |
243 There are a few auxiliary functions that are of interest |
229 in analysing this algorithm. One is called \emph{flatten}, |
244 when analysing this algorithm. One is called \emph{flatten}, |
230 written $|\_|$, which extracts the string ``underlying'' a |
245 written $|\_|$, which extracts the string ``underlying'' a |
231 value. It is defined recursively as |
246 value. It is defined recursively as |
232 |
247 |
233 \begin{center} |
248 \begin{center} |
234 \begin{tabular}{lcl} |
249 \begin{tabular}{lcl} |
252 $|v_2|$: & $bc$\\ |
267 $|v_2|$: & $bc$\\ |
253 $|v_1|$: & $abc$ |
268 $|v_1|$: & $abc$ |
254 \end{tabular} |
269 \end{tabular} |
255 \end{center} |
270 \end{center} |
256 |
271 |
257 \noindent |
272 \noindent This indicates that $inj$ indeed is injecting, or |
258 This indicates that $inj$ indeed is injecting, or adding, back |
273 adding, back a character into the value. Next we look at how |
259 a character into the value. |
274 simplification can be included into this algorithm. |
260 |
275 |
261 |
276 |
262 \subsubsection*{Simplification} |
277 \subsubsection*{Simplification} |
263 |
278 |
264 Generally the matching algorithms based on derivatives do |
279 Generally the matching algorithms based on derivatives do |
265 poorly unless the regular expressions are simplified after |
280 poorly unless the regular expressions are simplified after |
266 each derivatives step. But this is a bit more involved in |
281 each derivative step. But this is a bit more involved in the |
267 algorithm of Sulzmann \& Lu. Consider the last derivation |
282 algorithm of Sulzmann \& Lu. So what follows might require you |
268 step in our running example |
283 to read several times before it makes sense and also might |
269 |
284 require that you do some example calculations. As a first |
270 \begin{center} |
285 example consider the last derivation step in our earlier |
271 $r_4 = der\,c\,r_3 = (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$ |
286 example: |
272 \end{center} |
287 |
273 |
288 \begin{center} |
274 \noindent Simplifying the result would just give us |
289 $r_4 = der\,c\,r_3 = |
275 $\epsilon$. Running $mkeps$ on this regular expression would |
290 (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$ |
276 then provide us with $Empty$ instead of $Right(Right(Empty))$ |
291 \end{center} |
277 that was obtained without the simplification. The problem is |
292 |
278 we need to recreate this more complicated value, rather than |
293 \noindent Simplifying this regular expression would just give |
279 just $Empty$. |
294 us $\epsilon$. Running $mkeps$ with this regular expression as |
280 |
295 input, however, would then provide us with $Empty$ instead of |
281 This requires what I call \emph{rectification functions}. They |
296 $Right(Right(Empty))$ that was obtained without the |
282 need to be calculated whenever a regular expression gets |
297 simplification. The problem is we need to recreate this more |
|
298 complicated value, rather than just $Empty$. |
|
299 |
|
300 This will require what I call \emph{rectification functions}. |
|
301 They need to be calculated whenever a regular expression gets |
283 simplified. Rectification functions take a value as argument |
302 simplified. Rectification functions take a value as argument |
284 and return a (rectified) value. Our simplification rules so |
303 and return a (rectified) value. Let us first take a look again |
285 far are |
304 at our simplification rules: |
286 |
305 |
287 \begin{center} |
306 \begin{center} |
288 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l} |
307 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l} |
289 $r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ |
308 $r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ |
290 $\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ |
309 $\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ |
295 $r + r$ & $\mapsto$ & $r$\\ |
314 $r + r$ & $\mapsto$ & $r$\\ |
296 \end{tabular} |
315 \end{tabular} |
297 \end{center} |
316 \end{center} |
298 |
317 |
299 \noindent Applying them to $r_4$ will require several nested |
318 \noindent Applying them to $r_4$ will require several nested |
300 simplifications in order end up with just $\epsilon$. |
319 simplifications in order end up with just $\epsilon$. However, |
301 |
320 it is possible to apply them in a depth-first, or inside-out, |
302 We can implement this by letting simp return not just a |
321 manner in order to calculate this simplified regular |
303 (simplified) regular expression, but also a rectification |
322 expression. |
304 function. Let us consider the alternative case, say $r_1 + |
323 |
305 r_2$, first. We would first simplify the component regular |
324 The rectification we can implement this by letting simp return |
306 expressions $r_1$ and $r_2.$ This will return simplified |
325 not just a (simplified) regular expression, but also a |
307 versions (if they can be simplified), say $r_{1s}$ and |
326 rectification function. Let us consider the alternative case, |
308 $r_{2s}$, but also two rectification functions $f_{1s}$ and |
327 $r_1 + r_2$, first. By going depth-first, we first simplify |
309 $f_{2s}$. We need to assemble them in order to obtain a |
328 the component regular expressions $r_1$ and $r_2.$ This will |
310 rectified value for $r_1 + r_2$. In case $r_{1s}$ simplified |
329 return simplified versions (if they can be simplified), say |
311 to $\varnothing$, we would continue the derivative calculation |
330 $r_{1s}$ and $r_{2s}$, but also two rectification functions |
312 with $r_{2s}$. The Sulzmann \& Lu algorithm would return a |
331 $f_{1s}$ and $f_{2s}$. We need to assemble them in order to |
313 corresponding value, say $v_{2s}$. But now this needs to |
332 obtain a rectified value for $r_1 + r_2$. In case $r_{1s}$ |
314 be ``rectified'' to the value |
333 simplified to $\varnothing$, we continue the derivative |
|
334 calculation with $r_{2s}$. The Sulzmann \& Lu algorithm would |
|
335 return a corresponding value, say $v_{2s}$. But now this value |
|
336 needs to be ``rectified'' to the value |
315 |
337 |
316 \begin{center} |
338 \begin{center} |
317 $Right(v_{2s})$ |
339 $Right(v_{2s})$ |
318 \end{center} |
340 \end{center} |
319 |
341 |
320 \noindent Unfortunately, this is not enough because there |
342 \noindent The reason is that we look for the value that tells |
321 might be some simplifications that happened inside $r_2$ |
343 us how $r_1 + r_2$ could have matched the string, not just |
322 and for which the simplification function retuned also |
344 $r_2$ or $r_{2s}$. Unfortunately, this is still not the right |
323 a rectification function $f_{2s}$. So in fact we need to |
345 value in general because there might be some simplifications |
324 apply this one too which gives |
346 that happened inside $r_2$ and for which the simplification |
|
347 function retuned also a rectification function $f_{2s}$. So in |
|
348 fact we need to apply this one too which gives |
325 |
349 |
326 \begin{center} |
350 \begin{center} |
327 $Right(f_{2s}(v_{2s}))$ |
351 $Right(f_{2s}(v_{2s}))$ |
328 \end{center} |
352 \end{center} |
329 |
353 |
330 \noindent So if we want to return this as function, |
354 \noindent This is now the correct, or rectified, value. Since |
331 we would need to return |
355 the simplification will be done in the first phase of the |
|
356 algorithm, but the rectification needs to be done to the |
|
357 values in the second phase, it is advantageous to calculate |
|
358 the rectification as a function, remember this function and |
|
359 then apply the value to this function during the second phase. |
|
360 So if we want to implement the rectification as function, we |
|
361 would need to return |
332 |
362 |
333 \begin{center} |
363 \begin{center} |
334 $\lambda v.\,Right(f_{2s}(v))$ |
364 $\lambda v.\,Right(f_{2s}(v))$ |
335 \end{center} |
365 \end{center} |
336 |
366 |
337 \noindent which is the lambda-calculus notation for |
367 \noindent which is the lambda-calculus notation for |
338 a function that expects a value $v$ and returns everything |
368 a function that expects a value $v$ and returns everything |
339 after the dot where $v$ is replaced by whatever value is |
369 after the dot where $v$ is replaced by whatever value is |
340 given. |
370 given. |
341 |
371 |
342 Let us package these ideas into a single function (still only |
372 Let us package this idea with rectification functions |
343 considering the alternative case): |
373 into a single function (still only considering the alternative |
|
374 case): |
344 |
375 |
345 \begin{center} |
376 \begin{center} |
346 \begin{tabular}{l} |
377 \begin{tabular}{l} |
347 $simp(r)$:\\ |
378 $simp(r)$:\\ |
348 \quad case $r = r_1 + r_2$\\ |
379 \quad case $r = r_1 + r_2$\\ |
350 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ |
381 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ |
351 \qquad case $r_{1s} = \varnothing$: |
382 \qquad case $r_{1s} = \varnothing$: |
352 return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\ |
383 return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\ |
353 \qquad case $r_{2s} = \varnothing$: |
384 \qquad case $r_{2s} = \varnothing$: |
354 return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\ |
385 return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\ |
|
386 \qquad case $r_{1s} = r_{2s}$: |
|
387 return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\ |
355 \qquad otherwise: |
388 \qquad otherwise: |
356 return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\ |
389 return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\ |
357 \end{tabular} |
390 \end{tabular} |
358 \end{center} |
391 \end{center} |
359 |
392 |
360 \noindent We first recursively call the simlification with $r_1$ |
393 \noindent We first recursively call the simplification with |
361 and $r_2$. This gives simplified regular expressions, $r_{1s}$ |
394 $r_1$ and $r_2$. This gives simplified regular expressions, |
362 and $r_{2s}$, as well as two rectification functions $f_{1s}$ |
395 $r_{1s}$ and $r_{2s}$, as well as two rectification functions |
363 and $f_{2s}$. We next need to test whether the simplified |
396 $f_{1s}$ and $f_{2s}$. We next need to test whether the |
364 regular expressions are $\varnothing$ so as to make further |
397 simplified regular expressions are $\varnothing$ so as to make |
365 simplifications. In case $r_{1s}$ is $\varnothing$ then |
398 further simplifications. In case $r_{1s}$ is $\varnothing$, |
366 we can return $r_{2s}$ (the other alternative). However |
399 then we can return $r_{2s}$ (the other alternative). However |
367 we need to now build a rectification function, which as |
400 for this we need to build a corresponding rectification |
368 said above is |
401 function, which as said above is |
369 |
402 |
370 \begin{center} |
403 \begin{center} |
371 $\lambda v.\,Right(f_{2s}(v))$ |
404 $\lambda v.\,Right(f_{2s}(v))$ |
372 \end{center} |
405 \end{center} |
373 |
406 |
374 \noindent The case where $r_{2s} = \varnothing$ is similar. |
407 \noindent The case where $r_{2s} = \varnothing$ is similar: |
375 We return $r_{1s}$ but now have to rectify such that we return |
408 We return $r_{1s}$ and rectify with $Left(\_)$ and the |
|
409 other calculated rectification function $f_{1s}$. This gives |
376 |
410 |
377 \begin{center} |
411 \begin{center} |
378 $\lambda v.\,Left(f_{1s}(v))$ |
412 $\lambda v.\,Left(f_{1s}(v))$ |
379 \end{center} |
413 \end{center} |
380 |
414 |
381 \noindent Note that in this case we have to apply $f_{1s}$, |
415 \noindent The next case where $r_{1s} = r_{2s}$ can be treated |
382 not $f_{2s}$, which is responsible to rectify the inner parts |
416 like the one where $r_{2s} = \varnothing$. We return $r_{1s}$ |
383 of $v$. The otherwise-case is slightly interesting. In this |
417 and rectify with $Left(\_)$ and so on. |
384 case neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and no |
418 |
385 further simplification can be applied. Accordingly, we return |
419 |
386 $r_{1s} + r_{2s}$ as the simplified regular expression. In |
420 The otherwise-case is slightly more complicated. In this case |
387 principle we also do not have to do any rectification, because |
421 neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and also |
388 no simplification was done in this case. But this is actually |
422 $r_{1s} \not= r_{2s}$, which means no further simplification |
389 not true: There might have been simplifications inside |
423 can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as |
390 $r_{1s}$ and $r_2s$. We therefore need to take into account |
424 the simplified regular expression. In principle we also do not |
391 the calculated rectification functions $f_{1s}$ and $f_{2s}$. |
425 have to do any rectification, because no simplification was |
392 We can do this by defining a rectification function $f_{alt}$ |
426 done in this case. But this is actually not true: There might |
393 which takes two rectification functions as arguments |
427 have been simplifications inside $r_{1s}$ and $r_{2s}$. We |
|
428 therefore need to take into account the calculated |
|
429 rectification functions $f_{1s}$ and $f_{2s}$. We can do this |
|
430 by defining a rectification function $f_{alt}$ which takes two |
|
431 rectification functions as arguments and applies them |
|
432 according to whether the value is of the form $Left(\_)$ or |
|
433 $Right(\_)$: |
394 |
434 |
395 \begin{center} |
435 \begin{center} |
396 \begin{tabular}{l@{\hspace{1mm}}l} |
436 \begin{tabular}{l@{\hspace{1mm}}l} |
397 $f_{alt}(f_1, f_2) \dn$\\ |
437 $f_{alt}(f_1, f_2) \dn$\\ |
398 \qquad $\lambda v.\,$ case $v = Left(v')$: |
438 \qquad $\lambda v.\,$ case $v = Left(v')$: |
428 \qquad otherwise: |
463 \qquad otherwise: |
429 return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\ |
464 return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\ |
430 \end{tabular} |
465 \end{tabular} |
431 \end{center} |
466 \end{center} |
432 |
467 |
433 \noindent whereby $f_{seq}$ is again pushing the two |
468 \noindent whereby in the last line $f_{seq}$ is again pushing |
434 rectification functions into the two components of |
469 the two rectification functions into the two components of the |
435 the Seq-value: |
470 Seq-value: |
436 |
471 |
437 \begin{center} |
472 \begin{center} |
438 \begin{tabular}{l@{\hspace{1mm}}l} |
473 \begin{tabular}{l@{\hspace{1mm}}l} |
439 $f_{seq}(f_1, f_2) \dn$\\ |
474 $f_{seq}(f_1, f_2) \dn$\\ |
440 \qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: |
475 \qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: |
441 & return $Seq(f_1(v_1), f_2(v_2))$\\ |
476 & return $Seq(f_1(v_1), f_2(v_2))$\\ |
442 \end{tabular} |
477 \end{tabular} |
443 \end{center} |
478 \end{center} |
444 |
479 |
445 \noindent Note that in the case of $r_{1s}$ (similarly $r_{2s}$) |
480 \noindent Note that in the case of $r_{1s} = \varnothing$ |
446 we use the function $f_{error}$ for rectification. If you |
481 (similarly $r_{2s}$) we use the function $f_{error}$ for |
447 think carefully, then you will see that this function will |
482 rectification. If you think carefully, then you will realise |
448 actually never been called. Because a sequence with |
483 that this function will actually never been called. This is |
449 $\varnothing$ will never recognise any string and therefore |
484 because a sequence with $\varnothing$ will never recognise any |
450 the second phase of the algorithm would never been called. |
485 string and therefore the second phase of the algorithm would |
451 The simplification function still expects us to give a |
486 never been called. The simplification function still expects |
452 function. So in my own implementation I just returned |
487 us to give a function. So in my own implementation I just |
453 a function which raises an error. |
488 returned a function which raises an error. In the case |
454 |
489 where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have |
|
490 to create a sequence where the first component is a rectified |
|
491 version of $Empty$. Therefore we call $f_{1s}$ with $Empty$. |
|
492 |
|
493 Since we only simplify regular expressions of the form $r_1 + |
|
494 r_2$ and $r_1 \cdot r_2$ we do not have to do anything else |
|
495 in the remaining cases. The rectification function will |
|
496 be just the identity, which in lambda-calculus terms is |
|
497 just |
|
498 |
|
499 \begin{center} |
|
500 $\lambda v.\,v$ |
|
501 \end{center} |
|
502 |
|
503 \noindent This completes the high-level version of the |
|
504 simplification function, which is also shown again in |
|
505 Figure~\ref{simp}. This can now be used in a \emph{lexing |
|
506 function} as follows: |
|
507 |
|
508 \begin{figure}[t] |
|
509 \begin{center} |
|
510 \begin{tabular}{l} |
|
511 $simp(r)$:\\ |
|
512 \quad case $r = r_1 + r_2$\\ |
|
513 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\ |
|
514 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ |
|
515 \qquad case $r_{1s} = \varnothing$: |
|
516 return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\ |
|
517 \qquad case $r_{2s} = \varnothing$: |
|
518 return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\ |
|
519 \qquad case $r_{1s} = r_{2s}$: |
|
520 return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\ |
|
521 \qquad otherwise: |
|
522 return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$ |
|
523 \medskip\\ |
|
524 |
|
525 \quad case $r = r_1 \cdot r_2$\\ |
|
526 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\ |
|
527 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ |
|
528 \qquad case $r_{1s} = \varnothing$: |
|
529 return $(\varnothing, f_{error})$\\ |
|
530 \qquad case $r_{2s} = \varnothing$: |
|
531 return $(\varnothing, f_{error})$\\ |
|
532 \qquad case $r_{1s} = \epsilon$: |
|
533 return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\ |
|
534 \qquad case $r_{2s} = \epsilon$: |
|
535 return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\ |
|
536 \qquad otherwise: |
|
537 return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$ |
|
538 \medskip\\ |
|
539 |
|
540 \quad otherwise:\\ |
|
541 \qquad return $(r, \lambda v.\,v)$\\ |
|
542 \end{tabular} |
|
543 \end{center} |
|
544 \caption{The simplification function that returns a simplified |
|
545 regular expression and a rectification function.\label{simp}} |
|
546 \end{figure} |
|
547 |
|
548 \begin{center} |
|
549 \begin{tabular}{lcl} |
|
550 $lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\ |
|
551 & & else $error$\\ |
|
552 $lex\,r\,c\!::\!s$ & $\dn$ & let |
|
553 $(r_{simp}, f_{rect}) = simp(der(c, r))$\\ |
|
554 & & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ |
|
555 \end{tabular} |
|
556 \end{center} |
|
557 |
|
558 \noindent This corresponds to the $matches$ function we |
|
559 have seen in earlier lectures. In the first clause we are |
|
560 given an empty string, $[]$, and need to test wether the |
|
561 regular expression is $nullable$. If yes we can proceed |
|
562 normally and just return the value calculated by $mkeps$. |
|
563 The second clause is for strings where the first character is |
|
564 $c$ and the rest of the string is $s$. We first build the |
|
565 derivative of $r$ with respect to $c$; simplify the resulting |
|
566 regulare expression. We continue lexing with the simplified |
|
567 regular expression and the string $s$. Whatever will be |
|
568 returned as value, we sill rectify using the $f_{rect}$ |
|
569 from the simplification and finally inject $c$ back into |
|
570 the (rectified) value. |
455 |
571 |
456 |
572 |
457 \subsubsection*{Records and Tokenisation} |
573 \subsubsection*{Records and Tokenisation} |
458 |
574 |
459 \newpage |
575 \newpage |