handouts/ho04.tex
changeset 287 2c50b8b5886c
parent 286 19020b75d75e
child 288 39aeca14af8c
equal deleted inserted replaced
286:19020b75d75e 287:2c50b8b5886c
    10 
    10 
    11 So far our algorithm based on derivatives was only able to say
    11 So far our algorithm based on derivatives was only able to say
    12 yes or no depending on whether a string was matched by regular
    12 yes or no depending on whether a string was matched by regular
    13 expression or not. Often a more interesting question is to
    13 expression or not. Often a more interesting question is to
    14 find out \emph{how} a regular expression matched a string?
    14 find out \emph{how} a regular expression matched a string?
    15 Answering this question will help us with the problem we are
    15 Answering this question will also help us with the problem we
    16 after, namely tokenising an input string. The algorithm we
    16 are after, namely tokenising an input string. The algorithm we
    17 will be looking at for this was designed by Sulzmann \& Lu in
    17 will be looking at for this was designed by Sulzmann \& Lu in
    18 a rather recent paper. A link to it is provided on KEATS, in
    18 a rather recent paper. A link to it is provided on KEATS, in
    19 case you are interested.\footnote{In my humble opinion this is
    19 case you are interested.\footnote{In my humble opinion this is
    20 an interesting instance of the research literature: it
    20 an interesting instance of the research literature: it
    21 contains a very neat idea, but its presentation is rather
    21 contains a very neat idea, but its presentation is rather
    71 corresponding to the two alternatives. Note that $r^*$ is
    71 corresponding to the two alternatives. Note that $r^*$ is
    72 associated with a list of values, one for each copy of $r$
    72 associated with a list of values, one for each copy of $r$
    73 that was needed to match the string. This means we might also
    73 that was needed to match the string. This means we might also
    74 return the empty list $[]$, if no copy was needed.
    74 return the empty list $[]$, if no copy was needed.
    75 
    75 
    76 Graphically the algorithm by Sulzmann \& Lu can be represneted
    76 To emphasise the connection between regular expressions and
       
    77 values, I have in my implementation the convention that
       
    78 regular expressions are written entirely with upper-case
       
    79 letters, while values just start with a single upper-case
       
    80 character. My definition of values in Scala is below. I use 
       
    81 this in the REPL of Scala; when I use the Scala compiler
       
    82 I need to rename some constructors, because Scala on Macs
       
    83 does not like classes that are called \pcode{EMPTY} and
       
    84 \pcode{Empty}.
       
    85  
       
    86 {\small\lstinputlisting[language=Scala,numbers=none]
       
    87 {../progs/app02.scala}}
       
    88 
       
    89 
       
    90 Graphically the algorithm by Sulzmann \& Lu can be illustrated
    77 by the picture in Figure~\ref{Sulz} where the path from the
    91 by the picture in Figure~\ref{Sulz} where the path from the
    78 left to the right involving $der/nullable$ is the first phase
    92 left to the right involving $der/nullable$ is the first phase
    79 of the algorithm and $mkeps/inj$, the path from right to left,
    93 of the algorithm and $mkeps/inj$, the path from right to left,
    80 the second phase. This picture shows the steps required when a
    94 the second phase. This picture shows the steps required when a
    81 regular expression, say $r_1$, matches the string $abc$. We
    95 regular expression, say $r_1$, matches the string $abc$. We
   130 these regular expression cannot match the empty string. Note
   144 these regular expression cannot match the empty string. Note
   131 also that in case of alternatives we give preference to the
   145 also that in case of alternatives we give preference to the
   132 regular expression on the left-hand side. This will become
   146 regular expression on the left-hand side. This will become
   133 important later on.
   147 important later on.
   134 
   148 
   135 The second phase of the algorithm is organised recursively
   149 The second phase of the algorithm is organised so that it will
   136 such that it will calculate a value for how the derivative
   150 calculate a value for how the derivative regular expression
   137 regular expression has matched a string whose first character
   151 has matched a string whose first character has been chopped
   138 has been chopped off. Now we need a function that reverses
   152 off. Now we need a function that reverses this ``chopping
   139 this ``chopping off'' for values. The corresponding function
   153 off'' for values. The corresponding function is called $inj$
   140 is called $inj$ for injection. This function takes three
   154 for injection. This function takes three arguments: the first
   141 arguments: the first one is a regular expression for which we
   155 one is a regular expression for which we want to calculate the
   142 want to calculate the value, the second is the character we
   156 value, the second is the character we want to inject and the
   143 want to inject and the third argument is the value where we
   157 third argument is the value where we will inject the
   144 will inject the character. The result of this function is a
   158 character. The result of this function is a new value. The
   145 new value. The definition of $inj$ is as follows: 
   159 definition of $inj$ is as follows: 
   146 
   160 
   147 \begin{center}
   161 \begin{center}
   148 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   162 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   149   $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
   163   $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
   150   $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
   164   $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
   159 \noindent This definition is by recursion on the regular
   173 \noindent This definition is by recursion on the regular
   160 expression and by analysing the shape of the values. Therefore
   174 expression and by analysing the shape of the values. Therefore
   161 there are, for example, three cases for sequnece regular
   175 there are, for example, three cases for sequnece regular
   162 expressions. The last clause for the star regular expression
   176 expressions. The last clause for the star regular expression
   163 returns a list where the first element is $inj\,r\,c\,v$ and
   177 returns a list where the first element is $inj\,r\,c\,v$ and
   164 the other elements are $vs$. That mean $\_\,::\,\_$ should be 
   178 the other elements are $vs$. That means $\_\,::\,\_$ should be 
   165 read as list cons.
   179 read as list cons.
   166 
   180 
   167 To understand what is going on, it might be best to do some
   181 To understand what is going on, it might be best to do some
   168 example calculations and compare with Figure~\ref{Sulz}. For
   182 example calculations and compare them with Figure~\ref{Sulz}.
   169 this note that we have not yet dealt with the need of
   183 For this note that we have not yet dealt with the need of
   170 simplifying regular expressions (this will be a topic on its
   184 simplifying regular expressions (this will be a topic on its
   171 own later). Suppose the regular expression is $a \cdot (b
   185 own later). Suppose the regular expression is $a \cdot (b
   172 \cdot c)$ and the input string is $abc$. The derivatives from
   186 \cdot c)$ and the input string is $abc$. The derivatives from
   173 the first phase are as follows:
   187 the first phase are as follows:
   174 
   188 
   213 
   227 
   214 \begin{center}
   228 \begin{center}
   215 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
   229 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
   216 \end{center}
   230 \end{center}
   217 
   231 
   218 \noindent Finally we need to inject back the letter $a$ into
   232 \noindent which is again the correct result for how $r_2$
   219 $v_2$ giving the final result
   233 matched the string $bc$. Finally we need to inject back the
       
   234 letter $a$ into $v_2$ giving the final result
   220 
   235 
   221 \begin{center}
   236 \begin{center}
   222 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
   237 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
   223 \end{center}
   238 \end{center}
   224 
   239 
   225 \noindent This now corresponds to how the regular
   240 \noindent This now corresponds to how the regular
   226 expression $a \cdot (b \cdot c)$ matched the string $abc$.
   241 expression $a \cdot (b \cdot c)$ matched the string $abc$.
   227 
   242 
   228 There are a few auxiliary functions that are of interest
   243 There are a few auxiliary functions that are of interest
   229 in analysing this algorithm. One is called \emph{flatten},
   244 when analysing this algorithm. One is called \emph{flatten},
   230 written $|\_|$, which extracts the string ``underlying'' a 
   245 written $|\_|$, which extracts the string ``underlying'' a 
   231 value. It is defined recursively as
   246 value. It is defined recursively as
   232 
   247 
   233 \begin{center}
   248 \begin{center}
   234 \begin{tabular}{lcl}
   249 \begin{tabular}{lcl}
   252 $|v_2|$: & $bc$\\
   267 $|v_2|$: & $bc$\\
   253 $|v_1|$: & $abc$
   268 $|v_1|$: & $abc$
   254 \end{tabular}
   269 \end{tabular}
   255 \end{center}
   270 \end{center}
   256 
   271 
   257 \noindent 
   272 \noindent This indicates that $inj$ indeed is injecting, or
   258 This indicates that $inj$ indeed is injecting, or adding, back
   273 adding, back a character into the value. Next we look at how
   259 a character into the value.
   274 simplification can be included into this algorithm.
   260 
   275 
   261 
   276 
   262 \subsubsection*{Simplification}
   277 \subsubsection*{Simplification}
   263 
   278 
   264 Generally the matching algorithms based on derivatives do
   279 Generally the matching algorithms based on derivatives do
   265 poorly unless the regular expressions are simplified after
   280 poorly unless the regular expressions are simplified after
   266 each derivatives step. But this is a bit more involved in 
   281 each derivative step. But this is a bit more involved in the
   267 algorithm of Sulzmann \& Lu. Consider the last derivation
   282 algorithm of Sulzmann \& Lu. So what follows might require you
   268 step in our running example
   283 to read several times before it makes sense and also might
   269 
   284 require that you do some example calculations. As a first
   270 \begin{center}
   285 example consider the last derivation step in our earlier
   271 $r_4 = der\,c\,r_3 = (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
   286 example:
   272 \end{center}
   287 
   273 
   288 \begin{center}
   274 \noindent Simplifying the result would just give us
   289 $r_4 = der\,c\,r_3 = 
   275 $\epsilon$. Running $mkeps$ on this regular expression would
   290 (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
   276 then provide us with $Empty$ instead of $Right(Right(Empty))$
   291 \end{center}
   277 that was obtained without the simplification. The problem is
   292 
   278 we need to recreate this more complicated value, rather than 
   293 \noindent Simplifying this regular expression would just give
   279 just $Empty$.
   294 us $\epsilon$. Running $mkeps$ with this regular expression as
   280 
   295 input, however, would then provide us with $Empty$ instead of
   281 This requires what I call \emph{rectification functions}. They
   296 $Right(Right(Empty))$ that was obtained without the
   282 need to be calculated whenever a regular expression gets
   297 simplification. The problem is we need to recreate this more
       
   298 complicated value, rather than just $Empty$.
       
   299 
       
   300 This will require what I call \emph{rectification functions}.
       
   301 They need to be calculated whenever a regular expression gets
   283 simplified. Rectification functions take a value as argument
   302 simplified. Rectification functions take a value as argument
   284 and return a (rectified) value. Our simplification rules so
   303 and return a (rectified) value. Let us first take a look again
   285 far are
   304 at our simplification rules:
   286 
   305 
   287 \begin{center}
   306 \begin{center}
   288 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
   307 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
   289 $r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ 
   308 $r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ 
   290 $\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ 
   309 $\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ 
   295 $r + r$ & $\mapsto$ & $r$\\ 
   314 $r + r$ & $\mapsto$ & $r$\\ 
   296 \end{tabular}
   315 \end{tabular}
   297 \end{center}
   316 \end{center}
   298 
   317 
   299 \noindent Applying them to $r_4$ will require several nested
   318 \noindent Applying them to $r_4$ will require several nested
   300 simplifications in order end up with just $\epsilon$. 
   319 simplifications in order end up with just $\epsilon$. However,
   301 
   320 it is possible to apply them in a depth-first, or inside-out,
   302 We can implement this by letting simp return not just a
   321 manner in order to calculate this simplified regular
   303 (simplified) regular expression, but also a rectification
   322 expression.
   304 function. Let us consider the alternative case, say $r_1 +
   323 
   305 r_2$, first. We would first simplify the component regular
   324 The rectification we can implement this by letting simp return
   306 expressions $r_1$ and $r_2.$ This will return simplified
   325 not just a (simplified) regular expression, but also a
   307 versions (if they can be simplified), say $r_{1s}$ and
   326 rectification function. Let us consider the alternative case,
   308 $r_{2s}$, but also two rectification functions $f_{1s}$ and
   327 $r_1 + r_2$, first. By going depth-first, we first simplify
   309 $f_{2s}$. We need to assemble them in order to obtain a
   328 the component regular expressions $r_1$ and $r_2.$ This will
   310 rectified value for $r_1 + r_2$. In case $r_{1s}$ simplified
   329 return simplified versions (if they can be simplified), say
   311 to $\varnothing$, we would continue the derivative calculation
   330 $r_{1s}$ and $r_{2s}$, but also two rectification functions
   312 with $r_{2s}$. The Sulzmann \& Lu algorithm would return a
   331 $f_{1s}$ and $f_{2s}$. We need to assemble them in order to
   313 corresponding value, say $v_{2s}$. But now this needs to
   332 obtain a rectified value for $r_1 + r_2$. In case $r_{1s}$
   314 be ``rectified'' to the value 
   333 simplified to $\varnothing$, we continue the derivative
       
   334 calculation with $r_{2s}$. The Sulzmann \& Lu algorithm would
       
   335 return a corresponding value, say $v_{2s}$. But now this value
       
   336 needs to be ``rectified'' to the value 
   315 
   337 
   316 \begin{center}
   338 \begin{center}
   317 $Right(v_{2s})$
   339 $Right(v_{2s})$
   318 \end{center}
   340 \end{center}
   319 
   341 
   320 \noindent Unfortunately, this is not enough because there
   342 \noindent The reason is that we look for the value that tells
   321 might be some simplifications that happened inside $r_2$ 
   343 us how $r_1 + r_2$ could have matched the string, not just
   322 and for which the simplification function retuned also
   344 $r_2$ or $r_{2s}$. Unfortunately, this is still not the right
   323 a rectification function $f_{2s}$. So in fact we need to
   345 value in general because there might be some simplifications
   324 apply this one too which gives
   346 that happened inside $r_2$ and for which the simplification
       
   347 function retuned also a rectification function $f_{2s}$. So in
       
   348 fact we need to apply this one too which gives
   325 
   349 
   326 \begin{center}
   350 \begin{center}
   327 $Right(f_{2s}(v_{2s}))$
   351 $Right(f_{2s}(v_{2s}))$
   328 \end{center}
   352 \end{center}
   329 
   353 
   330 \noindent So if we want to return this as function,
   354 \noindent This is now the correct, or rectified, value. Since
   331 we would need to return
   355 the simplification will be done in the first phase of the
       
   356 algorithm, but the rectification needs to be done to the
       
   357 values in the second phase, it is advantageous to calculate
       
   358 the rectification as a function, remember this function and
       
   359 then apply the value to this function during the second phase.
       
   360 So if we want to implement the rectification as function, we 
       
   361 would need to return
   332 
   362 
   333 \begin{center}
   363 \begin{center}
   334 $\lambda v.\,Right(f_{2s}(v))$
   364 $\lambda v.\,Right(f_{2s}(v))$
   335 \end{center}
   365 \end{center}
   336 
   366 
   337 \noindent which is the lambda-calculus notation for
   367 \noindent which is the lambda-calculus notation for
   338 a function that expects a value $v$ and returns everything
   368 a function that expects a value $v$ and returns everything
   339 after the dot where $v$ is replaced by whatever value is 
   369 after the dot where $v$ is replaced by whatever value is 
   340 given.
   370 given.
   341 
   371 
   342 Let us package these ideas into a single function (still only
   372 Let us package this idea with rectification functions
   343 considering the alternative case):
   373 into a single function (still only considering the alternative
       
   374 case):
   344 
   375 
   345 \begin{center}
   376 \begin{center}
   346 \begin{tabular}{l}
   377 \begin{tabular}{l}
   347 $simp(r)$:\\
   378 $simp(r)$:\\
   348 \quad case $r = r_1 + r_2$\\
   379 \quad case $r = r_1 + r_2$\\
   350 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
   381 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
   351 \qquad case $r_{1s} = \varnothing$: 
   382 \qquad case $r_{1s} = \varnothing$: 
   352        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
   383        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
   353 \qquad case $r_{2s} = \varnothing$: 
   384 \qquad case $r_{2s} = \varnothing$: 
   354        return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
   385        return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
       
   386 \qquad case $r_{1s} = r_{2s}$:
       
   387        return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
   355 \qquad otherwise: 
   388 \qquad otherwise: 
   356        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
   389        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
   357 \end{tabular}
   390 \end{tabular}
   358 \end{center}
   391 \end{center}
   359 
   392 
   360 \noindent We first recursively call the simlification with $r_1$
   393 \noindent We first recursively call the simplification with
   361 and $r_2$. This gives simplified regular expressions, $r_{1s}$
   394 $r_1$ and $r_2$. This gives simplified regular expressions,
   362 and $r_{2s}$, as well as two rectification functions $f_{1s}$
   395 $r_{1s}$ and $r_{2s}$, as well as two rectification functions
   363 and $f_{2s}$. We next need to test whether the simplified 
   396 $f_{1s}$ and $f_{2s}$. We next need to test whether the
   364 regular expressions are $\varnothing$ so as to make further 
   397 simplified regular expressions are $\varnothing$ so as to make
   365 simplifications. In case $r_{1s}$ is $\varnothing$ then 
   398 further simplifications. In case $r_{1s}$ is $\varnothing$,
   366 we can return $r_{2s}$ (the other alternative). However
   399 then we can return $r_{2s}$ (the other alternative). However
   367 we need to now build a rectification function, which as
   400 for this we need to build a corresponding rectification 
   368 said above is
   401 function, which as said above is
   369 
   402 
   370 \begin{center}
   403 \begin{center}
   371 $\lambda v.\,Right(f_{2s}(v))$
   404 $\lambda v.\,Right(f_{2s}(v))$
   372 \end{center}
   405 \end{center}
   373 
   406 
   374 \noindent The case where $r_{2s} = \varnothing$ is similar.
   407 \noindent The case where $r_{2s} = \varnothing$ is similar:
   375 We return $r_{1s}$ but now have to rectify such that we return
   408 We return $r_{1s}$ and rectify with $Left(\_)$ and the
       
   409 other calculated rectification function $f_{1s}$. This gives
   376 
   410 
   377 \begin{center}
   411 \begin{center}
   378 $\lambda v.\,Left(f_{1s}(v))$
   412 $\lambda v.\,Left(f_{1s}(v))$
   379 \end{center}
   413 \end{center}
   380 
   414 
   381 \noindent Note that in this case we have to apply $f_{1s}$,
   415 \noindent The next case where $r_{1s} = r_{2s}$ can be treated
   382 not $f_{2s}$, which is responsible to rectify the inner parts
   416 like the one where $r_{2s} = \varnothing$. We return $r_{1s}$
   383 of $v$. The otherwise-case is slightly interesting. In this
   417 and rectify with $Left(\_)$ and so on.
   384 case neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and no
   418 
   385 further simplification can be applied. Accordingly, we return
   419 
   386 $r_{1s} + r_{2s}$ as the simplified regular expression. In
   420 The otherwise-case is slightly more complicated. In this case
   387 principle we also do not have to do any rectification, because
   421 neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and also
   388 no simplification was done in this case. But this is actually
   422 $r_{1s} \not= r_{2s}$, which means no further simplification
   389 not true: There might have been simplifications inside
   423 can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
   390 $r_{1s}$ and $r_2s$. We therefore need to take into account
   424 the simplified regular expression. In principle we also do not
   391 the calculated rectification functions $f_{1s}$ and $f_{2s}$.
   425 have to do any rectification, because no simplification was
   392 We can do this by defining a rectification function $f_{alt}$
   426 done in this case. But this is actually not true: There might
   393 which takes two rectification functions as arguments
   427 have been simplifications inside $r_{1s}$ and $r_{2s}$. We
       
   428 therefore need to take into account the calculated
       
   429 rectification functions $f_{1s}$ and $f_{2s}$. We can do this
       
   430 by defining a rectification function $f_{alt}$ which takes two
       
   431 rectification functions as arguments and applies them
       
   432 according to whether the value is of the form $Left(\_)$ or
       
   433 $Right(\_)$:
   394 
   434 
   395 \begin{center}
   435 \begin{center}
   396 \begin{tabular}{l@{\hspace{1mm}}l}
   436 \begin{tabular}{l@{\hspace{1mm}}l}
   397 $f_{alt}(f_1, f_2) \dn$\\
   437 $f_{alt}(f_1, f_2) \dn$\\
   398 \qquad $\lambda v.\,$ case $v = Left(v')$: 
   438 \qquad $\lambda v.\,$ case $v = Left(v')$: 
   400 \qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
   440 \qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
   401       & return $Right(f_2(v'))$\\      
   441       & return $Right(f_2(v'))$\\      
   402 \end{tabular}
   442 \end{tabular}
   403 \end{center}
   443 \end{center}
   404 
   444 
   405 \noindent In essence we need to apply in this case 
   445 The other interesting case with simplification is the sequence
   406 the appropriate rectification function to the inner part
   446 case. In this case the main simplification function is as
   407 of the value $v$, whereby $v$ can only be of the form 
   447 follows
   408 $Right(\_)$ or $Left(\_)$.
       
   409 
       
   410 The other interesting case with simplification is the 
       
   411 sequence case. Here the main simplification function 
       
   412 is as follows
       
   413 
   448 
   414 \begin{center}
   449 \begin{center}
   415 \begin{tabular}{l}
   450 \begin{tabular}{l}
   416 $simp(r)$:\qquad\qquad (continued)\\
   451 $simp(r)$:\qquad\qquad (continued)\\
   417 \quad case $r = r_1 \cdot r_2$\\
   452 \quad case $r = r_1 \cdot r_2$\\
   428 \qquad otherwise: 
   463 \qquad otherwise: 
   429        return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
   464        return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
   430 \end{tabular}
   465 \end{tabular}
   431 \end{center}
   466 \end{center}
   432 
   467 
   433 \noindent whereby $f_{seq}$ is again pushing the two 
   468 \noindent whereby in the last line $f_{seq}$ is again pushing
   434 rectification functions into the two components of
   469 the two rectification functions into the two components of the
   435 the Seq-value:
   470 Seq-value:
   436 
   471 
   437 \begin{center}
   472 \begin{center}
   438 \begin{tabular}{l@{\hspace{1mm}}l}
   473 \begin{tabular}{l@{\hspace{1mm}}l}
   439 $f_{seq}(f_1, f_2) \dn$\\
   474 $f_{seq}(f_1, f_2) \dn$\\
   440 \qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: 
   475 \qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: 
   441       & return $Seq(f_1(v_1), f_2(v_2))$\\
   476       & return $Seq(f_1(v_1), f_2(v_2))$\\
   442 \end{tabular}
   477 \end{tabular}
   443 \end{center}
   478 \end{center}
   444 
   479 
   445 \noindent Note that in the case of $r_{1s}$ (similarly $r_{2s}$)
   480 \noindent Note that in the case of $r_{1s} = \varnothing$
   446 we use the function $f_{error}$ for rectification. If you 
   481 (similarly $r_{2s}$) we use the function $f_{error}$ for
   447 think carefully, then you will see that this function will
   482 rectification. If you think carefully, then you will realise
   448 actually never been called. Because a sequence with 
   483 that this function will actually never been called. This is
   449 $\varnothing$ will never recognise any string and therefore
   484 because a sequence with $\varnothing$ will never recognise any
   450 the second phase of the algorithm would never been called.
   485 string and therefore the second phase of the algorithm would
   451 The simplification function still expects us to give a 
   486 never been called. The simplification function still expects
   452 function. So in my own implementation I just returned 
   487 us to give a function. So in my own implementation I just
   453 a function which raises an error. 
   488 returned a function which raises an error. In the case
   454 
   489 where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have
       
   490 to create a sequence where the first component is a rectified
       
   491 version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
       
   492 
       
   493 Since we only simplify regular expressions of the form $r_1 +
       
   494 r_2$ and $r_1 \cdot r_2$ we do not have to do anything else
       
   495 in the remaining cases. The rectification function will
       
   496 be just the identity, which in lambda-calculus terms is
       
   497 just
       
   498 
       
   499 \begin{center}
       
   500 $\lambda v.\,v$
       
   501 \end{center}
       
   502 
       
   503 \noindent This completes the high-level version of the
       
   504 simplification function, which is also shown again in 
       
   505 Figure~\ref{simp}. This can now be used in a \emph{lexing
       
   506 function} as follows:
       
   507 
       
   508 \begin{figure}[t]
       
   509 \begin{center}
       
   510 \begin{tabular}{l}
       
   511 $simp(r)$:\\
       
   512 \quad case $r = r_1 + r_2$\\
       
   513 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
       
   514 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
       
   515 \qquad case $r_{1s} = \varnothing$: 
       
   516        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
       
   517 \qquad case $r_{2s} = \varnothing$: 
       
   518        return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
       
   519 \qquad case $r_{1s} = r_{2s}$:
       
   520        return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
       
   521 \qquad otherwise: 
       
   522        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
       
   523        \medskip\\
       
   524 
       
   525 \quad case $r = r_1 \cdot r_2$\\
       
   526 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
       
   527 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
       
   528 \qquad case $r_{1s} = \varnothing$: 
       
   529        return $(\varnothing, f_{error})$\\
       
   530 \qquad case $r_{2s} = \varnothing$: 
       
   531        return $(\varnothing, f_{error})$\\
       
   532 \qquad case $r_{1s} = \epsilon$: 
       
   533 return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
       
   534 \qquad case $r_{2s} = \epsilon$: 
       
   535 return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
       
   536 \qquad otherwise: 
       
   537        return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
       
   538        \medskip\\
       
   539 
       
   540 \quad otherwise:\\
       
   541 \qquad return $(r, \lambda v.\,v)$\\
       
   542 \end{tabular}
       
   543 \end{center}
       
   544 \caption{The simplification function that returns a simplified 
       
   545 regular expression and a rectification function.\label{simp}}
       
   546 \end{figure}
       
   547 
       
   548 \begin{center}
       
   549 \begin{tabular}{lcl}
       
   550 $lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\
       
   551              &       & else $error$\\
       
   552 $lex\,r\,c\!::\!s$ & $\dn$ & let 
       
   553    $(r_{simp}, f_{rect}) = simp(der(c, r))$\\
       
   554 & & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
       
   555 \end{tabular}
       
   556 \end{center}
       
   557 
       
   558 \noindent This corresponds to the $matches$ function we
       
   559 have seen in earlier lectures. In the first clause we are
       
   560 given an empty string, $[]$, and need to test wether the
       
   561 regular expression is $nullable$. If yes we can proceed
       
   562 normally and just return the value calculated by $mkeps$.
       
   563 The second clause is for strings where the first character is
       
   564 $c$ and the rest of the string is $s$. We first build the
       
   565 derivative of $r$ with respect to $c$; simplify the resulting 
       
   566 regulare expression. We continue lexing with the simplified
       
   567 regular expression and the string $s$. Whatever will be
       
   568 returned as value, we sill rectify using the $f_{rect}$
       
   569 from the simplification and finally inject $c$ back into
       
   570 the (rectified) value.
   455 
   571 
   456 
   572 
   457 \subsubsection*{Records and Tokenisation}
   573 \subsubsection*{Records and Tokenisation}
   458 
   574 
   459 \newpage
   575 \newpage