handouts/ho04.tex
changeset 326 94700593a2d5
parent 319 e7b110f93697
child 350 c4e7caa06c74
equal deleted inserted replaced
325:794c599cee53 326:94700593a2d5
   205 \begin{center}
   205 \begin{center}
   206 $v_4:\;Right(Right(Empty))$
   206 $v_4:\;Right(Right(Empty))$
   207 \end{center}
   207 \end{center}
   208 
   208 
   209 \noindent The point is that from this value we can directly
   209 \noindent The point is that from this value we can directly
   210 read off which part of $r_4$ matched the empty string. Next we
   210 read off which part of $r_4$ matched the empty string: take
   211 have to ``inject'' the last character, that is $c$ in the
   211 the right-alternative first, and then the right-alternative
   212 running example, into this value $v_4$ in order to calculate
   212 again. 
   213 how $r_3$ could have matched the string $c$. According to the
   213 
   214 definition of $inj$ we obtain
   214 Next we have to ``inject'' the last character, that is $c$ in
       
   215 the running example, into this value $v_4$ in order to
       
   216 calculate how $r_3$ could have matched the string $c$.
       
   217 According to the definition of $inj$ we obtain
   215 
   218 
   216 \begin{center}
   219 \begin{center}
   217 $v_3:\;Right(Seq(Empty, Char(c)))$
   220 $v_3:\;Right(Seq(Empty, Char(c)))$
   218 \end{center}
   221 \end{center}
   219 
   222 
   277 Generally the matching algorithms based on derivatives do
   280 Generally the matching algorithms based on derivatives do
   278 poorly unless the regular expressions are simplified after
   281 poorly unless the regular expressions are simplified after
   279 each derivative step. But this is a bit more involved in the
   282 each derivative step. But this is a bit more involved in the
   280 algorithm of Sulzmann \& Lu. So what follows might require you
   283 algorithm of Sulzmann \& Lu. So what follows might require you
   281 to read several times before it makes sense and also might
   284 to read several times before it makes sense and also might
   282 require that you do some example calculations. As a first
   285 require that you do some example calculations yourself. As a
   283 example consider the last derivation step in our earlier
   286 first example consider the last derivation step in our earlier
   284 example:
   287 example:
   285 
   288 
   286 \begin{center}
   289 \begin{center}
   287 $r_4 = der\,c\,r_3 = 
   290 $r_4 = der\,c\,r_3 = 
   288 (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
   291 (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
   291 \noindent Simplifying this regular expression would just give
   294 \noindent Simplifying this regular expression would just give
   292 us $\epsilon$. Running $mkeps$ with this regular expression as
   295 us $\epsilon$. Running $mkeps$ with this regular expression as
   293 input, however, would then provide us with $Empty$ instead of
   296 input, however, would then provide us with $Empty$ instead of
   294 $Right(Right(Empty))$ that was obtained without the
   297 $Right(Right(Empty))$ that was obtained without the
   295 simplification. The problem is we need to recreate this more
   298 simplification. The problem is we need to recreate this more
   296 complicated value, rather than just $Empty$.
   299 complicated value, rather than just return $Empty$.
   297 
   300 
   298 This will require what I call \emph{rectification functions}.
   301 This will require what I call \emph{rectification functions}.
   299 They need to be calculated whenever a regular expression gets
   302 They need to be calculated whenever a regular expression gets
   300 simplified. Rectification functions take a value as argument
   303 simplified. Rectification functions take a value as argument
   301 and return a (rectified) value. Let us first take a look again
   304 and return a (rectified) value. Let us first take a look again
   317 simplifications in order end up with just $\epsilon$. However,
   320 simplifications in order end up with just $\epsilon$. However,
   318 it is possible to apply them in a depth-first, or inside-out,
   321 it is possible to apply them in a depth-first, or inside-out,
   319 manner in order to calculate this simplified regular
   322 manner in order to calculate this simplified regular
   320 expression.
   323 expression.
   321 
   324 
   322 The rectification we can implement this by letting simp return
   325 The rectification we can implement by letting simp return
   323 not just a (simplified) regular expression, but also a
   326 not just a (simplified) regular expression, but also a
   324 rectification function. Let us consider the alternative case,
   327 rectification function. Let us consider the alternative case,
   325 $r_1 + r_2$, first. By going depth-first, we first simplify
   328 $r_1 + r_2$, first. By going depth-first, we first simplify
   326 the component regular expressions $r_1$ and $r_2.$ This will
   329 the component regular expressions $r_1$ and $r_2.$ This will
   327 return simplified versions (if they can be simplified), say
   330 return simplified versions (if they can be simplified), say
   481 that this function will actually never been called. This is
   484 that this function will actually never been called. This is
   482 because a sequence with $\varnothing$ will never recognise any
   485 because a sequence with $\varnothing$ will never recognise any
   483 string and therefore the second phase of the algorithm would
   486 string and therefore the second phase of the algorithm would
   484 never been called. The simplification function still expects
   487 never been called. The simplification function still expects
   485 us to give a function. So in my own implementation I just
   488 us to give a function. So in my own implementation I just
   486 returned a function which raises an error. In the case
   489 returned a function that raises an error. In the case
   487 where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have
   490 where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have
   488 to create a sequence where the first component is a rectified
   491 to create a sequence where the first component is a rectified
   489 version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
   492 version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
   490 
   493 
   491 Since we only simplify regular expressions of the form $r_1 +
   494 Since we only simplify regular expressions of the form $r_1 +
   551    $(r_{simp}, f_{rect}) = simp(der(c, r))$\\
   554    $(r_{simp}, f_{rect}) = simp(der(c, r))$\\
   552 & & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
   555 & & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
   553 \end{tabular}
   556 \end{tabular}
   554 \end{center}
   557 \end{center}
   555 
   558 
   556 \noindent This corresponds to the $matches$ function we
   559 \noindent This corresponds to the $matches$ function we have
   557 have seen in earlier lectures. In the first clause we are
   560 seen in earlier lectures. In the first clause we are given an
   558 given an empty string, $[]$, and need to test wether the
   561 empty string, $[]$, and need to test wether the regular
   559 regular expression is $nullable$. If yes we can proceed
   562 expression is $nullable$. If yes, we can proceed normally and
   560 normally and just return the value calculated by $mkeps$.
   563 just return the value calculated by $mkeps$. The second clause
   561 The second clause is for strings where the first character is
   564 is for strings where the first character is $c$, say, and the
   562 $c$ and the rest of the string is $s$. We first build the
   565 rest of the string is $s$. We first build the derivative of
   563 derivative of $r$ with respect to $c$; simplify the resulting 
   566 $r$ with respect to $c$; simplify the resulting regular
   564 regulare expression. We continue lexing with the simplified
   567 expression. We continue lexing with the simplified regular
   565 regular expression and the string $s$. Whatever will be
   568 expression and the string $s$. Whatever will be returned as
   566 returned as value, we sill rectify using the $f_{rect}$
   569 value, we sill need to rectify using the $f_{rect}$ from the
   567 from the simplification and finally inject $c$ back into
   570 simplification and finally inject $c$ back into the
   568 the (rectified) value.
   571 (rectified) value.
   569 
   572 
   570 
   573 
   571 \subsubsection*{Records}
   574 \subsubsection*{Records}
   572 
   575 
   573 Remember we want to tokenize input strings, that means
   576 Remember we wanted to tokenize input strings, that means
   574 splitting strings into their ``word'' components. But
   577 splitting strings into their ``word'' components. Furthermore
   575 furthermore we want to classify each token as being a keyword
   578 we want to classify each token as being a keyword or
   576 or identifier and so on. For this one more feature will be
   579 identifier and so on. For this one more feature will be
   577 required, which I call \emph{record}. While values record
   580 required, which I call a \emph{record} regular expression.
   578 precisely how a regular expression matches a string, 
   581 While values encode how a regular expression matches a string,
   579 records can be used to focus on some particular
   582 records can be used to ``focus'' on some particular parts of
   580 parts of the regular expression and forget about others.
   583 the regular expression and ``forget'' about others. 
   581 Let us look at an example. 
   584 
   582 
   585 Let us look at an example. Suppose you have the regular
   583 Suppose you have the regular expression $ab + ac$. Clearly
   586 expression $a\cdot b + a\cdot c$. Clearly this regular expression can only
   584 this regular expression can only recognise two strings. But
   587 recognise two strings. But suppose you are not interested
   585 suppose you are not interested whether it can recognise $ab$
   588 whether it can recognise $ab$ or $ac$, but rather if it
   586 or $ac$, but rather if it matched, then what was the last
   589 matched, then what was the last character of the matched
   587 character of the matched string\ldots either $b$ or $c$.
   590 string\ldots either $b$ or $c$. You can do this by annotating
   588 You can do this by annotating the regular expression with
   591 the regular expression with a record, written in general
   589 a record, written $(x:r)$, where $x$ is just an identifier
   592 $(x:r)$, where $x$ is just an identifier (in my implementation
   590 (in my implementation a plain string) and $r$ is a regular
   593 a plain string) and $r$ is a regular expression. A record will
   591 expression. A record will be regarded as a regular expression.
   594 be regarded as a regular expression. The extended definition
   592 The extended definition in Scala looks as follows:
   595 in Scala therefore looks as follows:
   593 
   596 
   594 {\small\lstinputlisting[language=Scala]
   597 {\small\lstinputlisting[language=Scala]
   595 {../progs/app03.scala}}
   598 {../progs/app03.scala}}
   596 
   599 
   597 \noindent Since we regard records as regular expressions
   600 \noindent Since we regard records as regular expressions we
   598 we need to extend the functions $nullable$ and $der$. 
   601 need to extend the functions $nullable$ and $der$. Similarly
   599 Similarly $mkeps$ and $inj$ need to be extended and they 
   602 $mkeps$ and $inj$ need to be extended. This means we also need
   600 sometimes can return a particular value for records. 
   603 to extend the definition of values, which in Scala looks as
   601 This means we also need to extend the definition of values.
   604 follows:
   602 The extended definition in Scala looks as follows:
       
   603 
   605 
   604 {\small\lstinputlisting[language=Scala]
   606 {\small\lstinputlisting[language=Scala]
   605 {../progs/app04.scala}}
   607 {../progs/app04.scala}}
   606 
   608 
   607 \noindent Let us now look at the purpose of records more
   609 \noindent Let us now look at the purpose of records more
   608 closely and lets return to our question whether the string
   610 closely and let us return to our question whether the string
   609 terminated in a $b$ or $c$. We can do this as follows: we
   611 terminated in a $b$ or $c$. We can do this as follows: we
   610 annotate the regular expression $ab + ac$ with a record
   612 annotate the regular expression $ab + ac$ with a record
   611 as follows
   613 as follows
   612 
   614 
   613 \begin{center}
   615 \begin{center}
   614 $a(x:b) + a(x:c)$
   616 $a\cdot (x:b) + a\cdot (x:c)$
   615 \end{center}
   617 \end{center}
   616 
   618 
   617 \noindent This regular expression can still only recognise
   619 \noindent This regular expression can still only recognise
   618 the strings $ab$ and $ac$, but we can now use a function
   620 the strings $ab$ and $ac$, but we can now use a function
   619 that takes a value and returns all records. I call this
   621 that takes a value and returns all records. I call this
   620 function \emph{env} for environment\ldots it builds a list
   622 function \emph{env} for environment\ldots it builds a list
   621 of identifiers associated with their string. This function
   623 of identifiers associated with a string. This function
   622 can be defined as follows:
   624 can be defined as follows:
   623 
   625 
   624 \begin{center}
   626 \begin{center}
   625 \begin{tabular}{lcl}
   627 \begin{tabular}{lcl}
   626   $env(Empty)$     & $\dn$ & $[]$\\
   628   $env(Empty)$     & $\dn$ & $[]$\\
   632      $env(v_1) \,@\ldots @\, env(v_n)$\\
   634      $env(v_1) \,@\ldots @\, env(v_n)$\\
   633   $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
   635   $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
   634 \end{tabular}
   636 \end{tabular}
   635 \end{center}
   637 \end{center}
   636 
   638 
   637 \noindent where in the last clause we use the flatten function 
   639 \noindent where in the last clause we use the flatten function
   638 defined earlier. The function $env$ ``picks'' out all 
   640 defined earlier. As can be seen, the function $env$ ``picks''
   639 underlying strings where a record is given. Since there can be 
   641 out all underlying strings where a record is given. Since
   640 more than one, the environment will potentially contain
   642 there can be more than one, the environment will potentially
   641 many ``recordings''. If we now postprocess the value 
   643 contain many ``records''. If we now postprocess the value
   642 calculated by $lex$ extracting all recordings using $env$, 
   644 calculated by $lex$ extracting all records using $env$, we can
   643 we can answer the question whether the last element in the
   645 answer the question whether the last element in the string was
   644 string was an $b$ or a $c$. Lets see this in action: if
   646 an $b$ or a $c$. Lets see this in action: if we use $a\cdot b
   645 we use $ab + ac$ and $ac$ the calculated value will be
   647 + a\cdot c$ and $ac$ the calculated value will be
   646 
   648 
   647 \begin{center}
   649 \begin{center}
   648 $Right(Seq(Char(a), Char(c)))$
   650 $Right(Seq(Char(a), Char(c)))$
   649 \end{center}
   651 \end{center}
   650 
   652 
   651 \noindent If we use instead $a(x:b) + a(x:c)$ and
   653 \noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and
   652 use the $env$ function to extract the recording for 
   654 use the $env$ function to extract the recording for 
   653 $x$ we obtain
   655 $x$ we obtain
   654 
   656 
   655 \begin{center}
   657 \begin{center}
   656 $[(x:c)]$
   658 $[(x:c)]$
   659 \noindent If we had given the string $ab$ instead, then the
   661 \noindent If we had given the string $ab$ instead, then the
   660 record would have been $[(x:b)]$. The fun starts if we 
   662 record would have been $[(x:b)]$. The fun starts if we 
   661 iterate this. Consider the regular expression 
   663 iterate this. Consider the regular expression 
   662 
   664 
   663 \begin{center}
   665 \begin{center}
   664 $(a(x:b) + a(y:c))^*$
   666 $(a\cdot (x:b) + a\cdot (y:c))^*$
   665 \end{center}
   667 \end{center}
   666 
   668 
   667 \noindent and the string $ababacabacab$. This string is 
   669 \noindent and the string $ababacabacab$. This string is 
   668 clearly matched by the regular expression, but we are only
   670 clearly matched by the regular expression, but we are only
   669 interested in the sequence of $b$s and $c$s. Using $env$
   671 interested in the sequence of $b$s and $c$s. Using $env$