handouts/ho04.tex
changeset 288 39aeca14af8c
parent 287 2c50b8b5886c
child 296 796b9b81ac8d
equal deleted inserted replaced
287:2c50b8b5886c 288:39aeca14af8c
   568 returned as value, we sill rectify using the $f_{rect}$
   568 returned as value, we sill rectify using the $f_{rect}$
   569 from the simplification and finally inject $c$ back into
   569 from the simplification and finally inject $c$ back into
   570 the (rectified) value.
   570 the (rectified) value.
   571 
   571 
   572 
   572 
   573 \subsubsection*{Records and Tokenisation}
   573 \subsubsection*{Records}
   574 
   574 
   575 \newpage
   575 Remember we want to tokenize input strings, that means
   576 Algorithm by Sulzmann, Lexing
   576 splitting strings into their ``word'' components. But
   577 
   577 furthermore we want to classify each token as being a keyword
       
   578 or identifier and so on. For this one more feature will be
       
   579 required, which I call \emph{record}. While values record
       
   580 precisely how a regular expression matches a string, 
       
   581 records can be used to focus on some particular
       
   582 parts of the regular expression and forget about others.
       
   583 Let us look at an example. 
       
   584 
       
   585 Suppose you have the regular expression $ab + ac$. Clearly
       
   586 this regular expression can only recognise two strings. But
       
   587 suppose you are not interested whether it can recognise $ab$
       
   588 or $ac$, but rather if it matched, then what was the last
       
   589 character of the matched string\ldots either $b$ or $c$.
       
   590 You can do this by annotating the regular expression with
       
   591 a record, written $(x:r)$, where $x$ is just an identifier
       
   592 (in my implementation a plain string) and $r$ is a regular
       
   593 expression. A record will be regarded as a regular expression.
       
   594 The extended definition in Scala looks as follows:
       
   595 
       
   596 {\small\lstinputlisting[language=Scala]
       
   597 {../progs/app03.scala}}
       
   598 
       
   599 \noindent Since we regard records as regular expressions
       
   600 we need to extend the functions $nullable$ and $der$. 
       
   601 Similarly $mkeps$ and $inj$ need to be extended and they 
       
   602 sometimes can return a particular value for records. 
       
   603 This means we also need to extend the definition of values.
       
   604 The extended definition in Scala looks as follows:
       
   605 
       
   606 {\small\lstinputlisting[language=Scala]
       
   607 {../progs/app04.scala}}
       
   608 
       
   609 \noindent Let us now look at the purpose of records more
       
   610 closely and lets return to our question whether the string
       
   611 terminated in a $b$ or $c$. We can do this as follows: we
       
   612 annotate the regular expression $ab + ac$ with a record
       
   613 as follows
       
   614 
       
   615 \begin{center}
       
   616 $a(x:b) + a(x:c)$
       
   617 \end{center}
       
   618 
       
   619 \noindent This regular expression can still only recognise
       
   620 the strings $ab$ and $ac$, but we can now use a function
       
   621 that takes a value and returns all records. I call this
       
   622 function \emph{env} for environment\ldots it builds a list
       
   623 of identifiers associated with their string. This function
       
   624 can be defined as follows:
       
   625 
       
   626 \begin{center}
       
   627 \begin{tabular}{lcl}
       
   628   $env(Empty)$     & $\dn$ & $[]$\\
       
   629   $env(Char(c))$   & $\dn$ & $[]$\\
       
   630   $env(Left(v))$   & $\dn$ & $env(v)$\\
       
   631   $env(Right(v))$  & $\dn$ & $env(v)$\\
       
   632   $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
       
   633   $env([v_1,\ldots ,v_n])$ & $\dn$ & 
       
   634      $env(v_1) \,@\ldots @\, env(v_n)$\\
       
   635   $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
       
   636 \end{tabular}
       
   637 \end{center}
       
   638 
       
   639 \noindent where in the last clause we use the flatten function 
       
   640 defined earlier. The function $env$ ``picks'' out all 
       
   641 underlying strings where a record is given. Since there can be 
       
   642 more than one, the environment will potentially contain
       
   643 many ``recordings''. If we now postprocess the value 
       
   644 calculated by $lex$ extracting all recordings using $env$, 
       
   645 we can answer the question whether the last element in the
       
   646 string was an $b$ or a $c$. Lets see this in action: if
       
   647 we use $ab + ac$ and $ac$ the calculated value will be
       
   648 
       
   649 \begin{center}
       
   650 $Right(Seq(Char(a), Char(c)))$
       
   651 \end{center}
       
   652 
       
   653 \noindent If we use instead $a(x:b) + a(x:c)$ and
       
   654 use the $env$ function to extract the recording for 
       
   655 $x$ we obtain
       
   656 
       
   657 \begin{center}
       
   658 $[(x:c)]$
       
   659 \end{center}
       
   660 
       
   661 \noindent If we had given the string $ab$ instead, then the
       
   662 record would have been $[(x:b)]$. The fun starts if we 
       
   663 iterate this. Consider the regular expression 
       
   664 
       
   665 \begin{center}
       
   666 $(a(x:b) + a(y:c))^*$
       
   667 \end{center}
       
   668 
       
   669 \noindent and the string $ababacabacab$. This string is 
       
   670 clearly matched by the regular expression, but we are only
       
   671 interested in the sequence of $b$s and $c$s. Using $env$
       
   672 we obtain
       
   673 
       
   674 \begin{center}
       
   675 $[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$
       
   676 \end{center}
       
   677 
       
   678 \noindent While this feature might look silly, it is in fact
       
   679 quite useful. For example if we want to match the name of
       
   680 an email we might use the regular expression
       
   681 
       
   682 \[
       
   683 (name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot 
       
   684 (domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot 
       
   685 (top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
       
   686 \]
       
   687 
       
   688 \noindent Then if we match the email address
       
   689 
       
   690 \[
       
   691 \texttt{christian.urban@kcl.ac.uk}
       
   692 \]
       
   693 
       
   694 \noindent we can use the $env$ function and find out
       
   695 what the name, domain and top-level part of the email
       
   696 address are:
       
   697 
       
   698 \begin{center}
       
   699 $[(name:\texttt{christian.urban}), 
       
   700   (domain:\texttt{kcl}), 
       
   701   (top\_level:\texttt{ac.uk})]$
       
   702 \end{center}
       
   703 
       
   704 \noindent As you will see in the next lecture, this is now all
       
   705 we need to tokenise an input string and classify each token.
   578 \end{document}
   706 \end{document}
   579 
   707 
   580 %%% Local Variables: 
   708 %%% Local Variables: 
   581 %%% mode: latex
   709 %%% mode: latex
   582 %%% TeX-master: t
   710 %%% TeX-master: t