568 returned as value, we sill rectify using the $f_{rect}$ |
568 returned as value, we sill rectify using the $f_{rect}$ |
569 from the simplification and finally inject $c$ back into |
569 from the simplification and finally inject $c$ back into |
570 the (rectified) value. |
570 the (rectified) value. |
571 |
571 |
572 |
572 |
573 \subsubsection*{Records and Tokenisation} |
573 \subsubsection*{Records} |
574 |
574 |
575 \newpage |
575 Remember we want to tokenize input strings, that means |
576 Algorithm by Sulzmann, Lexing |
576 splitting strings into their ``word'' components. But |
577 |
577 furthermore we want to classify each token as being a keyword |
|
578 or identifier and so on. For this one more feature will be |
|
579 required, which I call \emph{record}. While values record |
|
580 precisely how a regular expression matches a string, |
|
581 records can be used to focus on some particular |
|
582 parts of the regular expression and forget about others. |
|
583 Let us look at an example. |
|
584 |
|
585 Suppose you have the regular expression $ab + ac$. Clearly |
|
586 this regular expression can only recognise two strings. But |
|
587 suppose you are not interested whether it can recognise $ab$ |
|
588 or $ac$, but rather if it matched, then what was the last |
|
589 character of the matched string\ldots either $b$ or $c$. |
|
590 You can do this by annotating the regular expression with |
|
591 a record, written $(x:r)$, where $x$ is just an identifier |
|
592 (in my implementation a plain string) and $r$ is a regular |
|
593 expression. A record will be regarded as a regular expression. |
|
594 The extended definition in Scala looks as follows: |
|
595 |
|
596 {\small\lstinputlisting[language=Scala] |
|
597 {../progs/app03.scala}} |
|
598 |
|
599 \noindent Since we regard records as regular expressions |
|
600 we need to extend the functions $nullable$ and $der$. |
|
601 Similarly $mkeps$ and $inj$ need to be extended and they |
|
602 sometimes can return a particular value for records. |
|
603 This means we also need to extend the definition of values. |
|
604 The extended definition in Scala looks as follows: |
|
605 |
|
606 {\small\lstinputlisting[language=Scala] |
|
607 {../progs/app04.scala}} |
|
608 |
|
609 \noindent Let us now look at the purpose of records more |
|
610 closely and lets return to our question whether the string |
|
611 terminated in a $b$ or $c$. We can do this as follows: we |
|
612 annotate the regular expression $ab + ac$ with a record |
|
613 as follows |
|
614 |
|
615 \begin{center} |
|
616 $a(x:b) + a(x:c)$ |
|
617 \end{center} |
|
618 |
|
619 \noindent This regular expression can still only recognise |
|
620 the strings $ab$ and $ac$, but we can now use a function |
|
621 that takes a value and returns all records. I call this |
|
622 function \emph{env} for environment\ldots it builds a list |
|
623 of identifiers associated with their string. This function |
|
624 can be defined as follows: |
|
625 |
|
626 \begin{center} |
|
627 \begin{tabular}{lcl} |
|
628 $env(Empty)$ & $\dn$ & $[]$\\ |
|
629 $env(Char(c))$ & $\dn$ & $[]$\\ |
|
630 $env(Left(v))$ & $\dn$ & $env(v)$\\ |
|
631 $env(Right(v))$ & $\dn$ & $env(v)$\\ |
|
632 $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\ |
|
633 $env([v_1,\ldots ,v_n])$ & $\dn$ & |
|
634 $env(v_1) \,@\ldots @\, env(v_n)$\\ |
|
635 $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\ |
|
636 \end{tabular} |
|
637 \end{center} |
|
638 |
|
639 \noindent where in the last clause we use the flatten function |
|
640 defined earlier. The function $env$ ``picks'' out all |
|
641 underlying strings where a record is given. Since there can be |
|
642 more than one, the environment will potentially contain |
|
643 many ``recordings''. If we now postprocess the value |
|
644 calculated by $lex$ extracting all recordings using $env$, |
|
645 we can answer the question whether the last element in the |
|
646 string was an $b$ or a $c$. Lets see this in action: if |
|
647 we use $ab + ac$ and $ac$ the calculated value will be |
|
648 |
|
649 \begin{center} |
|
650 $Right(Seq(Char(a), Char(c)))$ |
|
651 \end{center} |
|
652 |
|
653 \noindent If we use instead $a(x:b) + a(x:c)$ and |
|
654 use the $env$ function to extract the recording for |
|
655 $x$ we obtain |
|
656 |
|
657 \begin{center} |
|
658 $[(x:c)]$ |
|
659 \end{center} |
|
660 |
|
661 \noindent If we had given the string $ab$ instead, then the |
|
662 record would have been $[(x:b)]$. The fun starts if we |
|
663 iterate this. Consider the regular expression |
|
664 |
|
665 \begin{center} |
|
666 $(a(x:b) + a(y:c))^*$ |
|
667 \end{center} |
|
668 |
|
669 \noindent and the string $ababacabacab$. This string is |
|
670 clearly matched by the regular expression, but we are only |
|
671 interested in the sequence of $b$s and $c$s. Using $env$ |
|
672 we obtain |
|
673 |
|
674 \begin{center} |
|
675 $[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$ |
|
676 \end{center} |
|
677 |
|
678 \noindent While this feature might look silly, it is in fact |
|
679 quite useful. For example if we want to match the name of |
|
680 an email we might use the regular expression |
|
681 |
|
682 \[ |
|
683 (name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot |
|
684 (domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot |
|
685 (top\_level: [a\mbox{-}z\,.]^{\{2,6\}}) |
|
686 \] |
|
687 |
|
688 \noindent Then if we match the email address |
|
689 |
|
690 \[ |
|
691 \texttt{christian.urban@kcl.ac.uk} |
|
692 \] |
|
693 |
|
694 \noindent we can use the $env$ function and find out |
|
695 what the name, domain and top-level part of the email |
|
696 address are: |
|
697 |
|
698 \begin{center} |
|
699 $[(name:\texttt{christian.urban}), |
|
700 (domain:\texttt{kcl}), |
|
701 (top\_level:\texttt{ac.uk})]$ |
|
702 \end{center} |
|
703 |
|
704 \noindent As you will see in the next lecture, this is now all |
|
705 we need to tokenise an input string and classify each token. |
578 \end{document} |
706 \end{document} |
579 |
707 |
580 %%% Local Variables: |
708 %%% Local Variables: |
581 %%% mode: latex |
709 %%% mode: latex |
582 %%% TeX-master: t |
710 %%% TeX-master: t |