551 $(r_{simp}, f_{rect}) = simp(der(c, r))$\\ |
554 $(r_{simp}, f_{rect}) = simp(der(c, r))$\\ |
552 & & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ |
555 & & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ |
553 \end{tabular} |
556 \end{tabular} |
554 \end{center} |
557 \end{center} |
555 |
558 |
556 \noindent This corresponds to the $matches$ function we |
559 \noindent This corresponds to the $matches$ function we have |
557 have seen in earlier lectures. In the first clause we are |
560 seen in earlier lectures. In the first clause we are given an |
558 given an empty string, $[]$, and need to test wether the |
561 empty string, $[]$, and need to test wether the regular |
559 regular expression is $nullable$. If yes we can proceed |
562 expression is $nullable$. If yes, we can proceed normally and |
560 normally and just return the value calculated by $mkeps$. |
563 just return the value calculated by $mkeps$. The second clause |
561 The second clause is for strings where the first character is |
564 is for strings where the first character is $c$, say, and the |
562 $c$ and the rest of the string is $s$. We first build the |
565 rest of the string is $s$. We first build the derivative of |
563 derivative of $r$ with respect to $c$; simplify the resulting |
566 $r$ with respect to $c$; simplify the resulting regular |
564 regulare expression. We continue lexing with the simplified |
567 expression. We continue lexing with the simplified regular |
565 regular expression and the string $s$. Whatever will be |
568 expression and the string $s$. Whatever will be returned as |
566 returned as value, we sill rectify using the $f_{rect}$ |
569 value, we sill need to rectify using the $f_{rect}$ from the |
567 from the simplification and finally inject $c$ back into |
570 simplification and finally inject $c$ back into the |
568 the (rectified) value. |
571 (rectified) value. |
569 |
572 |
570 |
573 |
571 \subsubsection*{Records} |
574 \subsubsection*{Records} |
572 |
575 |
573 Remember we want to tokenize input strings, that means |
576 Remember we wanted to tokenize input strings, that means |
574 splitting strings into their ``word'' components. But |
577 splitting strings into their ``word'' components. Furthermore |
575 furthermore we want to classify each token as being a keyword |
578 we want to classify each token as being a keyword or |
576 or identifier and so on. For this one more feature will be |
579 identifier and so on. For this one more feature will be |
577 required, which I call \emph{record}. While values record |
580 required, which I call a \emph{record} regular expression. |
578 precisely how a regular expression matches a string, |
581 While values encode how a regular expression matches a string, |
579 records can be used to focus on some particular |
582 records can be used to ``focus'' on some particular parts of |
580 parts of the regular expression and forget about others. |
583 the regular expression and ``forget'' about others. |
581 Let us look at an example. |
584 |
582 |
585 Let us look at an example. Suppose you have the regular |
583 Suppose you have the regular expression $ab + ac$. Clearly |
586 expression $a\cdot b + a\cdot c$. Clearly this regular expression can only |
584 this regular expression can only recognise two strings. But |
587 recognise two strings. But suppose you are not interested |
585 suppose you are not interested whether it can recognise $ab$ |
588 whether it can recognise $ab$ or $ac$, but rather if it |
586 or $ac$, but rather if it matched, then what was the last |
589 matched, then what was the last character of the matched |
587 character of the matched string\ldots either $b$ or $c$. |
590 string\ldots either $b$ or $c$. You can do this by annotating |
588 You can do this by annotating the regular expression with |
591 the regular expression with a record, written in general |
589 a record, written $(x:r)$, where $x$ is just an identifier |
592 $(x:r)$, where $x$ is just an identifier (in my implementation |
590 (in my implementation a plain string) and $r$ is a regular |
593 a plain string) and $r$ is a regular expression. A record will |
591 expression. A record will be regarded as a regular expression. |
594 be regarded as a regular expression. The extended definition |
592 The extended definition in Scala looks as follows: |
595 in Scala therefore looks as follows: |
593 |
596 |
594 {\small\lstinputlisting[language=Scala] |
597 {\small\lstinputlisting[language=Scala] |
595 {../progs/app03.scala}} |
598 {../progs/app03.scala}} |
596 |
599 |
597 \noindent Since we regard records as regular expressions |
600 \noindent Since we regard records as regular expressions we |
598 we need to extend the functions $nullable$ and $der$. |
601 need to extend the functions $nullable$ and $der$. Similarly |
599 Similarly $mkeps$ and $inj$ need to be extended and they |
602 $mkeps$ and $inj$ need to be extended. This means we also need |
600 sometimes can return a particular value for records. |
603 to extend the definition of values, which in Scala looks as |
601 This means we also need to extend the definition of values. |
604 follows: |
602 The extended definition in Scala looks as follows: |
|
603 |
605 |
604 {\small\lstinputlisting[language=Scala] |
606 {\small\lstinputlisting[language=Scala] |
605 {../progs/app04.scala}} |
607 {../progs/app04.scala}} |
606 |
608 |
607 \noindent Let us now look at the purpose of records more |
609 \noindent Let us now look at the purpose of records more |
608 closely and lets return to our question whether the string |
610 closely and let us return to our question whether the string |
609 terminated in a $b$ or $c$. We can do this as follows: we |
611 terminated in a $b$ or $c$. We can do this as follows: we |
610 annotate the regular expression $ab + ac$ with a record |
612 annotate the regular expression $ab + ac$ with a record |
611 as follows |
613 as follows |
612 |
614 |
613 \begin{center} |
615 \begin{center} |
614 $a(x:b) + a(x:c)$ |
616 $a\cdot (x:b) + a\cdot (x:c)$ |
615 \end{center} |
617 \end{center} |
616 |
618 |
617 \noindent This regular expression can still only recognise |
619 \noindent This regular expression can still only recognise |
618 the strings $ab$ and $ac$, but we can now use a function |
620 the strings $ab$ and $ac$, but we can now use a function |
619 that takes a value and returns all records. I call this |
621 that takes a value and returns all records. I call this |
620 function \emph{env} for environment\ldots it builds a list |
622 function \emph{env} for environment\ldots it builds a list |
621 of identifiers associated with their string. This function |
623 of identifiers associated with a string. This function |
622 can be defined as follows: |
624 can be defined as follows: |
623 |
625 |
624 \begin{center} |
626 \begin{center} |
625 \begin{tabular}{lcl} |
627 \begin{tabular}{lcl} |
626 $env(Empty)$ & $\dn$ & $[]$\\ |
628 $env(Empty)$ & $\dn$ & $[]$\\ |
632 $env(v_1) \,@\ldots @\, env(v_n)$\\ |
634 $env(v_1) \,@\ldots @\, env(v_n)$\\ |
633 $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\ |
635 $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\ |
634 \end{tabular} |
636 \end{tabular} |
635 \end{center} |
637 \end{center} |
636 |
638 |
637 \noindent where in the last clause we use the flatten function |
639 \noindent where in the last clause we use the flatten function |
638 defined earlier. The function $env$ ``picks'' out all |
640 defined earlier. As can be seen, the function $env$ ``picks'' |
639 underlying strings where a record is given. Since there can be |
641 out all underlying strings where a record is given. Since |
640 more than one, the environment will potentially contain |
642 there can be more than one, the environment will potentially |
641 many ``recordings''. If we now postprocess the value |
643 contain many ``records''. If we now postprocess the value |
642 calculated by $lex$ extracting all recordings using $env$, |
644 calculated by $lex$ extracting all records using $env$, we can |
643 we can answer the question whether the last element in the |
645 answer the question whether the last element in the string was |
644 string was an $b$ or a $c$. Lets see this in action: if |
646 an $b$ or a $c$. Lets see this in action: if we use $a\cdot b |
645 we use $ab + ac$ and $ac$ the calculated value will be |
647 + a\cdot c$ and $ac$ the calculated value will be |
646 |
648 |
647 \begin{center} |
649 \begin{center} |
648 $Right(Seq(Char(a), Char(c)))$ |
650 $Right(Seq(Char(a), Char(c)))$ |
649 \end{center} |
651 \end{center} |
650 |
652 |
651 \noindent If we use instead $a(x:b) + a(x:c)$ and |
653 \noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and |
652 use the $env$ function to extract the recording for |
654 use the $env$ function to extract the recording for |
653 $x$ we obtain |
655 $x$ we obtain |
654 |
656 |
655 \begin{center} |
657 \begin{center} |
656 $[(x:c)]$ |
658 $[(x:c)]$ |