89 shown below. I use the convention that |
89 shown below. I use the convention that |
90 regular expressions are written entirely with upper-case |
90 regular expressions are written entirely with upper-case |
91 letters, whereas values start with a single upper-case |
91 letters, whereas values start with a single upper-case |
92 character and the rest are lower-case letters. |
92 character and the rest are lower-case letters. |
93 |
93 |
94 {\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor= |
94 {\small% |
95 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
95 \begin{lstlisting}[language=Scala,numbers=none] |
96 {../progs/app01.scala}} |
96 enum Rexp { |
97 |
97 case ZERO |
98 |
98 case ONE |
99 {\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor= |
99 case CHAR(c: Char) |
100 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
100 case ALT(r1: Rexp, r2: Rexp) |
101 {../progs/app02.scala}} |
101 case SEQ(r1: Rexp, r2: Rexp) |
|
102 case STAR(r: Rexp) |
|
103 } |
|
104 \end{lstlisting} |
|
105 |
|
106 \begin{lstlisting}[language=Scala,numbers=none] |
|
107 enum Val { |
|
108 case Empty |
|
109 case Chr(c: Char) |
|
110 case Sequ(v1: Val, v2: Val) |
|
111 case Left(v: Val) |
|
112 case Right(v: Val) |
|
113 case Stars(vs: List[Val]) |
|
114 } |
|
115 \end{lstlisting}} |
|
116 |
|
117 %{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor= |
|
118 % {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
119 %{../progs/app01.scala}} |
|
120 % |
|
121 % |
|
122 %{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor= |
|
123 % {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
124 %{../progs/app02.scala}} |
102 |
125 |
103 |
126 |
104 Graphically the algorithm by Sulzmann \& Lu can be illustrated |
127 Graphically the algorithm by Sulzmann \& Lu can be illustrated |
105 by the picture in Figure~\ref{Sulz} where the path from the |
128 by the picture in Figure~\ref{Sulz} where the path from the |
106 left to the right involving $\der/\nullable$ is the first phase |
129 left to the right involving $\der/\nullable$ is the first phase |
108 the second phase. This picture shows the steps required when a |
131 the second phase. This picture shows the steps required when a |
109 regular expression, say $r_1$, matches the string $abc$. We |
132 regular expression, say $r_1$, matches the string $abc$. We |
110 first build the three derivatives (according to $a$, $b$ and |
133 first build the three derivatives (according to $a$, $b$ and |
111 $c$). We then use $\nullable$ to find out whether the resulting |
134 $c$). We then use $\nullable$ to find out whether the resulting |
112 regular expression can match the empty string. If yes, we call |
135 regular expression can match the empty string. If yes, we call |
113 the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular |
136 the function $\textit{mkeps}$. This function calculates a value for how a regular |
114 expression has matched the empty string. Its definition |
137 expression has matched the empty string. Its definition |
115 is as follows: |
138 is as follows: |
116 |
139 |
117 \begin{figure}[t] |
140 \begin{figure}[t] |
118 \begin{center} |
141 \begin{center} |
233 \noindent If there had been a $\ONE$ on the left, then |
256 \noindent If there had been a $\ONE$ on the left, then |
234 $\textit{mkeps}$ would have returned something of the form |
257 $\textit{mkeps}$ would have returned something of the form |
235 $\Left(\ldots)$. The point is that from this value we can |
258 $\Left(\ldots)$. The point is that from this value we can |
236 directly read off which part of $r_4$ matched the empty |
259 directly read off which part of $r_4$ matched the empty |
237 string: take the right-alternative first, and then the |
260 string: take the right-alternative first, and then the |
238 right-alternative again, then you got to the $\ONE$. |
261 right-alternative again, then you got to the $\ONE$ (indicated by |
|
262 \textit{Empty}). |
239 |
263 |
240 Next we have to ``inject'' the last character, that is $c$ in |
264 Next we have to ``inject'' the last character, that is $c$ in |
241 the running example, into this value $v_4$ in order to |
265 the running example, into this value $v_4$ in order to |
242 calculate how $r_3$ could have matched the string $c$. |
266 calculate how $r_3$ could have matched the string $c$. |
243 For this we call injection with $\inj(r_3, c, v_4)$. |
267 For this we call injection with $\inj(r_3, c, v_4)$. |
267 \begin{center} |
291 \begin{center} |
268 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ |
292 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ |
269 \end{center} |
293 \end{center} |
270 |
294 |
271 \noindent This value corresponds to how the regular expression $r_1$, |
295 \noindent This value corresponds to how the regular expression $r_1$, |
272 namely $a \cdot (b \cdot c)$, matched the string $abc$. |
296 namely $a \cdot (b \cdot c)$, matched the string $abc$. I let you |
|
297 think whether it is the expected value. |
273 |
298 |
274 |
299 |
275 There are a few auxiliary functions that are of interest |
300 There are a few auxiliary functions that are of interest |
276 when analysing this algorithm. One is called \emph{flatten}, |
301 when analysing this algorithm. One is called \emph{flatten}, |
277 written $|\_|$, which extracts the string ``underlying'' a |
302 written $|\_|$, which extracts the string ``underlying'' a |
305 adding, back a character into the value. |
330 adding, back a character into the value. |
306 |
331 |
307 The definition of $\inj$ might still be very puzzling and each clause |
332 The definition of $\inj$ might still be very puzzling and each clause |
308 might look arbitrary, but there is in fact a relatively simple idea |
333 might look arbitrary, but there is in fact a relatively simple idea |
309 behind it. Ultimately the $\inj$-functions is determined by the |
334 behind it. Ultimately the $\inj$-functions is determined by the |
310 derivative functions. For this consider one of the ``squares'' from |
335 derivative function. For this consider one of the ``squares'' from |
311 Figure~\ref{Sulz}: |
336 Figure~\ref{Sulz}: |
312 |
337 |
313 \begin{center} |
338 \begin{center} |
314 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
339 \begin{tikzpicture}[scale=2,node distance=1.2cm, |
315 every node/.style={minimum size=7mm}] |
340 every node/.style={minimum size=7mm}] |
773 $(x:r)$, where $x$ is just an identifier (in my implementation |
798 $(x:r)$, where $x$ is just an identifier (in my implementation |
774 a plain string) and $r$ is a regular expression. A record will |
799 a plain string) and $r$ is a regular expression. A record will |
775 be regarded as a regular expression. The extended definition |
800 be regarded as a regular expression. The extended definition |
776 in Scala therefore looks as follows: |
801 in Scala therefore looks as follows: |
777 |
802 |
778 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor= |
803 {\small% |
779 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
804 \begin{lstlisting}[language=Scala,numbers=none] |
780 {../progs/app03.scala}} |
805 enum Rexp { |
|
806 case ZERO |
|
807 case ONE |
|
808 case CHAR(c: Char) |
|
809 case ALT(r1: Rexp, r2: Rexp) |
|
810 case SEQ(r1: Rexp, r2: Rexp) |
|
811 case STAR(r: Rexp) |
|
812 case RECD(label: String, r: Rexp) |
|
813 } |
|
814 \end{lstlisting}} |
|
815 |
781 |
816 |
782 \noindent Since we regard records as regular expressions we |
817 \noindent Since we regard records as regular expressions we |
783 need to extend the functions $\nullable$ and $\der$. Similarly |
818 need to extend the functions $\nullable$ and $\der$. Similarly |
784 $\textit{mkeps}$ and $\inj$ need to be extended. This means we also need |
819 $\textit{mkeps}$ and $\inj$ need to be extended. This means we also need |
785 to extend the definition of values, which in Scala looks as |
820 to extend the definition of values, which in Scala looks as |
786 follows: |
821 follows: |
787 |
822 |
788 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor= |
823 {\small% |
789 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
824 \begin{lstlisting}[language=Scala,numbers=none] |
790 {../progs/app04.scala}} |
825 enum Val { |
|
826 case Empty |
|
827 case Chr(c: Char) |
|
828 case Sequ(v1: Val, v2: Val) |
|
829 case Left(v: Val) |
|
830 case Right(v: Val) |
|
831 case Stars(vs: List[Val]) |
|
832 case Rec(label: String, v: Val) |
|
833 } |
|
834 \end{lstlisting}} |
|
835 |
791 |
836 |
792 \noindent Let us now look at the purpose of records more |
837 \noindent Let us now look at the purpose of records more |
793 closely and let us return to our question whether the string |
838 closely and let us return to our question whether the string |
794 terminated in a $b$ or $c$. We can do this as follows: we |
839 terminated in a $b$ or $c$. We can do this as follows: we |
795 annotate the regular expression $ab + ac$ with a record |
840 annotate the regular expression $ab + ac$ with a record |
801 |
846 |
802 \noindent This regular expression can still only recognise |
847 \noindent This regular expression can still only recognise |
803 the strings $ab$ and $ac$, but we can now use a function |
848 the strings $ab$ and $ac$, but we can now use a function |
804 that takes a value and returns all records. I call this |
849 that takes a value and returns all records. I call this |
805 function \emph{env} for environment\ldots it builds a list |
850 function \emph{env} for environment\ldots it builds a list |
806 of identifiers associated with a string. This function |
851 of labels associated with a string. This function |
807 can be defined as follows: |
852 can be defined as follows: |
808 |
853 |
809 \begin{center} |
854 \begin{center} |
810 \begin{tabular}{lcl} |
855 \begin{tabular}{lcl} |
811 $env(Empty)$ & $\dn$ & $[]$\\ |
856 $env(Empty)$ & $\dn$ & $[]$\\ |
824 out all underlying strings where a record is given. Since |
869 out all underlying strings where a record is given. Since |
825 there can be more than one, the environment will potentially |
870 there can be more than one, the environment will potentially |
826 contain many ``records''. If we now postprocess the value |
871 contain many ``records''. If we now postprocess the value |
827 calculated by $lex$ extracting all records using $env$, we can |
872 calculated by $lex$ extracting all records using $env$, we can |
828 answer the question whether the last element in the string was |
873 answer the question whether the last element in the string was |
829 an $b$ or a $c$. Lets see this in action: if we use $a\cdot b |
874 an $b$ or a $c$. Lets see this in action: if we use the regular expression |
830 + a\cdot c$ and $ac$ the calculated value will be |
875 $a\cdot b + a\cdot c$ and the string$ac$, then the calculated value of \textit{lex} |
|
876 will be |
831 |
877 |
832 \begin{center} |
878 \begin{center} |
833 $Right(Seq(Char(a), Char(c)))$ |
879 $Right(Seq(Char(a), Char(c)))$ |
834 \end{center} |
880 \end{center} |
835 |
881 |
836 \noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and |
882 \noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and |
837 use the $env$ function to extract the recording for |
883 use the $env$ function to extract what is recorded for |
838 $x$ we obtain |
884 $x$ we obtain |
839 |
885 |
840 \begin{center} |
886 \begin{center} |
841 $[(x:c)]$ |
887 $[(x:c)]$ |
842 \end{center} |
888 \end{center} |
843 |
889 |
844 \noindent If we had given the string $ab$ instead, then the |
890 \noindent If we had given the string $ab$ instead, then the |
845 record would have been $[(x:b)]$. The fun starts if we |
891 record would have been $[(x:b)]$. Note that $x$ is just an arbitrary |
|
892 label we use to identify a place in the regular expression |
|
893 where we are interested in what substring has been matched. The fun starts if we |
846 iterate this. Consider the regular expression |
894 iterate this. Consider the regular expression |
847 |
895 |
848 \begin{center} |
896 \begin{center} |
849 $(a\cdot (x:b) + a\cdot (y:c))^*$ |
897 $(a\cdot (x:b) + a\cdot (y:c))^*$ |
850 \end{center} |
898 \end{center} |
924 (b, (Begin + End))\;+ \\ |
972 (b, (Begin + End))\;+ \\ |
925 (w, WhiteSpaces) |
973 (w, WhiteSpaces) |
926 \end{array}\right)^{\mbox{\LARGE{}*}}$ |
974 \end{array}\right)^{\mbox{\LARGE{}*}}$ |
927 \end{center} |
975 \end{center} |
928 |
976 |
929 \noindent and ask the algorithm by Sulzmann \& Lu to lex, say |
977 \noindent The string $k$ stands for keywords, $i$ for identifiers, $o$ for |
930 the following string |
978 operations and so on. These labels essentially allow us to classify what |
|
979 strings we match with each alternative. Then we can ask the algorithm |
|
980 by Sulzmann \& Lu to lex the following string |
931 |
981 |
932 \begin{center}\large |
982 \begin{center}\large |
933 \code{"if true then then 42 else +"} |
983 \code{"if true then then 42 else +"} |
934 \end{center} |
984 \end{center} |
935 |
985 |