handouts/ho04.tex
changeset 1020 36799f7b9702
parent 940 1c1fbf45a03c
child 1021 5990e03b2ba8
equal deleted inserted replaced
1019:f71399fe3fdc 1020:36799f7b9702
    89 shown below. I use the convention that
    89 shown below. I use the convention that
    90 regular expressions are written entirely with upper-case
    90 regular expressions are written entirely with upper-case
    91 letters, whereas values start with a single upper-case
    91 letters, whereas values start with a single upper-case
    92 character and the rest are lower-case letters.
    92 character and the rest are lower-case letters.
    93  
    93  
    94 {\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
    94 {\small%
    95                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
    95 \begin{lstlisting}[language=Scala,numbers=none]
    96 {../progs/app01.scala}}
    96  enum Rexp {
    97 
    97    case ZERO
    98  
    98    case ONE
    99 {\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
    99    case CHAR(c: Char)
   100                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   100    case ALT(r1: Rexp, r2: Rexp)
   101 {../progs/app02.scala}}
   101    case SEQ(r1: Rexp, r2: Rexp)
       
   102    case STAR(r: Rexp)
       
   103 }
       
   104 \end{lstlisting}
       
   105 
       
   106 \begin{lstlisting}[language=Scala,numbers=none]
       
   107 enum Val {
       
   108    case Empty
       
   109    case Chr(c: Char)
       
   110    case Sequ(v1: Val, v2: Val)
       
   111    case Left(v: Val)
       
   112    case Right(v: Val)
       
   113    case Stars(vs: List[Val])
       
   114 }
       
   115 \end{lstlisting}}
       
   116 
       
   117 %{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
       
   118 %                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   119 %{../progs/app01.scala}}
       
   120 %
       
   121 % 
       
   122 %{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
       
   123 %                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   124 %{../progs/app02.scala}}
   102 
   125 
   103 
   126 
   104 Graphically the algorithm by Sulzmann \& Lu can be illustrated
   127 Graphically the algorithm by Sulzmann \& Lu can be illustrated
   105 by the picture in Figure~\ref{Sulz} where the path from the
   128 by the picture in Figure~\ref{Sulz} where the path from the
   106 left to the right involving $\der/\nullable$ is the first phase
   129 left to the right involving $\der/\nullable$ is the first phase
   108 the second phase. This picture shows the steps required when a
   131 the second phase. This picture shows the steps required when a
   109 regular expression, say $r_1$, matches the string $abc$. We
   132 regular expression, say $r_1$, matches the string $abc$. We
   110 first build the three derivatives (according to $a$, $b$ and
   133 first build the three derivatives (according to $a$, $b$ and
   111 $c$). We then use $\nullable$ to find out whether the resulting
   134 $c$). We then use $\nullable$ to find out whether the resulting
   112 regular expression can match the empty string. If yes, we call
   135 regular expression can match the empty string. If yes, we call
   113 the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
   136 the function $\textit{mkeps}$. This function calculates a value for how a regular
   114 expression has matched the empty string. Its definition
   137 expression has matched the empty string. Its definition
   115 is as follows:
   138 is as follows:
   116 
   139 
   117 \begin{figure}[t]
   140 \begin{figure}[t]
   118 \begin{center}
   141 \begin{center}
   233 \noindent If there had been a $\ONE$ on the left, then
   256 \noindent If there had been a $\ONE$ on the left, then
   234 $\textit{mkeps}$ would have returned something of the form
   257 $\textit{mkeps}$ would have returned something of the form
   235 $\Left(\ldots)$. The point is that from this value we can
   258 $\Left(\ldots)$. The point is that from this value we can
   236 directly read off which part of $r_4$ matched the empty
   259 directly read off which part of $r_4$ matched the empty
   237 string: take the right-alternative first, and then the
   260 string: take the right-alternative first, and then the
   238 right-alternative again, then you got to the $\ONE$.
   261 right-alternative again, then you got to the $\ONE$ (indicated by
       
   262 \textit{Empty}).
   239 
   263 
   240 Next we have to ``inject'' the last character, that is $c$ in
   264 Next we have to ``inject'' the last character, that is $c$ in
   241 the running example, into this value $v_4$ in order to
   265 the running example, into this value $v_4$ in order to
   242 calculate how $r_3$ could have matched the string $c$.
   266 calculate how $r_3$ could have matched the string $c$.
   243 For this we call injection with $\inj(r_3, c, v_4)$.
   267 For this we call injection with $\inj(r_3, c, v_4)$.
   267 \begin{center}
   291 \begin{center}
   268 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
   292 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
   269 \end{center}
   293 \end{center}
   270 
   294 
   271 \noindent This value corresponds to how the regular expression $r_1$,
   295 \noindent This value corresponds to how the regular expression $r_1$,
   272 namely $a \cdot (b \cdot c)$, matched the string $abc$.
   296 namely $a \cdot (b \cdot c)$, matched the string $abc$. I let you
       
   297 think whether it is the expected value.
   273 
   298 
   274 
   299 
   275 There are a few auxiliary functions that are of interest
   300 There are a few auxiliary functions that are of interest
   276 when analysing this algorithm. One is called \emph{flatten},
   301 when analysing this algorithm. One is called \emph{flatten},
   277 written $|\_|$, which extracts the string ``underlying'' a 
   302 written $|\_|$, which extracts the string ``underlying'' a 
   305 adding, back a character into the value. 
   330 adding, back a character into the value. 
   306 
   331 
   307 The definition of $\inj$ might still be very puzzling and each clause
   332 The definition of $\inj$ might still be very puzzling and each clause
   308 might look arbitrary, but there is in fact a relatively simple idea
   333 might look arbitrary, but there is in fact a relatively simple idea
   309 behind it. Ultimately the $\inj$-functions is determined by the
   334 behind it. Ultimately the $\inj$-functions is determined by the
   310 derivative functions. For this consider one of the ``squares'' from
   335 derivative function. For this consider one of the ``squares'' from
   311 Figure~\ref{Sulz}:
   336 Figure~\ref{Sulz}:
   312 
   337 
   313 \begin{center}
   338 \begin{center}
   314 \begin{tikzpicture}[scale=2,node distance=1.2cm,
   339 \begin{tikzpicture}[scale=2,node distance=1.2cm,
   315                     every node/.style={minimum size=7mm}]
   340                     every node/.style={minimum size=7mm}]
   773 $(x:r)$, where $x$ is just an identifier (in my implementation
   798 $(x:r)$, where $x$ is just an identifier (in my implementation
   774 a plain string) and $r$ is a regular expression. A record will
   799 a plain string) and $r$ is a regular expression. A record will
   775 be regarded as a regular expression. The extended definition
   800 be regarded as a regular expression. The extended definition
   776 in Scala therefore looks as follows:
   801 in Scala therefore looks as follows:
   777 
   802 
   778 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
   803 {\small%
   779                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   804 \begin{lstlisting}[language=Scala,numbers=none]
   780 {../progs/app03.scala}}
   805  enum Rexp {
       
   806    case ZERO
       
   807    case ONE
       
   808    case CHAR(c: Char)
       
   809    case ALT(r1: Rexp, r2: Rexp)
       
   810    case SEQ(r1: Rexp, r2: Rexp)
       
   811    case STAR(r: Rexp)
       
   812    case RECD(label: String, r: Rexp)
       
   813 }
       
   814 \end{lstlisting}}
       
   815 
   781 
   816 
   782 \noindent Since we regard records as regular expressions we
   817 \noindent Since we regard records as regular expressions we
   783 need to extend the functions $\nullable$ and $\der$. Similarly
   818 need to extend the functions $\nullable$ and $\der$. Similarly
   784 $\textit{mkeps}$ and $\inj$ need to be extended. This means we also need
   819 $\textit{mkeps}$ and $\inj$ need to be extended. This means we also need
   785 to extend the definition of values, which in Scala looks as
   820 to extend the definition of values, which in Scala looks as
   786 follows:
   821 follows:
   787 
   822 
   788 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
   823 {\small%
   789                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   824 \begin{lstlisting}[language=Scala,numbers=none]
   790 {../progs/app04.scala}}
   825 enum Val {
       
   826    case Empty
       
   827    case Chr(c: Char)
       
   828    case Sequ(v1: Val, v2: Val)
       
   829    case Left(v: Val)
       
   830    case Right(v: Val)
       
   831    case Stars(vs: List[Val])
       
   832    case Rec(label: String, v: Val)
       
   833 }
       
   834 \end{lstlisting}}
       
   835 
   791 
   836 
   792 \noindent Let us now look at the purpose of records more
   837 \noindent Let us now look at the purpose of records more
   793 closely and let us return to our question whether the string
   838 closely and let us return to our question whether the string
   794 terminated in a $b$ or $c$. We can do this as follows: we
   839 terminated in a $b$ or $c$. We can do this as follows: we
   795 annotate the regular expression $ab + ac$ with a record
   840 annotate the regular expression $ab + ac$ with a record
   801 
   846 
   802 \noindent This regular expression can still only recognise
   847 \noindent This regular expression can still only recognise
   803 the strings $ab$ and $ac$, but we can now use a function
   848 the strings $ab$ and $ac$, but we can now use a function
   804 that takes a value and returns all records. I call this
   849 that takes a value and returns all records. I call this
   805 function \emph{env} for environment\ldots it builds a list
   850 function \emph{env} for environment\ldots it builds a list
   806 of identifiers associated with a string. This function
   851 of labels associated with a string. This function
   807 can be defined as follows:
   852 can be defined as follows:
   808 
   853 
   809 \begin{center}
   854 \begin{center}
   810 \begin{tabular}{lcl}
   855 \begin{tabular}{lcl}
   811   $env(Empty)$     & $\dn$ & $[]$\\
   856   $env(Empty)$     & $\dn$ & $[]$\\
   824 out all underlying strings where a record is given. Since
   869 out all underlying strings where a record is given. Since
   825 there can be more than one, the environment will potentially
   870 there can be more than one, the environment will potentially
   826 contain many ``records''. If we now postprocess the value
   871 contain many ``records''. If we now postprocess the value
   827 calculated by $lex$ extracting all records using $env$, we can
   872 calculated by $lex$ extracting all records using $env$, we can
   828 answer the question whether the last element in the string was
   873 answer the question whether the last element in the string was
   829 an $b$ or a $c$. Lets see this in action: if we use $a\cdot b
   874 an $b$ or a $c$. Lets see this in action: if we use the regular expression 
   830 + a\cdot c$ and $ac$ the calculated value will be
   875 $a\cdot b + a\cdot c$ and the string$ac$, then the calculated value of \textit{lex} 
       
   876 will be
   831 
   877 
   832 \begin{center}
   878 \begin{center}
   833 $Right(Seq(Char(a), Char(c)))$
   879 $Right(Seq(Char(a), Char(c)))$
   834 \end{center}
   880 \end{center}
   835 
   881 
   836 \noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and
   882 \noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and
   837 use the $env$ function to extract the recording for 
   883 use the $env$ function to extract what is recorded for 
   838 $x$ we obtain
   884 $x$ we obtain
   839 
   885 
   840 \begin{center}
   886 \begin{center}
   841 $[(x:c)]$
   887 $[(x:c)]$
   842 \end{center}
   888 \end{center}
   843 
   889 
   844 \noindent If we had given the string $ab$ instead, then the
   890 \noindent If we had given the string $ab$ instead, then the
   845 record would have been $[(x:b)]$. The fun starts if we 
   891 record would have been $[(x:b)]$. Note that $x$ is just an arbitrary 
       
   892 label we use to identify a place in the regular expression 
       
   893 where we are interested in what substring has been matched. The fun starts if we 
   846 iterate this. Consider the regular expression 
   894 iterate this. Consider the regular expression 
   847 
   895 
   848 \begin{center}
   896 \begin{center}
   849 $(a\cdot (x:b) + a\cdot (y:c))^*$
   897 $(a\cdot (x:b) + a\cdot (y:c))^*$
   850 \end{center}
   898 \end{center}
   924        (b, (Begin + End))\;+ \\
   972        (b, (Begin + End))\;+ \\
   925        (w, WhiteSpaces)
   973        (w, WhiteSpaces)
   926       \end{array}\right)^{\mbox{\LARGE{}*}}$
   974       \end{array}\right)^{\mbox{\LARGE{}*}}$
   927 \end{center}
   975 \end{center}
   928 
   976 
   929 \noindent and ask the algorithm by Sulzmann \& Lu to lex, say
   977 \noindent The string $k$ stands for keywords, $i$ for identifiers, $o$ for 
   930 the following string
   978 operations and so on. These labels essentially allow us to classify what 
       
   979 strings we match with each alternative.  Then we can ask the algorithm 
       
   980 by Sulzmann \& Lu to lex the following string
   931 
   981 
   932 \begin{center}\large
   982 \begin{center}\large
   933 \code{"if true then then 42 else +"}
   983 \code{"if true then then 42 else +"}
   934 \end{center}
   984 \end{center}
   935 
   985