afl-material: comparison handouts/ho04.tex

equal deleted inserted replaced

-:2c50b8b5886c
+:39aeca14af8c
 returned as value, we sill rectify using the $f_{rect}$
 from the simplification and finally inject $c$ back into
 the (rectified) value.
-\subsubsection*{Records and Tokenisation}
+\subsubsection*{Records}
-\newpage
+Remember we want to tokenize input strings, that means
-Algorithm by Sulzmann, Lexing
+splitting strings into their ``word'' components. But
+furthermore we want to classify each token as being a keyword
+or identifier and so on. For this one more feature will be
+required, which I call \emph{record}. While values record
+precisely how a regular expression matches a string,
+records can be used to focus on some particular
+parts of the regular expression and forget about others.
+Let us look at an example.
+Suppose you have the regular expression $ab + ac$. Clearly
+this regular expression can only recognise two strings. But
+suppose you are not interested whether it can recognise $ab$
+or $ac$, but rather if it matched, then what was the last
+character of the matched string\ldots either $b$ or $c$.
+You can do this by annotating the regular expression with
+a record, written $(x:r)$, where $x$ is just an identifier
+(in my implementation a plain string) and $r$ is a regular
+expression. A record will be regarded as a regular expression.
+The extended definition in Scala looks as follows:
+{\small\lstinputlisting[language=Scala]
+{../progs/app03.scala}}
+\noindent Since we regard records as regular expressions
+we need to extend the functions $nullable$ and $der$.
+Similarly $mkeps$ and $inj$ need to be extended and they
+sometimes can return a particular value for records.
+This means we also need to extend the definition of values.
+The extended definition in Scala looks as follows:
+{\small\lstinputlisting[language=Scala]
+{../progs/app04.scala}}
+\noindent Let us now look at the purpose of records more
+closely and lets return to our question whether the string
+terminated in a $b$ or $c$. We can do this as follows: we
+annotate the regular expression $ab + ac$ with a record
+as follows
+\begin{center}
+$a(x:b) + a(x:c)$
+\end{center}
+\noindent This regular expression can still only recognise
+the strings $ab$ and $ac$, but we can now use a function
+that takes a value and returns all records. I call this
+function \emph{env} for environment\ldots it builds a list
+of identifiers associated with their string. This function
+can be defined as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$env(Empty)$     & $\dn$ & $[]$\\
+$env(Char(c))$   & $\dn$ & $[]$\\
+$env(Left(v))$   & $\dn$ & $env(v)$\\
+$env(Right(v))$  & $\dn$ & $env(v)$\\
+$env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
+$env([v_1,\ldots ,v_n])$ & $\dn$ &
+$env(v_1) \,@\ldots @\, env(v_n)$\\
+$env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
+\end{tabular}
+\end{center}
+\noindent where in the last clause we use the flatten function
+defined earlier. The function $env$ ``picks'' out all
+underlying strings where a record is given. Since there can be
+more than one, the environment will potentially contain
+many ``recordings''. If we now postprocess the value
+calculated by $lex$ extracting all recordings using $env$,
+we can answer the question whether the last element in the
+string was an $b$ or a $c$. Lets see this in action: if
+we use $ab + ac$ and $ac$ the calculated value will be
+\begin{center}
+$Right(Seq(Char(a), Char(c)))$
+\end{center}
+\noindent If we use instead $a(x:b) + a(x:c)$ and
+use the $env$ function to extract the recording for
+$x$ we obtain
+\begin{center}
+$[(x:c)]$
+\end{center}
+\noindent If we had given the string $ab$ instead, then the
+record would have been $[(x:b)]$. The fun starts if we
+iterate this. Consider the regular expression
+\begin{center}
+$(a(x:b) + a(y:c))^*$
+\end{center}
+\noindent and the string $ababacabacab$. This string is
+clearly matched by the regular expression, but we are only
+interested in the sequence of $b$s and $c$s. Using $env$
+we obtain
+\begin{center}
+$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$
+\end{center}
+\noindent While this feature might look silly, it is in fact
+quite useful. For example if we want to match the name of
+an email we might use the regular expression
+\[
+(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot
+(domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot
+(top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
+\]
+\noindent Then if we match the email address
+\[
+\texttt{christian.urban@kcl.ac.uk}
+\]
+\noindent we can use the $env$ function and find out
+what the name, domain and top-level part of the email
+address are:
+\begin{center}
+$[(name:\texttt{christian.urban}),
+(domain:\texttt{kcl}),
+(top\_level:\texttt{ac.uk})]$
+\end{center}
+\noindent As you will see in the next lecture, this is now all
+we need to tokenise an input string and classify each token.
 \end{document}
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: t

changeset 288	39aeca14af8c
parent 287	2c50b8b5886c
child 296	796b9b81ac8d