--- a/handouts/ho05.tex Sat Oct 26 10:06:58 2013 +0100
+++ b/handouts/ho05.tex Sun Oct 27 00:08:57 2013 +0100
@@ -146,7 +146,7 @@
\end{center}
\noindent
-we expect it will be split up as follows
+we expect it is split up as follows
\begin{center}
\tt
@@ -168,21 +168,22 @@
\end{center}
\noindent
-This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}.
+This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
-be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
-this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any
-whitespace. Another reason for recognising whitespaces explicitly is
-that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also all comments.
+be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
+in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are
+not separated by any whitespace. Another reason for recognising whitespaces explicitly is
+that in some languages, for example Python, whitespaces matters, that is carry meaning. However in
+our small language we will eventually just filter out all whitespaces and also all comments.
-Lexing will not just separate a string into its components, but also classify the components, that
-is explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
-For the moment, though, we will only focus on the simpler problem of just splitting a string into components.
+Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
+For the moment, though, we will only focus on the simpler problem of just splitting up
+a string into components.
There are a few subtleties we need to consider first. For example, say the string is
\begin{center}
-\texttt{\Grid{iffoo\VS\ldots}}
+\texttt{\Grid{iffoo\VS}}\;\ldots
\end{center}
\noindent
@@ -191,11 +192,11 @@
single identifier. The choice that is often made in lexers is to look for the longest possible match.
This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
-Unfortunately, the convention about the longest match does not yet make the whole process
+Unfortunately, the convention about the longest match does not yet make the process
of lexing completely deterministic. Consider the string
\begin{center}
-\texttt{\Grid{then\VS\dots}}
+\texttt{\Grid{then\VS}}\;\dots
\end{center}
\noindent
@@ -210,10 +211,8 @@
So even if both regular expressions match in the example above,
we give preference to the regular expression for keywords.
-Let us see how our algorithm for lexing works in detail. The regular
-expressions and their ranking are shown above. For our algorithm it will
-be helpful to have a look at the function \emph{zeroable}
-defined as follows:
+Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
+and $der$, it will use the function \emph{zeroable} defined as follows:
\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
@@ -227,8 +226,8 @@
\end{center}
\noindent
-In contrast to the function $nullable(r)$, which test whether a regular expression
-can match the empty string, the $zeroable$ function identifies whether a regular
+Recall that the function $nullable(r)$ tests whether a regular expression
+can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
expression cannot match anything at all. The mathematical way of stating this
property is
@@ -237,19 +236,99 @@
\end{center}
\noindent
-Let us fix a set of regular expressions $rs$.
-The crucial idea of the algorithm working with $rs$ and the string, say
+For what follows let us fix a set of regular expressions $rs$ as being
+
+\begin{center}
+\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
+\end{center}
+
+\noindent
+specifying the ``words''
+of our programming language. The algorithm takes as input the $rs$ and a string, say
\begin{center}
-\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}\grid{\ldots}}
+\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
+\end{center}
+
+\noindent
+and tries to chop off one word from the beginning of the string. If none of the
+regular expression in $rs$ matches, we will just return
+the empty string.
+
+The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
+to the first character $c_1$. Then we take the results and continue with
+building the derivatives with respect to $c_2$ until we have either exhausted our
+input string or all of the regular expressions are ``zeroable''. If the input string is
+
+\begin{center}
+\texttt{\Grid{if2\VS}}\;\dots
+\end{center}
+
+\noindent
+then building the derivatives with respect to \texttt{i} gives
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\
+ $der\;\texttt{i}\;(\textit{IDENT})$ & no\\
+ $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
+\end{tabular}
\end{center}
\noindent
-is to build the derivative of all regular expressions in $rs$ with respect
-to the first character $c_1$. Then we take the results and continue with
-building the derivatives with respect to $c_2$
-until we have either exhausted our input string or all of the regular expressions
-are ``zeroable''.
+We can eliminate then \textit{WHITESPACE} as a potential candidate, because
+no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
+two regular expressions as potential candidate and we have to consider the
+next character from the input string
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\
+ $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\
+\end{tabular}
+\end{center}
+
+\noindent
+Since both are `no', we have to go on with \texttt{2} from the input string
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\
+ $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\
+\end{tabular}
+\end{center}
+
+\noindent
+Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
+know how much of the input string should be considered as an \textit{IDENT}. So we
+also have to go on and consider the next derivative.
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\
+\end{tabular}
+\end{center}
+
+\noindent
+Since the answer is now `yes' we can stop: once all derivatives are
+zeroable, we know the regular expressions cannot match any more letters from
+the input. In this case we only have to go back to the derivative that is nullable. In this case it
+is
+
+\begin{center}
+$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$
+\end{center}
+
+\noindent
+which means we recognised an identifier. In case there is a choice of more than one
+regular expressions that are nullable, then we choose the one with the highest precedence.
+You can try out this case
+
+
\end{document}