--- a/handouts/ho05.tex Sun Oct 27 00:08:57 2013 +0100
+++ b/handouts/ho05.tex Sun Oct 27 01:07:25 2013 +0100
@@ -176,7 +176,7 @@
that in some languages, for example Python, whitespaces matters, that is carry meaning. However in
our small language we will eventually just filter out all whitespaces and also all comments.
-Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
+Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
For the moment, though, we will only focus on the simpler problem of just splitting up
a string into components.
@@ -190,7 +190,7 @@
then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
single identifier. The choice that is often made in lexers is to look for the longest possible match.
-This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
+This leaves \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
Unfortunately, the convention about the longest match does not yet make the process
of lexing completely deterministic. Consider the string
@@ -258,7 +258,7 @@
The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
to the first character $c_1$. Then we take the results and continue with
building the derivatives with respect to $c_2$ until we have either exhausted our
-input string or all of the regular expressions are ``zeroable''. If the input string is
+input string or all of the regular expressions are ``zeroable''. Suppose the input string is
\begin{center}
\texttt{\Grid{if2\VS}}\;\dots
@@ -277,10 +277,10 @@
\end{center}
\noindent
-We can eliminate then \textit{WHITESPACE} as a potential candidate, because
+We can eliminate \textit{WHITESPACE} as a potential candidate, because
no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
two regular expressions as potential candidate and we have to consider the
-next character from the input string
+next character, \texttt{f}, from the input string
\begin{center}
\begin{tabular}{l|c}
@@ -291,7 +291,7 @@
\end{center}
\noindent
-Since both are `no', we have to go on with \texttt{2} from the input string
+Since both are `no', we have to continue with \texttt{2} from the input string
\begin{center}
\begin{tabular}{l|c}
@@ -304,7 +304,7 @@
\noindent
Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
know how much of the input string should be considered as an \textit{IDENT}. So we
-also have to go on and consider the next derivative.
+still have to continue and consider the next derivative.
\begin{center}
\begin{tabular}{l|c}
@@ -314,7 +314,7 @@
\end{center}
\noindent
-Since the answer is now `yes' we can stop: once all derivatives are
+Since the answer is now `yes' also in this case, we can stop: once all derivatives are
zeroable, we know the regular expressions cannot match any more letters from
the input. In this case we only have to go back to the derivative that is nullable. In this case it
is
@@ -324,9 +324,46 @@
\end{center}
\noindent
-which means we recognised an identifier. In case there is a choice of more than one
+which means we recognised an identifier. In case where there is a choice of more than one
regular expressions that are nullable, then we choose the one with the highest precedence.
-You can try out this case
+You can try out such a case with the input string
+
+\begin{center}
+\texttt{\Grid{then\VS}}\;\dots
+\end{center}
+
+\noindent
+which can both be recognised as a keyword, but also an identifier.
+
+While in the example above the last nullable derivative is the one directly before
+the derivative turns zeroable, this is not always the case. Imagine, identifiers can be
+letters, as permuted by the regular expression \textit{IDENT}, but must end with an
+undercore.
+
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\
+\end{tabular}
+\end{center}
+
+\noindent
+If we use \textit{NEWIDENT} with the input string
+
+\begin{center}
+\texttt{\Grid{iffoo\VS}}\;\ldots
+\end{center}
+
+\noindent
+then it will only become $zeroable$ after the $\VS$ has been analysed. Then we have to go back to the
+first \texttt{f} because only
+
+\begin{center}
+ $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$
+\end{center}
+
+\noindent
+is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
+string needs to be consumed by other regular expressions or lead to a lexing error.