Binary file handouts/ho05.pdf has changed
--- a/handouts/ho05.tex Fri Oct 25 17:06:19 2013 +0100
+++ b/handouts/ho05.tex Fri Oct 25 17:55:35 2013 +0100
@@ -164,13 +164,26 @@
\end{center}
\noindent
-Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process
-of separating an input string into components is often called \emph{lexing} or \emph{scanning}.
+Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}.
It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments.
+Lexing will not just separate the input into its components, but also classify the components, that
+is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
+But for the moment we will only focus on the simpler problem of separating a string into components.
+There are a few subtleties we need to consider first. For example if the input string is
+
+\begin{center}
+\texttt{\Grid{iffoo\VS\ldots}}
+\end{center}
+
+\noindent
+then there are two possibilities: either we regard the input as the keyword \texttt{if} followed
+by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
+single identifier. The choice that is often made in lexers to look for the longest possible match,
+that is regard the input as a single identifier \texttt{iffoo} (since it is longer than \texttt{if}).
\end{document}