# HG changeset patch # User Christian Urban # Date 1382720135 -3600 # Node ID 51d6b8b828c4bd4d31e5443fbf6e4dd01c39dde7 # Parent 70ab41cb610e1696d0b1e8b3e0e6c6a456eb86c8 added diff -r 70ab41cb610e -r 51d6b8b828c4 handouts/ho05.pdf Binary file handouts/ho05.pdf has changed diff -r 70ab41cb610e -r 51d6b8b828c4 handouts/ho05.tex --- a/handouts/ho05.tex Fri Oct 25 17:06:19 2013 +0100 +++ b/handouts/ho05.tex Fri Oct 25 17:55:35 2013 +0100 @@ -164,13 +164,26 @@ \end{center} \noindent -Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process -of separating an input string into components is often called \emph{lexing} or \emph{scanning}. +Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}. It is usually the first phase of a compiler. Note that the separation into words cannot, in general, be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments. +Lexing will not just separate the input into its components, but also classify the components, that +is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. +But for the moment we will only focus on the simpler problem of separating a string into components. +There are a few subtleties we need to consider first. For example if the input string is + +\begin{center} +\texttt{\Grid{iffoo\VS\ldots}} +\end{center} + +\noindent +then there are two possibilities: either we regard the input as the keyword \texttt{if} followed +by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a +single identifier. The choice that is often made in lexers to look for the longest possible match, +that is regard the input as a single identifier \texttt{iffoo} (since it is longer than \texttt{if}). \end{document}