afl-material: comparison handouts/ho05.tex

equal deleted inserted replaced

-:70ab41cb610e
+:51d6b8b828c4
 \Grid{+}
 \Grid{3}
 \end{center}
 \noindent
-Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{}  is a whitespace and so on. This process
+Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{}  is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}.
-of separating an input string into components is often called \emph{lexing} or \emph{scanning}.
 It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
 that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments.
+Lexing will not just separate the input into its components, but also classify the components, that
+is explicitly record that \texttt{it} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
+But for the moment we will only focus on the simpler problem of separating a string into components.
+There are a few subtleties  we need to consider first. For example if the input string is
+\begin{center}
+\texttt{\Grid{iffoo\VS\ldots}}
+\end{center}
+\noindent
+then there are two possibilities: either we regard the input as the keyword \texttt{if} followed
+by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
+single identifier. The choice that is often made in lexers to look for the longest possible match,
+that is regard the input as a single identifier  \texttt{iffoo} (since it is longer than \texttt{if}).
 \end{document}
 %%% Local Variables:
 %%% mode: latex

changeset 154	51d6b8b828c4
parent 153	70ab41cb610e
child 155	9b2d128765e1