handouts/ho05.tex
changeset 154 51d6b8b828c4
parent 153 70ab41cb610e
child 155 9b2d128765e1
equal deleted inserted replaced
153:70ab41cb610e 154:51d6b8b828c4
   162 \Grid{+}
   162 \Grid{+}
   163 \Grid{3}
   163 \Grid{3}
   164 \end{center}
   164 \end{center}
   165 
   165 
   166 \noindent
   166 \noindent
   167 Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{}  is a whitespace and so on. This process 
   167 Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{}  is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}.
   168 of separating an input string into components is often called \emph{lexing} or \emph{scanning}.
       
   169 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
   168 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
   170 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
   169 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
   171 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
   170 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
   172 that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments.
   171 that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments.
   173 
   172 
       
   173 Lexing will not just separate the input into its components, but also classify the components, that
       
   174 is explicitly record that \texttt{it} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
       
   175 But for the moment we will only focus on the simpler problem of separating a string into components.
       
   176 There are a few subtleties  we need to consider first. For example if the input string is
       
   177 
       
   178 \begin{center}
       
   179 \texttt{\Grid{iffoo\VS\ldots}}
       
   180 \end{center}
       
   181 
       
   182 \noindent
       
   183 then there are two possibilities: either we regard the input as the keyword \texttt{if} followed
       
   184 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
       
   185 single identifier. The choice that is often made in lexers to look for the longest possible match,
       
   186 that is regard the input as a single identifier  \texttt{iffoo} (since it is longer than \texttt{if}).
   174 
   187 
   175 \end{document}
   188 \end{document}
   176 
   189 
   177 %%% Local Variables: 
   190 %%% Local Variables: 
   178 %%% mode: latex
   191 %%% mode: latex