162 \Grid{+} |
162 \Grid{+} |
163 \Grid{3} |
163 \Grid{3} |
164 \end{center} |
164 \end{center} |
165 |
165 |
166 \noindent |
166 \noindent |
167 Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process |
167 Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}. |
168 of separating an input string into components is often called \emph{lexing} or \emph{scanning}. |
|
169 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
168 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
170 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, |
169 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, |
171 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is |
170 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is |
172 that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments. |
171 that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments. |
173 |
172 |
|
173 Lexing will not just separate the input into its components, but also classify the components, that |
|
174 is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
|
175 But for the moment we will only focus on the simpler problem of separating a string into components. |
|
176 There are a few subtleties we need to consider first. For example if the input string is |
|
177 |
|
178 \begin{center} |
|
179 \texttt{\Grid{iffoo\VS\ldots}} |
|
180 \end{center} |
|
181 |
|
182 \noindent |
|
183 then there are two possibilities: either we regard the input as the keyword \texttt{if} followed |
|
184 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
|
185 single identifier. The choice that is often made in lexers to look for the longest possible match, |
|
186 that is regard the input as a single identifier \texttt{iffoo} (since it is longer than \texttt{if}). |
174 |
187 |
175 \end{document} |
188 \end{document} |
176 |
189 |
177 %%% Local Variables: |
190 %%% Local Variables: |
178 %%% mode: latex |
191 %%% mode: latex |