166 \Grid{+}\, |
166 \Grid{+}\, |
167 \Grid{3} |
167 \Grid{3} |
168 \end{center} |
168 \end{center} |
169 |
169 |
170 \noindent |
170 \noindent |
171 This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}. |
171 This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}. |
172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
173 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, |
173 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace |
174 this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any |
174 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are |
175 whitespace. Another reason for recognising whitespaces explicitly is |
175 not separated by any whitespace. Another reason for recognising whitespaces explicitly is |
176 that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also all comments. |
176 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in |
177 |
177 our small language we will eventually just filter out all whitespaces and also all comments. |
178 Lexing will not just separate a string into its components, but also classify the components, that |
178 |
179 is explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
179 Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
180 For the moment, though, we will only focus on the simpler problem of just splitting a string into components. |
180 For the moment, though, we will only focus on the simpler problem of just splitting up |
|
181 a string into components. |
181 |
182 |
182 There are a few subtleties we need to consider first. For example, say the string is |
183 There are a few subtleties we need to consider first. For example, say the string is |
183 |
184 |
184 \begin{center} |
185 \begin{center} |
185 \texttt{\Grid{iffoo\VS\ldots}} |
186 \texttt{\Grid{iffoo\VS}}\;\ldots |
186 \end{center} |
187 \end{center} |
187 |
188 |
188 \noindent |
189 \noindent |
189 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed |
190 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed |
190 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
191 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
191 single identifier. The choice that is often made in lexers is to look for the longest possible match. |
192 single identifier. The choice that is often made in lexers is to look for the longest possible match. |
192 This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}). |
193 This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}). |
193 |
194 |
194 Unfortunately, the convention about the longest match does not yet make the whole process |
195 Unfortunately, the convention about the longest match does not yet make the process |
195 of lexing completely deterministic. Consider the string |
196 of lexing completely deterministic. Consider the string |
196 |
197 |
197 \begin{center} |
198 \begin{center} |
198 \texttt{\Grid{then\VS\dots}} |
199 \texttt{\Grid{then\VS}}\;\dots |
199 \end{center} |
200 \end{center} |
200 |
201 |
201 \noindent |
202 \noindent |
202 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our |
203 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our |
203 regular expressions. In our running example we just use the ranking |
204 regular expressions. In our running example we just use the ranking |
208 |
209 |
209 \noindent |
210 \noindent |
210 So even if both regular expressions match in the example above, |
211 So even if both regular expressions match in the example above, |
211 we give preference to the regular expression for keywords. |
212 we give preference to the regular expression for keywords. |
212 |
213 |
213 Let us see how our algorithm for lexing works in detail. The regular |
214 Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$ |
214 expressions and their ranking are shown above. For our algorithm it will |
215 and $der$, it will use the function \emph{zeroable} defined as follows: |
215 be helpful to have a look at the function \emph{zeroable} |
|
216 defined as follows: |
|
217 |
216 |
218 \begin{center} |
217 \begin{center} |
219 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
218 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
220 $zeroable(\varnothing)$ & $\dn$ & $true$\\ |
219 $zeroable(\varnothing)$ & $\dn$ & $true$\\ |
221 $zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\ |
220 $zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\ |
225 $zeroable (r^*)$ & $\dn$ & $f\!alse$\\ |
224 $zeroable (r^*)$ & $\dn$ & $f\!alse$\\ |
226 \end{tabular} |
225 \end{tabular} |
227 \end{center} |
226 \end{center} |
228 |
227 |
229 \noindent |
228 \noindent |
230 In contrast to the function $nullable(r)$, which test whether a regular expression |
229 Recall that the function $nullable(r)$ tests whether a regular expression |
231 can match the empty string, the $zeroable$ function identifies whether a regular |
230 can match the empty string. The function $zeroable$, on the other hand, tests whether a regular |
232 expression cannot match anything at all. The mathematical way of stating this |
231 expression cannot match anything at all. The mathematical way of stating this |
233 property is |
232 property is |
234 |
233 |
235 \begin{center} |
234 \begin{center} |
236 $zeroable(r)$ if and only if $L(r) = \varnothing$ |
235 $zeroable(r)$ if and only if $L(r) = \varnothing$ |
237 \end{center} |
236 \end{center} |
238 |
237 |
239 \noindent |
238 \noindent |
240 Let us fix a set of regular expressions $rs$. |
239 For what follows let us fix a set of regular expressions $rs$ as being |
241 The crucial idea of the algorithm working with $rs$ and the string, say |
240 |
242 |
241 \begin{center} |
243 \begin{center} |
242 \textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE} |
244 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}\grid{\ldots}} |
243 \end{center} |
245 \end{center} |
244 |
246 |
245 \noindent |
247 \noindent |
246 specifying the ``words'' |
248 is to build the derivative of all regular expressions in $rs$ with respect |
247 of our programming language. The algorithm takes as input the $rs$ and a string, say |
|
248 |
|
249 \begin{center} |
|
250 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots |
|
251 \end{center} |
|
252 |
|
253 \noindent |
|
254 and tries to chop off one word from the beginning of the string. If none of the |
|
255 regular expression in $rs$ matches, we will just return |
|
256 the empty string. |
|
257 |
|
258 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect |
249 to the first character $c_1$. Then we take the results and continue with |
259 to the first character $c_1$. Then we take the results and continue with |
250 building the derivatives with respect to $c_2$ |
260 building the derivatives with respect to $c_2$ until we have either exhausted our |
251 until we have either exhausted our input string or all of the regular expressions |
261 input string or all of the regular expressions are ``zeroable''. If the input string is |
252 are ``zeroable''. |
262 |
|
263 \begin{center} |
|
264 \texttt{\Grid{if2\VS}}\;\dots |
|
265 \end{center} |
|
266 |
|
267 \noindent |
|
268 then building the derivatives with respect to \texttt{i} gives |
|
269 |
|
270 \begin{center} |
|
271 \begin{tabular}{l|c} |
|
272 & $zeroable$\\\hline |
|
273 $der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\ |
|
274 $der\;\texttt{i}\;(\textit{IDENT})$ & no\\ |
|
275 $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\ |
|
276 \end{tabular} |
|
277 \end{center} |
|
278 |
|
279 \noindent |
|
280 We can eliminate then \textit{WHITESPACE} as a potential candidate, because |
|
281 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other |
|
282 two regular expressions as potential candidate and we have to consider the |
|
283 next character from the input string |
|
284 |
|
285 \begin{center} |
|
286 \begin{tabular}{l|c} |
|
287 & $zeroable$\\\hline |
|
288 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\ |
|
289 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\ |
|
290 \end{tabular} |
|
291 \end{center} |
|
292 |
|
293 \noindent |
|
294 Since both are `no', we have to go on with \texttt{2} from the input string |
|
295 |
|
296 \begin{center} |
|
297 \begin{tabular}{l|c} |
|
298 & $zeroable$\\\hline |
|
299 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\ |
|
300 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\ |
|
301 \end{tabular} |
|
302 \end{center} |
|
303 |
|
304 \noindent |
|
305 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet |
|
306 know how much of the input string should be considered as an \textit{IDENT}. So we |
|
307 also have to go on and consider the next derivative. |
|
308 |
|
309 \begin{center} |
|
310 \begin{tabular}{l|c} |
|
311 & $zeroable$\\\hline |
|
312 $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\ |
|
313 \end{tabular} |
|
314 \end{center} |
|
315 |
|
316 \noindent |
|
317 Since the answer is now `yes' we can stop: once all derivatives are |
|
318 zeroable, we know the regular expressions cannot match any more letters from |
|
319 the input. In this case we only have to go back to the derivative that is nullable. In this case it |
|
320 is |
|
321 |
|
322 \begin{center} |
|
323 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ |
|
324 \end{center} |
|
325 |
|
326 \noindent |
|
327 which means we recognised an identifier. In case there is a choice of more than one |
|
328 regular expressions that are nullable, then we choose the one with the highest precedence. |
|
329 You can try out this case |
|
330 |
|
331 |
253 |
332 |
254 \end{document} |
333 \end{document} |
255 |
334 |
256 %%% Local Variables: |
335 %%% Local Variables: |
257 %%% mode: latex |
336 %%% mode: latex |