174 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are |
174 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are |
175 not separated by any whitespace. Another reason for recognising whitespaces explicitly is |
175 not separated by any whitespace. Another reason for recognising whitespaces explicitly is |
176 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in |
176 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in |
177 our small language we will eventually just filter out all whitespaces and also all comments. |
177 our small language we will eventually just filter out all whitespaces and also all comments. |
178 |
178 |
179 Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
179 Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
180 For the moment, though, we will only focus on the simpler problem of just splitting up |
180 For the moment, though, we will only focus on the simpler problem of just splitting up |
181 a string into components. |
181 a string into components. |
182 |
182 |
183 There are a few subtleties we need to consider first. For example, say the string is |
183 There are a few subtleties we need to consider first. For example, say the string is |
184 |
184 |
188 |
188 |
189 \noindent |
189 \noindent |
190 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed |
190 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed |
191 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
191 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
192 single identifier. The choice that is often made in lexers is to look for the longest possible match. |
192 single identifier. The choice that is often made in lexers is to look for the longest possible match. |
193 This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}). |
193 This leaves \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}). |
194 |
194 |
195 Unfortunately, the convention about the longest match does not yet make the process |
195 Unfortunately, the convention about the longest match does not yet make the process |
196 of lexing completely deterministic. Consider the string |
196 of lexing completely deterministic. Consider the string |
197 |
197 |
198 \begin{center} |
198 \begin{center} |
256 the empty string. |
256 the empty string. |
257 |
257 |
258 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect |
258 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect |
259 to the first character $c_1$. Then we take the results and continue with |
259 to the first character $c_1$. Then we take the results and continue with |
260 building the derivatives with respect to $c_2$ until we have either exhausted our |
260 building the derivatives with respect to $c_2$ until we have either exhausted our |
261 input string or all of the regular expressions are ``zeroable''. If the input string is |
261 input string or all of the regular expressions are ``zeroable''. Suppose the input string is |
262 |
262 |
263 \begin{center} |
263 \begin{center} |
264 \texttt{\Grid{if2\VS}}\;\dots |
264 \texttt{\Grid{if2\VS}}\;\dots |
265 \end{center} |
265 \end{center} |
266 |
266 |
275 $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\ |
275 $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\ |
276 \end{tabular} |
276 \end{tabular} |
277 \end{center} |
277 \end{center} |
278 |
278 |
279 \noindent |
279 \noindent |
280 We can eliminate then \textit{WHITESPACE} as a potential candidate, because |
280 We can eliminate \textit{WHITESPACE} as a potential candidate, because |
281 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other |
281 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other |
282 two regular expressions as potential candidate and we have to consider the |
282 two regular expressions as potential candidate and we have to consider the |
283 next character from the input string |
283 next character, \texttt{f}, from the input string |
284 |
284 |
285 \begin{center} |
285 \begin{center} |
286 \begin{tabular}{l|c} |
286 \begin{tabular}{l|c} |
287 & $zeroable$\\\hline |
287 & $zeroable$\\\hline |
288 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\ |
288 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\ |
289 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\ |
289 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\ |
290 \end{tabular} |
290 \end{tabular} |
291 \end{center} |
291 \end{center} |
292 |
292 |
293 \noindent |
293 \noindent |
294 Since both are `no', we have to go on with \texttt{2} from the input string |
294 Since both are `no', we have to continue with \texttt{2} from the input string |
295 |
295 |
296 \begin{center} |
296 \begin{center} |
297 \begin{tabular}{l|c} |
297 \begin{tabular}{l|c} |
298 & $zeroable$\\\hline |
298 & $zeroable$\\\hline |
299 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\ |
299 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\ |
302 \end{center} |
302 \end{center} |
303 |
303 |
304 \noindent |
304 \noindent |
305 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet |
305 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet |
306 know how much of the input string should be considered as an \textit{IDENT}. So we |
306 know how much of the input string should be considered as an \textit{IDENT}. So we |
307 also have to go on and consider the next derivative. |
307 still have to continue and consider the next derivative. |
308 |
308 |
309 \begin{center} |
309 \begin{center} |
310 \begin{tabular}{l|c} |
310 \begin{tabular}{l|c} |
311 & $zeroable$\\\hline |
311 & $zeroable$\\\hline |
312 $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\ |
312 $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\ |
313 \end{tabular} |
313 \end{tabular} |
314 \end{center} |
314 \end{center} |
315 |
315 |
316 \noindent |
316 \noindent |
317 Since the answer is now `yes' we can stop: once all derivatives are |
317 Since the answer is now `yes' also in this case, we can stop: once all derivatives are |
318 zeroable, we know the regular expressions cannot match any more letters from |
318 zeroable, we know the regular expressions cannot match any more letters from |
319 the input. In this case we only have to go back to the derivative that is nullable. In this case it |
319 the input. In this case we only have to go back to the derivative that is nullable. In this case it |
320 is |
320 is |
321 |
321 |
322 \begin{center} |
322 \begin{center} |
323 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ |
323 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ |
324 \end{center} |
324 \end{center} |
325 |
325 |
326 \noindent |
326 \noindent |
327 which means we recognised an identifier. In case there is a choice of more than one |
327 which means we recognised an identifier. In case where there is a choice of more than one |
328 regular expressions that are nullable, then we choose the one with the highest precedence. |
328 regular expressions that are nullable, then we choose the one with the highest precedence. |
329 You can try out this case |
329 You can try out such a case with the input string |
|
330 |
|
331 \begin{center} |
|
332 \texttt{\Grid{then\VS}}\;\dots |
|
333 \end{center} |
|
334 |
|
335 \noindent |
|
336 which can both be recognised as a keyword, but also an identifier. |
|
337 |
|
338 While in the example above the last nullable derivative is the one directly before |
|
339 the derivative turns zeroable, this is not always the case. Imagine, identifiers can be |
|
340 letters, as permuted by the regular expression \textit{IDENT}, but must end with an |
|
341 undercore. |
|
342 |
|
343 \begin{center} |
|
344 \begin{tabular}{lcl} |
|
345 $\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ |
|
346 \end{tabular} |
|
347 \end{center} |
|
348 |
|
349 \noindent |
|
350 If we use \textit{NEWIDENT} with the input string |
|
351 |
|
352 \begin{center} |
|
353 \texttt{\Grid{iffoo\VS}}\;\ldots |
|
354 \end{center} |
|
355 |
|
356 \noindent |
|
357 then it will only become $zeroable$ after the $\VS$ has been analysed. Then we have to go back to the |
|
358 first \texttt{f} because only |
|
359 |
|
360 \begin{center} |
|
361 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ |
|
362 \end{center} |
|
363 |
|
364 \noindent |
|
365 is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining |
|
366 string needs to be consumed by other regular expressions or lead to a lexing error. |
330 |
367 |
331 |
368 |
332 |
369 |
333 \end{document} |
370 \end{document} |
334 |
371 |