handouts/ho05.tex
changeset 162 edcd84c7b491
parent 161 bfcf838dda4d
child 163 89d6d89d9844
equal deleted inserted replaced
161:bfcf838dda4d 162:edcd84c7b491
   174 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are 
   174 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are 
   175 not separated by any whitespace. Another reason for recognising whitespaces explicitly is
   175 not separated by any whitespace. Another reason for recognising whitespaces explicitly is
   176 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in 
   176 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in 
   177 our small language we will eventually just filter out all whitespaces and also all comments.
   177 our small language we will eventually just filter out all whitespaces and also all comments.
   178 
   178 
   179 Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
   179 Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
   180 For the moment, though, we will only focus on the simpler problem of just splitting up 
   180 For the moment, though, we will only focus on the simpler problem of just splitting up 
   181 a string into components.
   181 a string into components.
   182 
   182 
   183 There are a few subtleties  we need to consider first. For example, say the string is
   183 There are a few subtleties  we need to consider first. For example, say the string is
   184 
   184 
   188 
   188 
   189 \noindent
   189 \noindent
   190 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
   190 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
   191 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
   191 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
   192 single identifier. The choice that is often made in lexers is to look for the longest possible match.
   192 single identifier. The choice that is often made in lexers is to look for the longest possible match.
   193 This leaves  \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
   193 This leaves  \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
   194 
   194 
   195 Unfortunately, the convention about the longest match does not yet make the process 
   195 Unfortunately, the convention about the longest match does not yet make the process 
   196 of lexing completely deterministic. Consider the string
   196 of lexing completely deterministic. Consider the string
   197 
   197 
   198 \begin{center}
   198 \begin{center}
   256 the empty string.
   256 the empty string.
   257 
   257 
   258 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
   258 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
   259 to the first character $c_1$. Then we take the results and continue with 
   259 to the first character $c_1$. Then we take the results and continue with 
   260 building the derivatives with respect to $c_2$ until we have either exhausted our 
   260 building the derivatives with respect to $c_2$ until we have either exhausted our 
   261 input string or all of the regular expressions are ``zeroable''.  If the input string is 
   261 input string or all of the regular expressions are ``zeroable''.  Suppose the input string is 
   262 
   262 
   263 \begin{center}
   263 \begin{center}
   264 \texttt{\Grid{if2\VS}}\;\dots
   264 \texttt{\Grid{if2\VS}}\;\dots
   265 \end{center}
   265 \end{center}
   266 
   266 
   275  $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
   275  $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
   276 \end{tabular}
   276 \end{tabular}
   277 \end{center}
   277 \end{center}
   278 
   278 
   279 \noindent
   279 \noindent
   280 We can eliminate then \textit{WHITESPACE} as a potential candidate, because
   280 We can eliminate \textit{WHITESPACE} as a potential candidate, because
   281 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
   281 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
   282 two regular expressions as potential candidate and we have to consider the
   282 two regular expressions as potential candidate and we have to consider the
   283 next character from the input string
   283 next character, \texttt{f}, from the input string
   284 
   284 
   285 \begin{center}
   285 \begin{center}
   286 \begin{tabular}{l|c}
   286 \begin{tabular}{l|c}
   287  & $zeroable$\\\hline
   287  & $zeroable$\\\hline
   288  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
   288  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
   289  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
   289  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
   290 \end{tabular}
   290 \end{tabular}
   291 \end{center}
   291 \end{center}
   292 
   292 
   293 \noindent
   293 \noindent
   294 Since both are `no', we have to go on with \texttt{2} from the input string
   294 Since both are `no', we have to continue with \texttt{2} from the input string
   295 
   295 
   296 \begin{center}
   296 \begin{center}
   297 \begin{tabular}{l|c}
   297 \begin{tabular}{l|c}
   298  & $zeroable$\\\hline
   298  & $zeroable$\\\hline
   299  $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
   299  $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
   302 \end{center}
   302 \end{center}
   303 
   303 
   304 \noindent
   304 \noindent
   305 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
   305 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
   306 know how much of the input string should be considered as an \textit{IDENT}. So we
   306 know how much of the input string should be considered as an \textit{IDENT}. So we
   307 also have to go on and consider the next derivative.
   307 still have to continue and consider the next derivative.
   308 
   308 
   309 \begin{center}
   309 \begin{center}
   310 \begin{tabular}{l|c}
   310 \begin{tabular}{l|c}
   311  & $zeroable$\\\hline
   311  & $zeroable$\\\hline
   312  $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
   312  $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
   313 \end{tabular}
   313 \end{tabular}
   314 \end{center}
   314 \end{center}
   315 
   315 
   316 \noindent
   316 \noindent
   317 Since the answer is now `yes' we can stop: once all derivatives are
   317 Since the answer is now `yes' also in this case, we can stop: once all derivatives are
   318 zeroable, we know the regular expressions cannot match any more letters from
   318 zeroable, we know the regular expressions cannot match any more letters from
   319 the input. In this case we only have to go back to the derivative that is nullable.  In this case it
   319 the input. In this case we only have to go back to the derivative that is nullable.  In this case it
   320 is
   320 is
   321 
   321 
   322 \begin{center}
   322 \begin{center}
   323 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ 
   323 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ 
   324 \end{center} 
   324 \end{center} 
   325 
   325 
   326 \noindent
   326 \noindent
   327 which means we recognised an identifier. In case there is a choice of more than one
   327 which means we recognised an identifier. In case where there is a choice of more than one
   328 regular expressions that are nullable, then we choose the one with the highest precedence. 
   328 regular expressions that are nullable, then we choose the one with the highest precedence. 
   329 You can try out this case
   329 You can try out such a case with the input string
       
   330 
       
   331 \begin{center}
       
   332 \texttt{\Grid{then\VS}}\;\dots
       
   333 \end{center}
       
   334 
       
   335 \noindent
       
   336 which can both be recognised as a keyword, but also an identifier. 
       
   337 
       
   338 While in the example above the last nullable derivative is the one directly before
       
   339 the derivative turns zeroable, this is not always the case. Imagine, identifiers can be 
       
   340 letters, as permuted by the regular expression \textit{IDENT}, but must end with an 
       
   341 undercore.
       
   342 
       
   343 \begin{center}
       
   344 \begin{tabular}{lcl}
       
   345 $\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ 
       
   346 \end{tabular}
       
   347 \end{center}
       
   348 
       
   349 \noindent
       
   350 If we use \textit{NEWIDENT} with the input string
       
   351 
       
   352 \begin{center}
       
   353 \texttt{\Grid{iffoo\VS}}\;\ldots
       
   354 \end{center}
       
   355 
       
   356 \noindent
       
   357 then it will only become $zeroable$ after the $\VS$ has been analysed. Then we have to go back to the
       
   358 first \texttt{f} because only 
       
   359 
       
   360 \begin{center}
       
   361  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$  
       
   362 \end{center}
       
   363 
       
   364 \noindent
       
   365 is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
       
   366 string needs to be consumed by other regular expressions or lead to a lexing error.
   330 
   367 
   331 
   368 
   332 
   369 
   333 \end{document}
   370 \end{document}
   334 
   371