handouts/ho05.tex
changeset 161 bfcf838dda4d
parent 160 8134c3b981e0
child 162 edcd84c7b491
equal deleted inserted replaced
160:8134c3b981e0 161:bfcf838dda4d
   144 \begin{center}
   144 \begin{center}
   145 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
   145 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
   146 \end{center}
   146 \end{center}
   147 
   147 
   148 \noindent
   148 \noindent
   149 we expect it will be split up as follows
   149 we expect it is split up as follows
   150 
   150 
   151 \begin{center}
   151 \begin{center}
   152 \tt
   152 \tt
   153 \Grid{if}\,
   153 \Grid{if}\,
   154 \Grid{\VS}\, 
   154 \Grid{\VS}\, 
   166 \Grid{+}\,
   166 \Grid{+}\,
   167 \Grid{3}
   167 \Grid{3}
   168 \end{center}
   168 \end{center}
   169 
   169 
   170 \noindent
   170 \noindent
   171 This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}.
   171 This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
   172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
   172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
   173 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
   173 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
   174 this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any
   174 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are 
   175 whitespace. Another reason for recognising whitespaces explicitly is
   175 not separated by any whitespace. Another reason for recognising whitespaces explicitly is
   176 that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also all comments.
   176 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in 
   177 
   177 our small language we will eventually just filter out all whitespaces and also all comments.
   178 Lexing will not just separate a string into its components, but also classify the components, that
   178 
   179 is explicitly record that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
   179 Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
   180 For the moment, though, we will only focus on the simpler problem of just splitting a string into components.
   180 For the moment, though, we will only focus on the simpler problem of just splitting up 
       
   181 a string into components.
   181 
   182 
   182 There are a few subtleties  we need to consider first. For example, say the string is
   183 There are a few subtleties  we need to consider first. For example, say the string is
   183 
   184 
   184 \begin{center}
   185 \begin{center}
   185 \texttt{\Grid{iffoo\VS\ldots}}
   186 \texttt{\Grid{iffoo\VS}}\;\ldots
   186 \end{center}
   187 \end{center}
   187 
   188 
   188 \noindent
   189 \noindent
   189 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
   190 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
   190 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
   191 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
   191 single identifier. The choice that is often made in lexers is to look for the longest possible match.
   192 single identifier. The choice that is often made in lexers is to look for the longest possible match.
   192 This leaves  \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
   193 This leaves  \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
   193 
   194 
   194 Unfortunately, the convention about the longest match does not yet make the whole process 
   195 Unfortunately, the convention about the longest match does not yet make the process 
   195 of lexing completely deterministic. Consider the string
   196 of lexing completely deterministic. Consider the string
   196 
   197 
   197 \begin{center}
   198 \begin{center}
   198 \texttt{\Grid{then\VS\dots}}
   199 \texttt{\Grid{then\VS}}\;\dots
   199 \end{center}
   200 \end{center}
   200 
   201 
   201 \noindent
   202 \noindent
   202 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our 
   203 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our 
   203 regular expressions. In our running example we just use the ranking
   204 regular expressions. In our running example we just use the ranking
   208 
   209 
   209 \noindent
   210 \noindent
   210 So even if both regular expressions match in the example above,
   211 So even if both regular expressions match in the example above,
   211 we give preference to the regular expression for keywords.
   212 we give preference to the regular expression for keywords.
   212 
   213 
   213 Let us see how our algorithm for lexing works in detail. The regular 
   214 Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
   214 expressions and their ranking are shown above. For our algorithm it will  
   215 and $der$, it will  use the function \emph{zeroable} defined as follows:
   215 be helpful to have a look at the function \emph{zeroable}
       
   216 defined as follows:
       
   217 
   216 
   218 \begin{center}
   217 \begin{center}
   219 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   218 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   220 $zeroable(\varnothing)$      & $\dn$ & $true$\\
   219 $zeroable(\varnothing)$      & $\dn$ & $true$\\
   221 $zeroable(\epsilon)$           & $\dn$ &  $f\!alse$\\
   220 $zeroable(\epsilon)$           & $\dn$ &  $f\!alse$\\
   225 $zeroable (r^*)$                  & $\dn$ & $f\!alse$\\
   224 $zeroable (r^*)$                  & $\dn$ & $f\!alse$\\
   226 \end{tabular}
   225 \end{tabular}
   227 \end{center}
   226 \end{center}
   228 
   227 
   229 \noindent
   228 \noindent
   230 In contrast to the function $nullable(r)$, which test whether a regular expression
   229 Recall that the function $nullable(r)$ tests whether a regular expression
   231 can match the empty string, the $zeroable$ function identifies whether a regular
   230 can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
   232 expression cannot match anything at all. The mathematical way of stating this
   231 expression cannot match anything at all. The mathematical way of stating this
   233 property is
   232 property is
   234 
   233 
   235 \begin{center}
   234 \begin{center}
   236 $zeroable(r)$ if and only if $L(r) = \varnothing$
   235 $zeroable(r)$ if and only if $L(r) = \varnothing$
   237 \end{center}
   236 \end{center}
   238 
   237 
   239 \noindent
   238 \noindent
   240 Let us fix a set of regular expressions $rs$.
   239 For what follows let us fix a set of regular expressions $rs$ as being 
   241 The crucial idea of the algorithm working with $rs$ and the string, say
   240 
   242 
   241 \begin{center}
   243 \begin{center}
   242 \textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
   244 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}\grid{\ldots}}
   243 \end{center}
   245 \end{center}
   244 
   246 
   245 \noindent
   247 \noindent
   246 specifying the ``words'' 
   248 is to build the derivative of all regular expressions in $rs$ with respect
   247 of our programming language. The algorithm takes as input the $rs$ and a string, say
       
   248 
       
   249 \begin{center}
       
   250 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
       
   251 \end{center}
       
   252 
       
   253 \noindent
       
   254 and tries to chop off one word from the beginning of the string. If none of the
       
   255 regular expression in $rs$ matches, we will just return
       
   256 the empty string.
       
   257 
       
   258 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
   249 to the first character $c_1$. Then we take the results and continue with 
   259 to the first character $c_1$. Then we take the results and continue with 
   250 building the derivatives with respect to $c_2$
   260 building the derivatives with respect to $c_2$ until we have either exhausted our 
   251 until we have either exhausted our input string or all of the regular expressions
   261 input string or all of the regular expressions are ``zeroable''.  If the input string is 
   252 are ``zeroable''.
   262 
       
   263 \begin{center}
       
   264 \texttt{\Grid{if2\VS}}\;\dots
       
   265 \end{center}
       
   266 
       
   267 \noindent
       
   268 then building the derivatives with respect to \texttt{i} gives
       
   269 
       
   270 \begin{center}
       
   271 \begin{tabular}{l|c}
       
   272  & $zeroable$\\\hline
       
   273  $der\;\texttt{i}\;(\textit{KEYWORD})$      & no\\
       
   274  $der\;\texttt{i}\;(\textit{IDENT})$              & no\\
       
   275  $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
       
   276 \end{tabular}
       
   277 \end{center}
       
   278 
       
   279 \noindent
       
   280 We can eliminate then \textit{WHITESPACE} as a potential candidate, because
       
   281 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
       
   282 two regular expressions as potential candidate and we have to consider the
       
   283 next character from the input string
       
   284 
       
   285 \begin{center}
       
   286 \begin{tabular}{l|c}
       
   287  & $zeroable$\\\hline
       
   288  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
       
   289  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
       
   290 \end{tabular}
       
   291 \end{center}
       
   292 
       
   293 \noindent
       
   294 Since both are `no', we have to go on with \texttt{2} from the input string
       
   295 
       
   296 \begin{center}
       
   297 \begin{tabular}{l|c}
       
   298  & $zeroable$\\\hline
       
   299  $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
       
   300  $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$              & no\\
       
   301 \end{tabular}
       
   302 \end{center}
       
   303 
       
   304 \noindent
       
   305 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
       
   306 know how much of the input string should be considered as an \textit{IDENT}. So we
       
   307 also have to go on and consider the next derivative.
       
   308 
       
   309 \begin{center}
       
   310 \begin{tabular}{l|c}
       
   311  & $zeroable$\\\hline
       
   312  $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
       
   313 \end{tabular}
       
   314 \end{center}
       
   315 
       
   316 \noindent
       
   317 Since the answer is now `yes' we can stop: once all derivatives are
       
   318 zeroable, we know the regular expressions cannot match any more letters from
       
   319 the input. In this case we only have to go back to the derivative that is nullable.  In this case it
       
   320 is
       
   321 
       
   322 \begin{center}
       
   323 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ 
       
   324 \end{center} 
       
   325 
       
   326 \noindent
       
   327 which means we recognised an identifier. In case there is a choice of more than one
       
   328 regular expressions that are nullable, then we choose the one with the highest precedence. 
       
   329 You can try out this case
       
   330 
       
   331 
   253 
   332 
   254 \end{document}
   333 \end{document}
   255 
   334 
   256 %%% Local Variables: 
   335 %%% Local Variables: 
   257 %%% mode: latex
   336 %%% mode: latex