handouts/ho05-bak.tex
changeset 358 b3129cff41e9
parent 292 7ed2a25dd115
child 871 94b84d880c2b
equal deleted inserted replaced
357:603e171a7b48 358:b3129cff41e9
       
     1 \documentclass{article}
       
     2 \usepackage{../style}
       
     3 \usepackage{../langs}
       
     4 \usepackage{../graphics}
       
     5 
       
     6 \begin{document}
       
     7 
       
     8 \section*{Handout 5 (Lexing)}
       
     9 
       
    10 Whenever you want to design a new programming language or
       
    11 implement a compiler for an existing language, the first task
       
    12 is to fix the basic ``words'' of the language. For example
       
    13 what are the keywords, or reserved words, of the language,
       
    14 what are permitted identifiers, numbers, expressions and so
       
    15 on. One convenient way to do this is, of course, by using
       
    16 regular expressions. 
       
    17 
       
    18 In this course we want to take a closer look at the WHILE
       
    19 programming language. This is a simple imperative programming
       
    20 language consisting of arithmetic expressions, assignments,
       
    21 if-statements and loops. For example the Fibonacci program can
       
    22 be written in this language as follows:
       
    23 
       
    24 \begin{center}
       
    25 \mbox{\lstinputlisting[language=while]{../progs/fib.while}}
       
    26 \end{center}
       
    27 
       
    28 \noindent
       
    29 The keywords in this language will be
       
    30 
       
    31 \begin{center}
       
    32 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
       
    33 \end{center}
       
    34 
       
    35 \noindent
       
    36 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
       
    37 and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''
       
    38 is in our programming language and what comments should look like.
       
    39 As a first try, we might specify the regular expressions for our language roughly as follows
       
    40 
       
    41 \begin{center}
       
    42 \begin{tabular}{lcl}
       
    43 $\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
       
    44 $\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
       
    45 $\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
       
    46 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
       
    47 $\textit{OP}$      &  $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
       
    48 $\textit{NUM}$   &  $:=$ & $\textit{DIGIT}^+$\\
       
    49 $\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
       
    50 \end{tabular}
       
    51 \end{center}
       
    52 
       
    53 Having the regular expressions in place, the problem we have to solve is: 
       
    54 given a string of our programming language, which regular expression 
       
    55 matches which part of the string. By solving this problem, we can split up a string 
       
    56 of our language into components. 
       
    57 For example given the input string
       
    58 
       
    59 \begin{center}
       
    60 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
       
    61 \end{center}
       
    62 
       
    63 \noindent
       
    64 we expect it is split up as follows
       
    65 
       
    66 \begin{center}
       
    67 \tt
       
    68 \Grid{if}\,
       
    69 \Grid{\VS}\, 
       
    70 \Grid{true}\, 
       
    71 \Grid{\VS}\, 
       
    72 \Grid{then}\, 
       
    73 \Grid{\VS}\, 
       
    74 \Grid{x}\, 
       
    75 \Grid{+}\, 
       
    76 \Grid{2}\,
       
    77 \Grid{\VS}\, 
       
    78 \Grid{else}\,
       
    79 \Grid{\VS}\,
       
    80 \Grid{x}\,
       
    81 \Grid{+}\,
       
    82 \Grid{3}
       
    83 \end{center}
       
    84 
       
    85 \noindent
       
    86 This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
       
    87 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
       
    88 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
       
    89 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are 
       
    90 not separated by any whitespace. Another reason for recognising whitespaces explicitly is
       
    91 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in 
       
    92 our small language we will eventually just filter out all whitespaces and also all comments.
       
    93 
       
    94 Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
       
    95 For the moment, though, we will only focus on the simpler problem of just splitting up 
       
    96 a string into components.
       
    97 
       
    98 There are a few subtleties  we need to consider first. For example, say the string is
       
    99 
       
   100 \begin{center}
       
   101 \texttt{\Grid{iffoo\VS}}\;\ldots
       
   102 \end{center}
       
   103 
       
   104 \noindent
       
   105 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
       
   106 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
       
   107 single identifier. The choice that is often made in lexers is to look for the longest possible match.
       
   108 This leaves  \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
       
   109 
       
   110 Unfortunately, the convention about the longest match does not yet make the process 
       
   111 of lexing completely deterministic. Consider the string
       
   112 
       
   113 \begin{center}
       
   114 \texttt{\Grid{then\VS}}\;\dots
       
   115 \end{center}
       
   116 
       
   117 \noindent
       
   118 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our 
       
   119 regular expressions. In our running example we just use the ranking
       
   120 
       
   121 \[
       
   122 \textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
       
   123 \]
       
   124 
       
   125 \noindent
       
   126 So even if both regular expressions match in the example above,
       
   127 we give preference to the regular expression for keywords.
       
   128 
       
   129 Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
       
   130 and $der$, it will  use the function \emph{zeroable} defined as follows:
       
   131 
       
   132 \begin{center}
       
   133 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
       
   134 $zeroable(\varnothing)$      & $\dn$ & $true$\\
       
   135 $zeroable(\epsilon)$           & $\dn$ &  $f\!alse$\\
       
   136 $zeroable (c)$                    & $\dn$ &  $f\!alse$\\
       
   137 $zeroable (r_1 + r_2)$        & $\dn$ &  $zeroable(r_1) \wedge zeroable(r_2)$ \\ 
       
   138 $zeroable (r_1 \cdot r_2)$  & $\dn$ &  $zeroable(r_1) \vee zeroable(r_2)$ \\
       
   139 $zeroable (r^*)$                  & $\dn$ & $f\!alse$\\
       
   140 \end{tabular}
       
   141 \end{center}
       
   142 
       
   143 \noindent
       
   144 Recall that the function $nullable(r)$ tests whether a regular expression
       
   145 can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
       
   146 expression cannot match anything at all. The mathematical way of stating this
       
   147 property is
       
   148 
       
   149 \begin{center}
       
   150 $zeroable(r)$ if and only if $L(r) = \varnothing$
       
   151 \end{center}
       
   152 
       
   153 \noindent
       
   154 For what follows let us fix a set of regular expressions $rs$ as being 
       
   155 
       
   156 \begin{center}
       
   157 \textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
       
   158 \end{center}
       
   159 
       
   160 \noindent
       
   161 specifying the ``words'' 
       
   162 of our programming language. The algorithm takes as input the $rs$ and a string, say
       
   163 
       
   164 \begin{center}
       
   165 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
       
   166 \end{center}
       
   167 
       
   168 \noindent
       
   169 and tries to chop off one word from the beginning of the string. If none of the
       
   170 regular expression in $rs$ matches, we will just return
       
   171 the empty string.
       
   172 
       
   173 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
       
   174 to the first character $c_1$. Then we take the results and continue with 
       
   175 building the derivatives with respect to $c_2$ until we have either exhausted our 
       
   176 input string or all of the regular expressions are ``zeroable''.  Suppose the input string is 
       
   177 
       
   178 \begin{center}
       
   179 \texttt{\Grid{if2\VS}}\;\dots
       
   180 \end{center}
       
   181 
       
   182 \noindent
       
   183 then building the derivatives with respect to \texttt{i} gives
       
   184 
       
   185 \begin{center}
       
   186 \begin{tabular}{l|c}
       
   187  & $zeroable$\\\hline
       
   188  $der\;\texttt{i}\;(\textit{KEYWORD})$      & no\\
       
   189  $der\;\texttt{i}\;(\textit{IDENT})$              & no\\
       
   190  $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
       
   191 \end{tabular}
       
   192 \end{center}
       
   193 
       
   194 \noindent
       
   195 We can eliminate \textit{WHITESPACE} as a potential candidate, because
       
   196 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
       
   197 two regular expressions as potential candidate and we have to consider the
       
   198 next character, \texttt{f}, from the input string
       
   199 
       
   200 \begin{center}
       
   201 \begin{tabular}{l|c}
       
   202  & $zeroable$\\\hline
       
   203  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
       
   204  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
       
   205 \end{tabular}
       
   206 \end{center}
       
   207 
       
   208 \noindent
       
   209 Since both are `no', we have to continue with \texttt{2} from the input string
       
   210 
       
   211 \begin{center}
       
   212 \begin{tabular}{l|c}
       
   213  & $zeroable$\\\hline
       
   214  $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
       
   215  $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$              & no\\
       
   216 \end{tabular}
       
   217 \end{center}
       
   218 
       
   219 \noindent
       
   220 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
       
   221 know how much of the input string should be considered as an \textit{IDENT}. So we
       
   222 still have to continue and consider the next derivative.
       
   223 
       
   224 \begin{center}
       
   225 \begin{tabular}{l|c}
       
   226  & $zeroable$\\\hline
       
   227  $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
       
   228 \end{tabular}
       
   229 \end{center}
       
   230 
       
   231 \noindent
       
   232 Since the answer is now `yes' also in this case, we can stop: once all derivatives are
       
   233 zeroable, we know the regular expressions cannot match any more letters from
       
   234 the input. In this case we only have to go back to the derivative that is nullable.  In this case it
       
   235 is
       
   236 
       
   237 \begin{center}
       
   238 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ 
       
   239 \end{center} 
       
   240 
       
   241 \noindent
       
   242 which means we recognised an identifier. In case where there is a choice of more than one
       
   243 regular expressions that are nullable, then we choose the one with the highest precedence. 
       
   244 You can try out such a case with the input string
       
   245 
       
   246 \begin{center}
       
   247 \texttt{\Grid{then\VS}}\;\dots
       
   248 \end{center}
       
   249 
       
   250 \noindent
       
   251 which can both be recognised as a keyword, but also an identifier. 
       
   252 
       
   253 While in the example above the last nullable derivative is the one directly before
       
   254 the derivative turns zeroable, this is not always the case. Imagine, identifiers can be 
       
   255 letters, as permuted by the regular expression \textit{IDENT}, but must end with an 
       
   256 undercore.
       
   257 
       
   258 \begin{center}
       
   259 \begin{tabular}{lcl}
       
   260 $\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ 
       
   261 \end{tabular}
       
   262 \end{center}
       
   263 
       
   264 \noindent
       
   265 If we use \textit{NEWIDENT} with the input string
       
   266 
       
   267 \begin{center}
       
   268 \texttt{\Grid{iffoo\VS}}\;\ldots
       
   269 \end{center}
       
   270 
       
   271 \noindent
       
   272 then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the
       
   273 first \texttt{f} because only 
       
   274 
       
   275 \begin{center}
       
   276  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$  
       
   277 \end{center}
       
   278 
       
   279 \noindent
       
   280 is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
       
   281 string needs to be consumed by other regular expressions or lead to a lexing error.
       
   282 
       
   283 
       
   284 
       
   285 \end{document}
       
   286 
       
   287 %%% Local Variables: 
       
   288 %%% mode: latex
       
   289 %%% TeX-master: t
       
   290 %%% End: