|
1 \documentclass{article} |
|
2 \usepackage{../style} |
|
3 \usepackage{../langs} |
|
4 \usepackage{../graphics} |
|
5 |
|
6 \begin{document} |
|
7 |
|
8 \section*{Handout 5 (Lexing)} |
|
9 |
|
10 Whenever you want to design a new programming language or |
|
11 implement a compiler for an existing language, the first task |
|
12 is to fix the basic ``words'' of the language. For example |
|
13 what are the keywords, or reserved words, of the language, |
|
14 what are permitted identifiers, numbers, expressions and so |
|
15 on. One convenient way to do this is, of course, by using |
|
16 regular expressions. |
|
17 |
|
18 In this course we want to take a closer look at the WHILE |
|
19 programming language. This is a simple imperative programming |
|
20 language consisting of arithmetic expressions, assignments, |
|
21 if-statements and loops. For example the Fibonacci program can |
|
22 be written in this language as follows: |
|
23 |
|
24 \begin{center} |
|
25 \mbox{\lstinputlisting[language=while]{../progs/fib.while}} |
|
26 \end{center} |
|
27 |
|
28 \noindent |
|
29 The keywords in this language will be |
|
30 |
|
31 \begin{center} |
|
32 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read} |
|
33 \end{center} |
|
34 |
|
35 \noindent |
|
36 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers |
|
37 and strings (which we however ignore for the moment). We also need to specify what the ``whitespace'' |
|
38 is in our programming language and what comments should look like. |
|
39 As a first try, we might specify the regular expressions for our language roughly as follows |
|
40 |
|
41 \begin{center} |
|
42 \begin{tabular}{lcl} |
|
43 $\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\ |
|
44 $\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\ |
|
45 $\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\ |
|
46 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ |
|
47 $\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\ |
|
48 $\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\ |
|
49 $\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$ |
|
50 \end{tabular} |
|
51 \end{center} |
|
52 |
|
53 Having the regular expressions in place, the problem we have to solve is: |
|
54 given a string of our programming language, which regular expression |
|
55 matches which part of the string. By solving this problem, we can split up a string |
|
56 of our language into components. |
|
57 For example given the input string |
|
58 |
|
59 \begin{center} |
|
60 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}} |
|
61 \end{center} |
|
62 |
|
63 \noindent |
|
64 we expect it is split up as follows |
|
65 |
|
66 \begin{center} |
|
67 \tt |
|
68 \Grid{if}\, |
|
69 \Grid{\VS}\, |
|
70 \Grid{true}\, |
|
71 \Grid{\VS}\, |
|
72 \Grid{then}\, |
|
73 \Grid{\VS}\, |
|
74 \Grid{x}\, |
|
75 \Grid{+}\, |
|
76 \Grid{2}\, |
|
77 \Grid{\VS}\, |
|
78 \Grid{else}\, |
|
79 \Grid{\VS}\, |
|
80 \Grid{x}\, |
|
81 \Grid{+}\, |
|
82 \Grid{3} |
|
83 \end{center} |
|
84 |
|
85 \noindent |
|
86 This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}. |
|
87 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
|
88 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace |
|
89 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are |
|
90 not separated by any whitespace. Another reason for recognising whitespaces explicitly is |
|
91 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in |
|
92 our small language we will eventually just filter out all whitespaces and also all comments. |
|
93 |
|
94 Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
|
95 For the moment, though, we will only focus on the simpler problem of just splitting up |
|
96 a string into components. |
|
97 |
|
98 There are a few subtleties we need to consider first. For example, say the string is |
|
99 |
|
100 \begin{center} |
|
101 \texttt{\Grid{iffoo\VS}}\;\ldots |
|
102 \end{center} |
|
103 |
|
104 \noindent |
|
105 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed |
|
106 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
|
107 single identifier. The choice that is often made in lexers is to look for the longest possible match. |
|
108 This leaves \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}). |
|
109 |
|
110 Unfortunately, the convention about the longest match does not yet make the process |
|
111 of lexing completely deterministic. Consider the string |
|
112 |
|
113 \begin{center} |
|
114 \texttt{\Grid{then\VS}}\;\dots |
|
115 \end{center} |
|
116 |
|
117 \noindent |
|
118 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our |
|
119 regular expressions. In our running example we just use the ranking |
|
120 |
|
121 \[ |
|
122 \textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots |
|
123 \] |
|
124 |
|
125 \noindent |
|
126 So even if both regular expressions match in the example above, |
|
127 we give preference to the regular expression for keywords. |
|
128 |
|
129 Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$ |
|
130 and $der$, it will use the function \emph{zeroable} defined as follows: |
|
131 |
|
132 \begin{center} |
|
133 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
|
134 $zeroable(\varnothing)$ & $\dn$ & $true$\\ |
|
135 $zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\ |
|
136 $zeroable (c)$ & $\dn$ & $f\!alse$\\ |
|
137 $zeroable (r_1 + r_2)$ & $\dn$ & $zeroable(r_1) \wedge zeroable(r_2)$ \\ |
|
138 $zeroable (r_1 \cdot r_2)$ & $\dn$ & $zeroable(r_1) \vee zeroable(r_2)$ \\ |
|
139 $zeroable (r^*)$ & $\dn$ & $f\!alse$\\ |
|
140 \end{tabular} |
|
141 \end{center} |
|
142 |
|
143 \noindent |
|
144 Recall that the function $nullable(r)$ tests whether a regular expression |
|
145 can match the empty string. The function $zeroable$, on the other hand, tests whether a regular |
|
146 expression cannot match anything at all. The mathematical way of stating this |
|
147 property is |
|
148 |
|
149 \begin{center} |
|
150 $zeroable(r)$ if and only if $L(r) = \varnothing$ |
|
151 \end{center} |
|
152 |
|
153 \noindent |
|
154 For what follows let us fix a set of regular expressions $rs$ as being |
|
155 |
|
156 \begin{center} |
|
157 \textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE} |
|
158 \end{center} |
|
159 |
|
160 \noindent |
|
161 specifying the ``words'' |
|
162 of our programming language. The algorithm takes as input the $rs$ and a string, say |
|
163 |
|
164 \begin{center} |
|
165 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots |
|
166 \end{center} |
|
167 |
|
168 \noindent |
|
169 and tries to chop off one word from the beginning of the string. If none of the |
|
170 regular expression in $rs$ matches, we will just return |
|
171 the empty string. |
|
172 |
|
173 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect |
|
174 to the first character $c_1$. Then we take the results and continue with |
|
175 building the derivatives with respect to $c_2$ until we have either exhausted our |
|
176 input string or all of the regular expressions are ``zeroable''. Suppose the input string is |
|
177 |
|
178 \begin{center} |
|
179 \texttt{\Grid{if2\VS}}\;\dots |
|
180 \end{center} |
|
181 |
|
182 \noindent |
|
183 then building the derivatives with respect to \texttt{i} gives |
|
184 |
|
185 \begin{center} |
|
186 \begin{tabular}{l|c} |
|
187 & $zeroable$\\\hline |
|
188 $der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\ |
|
189 $der\;\texttt{i}\;(\textit{IDENT})$ & no\\ |
|
190 $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\ |
|
191 \end{tabular} |
|
192 \end{center} |
|
193 |
|
194 \noindent |
|
195 We can eliminate \textit{WHITESPACE} as a potential candidate, because |
|
196 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other |
|
197 two regular expressions as potential candidate and we have to consider the |
|
198 next character, \texttt{f}, from the input string |
|
199 |
|
200 \begin{center} |
|
201 \begin{tabular}{l|c} |
|
202 & $zeroable$\\\hline |
|
203 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\ |
|
204 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\ |
|
205 \end{tabular} |
|
206 \end{center} |
|
207 |
|
208 \noindent |
|
209 Since both are `no', we have to continue with \texttt{2} from the input string |
|
210 |
|
211 \begin{center} |
|
212 \begin{tabular}{l|c} |
|
213 & $zeroable$\\\hline |
|
214 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\ |
|
215 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\ |
|
216 \end{tabular} |
|
217 \end{center} |
|
218 |
|
219 \noindent |
|
220 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet |
|
221 know how much of the input string should be considered as an \textit{IDENT}. So we |
|
222 still have to continue and consider the next derivative. |
|
223 |
|
224 \begin{center} |
|
225 \begin{tabular}{l|c} |
|
226 & $zeroable$\\\hline |
|
227 $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\ |
|
228 \end{tabular} |
|
229 \end{center} |
|
230 |
|
231 \noindent |
|
232 Since the answer is now `yes' also in this case, we can stop: once all derivatives are |
|
233 zeroable, we know the regular expressions cannot match any more letters from |
|
234 the input. In this case we only have to go back to the derivative that is nullable. In this case it |
|
235 is |
|
236 |
|
237 \begin{center} |
|
238 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ |
|
239 \end{center} |
|
240 |
|
241 \noindent |
|
242 which means we recognised an identifier. In case where there is a choice of more than one |
|
243 regular expressions that are nullable, then we choose the one with the highest precedence. |
|
244 You can try out such a case with the input string |
|
245 |
|
246 \begin{center} |
|
247 \texttt{\Grid{then\VS}}\;\dots |
|
248 \end{center} |
|
249 |
|
250 \noindent |
|
251 which can both be recognised as a keyword, but also an identifier. |
|
252 |
|
253 While in the example above the last nullable derivative is the one directly before |
|
254 the derivative turns zeroable, this is not always the case. Imagine, identifiers can be |
|
255 letters, as permuted by the regular expression \textit{IDENT}, but must end with an |
|
256 undercore. |
|
257 |
|
258 \begin{center} |
|
259 \begin{tabular}{lcl} |
|
260 $\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ |
|
261 \end{tabular} |
|
262 \end{center} |
|
263 |
|
264 \noindent |
|
265 If we use \textit{NEWIDENT} with the input string |
|
266 |
|
267 \begin{center} |
|
268 \texttt{\Grid{iffoo\VS}}\;\ldots |
|
269 \end{center} |
|
270 |
|
271 \noindent |
|
272 then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the |
|
273 first \texttt{f} because only |
|
274 |
|
275 \begin{center} |
|
276 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ |
|
277 \end{center} |
|
278 |
|
279 \noindent |
|
280 is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining |
|
281 string needs to be consumed by other regular expressions or lead to a lexing error. |
|
282 |
|
283 |
|
284 |
|
285 \end{document} |
|
286 |
|
287 %%% Local Variables: |
|
288 %%% mode: latex |
|
289 %%% TeX-master: t |
|
290 %%% End: |