8 \begin{document} |
8 \begin{document} |
9 |
9 |
10 \section*{Handout 7 (Compilation)} |
10 \section*{Handout 7 (Compilation)} |
11 |
11 |
12 The purpose of a compiler is to transform a program, a human |
12 The purpose of a compiler is to transform a program, a human |
13 can write, into code the machine can run as fast as possible. |
13 can read and write, into code the machine can run as fast as |
14 The fastest code would be machine code the CPU can run |
14 possible. The fastest code would be machine code the CPU can |
15 directly, but it is often enough to improve the speed of a |
15 run directly, but it is often enough to improve the speed of a |
16 program by just targeting a virtual machine. This produces not |
16 program by just targeting a virtual machine. This produces not |
17 the fastest possible code, but code that is fast enough and |
17 the fastest possible code, but code that is fast enough and |
18 has the advantage that the virtual machine care of things a |
18 has the advantage that the virtual machine takes care of |
19 compiler would normally need to take care of (like explicit |
19 things a compiler would normally need to take care of (like |
20 memory management). |
20 explicit memory management). |
21 |
21 |
22 As an example we will implement a compiler for the very simple |
22 As a first example we will implement a compiler for the very |
23 While-language. We will be generating code for the Java |
23 simple While-language. It will generate code for the Java |
24 Virtual Machine. This is a stack-based virtual machine, a fact |
24 Virtual Machine (JVM). This is a stack-based virtual machine, |
25 which will make it easy to generate code for arithmetic |
25 a fact which will make it easy to generate code for arithmetic |
26 expressions. For example for generating code for the |
26 expressions. For example for generating code for the |
27 expression $1 + 2$ we need to generate the following three |
27 expression $1 + 2$ we need to generate the following three |
28 instructions |
28 instructions |
29 |
29 |
30 \begin{lstlisting}[numbers=none] |
30 \begin{lstlisting}[numbers=none] |
33 iadd |
33 iadd |
34 \end{lstlisting} |
34 \end{lstlisting} |
35 |
35 |
36 \noindent The first instruction loads the constant $1$ onto |
36 \noindent The first instruction loads the constant $1$ onto |
37 the stack, the next one $2$, the third instruction adds both |
37 the stack, the next one $2$, the third instruction adds both |
38 numbers together replacing the top elements of the stack with |
38 numbers together replacing the top two elements of the stack |
39 the result $3$. For simplicity, we will throughout consider |
39 with the result $3$. For simplicity, we will throughout |
40 only integer numbers and results. Therefore we can use the |
40 consider only integer numbers and results. Therefore we can |
41 instructions \code{iadd}, \code{isub}, \code{imul}, |
41 use the JVM instructions \code{iadd}, \code{isub}, |
42 \code{idiv} and so on. The \code{i} stands for integer |
42 \code{imul}, \code{idiv} and so on. The \code{i} stands for |
43 instructions in the JVM (alternatives are \code{d} for doubles, |
43 integer instructions in the JVM (alternatives are \code{d} for |
44 \code{l} for longs and \code{f} for floats). |
44 doubles, \code{l} for longs and \code{f} for floats). |
45 |
45 |
46 Recall our grammar for arithmetic expressions ($E$ is the |
46 Recall our grammar for arithmetic expressions ($E$ is the |
47 starting symbol): |
47 starting symbol): |
48 |
48 |
49 |
49 |
99 expressions will always be on top of the stack. Note, that a |
99 expressions will always be on top of the stack. Note, that a |
100 different bracketing of the expression, for example $(1 + (2 * |
100 different bracketing of the expression, for example $(1 + (2 * |
101 3)) + (4 - 3)$, produces a different abstract syntax tree and |
101 3)) + (4 - 3)$, produces a different abstract syntax tree and |
102 thus potentially also a different list of instructions. |
102 thus potentially also a different list of instructions. |
103 Generating code in this fashion is rather easy to implement: |
103 Generating code in this fashion is rather easy to implement: |
104 it can be done with the following \textit{compile}-function, |
104 it can be done with the following recursive |
105 which takes the abstract syntax tree as argument: |
105 \textit{compile}-function, which takes the abstract syntax |
|
106 tree as argument: |
106 |
107 |
107 \begin{center} |
108 \begin{center} |
108 \begin{tabular}{lcl} |
109 \begin{tabular}{lcl} |
109 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\ |
110 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\ |
110 $\textit{compile}(a_1 + a_2)$ & $\dn$ & |
111 $\textit{compile}(a_1 + a_2)$ & $\dn$ & |
169 index the variable $x$ maps to. |
170 index the variable $x$ maps to. |
170 |
171 |
171 There is a similar \textit{compile}-function for boolean |
172 There is a similar \textit{compile}-function for boolean |
172 expressions, but it includes a ``trick'' to do with |
173 expressions, but it includes a ``trick'' to do with |
173 \pcode{if}- and \pcode{while}-statements. To explain the issue |
174 \pcode{if}- and \pcode{while}-statements. To explain the issue |
174 let us explain first the compilation of statements of the |
175 let us first describe the compilation of statements of the |
175 While-language. The clause for \pcode{skip} is trivial, since |
176 While-language. The clause for \pcode{skip} is trivial, since |
176 we do not have to generate any instruction |
177 we do not have to generate any instruction |
177 |
178 |
178 \begin{center} |
179 \begin{center} |
179 \begin{tabular}{lcl} |
180 \begin{tabular}{lcl} |
180 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\ |
181 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\ |
181 \end{tabular} |
182 \end{tabular} |
182 \end{center} |
183 \end{center} |
183 |
184 |
184 \noindent Note that the \textit{compile}-function for |
185 \noindent $[]$ is the empty list of instructions. Note that |
185 statements returns a pair, a list of instructions (in this |
186 the \textit{compile}-function for statements returns a pair, a |
186 case the empty list) and an environment for variables. The |
187 list of instructions (in this case the empty list) and an |
187 reason for the environment is that assignments in the |
188 environment for variables. The reason for the environment is |
188 While-language might change the environment---clearly if a |
189 that assignments in the While-language might change the |
189 variable is used for the first time, we need to allocate a new |
190 environment---clearly if a variable is used for the first |
190 index and if it has been used before, we need to be able to |
191 time, we need to allocate a new index and if it has been used |
191 retrieve the associated index. This is reflected in |
192 before, we need to be able to retrieve the associated index. |
192 the clause for compiling assignments: |
193 This is reflected in the clause for compiling assignments: |
193 |
194 |
194 \begin{center} |
195 \begin{center} |
195 \begin{tabular}{lcl} |
196 \begin{tabular}{lcl} |
196 $\text{compile}(x := a, E)$ & $\dn$ & |
197 $\textit{compile}(x := a, E)$ & $\dn$ & |
197 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$ |
198 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$ |
198 \end{tabular} |
199 \end{tabular} |
199 \end{center} |
200 \end{center} |
200 |
201 |
201 \noindent We first generate code for the right-hand side of |
202 \noindent We first generate code for the right-hand side of |
227 |
228 |
228 \begin{lstlisting}[mathescape,language={},numbers=none] |
229 \begin{lstlisting}[mathescape,language={},numbers=none] |
229 if $b$ then $cs_1$ else $cs_2$ |
230 if $b$ then $cs_1$ else $cs_2$ |
230 \end{lstlisting} |
231 \end{lstlisting} |
231 |
232 |
232 \noindent where $b$ is a boolean expression and the $cs_i$ |
233 \noindent where $b$ is a boolean expression and the $cs_{1/2}$ |
233 are the instructions for each \pcode{if}-branch. Lets assume |
234 are the statements for each \pcode{if}-branch. Lets assume |
234 we already generated code for $b$ and $cs_{1/2}$. Then in the |
235 we already generated code for $b$ and $cs_{1/2}$. Then in the |
235 true-case the control-flow of the program needs to be |
236 true-case the control-flow of the program needs to be |
236 |
237 |
237 \begin{center} |
238 \begin{center} |
238 \begin{tikzpicture}[node distance=2mm and 4mm, |
239 \begin{tikzpicture}[node distance=2mm and 4mm, |
291 if-condition is false) from the end of the code for the |
292 if-condition is false) from the end of the code for the |
292 boolean to the beginning of the instructions $cs_2$. Once we |
293 boolean to the beginning of the instructions $cs_2$. Once we |
293 are finished with running $cs_2$ we can continue with whatever |
294 are finished with running $cs_2$ we can continue with whatever |
294 code comes after the if-statement. |
295 code comes after the if-statement. |
295 |
296 |
296 The \pcode{goto} and conditional jumps need addresses to where |
297 The \pcode{goto} and the conditional jumps need addresses to |
297 the jump should go. Since we are generating assembly code for |
298 where the jump should go. Since we are generating assembly |
298 the JVM, we do not actually have to give addresses, but need |
299 code for the JVM, we do not actually have to give (numeric) |
299 to attach labels to our code. These labels specify a target |
300 addresses, but can just attach (symbolic) labels to our code. |
300 for a jump. Therefore the labels need to be unique, as |
301 These labels specify a target for a jump. Therefore the labels |
301 otherwise it would be ambiguous where a jump should go. |
302 need to be unique, as otherwise it would be ambiguous where a |
302 A labels, say \pcode{L}, is attached to code like |
303 jump should go to. A label, say \pcode{L}, is attached to code |
|
304 like |
303 |
305 |
304 \begin{lstlisting}[mathescape,numbers=none] |
306 \begin{lstlisting}[mathescape,numbers=none] |
305 L: |
307 L: |
306 $instr_1$ |
308 $instr_1$ |
307 $instr_2$ |
309 $instr_2$ |
308 $\vdots$ |
310 $\vdots$ |
309 \end{lstlisting} |
311 \end{lstlisting} |
|
312 |
|
313 \noindent where a label is indicated by a colon. |
310 |
314 |
311 Recall the ``trick'' with compiling boolean expressions: the |
315 Recall the ``trick'' with compiling boolean expressions: the |
312 \textit{compile}-function for boolean expressions takes three |
316 \textit{compile}-function for boolean expressions takes three |
313 arguments: an abstract syntax tree, an environment for |
317 arguments: an abstract syntax tree, an environment for |
314 variable indices and also the label, $lab$, to where an conditional |
318 variable indices and also the label, $lab$, to where an conditional |
320 $\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ |
324 $\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ |
321 \multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$} |
325 \multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$} |
322 \end{tabular} |
326 \end{tabular} |
323 \end{center} |
327 \end{center} |
324 |
328 |
325 \noindent |
329 \noindent where we are first generating code for the |
326 We are generating code for the subexpressions $a_1$ and $a_2$. |
330 subexpressions $a_1$ and $a_2$. This will mean after running |
327 This will mean after running the corresponding code there will |
331 the corresponding code there will be two integers on top of |
328 be two integers on top of the stack. If they are equal, we do |
332 the stack. If they are equal, we do not have to do anything |
329 not have to do anything and just continue with the next |
333 (except for popping them off from the stack) and just continue |
330 instructions (see control-flow of ifs above). However if they |
334 with the next instructions (see control-flow of ifs above). |
331 are \emph{not} equal, then we need to (conditionally) jump to |
335 However if they are \emph{not} equal, then we need to |
332 the label $lab$. This can be done with the instruction |
336 (conditionally) jump to the label $lab$. This can be done with |
|
337 the instruction |
333 |
338 |
334 \begin{lstlisting}[mathescape,numbers=none] |
339 \begin{lstlisting}[mathescape,numbers=none] |
335 if_icmpne $lab$ |
340 if_icmpne $lab$ |
336 \end{lstlisting} |
341 \end{lstlisting} |
337 |
342 |
338 \noindent Other jump instructions for boolean operators are |
343 \noindent Other jump instructions for boolean operators are |
339 |
344 |
340 \begin{center} |
345 \begin{center} |
341 \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l} |
346 \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l} |
342 $=$ & $\Rightarrow$ & \pcode{if_icmpne}\\ |
|
343 $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\ |
347 $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\ |
344 $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\ |
348 $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\ |
345 $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\ |
349 $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\ |
346 \end{tabular} |
350 \end{tabular} |
347 \end{center} |
351 \end{center} |
348 |
352 |
349 \noindent and so on. I leave it to you to extend the |
353 \noindent and so on. I leave it to you to extend the |
350 \textit{compile}-function for the other boolean expressions. |
354 \textit{compile}-function for the other boolean expressions. |
351 Note that we need to jump whenever the boolean is \emph{not} |
355 Note that we need to jump whenever the boolean is \emph{not} |
352 true, which means we have to ``negate'' the jump---equals |
356 true, which means we have to ``negate'' the jump |
353 becomes not-equal, less becomes greater-or-equal. If you do |
357 condition---equals becomes not-equal, less becomes |
354 not like this design (it can be the source of some nasty, |
358 greater-or-equal. If you do not like this design (it can be |
355 hard-to-detect errors), you can also change the layout of the |
359 the source of some nasty, hard-to-detect errors), you can also |
356 code and first give the code for the else-branch and then for |
360 change the layout of the code and first give the code for the |
357 the if-branch. |
361 else-branch and then for the if-branch. However in the case |
|
362 of while-loops this way of generating code still seems |
|
363 the most convenient. |
358 |
364 |
359 We are now ready to give the compile function for |
365 We are now ready to give the compile function for |
360 if-statments--remember this function returns for staments a |
366 if-statments---remember this function returns for staments a |
361 pair consisting of the code and an environment: |
367 pair consisting of the code and an environment: |
362 |
368 |
363 \begin{center} |
369 \begin{center} |
364 \begin{tabular}{lcl} |
370 \begin{tabular}{lcl} |
365 $\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\ |
371 $\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\ |