16 the fastest possible code, but code that is fast enough and |
17 the fastest possible code, but code that is fast enough and |
17 has the advantage that the virtual machine care of things a |
18 has the advantage that the virtual machine care of things a |
18 compiler would normally need to take care of (like explicit |
19 compiler would normally need to take care of (like explicit |
19 memory management). |
20 memory management). |
20 |
21 |
21 We will be generating code for the Java Virtual Machine. This |
22 As an example we will implement a compiler for the very simple |
22 is a stack-based virtual machine which will make it easy to |
23 While-language. We will be generating code for the Java |
23 generate code for arithmetic expressions. For example for |
24 Virtual Machine. This is a stack-based virtual machine, a fact |
24 generating code for the expression $1 + 2$ we need to issue |
25 which will make it easy to generate code for arithmetic |
25 the following three instructions |
26 expressions. For example for generating code for the |
|
27 expression $1 + 2$ we need to generate the following three |
|
28 instructions |
26 |
29 |
27 \begin{lstlisting}[numbers=none] |
30 \begin{lstlisting}[numbers=none] |
28 ldc 1 |
31 ldc 1 |
29 ldc 2 |
32 ldc 2 |
30 iadd |
33 iadd |
31 \end{lstlisting} |
34 \end{lstlisting} |
32 |
35 |
33 \noindent The first instruction loads the constant $1$ on the |
36 \noindent The first instruction loads the constant $1$ onto |
34 stack, the next one $2$, the third instruction add both |
37 the stack, the next one $2$, the third instruction adds both |
35 numbers together replacing the top elements of the stack with |
38 numbers together replacing the top elements of the stack with |
36 the result $3$. We will throughout consider only integer |
39 the result $3$. For simplicity, we will throughout consider |
37 numbers and results, therefore we can use the instructions |
40 only integer numbers and results. Therefore we can use the |
38 \code{iadd}, \code{isub}, \code{imul}, \code{idiv} and so on. |
41 instructions \code{iadd}, \code{isub}, \code{imul}, |
39 The \code{i} stands for integer instructions (alternatives are |
42 \code{idiv} and so on. The \code{i} stands for integer |
40 \code{d} for doubles, \code{l} for longs and \code{f} for |
43 instructions in the JVM (alternatives are \code{d} for doubles, |
41 floats). |
44 \code{l} for longs and \code{f} for floats). |
42 |
45 |
43 Recall our grammar for arithmetic expressions: |
46 Recall our grammar for arithmetic expressions ($E$ is the |
|
47 starting symbol): |
44 |
48 |
45 |
49 |
46 \begin{plstx}[rhs style=, margin=3cm] |
50 \begin{plstx}[rhs style=, margin=3cm] |
47 : \meta{E} ::= \meta{T} $+$ \meta{E} |
51 : \meta{E} ::= \meta{T} $+$ \meta{E} |
48 | \meta{T} $-$ \meta{E} |
52 | \meta{T} $-$ \meta{E} |
108 $\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & |
116 $\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & |
109 $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\ |
117 $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\ |
110 \end{tabular} |
118 \end{tabular} |
111 \end{center} |
119 \end{center} |
112 |
120 |
113 \noindent However, our arithmetic expressions can also contain |
121 However, our arithmetic expressions can also contain |
114 variables. We will represent them as \emph{local variables} in |
122 variables. We will represent them as \emph{local variables} in |
115 the JVM. Essentially, local variables are an array or pointers |
123 the JVM. Essentially, local variables are an array or pointers |
116 to the memory containing in our case only integers. Looking up |
124 to memory cells, containing in our case only integers. Looking |
117 a variable can be done by the instruction |
125 up a variable can be done with the instruction |
118 |
126 |
119 \begin{lstlisting}[mathescape,numbers=none] |
127 \begin{lstlisting}[mathescape,numbers=none] |
120 iload $index$ |
128 iload $index$ |
121 \end{lstlisting} |
129 \end{lstlisting} |
122 |
130 |
123 \noindent |
131 \noindent |
124 which places the content of the local variable $index$ onto |
132 which places the content of the local variable $index$ onto |
125 thestack. Storing the top of the stack into a local variable |
133 the stack. Storing the top of the stack into a local variable |
126 can be done by the instruction |
134 can be done by the instruction |
127 |
135 |
128 \begin{lstlisting}[mathescape,numbers=none] |
136 \begin{lstlisting}[mathescape,numbers=none] |
129 istore $index$ |
137 istore $index$ |
130 \end{lstlisting} |
138 \end{lstlisting} |
131 |
139 |
132 \noindent Note that this also pops off the top of the stack. |
140 \noindent Note that this also pops off the top of the stack. |
133 One problem we have to overcome is that local variables are |
141 One problem we have to overcome, however, is that local |
134 addressed, not by identifiers, but by numbers (starting from |
142 variables are addressed, not by identifiers, but by numbers |
135 $0$). Therefore our compiler needs to maintain a kind of |
143 (starting from $0$). Therefore our compiler needs to maintain |
136 environment (similar to the interpreter) where variables are |
144 a kind of environment where variables are associated to |
137 associated to numbers. This association needs to be unique: if |
145 numbers. This association needs to be unique: if we muddle up |
138 we muddle up the numbers, then we essentially confuse |
146 the numbers, then we essentially confuse variables and the |
139 variables and the result will usually be an erroneous result. |
147 consequence will usually be an erroneous result. Our extended |
140 Therefore our \textit{compile}-function will take two |
148 \textit{compile}-function for arithmetic expressions will |
141 arguments: the abstract syntax tree and the environment, $E$, |
149 therefore take two arguments: the abstract syntax tree and the |
142 that maps identifiers to index-numbers. |
150 environment, $E$, that maps identifiers to index-numbers. |
143 |
151 |
144 \begin{center} |
152 \begin{center} |
145 \begin{tabular}{lcl} |
153 \begin{tabular}{lcl} |
146 $\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\ |
154 $\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\ |
147 $\textit{compile}(a_1 + a_2, E)$ & $\dn$ & |
155 $\textit{compile}(a_1 + a_2, E)$ & $\dn$ & |
155 $\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\ |
163 $\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\ |
156 \end{tabular} |
164 \end{tabular} |
157 \end{center} |
165 \end{center} |
158 |
166 |
159 \noindent In the last line we generate the code for variables |
167 \noindent In the last line we generate the code for variables |
160 where $E(x)$ stands for looking up to which index the variable |
168 where $E(x)$ stands for looking up the environment to which |
161 $x$ maps to. |
169 index the variable $x$ maps to. |
162 |
170 |
|
171 There is a similar \textit{compile}-function for boolean |
|
172 expressions, but it includes a ``trick'' to do with |
|
173 \pcode{if}- and \pcode{while}-statements. To explain the issue |
|
174 let us explain first the compilation of statements of the |
|
175 While-language. The clause for \pcode{skip} is trivial, since |
|
176 we do not have to generate any instruction |
|
177 |
|
178 \begin{center} |
|
179 \begin{tabular}{lcl} |
|
180 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\ |
|
181 \end{tabular} |
|
182 \end{center} |
|
183 |
|
184 \noindent Note that the \textit{compile}-function for |
|
185 statements returns a pair, a list of instructions (in this |
|
186 case the empty list) and an environment for variables. The |
|
187 reason for the environment is that assignments in the |
|
188 While-language might change the environment---clearly if a |
|
189 variable is used for the first time, we need to allocate a new |
|
190 index and if it has been used before, we need to be able to |
|
191 retrieve the associated index. This is reflected in |
|
192 the clause for compiling assignments: |
|
193 |
|
194 \begin{center} |
|
195 \begin{tabular}{lcl} |
|
196 $\text{compile}(x := a, E)$ & $\dn$ & |
|
197 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$ |
|
198 \end{tabular} |
|
199 \end{center} |
|
200 |
|
201 \noindent We first generate code for the right-hand side of |
|
202 the assignment and then add an \pcode{istore}-instruction at |
|
203 the end. By convention the result of the arithmetic expression |
|
204 $a$ will be on top of the stack. After the \pcode{istore} |
|
205 instruction, the result will be stored in the index |
|
206 corresponding to the variable $x$. If the variable $x$ has |
|
207 been used before in the program, we just need to look up what |
|
208 the index is and return the environment unchanged (that is in |
|
209 this case $E' = E$). However, if this is the first encounter |
|
210 of the variable $x$ in the program, then we have to augment |
|
211 the environment and assign $x$ with the largest index in $E$ |
|
212 plus one (that is $E' = E(x \mapsto largest\_index + 1)$). |
|
213 That means for the assignment $x := x + 1$ we generate the |
|
214 following code |
|
215 |
|
216 \begin{lstlisting}[mathescape,numbers=none] |
|
217 iload $n_x$ |
|
218 ldc 1 |
|
219 iadd |
|
220 istore $n_x$ |
|
221 \end{lstlisting} |
|
222 |
|
223 \noindent |
|
224 where $n_x$ is the index for the variable $x$. |
|
225 |
|
226 More complicated is the code for \pcode{if}-statments, say |
|
227 |
|
228 \begin{lstlisting}[mathescape,language={},numbers=none] |
|
229 if $b$ then $cs_1$ else $cs_2$ |
|
230 \end{lstlisting} |
|
231 |
|
232 \noindent where $b$ is a boolean expression and the $cs_i$ |
|
233 are the instructions for each \pcode{if}-branch. Lets assume |
|
234 we already generated code for $b$ and $cs_{1/2}$. Then in the |
|
235 true-case the control-flow of the program needs to be |
|
236 |
|
237 \begin{center} |
|
238 \begin{tikzpicture}[node distance=2mm and 4mm, |
|
239 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, |
|
240 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
|
241 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
|
242 \node (A1) [point] {}; |
|
243 \node (b) [block, right=of A1] {code of $b$}; |
|
244 \node (A2) [point, right=of b] {}; |
|
245 \node (cs1) [block, right=of A2] {code of $cs_1$}; |
|
246 \node (A3) [point, right=of cs1] {}; |
|
247 \node (cs2) [block, right=of A3] {code of $cs_2$}; |
|
248 \node (A4) [point, right=of cs2] {}; |
|
249 |
|
250 \draw (A1) edge [->, black, line width=1mm] (b); |
|
251 \draw (b) edge [->, black, line width=1mm] (cs1); |
|
252 \draw (cs1) edge [->, black, line width=1mm] (A3); |
|
253 \draw (A3) edge [->, black, skip loop] (A4); |
|
254 \node [below=of cs2] {\raisebox{-5mm}{\small{}jump}}; |
|
255 \end{tikzpicture} |
|
256 \end{center} |
|
257 |
|
258 \noindent where we start with running the code for $b$; since |
|
259 we are in the true case we continue with running the code for |
|
260 $cs_1$. After this however, we must not run the code for |
|
261 $cs_2$, but always jump after the last instruction of $cs_2$ |
|
262 (the code for the \pcode{else}-branch). Note that this jump is |
|
263 unconditional, meaning we always have to jump to the end of |
|
264 $cs_2$. The corresponding instruction of the JVM is |
|
265 \pcode{goto}. In case $b$ turns out to be false we need the |
|
266 control-flow |
|
267 |
|
268 \begin{center} |
|
269 \begin{tikzpicture}[node distance=2mm and 4mm, |
|
270 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, |
|
271 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
|
272 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
|
273 \node (A1) [point] {}; |
|
274 \node (b) [block, right=of A1] {code of $b$}; |
|
275 \node (A2) [point, right=of b] {}; |
|
276 \node (cs1) [block, right=of A2] {code of $cs_1$}; |
|
277 \node (A3) [point, right=of cs1] {}; |
|
278 \node (cs2) [block, right=of A3] {code of $cs_2$}; |
|
279 \node (A4) [point, right=of cs2] {}; |
|
280 |
|
281 \draw (A1) edge [->, black, line width=1mm] (b); |
|
282 \draw (b) edge [->, black, line width=1mm] (A2); |
|
283 \draw (A2) edge [skip loop] (A3); |
|
284 \draw (A3) edge [->, black, line width=1mm] (cs2); |
|
285 \draw (cs2) edge [->,black, line width=1mm] (A4); |
|
286 \node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}}; |
|
287 \end{tikzpicture} |
|
288 \end{center} |
|
289 |
|
290 \noindent where we now need a conditional jump (if the |
|
291 if-condition is false) from the end of the code for the |
|
292 boolean to the beginning of the instructions $cs_2$. Once we |
|
293 are finished with running $cs_2$ we can continue with whatever |
|
294 code comes after the if-statement. |
|
295 |
|
296 The \pcode{goto} and conditional jumps need addresses to where |
|
297 the jump should go. Since we are generating assembly code for |
|
298 the JVM, we do not actually have to give addresses, but need |
|
299 to attach labels to our code. These labels specify a target |
|
300 for a jump. Therefore the labels need to be unique, as |
|
301 otherwise it would be ambiguous where a jump should go. |
|
302 A labels, say \pcode{L}, is attached to code like |
|
303 |
|
304 \begin{lstlisting}[mathescape,numbers=none] |
|
305 L: |
|
306 $instr_1$ |
|
307 $instr_2$ |
|
308 $\vdots$ |
|
309 \end{lstlisting} |
|
310 |
|
311 Recall the ``trick'' with compiling boolean expressions: the |
|
312 \textit{compile}-function for boolean expressions takes three |
|
313 arguments: an abstract syntax tree, an environment for |
|
314 variable indices and also the label, $lab$, to where an conditional |
|
315 jump needs to go. The clause for the expression $a_1 = a_2$, |
|
316 for example, is as follows: |
|
317 |
|
318 \begin{center} |
|
319 \begin{tabular}{lcl} |
|
320 $\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ |
|
321 \multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$} |
|
322 \end{tabular} |
|
323 \end{center} |
|
324 |
|
325 \noindent |
|
326 We are generating code for the subexpressions $a_1$ and $a_2$. |
|
327 This will mean after running the corresponding code there will |
|
328 be two integers on top of the stack. If they are equal, we do |
|
329 not have to do anything and just continue with the next |
|
330 instructions (see control-flow of ifs above). However if they |
|
331 are \emph{not} equal, then we need to (conditionally) jump to |
|
332 the label $lab$. This can be done with the instruction |
|
333 |
|
334 \begin{lstlisting}[mathescape,numbers=none] |
|
335 if_icmpne $lab$ |
|
336 \end{lstlisting} |
|
337 |
|
338 \noindent Other jump instructions for boolean operators are |
|
339 |
|
340 \begin{center} |
|
341 \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l} |
|
342 $=$ & $\Rightarrow$ & \pcode{if_icmpne}\\ |
|
343 $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\ |
|
344 $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\ |
|
345 $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\ |
|
346 \end{tabular} |
|
347 \end{center} |
|
348 |
|
349 \noindent and so on. I leave it to you to extend the |
|
350 \textit{compile}-function for the other boolean expressions. |
|
351 Note that we need to jump whenever the boolean is \emph{not} |
|
352 true, which means we have to ``negate'' the jump---equals |
|
353 becomes not-equal, less becomes greater-or-equal. If you do |
|
354 not like this design (it can be the source of some nasty, |
|
355 hard-to-detect errors), you can also change the layout of the |
|
356 code and first give the code for the else-branch and then for |
|
357 the if-branch. |
|
358 |
|
359 We are now ready to give the compile function for |
|
360 if-statments--remember this function returns for staments a |
|
361 pair consisting of the code and an environment: |
|
362 |
|
363 \begin{center} |
|
364 \begin{tabular}{lcl} |
|
365 $\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\ |
|
366 \multicolumn{3}{l}{$\qquad l_\textit{ifelse}\;$ (fresh label)}\\ |
|
367 \multicolumn{3}{l}{$\qquad l_\textit{ifend}\;$ (fresh label)}\\ |
|
368 \multicolumn{3}{l}{$\qquad (is_1, E') = \textit{compile}(cs_1, E)$}\\ |
|
369 \multicolumn{3}{l}{$\qquad (is_2, E'') = \textit{compile}(cs_2, E')$}\\ |
|
370 \multicolumn{3}{l}{$\qquad(\textit{compile}(b, E, l_\textit{ifelse})$}\\ |
|
371 \multicolumn{3}{l}{$\qquad\phantom{(}@\;is_1$}\\ |
|
372 \multicolumn{3}{l}{$\qquad\phantom{(}@\; \pcode{goto}\;l_\textit{ifend}$}\\ |
|
373 \multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifelse}:$}\\ |
|
374 \multicolumn{3}{l}{$\qquad\phantom{(}@\;is_2$}\\ |
|
375 \multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifend}:, E'')$}\\ |
|
376 \end{tabular} |
|
377 \end{center} |
163 |
378 |
164 \end{document} |
379 \end{document} |
165 |
380 |
166 %%% Local Variables: |
381 %%% Local Variables: |
167 %%% mode: latex |
382 %%% mode: latex |