22 that is often pretty fast. This way of producing code has the advantage that |
24 that is often pretty fast. This way of producing code has the advantage that |
23 the virtual machine takes care of things a compiler would normally need |
25 the virtual machine takes care of things a compiler would normally need |
24 to take care of (like explicit memory management). |
26 to take care of (like explicit memory management). |
25 |
27 |
26 As a first example in this module we will implement a compiler for the |
28 As a first example in this module we will implement a compiler for the |
27 very simple While-language. It will generate code for the Java Virtual |
29 very simple WHILE-language that we parsed in the last lecture. The |
28 Machine (JVM). Unfortunately the Java ecosystem does not come with an |
30 compiler will target the Java Virtual Machine (JVM), but not directly. |
29 assembler which would be handy for our compiler-endeavour (unlike |
31 Pictorially the compiler will work as follows: |
30 Microsoft's Common Language Infrastructure for the .Net platform which |
32 |
31 has an assembler out-of-the-box). As a substitute we use in this module |
33 \begin{center} |
32 the 3rd-party programs Jasmin and Krakatau |
34 \begin{tikzpicture}[scale=1,font=\bf, |
|
35 node/.style={ |
|
36 rectangle,rounded corners=3mm, |
|
37 ultra thick,draw=black!50,minimum height=18mm, |
|
38 minimum width=20mm, |
|
39 top color=white,bottom color=black!20}] |
|
40 |
|
41 \node (0) at (-3,0) {}; |
|
42 \node (A) at (0,0) [node,text width=1.6cm,text centered] {our compiler}; |
|
43 \node (B) at (3.5,0) [node,text width=1.6cm,text centered] {Jasmin / Krakatau}; |
|
44 \node (C) at (7.5,0) [node] {JVM}; |
|
45 |
|
46 \draw [->,line width=2.5mm] (0) -- node [above,pos=0.35] {*.while} (A); |
|
47 \draw [->,line width=2.5mm] (A) -- node [above,pos=0.35] {*.j} (B); |
|
48 \draw [->,line width=2.5mm] (B) -- node [above,pos=0.35] {*.class} (C); |
|
49 \end{tikzpicture} |
|
50 \end{center} |
|
51 |
|
52 \noindent |
|
53 The input will be WHILE-programs; the output will be assembly files |
|
54 (with the ending .j). Assembly files essentially contain human-readable |
|
55 machine code, meaning they are not just bits and bytes, but rather |
|
56 something you can read and understand---with a bit of practice of |
|
57 course. An \emph{assembler} will then translate the assembly files into |
|
58 unreadable class or binary files the JVM can run. Unfortunately, the |
|
59 Java ecosystem does not come with an assembler which would be handy for |
|
60 our compiler-endeavour (unlike Microsoft's Common Language |
|
61 Infrastructure for the .Net platform which has an assembler |
|
62 out-of-the-box). As a substitute we shall therefore use the 3rd-party |
|
63 programs Jasmin and Krakatau |
33 |
64 |
34 \begin{itemize} |
65 \begin{itemize} |
35 \item \url{http://jasmin.sourceforge.net} |
66 \item \url{http://jasmin.sourceforge.net} |
36 \item \url{https://github.com/Storyyeller/Krakatau} |
67 \item \url{https://github.com/Storyyeller/Krakatau} |
37 \end{itemize} |
68 \end{itemize} |
154 |
186 |
155 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
187 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
156 istore $index$ |
188 istore $index$ |
157 \end{lstlisting} |
189 \end{lstlisting} |
158 |
190 |
159 \noindent Note that this also pops off the top of the stack. |
191 \noindent Note that this also pops off the top of the stack. One problem |
160 One problem we have to overcome, however, is that local |
192 we have to overcome, however, is that local variables are addressed, not |
161 variables are addressed, not by identifiers, but by numbers |
193 by identifiers (like \texttt{x}, \texttt{foo} and so on), but by numbers |
162 (starting from $0$). Therefore our compiler needs to maintain |
194 (starting from $0$). Therefore our compiler needs to maintain a kind of |
163 a kind of environment where variables are associated to |
195 environment where variables are associated to numbers. This association |
164 numbers. This association needs to be unique: if we muddle up |
196 needs to be unique: if we muddle up the numbers, then we essentially |
165 the numbers, then we essentially confuse variables and the |
197 confuse variables and the consequence will usually be an erroneous |
166 consequence will usually be an erroneous result. Our extended |
198 result. Our extended \textit{compile}-function for arithmetic |
167 \textit{compile}-function for arithmetic expressions will |
199 expressions will therefore take two arguments: the abstract syntax tree |
168 therefore take two arguments: the abstract syntax tree and an |
200 and an environment, $E$, that maps identifiers to index-numbers. |
169 environment, $E$, that maps identifiers to index-numbers. |
|
170 |
201 |
171 \begin{center} |
202 \begin{center} |
172 \begin{tabular}{lcl} |
203 \begin{tabular}{lcl} |
173 $\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\ |
204 $\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\ |
174 $\textit{compile}(a_1 + a_2, E)$ & $\dn$ & |
205 $\textit{compile}(a_1 + a_2, E)$ & $\dn$ & |
181 $\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{idiv}$\\ |
212 $\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{idiv}$\\ |
182 $\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\ |
213 $\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\ |
183 \end{tabular} |
214 \end{tabular} |
184 \end{center} |
215 \end{center} |
185 |
216 |
186 \noindent In the last line we generate the code for variables |
217 \noindent In the last line we generate the code for variables where |
187 where $E(x)$ stands for looking up the environment to which |
218 $E(x)$ stands for looking up the environment to which index the variable |
188 index the variable $x$ maps to. This is similar to an interpreter, |
219 $x$ maps to. This is similar to the interpreter we saw earlier in the |
189 which also needs an environment: the difference is that the |
220 module, which also needs an environment: the difference is that the |
190 interpreter maintains a mapping from variables to current values (what is the |
221 interpreter maintains a mapping from variables to current values (what |
191 currently the value of a variable), while compilers need a mapping |
222 is the currently the value of a variable?), while compilers need a |
192 from variables to memory locations (where can I find the current |
223 mapping from variables to memory locations (where can I find the current |
193 value for the variable in memory). |
224 value for the variable in memory?). |
194 |
225 |
195 There is a similar \textit{compile}-function for boolean |
226 There is a similar \textit{compile}-function for boolean |
196 expressions, but it includes a ``trick'' to do with |
227 expressions, but it includes a ``trick'' to do with |
197 \pcode{if}- and \pcode{while}-statements. To explain the issue |
228 \pcode{if}- and \pcode{while}-statements. To explain the issue |
198 let us first describe the compilation of statements of the |
229 let us first describe the compilation of statements of the |
199 While-language. The clause for \pcode{skip} is trivial, since |
230 WHILE-language. The clause for \pcode{skip} is trivial, since |
200 we do not have to generate any instruction |
231 we do not have to generate any instruction |
201 |
232 |
202 \begin{center} |
233 \begin{center} |
203 \begin{tabular}{lcl} |
234 \begin{tabular}{lcl} |
204 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\ |
235 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\ |
221 $\textit{compile}(x := a, E)$ & $\dn$ & |
252 $\textit{compile}(x := a, E)$ & $\dn$ & |
222 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$ |
253 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$ |
223 \end{tabular} |
254 \end{tabular} |
224 \end{center} |
255 \end{center} |
225 |
256 |
226 \noindent We first generate code for the right-hand side of |
257 \noindent We first generate code for the right-hand side of the |
227 the assignment and then add an \pcode{istore}-instruction at |
258 assignment (that is the arithmetic expression $a$) and then add an |
228 the end. By convention the result of the arithmetic expression |
259 \pcode{istore}-instruction at the end. By convention running the code |
229 $a$ will be on top of the stack. After the \pcode{istore} |
260 for the arithmetic expression $a$ will leave the result on top of the |
230 instruction, the result will be stored in the index |
261 stack. After that the \pcode{istore} instruction, the result will be |
231 corresponding to the variable $x$. If the variable $x$ has |
262 stored in the index corresponding to the variable $x$. If the variable |
232 been used before in the program, we just need to look up what |
263 $x$ has been used before in the program, we just need to look up what |
233 the index is and return the environment unchanged (that is in |
264 the index is and return the environment unchanged (that is in this case |
234 this case $E' = E$). However, if this is the first encounter |
265 $E' = E$). However, if this is the first encounter of the variable $x$ |
235 of the variable $x$ in the program, then we have to augment |
266 in the program, then we have to augment the environment and assign $x$ |
236 the environment and assign $x$ with the largest index in $E$ |
267 with the largest index in $E$ plus one (that is $E' = E(x \mapsto |
237 plus one (that is $E' = E(x \mapsto largest\_index + 1)$). |
268 largest\_index + 1)$). To sum up, for the assignment $x := x + 1$ we |
238 To sum up, for the assignment $x := x + 1$ we generate the |
269 generate the following code |
239 following code |
|
240 |
270 |
241 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
271 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
242 iload $n_x$ |
272 iload $n_x$ |
243 ldc 1 |
273 ldc 1 |
244 iadd |
274 iadd |
254 $index \;=\; E\textit{.getOrElse}(x, |E|)$ |
284 $index \;=\; E\textit{.getOrElse}(x, |E|)$ |
255 \end{tabular} |
285 \end{tabular} |
256 \end{center} |
286 \end{center} |
257 |
287 |
258 \noindent |
288 \noindent |
259 In case the environment $E$ contains an index for $x$, we return it. |
289 This implements the idea that in case the environment $E$ contains an |
260 Otherwise we ``create'' a new index by returning the size $|E|$ of the |
290 index for $x$, we return it. Otherwise we ``create'' a new index by |
261 environment (that will be an index that is guaranteed to be not used |
291 returning the size $|E|$ of the environment (that will be an index that |
262 yet). In all this we take advantage of the JVM which provides us with |
292 is guaranteed not to be used yet). In all this we take advantage of the |
263 a potentially limitless supply of places where we can store values |
293 JVM which provides us with a potentially limitless supply of places |
264 of variables. |
294 where we can store values of variables. |
265 |
295 |
266 A bit more complicated is the generation of code for |
296 A bit more complicated is the generation of code for |
267 \pcode{if}-statements, say |
297 \pcode{if}-statements, say |
268 |
298 |
269 \begin{lstlisting}[mathescape,language={},numbers=none] |
299 \begin{lstlisting}[mathescape,language={},numbers=none] |
270 if $b$ then $cs_1$ else $cs_2$ |
300 if $b$ then $cs_1$ else $cs_2$ |
271 \end{lstlisting} |
301 \end{lstlisting} |
272 |
302 |
273 \noindent where $b$ is a boolean expression and where both $cs_{1/2}$ |
303 \noindent where $b$ is a boolean expression and where both $cs_{1/2}$ |
274 are the statements for each of the \pcode{if}-branches. Lets assume we |
304 are the statements for each of the \pcode{if}-branches. Let us assume we |
275 already generated code for $b$ and $cs_{1/2}$. Then in the true-case the |
305 already generated code for $b$ and and the two if-branches $cs_{1/2}$. |
276 control-flow of the program needs to behave as |
306 Then in the true-case the control-flow of the program needs to behave as |
277 |
307 |
278 \begin{center} |
308 |
279 \begin{tikzpicture}[node distance=2mm and 4mm, |
309 \begin{center} |
280 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, |
310 \begin{tikzpicture}[node distance=2mm and 4mm,line cap=round, |
|
311 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm, |
|
312 top color=white,bottom color=black!20}, |
281 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
313 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
282 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
314 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
283 \node (A1) [point] {}; |
315 \node (A1) [point] {}; |
284 \node (b) [block, right=of A1] {code of $b$}; |
316 \node (b) [block, right=of A1] {code of $b$}; |
285 \node (A2) [point, right=of b] {}; |
317 \node (A2) [point, right=of b] {}; |
288 \node (cs2) [block, right=of A3] {code of $cs_2$}; |
320 \node (cs2) [block, right=of A3] {code of $cs_2$}; |
289 \node (A4) [point, right=of cs2] {}; |
321 \node (A4) [point, right=of cs2] {}; |
290 |
322 |
291 \draw (A1) edge [->, black, line width=1mm] (b); |
323 \draw (A1) edge [->, black, line width=1mm] (b); |
292 \draw (b) edge [->, black, line width=1mm] (cs1); |
324 \draw (b) edge [->, black, line width=1mm] (cs1); |
293 \draw (cs1) edge [->, black, line width=1mm] (A3); |
325 \draw (cs1) edge [->, black, line width=1mm,shorten >= -0.5mm] (A3); |
294 \draw (A3) edge [->, black, skip loop] (A4); |
326 \draw (A3) edge [->, black, skip loop] (A4); |
295 \node [below=of cs2] {\raisebox{-5mm}{\small{}jump}}; |
327 \node [below=of cs2] {\raisebox{-5mm}{\small{}jump}}; |
296 \end{tikzpicture} |
328 \end{tikzpicture} |
297 \end{center} |
329 \end{center} |
298 |
330 |
299 \noindent where we start with running the code for $b$; since |
331 \noindent where we start with running the code for $b$; since |
300 we are in the true case we continue with running the code for |
332 we are in the true case we continue with running the code for |
301 $cs_1$. After this however, we must not run the code for |
333 $cs_1$. After this however, we must not run the code for |
302 $cs_2$, but always jump after the last instruction of $cs_2$ |
334 $cs_2$, but always jump to after the last instruction of $cs_2$ |
303 (the code for the \pcode{else}-branch). Note that this jump is |
335 (the code for the \pcode{else}-branch). Note that this jump is |
304 unconditional, meaning we always have to jump to the end of |
336 unconditional, meaning we always have to jump to the end of |
305 $cs_2$. The corresponding instruction of the JVM is |
337 $cs_2$. The corresponding instruction of the JVM is |
306 \pcode{goto}. In case $b$ turns out to be false we need the |
338 \pcode{goto}. In case $b$ turns out to be false we need the |
307 control-flow |
339 control-flow |
308 |
340 |
309 \begin{center} |
341 \begin{center} |
310 \begin{tikzpicture}[node distance=2mm and 4mm, |
342 \begin{tikzpicture}[node distance=2mm and 4mm,line cap=round, |
311 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, |
343 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm, |
|
344 top color=white,bottom color=black!20}, |
312 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
345 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
313 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
346 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
314 \node (A1) [point] {}; |
347 \node (A1) [point] {}; |
315 \node (b) [block, right=of A1] {code of $b$}; |
348 \node (b) [block, right=of A1] {code of $b$}; |
316 \node (A2) [point, right=of b] {}; |
349 \node (A2) [point, right=of b] {}; |
318 \node (A3) [point, right=of cs1] {}; |
351 \node (A3) [point, right=of cs1] {}; |
319 \node (cs2) [block, right=of A3] {code of $cs_2$}; |
352 \node (cs2) [block, right=of A3] {code of $cs_2$}; |
320 \node (A4) [point, right=of cs2] {}; |
353 \node (A4) [point, right=of cs2] {}; |
321 |
354 |
322 \draw (A1) edge [->, black, line width=1mm] (b); |
355 \draw (A1) edge [->, black, line width=1mm] (b); |
323 \draw (b) edge [->, black, line width=1mm] (A2); |
356 \draw (b) edge [->, black, line width=1mm,shorten >= -0.5mm] (A2); |
324 \draw (A2) edge [skip loop] (A3); |
357 \draw (A2) edge [skip loop] (A3); |
325 \draw (A3) edge [->, black, line width=1mm] (cs2); |
358 \draw (A3) edge [->, black, line width=1mm] (cs2); |
326 \draw (cs2) edge [->,black, line width=1mm] (A4); |
359 \draw (cs2) edge [->,black, line width=1mm] (A4); |
327 \node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}}; |
360 \node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}}; |
328 \end{tikzpicture} |
361 \end{tikzpicture} |
381 |
414 |
382 \begin{lstlisting}[mathescape,numbers=none,language=JVMIS] |
415 \begin{lstlisting}[mathescape,numbers=none,language=JVMIS] |
383 if_icmpne $lab$ |
416 if_icmpne $lab$ |
384 \end{lstlisting} |
417 \end{lstlisting} |
385 |
418 |
386 \noindent Other jump instructions for boolean operators are |
419 To sum up, the third argument in the compile function for booleans |
|
420 specifies where to jump, in case the condition is \emph{not} true. I |
|
421 leave it to you to extend the \textit{compile}-function for the other |
|
422 boolean expressions. Note that we need to jump whenever the boolean is |
|
423 \emph{not} true, which means we have to ``negate'' the jump |
|
424 condition---equals becomes not-equal, less becomes greater-or-equal. |
|
425 Other jump instructions for boolean operators are |
387 |
426 |
388 \begin{center} |
427 \begin{center} |
389 \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l} |
428 \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l} |
390 $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\ |
429 $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\ |
391 $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\ |
430 $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\ |
392 $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\ |
431 $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\ |
393 \end{tabular} |
432 \end{tabular} |
394 \end{center} |
433 \end{center} |
395 |
434 |
396 \noindent and so on. I leave it to you to extend the |
435 \noindent and so on. If you do not like this design (it can be the |
397 \textit{compile}-function for the other boolean expressions. Note that |
|
398 we need to jump whenever the boolean is \emph{not} true, which means we |
|
399 have to ``negate'' the jump condition---equals becomes not-equal, less |
|
400 becomes greater-or-equal. If you do not like this design (it can be the |
|
401 source of some nasty, hard-to-detect errors), you can also change the |
436 source of some nasty, hard-to-detect errors), you can also change the |
402 layout of the code and first give the code for the else-branch and then |
437 layout of the code and first give the code for the else-branch and then |
403 for the if-branch. However in the case of while-loops this |
438 for the if-branch. However in the case of while-loops this |
404 ``upside-down-inside-out'' way of generating code still seems the most |
439 ``upside-down-inside-out'' way of generating code still seems the most |
405 convenient. |
440 convenient. |
479 \pcode{while} $b$ \pcode{do} $cs$, is very similar. In case |
514 \pcode{while} $b$ \pcode{do} $cs$, is very similar. In case |
480 the condition is true and we need to do another iteration, |
515 the condition is true and we need to do another iteration, |
481 and the control-flow needs to be as follows |
516 and the control-flow needs to be as follows |
482 |
517 |
483 \begin{center} |
518 \begin{center} |
484 \begin{tikzpicture}[node distance=2mm and 4mm, |
519 \begin{tikzpicture}[node distance=2mm and 4mm,line cap=round, |
485 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, |
520 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm, |
|
521 top color=white,bottom color=black!20}, |
486 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
522 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
487 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
523 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
488 \node (A0) [point, left=of A1] {}; |
524 \node (A0) [point, left=of A1] {}; |
489 \node (A1) [point] {}; |
525 \node (A1) [point] {}; |
490 \node (b) [block, right=of A1] {code of $b$}; |
526 \node (b) [block, right=of A1] {code of $b$}; |
493 \node (A3) [point, right=of cs1] {}; |
529 \node (A3) [point, right=of cs1] {}; |
494 \node (A4) [point, right=of A3] {}; |
530 \node (A4) [point, right=of A3] {}; |
495 |
531 |
496 \draw (A0) edge [->, black, line width=1mm] (b); |
532 \draw (A0) edge [->, black, line width=1mm] (b); |
497 \draw (b) edge [->, black, line width=1mm] (cs1); |
533 \draw (b) edge [->, black, line width=1mm] (cs1); |
498 \draw (cs1) edge [->, black, line width=1mm] (A3); |
534 \draw (cs1) edge [->, black, line width=1mm,shorten >= -0.5mm] (A3); |
499 \draw (A3) edge [->,skip loop] (A1); |
535 \draw (A3) edge [->,skip loop] (A1); |
500 \end{tikzpicture} |
536 \end{tikzpicture} |
501 \end{center} |
537 \end{center} |
502 |
538 |
503 \noindent Whereas if the condition is \emph{not} true, we |
539 \noindent Whereas if the condition is \emph{not} true, we |
504 need to jump out of the loop, which gives the following |
540 need to jump out of the loop, which gives the following |
505 control flow. |
541 control flow. |
506 |
542 |
507 \begin{center} |
543 \begin{center} |
508 \begin{tikzpicture}[node distance=2mm and 4mm, |
544 \begin{tikzpicture}[node distance=2mm and 4mm,line cap=round, |
509 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, |
545 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm, |
|
546 top color=white,bottom color=black!20}, |
510 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
547 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
511 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
548 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}] |
512 \node (A0) [point, left=of A1] {}; |
549 \node (A0) [point, left=of A1] {}; |
513 \node (A1) [point] {}; |
550 \node (A1) [point] {}; |
514 \node (b) [block, right=of A1] {code of $b$}; |
551 \node (b) [block, right=of A1] {code of $b$}; |
570 \draw[->,very thick] (LC) edge [->,to path={-- ++(10mm,0mm) |
607 \draw[->,very thick] (LC) edge [->,to path={-- ++(10mm,0mm) |
571 -- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (LD.east); |
608 -- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (LD.east); |
572 \end{tikzpicture} |
609 \end{tikzpicture} |
573 |
610 |
574 \noindent |
611 \noindent |
575 I leave it to you to read the code and follow its controlflow. |
612 As said, I leave it to you to decide whether the code implements |
576 |
613 the usual controlflow of while-loops. |
577 Next we need to consider the statement \pcode{write x}, which |
614 |
578 can be used to print out the content of a variable. For this |
615 Next we need to consider the statement \pcode{write x}, which can be |
579 we need to use a Java library function. In order to avoid |
616 used to print out the content of a variable. For this we shall use a |
580 having to generate a lot of code for each |
617 Java library function. In order to avoid having to generate a lot of |
581 \pcode{write}-command, we use a separate helper-method and |
618 code for each \pcode{write}-command, we use a separate helper-method and |
582 just call this method with an argument (which needs to be |
619 just call this method with an appropriate argument (which of course |
583 placed onto the stack). The code of the helper-method is as |
620 needs to be placed onto the stack). The code of the helper-method is as |
584 follows. |
621 follows. |
585 |
|
586 |
622 |
587 \begin{lstlisting}[language=JVMIS,numbers=left] |
623 \begin{lstlisting}[language=JVMIS,numbers=left] |
588 .method public static write(I)V |
624 .method public static write(I)V |
589 .limit locals 1 |
625 .limit locals 1 |
590 .limit stack 2 |
626 .limit stack 2 |
599 called \pcode{write}. It takes a single integer argument |
635 called \pcode{write}. It takes a single integer argument |
600 indicated by the \pcode{(I)} and returns no result, indicated |
636 indicated by the \pcode{(I)} and returns no result, indicated |
601 by the \pcode{V}. Since the method has only one argument, we |
637 by the \pcode{V}. Since the method has only one argument, we |
602 only need a single local variable (Line~2) and a stack with |
638 only need a single local variable (Line~2) and a stack with |
603 two cells will be sufficient (Line 3). Line 4 instructs the |
639 two cells will be sufficient (Line 3). Line 4 instructs the |
604 JVM to get the value of the field \pcode{out} of the class |
640 JVM to get the value of the mem \pcode{out} of the class |
605 \pcode{java/lang/System}. It expects the value to be of type |
641 \pcode{java/lang/System}. It expects the value to be of type |
606 \pcode{java/io/PrintStream}. A reference to this value will be |
642 \pcode{java/io/PrintStream}. A reference to this value will be |
607 placed on the stack. Line~5 copies the integer we want to |
643 placed on the stack. Line~5 copies the integer we want to |
608 print out onto the stack. In the next line we call the method |
644 print out onto the stack. In the next line we call the method |
609 \pcode{println} (from the class \pcode{java/io/PrintStream}). |
645 \pcode{println} (from the class \pcode{java/io/PrintStream}). |
624 top of the stack and then call \pcode{write}. The \pcode{XXX} |
660 top of the stack and then call \pcode{write}. The \pcode{XXX} |
625 need to be replaced by an appropriate class name (this will be |
661 need to be replaced by an appropriate class name (this will be |
626 explained shortly). |
662 explained shortly). |
627 |
663 |
628 |
664 |
629 \begin{figure}[t] |
665 By generating code for a WHILE-program, we end up with a list |
630 \begin{lstlisting}[mathescape,language=JVMIS,numbers=left] |
|
631 .class public XXX.XXX |
|
632 .super java/lang/Object |
|
633 |
|
634 .method public <init>()V |
|
635 aload_0 |
|
636 invokenonvirtual java/lang/Object/<init>()V |
|
637 return |
|
638 .end method |
|
639 |
|
640 .method public static main([Ljava/lang/String;)V |
|
641 .limit locals 200 |
|
642 .limit stack 200 |
|
643 |
|
644 $\textit{\ldots{}here comes the compiled code\ldots}$ |
|
645 |
|
646 return |
|
647 .end method |
|
648 \end{lstlisting} |
|
649 \caption{Boilerplate code needed for running generated code.\label{boiler}} |
|
650 \end{figure} |
|
651 |
|
652 |
|
653 By generating code for a While-program, we end up with a list |
|
654 of (JVM assembly) instructions. Unfortunately, there is a bit |
666 of (JVM assembly) instructions. Unfortunately, there is a bit |
655 more boilerplate code needed before these instructions can be |
667 more boilerplate code needed before these instructions can be |
656 run. The complete code is shown in Figure~\ref{boiler}. This |
668 run. The complete code is shown in Figure~\ref{boiler}. This |
657 boilerplate code is very specific to the JVM. If we target any |
669 boilerplate code is very specific to the JVM. If we target any |
658 other virtual machine or a machine language, then we would |
670 other virtual machine or a machine language, then we would |
661 is called \emph{before} the \pcode{main}-method in Lines 10 to |
673 is called \emph{before} the \pcode{main}-method in Lines 10 to |
662 17. Interesting are the Lines 11 and 12 where we hardwire that |
674 17. Interesting are the Lines 11 and 12 where we hardwire that |
663 the stack of our programs will never be larger than 200 and |
675 the stack of our programs will never be larger than 200 and |
664 that the maximum number of variables is also 200. This seem to |
676 that the maximum number of variables is also 200. This seem to |
665 be conservative default values that allow is to run some |
677 be conservative default values that allow is to run some |
666 simple While-programs. In a real compiler, we would of course |
678 simple WHILE-programs. In a real compiler, we would of course |
667 need to work harder and find out appropriate values for the |
679 need to work harder and find out appropriate values for the |
668 stack and local variables. |
680 stack and local variables. |
|
681 |
|
682 \begin{figure}[t] |
|
683 \begin{lstlisting}[mathescape,language=JVMIS,numbers=left] |
|
684 .class public XXX.XXX |
|
685 .super java/lang/Object |
|
686 |
|
687 .method public static main([Ljava/lang/String;)V |
|
688 .limit locals 200 |
|
689 .limit stack 200 |
|
690 |
|
691 $\textit{\ldots{}here comes the compiled code\ldots}$ |
|
692 |
|
693 return |
|
694 .end method |
|
695 \end{lstlisting} |
|
696 \caption{The boilerplate code needed for running generated code.\label{boiler}} |
|
697 \end{figure} |
|
698 |
669 |
699 |
670 To sum up, in Figure~\ref{test} is the complete code generated |
700 To sum up, in Figure~\ref{test} is the complete code generated |
671 for the slightly nonsensical program |
701 for the slightly nonsensical program |
672 |
702 |
673 \begin{lstlisting}[mathescape,language=While] |
703 \begin{lstlisting}[mathescape,language=While] |
681 bytecode is then understood by the JVM and can be run by just invoking |
711 bytecode is then understood by the JVM and can be run by just invoking |
682 the \pcode{java}-program. |
712 the \pcode{java}-program. |
683 |
713 |
684 |
714 |
685 \begin{figure}[p] |
715 \begin{figure}[p] |
686 \lstinputlisting[language=JVMIS]{../progs/test-small.j} |
716 \lstinputlisting[language=JVMIS,mathescape]{../progs/test-small.j} |
687 \caption{Generated code for a test program. This code can be |
717 \begin{tikzpicture}[remember picture,overlay] |
688 processed by an Java assembler producing a class-file, which |
718 \draw[|<->|,very thick] (LA.north) -- (LB.south) |
689 can be run by the {\tt{}java}-program.\label{test}} |
719 node[left=0mm,midway] {\footnotesize\texttt{x\,:=\,1\,+\,2}}; |
|
720 \draw[|<->|,very thick] (LC.north) -- (LD.south) |
|
721 node[left=0mm,midway] {\footnotesize\texttt{write x}}; |
|
722 \end{tikzpicture} |
|
723 \caption{The generated code for the test program \texttt{x := 1 + 2; write |
|
724 x}. This code can be processed by a Java assembler producing a |
|
725 class-file, which can then be run by the {\tt{}java}-program.\label{test}} |
690 \end{figure} |
726 \end{figure} |
691 |
727 |
692 \subsection*{Arrays} |
728 \subsection*{Arrays} |
693 |
729 |
694 Maybe a useful addition to the While-language would be arrays. This |
730 Maybe a useful addition to the WHILE-language would be arrays. This |
695 would let us generate more interesting While-programs by translating |
731 would allow us to generate more interesting WHILE-programs by |
696 BF*** programs into equivalent While-code. So in this section lets have |
732 translating BF*** programs into equivalent WHILE-code. Therefore in this |
697 a look at how we can support the following three constructions |
733 section let us have a look at how we can support the following three |
|
734 constructions |
698 |
735 |
699 \begin{lstlisting}[mathescape,language=While] |
736 \begin{lstlisting}[mathescape,language=While] |
700 new arr[15000] |
737 new(arr[15000]) |
701 x := 3 + arr[3 + y] |
738 x := 3 + arr[3 + y] |
702 arr[42 * n] := ... |
739 arr[42 * n] := ... |
703 \end{lstlisting} |
740 \end{lstlisting} |
704 |
741 |
705 \noindent |
742 \noindent |
706 The first construct is for creating new arrays, in this instance the |
743 The first construct is for creating new arrays. In this instance the |
707 name of the array is \pcode{arr} and it can hold 15000 integers. The |
744 name of the array is \pcode{arr} and it can hold 15000 integers. We do |
708 second is for referencing an array cell inside an arithmetic |
745 not support ``dynamic'' arrays, that is the size of our arrays will |
709 expression---we need to be able to look up the contents of an array at |
746 always be fixed. The second construct is for referencing an array cell |
710 an index determined by an arithmetic expression. Similarly in the line |
747 inside an arithmetic expression---we need to be able to look up the |
711 below, we need to be able to update the content of an array at an |
748 contents of an array at an index determined by an arithmetic expression. |
712 calculated index. |
749 Similarly in the line below, we need to be able to update the content of |
|
750 an array at an calculated index. |
713 |
751 |
714 For creating a new array we can generate the following three JVM |
752 For creating a new array we can generate the following three JVM |
715 instructions: |
753 instructions: |
716 |
754 |
717 \begin{lstlisting}[mathescape,language=JVMIS] |
755 \begin{lstlisting}[mathescape,language=JVMIS] |
719 newarray int |
757 newarray int |
720 astore loc_var |
758 astore loc_var |
721 \end{lstlisting} |
759 \end{lstlisting} |
722 |
760 |
723 \noindent |
761 \noindent |
724 First we need to put the dimension of the array onto the stack. The next |
762 First we need to put the size of the array onto the stack. The next |
725 instruction creates the array. With the last we can store the array as a |
763 instruction creates the array. In this case the array contains |
|
764 \texttt{int}s. With the last instruction we can store the array as a |
726 local variable (like the ``simple'' variables from the previous |
765 local variable (like the ``simple'' variables from the previous |
727 section). The use of a local variable for each array allows us to have |
766 section). The use of a local variable for each array allows us to have |
728 multiple arrays in a While-program. For looking up an element in an |
767 multiple arrays in a WHILE-program. For looking up an element in an |
729 array we can use the following JVM code |
768 array we can use the following JVM code |
730 |
769 |
731 \begin{lstlisting}[mathescape,language=JVMIS] |
770 \begin{lstlisting}[mathescape,language=JVMIS] |
732 aload loc_var |
771 aload loc_var |
733 index_aexp |
772 index_aexp |
734 iaload |
773 iaload |
735 \end{lstlisting} |
774 \end{lstlisting} |
736 |
775 |
737 \noindent |
776 \noindent |
738 The first instruction loads the ``pointer'' to the array onto the stack. |
777 The first instruction loads the ``pointer'', or local variable, to the |
739 Then we have some instructions corresponding to the index where we want |
778 array onto the stack. Then we have some instructions calculating the |
740 to look up the array. The idea is that these instructions will leave a |
779 index where we want to look up the array. The idea is that these |
741 concrete number on the stack, which will be the index into the array we |
780 instructions will leave a concrete number on the top of the stack, which |
742 need. Finally we need to tell the JVM to load the corresponding element |
781 will be the index into the array we need. Finally we need to tell the |
743 onto the stack. Updating an array at an index with a value is as |
782 JVM to load the corresponding element onto the stack. Updating an array |
744 follows. |
783 at an index with a value is as follows. |
745 |
784 |
746 \begin{lstlisting}[mathescape,language=JVMIS] |
785 \begin{lstlisting}[mathescape,language=JVMIS] |
747 aload loc_var |
786 aload loc_var |
748 index_aexp |
787 index_aexp |
749 value_aexp |
788 value_aexp |
750 iastore |
789 iastore |
751 \end{lstlisting} |
790 \end{lstlisting} |
752 |
791 |
753 \noindent |
792 \noindent |
754 Again the first instruction loads the ``pointer'' to the array onto the |
793 Again the first instruction loads the local variable of |
755 stack. Then we have some instructions corresponding to the index where |
794 the array onto the stack. Then we have some instructions calculating |
756 we want to update the array. After that come the instructions for with |
795 the index where we want to update the array. After that come the |
757 what value we want to update the array. The last line contains the |
796 instructions for with which value we want to update the array. The last |
758 instruction for updating the array. |
797 line contains the instruction for updating the array. |
759 |
798 |
760 Next we need to modify our grammar rules for our While-language: it |
799 Next we need to modify our grammar rules for our WHILE-language: it |
761 seems best to extend the rule for factors in arithmetic expressions with |
800 seems best to extend the rule for factors in arithmetic expressions with |
762 a rule for looking up an array. |
801 a rule for looking up an array. |
763 |
802 |
764 \begin{plstx}[rhs style=, margin=3cm] |
803 \begin{plstx}[rhs style=, margin=3cm] |
765 : \meta{E} ::= \meta{T} $+$ \meta{E} |
804 : \meta{E} ::= \meta{T} $+$ \meta{E} |
779 by an identifier and the brackets. There are two new rules for statements, |
818 by an identifier and the brackets. There are two new rules for statements, |
780 one for creating an array and one for array assignment: |
819 one for creating an array and one for array assignment: |
781 |
820 |
782 \begin{plstx}[rhs style=, margin=2cm, one per line] |
821 \begin{plstx}[rhs style=, margin=2cm, one per line] |
783 : \meta{Stmt} ::= \ldots |
822 : \meta{Stmt} ::= \ldots |
784 | \texttt{new}\; \meta{Id}\,[\,\meta{Num}\,] |
823 | \texttt{new}(\meta{Id}\,[\,\meta{Num}\,]) |
785 | \meta{Id}\,[\,\meta{E}\,]\,:=\,\meta{E}\\ |
824 | \meta{Id}\,[\,\meta{E}\,]\,:=\,\meta{E}\\ |
786 \end{plstx} |
825 \end{plstx} |
787 |
826 |
788 With this in place we can turn back to the idea of creating While |
827 With this in place we can turn back to the idea of creating |
789 programs by translating BF programs. This is a relatively easy task |
828 WHILE-programs by translating BF programs. This is a relatively easy |
790 because BF only has eight instructions (we will actually only have seven |
829 task because BF has only eight instructions (we will actually implement |
791 because we can omit the read-in instruction from BF). But also translating |
830 seven because we can omit the read-in instruction from BF). What makes |
792 BF-loops is going to be easy since they straightforwardly can be |
831 this translation easy is that BF-loops can be straightforwardly |
793 represented by While-loops. The Scala code for the translation is |
832 represented as while-loops. The Scala code for the translation is as |
794 as follows: |
833 follows: |
795 |
834 |
796 \begin{lstlisting}[language=Scala,numbers=left] |
835 \begin{lstlisting}[language=Scala,numbers=left] |
797 def instr(c: Char) : String = c match { |
836 def instr(c: Char) : String = c match { |
798 case '>' => "ptr := ptr + 1;" |
837 case '>' => "ptr := ptr + 1;" |
799 case '<' => "ptr := ptr - 1;" |
838 case '<' => "ptr := ptr - 1;" |
800 case '+' => "field[ptr] := field[ptr] + 1;" |
839 case '+' => "mem[ptr] := mem [ptr] + 1;" |
801 case '-' => "field[ptr] := field[ptr] - 1;" |
840 case '-' => "mem [ptr] := mem [ptr] - 1;" |
802 case '.' => "x := field[ptr]; write x;" |
841 case '.' => "x := mem [ptr]; write x;" |
803 case '[' => "while (field[ptr] != 0) do {" |
842 case '[' => "while (mem [ptr] != 0) do {" |
804 case ']' => "skip};" |
843 case ']' => "skip};" |
805 case _ => "" |
844 case _ => "" |
806 } |
845 } |
807 \end{lstlisting} |
846 \end{lstlisting} |
808 |
847 |
809 \noindent |
848 \noindent |
810 The idea behind the translation is that BF-programs operate on an array, |
849 The idea behind the translation is that BF-programs operate on an array, |
811 called \texttt{field}. The BP-memory pointer into this array is |
850 called here \texttt{mem}. The BP-memory pointer into this array is |
812 represented as the variable \texttt{ptr}. The BF-instructions \code{>} |
851 represented as the variable \texttt{ptr}. As usual the BF-instructions |
813 and \code{<} increase, respectively decrease, \texttt{ptr}. The |
852 \code{>} and \code{<} increase, respectively decrease, \texttt{ptr}. The |
814 instructions \code{+} and \code{-} update a cell in \texttt{field}. In |
853 instructions \code{+} and \code{-} update a cell in \texttt{mem}. In |
815 Line 6 we need to first assign a field-cell to an auxiliary variable |
854 Line 6 we need to first assign a \texttt{mem}-cell to an auxiliary variable |
816 since we have not changed our write functions in order to cope with |
855 since we have not changed our write functions in order to cope with |
817 writing out any array-content directly. Lines 7 and 8 are for |
856 writing out any array-content directly. Lines 7 and 8 are for |
818 translating BF-loops. Line 8 is interesting in the sense that we need to |
857 translating BF-loops. Line 8 is interesting in the sense that we need to |
819 generate a \code{skip} instruction just before finishing with the |
858 generate a \code{skip} instruction just before finishing with the |
820 closing \code{"\}"}. The reason is that we are rather pedantic about |
859 closing \code{"\}"}. The reason is that we are rather pedantic about |
821 semicolons in our While-grammar: the last command cannot have a |
860 semicolons in our WHILE-grammar: the last command cannot have a |
822 semicolon---adding a \code{skip} works around this snag. Putting |
861 semicolon---adding a \code{skip} works around this snag. Putting all |
823 all this together is we can generate While-programs with more than |
862 this together is we can generate WHILE-programs with more than 400 |
824 400 instructions and then run the compiled JVM code for such programs. |
863 instructions and then run the compiled JVM code for such programs. |
825 |
864 \bigskip |
|
865 |
|
866 \noindent |
|
867 Hooooray, we can finally run the BF-mandelbrot program on the JVM and it |
|
868 completes within 20 seconds (after nearly 10 minutes of parsing the |
|
869 corresponding WHILE-program and generating 270K of a class file). Try |
|
870 replicating the 20 secs with an interpreter! OK, we now face the |
|
871 nagging question about what to do next\ldots |
|
872 |
|
873 \subsection*{Added Value} |
|
874 |
|
875 As you have probably seen, the compiler writer has a lot of freedom |
|
876 about how to generate code from what the progarmmer wrote as program. |
|
877 The only condition is that generated code should behave as expected by |
|
878 the programmer. Then all is fine\ldots mission accomplished! But |
|
879 sometimes the compiler writer is expected to go an extra mile, or even |
|
880 miles. Suppose we are given the following WHILE-program: |
|
881 |
|
882 \begin{lstlisting}[mathescape,language=While] |
|
883 new(arr[10]); |
|
884 arr[14] := 3 + arr[13] |
|
885 \end{lstlisting} |
|
886 |
|
887 \noindent |
|
888 While admittedly this is a contrived program, and probably not meant to |
|
889 be like this by any sane programmer, it is supposed to make the |
|
890 following point: We generate an array of size 10, and then try to access |
|
891 the non-existing element at index 13 and even updating element with |
|
892 index 14. Obviously this is baloney. However, our compiler generates |
|
893 code for this program without any questions asked. We can even run this |
|
894 code on the JVM\ldots of course the result is an exception trace where |
|
895 the JVM yells at us for doing naughty things. (This is much better than |
|
896 C, for example, where such errors are not prevented and as a result |
|
897 insidious attacks can be mounted against such kind C-programs. I assume |
|
898 everyone has heard about \emph{Buffer Overflow Attacks}.) |
|
899 |
|
900 Imagine we do not want to rely in our compiler on the JVM for producing |
|
901 an annoying, but safe exception trace, rather we want to handle such |
|
902 situations ourselves. Lets assume we want to handle them in the |
|
903 following way: if the programmer access a field out-of-bounds, we just |
|
904 return the default 0, and if a programmer wants to update an |
|
905 out-of-bounds filed, we quietly ignore this update. |
826 |
906 |
827 \end{document} |
907 \end{document} |
828 |
908 |
829 %%% Local Variables: |
909 %%% Local Variables: |
830 %%% mode: latex |
910 %%% mode: latex |