17 \section*{Handout 7 (Compilation)} |
17 \section*{Handout 7 (Compilation)} |
18 |
18 |
19 The purpose of a compiler is to transform a program a human can read and |
19 The purpose of a compiler is to transform a program a human can read and |
20 write into code the machine can run as fast as possible. The fastest |
20 write into code the machine can run as fast as possible. The fastest |
21 code would be machine code the CPU can run directly, but it is often |
21 code would be machine code the CPU can run directly, but it is often |
22 good enough for improving the speed of a program to target a |
22 good enough for improving the speed of a program to target a virtual |
23 virtual machine. This produces not the fastest possible code, but code |
23 machine instead. This produces not the fastest possible code, but code |
24 that is often pretty fast. This way of producing code has the advantage that |
24 that is often pretty fast. This way of producing code has the advantage |
25 the virtual machine takes care of things a compiler would normally need |
25 that the virtual machine takes care of things a compiler would normally |
26 to take care of (like explicit memory management). |
26 need to take care of (hairy things like explicit memory management). |
27 |
27 |
28 As a first example in this module we will implement a compiler for the |
28 As a first example in this module we will implement a compiler for the |
29 very simple WHILE-language that we parsed in the last lecture. The |
29 very simple WHILE-language that we parsed in the last lecture. The |
30 compiler will target the Java Virtual Machine (JVM), but not directly. |
30 compiler will target the Java Virtual Machine (JVM), but not directly. |
31 Pictorially the compiler will work as follows: |
31 Pictorially the compiler will work as follows: |
49 \end{tikzpicture} |
49 \end{tikzpicture} |
50 \end{center} |
50 \end{center} |
51 |
51 |
52 \noindent |
52 \noindent |
53 The input will be WHILE-programs; the output will be assembly files |
53 The input will be WHILE-programs; the output will be assembly files |
54 (with the ending .j). Assembly files essentially contain human-readable |
54 (with the file extension .j). Assembly files essentially contain |
55 machine code, meaning they are not just bits and bytes, but rather |
55 human-readable machine code, meaning they are not just bits and bytes, |
56 something you can read and understand---with a bit of practice of |
56 but rather something you can read and understand---with a bit of |
57 course. An \emph{assembler} will then translate the assembly files into |
57 practice of course. An \emph{assembler} will then translate the assembly |
58 unreadable class or binary files the JVM can run. Unfortunately, the |
58 files into unreadable class or binary files the JVM can run. |
59 Java ecosystem does not come with an assembler which would be handy for |
59 Unfortunately, the Java ecosystem does not come with an assembler which |
60 our compiler-endeavour (unlike Microsoft's Common Language |
60 would be handy for our compiler-endeavour (unlike Microsoft's Common |
61 Infrastructure for the .Net platform which has an assembler |
61 Language Infrastructure for the .Net platform which has an assembler |
62 out-of-the-box). As a substitute we shall therefore use the 3rd-party |
62 out-of-the-box). As a substitute we shall therefore use the 3rd-party |
63 programs Jasmin and Krakatau |
63 programs Jasmin and Krakatau |
64 |
64 |
65 \begin{itemize} |
65 \begin{itemize} |
66 \item \url{http://jasmin.sourceforge.net} |
66 \item \url{http://jasmin.sourceforge.net} |
84 ldc 1 |
84 ldc 1 |
85 ldc 2 |
85 ldc 2 |
86 iadd |
86 iadd |
87 \end{lstlisting} |
87 \end{lstlisting} |
88 |
88 |
89 \noindent The first instruction loads the constant $1$ onto |
89 \noindent The first instruction loads the constant $1$ onto the stack, |
90 the stack, the next one loads $2$, the third instruction adds both |
90 the next one loads $2$, the third instruction adds both numbers together |
91 numbers together replacing the top two elements of the stack |
91 replacing the top two elements of the stack with the result $3$. For |
92 with the result $3$. For simplicity, we will throughout |
92 simplicity, we will consider throughout only integer numbers. This means |
93 consider only integer numbers. Therefore we can |
93 our main JVM instructions for arithmetic will be \code{iadd}, |
94 use the JVM instructions \code{iadd}, \code{isub}, |
94 \code{isub}, \code{imul}, \code{idiv} and so on. The \code{i} stands for |
95 \code{imul}, \code{idiv} and so on. The \code{i} stands for |
95 integer instructions in the JVM (alternatives are \code{d} for doubles, |
96 integer instructions in the JVM (alternatives are \code{d} for |
96 \code{l} for longs and \code{f} for floats etc). |
97 doubles, \code{l} for longs and \code{f} for floats). |
|
98 |
97 |
99 Recall our grammar for arithmetic expressions (\meta{E} is the |
98 Recall our grammar for arithmetic expressions (\meta{E} is the |
100 starting symbol): |
99 starting symbol): |
101 |
100 |
102 |
101 |
147 top of the stack (I leave this to you to verify; the meaning of each |
146 top of the stack (I leave this to you to verify; the meaning of each |
148 instruction should be clear). The result being on the top of the stack |
147 instruction should be clear). The result being on the top of the stack |
149 will be an important convention we always observe in our compiler. Note, |
148 will be an important convention we always observe in our compiler. Note, |
150 that a different bracketing of the expression, for example $(1 + (2 * |
149 that a different bracketing of the expression, for example $(1 + (2 * |
151 3)) + (4 - 3)$, produces a different abstract syntax tree and thus also |
150 3)) + (4 - 3)$, produces a different abstract syntax tree and thus also |
152 a different list of instructions. Generating code in this |
151 a different list of instructions. |
153 post-order-traversal fashion is rather easy to implement: it can be done |
152 |
154 with the following recursive \textit{compile}-function, which takes the |
153 Generating code in this post-order-traversal fashion is rather easy to |
155 abstract syntax tree as an argument: |
154 implement: it can be done with the following recursive |
|
155 \textit{compile}-function, which takes the abstract syntax tree as an |
|
156 argument: |
156 |
157 |
157 \begin{center} |
158 \begin{center} |
158 \begin{tabular}{lcl} |
159 \begin{tabular}{lcl} |
159 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\ |
160 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\ |
160 $\textit{compile}(a_1 + a_2)$ & $\dn$ & |
161 $\textit{compile}(a_1 + a_2)$ & $\dn$ & |
166 $\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & |
167 $\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & |
167 $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\ |
168 $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\ |
168 \end{tabular} |
169 \end{tabular} |
169 \end{center} |
170 \end{center} |
170 |
171 |
171 This is all fine, but our arithmetic expressions can clearly contain |
172 \noindent |
172 variables and then this code will not be good enough. To fix this we |
173 This is all fine, but our arithmetic expressions can contain variables |
173 will represent our variables as the \emph{local variables} in the JVM. |
174 and we have not considered them yet. To fix this we will represent our |
174 Essentially, local variables are an array or pointers to memory cells, |
175 variables as the \emph{local variables} of the JVM. Essentially, local |
175 containing in our case only integers. Looking up a variable can be done |
176 variables are an array or pointers to memory cells, containing in our |
176 with the instruction |
177 case only integers. Looking up a variable can be done with the |
|
178 instruction |
177 |
179 |
178 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
180 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
179 iload $index$ |
181 iload $index$ |
180 \end{lstlisting} |
182 \end{lstlisting} |
181 |
183 |
610 |
612 |
611 \noindent |
613 \noindent |
612 As said, I leave it to you to decide whether the code implements |
614 As said, I leave it to you to decide whether the code implements |
613 the usual controlflow of while-loops. |
615 the usual controlflow of while-loops. |
614 |
616 |
615 Next we need to consider the statement \pcode{write x}, which can be |
617 Next we need to consider the WHILE-statement \pcode{write x}, which can |
616 used to print out the content of a variable. For this we shall use a |
618 be used to print out the content of a variable. For this we shall use a |
617 Java library function. In order to avoid having to generate a lot of |
619 Java library function. In order to avoid having to generate a lot of |
618 code for each \pcode{write}-command, we use a separate helper-method and |
620 code for each \pcode{write}-command, we use a separate helper-method and |
619 just call this method with an appropriate argument (which of course |
621 just call this method with an appropriate argument (which of course |
620 needs to be placed onto the stack). The code of the helper-method is as |
622 needs to be placed onto the stack). The code of the helper-method is as |
621 follows. |
623 follows. |
622 |
624 |
623 \begin{lstlisting}[language=JVMIS,numbers=left] |
625 \begin{lstlisting}[language=JVMIS,numbers=left,basicstyle=\ttfamily\small] |
624 .method public static write(I)V |
626 .method public static write(I)V |
625 .limit locals 1 |
627 .limit locals 1 |
626 .limit stack 2 |
628 .limit stack 2 |
627 getstatic java/lang/System/out Ljava/io/PrintStream; |
629 getstatic java/lang/System/out Ljava/io/PrintStream; |
628 iload 0 |
630 iload 0 |
629 invokevirtual java/io/PrintStream/println(I)V |
631 invokevirtual java/io/PrintStream/println(I)V |
630 return |
632 return |
631 .end method |
633 .end method |
632 \end{lstlisting} |
634 \end{lstlisting} |
633 |
635 |
634 \noindent The first line marks the beginning of the method, |
636 \noindent The first line marks the beginning of the method, called |
635 called \pcode{write}. It takes a single integer argument |
637 \pcode{write}. It takes a single integer argument indicated by the |
636 indicated by the \pcode{(I)} and returns no result, indicated |
638 \pcode{(I)} and returns no result, indicated by the \pcode{V} (for |
637 by the \pcode{V}. Since the method has only one argument, we |
639 void). Since the method has only one argument, we only need a single |
638 only need a single local variable (Line~2) and a stack with |
640 local variable (Line~2) and a stack with two cells will be sufficient |
639 two cells will be sufficient (Line 3). Line 4 instructs the |
641 (Line 3). Line 4 instructs the JVM to get the value of the member |
640 JVM to get the value of the mem \pcode{out} of the class |
642 \pcode{out} of the class \pcode{java/lang/System}. It expects the value |
641 \pcode{java/lang/System}. It expects the value to be of type |
643 to be of type \pcode{java/io/PrintStream}. A reference to this value |
642 \pcode{java/io/PrintStream}. A reference to this value will be |
644 will be placed on the stack.\footnote{Note the syntax \texttt{L |
643 placed on the stack. Line~5 copies the integer we want to |
645 \ldots{};} for the \texttt{PrintStream} type is not an typo. Somehow the |
644 print out onto the stack. In the next line we call the method |
646 designers of Jasmin decided that this syntax is pleasing to the eye. So |
645 \pcode{println} (from the class \pcode{java/io/PrintStream}). |
647 if you wanted to have strings in your Jasmin code, you would need to |
646 We want to print out an integer and do not expect anything |
648 write \texttt{Ljava/lang/String;}\;. If you want arrays of one dimension, |
647 back (that is why the type annotation is \pcode{(I)V}). The |
649 then use \texttt{[\ldots}; two dimensions, use \texttt{[[\ldots} and |
648 \pcode{return}-instruction in the next line changes the |
650 so on. Looks all very ugly to my eyes.} Line~5 copies the integer we |
649 control-flow back to the place from where \pcode{write} was |
651 want to print out onto the stack. In the line after that we call the |
650 called. This method needs to be part of a header that is |
652 method \pcode{println} (from the class \pcode{java/io/PrintStream}). We |
651 included in any code we generate. The helper-method |
653 want to print out an integer and do not expect anything back (that is |
652 \pcode{write} can be invoked with the two instructions |
654 why the type annotation is \pcode{(I)V}). The \pcode{return}-instruction |
|
655 in the next line changes the control-flow back to the place from where |
|
656 \pcode{write} was called. This method needs to be part of a header that |
|
657 is included in any code we generate. The helper-method \pcode{write} can |
|
658 be invoked with the two instructions |
653 |
659 |
654 \begin{lstlisting}[mathescape,language=JVMIS] |
660 \begin{lstlisting}[mathescape,language=JVMIS] |
655 iload $E(x)$ |
661 iload $E(x)$ |
656 invokestatic XXX/XXX/write(I)V |
662 invokestatic XXX/XXX/write(I)V |
657 \end{lstlisting} |
663 \end{lstlisting} |
660 top of the stack and then call \pcode{write}. The \pcode{XXX} |
666 top of the stack and then call \pcode{write}. The \pcode{XXX} |
661 need to be replaced by an appropriate class name (this will be |
667 need to be replaced by an appropriate class name (this will be |
662 explained shortly). |
668 explained shortly). |
663 |
669 |
664 |
670 |
665 By generating code for a WHILE-program, we end up with a list |
671 By generating code for a WHILE-program, we end up with a list of (JVM |
666 of (JVM assembly) instructions. Unfortunately, there is a bit |
672 assembly) instructions. Unfortunately, there is a bit more boilerplate |
667 more boilerplate code needed before these instructions can be |
673 code needed before these instructions can be run. Essentially we have to |
668 run. The complete code is shown in Figure~\ref{boiler}. This |
674 enclose them inside a Java \texttt{main}-method. The corresponding code |
669 boilerplate code is very specific to the JVM. If we target any |
675 is shown in Figure~\ref{boiler}. This boilerplate code is very specific |
670 other virtual machine or a machine language, then we would |
676 to the JVM. If we target any other virtual machine or a machine |
671 need to change this code. Lines 4 to 8 in Figure~\ref{boiler} |
677 language, then we would need to change this code. Interesting are the |
672 contain a method for object creation in the JVM; this method |
678 Lines 5 and 6 where we hardwire that the stack of our programs will |
673 is called \emph{before} the \pcode{main}-method in Lines 10 to |
679 never be larger than 200 and that the maximum number of variables is |
674 17. Interesting are the Lines 11 and 12 where we hardwire that |
680 also 200. This seem to be conservative default values that allow is to |
675 the stack of our programs will never be larger than 200 and |
681 run some simple WHILE-programs. In a real compiler, we would of course |
676 that the maximum number of variables is also 200. This seem to |
682 need to work harder and find out appropriate values for the stack and |
677 be conservative default values that allow is to run some |
683 local variables. |
678 simple WHILE-programs. In a real compiler, we would of course |
|
679 need to work harder and find out appropriate values for the |
|
680 stack and local variables. |
|
681 |
684 |
682 \begin{figure}[t] |
685 \begin{figure}[t] |
683 \begin{lstlisting}[mathescape,language=JVMIS,numbers=left] |
686 \begin{lstlisting}[mathescape,language=JVMIS,numbers=left] |
684 .class public XXX.XXX |
687 .class public XXX.XXX |
685 .super java/lang/Object |
688 .super java/lang/Object |
707 |
712 |
708 \noindent I let you read the code and make sure the code behaves as |
713 \noindent I let you read the code and make sure the code behaves as |
709 expected. Having this code at our disposal, we need the assembler to |
714 expected. Having this code at our disposal, we need the assembler to |
710 translate the generated code into JVM bytecode (a class file). This |
715 translate the generated code into JVM bytecode (a class file). This |
711 bytecode is then understood by the JVM and can be run by just invoking |
716 bytecode is then understood by the JVM and can be run by just invoking |
712 the \pcode{java}-program. |
717 the \pcode{java}-program. Again I let you do the work. |
713 |
718 |
714 |
719 |
715 \begin{figure}[p] |
720 \begin{figure}[p] |
716 \lstinputlisting[language=JVMIS,mathescape]{../progs/test-small.j} |
721 \lstinputlisting[language=JVMIS,mathescape,basicstyle=\ttfamily\small]{../progs/test-small.j} |
717 \begin{tikzpicture}[remember picture,overlay] |
722 \begin{tikzpicture}[remember picture,overlay] |
718 \draw[|<->|,very thick] (LA.north) -- (LB.south) |
723 \draw[|<->|,very thick] (LA.north) -- (LB.south) |
719 node[left=0mm,midway] {\footnotesize\texttt{x\,:=\,1\,+\,2}}; |
724 node[left=0mm,midway] {\footnotesize\texttt{x\,:=\,1\,+\,2}}; |
720 \draw[|<->|,very thick] (LC.north) -- (LD.south) |
725 \draw[|<->|,very thick] (LC.north) -- (LD.south) |
721 node[left=0mm,midway] {\footnotesize\texttt{write x}}; |
726 node[left=0mm,midway] {\footnotesize\texttt{write x}}; |
870 replicating the 20 secs with an interpreter! OK, we now face the |
875 replicating the 20 secs with an interpreter! OK, we now face the |
871 nagging question about what to do next\ldots |
876 nagging question about what to do next\ldots |
872 |
877 |
873 \subsection*{Added Value} |
878 \subsection*{Added Value} |
874 |
879 |
|
880 % 33296 bytes -> 21882 |
|
881 % shave off 2 seconds |
|
882 |
875 As you have probably seen, the compiler writer has a lot of freedom |
883 As you have probably seen, the compiler writer has a lot of freedom |
876 about how to generate code from what the progarmmer wrote as program. |
884 about how to generate code from what the progarmmer wrote as program. |
877 The only condition is that generated code should behave as expected by |
885 The only condition is that generated code should behave as expected by |
878 the programmer. Then all is fine\ldots mission accomplished! But |
886 the programmer. Then all is fine\ldots mission accomplished! But |
879 sometimes the compiler writer is expected to go an extra mile, or even |
887 sometimes the compiler writer is expected to go an extra mile, or even |
900 Imagine we do not want to rely in our compiler on the JVM for producing |
908 Imagine we do not want to rely in our compiler on the JVM for producing |
901 an annoying, but safe exception trace, rather we want to handle such |
909 an annoying, but safe exception trace, rather we want to handle such |
902 situations ourselves. Lets assume we want to handle them in the |
910 situations ourselves. Lets assume we want to handle them in the |
903 following way: if the programmer access a field out-of-bounds, we just |
911 following way: if the programmer access a field out-of-bounds, we just |
904 return the default 0, and if a programmer wants to update an |
912 return the default 0, and if a programmer wants to update an |
905 out-of-bounds filed, we quietly ignore this update. |
913 out-of-bounds filed, we want to ``quietly'' ignore this update. |
|
914 |
|
915 |
|
916 arraylength |
|
917 |
906 |
918 |
907 \end{document} |
919 \end{document} |
908 |
920 |
909 %%% Local Variables: |
921 %%% Local Variables: |
910 %%% mode: latex |
922 %%% mode: latex |