\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../grammar}\usepackage{../graphics}\begin{document}\section*{Handout 7 (Compilation)}The purpose of a compiler is to transform a program, a humancan write, into code the machine can run as fast as possible.The fastest code would be machine code the CPU can rundirectly, but it is often enough to improve the speed of aprogram by just targeting a virtual machine. This produces notthe fastest possible code, but code that is fast enough andhas the advantage that the virtual machine care of things acompiler would normally need to take care of (like explicitmemory management).As an example we will implement a compiler for the very simpleWhile-language. We will be generating code for the JavaVirtual Machine. This is a stack-based virtual machine, a factwhich will make it easy to generate code for arithmeticexpressions. For example for generating code for theexpression $1 + 2$ we need to generate the following threeinstructions\begin{lstlisting}[numbers=none]ldc 1ldc 2iadd \end{lstlisting}\noindent The first instruction loads the constant $1$ ontothe stack, the next one $2$, the third instruction adds bothnumbers together replacing the top elements of the stack withthe result $3$. For simplicity, we will throughout consideronly integer numbers and results. Therefore we can use theinstructions \code{iadd}, \code{isub}, \code{imul},\code{idiv} and so on. The \code{i} stands for integerinstructions in the JVM (alternatives are \code{d} for doubles,\code{l} for longs and \code{f} for floats).Recall our grammar for arithmetic expressions ($E$ is thestarting symbol):\begin{plstx}[rhs style=, margin=3cm]: \meta{E} ::= \meta{T} $+$ \meta{E} | \meta{T} $-$ \meta{E} | \meta{T}\\: \meta{T} ::= \meta{F} $*$ \meta{T} | \meta{F} $\backslash$ \meta{T} | \meta{F}\\: \meta{F} ::= ( \meta{E} ) | \meta{Id} | \meta{Num}\\\end{plstx}\noindent where \meta{Id} stands for variables and\meta{Num} for numbers. For the moment let us omit variables fromarithmetic expressions. Our parser will take this grammar andproduce abstract syntax trees. Forexample for the expression $1 + ((2 * 3) + (4 - 3))$ it willproduce the following tree.\begin{center}\begin{tikzpicture}\Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]\end{tikzpicture}\end{center}\noindent To generate code for this expression, we need totraverse this tree in post-order fashion and emit code foreach node---this traversal in post-order fashion will producecode for a stack-machine (what the JVM is). Doing so for thetree above generates the instructions\begin{lstlisting}[numbers=none]ldc 1 ldc 2 ldc 3 imul ldc 4 ldc 3 isub iadd iadd\end{lstlisting}\noindent If we ``run'' these instructions, the result $8$will be on top of the stack (I leave this to you to verify;the meaning of each instruction should be clear). The resultbeing on the top of the stack will be a convention we alwaysobserve in our compiler, that is the results of arithmeticexpressions will always be on top of the stack. Note, that adifferent bracketing of the expression, for example $(1 + (2 *3)) + (4 - 3)$, produces a different abstract syntax tree andthus potentially also a different list of instructions.Generating code in this fashion is rather easy to implement:it can be done with the following \textit{compile}-function,which takes the abstract syntax tree as argument:\begin{center}\begin{tabular}{lcl}$\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\$\textit{compile}(a_1 + a_2)$ & $\dn$ &$\textit{compile}(a_1) \;@\;\textit{compile}(a_2)\;@\; \pcode{iadd}$\\$\textit{compile}(a_1 - a_2)$ & $\dn$ & $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{isub}$\\$\textit{compile}(a_1 * a_2)$ & $\dn$ & $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{imul}$\\$\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\\end{tabular}\end{center}However, our arithmetic expressions can also containvariables. We will represent them as \emph{local variables} inthe JVM. Essentially, local variables are an array or pointersto memory cells, containing in our case only integers. Lookingup a variable can be done with the instruction\begin{lstlisting}[mathescape,numbers=none]iload $index$\end{lstlisting}\noindent which places the content of the local variable $index$ onto the stack. Storing the top of the stack into a local variable can be done by the instruction\begin{lstlisting}[mathescape,numbers=none]istore $index$\end{lstlisting}\noindent Note that this also pops off the top of the stack.One problem we have to overcome, however, is that localvariables are addressed, not by identifiers, but by numbers(starting from $0$). Therefore our compiler needs to maintaina kind of environment where variables are associated tonumbers. This association needs to be unique: if we muddle upthe numbers, then we essentially confuse variables and theconsequence will usually be an erroneous result. Our extended\textit{compile}-function for arithmetic expressions willtherefore take two arguments: the abstract syntax tree and theenvironment, $E$, that maps identifiers to index-numbers.\begin{center}\begin{tabular}{lcl}$\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\$\textit{compile}(a_1 + a_2, E)$ & $\dn$ & $\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{iadd}$\\$\textit{compile}(a_1 - a_2, E)$ & $\dn$ &$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{isub}$\\$\textit{compile}(a_1 * a_2, E)$ & $\dn$ &$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{imul}$\\$\textit{compile}(a_1 \backslash a_2, E)$ & $\dn$ & $\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{idiv}$\\$\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\\end{tabular}\end{center}\noindent In the last line we generate the code for variableswhere $E(x)$ stands for looking up the environment to whichindex the variable $x$ maps to.There is a similar \textit{compile}-function for booleanexpressions, but it includes a ``trick'' to do with\pcode{if}- and \pcode{while}-statements. To explain the issuelet us explain first the compilation of statements of theWhile-language. The clause for \pcode{skip} is trivial, sincewe do not have to generate any instruction\begin{center}\begin{tabular}{lcl}$\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\\end{tabular}\end{center}\noindent Note that the \textit{compile}-function forstatements returns a pair, a list of instructions (in thiscase the empty list) and an environment for variables. Thereason for the environment is that assignments in theWhile-language might change the environment---clearly if avariable is used for the first time, we need to allocate a newindex and if it has been used before, we need to be able toretrieve the associated index. This is reflected inthe clause for compiling assignments:\begin{center}\begin{tabular}{lcl}$\text{compile}(x := a, E)$ & $\dn$ & $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$\end{tabular}\end{center}\noindent We first generate code for the right-hand side ofthe assignment and then add an \pcode{istore}-instruction atthe end. By convention the result of the arithmetic expression$a$ will be on top of the stack. After the \pcode{istore}instruction, the result will be stored in the indexcorresponding to the variable $x$. If the variable $x$ hasbeen used before in the program, we just need to look up whatthe index is and return the environment unchanged (that is inthis case $E' = E$). However, if this is the first encounter of the variable $x$ in the program, then we have to augment the environment and assign $x$ with the largest index in $E$plus one (that is $E' = E(x \mapsto largest\_index + 1)$). That means for the assignment $x := x + 1$ we generate thefollowing code\begin{lstlisting}[mathescape,numbers=none]iload $n_x$ldc 1iaddistore $n_x$\end{lstlisting}\noindent where $n_x$ is the index for the variable $x$.More complicated is the code for \pcode{if}-statments, say\begin{lstlisting}[mathescape,language={},numbers=none]if $b$ then $cs_1$ else $cs_2$\end{lstlisting}\noindent where $b$ is a boolean expression and the $cs_i$are the instructions for each \pcode{if}-branch. Lets assumewe already generated code for $b$ and $cs_{1/2}$. Then in thetrue-case the control-flow of the program needs to be\begin{center}\begin{tikzpicture}[node distance=2mm and 4mm, block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]\node (A1) [point] {};\node (b) [block, right=of A1] {code of $b$};\node (A2) [point, right=of b] {};\node (cs1) [block, right=of A2] {code of $cs_1$};\node (A3) [point, right=of cs1] {};\node (cs2) [block, right=of A3] {code of $cs_2$};\node (A4) [point, right=of cs2] {};\draw (A1) edge [->, black, line width=1mm] (b);\draw (b) edge [->, black, line width=1mm] (cs1);\draw (cs1) edge [->, black, line width=1mm] (A3);\draw (A3) edge [->, black, skip loop] (A4);\node [below=of cs2] {\raisebox{-5mm}{\small{}jump}};\end{tikzpicture}\end{center}\noindent where we start with running the code for $b$; sincewe are in the true case we continue with running the code for$cs_1$. After this however, we must not run the code for$cs_2$, but always jump after the last instruction of $cs_2$(the code for the \pcode{else}-branch). Note that this jump isunconditional, meaning we always have to jump to the end of$cs_2$. The corresponding instruction of the JVM is\pcode{goto}. In case $b$ turns out to be false we need thecontrol-flow\begin{center}\begin{tikzpicture}[node distance=2mm and 4mm, block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]\node (A1) [point] {};\node (b) [block, right=of A1] {code of $b$};\node (A2) [point, right=of b] {};\node (cs1) [block, right=of A2] {code of $cs_1$};\node (A3) [point, right=of cs1] {};\node (cs2) [block, right=of A3] {code of $cs_2$};\node (A4) [point, right=of cs2] {};\draw (A1) edge [->, black, line width=1mm] (b);\draw (b) edge [->, black, line width=1mm] (A2);\draw (A2) edge [skip loop] (A3);\draw (A3) edge [->, black, line width=1mm] (cs2);\draw (cs2) edge [->,black, line width=1mm] (A4);\node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}};\end{tikzpicture}\end{center}\noindent where we now need a conditional jump (if theif-condition is false) from the end of the code for the boolean to the beginning of the instructions $cs_2$. Once we are finished with running $cs_2$ we can continue with whatevercode comes after the if-statement.The \pcode{goto} and conditional jumps need addresses to wherethe jump should go. Since we are generating assembly code forthe JVM, we do not actually have to give addresses, but needto attach labels to our code. These labels specify a targetfor a jump. Therefore the labels need to be unique, asotherwise it would be ambiguous where a jump should go. A labels, say \pcode{L}, is attached to code like\begin{lstlisting}[mathescape,numbers=none]L: $instr_1$ $instr_2$ $\vdots$\end{lstlisting}Recall the ``trick'' with compiling boolean expressions: the \textit{compile}-function for boolean expressions takes threearguments: an abstract syntax tree, an environment for variable indices and also the label, $lab$, to where an conditional jump needs to go. The clause for the expression $a_1 = a_2$, for example, is as follows:\begin{center}\begin{tabular}{lcl}$\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ \multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}\end{tabular}\end{center}\noindent We are generating code for the subexpressions $a_1$ and $a_2$. This will mean after running the corresponding code there willbe two integers on top of the stack. If they are equal, we do not have to do anything and just continue with the next instructions (see control-flow of ifs above). However if they are \emph{not} equal, then we need to (conditionally) jump to the label $lab$. This can be done with the instruction\begin{lstlisting}[mathescape,numbers=none]if_icmpne $lab$\end{lstlisting}\noindent Other jump instructions for boolean operators are\begin{center}\begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}$=$ & $\Rightarrow$ & \pcode{if_icmpne}\\$\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\$<$ & $\Rightarrow$ & \pcode{if_icmpge}\\$\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\\end{tabular}\end{center}\noindent and so on. I leave it to you to extend the\textit{compile}-function for the other boolean expressions.Note that we need to jump whenever the boolean is \emph{not}true, which means we have to ``negate'' the jump---equalsbecomes not-equal, less becomes greater-or-equal. If you donot like this design (it can be the source of some nasty,hard-to-detect errors), you can also change the layout of thecode and first give the code for the else-branch and then forthe if-branch.We are now ready to give the compile function for if-statments--remember this function returns for staments a pair consisting of the code and an environment:\begin{center}\begin{tabular}{lcl}$\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\ \multicolumn{3}{l}{$\qquad L_\textit{ifelse}\;$ (fresh label)}\\\multicolumn{3}{l}{$\qquad L_\textit{ifend}\;$ (fresh label)}\\\multicolumn{3}{l}{$\qquad (is_1, E') = \textit{compile}(cs_1, E)$}\\\multicolumn{3}{l}{$\qquad (is_2, E'') = \textit{compile}(cs_2, E')$}\\\multicolumn{3}{l}{$\qquad(\textit{compile}(b, E, L_\textit{ifelse})$}\\\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_1$}\\\multicolumn{3}{l}{$\qquad\phantom{(}@\; \pcode{goto}\;L_\textit{ifend}$}\\\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_\textit{ifelse}:$}\\\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_2$}\\\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_\textit{ifend}:, E'')$}\\\end{tabular}\end{center}\noindent In the first two lines we generate two fresh labelsfor the jump addresses (just before the else-branch and justafter). In the next two lines we generate the instructions forthe two branches, $is_1$ and $is_2$. The final code willbe first the code for $b$ (including the label just-before-the-else-branch), then the \pcode{goto} for afterthe else-branch, the label $L_\textit{ifesle}$, followed bythe instructions for the else-branch, followed by the after-the-else-branch label. Consider for example the if-statement:\begin{lstlisting}[mathescape,numbers=none,language={}]if 1 = 1 then x := 2 else y := 3\end{lstlisting}\noindent The generated code is as follows:\begin{lstlisting}[mathescape,language={}] ldc 1 ldc 2 if_icmpne L_ifelse $\quad\tikz[remember picture] \node (C) {\mbox{}};$ ldc 2 istore 0 goto L_ifend $\quad\tikz[remember picture] \node (A) {\mbox{}};$L_ifelse: $\quad\tikz[remember picture] \node[] (D) {\mbox{}};$ ldc 3 istore 1L_ifend: $\quad\tikz[remember picture] \node[] (B) {\mbox{}};$\end{lstlisting}\begin{tikzpicture}[remember picture,overlay] \draw[->,very thick] (A) edge [->,to path={-- ++(10mm,0mm) -- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (B.east); \draw[->,very thick] (C) edge [->,to path={-- ++(10mm,0mm) -- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (D.east);\end{tikzpicture}\noindent The first three lines correspond to the the boolean expression $1 = 1$. The jump for when this boolean expressionis false is in Line~3. Lines 4-6 corresponds to the if-branch; the else-branch is in Lines 8 and 9. Note carefully how theenvironment $E$ is threaded through the calls of \textit{compile}. The function receives an environment $E$, but it might extend it when compiling the if-branch, yielding $E'$. This happens for example in the if-statement above whenever the variable \code{x} has not been used before. Similarly with the environment $E''$ for the second callto \textit{compile}. $E''$ is also the environment that needto be returned.The compilation of the while-loops, say \pcode{while} $b$ \pcode{do} $cs$, is very similar. In casethe condition is true and we need to do another iteration, the control-flow needs to be as follows\begin{center}\begin{tikzpicture}[node distance=2mm and 4mm, block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]\node (A0) [point, left=of A1] {};\node (A1) [point] {};\node (b) [block, right=of A1] {code of $b$};\node (A2) [point, right=of b] {};\node (cs1) [block, right=of A2] {code of $cs$};\node (A3) [point, right=of cs1] {};\node (A4) [point, right=of A3] {};\draw (A0) edge [->, black, line width=1mm] (b);\draw (b) edge [->, black, line width=1mm] (cs1);\draw (cs1) edge [->, black, line width=1mm] (A3);\draw (A3) edge [->,skip loop] (A1);\end{tikzpicture}\end{center}\noindent While if the condition is \emph{not} true weneed to jump out of the loop, which gives the followingcontrol flow.\begin{center}\begin{tikzpicture}[node distance=2mm and 4mm, block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]\node (A0) [point, left=of A1] {};\node (A1) [point] {};\node (b) [block, right=of A1] {code of $b$};\node (A2) [point, right=of b] {};\node (cs1) [block, right=of A2] {code of $cs$};\node (A3) [point, right=of cs1] {};\node (A4) [point, right=of A3] {};\draw (A0) edge [->, black, line width=1mm] (b);\draw (b) edge [->, black, line width=1mm] (A2);\draw (A2) edge [skip loop] (A3);\draw (A3) edge [->, black, line width=1mm] (A4);\end{tikzpicture}\end{center}\noindent Again we can use the \textit{compile}-function forboolean expressions to insert the appropriate jump to theend of the loop (label $L_{wend}$).\begin{center}\begin{tabular}{lcl}$\textit{compile}(\pcode{while}\; b\; \pcode{do} \;cs, E)$ & $\dn$\\ \multicolumn{3}{l}{$\qquad L_{wbegin}\;$ (fresh label)}\\\multicolumn{3}{l}{$\qquad L_{wend}\;$ (fresh label)}\\\multicolumn{3}{l}{$\qquad (is, E') = \textit{compile}(cs_1, E)$}\\\multicolumn{3}{l}{$\qquad(L_{wbegin}:$}\\\multicolumn{3}{l}{$\qquad\phantom{(}@\;\textit{compile}(b, E, L_{wend})$}\\\multicolumn{3}{l}{$\qquad\phantom{(}@\;is$}\\\multicolumn{3}{l}{$\qquad\phantom{(}@\; \text{goto}\;l_{wbegin}$}\\\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_{wend}:, E')$}\\\end{tabular}\end{center}Next we need to consider the statement \pcode{write x}, whichcan be used to write out the content of a variable. For thiswe need to use a Java library function. In order to avoidhaving to generate a lot of code for each\pcode{write}-command, we use a separate method and just callthis method with an argument (which needs to be placed on thestack). The code is as follows.\begin{lstlisting}[language=JVMIS].method public static write(I)V .limit locals 1 .limit stack 2 getstatic java/lang/System/out Ljava/io/PrintStream; iload 0 invokevirtual java/io/PrintStream/println(I)V return .end method\end{lstlisting}\noindent The first line marks the beginning of the method,called \pcode{write}. It takes a single integer argumentindicated by the \pcode{(I)} and returns no result, indicatedby the \pcode{V}. Since the method has only one argument, weonly need a single local variable (Line~2) and a stack withtwo cells will be sufficient (Line 3). Line 4 instructs theJVM to get the value of the field \pcode{out} of the class\pcode{java/lang/System}. It expects the value to be of type\pcode{java/io/PrintStream}. A reference to this value will beplaced on the stack. Line~5 copies the integer we want toprint out onto the stack. In the next line we call the method\pcode{println} (from the class \pcode{java/io/PrintStream}).We want to print out an integer and do not expect anythingback (that is why the type annotation \pcode{(I)V}). The\pcode{return}-instruction in the next line changes thecontrol-flow back to the place from where \pcode{write} wascalled. This method needs to be part of a header that isincluded in any code we generate. The method \pcode{write} canbe invoked with the two instructions\begin{lstlisting}[mathescape,language=JVMIS]iload $E(x)$ invokestatic XXX/XXX/write(I)V\end{lstlisting}\noindent where we first place the variable to be printed onthe stack and then call \pcode{write}. The \pcode{XXX} need tobe replaced by appropriate class name (this will be explainedshortly).\begin{figure}[t]\begin{lstlisting}[mathescape,language=JVMIS].class public XXX.XXX.super java/lang/Object.method public <init>()V aload_0 invokenonvirtual java/lang/Object/<init>()V return.end method.method public static main([Ljava/lang/String;)V .limit locals 200 .limit stack 200 $\textit{\ldots{}here comes the compiled code\ldots}$ return.end method\end{lstlisting}\caption{Boilerplate code needed for running generated code.\label{boiler}}\end{figure}By generating code for a While-program, we end up with a listof (JVM assembly) instructions. Unfortunately, there is a bitmore boilerplate code needed before these instructions can berun. The code is shown in Figure~\ref{boiler}. Thisboilerplate code is very specific to the JVM. Lines 4 to 8 contains a method for object creation in the JVM and is called \emph{before} the \pcode{main}-method in Lines 10 to 17. Interesting are the Lines 11 and 12 where we hardwire that the stack of our program will never be larger than 200 and that the maximum number of variables is also 200. This seem conservative default values that allow is to run some simpleWhile-programs. In a real compiler, we would of course need towork harder and find out appropriate values for the stack and local variables.To sum up, in Figure~\ref{test} is the complete code generatedfor the slightly non-sensical program\begin{lstlisting}[mathescape,language=While]x := 1 + 2;write x\end{lstlisting}\noindent Having this code at our disposal, we need theassembler to translate the generated code into bytecode (aclass file) that can be understood by the JVM.\begin{figure}[p]\lstinputlisting{../progs/test-small.j}\caption{Generated code for a test program.\label{test}}\end{figure}\end{document}%%% Local Variables: %%% mode: latex %%% TeX-master: t%%% End: