handouts/ho07.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Tue, 17 Nov 2015 04:38:01 +0000
changeset 374 0e25fb72d339
parent 373 b018234c9126
child 375 bf36664a3196
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../grammar}
\usepackage{../graphics}


\begin{document}

\section*{Handout 7 (Compilation)}

The purpose of a compiler is to transform a program, a human
can write, into code the machine can run as fast as possible.
The fastest code would be machine code the CPU can run
directly, but it is often enough to improve the speed of a
program by just targeting a virtual machine. This produces not
the fastest possible code, but code that is fast enough and
has the advantage that the virtual machine care of things a
compiler would normally need to take care of (like explicit
memory management).

As an example we will implement a compiler for the very simple
While-language. We will be generating code for the Java
Virtual Machine. This is a stack-based virtual machine, a fact
which will make it easy to generate code for arithmetic
expressions. For example for generating code for the
expression $1 + 2$ we need to generate the following three
instructions

\begin{lstlisting}[numbers=none]
ldc 1
ldc 2
iadd 
\end{lstlisting}

\noindent The first instruction loads the constant $1$ onto
the stack, the next one $2$, the third instruction adds both
numbers together replacing the top elements of the stack with
the result $3$. For simplicity, we will throughout consider
only integer numbers and results. Therefore we can use the
instructions \code{iadd}, \code{isub}, \code{imul},
\code{idiv} and so on. The \code{i} stands for integer
instructions in the JVM (alternatives are \code{d} for doubles,
\code{l} for longs and \code{f} for floats).

Recall our grammar for arithmetic expressions ($E$ is the
starting symbol):


\begin{plstx}[rhs style=, margin=3cm]
: \meta{E} ::= \meta{T} $+$ \meta{E}
         | \meta{T} $-$ \meta{E}
         | \meta{T}\\
: \meta{T} ::= \meta{F} $*$ \meta{T}
          | \meta{F} $\backslash$ \meta{T}
          | \meta{F}\\
: \meta{F} ::= ( \meta{E} )
          | \meta{Id}
          | \meta{Num}\\
\end{plstx}


\noindent where \meta{Id} stands for variables and
\meta{Num} for numbers. For the moment let us omit variables from
arithmetic expressions. Our parser will take this grammar and
produce abstract syntax trees. For
example for the expression $1 + ((2 * 3) + (4 - 3))$ it will
produce the following tree.

\begin{center}
\begin{tikzpicture}
\Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]
\end{tikzpicture}
\end{center}

\noindent To generate code for this expression, we need to
traverse this tree in post-order fashion and emit code for
each node---this traversal in post-order fashion will produce
code for a stack-machine (what the JVM is). Doing so for the
tree above generates the instructions

\begin{lstlisting}[numbers=none]
ldc 1 
ldc 2 
ldc 3 
imul 
ldc 4 
ldc 3 
isub 
iadd 
iadd
\end{lstlisting}

\noindent If we ``run'' these instructions, the result $8$
will be on top of the stack (I leave this to you to verify;
the meaning of each instruction should be clear). The result
being on the top of the stack will be a convention we always
observe in our compiler, that is the results of arithmetic
expressions will always be on top of the stack. Note, that a
different bracketing of the expression, for example $(1 + (2 *
3)) + (4 - 3)$, produces a different abstract syntax tree and
thus potentially also a different list of instructions.
Generating code in this fashion is rather easy to implement:
it can be done with the following \textit{compile}-function,
which takes the abstract syntax tree as argument:

\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\
$\textit{compile}(a_1 + a_2)$ & $\dn$ &
$\textit{compile}(a_1) \;@\;\textit{compile}(a_2)\;@\; \pcode{iadd}$\\
$\textit{compile}(a_1 - a_2)$ & $\dn$ & 
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{isub}$\\
$\textit{compile}(a_1 * a_2)$ & $\dn$ & 
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{imul}$\\
$\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & 
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\
\end{tabular}
\end{center}

However, our arithmetic expressions can also contain
variables. We will represent them as \emph{local variables} in
the JVM. Essentially, local variables are an array or pointers
to memory cells, containing in our case only integers. Looking
up a variable can be done with the instruction

\begin{lstlisting}[mathescape,numbers=none]
iload $index$
\end{lstlisting}

\noindent 
which places the content of the local variable $index$ onto 
the stack. Storing the top of the stack into a local variable 
can be done by the instruction

\begin{lstlisting}[mathescape,numbers=none]
istore $index$
\end{lstlisting}

\noindent Note that this also pops off the top of the stack.
One problem we have to overcome, however, is that local
variables are addressed, not by identifiers, but by numbers
(starting from $0$). Therefore our compiler needs to maintain
a kind of environment where variables are associated to
numbers. This association needs to be unique: if we muddle up
the numbers, then we essentially confuse variables and the
consequence will usually be an erroneous result. Our extended
\textit{compile}-function for arithmetic expressions will
therefore take two arguments: the abstract syntax tree and the
environment, $E$, that maps identifiers to index-numbers.

\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\
$\textit{compile}(a_1 + a_2, E)$ & $\dn$ & 
$\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{iadd}$\\
$\textit{compile}(a_1 - a_2, E)$ & $\dn$ &
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{isub}$\\
$\textit{compile}(a_1 * a_2, E)$ & $\dn$ &
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{imul}$\\
$\textit{compile}(a_1 \backslash a_2, E)$ & $\dn$ & 
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{idiv}$\\
$\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\
\end{tabular}
\end{center}

\noindent In the last line we generate the code for variables
where $E(x)$ stands for looking up the environment to which
index the variable $x$ maps to.

There is a similar \textit{compile}-function for boolean
expressions, but it includes a ``trick'' to do with
\pcode{if}- and \pcode{while}-statements. To explain the issue
let us explain first the compilation of statements of the
While-language. The clause for \pcode{skip} is trivial, since
we do not have to generate any instruction

\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\
\end{tabular}
\end{center}

\noindent Note that the \textit{compile}-function for
statements returns a pair, a list of instructions (in this
case the empty list) and an environment for variables. The
reason for the environment is that assignments in the
While-language might change the environment---clearly if a
variable is used for the first time, we need to allocate a new
index and if it has been used before, we need to be able to
retrieve the associated index. This is reflected in
the clause for compiling assignments:

\begin{center}
\begin{tabular}{lcl}
$\text{compile}(x := a, E)$ & $\dn$ & 
$(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$
\end{tabular}
\end{center}

\noindent We first generate code for the right-hand side of
the assignment and then add an \pcode{istore}-instruction at
the end. By convention the result of the arithmetic expression
$a$ will be on top of the stack. After the \pcode{istore}
instruction, the result will be stored in the index
corresponding to the variable $x$. If the variable $x$ has
been used before in the program, we just need to look up what
the index is and return the environment unchanged (that is in
this case $E' = E$). However, if this is the first encounter 
of the variable $x$ in the program, then we have to augment 
the environment and assign $x$ with the largest index in $E$
plus one (that is $E' = E(x \mapsto largest\_index + 1)$). 
That means for the assignment $x := x + 1$ we generate the
following code

\begin{lstlisting}[mathescape,numbers=none]
iload $n_x$
ldc 1
iadd
istore $n_x$
\end{lstlisting}

\noindent 
where $n_x$ is the index for the variable $x$.

More complicated is the code for \pcode{if}-statments, say

\begin{lstlisting}[mathescape,language={},numbers=none]
if $b$ then $cs_1$ else $cs_2$
\end{lstlisting}

\noindent where $b$ is a boolean expression and the $cs_i$
are the instructions for each \pcode{if}-branch. Lets assume
we already generated code for $b$ and $cs_{1/2}$. Then in the
true-case the control-flow of the program needs to be

\begin{center}
\begin{tikzpicture}[node distance=2mm and 4mm,
 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
\node (A1) [point] {};
\node (b) [block, right=of A1] {code of $b$};
\node (A2) [point, right=of b] {};
\node (cs1) [block, right=of A2] {code of $cs_1$};
\node (A3) [point, right=of cs1] {};
\node (cs2) [block, right=of A3] {code of $cs_2$};
\node (A4) [point, right=of cs2] {};

\draw (A1) edge [->, black, line width=1mm] (b);
\draw (b) edge [->, black, line width=1mm] (cs1);
\draw (cs1) edge [->, black, line width=1mm] (A3);
\draw (A3) edge [->, black, skip loop] (A4);
\node [below=of cs2] {\raisebox{-5mm}{\small{}jump}};
\end{tikzpicture}
\end{center}

\noindent where we start with running the code for $b$; since
we are in the true case we continue with running the code for
$cs_1$. After this however, we must not run the code for
$cs_2$, but always jump after the last instruction of $cs_2$
(the code for the \pcode{else}-branch). Note that this jump is
unconditional, meaning we always have to jump to the end of
$cs_2$. The corresponding instruction of the JVM is
\pcode{goto}. In case $b$ turns out to be false we need the
control-flow

\begin{center}
\begin{tikzpicture}[node distance=2mm and 4mm,
 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
\node (A1) [point] {};
\node (b) [block, right=of A1] {code of $b$};
\node (A2) [point, right=of b] {};
\node (cs1) [block, right=of A2] {code of $cs_1$};
\node (A3) [point, right=of cs1] {};
\node (cs2) [block, right=of A3] {code of $cs_2$};
\node (A4) [point, right=of cs2] {};

\draw (A1) edge [->, black, line width=1mm] (b);
\draw (b) edge [->, black, line width=1mm] (A2);
\draw (A2) edge [skip loop] (A3);
\draw (A3) edge [->, black, line width=1mm] (cs2);
\draw (cs2) edge [->,black, line width=1mm] (A4);
\node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}};
\end{tikzpicture}
\end{center}

\noindent where we now need a conditional jump (if the
if-condition is false) from the end of the code for the 
boolean to the beginning of the instructions $cs_2$. Once we 
are finished with running $cs_2$ we can continue with whatever
code comes after the if-statement.

The \pcode{goto} and conditional jumps need addresses to where
the jump should go. Since we are generating assembly code for
the JVM, we do not actually have to give addresses, but need
to attach labels to our code. These labels specify a target
for a jump. Therefore the labels need to be unique, as
otherwise it would be ambiguous where a jump should go. 
A labels, say \pcode{L}, is attached to code like

\begin{lstlisting}[mathescape,numbers=none]
L:
  $instr_1$
  $instr_2$
    $\vdots$
\end{lstlisting}
 
Recall the ``trick'' with compiling boolean expressions: the 
\textit{compile}-function for boolean expressions takes three
arguments: an abstract syntax tree, an environment for 
variable indices and also the label, $lab$, to where an conditional 
jump needs to go. The clause for the expression $a_1 = a_2$, 
for example, is as follows:

\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ 
\multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}
\end{tabular}
\end{center}

\noindent 
We are generating code for the subexpressions $a_1$ and $a_2$. 
This will mean after running the corresponding code there will
be two integers on top of the stack. If they are equal, we do 
not have to do anything and just continue with the next 
instructions (see control-flow of ifs above). However if they 
are \emph{not} equal, then we need to (conditionally) jump to 
the label $lab$. This can be done with the instruction

\begin{lstlisting}[mathescape,numbers=none]
if_icmpne $lab$
\end{lstlisting}

\noindent Other jump instructions for boolean operators are

\begin{center}
\begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}
$=$ & $\Rightarrow$ & \pcode{if_icmpne}\\
$\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\
$<$ & $\Rightarrow$ & \pcode{if_icmpge}\\
$\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\
\end{tabular}
\end{center}

\noindent and so on. I leave it to you to extend the
\textit{compile}-function for the other boolean expressions.
Note that we need to jump whenever the boolean is \emph{not}
true, which means we have to ``negate'' the jump---equals
becomes not-equal, less becomes greater-or-equal. If you do
not like this design (it can be the source of some nasty,
hard-to-detect errors), you can also change the layout of the
code and first give the code for the else-branch and then for
the if-branch.

We are now ready to give the compile function for 
if-statments--remember this function returns for staments a 
pair consisting of the code and an environment:

\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\ 
\multicolumn{3}{l}{$\qquad L_\textit{ifelse}\;$ (fresh label)}\\
\multicolumn{3}{l}{$\qquad L_\textit{ifend}\;$ (fresh label)}\\
\multicolumn{3}{l}{$\qquad (is_1, E') = \textit{compile}(cs_1, E)$}\\
\multicolumn{3}{l}{$\qquad (is_2, E'') = \textit{compile}(cs_2, E')$}\\
\multicolumn{3}{l}{$\qquad(\textit{compile}(b, E, L_\textit{ifelse})$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_1$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\; \pcode{goto}\;L_\textit{ifend}$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_\textit{ifelse}:$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_2$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_\textit{ifend}:, E'')$}\\
\end{tabular}
\end{center}

\noindent In the first two lines we generate two fresh labels
for the jump addresses (just before the else-branch and just
after). In the next two lines we generate the instructions for
the two branches, $is_1$ and $is_2$. The final code will
be first the code for $b$ (including the label 
just-before-the-else-branch), then the \pcode{goto} for after
the else-branch, the label $L_\textit{ifesle}$, followed by
the instructions for the else-branch, followed by the 
after-the-else-branch label. Consider for example the 
if-statement:

\begin{lstlisting}[mathescape,numbers=none,language={}]
if 1 = 1 then x := 2 else y := 3
\end{lstlisting}

\noindent 
The generated code is as follows:

\begin{lstlisting}[mathescape,language={}]
   ldc 1
   ldc 2
   if_icmpne L_ifelse $\quad\tikz[remember picture] \node (C) {\mbox{}};$
   ldc 2
   istore 0
   goto L_ifend $\quad\tikz[remember picture] \node (A) {\mbox{}};$
L_ifelse: $\quad\tikz[remember picture] \node[] (D) {\mbox{}};$
   ldc 3
   istore 1
L_ifend: $\quad\tikz[remember picture] \node[] (B) {\mbox{}};$
\end{lstlisting}

\begin{tikzpicture}[remember picture,overlay]
  \draw[->,very thick] (A) edge [->,to path={-- ++(10mm,0mm) 
           -- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (B.east);
  \draw[->,very thick] (C) edge [->,to path={-- ++(10mm,0mm) 
           -- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (D.east);
\end{tikzpicture}

\noindent The first three lines correspond to the the boolean 
expression $1 = 1$. The jump for when this boolean expression
is false is in Line~3. Lines 4-6 corresponds to the if-branch; 
the else-branch is in Lines 8 and 9. Note carefully how the
environment $E$ is threaded through the calls of 
\textit{compile}. The function receives an environment $E$, 
but it might extend it when compiling the if-branch, yielding 
$E'$. This happens for example in the if-statement above 
whenever the variable \code{x} has not been used before. 
Similarly with the environment $E''$ for the second call
to \textit{compile}. $E''$ is also the environment that need
to be returned.

The compilation of the while-loops, say 
\pcode{while} $b$ \pcode{do} $cs$, is very similar. In case
the condition is true and we need to do another iteration, the 
control-flow needs to be as follows

\begin{center}
\begin{tikzpicture}[node distance=2mm and 4mm,
 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
\node (A0) [point, left=of A1] {};
\node (A1) [point] {};
\node (b) [block, right=of A1] {code of $b$};
\node (A2) [point, right=of b] {};
\node (cs1) [block, right=of A2] {code of $cs$};
\node (A3) [point, right=of cs1] {};
\node (A4) [point, right=of A3] {};

\draw (A0) edge [->, black, line width=1mm] (b);
\draw (b) edge [->, black, line width=1mm] (cs1);
\draw (cs1) edge [->, black, line width=1mm] (A3);
\draw (A3) edge [->,skip loop] (A1);
\end{tikzpicture}
\end{center}

\noindent While if the condition is \emph{not} true we
need to jump out of the loop, which gives the following
control flow.

\begin{center}
\begin{tikzpicture}[node distance=2mm and 4mm,
 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
\node (A0) [point, left=of A1] {};
\node (A1) [point] {};
\node (b) [block, right=of A1] {code of $b$};
\node (A2) [point, right=of b] {};
\node (cs1) [block, right=of A2] {code of $cs$};
\node (A3) [point, right=of cs1] {};
\node (A4) [point, right=of A3] {};

\draw (A0) edge [->, black, line width=1mm] (b);
\draw (b) edge [->, black, line width=1mm] (A2);
\draw (A2) edge [skip loop] (A3);
\draw (A3) edge [->, black, line width=1mm] (A4);
\end{tikzpicture}
\end{center}

\noindent Again we can use the \textit{compile}-function for
boolean expressions to insert the appropriate jump to the
end of the loop (label $L_{wend}$).

\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(\pcode{while}\; b\; \pcode{do} \;cs, E)$ & $\dn$\\ 
\multicolumn{3}{l}{$\qquad L_{wbegin}\;$ (fresh label)}\\
\multicolumn{3}{l}{$\qquad L_{wend}\;$ (fresh label)}\\
\multicolumn{3}{l}{$\qquad (is, E') = \textit{compile}(cs_1, E)$}\\
\multicolumn{3}{l}{$\qquad(L_{wbegin}:$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;\textit{compile}(b, E, L_{wend})$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\; \text{goto}\;l_{wbegin}$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_{wend}:, E')$}\\
\end{tabular}
\end{center}

Next we need to consider the statement \pcode{write x}, which
can be used to write out the content of a variable. For this
we need to use a Java library function. In order to avoid
having to generate a lot of code for each
\pcode{write}-command, we use a separate method and just call
this method with an argument (which needs to be placed on the
stack). The code is as follows.


\begin{lstlisting}[language=JVMIS]
.method public static write(I)V 
    .limit locals 1 
    .limit stack 2 
    getstatic java/lang/System/out Ljava/io/PrintStream; 
    iload 0
    invokevirtual java/io/PrintStream/println(I)V 
    return 
.end method
\end{lstlisting}

\noindent The first line marks the beginning of the method,
called \pcode{write}. It takes a single integer argument
indicated by the \pcode{(I)} and returns no result, indicated
by the \pcode{V}. Since the method has only one argument, we
only need a single local variable (Line~2) and a stack with
two cells will be sufficient (Line 3). Line 4 instructs the
JVM to get the value of the field \pcode{out} of the class
\pcode{java/lang/System}. It expects the value to be of type
\pcode{java/io/PrintStream}. A reference to this value will be
placed on the stack. Line~5 copies the integer we want to
print out onto the stack. In the next line we call the method
\pcode{println} (from the class \pcode{java/io/PrintStream}).
We want to print out an integer and do not expect anything
back (that is why the type annotation \pcode{(I)V}). The
\pcode{return}-instruction in the next line changes the
control-flow back to the place from where \pcode{write} was
called. This method needs to be part of a header that is
included in any code we generate. The method \pcode{write} can
be invoked with the two instructions

\begin{lstlisting}[mathescape,language=JVMIS]
iload $E(x)$ 
invokestatic XXX/XXX/write(I)V
\end{lstlisting}

\noindent where we first place the variable to be printed on
the stack and then call \pcode{write}. The \pcode{XXX} need to
be replaced by appropriate class name (this will be explained
shortly).


\begin{figure}[t]
\begin{lstlisting}[mathescape,language=JVMIS]
.class public XXX.XXX
.super java/lang/Object

.method public <init>()V
    aload_0
    invokenonvirtual java/lang/Object/<init>()V
    return
.end method

.method public static main([Ljava/lang/String;)V
    .limit locals 200
    .limit stack 200

      $\textit{\ldots{}here comes the compiled code\ldots}$

    return
.end method
\end{lstlisting}
\caption{Boilerplate code needed for running generated code.\label{boiler}}
\end{figure}


By generating code for a While-program, we end up with a list
of (JVM assembly) instructions. Unfortunately, there is a bit
more boilerplate code needed before these instructions can be
run. The code is shown in Figure~\ref{boiler}. This
boilerplate code is very specific to the JVM. Lines 4 to 8 
contains a method for object creation in the JVM and is called 
\emph{before} the \pcode{main}-method in Lines 10 to 17. 
Interesting are the Lines 11 and 12 where we hardwire that the 
stack of our program will never be larger than 200 and that 
the maximum number of variables is also 200. This seem 
conservative default values that allow is to run some simple
While-programs. In a real compiler, we would of course need to
work harder and find out appropriate values for the stack and 
local variables.


\end{document}

%%% Local Variables: 
%%% mode: latex  
%%% TeX-master: t
%%% End: