% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../grammar}
\usepackage{../graphics}
\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 2017, 2018, 2019}
\section*{Handout 7 (Compilation)}
The purpose of a compiler is to transform a program a human can read and
write into code the machine can run as fast as possible. The fastest
code would be machine code the CPU can run directly, but it is often
good enough for improving the speed of a program to target a
virtual machine. This produces not the fastest possible code, but code
that is often pretty fast. This way of producing code has the advantage that
the virtual machine takes care of things a compiler would normally need
to take care of (like explicit memory management).
As a first example in this module we will implement a compiler for the
very simple While-language. It will generate code for the Java Virtual
Machine (JVM). Unfortunately the Java ecosystem does not come with an
assembler which would be handy for our compiler-endeavour (unlike
Microsoft's Common Language Infrastructure for the .Net platform which
has an assembler out-of-the-box). As a substitute we use in this module
the 3rd-party programs Jasmin and Krakatau
\begin{itemize}
\item \url{http://jasmin.sourceforge.net}
\item \url{https://github.com/Storyyeller/Krakatau}
\end{itemize}
\noindent
The first is a Java program and the second a program written in Python.
Each of them allow us to generate \emph{assembly} files that are still
readable by humans, as opposed to class-files which are pretty much just
(horrible) zeros and ones. Jasmin (respectively Krakatau) will then take
an assembly file as input and generate the corresponding class file for
us.
Good about the JVM is that it is a stack-based virtual machine, a fact
which will make it easy to generate code for arithmetic expressions. For
example when compiling the expression $1 + 2$ we need to generate the
following three instructions
\begin{lstlisting}[language=JVMIS,numbers=none]
ldc 1
ldc 2
iadd
\end{lstlisting}
\noindent The first instruction loads the constant $1$ onto
the stack, the next one loads $2$, the third instruction adds both
numbers together replacing the top two elements of the stack
with the result $3$. For simplicity, we will throughout
consider only integer numbers and results. Therefore we can
use the JVM instructions \code{iadd}, \code{isub},
\code{imul}, \code{idiv} and so on. The \code{i} stands for
integer instructions in the JVM (alternatives are \code{d} for
doubles, \code{l} for longs and \code{f} for floats).
Recall our grammar for arithmetic expressions (\meta{E} is the
starting symbol):
\begin{plstx}[rhs style=, margin=3cm]
: \meta{E} ::= \meta{T} $+$ \meta{E}
| \meta{T} $-$ \meta{E}
| \meta{T}\\
: \meta{T} ::= \meta{F} $*$ \meta{T}
| \meta{F} $\backslash$ \meta{T}
| \meta{F}\\
: \meta{F} ::= ( \meta{E} )
| \meta{Id}
| \meta{Num}\\
\end{plstx}
\noindent where \meta{Id} stands for variables and \meta{Num}
for numbers. For the moment let us omit variables from arithmetic
expressions. Our parser will take this grammar and given an input
produce abstract syntax trees. For example we will obtain for the
expression $1 + ((2 * 3) + (4 - 3))$ the following tree.
\begin{center}
\begin{tikzpicture}
\Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]
\end{tikzpicture}
\end{center}
\noindent To generate JVM code for this expression, we need to
traverse this tree in post-order fashion and emit code for
each node---this traversal in post-order fashion will produce
code for a stack-machine (what the JVM is). Doing so for the
tree above generates the instructions
\begin{lstlisting}[language=JVMIS,numbers=none]
ldc 1
ldc 2
ldc 3
imul
ldc 4
ldc 3
isub
iadd
iadd
\end{lstlisting}
\noindent If we ``run'' these instructions, the result $8$ will be on
top of the stack (I leave this to you to verify; the meaning of each
instruction should be clear). The result being on the top of the stack
will be an important convention we always observe in our compiler. Note,
that a different bracketing of the expression, for example $(1 + (2 *
3)) + (4 - 3)$, produces a different abstract syntax tree and thus also
a different list of instructions. Generating code in this
post-order-traversal fashion is rather easy to implement: it can be done
with the following recursive \textit{compile}-function, which takes the
abstract syntax tree as argument:
\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\
$\textit{compile}(a_1 + a_2)$ & $\dn$ &
$\textit{compile}(a_1) \;@\;\textit{compile}(a_2)\;@\; \pcode{iadd}$\\
$\textit{compile}(a_1 - a_2)$ & $\dn$ &
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{isub}$\\
$\textit{compile}(a_1 * a_2)$ & $\dn$ &
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{imul}$\\
$\textit{compile}(a_1 \backslash a_2)$ & $\dn$ &
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\
\end{tabular}
\end{center}
However, our arithmetic expressions can also contain
variables. We will represent them as \emph{local variables} in
the JVM. Essentially, local variables are an array or pointers
to memory cells, containing in our case only integers. Looking
up a variable can be done with the instruction
\begin{lstlisting}[language=JVMIS,mathescape,numbers=none]
iload $index$
\end{lstlisting}
\noindent
which places the content of the local variable $index$ onto
the stack. Storing the top of the stack into a local variable
can be done by the instruction
\begin{lstlisting}[language=JVMIS,mathescape,numbers=none]
istore $index$
\end{lstlisting}
\noindent Note that this also pops off the top of the stack.
One problem we have to overcome, however, is that local
variables are addressed, not by identifiers, but by numbers
(starting from $0$). Therefore our compiler needs to maintain
a kind of environment where variables are associated to
numbers. This association needs to be unique: if we muddle up
the numbers, then we essentially confuse variables and the
consequence will usually be an erroneous result. Our extended
\textit{compile}-function for arithmetic expressions will
therefore take two arguments: the abstract syntax tree and the
environment, $E$, that maps identifiers to index-numbers.
\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\
$\textit{compile}(a_1 + a_2, E)$ & $\dn$ &
$\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{iadd}$\\
$\textit{compile}(a_1 - a_2, E)$ & $\dn$ &
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{isub}$\\
$\textit{compile}(a_1 * a_2, E)$ & $\dn$ &
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{imul}$\\
$\textit{compile}(a_1 \backslash a_2, E)$ & $\dn$ &
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{idiv}$\\
$\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\
\end{tabular}
\end{center}
\noindent In the last line we generate the code for variables
where $E(x)$ stands for looking up the environment to which
index the variable $x$ maps to. This is similar to an interpreter,
which also needs an environment: the difference is that the
interpreter maintains a mapping from variables to current values (what is the
currently the value of a variable), while compilers need a mapping
from variables to memory locations (where can I find the current
value for the variable in memory).
There is a similar \textit{compile}-function for boolean
expressions, but it includes a ``trick'' to do with
\pcode{if}- and \pcode{while}-statements. To explain the issue
let us first describe the compilation of statements of the
While-language. The clause for \pcode{skip} is trivial, since
we do not have to generate any instruction
\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\
\end{tabular}
\end{center}
\noindent whereby $[]$ is the empty list of instructions. Note that
the \textit{compile}-function for statements returns a pair, a
list of instructions (in this case the empty list) and an
environment for variables. The reason for the environment is
that assignments in the While-language might change the
environment---clearly if a variable is used for the first
time, we need to allocate a new index and if it has been used
before, then we need to be able to retrieve the associated index.
This is reflected in the clause for compiling assignments, say
$\textit{x := a}$:
\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(x := a, E)$ & $\dn$ &
$(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$
\end{tabular}
\end{center}
\noindent We first generate code for the right-hand side of
the assignment and then add an \pcode{istore}-instruction at
the end. By convention the result of the arithmetic expression
$a$ will be on top of the stack. After the \pcode{istore}
instruction, the result will be stored in the index
corresponding to the variable $x$. If the variable $x$ has
been used before in the program, we just need to look up what
the index is and return the environment unchanged (that is in
this case $E' = E$). However, if this is the first encounter
of the variable $x$ in the program, then we have to augment
the environment and assign $x$ with the largest index in $E$
plus one (that is $E' = E(x \mapsto largest\_index + 1)$).
This means for the assignment $x := x + 1$ we generate the
following code
\begin{lstlisting}[language=JVMIS,mathescape,numbers=none]
iload $n_x$
ldc 1
iadd
istore $n_x$
\end{lstlisting}
\noindent
where $n_x$ is the index (or pointer to the memory) for the variable $x$ . The code for
looking-up the index for the variable is as follow:
\begin{center}
\begin{tabular}{lcl}
$index \;=\; E\textit{.getOrElse}(x, |E|)$
\end{tabular}
\end{center}
\noindent
In case the environment $E$ contains an index for $x$, we return it.
Otherwise we ``create'' a new index by returning the size $|E|$ of the
environment (that will be an index that is guaranteed to be not used
yet).
A bit more complicated is the generation of code for \pcode{if}-statements, say
\begin{lstlisting}[mathescape,language={},numbers=none]
if $b$ then $cs_1$ else $cs_2$
\end{lstlisting}
\noindent where $b$ is a boolean expression and the $cs_{1/2}$ are the
statements for each of the \pcode{if}-branches. Lets assume we already
generated code for $b$ and $cs_{1/2}$. Then in the true-case the
control-flow of the program needs to be
\begin{center}
\begin{tikzpicture}[node distance=2mm and 4mm,
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
\node (A1) [point] {};
\node (b) [block, right=of A1] {code of $b$};
\node (A2) [point, right=of b] {};
\node (cs1) [block, right=of A2] {code of $cs_1$};
\node (A3) [point, right=of cs1] {};
\node (cs2) [block, right=of A3] {code of $cs_2$};
\node (A4) [point, right=of cs2] {};
\draw (A1) edge [->, black, line width=1mm] (b);
\draw (b) edge [->, black, line width=1mm] (cs1);
\draw (cs1) edge [->, black, line width=1mm] (A3);
\draw (A3) edge [->, black, skip loop] (A4);
\node [below=of cs2] {\raisebox{-5mm}{\small{}jump}};
\end{tikzpicture}
\end{center}
\noindent where we start with running the code for $b$; since
we are in the true case we continue with running the code for
$cs_1$. After this however, we must not run the code for
$cs_2$, but always jump after the last instruction of $cs_2$
(the code for the \pcode{else}-branch). Note that this jump is
unconditional, meaning we always have to jump to the end of
$cs_2$. The corresponding instruction of the JVM is
\pcode{goto}. In case $b$ turns out to be false we need the
control-flow
\begin{center}
\begin{tikzpicture}[node distance=2mm and 4mm,
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
\node (A1) [point] {};
\node (b) [block, right=of A1] {code of $b$};
\node (A2) [point, right=of b] {};
\node (cs1) [block, right=of A2] {code of $cs_1$};
\node (A3) [point, right=of cs1] {};
\node (cs2) [block, right=of A3] {code of $cs_2$};
\node (A4) [point, right=of cs2] {};
\draw (A1) edge [->, black, line width=1mm] (b);
\draw (b) edge [->, black, line width=1mm] (A2);
\draw (A2) edge [skip loop] (A3);
\draw (A3) edge [->, black, line width=1mm] (cs2);
\draw (cs2) edge [->,black, line width=1mm] (A4);
\node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}};
\end{tikzpicture}
\end{center}
\noindent where we now need a conditional jump (if the
if-condition is false) from the end of the code for the
boolean to the beginning of the instructions $cs_2$. Once we
are finished with running $cs_2$ we can continue with whatever
code comes after the if-statement.
The \pcode{goto} and the conditional jumps need addresses to
where the jump should go. Since we are generating assembly
code for the JVM, we do not actually have to give (numeric)
addresses, but can just attach (symbolic) labels to our code.
These labels specify a target for a jump. Therefore the labels
need to be unique, as otherwise it would be ambiguous where a
jump should go to. A label, say \pcode{L}, is attached to code
like
\begin{lstlisting}[mathescape,numbers=none]
L:
$instr_1$
$instr_2$
$\vdots$
\end{lstlisting}
\noindent where a label is indicated by a colon.
Recall the ``trick'' with compiling boolean expressions: the
\textit{compile}-function for boolean expressions takes three
arguments: an abstract syntax tree, an environment for
variable indices and also the label, $lab$, to where an conditional
jump needs to go. The clause for the expression $a_1 = a_2$,
for example, is as follows:
\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\
\multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}
\end{tabular}
\end{center}
\noindent where we are first generating code for the
subexpressions $a_1$ and $a_2$. This will mean after running
the corresponding code there will be two integers on top of
the stack. If they are equal, we do not have to do anything
(except for popping them off from the stack) and just continue
with the next instructions (see control-flow of ifs above).
However if they are \emph{not} equal, then we need to
(conditionally) jump to the label $lab$. This can be done with
the instruction
\begin{lstlisting}[mathescape,numbers=none]
if_icmpne $lab$
\end{lstlisting}
\noindent Other jump instructions for boolean operators are
\begin{center}
\begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}
$\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\
$<$ & $\Rightarrow$ & \pcode{if_icmpge}\\
$\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\
\end{tabular}
\end{center}
\noindent and so on. I leave it to you to extend the
\textit{compile}-function for the other boolean expressions.
Note that we need to jump whenever the boolean is \emph{not}
true, which means we have to ``negate'' the jump
condition---equals becomes not-equal, less becomes
greater-or-equal. If you do not like this design (it can be
the source of some nasty, hard-to-detect errors), you can also
change the layout of the code and first give the code for the
else-branch and then for the if-branch. However in the case
of while-loops this way of generating code still seems
the most convenient.
We are now ready to give the compile function for
if-statements---remember this function returns for statements a
pair consisting of the code and an environment:
\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\
\multicolumn{3}{l}{$\qquad L_\textit{ifelse}\;$ (fresh label)}\\
\multicolumn{3}{l}{$\qquad L_\textit{ifend}\;$ (fresh label)}\\
\multicolumn{3}{l}{$\qquad (is_1, E') = \textit{compile}(cs_1, E)$}\\
\multicolumn{3}{l}{$\qquad (is_2, E'') = \textit{compile}(cs_2, E')$}\\
\multicolumn{3}{l}{$\qquad(\textit{compile}(b, E, L_\textit{ifelse})$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_1$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\; \pcode{goto}\;L_\textit{ifend}$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_\textit{ifelse}:$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_2$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_\textit{ifend}:, E'')$}\\
\end{tabular}
\end{center}
\noindent In the first two lines we generate two fresh labels
for the jump addresses (just before the else-branch and just
after). In the next two lines we generate the instructions for
the two branches, $is_1$ and $is_2$. The final code will
be first the code for $b$ (including the label
just-before-the-else-branch), then the \pcode{goto} for after
the else-branch, the label $L_\textit{ifesle}$, followed by
the instructions for the else-branch, followed by the
after-the-else-branch label. Consider for example the
if-statement:
\begin{lstlisting}[mathescape,numbers=none,language=While]
if 1 = 1 then x := 2 else y := 3
\end{lstlisting}
\noindent
The generated code is as follows:
\begin{lstlisting}[language=JVMIS,mathescape,numbers=left]
ldc 1
ldc 1
if_icmpne L_ifelse $\quad\tikz[remember picture] \node (C) {\mbox{}};$
ldc 2
istore 0
goto L_ifend $\quad\tikz[remember picture] \node (A) {\mbox{}};$
L_ifelse: $\quad\tikz[remember picture] \node[] (D) {\mbox{}};$
ldc 3
istore 1
L_ifend: $\quad\tikz[remember picture] \node[] (B) {\mbox{}};$
\end{lstlisting}
\begin{tikzpicture}[remember picture,overlay]
\draw[->,very thick] (A) edge [->,to path={-- ++(10mm,0mm)
-- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (B.east);
\draw[->,very thick] (C) edge [->,to path={-- ++(10mm,0mm)
-- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (D.east);
\end{tikzpicture}
\noindent The first three lines correspond to the the boolean
expression $1 = 1$. The jump for when this boolean expression
is false is in Line~3. Lines 4-6 corresponds to the if-branch;
the else-branch is in Lines 8 and 9. Note carefully how the
environment $E$ is threaded through the recursive calls of
\textit{compile}. The function receives an environment $E$,
but it might extend it when compiling the if-branch, yielding
$E'$. This happens for example in the if-statement above
whenever the variable \code{x} has not been used before.
Similarly with the environment $E''$ for the second call to
\textit{compile}. $E''$ is also the environment that needs to
be returned as part of the answer.
The compilation of the while-loops, say
\pcode{while} $b$ \pcode{do} $cs$, is very similar. In case
the condition is true and we need to do another iteration,
and the control-flow needs to be as follows
\begin{center}
\begin{tikzpicture}[node distance=2mm and 4mm,
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
\node (A0) [point, left=of A1] {};
\node (A1) [point] {};
\node (b) [block, right=of A1] {code of $b$};
\node (A2) [point, right=of b] {};
\node (cs1) [block, right=of A2] {code of $cs$};
\node (A3) [point, right=of cs1] {};
\node (A4) [point, right=of A3] {};
\draw (A0) edge [->, black, line width=1mm] (b);
\draw (b) edge [->, black, line width=1mm] (cs1);
\draw (cs1) edge [->, black, line width=1mm] (A3);
\draw (A3) edge [->,skip loop] (A1);
\end{tikzpicture}
\end{center}
\noindent Whereas if the condition is \emph{not} true, we
need to jump out of the loop, which gives the following
control flow.
\begin{center}
\begin{tikzpicture}[node distance=2mm and 4mm,
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
\node (A0) [point, left=of A1] {};
\node (A1) [point] {};
\node (b) [block, right=of A1] {code of $b$};
\node (A2) [point, right=of b] {};
\node (cs1) [block, right=of A2] {code of $cs$};
\node (A3) [point, right=of cs1] {};
\node (A4) [point, right=of A3] {};
\draw (A0) edge [->, black, line width=1mm] (b);
\draw (b) edge [->, black, line width=1mm] (A2);
\draw (A2) edge [skip loop] (A3);
\draw (A3) edge [->, black, line width=1mm] (A4);
\end{tikzpicture}
\end{center}
\noindent Again we can use the \textit{compile}-function for
boolean expressions to insert the appropriate jump to the
end of the loop (label $L_{wend}$ below).
\begin{center}
\begin{tabular}{lcl}
$\textit{compile}(\pcode{while}\; b\; \pcode{do} \;cs, E)$ & $\dn$\\
\multicolumn{3}{l}{$\qquad L_{wbegin}\;$ (fresh label)}\\
\multicolumn{3}{l}{$\qquad L_{wend}\;$ (fresh label)}\\
\multicolumn{3}{l}{$\qquad (is, E') = \textit{compile}(cs_1, E)$}\\
\multicolumn{3}{l}{$\qquad(L_{wbegin}:$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;\textit{compile}(b, E, L_{wend})$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\; \text{goto}\;L_{wbegin}$}\\
\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_{wend}:, E')$}\\
\end{tabular}
\end{center}
\noindent I let you go through how this clause works. As an example
you can consider the while-loop
\begin{lstlisting}[mathescape,numbers=none,language=While]
while x <= 10 do x := x + 1
\end{lstlisting}
\noindent yielding the following code
\begin{lstlisting}[language=JVMIS,mathescape,numbers=left]
L_wbegin: $\quad\tikz[remember picture] \node[] (LB) {\mbox{}};$
iload 0
ldc 10
if_icmpgt L_wend $\quad\tikz[remember picture] \node (LC) {\mbox{}};$
iload 0
ldc 1
iadd
istore 0
goto L_wbegin $\quad\tikz[remember picture] \node (LA) {\mbox{}};$
L_wend: $\quad\tikz[remember picture] \node[] (LD) {\mbox{}};$
\end{lstlisting}
\begin{tikzpicture}[remember picture,overlay]
\draw[->,very thick] (LA) edge [->,to path={-- ++(10mm,0mm)
-- ++(0mm,17.3mm) |- (\tikztotarget)},line width=1mm] (LB.east);
\draw[->,very thick] (LC) edge [->,to path={-- ++(10mm,0mm)
-- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (LD.east);
\end{tikzpicture}
\noindent
I leave it to you to read the code and follow its controlflow.
Next we need to consider the statement \pcode{write x}, which
can be used to print out the content of a variable. For this
we need to use a Java library function. In order to avoid
having to generate a lot of code for each
\pcode{write}-command, we use a separate helper-method and
just call this method with an argument (which needs to be
placed onto the stack). The code of the helper-method is as
follows.
\begin{lstlisting}[language=JVMIS,numbers=left]
.method public static write(I)V
.limit locals 1
.limit stack 2
getstatic java/lang/System/out Ljava/io/PrintStream;
iload 0
invokevirtual java/io/PrintStream/println(I)V
return
.end method
\end{lstlisting}
\noindent The first line marks the beginning of the method,
called \pcode{write}. It takes a single integer argument
indicated by the \pcode{(I)} and returns no result, indicated
by the \pcode{V}. Since the method has only one argument, we
only need a single local variable (Line~2) and a stack with
two cells will be sufficient (Line 3). Line 4 instructs the
JVM to get the value of the field \pcode{out} of the class
\pcode{java/lang/System}. It expects the value to be of type
\pcode{java/io/PrintStream}. A reference to this value will be
placed on the stack. Line~5 copies the integer we want to
print out onto the stack. In the next line we call the method
\pcode{println} (from the class \pcode{java/io/PrintStream}).
We want to print out an integer and do not expect anything
back (that is why the type annotation is \pcode{(I)V}). The
\pcode{return}-instruction in the next line changes the
control-flow back to the place from where \pcode{write} was
called. This method needs to be part of a header that is
included in any code we generate. The helper-method
\pcode{write} can be invoked with the two instructions
\begin{lstlisting}[mathescape,language=JVMIS]
iload $E(x)$
invokestatic XXX/XXX/write(I)V
\end{lstlisting}
\noindent where we first place the variable to be printed on
top of the stack and then call \pcode{write}. The \pcode{XXX}
need to be replaced by an appropriate class name (this will be
explained shortly).
\begin{figure}[t]
\begin{lstlisting}[mathescape,language=JVMIS,numbers=left]
.class public XXX.XXX
.super java/lang/Object
.method public <init>()V
aload_0
invokenonvirtual java/lang/Object/<init>()V
return
.end method
.method public static main([Ljava/lang/String;)V
.limit locals 200
.limit stack 200
$\textit{\ldots{}here comes the compiled code\ldots}$
return
.end method
\end{lstlisting}
\caption{Boilerplate code needed for running generated code.\label{boiler}}
\end{figure}
By generating code for a While-program, we end up with a list
of (JVM assembly) instructions. Unfortunately, there is a bit
more boilerplate code needed before these instructions can be
run. The complete code is shown in Figure~\ref{boiler}. This
boilerplate code is very specific to the JVM. If we target any
other virtual machine or a machine language, then we would
need to change this code. Lines 4 to 8 in Figure~\ref{boiler}
contain a method for object creation in the JVM; this method
is called \emph{before} the \pcode{main}-method in Lines 10 to
17. Interesting are the Lines 11 and 12 where we hardwire that
the stack of our programs will never be larger than 200 and
that the maximum number of variables is also 200. This seem to
be conservative default values that allow is to run some
simple While-programs. In a real compiler, we would of course
need to work harder and find out appropriate values for the
stack and local variables.
To sum up, in Figure~\ref{test} is the complete code generated
for the slightly nonsensical program
\begin{lstlisting}[mathescape,language=While]
x := 1 + 2;
write x
\end{lstlisting}
\noindent Having this code at our disposal, we need the
assembler to translate the generated code into JVM bytecode (a
class file). This bytecode is understood by the JVM and can be
run by just invoking the \pcode{java}-program.
\begin{figure}[p]
\lstinputlisting[language=JVMIS]{../progs/test-small.j}
\caption{Generated code for a test program. This code can be
processed by an Java assembler producing a class-file, which
can be run by the {\tt{}java}-program.\label{test}}
\end{figure}
\subsection*{Arrays}
Maybe a useful addition to the While-language would be arrays. This
would let us generate more interesting While-programs by translating
BF*** programs into equivalent While-code. So in this section lets have
a look at how we can support the following three constructions
\begin{lstlisting}[mathescape,language=While]
new arr[15000]
x := 3 + arr[3 + y]
arr[42 * n] := ...
\end{lstlisting}
\noindent
The first construct is for creating new arrays, in this instance the
name of the array is \pcode{arr} and it can hold 15000 integers. The
second is for referencing an array cell inside an arithmetic
expression---we need to be able to look up the contents of an array at
an index determined by an arithmetic expression. Similarly, we need to
be able to update the content of an array at an calculated index---the
3rd feature.
For creating a new array we can generate the following three JVM
instructions:
\begin{lstlisting}[mathescape,language=JVMIS]
ldc number
newarray int
astore loc_var
\end{lstlisting}
\noindent
First we need to put the dimension of the array onto the stack. The next
instruction creates the array. With the last we can store the array as a
local variable (like the ``simple'' variables from the previous
section). The use of a local variable allows us to have multiple arrays
in a While-program. For looking up an element in an array we can use the
following JVM code
\begin{lstlisting}[mathescape,language=JVMIS]
aload loc_var
index_aexp
iaload
\end{lstlisting}
\noindent
The first instruction loads the ``pointer'' to the array onto the stack.
Then we have some instructions corresponding to the index where we want
to look up the array. The idea is that these instructions will leave a
concrete number on the stack, which will be the index into the array we
need. Finally we need to tell the JVM to load the corresponding element
onto the stack. Updating an array at an index with a value is as
follows.
\begin{lstlisting}[mathescape,language=JVMIS]
aload loc_var
index_aexp
value_aexp
iastore
\end{lstlisting}
\noindent
Again the first instruction loads the ``pointer'' to the array onto the
stack. Then we have some instructions corresponding to the index where
we want to update the array. After that come the instructions for with what
value we want to update the array. Finally the instruction for
updating the array.
Next we need to update our grammar rules: it seems best to extend the
rule for factors in arithmetic expressions with a rule for looking up
an array.
\begin{plstx}[rhs style=, margin=3cm]
: \meta{E} ::= \meta{T} $+$ \meta{E}
| \meta{T} $-$ \meta{E}
| \meta{T}\\
: \meta{T} ::= \meta{F} $*$ \meta{T}
| \meta{F} $\backslash$ \meta{T}
| \meta{F}\\
: \meta{F} ::= ( \meta{E} )
| $\underbrace{\meta{Id}\,[\,\meta{E}\,]}_{new}$
| \meta{Id}
| \meta{Num}\\
\end{plstx}
\noindent
There is no problem with left-recursion as the \meta{E} is ``protected''
by an identifier and brackets. There are two new rules in statements,
namely creation of an array and array assignment:
\begin{plstx}[rhs style=, margin=2cm, one per line]
: \meta{Stmt} ::= \ldots
| \texttt{new}\; \meta{Id}\,[\,\meta{Num}\,]
| \meta{Id}\,[\,\meta{E}\,]\,:=\,\meta{E}\\
\end{plstx}
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: