\documentclass{article}+ −
\usepackage{../style}+ −
\usepackage{../langs}+ −
\usepackage{../grammar}+ −
\usepackage{../graphics}+ −
+ −
+ −
\begin{document}+ −
+ −
\section*{Handout 7 (Compilation)}+ −
+ −
The purpose of a compiler is to transform a program, a human+ −
can write, into code the machine can run as fast as possible.+ −
The fastest code would be machine code the CPU can run+ −
directly, but it is often enough to improve the speed of a+ −
program by just targeting a virtual machine. This produces not+ −
the fastest possible code, but code that is fast enough and+ −
has the advantage that the virtual machine care of things a+ −
compiler would normally need to take care of (like explicit+ −
memory management).+ −
+ −
As an example we will implement a compiler for the very simple+ −
While-language. We will be generating code for the Java+ −
Virtual Machine. This is a stack-based virtual machine, a fact+ −
which will make it easy to generate code for arithmetic+ −
expressions. For example for generating code for the+ −
expression $1 + 2$ we need to generate the following three+ −
instructions+ −
+ −
\begin{lstlisting}[numbers=none]+ −
ldc 1+ −
ldc 2+ −
iadd + −
\end{lstlisting}+ −
+ −
\noindent The first instruction loads the constant $1$ onto+ −
the stack, the next one $2$, the third instruction adds both+ −
numbers together replacing the top elements of the stack with+ −
the result $3$. For simplicity, we will throughout consider+ −
only integer numbers and results. Therefore we can use the+ −
instructions \code{iadd}, \code{isub}, \code{imul},+ −
\code{idiv} and so on. The \code{i} stands for integer+ −
instructions in the JVM (alternatives are \code{d} for doubles,+ −
\code{l} for longs and \code{f} for floats).+ −
+ −
Recall our grammar for arithmetic expressions ($E$ is the+ −
starting symbol):+ −
+ −
+ −
\begin{plstx}[rhs style=, margin=3cm]+ −
: \meta{E} ::= \meta{T} $+$ \meta{E}+ −
| \meta{T} $-$ \meta{E}+ −
| \meta{T}\\+ −
: \meta{T} ::= \meta{F} $*$ \meta{T}+ −
| \meta{F} $\backslash$ \meta{T}+ −
| \meta{F}\\+ −
: \meta{F} ::= ( \meta{E} )+ −
| \meta{Id}+ −
| \meta{Num}\\+ −
\end{plstx}+ −
+ −
+ −
\noindent where \meta{Id} stands for variables and+ −
\meta{Num} for numbers. For the moment let us omit variables from+ −
arithmetic expressions. Our parser will take this grammar and+ −
produce abstract syntax trees. For+ −
example for the expression $1 + ((2 * 3) + (4 - 3))$ it will+ −
produce the following tree.+ −
+ −
\begin{center}+ −
\begin{tikzpicture}+ −
\Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent To generate code for this expression, we need to+ −
traverse this tree in post-order fashion and emit code for+ −
each node---this traversal in post-order fashion will produce+ −
code for a stack-machine (what the JVM is). Doing so for the+ −
tree above generates the instructions+ −
+ −
\begin{lstlisting}[numbers=none]+ −
ldc 1 + −
ldc 2 + −
ldc 3 + −
imul + −
ldc 4 + −
ldc 3 + −
isub + −
iadd + −
iadd+ −
\end{lstlisting}+ −
+ −
\noindent If we ``run'' these instructions, the result $8$+ −
will be on top of the stack (I leave this to you to verify;+ −
the meaning of each instruction should be clear). The result+ −
being on the top of the stack will be a convention we always+ −
observe in our compiler, that is the results of arithmetic+ −
expressions will always be on top of the stack. Note, that a+ −
different bracketing of the expression, for example $(1 + (2 *+ −
3)) + (4 - 3)$, produces a different abstract syntax tree and+ −
thus potentially also a different list of instructions.+ −
Generating code in this fashion is rather easy to implement:+ −
it can be done with the following \textit{compile}-function,+ −
which takes the abstract syntax tree as argument:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\+ −
$\textit{compile}(a_1 + a_2)$ & $\dn$ &+ −
$\textit{compile}(a_1) \;@\;\textit{compile}(a_2)\;@\; \pcode{iadd}$\\+ −
$\textit{compile}(a_1 - a_2)$ & $\dn$ & + −
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{isub}$\\+ −
$\textit{compile}(a_1 * a_2)$ & $\dn$ & + −
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{imul}$\\+ −
$\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & + −
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
However, our arithmetic expressions can also contain+ −
variables. We will represent them as \emph{local variables} in+ −
the JVM. Essentially, local variables are an array or pointers+ −
to memory cells, containing in our case only integers. Looking+ −
up a variable can be done with the instruction+ −
+ −
\begin{lstlisting}[mathescape,numbers=none]+ −
iload $index$+ −
\end{lstlisting}+ −
+ −
\noindent + −
which places the content of the local variable $index$ onto + −
the stack. Storing the top of the stack into a local variable + −
can be done by the instruction+ −
+ −
\begin{lstlisting}[mathescape,numbers=none]+ −
istore $index$+ −
\end{lstlisting}+ −
+ −
\noindent Note that this also pops off the top of the stack.+ −
One problem we have to overcome, however, is that local+ −
variables are addressed, not by identifiers, but by numbers+ −
(starting from $0$). Therefore our compiler needs to maintain+ −
a kind of environment where variables are associated to+ −
numbers. This association needs to be unique: if we muddle up+ −
the numbers, then we essentially confuse variables and the+ −
consequence will usually be an erroneous result. Our extended+ −
\textit{compile}-function for arithmetic expressions will+ −
therefore take two arguments: the abstract syntax tree and the+ −
environment, $E$, that maps identifiers to index-numbers.+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\+ −
$\textit{compile}(a_1 + a_2, E)$ & $\dn$ & + −
$\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{iadd}$\\+ −
$\textit{compile}(a_1 - a_2, E)$ & $\dn$ &+ −
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{isub}$\\+ −
$\textit{compile}(a_1 * a_2, E)$ & $\dn$ &+ −
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{imul}$\\+ −
$\textit{compile}(a_1 \backslash a_2, E)$ & $\dn$ & + −
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{idiv}$\\+ −
$\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent In the last line we generate the code for variables+ −
where $E(x)$ stands for looking up the environment to which+ −
index the variable $x$ maps to.+ −
+ −
There is a similar \textit{compile}-function for boolean+ −
expressions, but it includes a ``trick'' to do with+ −
\pcode{if}- and \pcode{while}-statements. To explain the issue+ −
let us explain first the compilation of statements of the+ −
While-language. The clause for \pcode{skip} is trivial, since+ −
we do not have to generate any instruction+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Note that the \textit{compile}-function for+ −
statements returns a pair, a list of instructions (in this+ −
case the empty list) and an environment for variables. The+ −
reason for the environment is that assignments in the+ −
While-language might change the environment---clearly if a+ −
variable is used for the first time, we need to allocate a new+ −
index and if it has been used before, we need to be able to+ −
retrieve the associated index. This is reflected in+ −
the clause for compiling assignments:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\text{compile}(x := a, E)$ & $\dn$ & + −
$(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent We first generate code for the right-hand side of+ −
the assignment and then add an \pcode{istore}-instruction at+ −
the end. By convention the result of the arithmetic expression+ −
$a$ will be on top of the stack. After the \pcode{istore}+ −
instruction, the result will be stored in the index+ −
corresponding to the variable $x$. If the variable $x$ has+ −
been used before in the program, we just need to look up what+ −
the index is and return the environment unchanged (that is in+ −
this case $E' = E$). However, if this is the first encounter + −
of the variable $x$ in the program, then we have to augment + −
the environment and assign $x$ with the largest index in $E$+ −
plus one (that is $E' = E(x \mapsto largest\_index + 1)$). + −
That means for the assignment $x := x + 1$ we generate the+ −
following code+ −
+ −
\begin{lstlisting}[mathescape,numbers=none]+ −
iload $n_x$+ −
ldc 1+ −
iadd+ −
istore $n_x$+ −
\end{lstlisting}+ −
+ −
\noindent + −
where $n_x$ is the index for the variable $x$.+ −
+ −
More complicated is the code for \pcode{if}-statments, say+ −
+ −
\begin{lstlisting}[mathescape,language={},numbers=none]+ −
if $b$ then $cs_1$ else $cs_2$+ −
\end{lstlisting}+ −
+ −
\noindent where $b$ is a boolean expression and the $cs_i$+ −
are the instructions for each \pcode{if}-branch. Lets assume+ −
we already generated code for $b$ and $cs_{1/2}$. Then in the+ −
true-case the control-flow of the program needs to be+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=2mm and 4mm,+ −
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},+ −
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},+ −
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]+ −
\node (A1) [point] {};+ −
\node (b) [block, right=of A1] {code of $b$};+ −
\node (A2) [point, right=of b] {};+ −
\node (cs1) [block, right=of A2] {code of $cs_1$};+ −
\node (A3) [point, right=of cs1] {};+ −
\node (cs2) [block, right=of A3] {code of $cs_2$};+ −
\node (A4) [point, right=of cs2] {};+ −
+ −
\draw (A1) edge [->, black, line width=1mm] (b);+ −
\draw (b) edge [->, black, line width=1mm] (cs1);+ −
\draw (cs1) edge [->, black, line width=1mm] (A3);+ −
\draw (A3) edge [->, black, skip loop] (A4);+ −
\node [below=of cs2] {\raisebox{-5mm}{\small{}jump}};+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent where we start with running the code for $b$; since+ −
we are in the true case we continue with running the code for+ −
$cs_1$. After this however, we must not run the code for+ −
$cs_2$, but always jump after the last instruction of $cs_2$+ −
(the code for the \pcode{else}-branch). Note that this jump is+ −
unconditional, meaning we always have to jump to the end of+ −
$cs_2$. The corresponding instruction of the JVM is+ −
\pcode{goto}. In case $b$ turns out to be false we need the+ −
control-flow+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=2mm and 4mm,+ −
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},+ −
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},+ −
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]+ −
\node (A1) [point] {};+ −
\node (b) [block, right=of A1] {code of $b$};+ −
\node (A2) [point, right=of b] {};+ −
\node (cs1) [block, right=of A2] {code of $cs_1$};+ −
\node (A3) [point, right=of cs1] {};+ −
\node (cs2) [block, right=of A3] {code of $cs_2$};+ −
\node (A4) [point, right=of cs2] {};+ −
+ −
\draw (A1) edge [->, black, line width=1mm] (b);+ −
\draw (b) edge [->, black, line width=1mm] (A2);+ −
\draw (A2) edge [skip loop] (A3);+ −
\draw (A3) edge [->, black, line width=1mm] (cs2);+ −
\draw (cs2) edge [->,black, line width=1mm] (A4);+ −
\node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}};+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent where we now need a conditional jump (if the+ −
if-condition is false) from the end of the code for the + −
boolean to the beginning of the instructions $cs_2$. Once we + −
are finished with running $cs_2$ we can continue with whatever+ −
code comes after the if-statement.+ −
+ −
The \pcode{goto} and conditional jumps need addresses to where+ −
the jump should go. Since we are generating assembly code for+ −
the JVM, we do not actually have to give addresses, but need+ −
to attach labels to our code. These labels specify a target+ −
for a jump. Therefore the labels need to be unique, as+ −
otherwise it would be ambiguous where a jump should go. + −
A labels, say \pcode{L}, is attached to code like+ −
+ −
\begin{lstlisting}[mathescape,numbers=none]+ −
L:+ −
$instr_1$+ −
$instr_2$+ −
$\vdots$+ −
\end{lstlisting}+ −
+ −
Recall the ``trick'' with compiling boolean expressions: the + −
\textit{compile}-function for boolean expressions takes three+ −
arguments: an abstract syntax tree, an environment for + −
variable indices and also the label, $lab$, to where an conditional + −
jump needs to go. The clause for the expression $a_1 = a_2$, + −
for example, is as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ + −
\multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent + −
We are generating code for the subexpressions $a_1$ and $a_2$. + −
This will mean after running the corresponding code there will+ −
be two integers on top of the stack. If they are equal, we do + −
not have to do anything and just continue with the next + −
instructions (see control-flow of ifs above). However if they + −
are \emph{not} equal, then we need to (conditionally) jump to + −
the label $lab$. This can be done with the instruction+ −
+ −
\begin{lstlisting}[mathescape,numbers=none]+ −
if_icmpne $lab$+ −
\end{lstlisting}+ −
+ −
\noindent Other jump instructions for boolean operators are+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}+ −
$=$ & $\Rightarrow$ & \pcode{if_icmpne}\\+ −
$\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\+ −
$<$ & $\Rightarrow$ & \pcode{if_icmpge}\\+ −
$\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent and so on. I leave it to you to extend the+ −
\textit{compile}-function for the other boolean expressions.+ −
Note that we need to jump whenever the boolean is \emph{not}+ −
true, which means we have to ``negate'' the jump---equals+ −
becomes not-equal, less becomes greater-or-equal. If you do+ −
not like this design (it can be the source of some nasty,+ −
hard-to-detect errors), you can also change the layout of the+ −
code and first give the code for the else-branch and then for+ −
the if-branch.+ −
+ −
We are now ready to give the compile function for + −
if-statments--remember this function returns for staments a + −
pair consisting of the code and an environment:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\ + −
\multicolumn{3}{l}{$\qquad l_\textit{ifelse}\;$ (fresh label)}\\+ −
\multicolumn{3}{l}{$\qquad l_\textit{ifend}\;$ (fresh label)}\\+ −
\multicolumn{3}{l}{$\qquad (is_1, E') = \textit{compile}(cs_1, E)$}\\+ −
\multicolumn{3}{l}{$\qquad (is_2, E'') = \textit{compile}(cs_2, E')$}\\+ −
\multicolumn{3}{l}{$\qquad(\textit{compile}(b, E, l_\textit{ifelse})$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_1$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\; \pcode{goto}\;l_\textit{ifend}$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifelse}:$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_2$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifend}:, E'')$}\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\end{document}+ −
+ −
%%% Local Variables: + −
%%% mode: latex + −
%%% TeX-master: t+ −
%%% End: + −