% !TEX program = xelatex+ −
\documentclass{article}+ −
\usepackage{../style}+ −
\usepackage{../langs}+ −
\usepackage{../grammar}+ −
\usepackage{../graphics}+ −
+ −
%% add safety check on references...whether it is above 0 and below the + −
%% index+ −
+ −
+ −
+ −
+ −
\begin{document}+ −
\fnote{\copyright{} Christian Urban, King's College London, 2017, 2018, 2019, 2020}+ −
+ −
\section*{Handout 7 (Compilation)}+ −
+ −
The purpose of a compiler is to transform a program a human can read and+ −
write into code the machine can run as fast as possible. The fastest+ −
code would be machine code the CPU can run directly, but it is often+ −
good enough for improving the speed of a program to target a+ −
virtual machine. This produces not the fastest possible code, but code+ −
that is often pretty fast. This way of producing code has the advantage that+ −
the virtual machine takes care of things a compiler would normally need+ −
to take care of (like explicit memory management). + −
+ −
As a first example in this module we will implement a compiler for the+ −
very simple WHILE-language that we parsed in the last lecture. The+ −
compiler will target the Java Virtual Machine (JVM), but not directly.+ −
Pictorially the compiler will work as follows:+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[scale=1,font=\bf,+ −
node/.style={+ −
rectangle,rounded corners=3mm,+ −
ultra thick,draw=black!50,minimum height=18mm, + −
minimum width=20mm,+ −
top color=white,bottom color=black!20}]+ −
+ −
\node (0) at (-3,0) {}; + −
\node (A) at (0,0) [node,text width=1.6cm,text centered] {our compiler};+ −
\node (B) at (3.5,0) [node,text width=1.6cm,text centered] {Jasmin / Krakatau};+ −
\node (C) at (7.5,0) [node] {JVM};+ −
+ −
\draw [->,line width=2.5mm] (0) -- node [above,pos=0.35] {*.while} (A); + −
\draw [->,line width=2.5mm] (A) -- node [above,pos=0.35] {*.j} (B); + −
\draw [->,line width=2.5mm] (B) -- node [above,pos=0.35] {*.class} (C); + −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent+ −
The input will be WHILE-programs; the output will be assembly files+ −
(with the ending .j). Assembly files essentially contain human-readable+ −
machine code, meaning they are not just bits and bytes, but rather+ −
something you can read and understand---with a bit of practice of+ −
course. An \emph{assembler} will then translate the assembly files into+ −
unreadable class or binary files the JVM can run. Unfortunately, the+ −
Java ecosystem does not come with an assembler which would be handy for+ −
our compiler-endeavour (unlike Microsoft's Common Language+ −
Infrastructure for the .Net platform which has an assembler+ −
out-of-the-box). As a substitute we shall therefore use the 3rd-party+ −
programs Jasmin and Krakatau + −
+ −
\begin{itemize}+ −
\item \url{http://jasmin.sourceforge.net}+ −
\item \url{https://github.com/Storyyeller/Krakatau}+ −
\end{itemize}+ −
+ −
\noindent+ −
The first is a Java program and the second a program written in Python.+ −
Each of them allow us to generate \emph{assembly} files that are still+ −
readable by humans, as opposed to class-files which are pretty much just+ −
(horrible) zeros and ones. Jasmin (respectively Krakatau) will then take+ −
an assembly file as input and generate the corresponding class file for+ −
us. + −
+ −
Good about the JVM is that it is a stack-based virtual machine, a fact+ −
which will make it easy to generate code for arithmetic expressions. For+ −
example when compiling the expression $1 + 2$ we need to generate the+ −
following three instructions+ −
+ −
\begin{lstlisting}[language=JVMIS,numbers=none]+ −
ldc 1+ −
ldc 2+ −
iadd + −
\end{lstlisting}+ −
+ −
\noindent The first instruction loads the constant $1$ onto+ −
the stack, the next one loads $2$, the third instruction adds both+ −
numbers together replacing the top two elements of the stack+ −
with the result $3$. For simplicity, we will throughout+ −
consider only integer numbers. Therefore we can+ −
use the JVM instructions \code{iadd}, \code{isub},+ −
\code{imul}, \code{idiv} and so on. The \code{i} stands for+ −
integer instructions in the JVM (alternatives are \code{d} for+ −
doubles, \code{l} for longs and \code{f} for floats).+ −
+ −
Recall our grammar for arithmetic expressions (\meta{E} is the+ −
starting symbol):+ −
+ −
+ −
\begin{plstx}[rhs style=, margin=3cm]+ −
: \meta{E} ::= \meta{T} $+$ \meta{E}+ −
| \meta{T} $-$ \meta{E}+ −
| \meta{T}\\+ −
: \meta{T} ::= \meta{F} $*$ \meta{T}+ −
| \meta{F} $\backslash$ \meta{T}+ −
| \meta{F}\\+ −
: \meta{F} ::= ( \meta{E} )+ −
| \meta{Id}+ −
| \meta{Num}\\+ −
\end{plstx}+ −
+ −
+ −
\noindent where \meta{Id} stands for variables and \meta{Num}+ −
for numbers. For the moment let us omit variables from arithmetic+ −
expressions. Our parser will take this grammar and given an input+ −
produce abstract syntax trees. For example we will obtain for the+ −
expression $1 + ((2 * 3) + (4 - 3))$ the following tree.+ −
+ −
\begin{center}+ −
\begin{tikzpicture}+ −
\Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent To generate JVM code for this expression, we need to traverse+ −
this tree in \emph{post-order} fashion and emit code for each+ −
node---this traversal in \emph{post-order} fashion will produce code for+ −
a stack-machine (which is what the JVM is). Doing so for the tree above+ −
generates the instructions+ −
+ −
\begin{lstlisting}[language=JVMIS,numbers=none]+ −
ldc 1 + −
ldc 2 + −
ldc 3 + −
imul + −
ldc 4 + −
ldc 3 + −
isub + −
iadd + −
iadd+ −
\end{lstlisting}+ −
+ −
\noindent If we ``run'' these instructions, the result $8$ will be on+ −
top of the stack (I leave this to you to verify; the meaning of each+ −
instruction should be clear). The result being on the top of the stack+ −
will be an important convention we always observe in our compiler. Note,+ −
that a different bracketing of the expression, for example $(1 + (2 *+ −
3)) + (4 - 3)$, produces a different abstract syntax tree and thus also+ −
a different list of instructions. Generating code in this+ −
post-order-traversal fashion is rather easy to implement: it can be done+ −
with the following recursive \textit{compile}-function, which takes the+ −
abstract syntax tree as an argument:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\+ −
$\textit{compile}(a_1 + a_2)$ & $\dn$ &+ −
$\textit{compile}(a_1) \;@\;\textit{compile}(a_2)\;@\; \pcode{iadd}$\\+ −
$\textit{compile}(a_1 - a_2)$ & $\dn$ & + −
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{isub}$\\+ −
$\textit{compile}(a_1 * a_2)$ & $\dn$ & + −
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{imul}$\\+ −
$\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & + −
$\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
This is all fine, but our arithmetic expressions can clearly contain+ −
variables and then this code will not be good enough. To fix this we+ −
will represent our variables as the \emph{local variables} in the JVM.+ −
Essentially, local variables are an array or pointers to memory cells,+ −
containing in our case only integers. Looking up a variable can be done+ −
with the instruction+ −
+ −
\begin{lstlisting}[language=JVMIS,mathescape,numbers=none]+ −
iload $index$+ −
\end{lstlisting}+ −
+ −
\noindent + −
which places the content of the local variable $index$ onto + −
the stack. Storing the top of the stack into a local variable + −
can be done by the instruction+ −
+ −
\begin{lstlisting}[language=JVMIS,mathescape,numbers=none]+ −
istore $index$+ −
\end{lstlisting}+ −
+ −
\noindent Note that this also pops off the top of the stack. One problem+ −
we have to overcome, however, is that local variables are addressed, not+ −
by identifiers (like \texttt{x}, \texttt{foo} and so on), but by numbers+ −
(starting from $0$). Therefore our compiler needs to maintain a kind of+ −
environment where variables are associated to numbers. This association+ −
needs to be unique: if we muddle up the numbers, then we essentially+ −
confuse variables and the consequence will usually be an erroneous+ −
result. Our extended \textit{compile}-function for arithmetic+ −
expressions will therefore take two arguments: the abstract syntax tree+ −
and an environment, $E$, that maps identifiers to index-numbers.+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\+ −
$\textit{compile}(a_1 + a_2, E)$ & $\dn$ & + −
$\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{iadd}$\\+ −
$\textit{compile}(a_1 - a_2, E)$ & $\dn$ &+ −
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{isub}$\\+ −
$\textit{compile}(a_1 * a_2, E)$ & $\dn$ &+ −
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{imul}$\\+ −
$\textit{compile}(a_1 \backslash a_2, E)$ & $\dn$ & + −
$\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{idiv}$\\+ −
$\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent In the last line we generate the code for variables where+ −
$E(x)$ stands for looking up the environment to which index the variable+ −
$x$ maps to. This is similar to the interpreter we saw earlier in the+ −
module, which also needs an environment: the difference is that the+ −
interpreter maintains a mapping from variables to current values (what+ −
is the currently the value of a variable?), while compilers need a+ −
mapping from variables to memory locations (where can I find the current+ −
value for the variable in memory?).+ −
+ −
There is a similar \textit{compile}-function for boolean+ −
expressions, but it includes a ``trick'' to do with+ −
\pcode{if}- and \pcode{while}-statements. To explain the issue+ −
let us first describe the compilation of statements of the+ −
WHILE-language. The clause for \pcode{skip} is trivial, since+ −
we do not have to generate any instruction+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent whereby $[]$ is the empty list of instructions. Note that+ −
the \textit{compile}-function for statements returns a pair, a+ −
list of instructions (in this case the empty list) and an+ −
environment for variables. The reason for the environment is+ −
that assignments in the WHILE-language might change the+ −
environment---clearly if a variable is used for the first+ −
time, we need to allocate a new index and if it has been used+ −
before, then we need to be able to retrieve the associated index.+ −
This is reflected in the clause for compiling assignments, say+ −
$\textit{x := a}$:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(x := a, E)$ & $\dn$ & + −
$(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent We first generate code for the right-hand side of the+ −
assignment (that is the arithmetic expression $a$) and then add an+ −
\pcode{istore}-instruction at the end. By convention running the code+ −
for the arithmetic expression $a$ will leave the result on top of the+ −
stack. After that the \pcode{istore} instruction, the result will be+ −
stored in the index corresponding to the variable $x$. If the variable+ −
$x$ has been used before in the program, we just need to look up what+ −
the index is and return the environment unchanged (that is in this case+ −
$E' = E$). However, if this is the first encounter of the variable $x$+ −
in the program, then we have to augment the environment and assign $x$+ −
with the largest index in $E$ plus one (that is $E' = E(x \mapsto+ −
largest\_index + 1)$). To sum up, for the assignment $x := x + 1$ we+ −
generate the following code+ −
+ −
\begin{lstlisting}[language=JVMIS,mathescape,numbers=none]+ −
iload $n_x$+ −
ldc 1+ −
iadd+ −
istore $n_x$+ −
\end{lstlisting}+ −
+ −
\noindent + −
where $n_x$ is the index (or pointer to the memory) for the variable+ −
$x$. The code for looking-up the index for the variable is as follow:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$index \;=\; E\textit{.getOrElse}(x, |E|)$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
This implements the idea that in case the environment $E$ contains an+ −
index for $x$, we return it. Otherwise we ``create'' a new index by+ −
returning the size $|E|$ of the environment (that will be an index that+ −
is guaranteed not to be used yet). In all this we take advantage of the+ −
JVM which provides us with a potentially limitless supply of places+ −
where we can store values of variables.+ −
+ −
A bit more complicated is the generation of code for+ −
\pcode{if}-statements, say+ −
+ −
\begin{lstlisting}[mathescape,language={},numbers=none]+ −
if $b$ then $cs_1$ else $cs_2$+ −
\end{lstlisting}+ −
+ −
\noindent where $b$ is a boolean expression and where both $cs_{1/2}$+ −
are the statements for each of the \pcode{if}-branches. Let us assume we+ −
already generated code for $b$ and and the two if-branches $cs_{1/2}$.+ −
Then in the true-case the control-flow of the program needs to behave as+ −
+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=2mm and 4mm,line cap=round,+ −
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm,+ −
top color=white,bottom color=black!20},+ −
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},+ −
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]+ −
\node (A1) [point] {};+ −
\node (b) [block, right=of A1] {code of $b$};+ −
\node (A2) [point, right=of b] {};+ −
\node (cs1) [block, right=of A2] {code of $cs_1$};+ −
\node (A3) [point, right=of cs1] {};+ −
\node (cs2) [block, right=of A3] {code of $cs_2$};+ −
\node (A4) [point, right=of cs2] {};+ −
+ −
\draw (A1) edge [->, black, line width=1mm] (b);+ −
\draw (b) edge [->, black, line width=1mm] (cs1);+ −
\draw (cs1) edge [->, black, line width=1mm,shorten >= -0.5mm] (A3);+ −
\draw (A3) edge [->, black, skip loop] (A4);+ −
\node [below=of cs2] {\raisebox{-5mm}{\small{}jump}};+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent where we start with running the code for $b$; since+ −
we are in the true case we continue with running the code for+ −
$cs_1$. After this however, we must not run the code for+ −
$cs_2$, but always jump to after the last instruction of $cs_2$+ −
(the code for the \pcode{else}-branch). Note that this jump is+ −
unconditional, meaning we always have to jump to the end of+ −
$cs_2$. The corresponding instruction of the JVM is+ −
\pcode{goto}. In case $b$ turns out to be false we need the+ −
control-flow+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=2mm and 4mm,line cap=round,+ −
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm,+ −
top color=white,bottom color=black!20},+ −
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},+ −
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]+ −
\node (A1) [point] {};+ −
\node (b) [block, right=of A1] {code of $b$};+ −
\node (A2) [point, right=of b] {};+ −
\node (cs1) [block, right=of A2] {code of $cs_1$};+ −
\node (A3) [point, right=of cs1] {};+ −
\node (cs2) [block, right=of A3] {code of $cs_2$};+ −
\node (A4) [point, right=of cs2] {};+ −
+ −
\draw (A1) edge [->, black, line width=1mm] (b);+ −
\draw (b) edge [->, black, line width=1mm,shorten >= -0.5mm] (A2);+ −
\draw (A2) edge [skip loop] (A3);+ −
\draw (A3) edge [->, black, line width=1mm] (cs2);+ −
\draw (cs2) edge [->,black, line width=1mm] (A4);+ −
\node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}};+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent where we now need a conditional jump (if the+ −
if-condition is false) from the end of the code for the + −
boolean to the beginning of the instructions $cs_2$. Once we + −
are finished with running $cs_2$ we can continue with whatever+ −
code comes after the if-statement.+ −
+ −
The \pcode{goto} and the conditional jumps need addresses to+ −
where the jump should go. Since we are generating assembly+ −
code for the JVM, we do not actually have to give (numeric)+ −
addresses, but can just attach (symbolic) labels to our code.+ −
These labels specify a target for a jump. Therefore the labels+ −
need to be unique, as otherwise it would be ambiguous where a+ −
jump should go to. A label, say \pcode{L}, is attached to code+ −
like+ −
+ −
\begin{lstlisting}[mathescape,numbers=none]+ −
L:+ −
$instr_1$+ −
$instr_2$+ −
$\vdots$+ −
\end{lstlisting}+ −
+ −
\noindent where the label needs to be followed by a colon. The task of+ −
the assembler (in our case Jasmin or Krakatau) is to resolve the labels+ −
to actual (numeric) addresses, for example jump 10 instructions forward,+ −
or 20 instructions backwards.+ −
+ −
Recall the ``trick'' with compiling boolean expressions: the + −
\textit{compile}-function for boolean expressions takes three+ −
arguments: an abstract syntax tree, an environment for + −
variable indices and also the label, $lab$, to where an conditional + −
jump needs to go. The clause for the expression $a_1 = a_2$, + −
for example, is as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ + −
\multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent where we are first generating code for the+ −
subexpressions $a_1$ and $a_2$. This will mean after running+ −
the corresponding code there will be two integers on top of+ −
the stack. If they are equal, we do not have to do anything+ −
(except for popping them off from the stack) and just continue+ −
with the next instructions (see control-flow of ifs above).+ −
However if they are \emph{not} equal, then we need to+ −
(conditionally) jump to the label $lab$. This can be done with+ −
the instruction+ −
+ −
\begin{lstlisting}[mathescape,numbers=none,language=JVMIS]+ −
if_icmpne $lab$+ −
\end{lstlisting}+ −
+ −
To sum up, the third argument in the compile function for booleans+ −
specifies where to jump, in case the condition is \emph{not} true. I+ −
leave it to you to extend the \textit{compile}-function for the other+ −
boolean expressions. Note that we need to jump whenever the boolean is+ −
\emph{not} true, which means we have to ``negate'' the jump+ −
condition---equals becomes not-equal, less becomes greater-or-equal. + −
Other jump instructions for boolean operators are+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}+ −
$\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\+ −
$<$ & $\Rightarrow$ & \pcode{if_icmpge}\\+ −
$\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent and so on. If you do not like this design (it can be the+ −
source of some nasty, hard-to-detect errors), you can also change the+ −
layout of the code and first give the code for the else-branch and then+ −
for the if-branch. However in the case of while-loops this+ −
``upside-down-inside-out'' way of generating code still seems the most+ −
convenient.+ −
+ −
We are now ready to give the compile function for + −
if-statements---remember this function returns for statements a + −
pair consisting of the code and an environment:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\ + −
\multicolumn{3}{l}{$\qquad L_\textit{ifelse}\;$ (fresh label)}\\+ −
\multicolumn{3}{l}{$\qquad L_\textit{ifend}\;$ (fresh label)}\\+ −
\multicolumn{3}{l}{$\qquad (is_1, E') = \textit{compile}(cs_1, E)$}\\+ −
\multicolumn{3}{l}{$\qquad (is_2, E'') = \textit{compile}(cs_2, E')$}\\+ −
\multicolumn{3}{l}{$\qquad(\textit{compile}(b, E, L_\textit{ifelse})$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_1$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\; \pcode{goto}\;L_\textit{ifend}$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_\textit{ifelse}:$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_2$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_\textit{ifend}:, E'')$}\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent In the first two lines we generate two fresh labels+ −
for the jump addresses (just before the else-branch and just+ −
after). In the next two lines we generate the instructions for+ −
the two branches, $is_1$ and $is_2$. The final code will+ −
be first the code for $b$ (including the label + −
just-before-the-else-branch), then the \pcode{goto} for after+ −
the else-branch, the label $L_\textit{ifesle}$, followed by+ −
the instructions for the else-branch, followed by the + −
after-the-else-branch label. Consider for example the + −
if-statement:+ −
+ −
\begin{lstlisting}[mathescape,numbers=none,language=While]+ −
if 1 = 1 then x := 2 else y := 3+ −
\end{lstlisting}+ −
+ −
\noindent + −
The generated code is as follows:+ −
+ −
\begin{lstlisting}[language=JVMIS,mathescape,numbers=left]+ −
ldc 1+ −
ldc 1+ −
if_icmpne L_ifelse $\quad\tikz[remember picture] \node (C) {\mbox{}};$+ −
ldc 2+ −
istore 0+ −
goto L_ifend $\quad\tikz[remember picture] \node (A) {\mbox{}};$+ −
L_ifelse: $\quad\tikz[remember picture] \node[] (D) {\mbox{}};$+ −
ldc 3+ −
istore 1+ −
L_ifend: $\quad\tikz[remember picture] \node[] (B) {\mbox{}};$+ −
\end{lstlisting}+ −
+ −
\begin{tikzpicture}[remember picture,overlay]+ −
\draw[->,very thick] (A) edge [->,to path={-- ++(10mm,0mm) + −
-- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (B.east);+ −
\draw[->,very thick] (C) edge [->,to path={-- ++(10mm,0mm) + −
-- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (D.east);+ −
\end{tikzpicture}+ −
+ −
\noindent The first three lines correspond to the the boolean+ −
expression $1 = 1$. The jump for when this boolean expression+ −
is false is in Line~3. Lines 4-6 corresponds to the if-branch;+ −
the else-branch is in Lines 8 and 9. Note carefully how the+ −
environment $E$ is threaded through the recursive calls of+ −
\textit{compile}. The function receives an environment $E$,+ −
but it might extend it when compiling the if-branch, yielding+ −
$E'$. This happens for example in the if-statement above+ −
whenever the variable \code{x} has not been used before.+ −
Similarly with the environment $E''$ for the second call to+ −
\textit{compile}. $E''$ is also the environment that needs to+ −
be returned as part of the answer.+ −
+ −
The compilation of the while-loops, say + −
\pcode{while} $b$ \pcode{do} $cs$, is very similar. In case+ −
the condition is true and we need to do another iteration, + −
and the control-flow needs to be as follows+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=2mm and 4mm,line cap=round,+ −
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm,+ −
top color=white,bottom color=black!20},+ −
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},+ −
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]+ −
\node (A0) [point, left=of A1] {};+ −
\node (A1) [point] {};+ −
\node (b) [block, right=of A1] {code of $b$};+ −
\node (A2) [point, right=of b] {};+ −
\node (cs1) [block, right=of A2] {code of $cs$};+ −
\node (A3) [point, right=of cs1] {};+ −
\node (A4) [point, right=of A3] {};+ −
+ −
\draw (A0) edge [->, black, line width=1mm] (b);+ −
\draw (b) edge [->, black, line width=1mm] (cs1);+ −
\draw (cs1) edge [->, black, line width=1mm,shorten >= -0.5mm] (A3);+ −
\draw (A3) edge [->,skip loop] (A1);+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent Whereas if the condition is \emph{not} true, we+ −
need to jump out of the loop, which gives the following+ −
control flow.+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[node distance=2mm and 4mm,line cap=round,+ −
block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm,+ −
top color=white,bottom color=black!20},+ −
point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},+ −
skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]+ −
\node (A0) [point, left=of A1] {};+ −
\node (A1) [point] {};+ −
\node (b) [block, right=of A1] {code of $b$};+ −
\node (A2) [point, right=of b] {};+ −
\node (cs1) [block, right=of A2] {code of $cs$};+ −
\node (A3) [point, right=of cs1] {};+ −
\node (A4) [point, right=of A3] {};+ −
+ −
\draw (A0) edge [->, black, line width=1mm] (b);+ −
\draw (b) edge [->, black, line width=1mm,shorten >= -0.5mm] (A2);+ −
\draw (A2) edge [skip loop] (A3);+ −
\draw (A3) edge [->, black, line width=1mm] (A4);+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent Again we can use the \textit{compile}-function for+ −
boolean expressions to insert the appropriate jump to the+ −
end of the loop (label $L_{wend}$ below).+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{compile}(\pcode{while}\; b\; \pcode{do} \;cs, E)$ & $\dn$\\ + −
\multicolumn{3}{l}{$\qquad L_{wbegin}\;$ (fresh label)}\\+ −
\multicolumn{3}{l}{$\qquad L_{wend}\;$ (fresh label)}\\+ −
\multicolumn{3}{l}{$\qquad (is, E') = \textit{compile}(cs_1, E)$}\\+ −
\multicolumn{3}{l}{$\qquad(L_{wbegin}:$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;\textit{compile}(b, E, L_{wend})$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;is$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\; \text{goto}\;L_{wbegin}$}\\+ −
\multicolumn{3}{l}{$\qquad\phantom{(}@\;L_{wend}:, E')$}\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent I let you go through how this clause works. As an example+ −
you can consider the while-loop+ −
+ −
\begin{lstlisting}[mathescape,numbers=none,language=While]+ −
while x <= 10 do x := x + 1+ −
\end{lstlisting}+ −
+ −
\noindent yielding the following code+ −
+ −
\begin{lstlisting}[language=JVMIS,mathescape,numbers=left]+ −
L_wbegin: $\quad\tikz[remember picture] \node[] (LB) {\mbox{}};$+ −
iload 0+ −
ldc 10+ −
if_icmpgt L_wend $\quad\tikz[remember picture] \node (LC) {\mbox{}};$+ −
iload 0+ −
ldc 1+ −
iadd+ −
istore 0+ −
goto L_wbegin $\quad\tikz[remember picture] \node (LA) {\mbox{}};$+ −
L_wend: $\quad\tikz[remember picture] \node[] (LD) {\mbox{}};$+ −
\end{lstlisting}+ −
+ −
\begin{tikzpicture}[remember picture,overlay]+ −
\draw[->,very thick] (LA) edge [->,to path={-- ++(10mm,0mm) + −
-- ++(0mm,17.3mm) |- (\tikztotarget)},line width=1mm] (LB.east);+ −
\draw[->,very thick] (LC) edge [->,to path={-- ++(10mm,0mm) + −
-- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (LD.east);+ −
\end{tikzpicture}+ −
+ −
\noindent+ −
As said, I leave it to you to decide whether the code implements+ −
the usual controlflow of while-loops.+ −
+ −
Next we need to consider the statement \pcode{write x}, which can be+ −
used to print out the content of a variable. For this we shall use a+ −
Java library function. In order to avoid having to generate a lot of+ −
code for each \pcode{write}-command, we use a separate helper-method and+ −
just call this method with an appropriate argument (which of course+ −
needs to be placed onto the stack). The code of the helper-method is as+ −
follows.+ −
+ −
\begin{lstlisting}[language=JVMIS,numbers=left]+ −
.method public static write(I)V + −
.limit locals 1 + −
.limit stack 2 + −
getstatic java/lang/System/out Ljava/io/PrintStream; + −
iload 0+ −
invokevirtual java/io/PrintStream/println(I)V + −
return + −
.end method+ −
\end{lstlisting}+ −
+ −
\noindent The first line marks the beginning of the method,+ −
called \pcode{write}. It takes a single integer argument+ −
indicated by the \pcode{(I)} and returns no result, indicated+ −
by the \pcode{V}. Since the method has only one argument, we+ −
only need a single local variable (Line~2) and a stack with+ −
two cells will be sufficient (Line 3). Line 4 instructs the+ −
JVM to get the value of the mem \pcode{out} of the class+ −
\pcode{java/lang/System}. It expects the value to be of type+ −
\pcode{java/io/PrintStream}. A reference to this value will be+ −
placed on the stack. Line~5 copies the integer we want to+ −
print out onto the stack. In the next line we call the method+ −
\pcode{println} (from the class \pcode{java/io/PrintStream}).+ −
We want to print out an integer and do not expect anything+ −
back (that is why the type annotation is \pcode{(I)V}). The+ −
\pcode{return}-instruction in the next line changes the+ −
control-flow back to the place from where \pcode{write} was+ −
called. This method needs to be part of a header that is+ −
included in any code we generate. The helper-method+ −
\pcode{write} can be invoked with the two instructions+ −
+ −
\begin{lstlisting}[mathescape,language=JVMIS]+ −
iload $E(x)$ + −
invokestatic XXX/XXX/write(I)V+ −
\end{lstlisting}+ −
+ −
\noindent where we first place the variable to be printed on+ −
top of the stack and then call \pcode{write}. The \pcode{XXX}+ −
need to be replaced by an appropriate class name (this will be+ −
explained shortly).+ −
+ −
+ −
By generating code for a WHILE-program, we end up with a list+ −
of (JVM assembly) instructions. Unfortunately, there is a bit+ −
more boilerplate code needed before these instructions can be+ −
run. The complete code is shown in Figure~\ref{boiler}. This+ −
boilerplate code is very specific to the JVM. If we target any+ −
other virtual machine or a machine language, then we would+ −
need to change this code. Lines 4 to 8 in Figure~\ref{boiler}+ −
contain a method for object creation in the JVM; this method+ −
is called \emph{before} the \pcode{main}-method in Lines 10 to+ −
17. Interesting are the Lines 11 and 12 where we hardwire that+ −
the stack of our programs will never be larger than 200 and+ −
that the maximum number of variables is also 200. This seem to+ −
be conservative default values that allow is to run some+ −
simple WHILE-programs. In a real compiler, we would of course+ −
need to work harder and find out appropriate values for the+ −
stack and local variables.+ −
+ −
\begin{figure}[t]+ −
\begin{lstlisting}[mathescape,language=JVMIS,numbers=left]+ −
.class public XXX.XXX+ −
.super java/lang/Object+ −
+ −
.method public static main([Ljava/lang/String;)V+ −
.limit locals 200+ −
.limit stack 200+ −
+ −
$\textit{\ldots{}here comes the compiled code\ldots}$+ −
+ −
return+ −
.end method+ −
\end{lstlisting}+ −
\caption{The boilerplate code needed for running generated code.\label{boiler}}+ −
\end{figure}+ −
+ −
+ −
To sum up, in Figure~\ref{test} is the complete code generated+ −
for the slightly nonsensical program+ −
+ −
\begin{lstlisting}[mathescape,language=While]+ −
x := 1 + 2;+ −
write x+ −
\end{lstlisting}+ −
+ −
\noindent I let you read the code and make sure the code behaves as+ −
expected. Having this code at our disposal, we need the assembler to+ −
translate the generated code into JVM bytecode (a class file). This+ −
bytecode is then understood by the JVM and can be run by just invoking+ −
the \pcode{java}-program.+ −
+ −
+ −
\begin{figure}[p]+ −
\lstinputlisting[language=JVMIS,mathescape]{../progs/test-small.j}+ −
\begin{tikzpicture}[remember picture,overlay]+ −
\draw[|<->|,very thick] (LA.north) -- (LB.south) + −
node[left=0mm,midway] {\footnotesize\texttt{x\,:=\,1\,+\,2}}; + −
\draw[|<->|,very thick] (LC.north) -- (LD.south)+ −
node[left=0mm,midway] {\footnotesize\texttt{write x}};+ −
\end{tikzpicture}+ −
\caption{The generated code for the test program \texttt{x := 1 + 2; write+ −
x}. This code can be processed by a Java assembler producing a+ −
class-file, which can then be run by the {\tt{}java}-program.\label{test}}+ −
\end{figure}+ −
+ −
\subsection*{Arrays}+ −
+ −
Maybe a useful addition to the WHILE-language would be arrays. This+ −
would allow us to generate more interesting WHILE-programs by+ −
translating BF*** programs into equivalent WHILE-code. Therefore in this+ −
section let us have a look at how we can support the following three+ −
constructions+ −
+ −
\begin{lstlisting}[mathescape,language=While]+ −
new(arr[15000])+ −
x := 3 + arr[3 + y]+ −
arr[42 * n] := ...+ −
\end{lstlisting}+ −
+ −
\noindent+ −
The first construct is for creating new arrays. In this instance the+ −
name of the array is \pcode{arr} and it can hold 15000 integers. We do+ −
not support ``dynamic'' arrays, that is the size of our arrays will+ −
always be fixed. The second construct is for referencing an array cell+ −
inside an arithmetic expression---we need to be able to look up the+ −
contents of an array at an index determined by an arithmetic expression.+ −
Similarly in the line below, we need to be able to update the content of+ −
an array at an calculated index. + −
+ −
For creating a new array we can generate the following three JVM+ −
instructions:+ −
+ −
\begin{lstlisting}[mathescape,language=JVMIS]+ −
ldc number + −
newarray int + −
astore loc_var+ −
\end{lstlisting}+ −
+ −
\noindent+ −
First we need to put the size of the array onto the stack. The next+ −
instruction creates the array. In this case the array contains+ −
\texttt{int}s. With the last instruction we can store the array as a+ −
local variable (like the ``simple'' variables from the previous+ −
section). The use of a local variable for each array allows us to have+ −
multiple arrays in a WHILE-program. For looking up an element in an+ −
array we can use the following JVM code+ −
+ −
\begin{lstlisting}[mathescape,language=JVMIS]+ −
aload loc_var + −
index_aexp + −
iaload+ −
\end{lstlisting}+ −
+ −
\noindent+ −
The first instruction loads the ``pointer'', or local variable, to the+ −
array onto the stack. Then we have some instructions calculating the+ −
index where we want to look up the array. The idea is that these+ −
instructions will leave a concrete number on the top of the stack, which+ −
will be the index into the array we need. Finally we need to tell the+ −
JVM to load the corresponding element onto the stack. Updating an array+ −
at an index with a value is as follows.+ −
+ −
\begin{lstlisting}[mathescape,language=JVMIS]+ −
aload loc_var + −
index_aexp + −
value_aexp + −
iastore+ −
\end{lstlisting}+ −
+ −
\noindent+ −
Again the first instruction loads the local variable of+ −
the array onto the stack. Then we have some instructions calculating+ −
the index where we want to update the array. After that come the+ −
instructions for with which value we want to update the array. The last+ −
line contains the instruction for updating the array.+ −
+ −
Next we need to modify our grammar rules for our WHILE-language: it+ −
seems best to extend the rule for factors in arithmetic expressions with+ −
a rule for looking up an array.+ −
+ −
\begin{plstx}[rhs style=, margin=3cm]+ −
: \meta{E} ::= \meta{T} $+$ \meta{E}+ −
| \meta{T} $-$ \meta{E}+ −
| \meta{T}\\+ −
: \meta{T} ::= \meta{F} $*$ \meta{T}+ −
| \meta{F} $\backslash$ \meta{T}+ −
| \meta{F}\\+ −
: \meta{F} ::= ( \meta{E} )+ −
| $\underbrace{\meta{Id}\,[\,\meta{E}\,]}_{new}$+ −
| \meta{Id}+ −
| \meta{Num}\\+ −
\end{plstx}+ −
+ −
\noindent+ −
There is no problem with left-recursion as the \meta{E} is ``protected''+ −
by an identifier and the brackets. There are two new rules for statements,+ −
one for creating an array and one for array assignment:+ −
+ −
\begin{plstx}[rhs style=, margin=2cm, one per line]+ −
: \meta{Stmt} ::= \ldots+ −
| \texttt{new}(\meta{Id}\,[\,\meta{Num}\,]) + −
| \meta{Id}\,[\,\meta{E}\,]\,:=\,\meta{E}\\+ −
\end{plstx}+ −
+ −
With this in place we can turn back to the idea of creating+ −
WHILE-programs by translating BF programs. This is a relatively easy+ −
task because BF has only eight instructions (we will actually implement+ −
seven because we can omit the read-in instruction from BF). What makes+ −
this translation easy is that BF-loops can be straightforwardly+ −
represented as while-loops. The Scala code for the translation is as+ −
follows:+ −
+ −
\begin{lstlisting}[language=Scala,numbers=left]+ −
def instr(c: Char) : String = c match {+ −
case '>' => "ptr := ptr + 1;"+ −
case '<' => "ptr := ptr - 1;"+ −
case '+' => "mem[ptr] := mem [ptr] + 1;"+ −
case '-' => "mem [ptr] := mem [ptr] - 1;"+ −
case '.' => "x := mem [ptr]; write x;"+ −
case '[' => "while (mem [ptr] != 0) do {"+ −
case ']' => "skip};"+ −
case _ => ""+ −
}+ −
\end{lstlisting}+ −
+ −
\noindent + −
The idea behind the translation is that BF-programs operate on an array,+ −
called here \texttt{mem}. The BP-memory pointer into this array is+ −
represented as the variable \texttt{ptr}. As usual the BF-instructions+ −
\code{>} and \code{<} increase, respectively decrease, \texttt{ptr}. The+ −
instructions \code{+} and \code{-} update a cell in \texttt{mem}. In+ −
Line 6 we need to first assign a \texttt{mem}-cell to an auxiliary variable+ −
since we have not changed our write functions in order to cope with+ −
writing out any array-content directly. Lines 7 and 8 are for+ −
translating BF-loops. Line 8 is interesting in the sense that we need to+ −
generate a \code{skip} instruction just before finishing with the+ −
closing \code{"\}"}. The reason is that we are rather pedantic about+ −
semicolons in our WHILE-grammar: the last command cannot have a+ −
semicolon---adding a \code{skip} works around this snag. Putting all+ −
this together is we can generate WHILE-programs with more than 400+ −
instructions and then run the compiled JVM code for such programs.+ −
\bigskip+ −
+ −
\noindent+ −
Hooooray, we can finally run the BF-mandelbrot program on the JVM and it+ −
completes within 20 seconds (after nearly 10 minutes of parsing the+ −
corresponding WHILE-program and generating 270K of a class file). Try+ −
replicating the 20 secs with an interpreter! OK, we now face the+ −
nagging question about what to do next\ldots+ −
+ −
\subsection*{Added Value}+ −
+ −
As you have probably seen, the compiler writer has a lot of freedom+ −
about how to generate code from what the progarmmer wrote as program.+ −
The only condition is that generated code should behave as expected by+ −
the programmer. Then all is fine\ldots mission accomplished! But+ −
sometimes the compiler writer is expected to go an extra mile, or even+ −
miles. Suppose we are given the following WHILE-program:+ −
+ −
\begin{lstlisting}[mathescape,language=While]+ −
new(arr[10]);+ −
arr[14] := 3 + arr[13]+ −
\end{lstlisting}+ −
+ −
\noindent + −
While admittedly this is a contrived program, and probably not meant to+ −
be like this by any sane programmer, it is supposed to make the+ −
following point: We generate an array of size 10, and then try to access+ −
the non-existing element at index 13 and even updating element with+ −
index 14. Obviously this is baloney. However, our compiler generates+ −
code for this program without any questions asked. We can even run this+ −
code on the JVM\ldots of course the result is an exception trace where+ −
the JVM yells at us for doing naughty things. (This is much better than+ −
C, for example, where such errors are not prevented and as a result+ −
insidious attacks can be mounted against such kind C-programs. I assume+ −
everyone has heard about \emph{Buffer Overflow Attacks}.) + −
+ −
Imagine we do not want to rely in our compiler on the JVM for producing+ −
an annoying, but safe exception trace, rather we want to handle such+ −
situations ourselves. Lets assume we want to handle them in the+ −
following way: if the programmer access a field out-of-bounds, we just+ −
return the default 0, and if a programmer wants to update an+ −
out-of-bounds filed, we quietly ignore this update.+ −
+ −
\end{document}+ −
+ −
%%% Local Variables: + −
%%% mode: latex + −
%%% TeX-master: t+ −
%%% End: + −