% !TEX program = xelatex\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\usepackage{../grammar}%%\usepackage{multicol}%%\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}\begin{document}\fnote{\copyright{} Christian Urban, King's College London, 2019}\section*{Handout 9 (LLVM, SSA and CPS)}Reflecting on our tiny compiler targetting the JVM, the code generationpart was actually not so hard, no? Pretty much just some post-traversalof the abstract syntax tree, yes? One of the main reason for this easeis that the JVM is a stack-based virtual machine and it is therefore nothard to translate arithmetic expressions into a sequence of instructionsmanipulating the stack. The problem is that ``real'' CPUs, althoughsupporting stack operations, are not really designed to be \emph{stackmachines}. The design of CPUs is more like, here is a chunk ofmemory---compiler, or better compiler writers, do something with it.Consequently, modern compilers need to go the extra mile in order togenerate code that is much easier and faster to process by CPUs. To makethis all tractable for this module, we target the LLVM IntermediateLanguage. In this way we can take advantage of the tools coming withLLVM. For example we do not have to worry about things like registerallocations.\bigskip \noindent LLVM\footnote{\url{http://llvm.org}} is a beautiful examplethat projects from Academia can make a difference in the world. LLVMstarted in 2000 as a project by two researchers at the University ofIllinois at Urbana-Champaign. At the time the behemoth of compilers wasgcc with its myriad of front-ends for other languages (e.g.~Fortran,Ada, Go, Objective-C, Pascal etc). The problem was that gcc morphed overtime into a monolithic gigantic piece of m\ldots ehm software, which youcould not mess about in an afternoon. In contrast, LLVM is designed tobe a modular suite of tools with which you could play around easily andtry out something new. LLVM became a big player once Apple hired one ofthe original developers (I cannot remember the reason why Apple did notwant to use gcc, but maybe they were also just disgusted by its bigmonolithic codebase). Anyway, LLVM is now the big player and gcc is moreor less legacy. This does not mean that programming languages like C andC++ are dying out any time soon---they are nicely supported by LLVM.Targetting the LLVM Intermediate Language, or IntermediateRepresentation (short LLVM-IR), also means we can profit from the verymodular structure of the LLVM compiler and let for example the compilergenerate code for X86, or ARM etc. That means we can be agnostic aboutwhere our code actually runs. However, what we have to do is to generatecode in \emph{Static Single-Assignment} format (short SSA), because thatis what the LLVM-IR expects from us. LLVM-IR is the intermediate formatthat LLVM uses for doing cool things, like targetting strangearchitectures, optimising code and allocating memory efficiently. The idea behind the SSA format is to use very simple variableassignments where every variable is assigned only once. The assignmentsalso need to be primitive in the sense that they can be just simpleoperations like addition, multiplication, jumps, comparisons and so on.An idealised snippet of a program in SSA is \begin{lstlisting}[language=LLVM,numbers=none] x := 1 y := 2 z := x + y\end{lstlisting}\noindent where every variable is used only once (we could not write\texttt{x := x + y} in the last line for example). There aresophisticated algorithms for imperative languages, like C, thatefficiently transform a high-level program into SSA format. But we canignore them here. We want to compile a functional language and therethings get much more interesting than just sophisticated. We will needto have a look at CPS translations, where the CPS stands forContinuation-Passing-Style---basically black programming art orabracadabra programming. So sit tight.\subsection*{LLVM-IR}Before we start, lets first have a look at the \emph{LLVM IntermediateRepresentation}. What is good about our simple Fun language is that itbasically only contains expressions (be they arithmetic expressions orboolean expressions). The exception is function definitions. Luckily,for them we can use the mechanism of defining functions in LLVM-IR. Forexample the simple Fun program \begin{lstlisting}[language=Scala,numbers=none]def sqr(x) = x * x\end{lstlisting}\noindentcan be compiled into the following LLVM-IR function:\begin{lstlisting}[language=LLVM]define i32 @sqr(i32 %x) { %tmp = mul i32 %x, %x ret i32 %tmp} \end{lstlisting}\noindent First to notice is that all variable names in the LLVM-IR areprefixed by \texttt{\%}; function names need to be prefixed with @.Also, the LLVM-IR is a fully typed language. The \texttt{i32} type standsfor a 32-bit integer. There are also types for 64-bit integers, chars(\texttt{i8}), floats, arrays and even pointer types. In teh code above,\texttt{sqr} takes an argument of type \texttt{i32} and produces aresult of type \texttt{i32}. Each arithmetic operation, like addition ormultiplication, are also prefixed with the type they operate on.Obviously these types need to match up\ldots{} but since we have in ourprograms only integers, \texttt{i32} everywhere will do. Conveniently, you can use the program \texttt{lli}, which comes withLLVM, to interpret programs written in the LLVM-IR. So you can easilycheck whether the code you produced actually works. To get a runningprogram that does something interesting you need to add some boilerplateabout printing out numbers and a main-function that is the entrypointfor the program (see Figure~\ref{lli}). You can generate a binary forthis program using \texttt{llc}-compiler in order to generate an objectfile and then use gcc (clang) for generating a binary:\begin{lstlisting}[language=bash,numbers=none]llc -filetype=obj sqr.llgcc sqr.o -o a.out./a.out\end{lstlisting}\begin{figure}[t]\small \lstinputlisting[language=LLVM,numbers=left]{../progs/sqr.ll}\caption{An LLVM-IR program for calculating the square function. The interesting function is \texttt{sqr} in Lines 13 -- 16. The mainfunction calls \texttt{sqr} and prints out the result. The othercode is boilerplate for printing out integers.\label{lli}}\end{figure} \subsection*{Our Own Intermediate Language}Remember compilers have to solve the problem of bridging the gap between``high-level'' programs and ``low-level'' hardware. If the gap is toolwide then a good strategy is to lay a stepping stone somewhere inbetween. The LLVM-IR itself is such a stepping stone to make the task ofgenerating code easier. Like a real compiler we will use anotherstepping stone which I call \emph{K-language}. For this rememberexpressions (and boolean expressions) in the Fun language are given bythe code on top of Figure~\ref{absfun}\begin{figure}[p]\small\begin{lstlisting}[language=Scala,numbers=none]// Fun-language (expressions)abstract class Exp abstract class BExp case class Call(name: String, args: List[Exp]) extends Expcase class If(a: BExp, e1: Exp, e2: Exp) extends Expcase class Write(e: Exp) extends Expcase class Var(s: String) extends Expcase class Num(i: Int) extends Expcase class Aop(o: String, a1: Exp, a2: Exp) extends Expcase class Sequence(e1: Exp, e2: Exp) extends Expcase class Bop(o: String, a1: Exp, a2: Exp) extends BExp // K-language (K-expressions, K-values)abstract class KExpabstract class KValcase class KVar(s: String) extends KValcase class KNum(i: Int) extends KValcase class Kop(o: String, v1: KVal, v2: KVal) extends KValcase class KCall(o: String, vrs: List[KVal]) extends KValcase class KWrite(v: KVal) extends KValcase class KIf(x1: String, e1: KExp, e2: KExp) extends KExpcase class KLet(x: String, e1: KVal, e2: KExp) extends KExpcase class KReturn(v: KVal) extends KExp\end{lstlisting}\caption{Abstract syntax trees for the Fun language.\label{absfun}}\end{figure}\subsection*{CPS-Translations}Another reason why it makes sense to go the extra mile is that stackinstructions are very difficult to optimise---you cannot just re-arrangeinstructions without messing about with what is calculated on the stack.Also it is hard to find out if all the calculations on the stack areactually necessary and not by chance dead code. The JVM has for all thissophisticated machinery to make such ``high-level'' code still run fast,but let's say that for the sake of argument we do not want to rely onit. We want to generate fast code ourselves. This means we have to workaround the intricacies of what instructions CPUs can actually process.\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: