--- a/handouts/ho09.tex Fri Nov 01 13:21:51 2019 +0000
+++ b/handouts/ho09.tex Wed Nov 06 15:04:40 2019 +0000
@@ -16,25 +16,25 @@
Reflecting on our tiny compiler targetting the JVM, the code generation
part was actually not so hard, no? Pretty much just some post-traversal
-of the abstract syntax tree, yes? One of the main reason for this ease
-is that the JVM is a stack-based virtual machine and it is therefore not
-hard to translate arithmetic expressions into a sequence of instructions
-manipulating the stack. The problem is that ``real'' CPUs, although
-supporting stack operations, are not really designed to be \emph{stack
-machines}. The design of CPUs is more like, here is a chunk of
-memory---compiler, or better compiler writers, do something with it.
-Consequently, modern compilers need to go the extra mile in order to
-generate code that is much easier and faster to process by CPUs. To make
-this all tractable for this module, we target the LLVM Intermediate
-Language. In this way we can take advantage of the tools coming with
-LLVM. For example we do not have to worry about things like register
-allocations.\bigskip
+of the abstract syntax tree, yes? One of the reasons for this ease is
+that the JVM is a stack-based virtual machine and it is therefore not
+hard to translate deeply-nested arithmetic expressions into a sequence
+of instructions manipulating the stack. The problem is that ``real''
+CPUs, although supporting stack operations, are not really designed to
+be \emph{stack machines}. The design of CPUs is more like, here is a
+chunk of memory---compiler, or better compiler writers, do something
+with it. Consequently, modern compilers need to go the extra mile in
+order to generate code that is much easier and faster to process by
+CPUs. To make this all tractable for this module, we target the LLVM
+Intermediate Language. In this way we can take advantage of the tools
+coming with LLVM. For example we do not have to worry about things like
+register allocations.\bigskip
\noindent LLVM\footnote{\url{http://llvm.org}} is a beautiful example
that projects from Academia can make a difference in the world. LLVM
started in 2000 as a project by two researchers at the University of
Illinois at Urbana-Champaign. At the time the behemoth of compilers was
-gcc with its myriad of front-ends for other languages (e.g.~Fortran,
+gcc with its myriad of front-ends for other languages (C++, Fortran,
Ada, Go, Objective-C, Pascal etc). The problem was that gcc morphed over
time into a monolithic gigantic piece of m\ldots ehm software, which you
could not mess about in an afternoon. In contrast, LLVM is designed to
@@ -46,30 +46,32 @@
or less legacy. This does not mean that programming languages like C and
C++ are dying out any time soon---they are nicely supported by LLVM.
-Targetting the LLVM Intermediate Language, or Intermediate
-Representation (short LLVM-IR), also means we can profit from the very
+We will target the LLVM Intermediate Language, or Intermediate
+Representation (short LLVM-IR). As a result we can benefit from the
modular structure of the LLVM compiler and let for example the compiler
-generate code for X86, or ARM etc. That means we can be agnostic about
-where our code actually runs. However, what we have to do is to generate
-code in \emph{Static Single-Assignment} format (short SSA), because that
-is what the LLVM-IR expects from us. LLVM-IR is the intermediate format
-that LLVM uses for doing cool things, like targetting strange
-architectures, optimising code and allocating memory efficiently.
+generate code for different CPUs, like X86 or ARM. That means we can be
+agnostic about where our code actually runs. We can also be ignorant
+about optimising code and allocating memory efficiently. However, what
+we have to do is to generate code in \emph{Static Single-Assignment}
+format (short SSA), because that is what the LLVM-IR expects from us.
The idea behind the SSA format is to use very simple variable
assignments where every variable is assigned only once. The assignments
also need to be primitive in the sense that they can be just simple
operations like addition, multiplication, jumps, comparisons and so on.
-An idealised snippet of a program in SSA is
-
-\begin{lstlisting}[language=LLVM,numbers=none]
- x := 1
- y := 2
- z := x + y
+Say, we have an expression $((1 + a) + (3 + (b * 5)))$, then the
+SSA is
+
+\begin{lstlisting}[language=LLVMIR,numbers=left]
+let tmp0 = add 1 a in
+let tmp1 = mul b 5 in
+let tmp2 = add 3 tmp1 in
+let tmp3 = add tmp0 tmp2 in
+ tmp3
\end{lstlisting}
\noindent where every variable is used only once (we could not write
-\texttt{x := x + y} in the last line for example). There are
+\texttt{tmp1 = add 3 tmp1} in Line 3 for example). There are
sophisticated algorithms for imperative languages, like C, that
efficiently transform a high-level program into SSA format. But we can
ignore them here. We want to compile a functional language and there
@@ -81,11 +83,11 @@
\subsection*{LLVM-IR}
Before we start, lets first have a look at the \emph{LLVM Intermediate
-Representation}. What is good about our simple Fun language is that it
-basically only contains expressions (be they arithmetic expressions or
-boolean expressions). The exception is function definitions. Luckily,
-for them we can use the mechanism of defining functions in LLVM-IR. For
-example the simple Fun program
+Representation} in more detail. What is good about our toy Fun language
+is that it basically only contains expressions (be they arithmetic
+expressions or boolean expressions or if-expressions). The exception are
+function definitions. Luckily, for them we can use the mechanism of
+defining functions in LLVM-IR. For example the simple Fun program
\begin{lstlisting}[language=Scala,numbers=none]
@@ -102,30 +104,33 @@
}
\end{lstlisting}
-\noindent First to notice is that all variable names in the LLVM-IR are
-prefixed by \texttt{\%}; function names need to be prefixed with @.
-Also, the LLVM-IR is a fully typed language. The \texttt{i32} type stands
-for a 32-bit integer. There are also types for 64-bit integers, chars
-(\texttt{i8}), floats, arrays and even pointer types. In teh code above,
-\texttt{sqr} takes an argument of type \texttt{i32} and produces a
-result of type \texttt{i32}. Each arithmetic operation, like addition or
-multiplication, are also prefixed with the type they operate on.
-Obviously these types need to match up\ldots{} but since we have in our
-programs only integers, \texttt{i32} everywhere will do.
+\noindent First notice that all variable names in the LLVM-IR are
+prefixed by \texttt{\%}; function names need to be prefixed with
+@-symbol. Also, the LLVM-IR is a fully typed language. The \texttt{i32}
+type stands for 32-bit integers. There are also types for 64-bit
+integers (\texttt{i64}), chars (\texttt{i8}), floats, arrays and even
+pointer types. In the code above, \texttt{sqr} takes an argument of type
+\texttt{i32} and produces a result of type \texttt{i32} (the result type
+is before the function name, like in C). Each arithmetic operation, like
+addition or multiplication, are also prefixed with the type they operate
+on. Obviously these types need to match up\ldots{} but since we have in
+our programs only integers, \texttt{i32} everywhere will do.
Conveniently, you can use the program \texttt{lli}, which comes with
LLVM, to interpret programs written in the LLVM-IR. So you can easily
check whether the code you produced actually works. To get a running
program that does something interesting you need to add some boilerplate
about printing out numbers and a main-function that is the entrypoint
-for the program (see Figure~\ref{lli}). You can generate a binary for
-this program using \texttt{llc}-compiler in order to generate an object
-file and then use gcc (clang) for generating a binary:
+for the program (see Figure~\ref{lli} for a complete listing). You can
+generate a binary for this program using \texttt{llc}-compiler and
+\texttt{gcc}---\texttt{llc} can generate an object file and then you can
+use \texttt{gcc} (that is clang) for generating an executable binary:
\begin{lstlisting}[language=bash,numbers=none]
llc -filetype=obj sqr.ll
gcc sqr.o -o a.out
./a.out
+> 25
\end{lstlisting}
\begin{figure}[t]\small
@@ -141,13 +146,41 @@
\subsection*{Our Own Intermediate Language}
Remember compilers have to solve the problem of bridging the gap between
-``high-level'' programs and ``low-level'' hardware. If the gap is tool
-wide then a good strategy is to lay a stepping stone somewhere in
-between. The LLVM-IR itself is such a stepping stone to make the task of
-generating code easier. Like a real compiler we will use another
-stepping stone which I call \emph{K-language}. For this remember
-expressions (and boolean expressions) in the Fun language are given by
-the code on top of Figure~\ref{absfun}
+``high-level'' programs and ``low-level'' hardware. If the gap is too
+wide for one step, then a good strategy is to lay a stepping stone
+somewhere in between. The LLVM-IR itself is such a stepping stone to
+make the task of generating and optimising code easier. Like a real
+compiler we will use our own stepping stone which I call the
+\emph{K-language}. For this remember expressions (and boolean
+expressions) in the Fun language. For convenience the Scala code is
+shown on top of Figure~\ref{absfun}. Below is the code for the
+K-language. There are two kinds of syntactic entities, namely
+\emph{K-values} and \emph{K-expressions}. The central point of the
+K-language is the \texttt{KLet}-constructor. For this recall that
+arithmetic expressions such as $((1 + a) + (3 + (b * 5)))$ need
+to be broken up into smaller ``atomic'' steps, like so
+
+\begin{lstlisting}[language=LLVMIR,numbers=none]
+let tmp0 = add 1 a in
+let tmp1 = mul b 5 in
+let tmp2 = add 3 tmp1 in
+let tmp3 = add tmp0 tmp2 in
+ tmp3
+\end{lstlisting}
+
+\noindent
+Here \texttt{tmp3} will contain the result of what the expression stands
+for. In each step we can only perform an ``atomic'' operation, like
+addition or multiplication. We could not for example have an
+if-condition on the right-hand side of an equals. These constraints
+enforced upon us because how the SSA format works in the LLVM-IR. By
+having in \texttt{KLet}, first a string (standing for an intermediate
+result) and second a value, we can fulfil this constraint---there is no
+way we could write anything else than a value. To sum up, K-values are
+the atomic operations that can be on the right-hand side of equal-signs.
+The K-language is restricted such that it is easy to generate the SSA format
+for the LLVM-IR.
+
\begin{figure}[p]\small
@@ -178,7 +211,7 @@
case class KWrite(v: KVal) extends KVal
case class KIf(x1: String, e1: KExp, e2: KExp) extends KExp
-case class KLet(x: String, e1: KVal, e2: KExp) extends KExp
+case class KLet(x: String, v: KVal, e: KExp) extends KExp
case class KReturn(v: KVal) extends KExp
\end{lstlisting}
\caption{Abstract syntax trees for the Fun language.\label{absfun}}