afl-material: comparison handouts/ho09.tex

equal deleted inserted replaced

-:8fc109f36b78
+:eecc4d5a2172
 \section*{Handout 9 (LLVM, SSA and CPS)}
 Reflecting on our tiny compiler targetting the JVM, the code generation
 part was actually not so hard, no? Pretty much just some post-traversal
-of the abstract syntax tree, yes? One of the main reason for this ease
+of the abstract syntax tree, yes? One of the reasons for this ease is
-is that the JVM is a stack-based virtual machine and it is therefore not
+that the JVM is a stack-based virtual machine and it is therefore not
-hard to translate arithmetic expressions into a sequence of instructions
+hard to translate deeply-nested arithmetic expressions into a sequence
-manipulating the stack. The problem is that ``real'' CPUs, although
+of instructions manipulating the stack. The problem is that ``real''
-supporting stack operations, are not really designed to be \emph{stack
+CPUs, although supporting stack operations, are not really designed to
-machines}.  The design of CPUs is more like, here is a chunk of
+be \emph{stack machines}.  The design of CPUs is more like, here is a
-memory---compiler, or better compiler writers, do something with it.
+chunk of memory---compiler, or better compiler writers, do something
-Consequently, modern compilers need to go the extra mile in order to
+with it. Consequently, modern compilers need to go the extra mile in
-generate code that is much easier and faster to process by CPUs. To make
+order to generate code that is much easier and faster to process by
-this all tractable for this module, we target the LLVM Intermediate
+CPUs. To make this all tractable for this module, we target the LLVM
-Language. In this way we can take advantage of the tools coming with
+Intermediate Language. In this way we can take advantage of the tools
-LLVM. For example we do not have to worry about things like register
+coming with LLVM. For example we do not have to worry about things like
-allocations.\bigskip
+register allocations.\bigskip
 \noindent LLVM\footnote{\url{http://llvm.org}} is a beautiful example
 that projects from Academia can make a difference in the world. LLVM
 started in 2000 as a project by two researchers at the  University of
 Illinois at Urbana-Champaign. At the time the behemoth of compilers was
-gcc with its myriad of front-ends for other languages (e.g.~Fortran,
+gcc with its myriad of front-ends for other languages (C++, Fortran,
 Ada, Go, Objective-C, Pascal etc). The problem was that gcc morphed over
 time into a monolithic gigantic piece of m\ldots ehm software, which you
 could not mess about in an afternoon. In contrast, LLVM is designed to
 be a modular suite of tools with which you could play around easily and
 try out something new. LLVM became a big player once Apple hired one of
 want to use gcc, but maybe they were also just disgusted by its big
 monolithic codebase). Anyway, LLVM is now the big player and gcc is more
 or less legacy. This does not mean that programming languages like C and
 C++ are dying out any time soon---they are nicely supported by LLVM.
-Targetting the LLVM Intermediate Language, or Intermediate
+We will target the LLVM Intermediate Language, or Intermediate
-Representation (short LLVM-IR), also means we can profit from the very
+Representation (short LLVM-IR). As a result we can benefit from the
 modular structure of the LLVM compiler and let for example the compiler
-generate code for X86, or ARM etc. That means we can be agnostic about
+generate code for different CPUs, like X86 or ARM. That means we can be
-where our code actually runs. However, what we have to do is to generate
+agnostic about where our code actually runs. We can also be ignorant
-code in \emph{Static Single-Assignment} format (short SSA), because that
+about optimising code and allocating memory efficiently. However, what
-is what the LLVM-IR expects from us. LLVM-IR is the intermediate format
+we have to do is to generate code in \emph{Static Single-Assignment}
-that LLVM uses for doing cool things, like targetting strange
+format (short SSA), because that is what the LLVM-IR expects from us.
-architectures, optimising code and allocating memory efficiently.
 The idea behind the SSA format is to use very simple variable
 assignments where every variable is assigned only once. The assignments
 also need to be primitive in the sense that they can be just simple
 operations like addition, multiplication, jumps, comparisons and so on.
-An idealised snippet of a program in SSA is
+Say, we have an expression $((1 + a) + (3 + (b * 5)))$, then the
+SSA is
-\begin{lstlisting}[language=LLVM,numbers=none]
-x := 1
+\begin{lstlisting}[language=LLVMIR,numbers=left]
-y := 2
+let tmp0 = add 1 a in
-z := x + y
+let tmp1 = mul b 5 in
+let tmp2 = add 3 tmp1 in
+let tmp3 = add tmp0 tmp2 in
+tmp3
 \end{lstlisting}
 \noindent where every variable is used only once (we could not write
-\texttt{x := x + y} in the last line for example).  There are
+\texttt{tmp1 = add 3 tmp1} in Line 3 for example).  There are
 sophisticated algorithms for imperative languages, like C, that
 efficiently transform a high-level program into SSA format. But we can
 ignore them here. We want to compile a functional language and there
 things get much more interesting than just sophisticated. We will need
 to have a look at CPS translations, where the CPS stands for
 abracadabra programming. So sit tight.
 \subsection*{LLVM-IR}
 Before we start, lets first have a look at the \emph{LLVM Intermediate
-Representation}. What is good about our simple Fun language is that it
+Representation} in more detail. What is good about our toy Fun language
-basically only contains expressions (be they arithmetic expressions or
+is that it basically only contains expressions (be they arithmetic
-boolean expressions). The exception is function definitions. Luckily,
+expressions or boolean expressions or if-expressions). The exception are
-for them we can use the mechanism of defining functions in LLVM-IR. For
+function definitions. Luckily, for them we can use the mechanism of
-example the simple Fun program
+defining functions in LLVM-IR. For example the simple Fun program
 \begin{lstlisting}[language=Scala,numbers=none]
 def sqr(x) = x * x
 \end{lstlisting}
 %tmp = mul i32 %x, %x
 ret i32 %tmp
 }
 \end{lstlisting}
-\noindent First to notice is that all variable names in the LLVM-IR are
+\noindent First notice that all variable names in the LLVM-IR are
-prefixed by \texttt{\%}; function names need to be prefixed with @.
+prefixed by \texttt{\%}; function names need to be prefixed with
-Also, the LLVM-IR is a fully typed language. The \texttt{i32} type stands
+@-symbol. Also, the LLVM-IR is a fully typed language. The \texttt{i32}
-for a 32-bit integer. There are also types for 64-bit integers, chars
+type stands for 32-bit integers. There are also types for 64-bit
-(\texttt{i8}), floats, arrays and even pointer types. In teh code above,
+integers (\texttt{i64}), chars (\texttt{i8}), floats, arrays and even
-\texttt{sqr} takes an argument of type \texttt{i32} and produces a
+pointer types. In the code above, \texttt{sqr} takes an argument of type
-result of type \texttt{i32}. Each arithmetic operation, like addition or
+\texttt{i32} and produces a result of type \texttt{i32} (the result type
-multiplication, are also prefixed with the type they operate on.
+is before the function name, like in C). Each arithmetic operation, like
-Obviously these types need to match up\ldots{} but since we have in our
+addition or multiplication, are also prefixed with the type they operate
-programs only integers, \texttt{i32} everywhere will do.
+on. Obviously these types need to match up\ldots{} but since we have in
+our programs only integers, \texttt{i32} everywhere will do.
 Conveniently, you can use the program \texttt{lli}, which comes with
 LLVM, to interpret programs written in the LLVM-IR. So you can easily
 check whether the code you produced actually works. To get a running
 program that does something interesting you need to add some boilerplate
 about printing out numbers and a main-function that is the entrypoint
-for the program (see Figure~\ref{lli}). You can generate a binary for
+for the program (see Figure~\ref{lli} for a complete listing). You can
-this program using \texttt{llc}-compiler in order to generate an object
+generate a binary for this program using \texttt{llc}-compiler and
-file and then use gcc (clang) for generating a binary:
+\texttt{gcc}---\texttt{llc} can generate an object file and then you can
+use \texttt{gcc} (that is clang) for generating an executable  binary:
 \begin{lstlisting}[language=bash,numbers=none]
 llc -filetype=obj sqr.ll
 gcc sqr.o -o a.out
 ./a.out
+> 25
 \end{lstlisting}
 \begin{figure}[t]\small
 \lstinputlisting[language=LLVM,numbers=left]{../progs/sqr.ll}
 \caption{An LLVM-IR program for calculating the square function. The
 \subsection*{Our Own Intermediate Language}
 Remember compilers have to solve the problem of bridging the gap between
-``high-level'' programs and ``low-level'' hardware. If the gap is tool
+``high-level'' programs and ``low-level'' hardware. If the gap is too
-wide then a good strategy is to lay a stepping stone somewhere in
+wide for one step, then a good strategy is to lay a stepping stone
-between. The LLVM-IR itself is such a stepping stone to make the task of
+somewhere in between. The LLVM-IR itself is such a stepping stone to
-generating code easier. Like a real compiler we will use another
+make the task of generating and optimising code easier. Like a real
-stepping stone which I call \emph{K-language}. For this remember
+compiler we will use our own stepping stone which I call the
-expressions (and boolean expressions) in the Fun language are given by
+\emph{K-language}. For this remember expressions (and boolean
-the code on top of Figure~\ref{absfun}
+expressions) in the Fun language. For convenience the Scala code is
+shown on top of Figure~\ref{absfun}. Below is the code for the
+K-language. There are two kinds of syntactic entities, namely
+\emph{K-values} and \emph{K-expressions}. The central point of the
+K-language is the \texttt{KLet}-constructor. For this recall that
+arithmetic expressions such as $((1 + a) + (3 + (b * 5)))$ need
+to be broken up into smaller ``atomic'' steps, like so
+\begin{lstlisting}[language=LLVMIR,numbers=none]
+let tmp0 = add 1 a in
+let tmp1 = mul b 5 in
+let tmp2 = add 3 tmp1 in
+let tmp3 = add tmp0 tmp2 in
+tmp3
+\end{lstlisting}
+\noindent
+Here \texttt{tmp3} will contain the result of what the expression stands
+for. In each step we can only perform an ``atomic'' operation, like
+addition or multiplication. We could not for example have an
+if-condition on the right-hand side of an equals. These constraints
+enforced upon us because how the SSA format works in the LLVM-IR. By
+having in \texttt{KLet},  first a string (standing for an intermediate
+result) and second a value, we can fulfil this constraint---there is no
+way we could write anything else than a value. To sum up, K-values are
+the atomic operations that can be on the right-hand side of equal-signs.
+The K-language is restricted such that it is easy to generate the SSA format
+for the LLVM-IR.
 \begin{figure}[p]\small
 \begin{lstlisting}[language=Scala,numbers=none]
 // Fun-language (expressions)
 case class Kop(o: String, v1: KVal, v2: KVal) extends KVal
 case class KCall(o: String, vrs: List[KVal]) extends KVal
 case class KWrite(v: KVal) extends KVal
 case class KIf(x1: String, e1: KExp, e2: KExp) extends KExp
-case class KLet(x: String, e1: KVal, e2: KExp) extends KExp
+case class KLet(x: String, v: KVal, e: KExp) extends KExp
 case class KReturn(v: KVal) extends KExp
 \end{lstlisting}
 \caption{Abstract syntax trees for the Fun language.\label{absfun}}
 \end{figure}

changeset 680	eecc4d5a2172
parent 679	8fc109f36b78
child 700	52263ffd17b9