--- a/handouts/ho07.tex Mon Nov 16 15:16:26 2015 +0000
+++ b/handouts/ho07.tex Tue Nov 17 01:58:50 2015 +0000
@@ -2,11 +2,12 @@
\usepackage{../style}
\usepackage{../langs}
\usepackage{../grammar}
-\usepackage{tikz-qtree}
+\usepackage{../graphics}
+
\begin{document}
-\section*{Handout 7 (Compilation of the WHILE-Language)}
+\section*{Handout 7 (Compilation)}
The purpose of a compiler is to transform a program, a human
can write, into code the machine can run as fast as possible.
@@ -18,11 +19,13 @@
compiler would normally need to take care of (like explicit
memory management).
-We will be generating code for the Java Virtual Machine. This
-is a stack-based virtual machine which will make it easy to
-generate code for arithmetic expressions. For example for
-generating code for the expression $1 + 2$ we need to issue
-the following three instructions
+As an example we will implement a compiler for the very simple
+While-language. We will be generating code for the Java
+Virtual Machine. This is a stack-based virtual machine, a fact
+which will make it easy to generate code for arithmetic
+expressions. For example for generating code for the
+expression $1 + 2$ we need to generate the following three
+instructions
\begin{lstlisting}[numbers=none]
ldc 1
@@ -30,17 +33,18 @@
iadd
\end{lstlisting}
-\noindent The first instruction loads the constant $1$ on the
-stack, the next one $2$, the third instruction add both
-numbers together replacing the top elements of the stack with
-the result $3$. We will throughout consider only integer
-numbers and results, therefore we can use the instructions
-\code{iadd}, \code{isub}, \code{imul}, \code{idiv} and so on.
-The \code{i} stands for integer instructions (alternatives are
-\code{d} for doubles, \code{l} for longs and \code{f} for
-floats).
+\noindent The first instruction loads the constant $1$ onto
+the stack, the next one $2$, the third instruction adds both
+numbers together replacing the top elements of the stack with
+the result $3$. For simplicity, we will throughout consider
+only integer numbers and results. Therefore we can use the
+instructions \code{iadd}, \code{isub}, \code{imul},
+\code{idiv} and so on. The \code{i} stands for integer
+instructions in the JVM (alternatives are \code{d} for doubles,
+\code{l} for longs and \code{f} for floats).
-Recall our grammar for arithmetic expressions:
+Recall our grammar for arithmetic expressions ($E$ is the
+starting symbol):
\begin{plstx}[rhs style=, margin=3cm]
: \meta{E} ::= \meta{T} $+$ \meta{E}
@@ -55,9 +59,9 @@
\noindent where \meta{Id} stands for variables and
-\meta{Num} for number. For the moment let us omit variables from
+\meta{Num} for numbers. For the moment let us omit variables from
arithmetic expressions. Our parser will take this grammar and
-produce abstract syntax trees, for
+produce abstract syntax trees. For
example for the expression $1 + ((2 * 3) + (4 - 3))$ it will
produce the following tree.
@@ -65,10 +69,11 @@
\begin{tikzpicture}
\Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]
\end{tikzpicture}
\end{center}
-\noindent
-To generate code for this expression, we need to traverse this
-tree in post-order fashion---this will produce code for a
-stack-machine, like the JVM. Doing so gives the instructions
+\noindent To generate code for this expression, we need to
+traverse this tree in post-order fashion and emit code for
+each node---this traversal in post-order fashion will produce
+code for a stack-machine (what the JVM is). Doing so for the
+tree above generates the instructions
\begin{lstlisting}[numbers=none]
ldc 1
@@ -83,14 +88,17 @@
\end{lstlisting}
\noindent If we ``run'' these instructions, the result $8$
-will be on top of the stack. This will be a convention we
-always observe, namely that the results of arithmetic
+will be on top of the stack (I leave this to you to verify;
+the meaning of each instruction should be clear). The result
+being on the top of the stack will be a convention we always
+observe in our compiler, that is the results of arithmetic
expressions will always be on top of the stack. Note, that a
-different bracketing, for example $(1 + (2 * 3)) + (4 - 3)$,
-produces a different abstract syntax tree and thus potentially
-also a different list of instructions. Generating code in this
-fashion is rather simple: it can be done by the following
-\textit{compile}-function:
+different bracketing of the expression, for example $(1 + (2 *
+3)) + (4 - 3)$, produces a different abstract syntax tree and
+thus potentially also a different list of instructions.
+Generating code in this fashion is rather easy to implement:
+it can be done with the following \textit{compile}-function,
+which takes the abstract syntax tree as argument:
\begin{center}
\begin{tabular}{lcl}
@@ -106,11 +114,11 @@
\end{tabular}
\end{center}
-\noindent However, our arithmetic expressions can also contain
+However, our arithmetic expressions can also contain
variables. We will represent them as \emph{local variables} in
the JVM. Essentially, local variables are an array or pointers
-to the memory containing in our case only integers. Looking up
-a variable can be done by the instruction
+to memory cells, containing in our case only integers. Looking
+up a variable can be done with the instruction
\begin{lstlisting}[mathescape,numbers=none]
iload $index$
@@ -118,7 +126,7 @@
\noindent
which places the content of the local variable $index$ onto
-thestack. Storing the top of the stack into a local variable
+the stack. Storing the top of the stack into a local variable
can be done by the instruction
\begin{lstlisting}[mathescape,numbers=none]
@@ -126,16 +134,16 @@
\end{lstlisting}
\noindent Note that this also pops off the top of the stack.
-One problem we have to overcome is that local variables are
-addressed, not by identifiers, but by numbers (starting from
-$0$). Therefore our compiler needs to maintain a kind of
-environment (similar to the interpreter) where variables are
-associated to numbers. This association needs to be unique: if
-we muddle up the numbers, then we essentially confuse
-variables and the result will usually be an erroneous result.
-Therefore our \textit{compile}-function will take two
-arguments: the abstract syntax tree and the environment, $E$,
-that maps identifiers to index-numbers.
+One problem we have to overcome, however, is that local
+variables are addressed, not by identifiers, but by numbers
+(starting from $0$). Therefore our compiler needs to maintain
+a kind of environment where variables are associated to
+numbers. This association needs to be unique: if we muddle up
+the numbers, then we essentially confuse variables and the
+consequence will usually be an erroneous result. Our extended
+\textit{compile}-function for arithmetic expressions will
+therefore take two arguments: the abstract syntax tree and the
+environment, $E$, that maps identifiers to index-numbers.
\begin{center}
\begin{tabular}{lcl}
@@ -153,9 +161,216 @@
\end{center}
\noindent In the last line we generate the code for variables
-where $E(x)$ stands for looking up to which index the variable
-$x$ maps to.
+where $E(x)$ stands for looking up the environment to which
+index the variable $x$ maps to.
+
+There is a similar \textit{compile}-function for boolean
+expressions, but it includes a ``trick'' to do with
+\pcode{if}- and \pcode{while}-statements. To explain the issue
+let us explain first the compilation of statements of the
+While-language. The clause for \pcode{skip} is trivial, since
+we do not have to generate any instruction
+
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\
+\end{tabular}
+\end{center}
+
+\noindent Note that the \textit{compile}-function for
+statements returns a pair, a list of instructions (in this
+case the empty list) and an environment for variables. The
+reason for the environment is that assignments in the
+While-language might change the environment---clearly if a
+variable is used for the first time, we need to allocate a new
+index and if it has been used before, we need to be able to
+retrieve the associated index. This is reflected in
+the clause for compiling assignments:
+
+\begin{center}
+\begin{tabular}{lcl}
+$\text{compile}(x := a, E)$ & $\dn$ &
+$(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$
+\end{tabular}
+\end{center}
+
+\noindent We first generate code for the right-hand side of
+the assignment and then add an \pcode{istore}-instruction at
+the end. By convention the result of the arithmetic expression
+$a$ will be on top of the stack. After the \pcode{istore}
+instruction, the result will be stored in the index
+corresponding to the variable $x$. If the variable $x$ has
+been used before in the program, we just need to look up what
+the index is and return the environment unchanged (that is in
+this case $E' = E$). However, if this is the first encounter
+of the variable $x$ in the program, then we have to augment
+the environment and assign $x$ with the largest index in $E$
+plus one (that is $E' = E(x \mapsto largest\_index + 1)$).
+That means for the assignment $x := x + 1$ we generate the
+following code
+
+\begin{lstlisting}[mathescape,numbers=none]
+iload $n_x$
+ldc 1
+iadd
+istore $n_x$
+\end{lstlisting}
+
+\noindent
+where $n_x$ is the index for the variable $x$.
+
+More complicated is the code for \pcode{if}-statments, say
+
+\begin{lstlisting}[mathescape,language={},numbers=none]
+if $b$ then $cs_1$ else $cs_2$
+\end{lstlisting}
+
+\noindent where $b$ is a boolean expression and the $cs_i$
+are the instructions for each \pcode{if}-branch. Lets assume
+we already generated code for $b$ and $cs_{1/2}$. Then in the
+true-case the control-flow of the program needs to be
+
+\begin{center}
+\begin{tikzpicture}[node distance=2mm and 4mm,
+ block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
+ point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
+ skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
+\node (A1) [point] {};
+\node (b) [block, right=of A1] {code of $b$};
+\node (A2) [point, right=of b] {};
+\node (cs1) [block, right=of A2] {code of $cs_1$};
+\node (A3) [point, right=of cs1] {};
+\node (cs2) [block, right=of A3] {code of $cs_2$};
+\node (A4) [point, right=of cs2] {};
+
+\draw (A1) edge [->, black, line width=1mm] (b);
+\draw (b) edge [->, black, line width=1mm] (cs1);
+\draw (cs1) edge [->, black, line width=1mm] (A3);
+\draw (A3) edge [->, black, skip loop] (A4);
+\node [below=of cs2] {\raisebox{-5mm}{\small{}jump}};
+\end{tikzpicture}
+\end{center}
+
+\noindent where we start with running the code for $b$; since
+we are in the true case we continue with running the code for
+$cs_1$. After this however, we must not run the code for
+$cs_2$, but always jump after the last instruction of $cs_2$
+(the code for the \pcode{else}-branch). Note that this jump is
+unconditional, meaning we always have to jump to the end of
+$cs_2$. The corresponding instruction of the JVM is
+\pcode{goto}. In case $b$ turns out to be false we need the
+control-flow
+\begin{center}
+\begin{tikzpicture}[node distance=2mm and 4mm,
+ block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
+ point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
+ skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
+\node (A1) [point] {};
+\node (b) [block, right=of A1] {code of $b$};
+\node (A2) [point, right=of b] {};
+\node (cs1) [block, right=of A2] {code of $cs_1$};
+\node (A3) [point, right=of cs1] {};
+\node (cs2) [block, right=of A3] {code of $cs_2$};
+\node (A4) [point, right=of cs2] {};
+
+\draw (A1) edge [->, black, line width=1mm] (b);
+\draw (b) edge [->, black, line width=1mm] (A2);
+\draw (A2) edge [skip loop] (A3);
+\draw (A3) edge [->, black, line width=1mm] (cs2);
+\draw (cs2) edge [->,black, line width=1mm] (A4);
+\node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}};
+\end{tikzpicture}
+\end{center}
+
+\noindent where we now need a conditional jump (if the
+if-condition is false) from the end of the code for the
+boolean to the beginning of the instructions $cs_2$. Once we
+are finished with running $cs_2$ we can continue with whatever
+code comes after the if-statement.
+
+The \pcode{goto} and conditional jumps need addresses to where
+the jump should go. Since we are generating assembly code for
+the JVM, we do not actually have to give addresses, but need
+to attach labels to our code. These labels specify a target
+for a jump. Therefore the labels need to be unique, as
+otherwise it would be ambiguous where a jump should go.
+A labels, say \pcode{L}, is attached to code like
+
+\begin{lstlisting}[mathescape,numbers=none]
+L:
+ $instr_1$
+ $instr_2$
+ $\vdots$
+\end{lstlisting}
+
+Recall the ``trick'' with compiling boolean expressions: the
+\textit{compile}-function for boolean expressions takes three
+arguments: an abstract syntax tree, an environment for
+variable indices and also the label, $lab$, to where an conditional
+jump needs to go. The clause for the expression $a_1 = a_2$,
+for example, is as follows:
+
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\
+\multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}
+\end{tabular}
+\end{center}
+
+\noindent
+We are generating code for the subexpressions $a_1$ and $a_2$.
+This will mean after running the corresponding code there will
+be two integers on top of the stack. If they are equal, we do
+not have to do anything and just continue with the next
+instructions (see control-flow of ifs above). However if they
+are \emph{not} equal, then we need to (conditionally) jump to
+the label $lab$. This can be done with the instruction
+
+\begin{lstlisting}[mathescape,numbers=none]
+if_icmpne $lab$
+\end{lstlisting}
+
+\noindent Other jump instructions for boolean operators are
+
+\begin{center}
+\begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}
+$=$ & $\Rightarrow$ & \pcode{if_icmpne}\\
+$\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\
+$<$ & $\Rightarrow$ & \pcode{if_icmpge}\\
+$\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\
+\end{tabular}
+\end{center}
+
+\noindent and so on. I leave it to you to extend the
+\textit{compile}-function for the other boolean expressions.
+Note that we need to jump whenever the boolean is \emph{not}
+true, which means we have to ``negate'' the jump---equals
+becomes not-equal, less becomes greater-or-equal. If you do
+not like this design (it can be the source of some nasty,
+hard-to-detect errors), you can also change the layout of the
+code and first give the code for the else-branch and then for
+the if-branch.
+
+We are now ready to give the compile function for
+if-statments--remember this function returns for staments a
+pair consisting of the code and an environment:
+
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\
+\multicolumn{3}{l}{$\qquad l_\textit{ifelse}\;$ (fresh label)}\\
+\multicolumn{3}{l}{$\qquad l_\textit{ifend}\;$ (fresh label)}\\
+\multicolumn{3}{l}{$\qquad (is_1, E') = \textit{compile}(cs_1, E)$}\\
+\multicolumn{3}{l}{$\qquad (is_2, E'') = \textit{compile}(cs_2, E')$}\\
+\multicolumn{3}{l}{$\qquad(\textit{compile}(b, E, l_\textit{ifelse})$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_1$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\; \pcode{goto}\;l_\textit{ifend}$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifelse}:$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_2$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifend}:, E'')$}\\
+\end{tabular}
+\end{center}
\end{document}