diff -r bf36664a3196 -r af65ffff9cdd handouts/ho07.tex --- a/handouts/ho07.tex Tue Nov 17 04:53:14 2015 +0000 +++ b/handouts/ho07.tex Tue Nov 17 13:31:19 2015 +0000 @@ -10,19 +10,19 @@ \section*{Handout 7 (Compilation)} The purpose of a compiler is to transform a program, a human -can write, into code the machine can run as fast as possible. -The fastest code would be machine code the CPU can run -directly, but it is often enough to improve the speed of a +can read and write, into code the machine can run as fast as +possible. The fastest code would be machine code the CPU can +run directly, but it is often enough to improve the speed of a program by just targeting a virtual machine. This produces not the fastest possible code, but code that is fast enough and -has the advantage that the virtual machine care of things a -compiler would normally need to take care of (like explicit -memory management). +has the advantage that the virtual machine takes care of +things a compiler would normally need to take care of (like +explicit memory management). -As an example we will implement a compiler for the very simple -While-language. We will be generating code for the Java -Virtual Machine. This is a stack-based virtual machine, a fact -which will make it easy to generate code for arithmetic +As a first example we will implement a compiler for the very +simple While-language. It will generate code for the Java +Virtual Machine (JVM). This is a stack-based virtual machine, +a fact which will make it easy to generate code for arithmetic expressions. For example for generating code for the expression $1 + 2$ we need to generate the following three instructions @@ -35,13 +35,13 @@ \noindent The first instruction loads the constant $1$ onto the stack, the next one $2$, the third instruction adds both -numbers together replacing the top elements of the stack with -the result $3$. For simplicity, we will throughout consider -only integer numbers and results. Therefore we can use the -instructions \code{iadd}, \code{isub}, \code{imul}, -\code{idiv} and so on. The \code{i} stands for integer -instructions in the JVM (alternatives are \code{d} for doubles, -\code{l} for longs and \code{f} for floats). +numbers together replacing the top two elements of the stack +with the result $3$. For simplicity, we will throughout +consider only integer numbers and results. Therefore we can +use the JVM instructions \code{iadd}, \code{isub}, +\code{imul}, \code{idiv} and so on. The \code{i} stands for +integer instructions in the JVM (alternatives are \code{d} for +doubles, \code{l} for longs and \code{f} for floats). Recall our grammar for arithmetic expressions ($E$ is the starting symbol): @@ -58,12 +58,12 @@ | \meta{Num}\\ \end{plstx} -\noindent where \meta{Id} stands for variables and -\meta{Num} for numbers. For the moment let us omit variables from +\noindent where \meta{Id} stands for variables and \meta{Num} +for numbers. For the moment let us omit variables from arithmetic expressions. Our parser will take this grammar and -produce abstract syntax trees. For -example for the expression $1 + ((2 * 3) + (4 - 3))$ it will -produce the following tree. +given an input produce abstract syntax trees. For example for +the expression $1 + ((2 * 3) + (4 - 3))$ it will produce the +following tree. \begin{center} \begin{tikzpicture} \Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]] \end{tikzpicture} @@ -97,8 +97,9 @@ 3)) + (4 - 3)$, produces a different abstract syntax tree and thus potentially also a different list of instructions. Generating code in this fashion is rather easy to implement: -it can be done with the following \textit{compile}-function, -which takes the abstract syntax tree as argument: +it can be done with the following recursive +\textit{compile}-function, which takes the abstract syntax +tree as argument: \begin{center} \begin{tabular}{lcl} @@ -167,7 +168,7 @@ There is a similar \textit{compile}-function for boolean expressions, but it includes a ``trick'' to do with \pcode{if}- and \pcode{while}-statements. To explain the issue -let us explain first the compilation of statements of the +let us first describe the compilation of statements of the While-language. The clause for \pcode{skip} is trivial, since we do not have to generate any instruction @@ -177,19 +178,19 @@ \end{tabular} \end{center} -\noindent Note that the \textit{compile}-function for -statements returns a pair, a list of instructions (in this -case the empty list) and an environment for variables. The -reason for the environment is that assignments in the -While-language might change the environment---clearly if a -variable is used for the first time, we need to allocate a new -index and if it has been used before, we need to be able to -retrieve the associated index. This is reflected in -the clause for compiling assignments: +\noindent $[]$ is the empty list of instructions. Note that +the \textit{compile}-function for statements returns a pair, a +list of instructions (in this case the empty list) and an +environment for variables. The reason for the environment is +that assignments in the While-language might change the +environment---clearly if a variable is used for the first +time, we need to allocate a new index and if it has been used +before, we need to be able to retrieve the associated index. +This is reflected in the clause for compiling assignments: \begin{center} \begin{tabular}{lcl} -$\text{compile}(x := a, E)$ & $\dn$ & +$\textit{compile}(x := a, E)$ & $\dn$ & $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$ \end{tabular} \end{center} @@ -225,8 +226,8 @@ if $b$ then $cs_1$ else $cs_2$ \end{lstlisting} -\noindent where $b$ is a boolean expression and the $cs_i$ -are the instructions for each \pcode{if}-branch. Lets assume +\noindent where $b$ is a boolean expression and the $cs_{1/2}$ +are the statements for each \pcode{if}-branch. Lets assume we already generated code for $b$ and $cs_{1/2}$. Then in the true-case the control-flow of the program needs to be @@ -289,13 +290,14 @@ are finished with running $cs_2$ we can continue with whatever code comes after the if-statement. -The \pcode{goto} and conditional jumps need addresses to where -the jump should go. Since we are generating assembly code for -the JVM, we do not actually have to give addresses, but need -to attach labels to our code. These labels specify a target -for a jump. Therefore the labels need to be unique, as -otherwise it would be ambiguous where a jump should go. -A labels, say \pcode{L}, is attached to code like +The \pcode{goto} and the conditional jumps need addresses to +where the jump should go. Since we are generating assembly +code for the JVM, we do not actually have to give (numeric) +addresses, but can just attach (symbolic) labels to our code. +These labels specify a target for a jump. Therefore the labels +need to be unique, as otherwise it would be ambiguous where a +jump should go to. A label, say \pcode{L}, is attached to code +like \begin{lstlisting}[mathescape,numbers=none] L: @@ -304,6 +306,8 @@ $\vdots$ \end{lstlisting} +\noindent where a label is indicated by a colon. + Recall the ``trick'' with compiling boolean expressions: the \textit{compile}-function for boolean expressions takes three arguments: an abstract syntax tree, an environment for @@ -318,14 +322,15 @@ \end{tabular} \end{center} -\noindent -We are generating code for the subexpressions $a_1$ and $a_2$. -This will mean after running the corresponding code there will -be two integers on top of the stack. If they are equal, we do -not have to do anything and just continue with the next -instructions (see control-flow of ifs above). However if they -are \emph{not} equal, then we need to (conditionally) jump to -the label $lab$. This can be done with the instruction +\noindent where we are first generating code for the +subexpressions $a_1$ and $a_2$. This will mean after running +the corresponding code there will be two integers on top of +the stack. If they are equal, we do not have to do anything +(except for popping them off from the stack) and just continue +with the next instructions (see control-flow of ifs above). +However if they are \emph{not} equal, then we need to +(conditionally) jump to the label $lab$. This can be done with +the instruction \begin{lstlisting}[mathescape,numbers=none] if_icmpne $lab$ @@ -335,7 +340,6 @@ \begin{center} \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l} -$=$ & $\Rightarrow$ & \pcode{if_icmpne}\\ $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\ $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\ $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\ @@ -345,15 +349,17 @@ \noindent and so on. I leave it to you to extend the \textit{compile}-function for the other boolean expressions. Note that we need to jump whenever the boolean is \emph{not} -true, which means we have to ``negate'' the jump---equals -becomes not-equal, less becomes greater-or-equal. If you do -not like this design (it can be the source of some nasty, -hard-to-detect errors), you can also change the layout of the -code and first give the code for the else-branch and then for -the if-branch. +true, which means we have to ``negate'' the jump +condition---equals becomes not-equal, less becomes +greater-or-equal. If you do not like this design (it can be +the source of some nasty, hard-to-detect errors), you can also +change the layout of the code and first give the code for the +else-branch and then for the if-branch. However in the case +of while-loops this way of generating code still seems +the most convenient. We are now ready to give the compile function for -if-statments--remember this function returns for staments a +if-statments---remember this function returns for staments a pair consisting of the code and an environment: \begin{center}