afl-material: comparison handouts/ho07.tex

equal deleted inserted replaced

-:0b1a92b305cf
+:d6af4b1239de
 \documentclass{article}
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../grammar}
-\usepackage{tikz-qtree}
+\usepackage{../graphics}
 \begin{document}
-\section*{Handout 7 (Compilation of the WHILE-Language)}
+\section*{Handout 7 (Compilation)}
 The purpose of a compiler is to transform a program, a human
 can write, into code the machine can run as fast as possible.
 The fastest code would be machine code the CPU can run
 directly, but it is often enough to improve the speed of a
 the fastest possible code, but code that is fast enough and
 has the advantage that the virtual machine care of things a
 compiler would normally need to take care of (like explicit
 memory management).
-We will be generating code for the Java Virtual Machine. This
+As an example we will implement a compiler for the very simple
-is a stack-based virtual machine which will make it easy to
+While-language. We will be generating code for the Java
-generate code for arithmetic expressions. For example for
+Virtual Machine. This is a stack-based virtual machine, a fact
-generating code for the expression $1 + 2$ we need to issue
+which will make it easy to generate code for arithmetic
-the following three instructions
+expressions. For example for generating code for the
+expression $1 + 2$ we need to generate the following three
+instructions
 \begin{lstlisting}[numbers=none]
 ldc 1
 ldc 2
 iadd
 \end{lstlisting}
-\noindent The first instruction loads the constant $1$ on the
+\noindent The first instruction loads the constant $1$ onto
-stack, the next one $2$, the third instruction add both
+the stack, the next one $2$, the third instruction adds both
 numbers together replacing the top elements of the stack with
-the result $3$. We will throughout consider only integer
+the result $3$. For simplicity, we will throughout consider
-numbers and results, therefore we can use the instructions
+only integer numbers and results. Therefore we can use the
-\code{iadd}, \code{isub}, \code{imul}, \code{idiv} and so on.
+instructions \code{iadd}, \code{isub}, \code{imul},
-The \code{i} stands for integer instructions (alternatives are
+\code{idiv} and so on. The \code{i} stands for integer
-\code{d} for doubles, \code{l} for longs and \code{f} for
+instructions in the JVM (alternatives are \code{d} for doubles,
-floats).
+\code{l} for longs and \code{f} for floats).
-Recall our grammar for arithmetic expressions:
+Recall our grammar for arithmetic expressions ($E$ is the
+starting symbol):
 \begin{plstx}[rhs style=, margin=3cm]
 : \meta{E} ::= \meta{T} $+$ \meta{E}
 | \meta{T} $-$ \meta{E}
 | \meta{Num}\\
 \end{plstx}
 \noindent where \meta{Id} stands for variables and
-\meta{Num} for number. For the moment let us omit variables from
+\meta{Num} for numbers. For the moment let us omit variables from
 arithmetic expressions. Our parser will take this grammar and
-produce abstract syntax trees, for
+produce abstract syntax trees. For
 example for the expression $1 + ((2 * 3) + (4 - 3))$ it will
 produce the following tree.
 \begin{center}
 \begin{tikzpicture}
 \Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]
 \end{tikzpicture}
 \end{center}
-\noindent
+\noindent To generate code for this expression, we need to
-To generate code for this expression, we need to traverse this
+traverse this tree in post-order fashion and emit code for
-tree in post-order fashion---this will produce code for a
+each node---this traversal in post-order fashion will produce
-stack-machine, like the JVM. Doing so gives the instructions
+code for a stack-machine (what the JVM is). Doing so for the
+tree above generates the instructions
 \begin{lstlisting}[numbers=none]
 ldc 1
 ldc 2
 ldc 3
 iadd
 iadd
 \end{lstlisting}
 \noindent If we ``run'' these instructions, the result $8$
-will be on top of the stack. This will be a convention we
+will be on top of the stack (I leave this to you to verify;
-always observe, namely that the results of arithmetic
+the meaning of each instruction should be clear). The result
+being on the top of the stack will be a convention we always
+observe in our compiler, that is the results of arithmetic
 expressions will always be on top of the stack. Note, that a
-different bracketing, for example $(1 + (2 * 3)) + (4 - 3)$,
+different bracketing of the expression, for example $(1 + (2 *
-produces a different abstract syntax tree and thus potentially
+3)) + (4 - 3)$, produces a different abstract syntax tree and
-also a different list of instructions. Generating code in this
+thus potentially also a different list of instructions.
-fashion is rather simple: it can be done by the following
+Generating code in this fashion is rather easy to implement:
-\textit{compile}-function:
+it can be done with the following \textit{compile}-function,
+which takes the abstract syntax tree as argument:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\
 $\textit{compile}(a_1 + a_2)$ & $\dn$ &
 $\textit{compile}(a_1 \backslash a_2)$ & $\dn$ &
 $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\
 \end{tabular}
 \end{center}
-\noindent However, our arithmetic expressions can also contain
+However, our arithmetic expressions can also contain
 variables. We will represent them as \emph{local variables} in
 the JVM. Essentially, local variables are an array or pointers
-to the memory containing in our case only integers. Looking up
+to memory cells, containing in our case only integers. Looking
-a variable can be done by the instruction
+up a variable can be done with the instruction
 \begin{lstlisting}[mathescape,numbers=none]
 iload $index$
 \end{lstlisting}
 \noindent
 which places the content of the local variable $index$ onto
-thestack. Storing the top of the stack into a local variable
+the stack. Storing the top of the stack into a local variable
 can be done by the instruction
 \begin{lstlisting}[mathescape,numbers=none]
 istore $index$
 \end{lstlisting}
 \noindent Note that this also pops off the top of the stack.
-One problem we have to overcome is that local variables are
+One problem we have to overcome, however, is that local
-addressed, not by identifiers, but by numbers (starting from
+variables are addressed, not by identifiers, but by numbers
-$0$). Therefore our compiler needs to maintain a kind of
+(starting from $0$). Therefore our compiler needs to maintain
-environment (similar to the interpreter) where variables are
+a kind of environment where variables are associated to
-associated to numbers. This association needs to be unique: if
+numbers. This association needs to be unique: if we muddle up
-we muddle up the numbers, then we essentially confuse
+the numbers, then we essentially confuse variables and the
-variables and the result will usually be an erroneous result.
+consequence will usually be an erroneous result. Our extended
-Therefore our \textit{compile}-function will take two
+\textit{compile}-function for arithmetic expressions will
-arguments: the abstract syntax tree and the environment, $E$,
+therefore take two arguments: the abstract syntax tree and the
-that maps identifiers to index-numbers.
+environment, $E$, that maps identifiers to index-numbers.
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\
 $\textit{compile}(a_1 + a_2, E)$ & $\dn$ &
 $\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\
 \end{tabular}
 \end{center}
 \noindent In the last line we generate the code for variables
-where $E(x)$ stands for looking up to which index the variable
+where $E(x)$ stands for looking up the environment to which
-$x$ maps to.
+index the variable $x$ maps to.
+There is a similar \textit{compile}-function for boolean
+expressions, but it includes a ``trick'' to do with
+\pcode{if}- and \pcode{while}-statements. To explain the issue
+let us explain first the compilation of statements of the
+While-language. The clause for \pcode{skip} is trivial, since
+we do not have to generate any instruction
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\
+\end{tabular}
+\end{center}
+\noindent Note that the \textit{compile}-function for
+statements returns a pair, a list of instructions (in this
+case the empty list) and an environment for variables. The
+reason for the environment is that assignments in the
+While-language might change the environment---clearly if a
+variable is used for the first time, we need to allocate a new
+index and if it has been used before, we need to be able to
+retrieve the associated index. This is reflected in
+the clause for compiling assignments:
+\begin{center}
+\begin{tabular}{lcl}
+$\text{compile}(x := a, E)$ & $\dn$ &
+$(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$
+\end{tabular}
+\end{center}
+\noindent We first generate code for the right-hand side of
+the assignment and then add an \pcode{istore}-instruction at
+the end. By convention the result of the arithmetic expression
+$a$ will be on top of the stack. After the \pcode{istore}
+instruction, the result will be stored in the index
+corresponding to the variable $x$. If the variable $x$ has
+been used before in the program, we just need to look up what
+the index is and return the environment unchanged (that is in
+this case $E' = E$). However, if this is the first encounter
+of the variable $x$ in the program, then we have to augment
+the environment and assign $x$ with the largest index in $E$
+plus one (that is $E' = E(x \mapsto largest\_index + 1)$).
+That means for the assignment $x := x + 1$ we generate the
+following code
+\begin{lstlisting}[mathescape,numbers=none]
+iload $n_x$
+ldc 1
+iadd
+istore $n_x$
+\end{lstlisting}
+\noindent
+where $n_x$ is the index for the variable $x$.
+More complicated is the code for \pcode{if}-statments, say
+\begin{lstlisting}[mathescape,language={},numbers=none]
+if $b$ then $cs_1$ else $cs_2$
+\end{lstlisting}
+\noindent where $b$ is a boolean expression and the $cs_i$
+are the instructions for each \pcode{if}-branch. Lets assume
+we already generated code for $b$ and $cs_{1/2}$. Then in the
+true-case the control-flow of the program needs to be
+\begin{center}
+\begin{tikzpicture}[node distance=2mm and 4mm,
+block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
+point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
+skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
+\node (A1) [point] {};
+\node (b) [block, right=of A1] {code of $b$};
+\node (A2) [point, right=of b] {};
+\node (cs1) [block, right=of A2] {code of $cs_1$};
+\node (A3) [point, right=of cs1] {};
+\node (cs2) [block, right=of A3] {code of $cs_2$};
+\node (A4) [point, right=of cs2] {};
+\draw (A1) edge [->, black, line width=1mm] (b);
+\draw (b) edge [->, black, line width=1mm] (cs1);
+\draw (cs1) edge [->, black, line width=1mm] (A3);
+\draw (A3) edge [->, black, skip loop] (A4);
+\node [below=of cs2] {\raisebox{-5mm}{\small{}jump}};
+\end{tikzpicture}
+\end{center}
+\noindent where we start with running the code for $b$; since
+we are in the true case we continue with running the code for
+$cs_1$. After this however, we must not run the code for
+$cs_2$, but always jump after the last instruction of $cs_2$
+(the code for the \pcode{else}-branch). Note that this jump is
+unconditional, meaning we always have to jump to the end of
+$cs_2$. The corresponding instruction of the JVM is
+\pcode{goto}. In case $b$ turns out to be false we need the
+control-flow
+\begin{center}
+\begin{tikzpicture}[node distance=2mm and 4mm,
+block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
+point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
+skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
+\node (A1) [point] {};
+\node (b) [block, right=of A1] {code of $b$};
+\node (A2) [point, right=of b] {};
+\node (cs1) [block, right=of A2] {code of $cs_1$};
+\node (A3) [point, right=of cs1] {};
+\node (cs2) [block, right=of A3] {code of $cs_2$};
+\node (A4) [point, right=of cs2] {};
+\draw (A1) edge [->, black, line width=1mm] (b);
+\draw (b) edge [->, black, line width=1mm] (A2);
+\draw (A2) edge [skip loop] (A3);
+\draw (A3) edge [->, black, line width=1mm] (cs2);
+\draw (cs2) edge [->,black, line width=1mm] (A4);
+\node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}};
+\end{tikzpicture}
+\end{center}
+\noindent where we now need a conditional jump (if the
+if-condition is false) from the end of the code for the
+boolean to the beginning of the instructions $cs_2$. Once we
+are finished with running $cs_2$ we can continue with whatever
+code comes after the if-statement.
+The \pcode{goto} and conditional jumps need addresses to where
+the jump should go. Since we are generating assembly code for
+the JVM, we do not actually have to give addresses, but need
+to attach labels to our code. These labels specify a target
+for a jump. Therefore the labels need to be unique, as
+otherwise it would be ambiguous where a jump should go.
+A labels, say \pcode{L}, is attached to code like
+\begin{lstlisting}[mathescape,numbers=none]
+L:
+$instr_1$
+$instr_2$
+$\vdots$
+\end{lstlisting}
+Recall the ``trick'' with compiling boolean expressions: the
+\textit{compile}-function for boolean expressions takes three
+arguments: an abstract syntax tree, an environment for
+variable indices and also the label, $lab$, to where an conditional
+jump needs to go. The clause for the expression $a_1 = a_2$,
+for example, is as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\
+\multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}
+\end{tabular}
+\end{center}
+\noindent
+We are generating code for the subexpressions $a_1$ and $a_2$.
+This will mean after running the corresponding code there will
+be two integers on top of the stack. If they are equal, we do
+not have to do anything and just continue with the next
+instructions (see control-flow of ifs above). However if they
+are \emph{not} equal, then we need to (conditionally) jump to
+the label $lab$. This can be done with the instruction
+\begin{lstlisting}[mathescape,numbers=none]
+if_icmpne $lab$
+\end{lstlisting}
+\noindent Other jump instructions for boolean operators are
+\begin{center}
+\begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}
+$=$ & $\Rightarrow$ & \pcode{if_icmpne}\\
+$\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\
+$<$ & $\Rightarrow$ & \pcode{if_icmpge}\\
+$\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\
+\end{tabular}
+\end{center}
+\noindent and so on. I leave it to you to extend the
+\textit{compile}-function for the other boolean expressions.
+Note that we need to jump whenever the boolean is \emph{not}
+true, which means we have to ``negate'' the jump---equals
+becomes not-equal, less becomes greater-or-equal. If you do
+not like this design (it can be the source of some nasty,
+hard-to-detect errors), you can also change the layout of the
+code and first give the code for the else-branch and then for
+the if-branch.
+We are now ready to give the compile function for
+if-statments--remember this function returns for staments a
+pair consisting of the code and an environment:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\
+\multicolumn{3}{l}{$\qquad l_\textit{ifelse}\;$ (fresh label)}\\
+\multicolumn{3}{l}{$\qquad l_\textit{ifend}\;$ (fresh label)}\\
+\multicolumn{3}{l}{$\qquad (is_1, E') = \textit{compile}(cs_1, E)$}\\
+\multicolumn{3}{l}{$\qquad (is_2, E'') = \textit{compile}(cs_2, E')$}\\
+\multicolumn{3}{l}{$\qquad(\textit{compile}(b, E, l_\textit{ifelse})$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_1$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\; \pcode{goto}\;l_\textit{ifend}$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifelse}:$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\;is_2$}\\
+\multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifend}:, E'')$}\\
+\end{tabular}
+\end{center}
 \end{document}
 %%% Local Variables:
 %%% mode: latex

changeset 372	d6af4b1239de
parent 370	a65767fe5d71
child 373	b018234c9126