afl-material: comparison handouts/ho07.tex

equal deleted inserted replaced

-:2fcd7c2da729
+:4980f421b3b0
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../grammar}
 \usepackage{../graphics}
 %% add safety check on references...whether it is above 0 and below the
 %% index
 \begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2017, 2018, 2019}
+\fnote{\copyright{} Christian Urban, King's College London, 2017, 2018, 2019, 2020}
 \section*{Handout 7 (Compilation)}
 The purpose of a compiler is to transform a program a human can read and
 write into code the machine can run as fast as possible. The fastest
 that is often pretty fast. This way of producing code has the advantage that
 the virtual machine takes care of things a compiler would normally need
 to take care of (like explicit memory management).
 As a first example in this module we will implement a compiler for the
-very simple While-language. It will generate code for the Java Virtual
+very simple WHILE-language that we parsed in the last lecture. The
-Machine (JVM). Unfortunately the Java ecosystem does not come with an
+compiler will target the Java Virtual Machine (JVM), but not directly.
-assembler which would be handy for our  compiler-endeavour  (unlike
+Pictorially the compiler will work as follows:
-Microsoft's  Common Language Infrastructure for the .Net platform which
-has an assembler out-of-the-box). As a substitute we use in this module
+\begin{center}
-the 3rd-party programs Jasmin and Krakatau
+\begin{tikzpicture}[scale=1,font=\bf,
+node/.style={
+rectangle,rounded corners=3mm,
+ultra thick,draw=black!50,minimum height=18mm,
+minimum width=20mm,
+top color=white,bottom color=black!20}]
+\node (0) at (-3,0) {};
+\node (A) at (0,0) [node,text width=1.6cm,text centered] {our compiler};
+\node (B) at (3.5,0) [node,text width=1.6cm,text centered] {Jasmin / Krakatau};
+\node (C) at (7.5,0) [node] {JVM};
+\draw [->,line width=2.5mm] (0) -- node [above,pos=0.35] {*.while} (A);
+\draw [->,line width=2.5mm] (A) -- node [above,pos=0.35] {*.j} (B);
+\draw [->,line width=2.5mm] (B) -- node [above,pos=0.35] {*.class} (C);
+\end{tikzpicture}
+\end{center}
+\noindent
+The input will be WHILE-programs; the output will be assembly files
+(with the ending .j). Assembly files essentially contain human-readable
+machine code, meaning they are not just bits and bytes, but rather
+something you can read and understand---with a bit of practice of
+course. An \emph{assembler} will then translate the assembly files into
+unreadable class or binary files the JVM can run. Unfortunately, the
+Java ecosystem does not come with an assembler which would be handy for
+our compiler-endeavour  (unlike Microsoft's  Common Language
+Infrastructure for the .Net platform which has an assembler
+out-of-the-box). As a substitute we shall therefore use the 3rd-party
+programs Jasmin and Krakatau
 \begin{itemize}
 \item \url{http://jasmin.sourceforge.net}
 \item \url{https://github.com/Storyyeller/Krakatau}
 \end{itemize}
 \begin{tikzpicture}
 \Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]
 \end{tikzpicture}
 \end{center}
-\noindent To generate JVM code for this expression, we need to
+\noindent To generate JVM code for this expression, we need to traverse
-traverse this tree in post-order fashion and emit code for
+this tree in \emph{post-order} fashion and emit code for each
-each node---this traversal in post-order fashion will produce
+node---this traversal in \emph{post-order} fashion will produce code for
-code for a stack-machine (what the JVM is). Doing so for the
+a stack-machine (which is what the JVM is). Doing so for the tree above
-tree above generates the instructions
+generates the instructions
 \begin{lstlisting}[language=JVMIS,numbers=none]
 ldc 1
 ldc 2
 ldc 3
 that a different bracketing of the expression, for example $(1 + (2 *
 3)) + (4 - 3)$, produces a different abstract syntax tree and thus also
 a different list of instructions. Generating code in this
 post-order-traversal fashion is rather easy to implement: it can be done
 with the following recursive \textit{compile}-function, which takes the
-abstract syntax tree as argument:
+abstract syntax tree as an argument:
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\
 $\textit{compile}(a_1 + a_2)$ & $\dn$ &
 $\textit{compile}(a_1 \backslash a_2)$ & $\dn$ &
 $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\
 \end{tabular}
 \end{center}
-However, our arithmetic expressions can also contain
+This is all fine, but our arithmetic expressions can clearly contain
-variables. We will represent them as \emph{local variables} in
+variables and then this code will not be good enough. To fix this we
-the JVM. Essentially, local variables are an array or pointers
+will represent our variables as the \emph{local variables} in the JVM.
-to memory cells, containing in our case only integers. Looking
+Essentially, local variables are an array or pointers to memory cells,
-up a variable can be done with the instruction
+containing in our case only integers. Looking up a variable can be done
+with the instruction
 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none]
 iload $index$
 \end{lstlisting}
 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none]
 istore $index$
 \end{lstlisting}
-\noindent Note that this also pops off the top of the stack.
+\noindent Note that this also pops off the top of the stack. One problem
-One problem we have to overcome, however, is that local
+we have to overcome, however, is that local variables are addressed, not
-variables are addressed, not by identifiers, but by numbers
+by identifiers (like \texttt{x}, \texttt{foo} and so on), but by numbers
-(starting from $0$). Therefore our compiler needs to maintain
+(starting from $0$). Therefore our compiler needs to maintain a kind of
-a kind of environment where variables are associated to
+environment where variables are associated to numbers. This association
-numbers. This association needs to be unique: if we muddle up
+needs to be unique: if we muddle up the numbers, then we essentially
-the numbers, then we essentially confuse variables and the
+confuse variables and the consequence will usually be an erroneous
-consequence will usually be an erroneous result. Our extended
+result. Our extended \textit{compile}-function for arithmetic
-\textit{compile}-function for arithmetic expressions will
+expressions will therefore take two arguments: the abstract syntax tree
-therefore take two arguments: the abstract syntax tree and an
+and an environment, $E$, that maps identifiers to index-numbers.
-environment, $E$, that maps identifiers to index-numbers.
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\
 $\textit{compile}(a_1 + a_2, E)$ & $\dn$ &
 $\textit{compile}(a_1, E) \;@\; \textit{compile}(a_2, E)\;@\; \pcode{idiv}$\\
 $\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\
 \end{tabular}
 \end{center}
-\noindent In the last line we generate the code for variables
+\noindent In the last line we generate the code for variables where
-where $E(x)$ stands for looking up the environment to which
+$E(x)$ stands for looking up the environment to which index the variable
-index the variable $x$ maps to. This is similar to an interpreter,
+$x$ maps to. This is similar to the interpreter we saw earlier in the
-which also needs an environment: the difference is that the
+module, which also needs an environment: the difference is that the
-interpreter maintains a mapping from variables to current values (what is the
+interpreter maintains a mapping from variables to current values (what
-currently the value of a variable), while compilers need a mapping
+is the currently the value of a variable?), while compilers need a
-from variables to memory locations (where can I find the current
+mapping from variables to memory locations (where can I find the current
-value for the variable in memory).
+value for the variable in memory?).
 There is a similar \textit{compile}-function for boolean
 expressions, but it includes a ``trick'' to do with
 \pcode{if}- and \pcode{while}-statements. To explain the issue
 let us first describe the compilation of statements of the
-While-language. The clause for \pcode{skip} is trivial, since
+WHILE-language. The clause for \pcode{skip} is trivial, since
 we do not have to generate any instruction
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\
 \noindent whereby $[]$ is the empty list of instructions. Note that
 the \textit{compile}-function for statements returns a pair, a
 list of instructions (in this case the empty list) and an
 environment for variables. The reason for the environment is
-that assignments in the While-language might change the
+that assignments in the WHILE-language might change the
 environment---clearly if a variable is used for the first
 time, we need to allocate a new index and if it has been used
 before, then we need to be able to retrieve the associated index.
 This is reflected in the clause for compiling assignments, say
 $\textit{x := a}$:
 $\textit{compile}(x := a, E)$ & $\dn$ &
 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$
 \end{tabular}
 \end{center}
-\noindent We first generate code for the right-hand side of
+\noindent We first generate code for the right-hand side of the
-the assignment and then add an \pcode{istore}-instruction at
+assignment (that is the arithmetic expression $a$) and then add an
-the end. By convention the result of the arithmetic expression
+\pcode{istore}-instruction at the end. By convention running the code
-$a$ will be on top of the stack. After the \pcode{istore}
+for the arithmetic expression $a$ will leave the result on top of the
-instruction, the result will be stored in the index
+stack. After that the \pcode{istore} instruction, the result will be
-corresponding to the variable $x$. If the variable $x$ has
+stored in the index corresponding to the variable $x$. If the variable
-been used before in the program, we just need to look up what
+$x$ has been used before in the program, we just need to look up what
-the index is and return the environment unchanged (that is in
+the index is and return the environment unchanged (that is in this case
-this case $E' = E$). However, if this is the first encounter
+$E' = E$). However, if this is the first encounter of the variable $x$
-of the variable $x$ in the program, then we have to augment
+in the program, then we have to augment the environment and assign $x$
-the environment and assign $x$ with the largest index in $E$
+with the largest index in $E$ plus one (that is $E' = E(x \mapsto
-plus one (that is $E' = E(x \mapsto largest\_index + 1)$).
+largest\_index + 1)$). To sum up, for the assignment $x := x + 1$ we
-To sum up, for the assignment $x := x + 1$ we generate the
+generate the following code
-following code
 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none]
 iload $n_x$
 ldc 1
 iadd
 $index \;=\; E\textit{.getOrElse}(x, |E|)$
 \end{tabular}
 \end{center}
 \noindent
-In case the environment $E$ contains an index for $x$, we return it.
+This implements the idea that in case the environment $E$ contains an
-Otherwise we ``create'' a new index by returning the size $|E|$ of the
+index for $x$, we return it. Otherwise we ``create'' a new index by
-environment (that will be an index that is guaranteed to be not used
+returning the size $|E|$ of the environment (that will be an index that
-yet). In all this we take advantage of the JVM which provides us with
+is guaranteed not to be used yet). In all this we take advantage of the
-a potentially limitless supply of places where we can store values
+JVM which provides us with a potentially limitless supply of places
-of variables.
+where we can store values of variables.
 A bit more complicated is the generation of code for
 \pcode{if}-statements, say
 \begin{lstlisting}[mathescape,language={},numbers=none]
 if $b$ then $cs_1$ else $cs_2$
 \end{lstlisting}
 \noindent where $b$ is a boolean expression and where both $cs_{1/2}$
-are the statements for each of the \pcode{if}-branches. Lets assume we
+are the statements for each of the \pcode{if}-branches. Let us assume we
-already generated code for $b$ and $cs_{1/2}$. Then in the true-case the
+already generated code for $b$ and and the two if-branches $cs_{1/2}$.
-control-flow of the program needs to behave as
+Then in the true-case the control-flow of the program needs to behave as
-\begin{center}
-\begin{tikzpicture}[node distance=2mm and 4mm,
+\begin{center}
-block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
+\begin{tikzpicture}[node distance=2mm and 4mm,line cap=round,
+block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm,
+top color=white,bottom color=black!20},
 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
 \node (A1) [point] {};
 \node (b) [block, right=of A1] {code of $b$};
 \node (A2) [point, right=of b] {};
 \node (cs2) [block, right=of A3] {code of $cs_2$};
 \node (A4) [point, right=of cs2] {};
 \draw (A1) edge [->, black, line width=1mm] (b);
 \draw (b) edge [->, black, line width=1mm] (cs1);
-\draw (cs1) edge [->, black, line width=1mm] (A3);
+\draw (cs1) edge [->, black, line width=1mm,shorten >= -0.5mm] (A3);
 \draw (A3) edge [->, black, skip loop] (A4);
 \node [below=of cs2] {\raisebox{-5mm}{\small{}jump}};
 \end{tikzpicture}
 \end{center}
 \noindent where we start with running the code for $b$; since
 we are in the true case we continue with running the code for
 $cs_1$. After this however, we must not run the code for
-$cs_2$, but always jump after the last instruction of $cs_2$
+$cs_2$, but always jump to after the last instruction of $cs_2$
 (the code for the \pcode{else}-branch). Note that this jump is
 unconditional, meaning we always have to jump to the end of
 $cs_2$. The corresponding instruction of the JVM is
 \pcode{goto}. In case $b$ turns out to be false we need the
 control-flow
 \begin{center}
-\begin{tikzpicture}[node distance=2mm and 4mm,
+\begin{tikzpicture}[node distance=2mm and 4mm,line cap=round,
-block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
+block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm,
+top color=white,bottom color=black!20},
 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
 \node (A1) [point] {};
 \node (b) [block, right=of A1] {code of $b$};
 \node (A2) [point, right=of b] {};
 \node (A3) [point, right=of cs1] {};
 \node (cs2) [block, right=of A3] {code of $cs_2$};
 \node (A4) [point, right=of cs2] {};
 \draw (A1) edge [->, black, line width=1mm] (b);
-\draw (b) edge [->, black, line width=1mm] (A2);
+\draw (b) edge [->, black, line width=1mm,shorten >= -0.5mm] (A2);
 \draw (A2) edge [skip loop] (A3);
 \draw (A3) edge [->, black, line width=1mm] (cs2);
 \draw (cs2) edge [->,black, line width=1mm] (A4);
 \node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}};
 \end{tikzpicture}
 $instr_1$
 $instr_2$
 $\vdots$
 \end{lstlisting}
-\noindent where a label is indicated by a colon. The task of the
+\noindent where the label needs to be followed by a colon. The task of
-assmbler (in our case Jasmin or Krakatau) is to resolve the labels
+the assembler (in our case Jasmin or Krakatau) is to resolve the labels
-to actual addresses, for example jump 10 instructions forward,
+to actual (numeric) addresses, for example jump 10 instructions forward,
 or 20 instructions backwards.
 Recall the ``trick'' with compiling boolean expressions: the
 \textit{compile}-function for boolean expressions takes three
 arguments: an abstract syntax tree, an environment for
 \begin{lstlisting}[mathescape,numbers=none,language=JVMIS]
 if_icmpne $lab$
 \end{lstlisting}
-\noindent Other jump instructions for boolean operators are
+To sum up, the third argument in the compile function for booleans
+specifies where to jump, in case the condition is \emph{not} true. I
+leave it to you to extend the \textit{compile}-function for the other
+boolean expressions. Note that we need to jump whenever the boolean is
+\emph{not} true, which means we have to ``negate'' the jump
+condition---equals becomes not-equal, less becomes greater-or-equal.
+Other jump instructions for boolean operators are
 \begin{center}
 \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}
 $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\
 $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\
 $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\
 \end{tabular}
 \end{center}
-\noindent and so on. I leave it to you to extend the
+\noindent and so on.   If you do not like this design (it can be the
-\textit{compile}-function for the other boolean expressions. Note that
-we need to jump whenever the boolean is \emph{not} true, which means we
-have to ``negate'' the jump condition---equals becomes not-equal, less
-becomes greater-or-equal. If you do not like this design (it can be the
 source of some nasty, hard-to-detect errors), you can also change the
 layout of the code and first give the code for the else-branch and then
 for the if-branch. However in the case of while-loops this
 ``upside-down-inside-out'' way of generating code still seems the most
 convenient.
 \pcode{while} $b$ \pcode{do} $cs$, is very similar. In case
 the condition is true and we need to do another iteration,
 and the control-flow needs to be as follows
 \begin{center}
-\begin{tikzpicture}[node distance=2mm and 4mm,
+\begin{tikzpicture}[node distance=2mm and 4mm,line cap=round,
-block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
+block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm,
+top color=white,bottom color=black!20},
 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
 \node (A0) [point, left=of A1] {};
 \node (A1) [point] {};
 \node (b) [block, right=of A1] {code of $b$};
 \node (A3) [point, right=of cs1] {};
 \node (A4) [point, right=of A3] {};
 \draw (A0) edge [->, black, line width=1mm] (b);
 \draw (b) edge [->, black, line width=1mm] (cs1);
-\draw (cs1) edge [->, black, line width=1mm] (A3);
+\draw (cs1) edge [->, black, line width=1mm,shorten >= -0.5mm] (A3);
 \draw (A3) edge [->,skip loop] (A1);
 \end{tikzpicture}
 \end{center}
 \noindent Whereas if the condition is \emph{not} true, we
 need to jump out of the loop, which gives the following
 control flow.
 \begin{center}
-\begin{tikzpicture}[node distance=2mm and 4mm,
+\begin{tikzpicture}[node distance=2mm and 4mm,line cap=round,
-block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
+block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm,
+top color=white,bottom color=black!20},
 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
 skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
 \node (A0) [point, left=of A1] {};
 \node (A1) [point] {};
 \node (b) [block, right=of A1] {code of $b$};
 \node (cs1) [block, right=of A2] {code of $cs$};
 \node (A3) [point, right=of cs1] {};
 \node (A4) [point, right=of A3] {};
 \draw (A0) edge [->, black, line width=1mm] (b);
-\draw (b) edge [->, black, line width=1mm] (A2);
+\draw (b) edge [->, black, line width=1mm,shorten >= -0.5mm] (A2);
 \draw (A2) edge [skip loop] (A3);
 \draw (A3) edge [->, black, line width=1mm] (A4);
 \end{tikzpicture}
 \end{center}
 \draw[->,very thick] (LC) edge [->,to path={-- ++(10mm,0mm)
 -- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (LD.east);
 \end{tikzpicture}
 \noindent
-I leave it to you to read the code and follow its controlflow.
+As said, I leave it to you to decide whether the code implements
+the usual controlflow of while-loops.
-Next we need to consider the statement \pcode{write x}, which
-can be used to print out the content of a variable. For this
+Next we need to consider the statement \pcode{write x}, which can be
-we need to use a Java library function. In order to avoid
+used to print out the content of a variable. For this we shall use a
-having to generate a lot of code for each
+Java library function. In order to avoid having to generate a lot of
-\pcode{write}-command, we use a separate helper-method and
+code for each \pcode{write}-command, we use a separate helper-method and
-just call this method with an argument (which needs to be
+just call this method with an appropriate argument (which of course
-placed onto the stack). The code of the helper-method is as
+needs to be placed onto the stack). The code of the helper-method is as
 follows.
 \begin{lstlisting}[language=JVMIS,numbers=left]
 .method public static write(I)V
 .limit locals 1
 .limit stack 2
 called \pcode{write}. It takes a single integer argument
 indicated by the \pcode{(I)} and returns no result, indicated
 by the \pcode{V}. Since the method has only one argument, we
 only need a single local variable (Line~2) and a stack with
 two cells will be sufficient (Line 3). Line 4 instructs the
-JVM to get the value of the field \pcode{out} of the class
+JVM to get the value of the mem  \pcode{out} of the class
 \pcode{java/lang/System}. It expects the value to be of type
 \pcode{java/io/PrintStream}. A reference to this value will be
 placed on the stack. Line~5 copies the integer we want to
 print out onto the stack. In the next line we call the method
 \pcode{println} (from the class \pcode{java/io/PrintStream}).
 top of the stack and then call \pcode{write}. The \pcode{XXX}
 need to be replaced by an appropriate class name (this will be
 explained shortly).
-\begin{figure}[t]
+By generating code for a WHILE-program, we end up with a list
-\begin{lstlisting}[mathescape,language=JVMIS,numbers=left]
-.class public XXX.XXX
-.super java/lang/Object
-.method public <init>()V
-aload_0
-invokenonvirtual java/lang/Object/<init>()V
-return
-.end method
-.method public static main([Ljava/lang/String;)V
-.limit locals 200
-.limit stack 200
-$\textit{\ldots{}here comes the compiled code\ldots}$
-return
-.end method
-\end{lstlisting}
-\caption{Boilerplate code needed for running generated code.\label{boiler}}
-\end{figure}
-By generating code for a While-program, we end up with a list
 of (JVM assembly) instructions. Unfortunately, there is a bit
 more boilerplate code needed before these instructions can be
 run. The complete code is shown in Figure~\ref{boiler}. This
 boilerplate code is very specific to the JVM. If we target any
 other virtual machine or a machine language, then we would
 is called \emph{before} the \pcode{main}-method in Lines 10 to
 17. Interesting are the Lines 11 and 12 where we hardwire that
 the stack of our programs will never be larger than 200 and
 that the maximum number of variables is also 200. This seem to
 be conservative default values that allow is to run some
-simple While-programs. In a real compiler, we would of course
+simple WHILE-programs. In a real compiler, we would of course
 need to work harder and find out appropriate values for the
 stack and local variables.
+\begin{figure}[t]
+\begin{lstlisting}[mathescape,language=JVMIS,numbers=left]
+.class public XXX.XXX
+.super java/lang/Object
+.method public static main([Ljava/lang/String;)V
+.limit locals 200
+.limit stack 200
+$\textit{\ldots{}here comes the compiled code\ldots}$
+return
+.end method
+\end{lstlisting}
+\caption{The boilerplate code needed for running generated code.\label{boiler}}
+\end{figure}
 To sum up, in Figure~\ref{test} is the complete code generated
 for the slightly nonsensical program
 \begin{lstlisting}[mathescape,language=While]
 bytecode is then understood by the JVM and can be run by just invoking
 the \pcode{java}-program.
 \begin{figure}[p]
-\lstinputlisting[language=JVMIS]{../progs/test-small.j}
+\lstinputlisting[language=JVMIS,mathescape]{../progs/test-small.j}
-\caption{Generated code for a test program. This code can be
+\begin{tikzpicture}[remember picture,overlay]
-processed by an Java assembler producing a class-file, which
+\draw[|<->|,very thick] (LA.north) -- (LB.south)
-can be run by the {\tt{}java}-program.\label{test}}
+node[left=0mm,midway] {\footnotesize\texttt{x\,:=\,1\,+\,2}};
+\draw[|<->|,very thick] (LC.north) -- (LD.south)
+node[left=0mm,midway] {\footnotesize\texttt{write x}};
+\end{tikzpicture}
+\caption{The generated code for the test program \texttt{x := 1 + 2; write
+x}. This code can be processed by a Java assembler producing a
+class-file, which can then be run by the {\tt{}java}-program.\label{test}}
 \end{figure}
 \subsection*{Arrays}
-Maybe a useful addition to the While-language would be arrays. This
+Maybe a useful addition to the WHILE-language would be arrays. This
-would let us generate more interesting While-programs by translating
+would allow us to generate more interesting WHILE-programs by
-BF*** programs into equivalent While-code. So in this section lets have
+translating BF*** programs into equivalent WHILE-code. Therefore in this
-a look at how we can support the following three constructions
+section let us have a look at how we can support the following three
+constructions
 \begin{lstlisting}[mathescape,language=While]
-new arr[15000]
+new(arr[15000])
 x := 3 + arr[3 + y]
 arr[42 * n] := ...
 \end{lstlisting}
 \noindent
-The first construct is for creating new arrays, in this instance the
+The first construct is for creating new arrays. In this instance the
-name of the array is \pcode{arr} and it can hold 15000 integers. The
+name of the array is \pcode{arr} and it can hold 15000 integers. We do
-second is for referencing an array cell inside an arithmetic
+not support ``dynamic'' arrays, that is the size of our arrays will
-expression---we need to be able to look up the contents of an array at
+always be fixed. The second construct is for referencing an array cell
-an index determined by an arithmetic expression. Similarly in the line
+inside an arithmetic expression---we need to be able to look up the
-below, we need to be able to update the content of an array at an
+contents of an array at an index determined by an arithmetic expression.
-calculated index.
+Similarly in the line below, we need to be able to update the content of
+an array at an calculated index.
 For creating a new array we can generate the following three JVM
 instructions:
 \begin{lstlisting}[mathescape,language=JVMIS]
 newarray int
 astore loc_var
 \end{lstlisting}
 \noindent
-First we need to put the dimension of the array onto the stack. The next
+First we need to put the size of the array onto the stack. The next
-instruction creates the array. With the last we can store the array as a
+instruction creates the array. In this case the array contains
+\texttt{int}s. With the last instruction we can store the array as a
 local variable (like the ``simple'' variables from the previous
 section). The use of a local variable for each array allows us to have
-multiple arrays in a While-program. For looking up an element in an
+multiple arrays in a WHILE-program. For looking up an element in an
 array we can use the following JVM code
 \begin{lstlisting}[mathescape,language=JVMIS]
 aload loc_var
 index_aexp
 iaload
 \end{lstlisting}
 \noindent
-The first instruction loads the ``pointer'' to the array onto the stack.
+The first instruction loads the ``pointer'', or local variable, to the
-Then we have some instructions corresponding to the index where we want
+array onto the stack. Then we have some instructions calculating the
-to look up the array. The idea is that these instructions will leave a
+index where we want to look up the array. The idea is that these
-concrete number on the stack, which will be the index into the array we
+instructions will leave a concrete number on the top of the stack, which
-need. Finally we need to tell the JVM to load the corresponding element
+will be the index into the array we need. Finally we need to tell the
-onto the stack. Updating an array at an index with a value is as
+JVM to load the corresponding element onto the stack. Updating an array
-follows.
+at an index with a value is as follows.
 \begin{lstlisting}[mathescape,language=JVMIS]
 aload loc_var
 index_aexp
 value_aexp
 iastore
 \end{lstlisting}
 \noindent
-Again the first instruction loads the ``pointer'' to the array onto the
+Again the first instruction loads the local variable of
-stack. Then we have some instructions corresponding to the index where
+the array onto the stack. Then we have some instructions calculating
-we want to update the array. After that come the instructions for with
+the index where we want to update the array. After that come the
-what value we want to update the array. The last line contains the
+instructions for with which value we want to update the array. The last
-instruction for updating the array.
+line contains the instruction for updating the array.
-Next we need to modify our grammar rules for our While-language: it
+Next we need to modify our grammar rules for our WHILE-language: it
 seems best to extend the rule for factors in arithmetic expressions with
 a rule for looking up an array.
 \begin{plstx}[rhs style=, margin=3cm]
 : \meta{E} ::= \meta{T} $+$ \meta{E}
 by an identifier and the brackets. There are two new rules for statements,
 one for creating an array and one for array assignment:
 \begin{plstx}[rhs style=, margin=2cm, one per line]
 : \meta{Stmt} ::=  \ldots
-| \texttt{new}\; \meta{Id}\,[\,\meta{Num}\,]
+| \texttt{new}(\meta{Id}\,[\,\meta{Num}\,])
 | \meta{Id}\,[\,\meta{E}\,]\,:=\,\meta{E}\\
 \end{plstx}
-With this in place we can turn back to the idea of creating While
+With this in place we can turn back to the idea of creating
-programs by translating BF programs. This is a relatively easy task
+WHILE-programs by translating BF programs. This is a relatively easy
-because BF only has eight instructions (we will actually only have seven
+task because BF has only eight instructions (we will actually implement
-because we can omit the read-in instruction from BF). But also translating
+seven because we can omit the read-in instruction from BF). What makes
-BF-loops is going to be easy since they straightforwardly can be
+this translation easy is that BF-loops can be straightforwardly
-represented by While-loops. The Scala code for the translation is
+represented as while-loops. The Scala code for the translation is as
-as follows:
+follows:
 \begin{lstlisting}[language=Scala,numbers=left]
 def instr(c: Char) : String = c match {
 case '>' => "ptr := ptr + 1;"
 case '<' => "ptr := ptr - 1;"
-case '+' => "field[ptr] := field[ptr] + 1;"
+case '+' => "mem[ptr] := mem [ptr] + 1;"
-case '-' => "field[ptr] := field[ptr] - 1;"
+case '-' => "mem [ptr] := mem [ptr] - 1;"
-case '.' => "x := field[ptr]; write x;"
+case '.' => "x := mem [ptr]; write x;"
-case '['  => "while (field[ptr] != 0) do {"
+case '['  => "while (mem [ptr] != 0) do {"
 case ']'  => "skip};"
 case _ => ""
 }
 \end{lstlisting}
 \noindent
 The idea behind the translation is that BF-programs operate on an array,
-called \texttt{field}. The BP-memory pointer into this array is
+called here \texttt{mem}. The BP-memory pointer into this array is
-represented as the variable \texttt{ptr}. The BF-instructions \code{>}
+represented as the variable \texttt{ptr}. As usual the BF-instructions
-and \code{<} increase, respectively decrease, \texttt{ptr}. The
+\code{>} and \code{<} increase, respectively decrease, \texttt{ptr}. The
-instructions \code{+} and \code{-} update a cell in \texttt{field}. In
+instructions \code{+} and \code{-} update a cell in \texttt{mem}. In
-Line 6 we need to first assign a field-cell to an auxiliary variable
+Line 6 we need to first assign a \texttt{mem}-cell to an auxiliary variable
 since we have not changed our write functions in order to cope with
 writing out any array-content directly. Lines 7 and 8 are for
 translating BF-loops. Line 8 is interesting in the sense that we need to
 generate a \code{skip} instruction just before finishing with the
 closing \code{"\}"}. The reason is that we are rather pedantic about
-semicolons in our While-grammar: the last command cannot have a
+semicolons in our WHILE-grammar: the last command cannot have a
-semicolon---adding a \code{skip} works around this snag. Putting
+semicolon---adding a \code{skip} works around this snag. Putting all
-all this together is we can generate While-programs with more than
+this together is we can generate WHILE-programs with more than 400
-400 instructions and then run the compiled JVM code for such programs.
+instructions and then run the compiled JVM code for such programs.
+\bigskip
+\noindent
+Hooooray, we can finally run the BF-mandelbrot program on the JVM and it
+completes within 20 seconds (after nearly 10 minutes of parsing the
+corresponding WHILE-program and generating 270K of a class file). Try
+replicating the 20 secs with an interpreter! OK, we now face the
+nagging question about what to do next\ldots
+\subsection*{Added Value}
+As you have probably seen, the compiler writer has a lot of freedom
+about how to generate code from what the progarmmer wrote as program.
+The only condition is that generated code should behave as expected by
+the programmer. Then all is fine\ldots mission accomplished! But
+sometimes the compiler writer is expected to go an extra mile, or even
+miles. Suppose we are given the following WHILE-program:
+\begin{lstlisting}[mathescape,language=While]
+new(arr[10]);
+arr[14] := 3 + arr[13]
+\end{lstlisting}
+\noindent
+While admittedly this is a contrived program, and probably not meant to
+be like this by any sane programmer, it is supposed to make the
+following point: We generate an array of size 10, and then try to access
+the non-existing element at index 13 and even updating element with
+index 14. Obviously this is baloney. However, our compiler generates
+code for this program without any questions asked. We can even run this
+code on the JVM\ldots of course the result is an exception trace where
+the JVM yells at us for doing naughty things. (This is much better than
+C, for example, where such errors are not prevented and as a result
+insidious attacks can be mounted against such kind C-programs. I assume
+everyone has heard about \emph{Buffer Overflow Attacks}.)
+Imagine we do not want to rely in our compiler on the JVM for producing
+an annoying, but safe exception trace, rather we want to handle such
+situations ourselves. Lets assume we want to handle them in the
+following way: if the programmer access a field out-of-bounds, we just
+return the default 0, and if a programmer wants to update an
+out-of-bounds filed, we quietly ignore this update.
 \end{document}
 %%% Local Variables:
 %%% mode: latex

changeset 708	4980f421b3b0
parent 705	bfc8703b1527
child 709	c112a6cb5e52