handouts/ho07.tex
changeset 372 d6af4b1239de
parent 370 a65767fe5d71
child 373 b018234c9126
equal deleted inserted replaced
371:0b1a92b305cf 372:d6af4b1239de
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{../style}
     2 \usepackage{../style}
     3 \usepackage{../langs}
     3 \usepackage{../langs}
     4 \usepackage{../grammar}
     4 \usepackage{../grammar}
     5 \usepackage{tikz-qtree}
     5 \usepackage{../graphics}
       
     6 
     6 
     7 
     7 \begin{document}
     8 \begin{document}
     8 
     9 
     9 \section*{Handout 7 (Compilation of the WHILE-Language)}
    10 \section*{Handout 7 (Compilation)}
    10 
    11 
    11 The purpose of a compiler is to transform a program, a human
    12 The purpose of a compiler is to transform a program, a human
    12 can write, into code the machine can run as fast as possible.
    13 can write, into code the machine can run as fast as possible.
    13 The fastest code would be machine code the CPU can run
    14 The fastest code would be machine code the CPU can run
    14 directly, but it is often enough to improve the speed of a
    15 directly, but it is often enough to improve the speed of a
    16 the fastest possible code, but code that is fast enough and
    17 the fastest possible code, but code that is fast enough and
    17 has the advantage that the virtual machine care of things a
    18 has the advantage that the virtual machine care of things a
    18 compiler would normally need to take care of (like explicit
    19 compiler would normally need to take care of (like explicit
    19 memory management).
    20 memory management).
    20 
    21 
    21 We will be generating code for the Java Virtual Machine. This
    22 As an example we will implement a compiler for the very simple
    22 is a stack-based virtual machine which will make it easy to
    23 While-language. We will be generating code for the Java
    23 generate code for arithmetic expressions. For example for 
    24 Virtual Machine. This is a stack-based virtual machine, a fact
    24 generating code for the expression $1 + 2$ we need to issue
    25 which will make it easy to generate code for arithmetic
    25 the following three instructions
    26 expressions. For example for generating code for the
       
    27 expression $1 + 2$ we need to generate the following three
       
    28 instructions
    26 
    29 
    27 \begin{lstlisting}[numbers=none]
    30 \begin{lstlisting}[numbers=none]
    28 ldc 1
    31 ldc 1
    29 ldc 2
    32 ldc 2
    30 iadd 
    33 iadd 
    31 \end{lstlisting}
    34 \end{lstlisting}
    32 
    35 
    33 \noindent The first instruction loads the constant $1$ on the 
    36 \noindent The first instruction loads the constant $1$ onto
    34 stack, the next one $2$, the third instruction add both 
    37 the stack, the next one $2$, the third instruction adds both
    35 numbers together replacing the top elements of the stack with 
    38 numbers together replacing the top elements of the stack with
    36 the result $3$. We will throughout consider only integer 
    39 the result $3$. For simplicity, we will throughout consider
    37 numbers and results, therefore we can use the instructions
    40 only integer numbers and results. Therefore we can use the
    38 \code{iadd}, \code{isub}, \code{imul}, \code{idiv} and so on.
    41 instructions \code{iadd}, \code{isub}, \code{imul},
    39 The \code{i} stands for integer instructions (alternatives are
    42 \code{idiv} and so on. The \code{i} stands for integer
    40 \code{d} for doubles, \code{l} for longs and \code{f} for 
    43 instructions in the JVM (alternatives are \code{d} for doubles,
    41 floats).
    44 \code{l} for longs and \code{f} for floats).
    42 
    45 
    43 Recall our grammar for arithmetic expressions:
    46 Recall our grammar for arithmetic expressions ($E$ is the
       
    47 starting symbol):
    44 
    48 
    45 
    49 
    46 \begin{plstx}[rhs style=, margin=3cm]
    50 \begin{plstx}[rhs style=, margin=3cm]
    47 : \meta{E} ::= \meta{T} $+$ \meta{E}
    51 : \meta{E} ::= \meta{T} $+$ \meta{E}
    48          | \meta{T} $-$ \meta{E}
    52          | \meta{T} $-$ \meta{E}
    55           | \meta{Num}\\
    59           | \meta{Num}\\
    56 \end{plstx}
    60 \end{plstx}
    57 
    61 
    58 
    62 
    59 \noindent where \meta{Id} stands for variables and
    63 \noindent where \meta{Id} stands for variables and
    60 \meta{Num} for number. For the moment let us omit variables from
    64 \meta{Num} for numbers. For the moment let us omit variables from
    61 arithmetic expressions. Our parser will take this grammar and
    65 arithmetic expressions. Our parser will take this grammar and
    62 produce abstract syntax trees, for
    66 produce abstract syntax trees. For
    63 example for the expression $1 + ((2 * 3) + (4 - 3))$ it will
    67 example for the expression $1 + ((2 * 3) + (4 - 3))$ it will
    64 produce the following tree.
    68 produce the following tree.
    65 
    69 
    66 \begin{center}
    70 \begin{center}
    67 \begin{tikzpicture}
    71 \begin{tikzpicture}
    68 \Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]
    72 \Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]
    69 \end{tikzpicture}
    73 \end{tikzpicture}
    70 \end{center}
    74 \end{center}
    71 
    75 
    72 \noindent 
    76 \noindent To generate code for this expression, we need to
    73 To generate code for this expression, we need to traverse this
    77 traverse this tree in post-order fashion and emit code for
    74 tree in post-order fashion---this will produce code for a 
    78 each node---this traversal in post-order fashion will produce
    75 stack-machine, like the JVM. Doing so gives the instructions
    79 code for a stack-machine (what the JVM is). Doing so for the
       
    80 tree above generates the instructions
    76 
    81 
    77 \begin{lstlisting}[numbers=none]
    82 \begin{lstlisting}[numbers=none]
    78 ldc 1 
    83 ldc 1 
    79 ldc 2 
    84 ldc 2 
    80 ldc 3 
    85 ldc 3 
    85 iadd 
    90 iadd 
    86 iadd
    91 iadd
    87 \end{lstlisting}
    92 \end{lstlisting}
    88 
    93 
    89 \noindent If we ``run'' these instructions, the result $8$
    94 \noindent If we ``run'' these instructions, the result $8$
    90 will be on top of the stack. This will be a convention we
    95 will be on top of the stack (I leave this to you to verify;
    91 always observe, namely that the results of arithmetic
    96 the meaning of each instruction should be clear). The result
       
    97 being on the top of the stack will be a convention we always
       
    98 observe in our compiler, that is the results of arithmetic
    92 expressions will always be on top of the stack. Note, that a
    99 expressions will always be on top of the stack. Note, that a
    93 different bracketing, for example $(1 + (2 * 3)) + (4 - 3)$,
   100 different bracketing of the expression, for example $(1 + (2 *
    94 produces a different abstract syntax tree and thus potentially
   101 3)) + (4 - 3)$, produces a different abstract syntax tree and
    95 also a different list of instructions. Generating code in this
   102 thus potentially also a different list of instructions.
    96 fashion is rather simple: it can be done by the following
   103 Generating code in this fashion is rather easy to implement:
    97 \textit{compile}-function:
   104 it can be done with the following \textit{compile}-function,
       
   105 which takes the abstract syntax tree as argument:
    98 
   106 
    99 \begin{center}
   107 \begin{center}
   100 \begin{tabular}{lcl}
   108 \begin{tabular}{lcl}
   101 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\
   109 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\
   102 $\textit{compile}(a_1 + a_2)$ & $\dn$ &
   110 $\textit{compile}(a_1 + a_2)$ & $\dn$ &
   108 $\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & 
   116 $\textit{compile}(a_1 \backslash a_2)$ & $\dn$ & 
   109 $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\
   117 $\textit{compile}(a_1) \;@\; \textit{compile}(a_2)\;@\; \pcode{idiv}$\\
   110 \end{tabular}
   118 \end{tabular}
   111 \end{center}
   119 \end{center}
   112 
   120 
   113 \noindent However, our arithmetic expressions can also contain
   121 However, our arithmetic expressions can also contain
   114 variables. We will represent them as \emph{local variables} in
   122 variables. We will represent them as \emph{local variables} in
   115 the JVM. Essentially, local variables are an array or pointers
   123 the JVM. Essentially, local variables are an array or pointers
   116 to the memory containing in our case only integers. Looking up
   124 to memory cells, containing in our case only integers. Looking
   117 a variable can be done by the instruction
   125 up a variable can be done with the instruction
   118 
   126 
   119 \begin{lstlisting}[mathescape,numbers=none]
   127 \begin{lstlisting}[mathescape,numbers=none]
   120 iload $index$
   128 iload $index$
   121 \end{lstlisting}
   129 \end{lstlisting}
   122 
   130 
   123 \noindent 
   131 \noindent 
   124 which places the content of the local variable $index$ onto 
   132 which places the content of the local variable $index$ onto 
   125 thestack. Storing the top of the stack into a local variable 
   133 the stack. Storing the top of the stack into a local variable 
   126 can be done by the instruction
   134 can be done by the instruction
   127 
   135 
   128 \begin{lstlisting}[mathescape,numbers=none]
   136 \begin{lstlisting}[mathescape,numbers=none]
   129 istore $index$
   137 istore $index$
   130 \end{lstlisting}
   138 \end{lstlisting}
   131 
   139 
   132 \noindent Note that this also pops off the top of the stack.
   140 \noindent Note that this also pops off the top of the stack.
   133 One problem we have to overcome is that local variables are
   141 One problem we have to overcome, however, is that local
   134 addressed, not by identifiers, but by numbers (starting from
   142 variables are addressed, not by identifiers, but by numbers
   135 $0$). Therefore our compiler needs to maintain a kind of
   143 (starting from $0$). Therefore our compiler needs to maintain
   136 environment (similar to the interpreter) where variables are
   144 a kind of environment where variables are associated to
   137 associated to numbers. This association needs to be unique: if
   145 numbers. This association needs to be unique: if we muddle up
   138 we muddle up the numbers, then we essentially confuse
   146 the numbers, then we essentially confuse variables and the
   139 variables and the result will usually be an erroneous result.
   147 consequence will usually be an erroneous result. Our extended
   140 Therefore our \textit{compile}-function will take two
   148 \textit{compile}-function for arithmetic expressions will
   141 arguments: the abstract syntax tree and the environment, $E$,
   149 therefore take two arguments: the abstract syntax tree and the
   142 that maps identifiers to index-numbers.
   150 environment, $E$, that maps identifiers to index-numbers.
   143 
   151 
   144 \begin{center}
   152 \begin{center}
   145 \begin{tabular}{lcl}
   153 \begin{tabular}{lcl}
   146 $\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\
   154 $\textit{compile}(n, E)$ & $\dn$ & $\pcode{ldc}\;n$\\
   147 $\textit{compile}(a_1 + a_2, E)$ & $\dn$ & 
   155 $\textit{compile}(a_1 + a_2, E)$ & $\dn$ & 
   155 $\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\
   163 $\textit{compile}(x, E)$ & $\dn$ & $\pcode{iload}\;E(x)$\\
   156 \end{tabular}
   164 \end{tabular}
   157 \end{center}
   165 \end{center}
   158 
   166 
   159 \noindent In the last line we generate the code for variables
   167 \noindent In the last line we generate the code for variables
   160 where $E(x)$ stands for looking up to which index the variable
   168 where $E(x)$ stands for looking up the environment to which
   161 $x$ maps to.
   169 index the variable $x$ maps to.
   162 
   170 
       
   171 There is a similar \textit{compile}-function for boolean
       
   172 expressions, but it includes a ``trick'' to do with
       
   173 \pcode{if}- and \pcode{while}-statements. To explain the issue
       
   174 let us explain first the compilation of statements of the
       
   175 While-language. The clause for \pcode{skip} is trivial, since
       
   176 we do not have to generate any instruction
       
   177 
       
   178 \begin{center}
       
   179 \begin{tabular}{lcl}
       
   180 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\
       
   181 \end{tabular}
       
   182 \end{center}
       
   183 
       
   184 \noindent Note that the \textit{compile}-function for
       
   185 statements returns a pair, a list of instructions (in this
       
   186 case the empty list) and an environment for variables. The
       
   187 reason for the environment is that assignments in the
       
   188 While-language might change the environment---clearly if a
       
   189 variable is used for the first time, we need to allocate a new
       
   190 index and if it has been used before, we need to be able to
       
   191 retrieve the associated index. This is reflected in
       
   192 the clause for compiling assignments:
       
   193 
       
   194 \begin{center}
       
   195 \begin{tabular}{lcl}
       
   196 $\text{compile}(x := a, E)$ & $\dn$ & 
       
   197 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$
       
   198 \end{tabular}
       
   199 \end{center}
       
   200 
       
   201 \noindent We first generate code for the right-hand side of
       
   202 the assignment and then add an \pcode{istore}-instruction at
       
   203 the end. By convention the result of the arithmetic expression
       
   204 $a$ will be on top of the stack. After the \pcode{istore}
       
   205 instruction, the result will be stored in the index
       
   206 corresponding to the variable $x$. If the variable $x$ has
       
   207 been used before in the program, we just need to look up what
       
   208 the index is and return the environment unchanged (that is in
       
   209 this case $E' = E$). However, if this is the first encounter 
       
   210 of the variable $x$ in the program, then we have to augment 
       
   211 the environment and assign $x$ with the largest index in $E$
       
   212 plus one (that is $E' = E(x \mapsto largest\_index + 1)$). 
       
   213 That means for the assignment $x := x + 1$ we generate the
       
   214 following code
       
   215 
       
   216 \begin{lstlisting}[mathescape,numbers=none]
       
   217 iload $n_x$
       
   218 ldc 1
       
   219 iadd
       
   220 istore $n_x$
       
   221 \end{lstlisting}
       
   222 
       
   223 \noindent 
       
   224 where $n_x$ is the index for the variable $x$.
       
   225 
       
   226 More complicated is the code for \pcode{if}-statments, say
       
   227 
       
   228 \begin{lstlisting}[mathescape,language={},numbers=none]
       
   229 if $b$ then $cs_1$ else $cs_2$
       
   230 \end{lstlisting}
       
   231 
       
   232 \noindent where $b$ is a boolean expression and the $cs_i$
       
   233 are the instructions for each \pcode{if}-branch. Lets assume
       
   234 we already generated code for $b$ and $cs_{1/2}$. Then in the
       
   235 true-case the control-flow of the program needs to be
       
   236 
       
   237 \begin{center}
       
   238 \begin{tikzpicture}[node distance=2mm and 4mm,
       
   239  block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
       
   240  point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
       
   241  skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
       
   242 \node (A1) [point] {};
       
   243 \node (b) [block, right=of A1] {code of $b$};
       
   244 \node (A2) [point, right=of b] {};
       
   245 \node (cs1) [block, right=of A2] {code of $cs_1$};
       
   246 \node (A3) [point, right=of cs1] {};
       
   247 \node (cs2) [block, right=of A3] {code of $cs_2$};
       
   248 \node (A4) [point, right=of cs2] {};
       
   249 
       
   250 \draw (A1) edge [->, black, line width=1mm] (b);
       
   251 \draw (b) edge [->, black, line width=1mm] (cs1);
       
   252 \draw (cs1) edge [->, black, line width=1mm] (A3);
       
   253 \draw (A3) edge [->, black, skip loop] (A4);
       
   254 \node [below=of cs2] {\raisebox{-5mm}{\small{}jump}};
       
   255 \end{tikzpicture}
       
   256 \end{center}
       
   257 
       
   258 \noindent where we start with running the code for $b$; since
       
   259 we are in the true case we continue with running the code for
       
   260 $cs_1$. After this however, we must not run the code for
       
   261 $cs_2$, but always jump after the last instruction of $cs_2$
       
   262 (the code for the \pcode{else}-branch). Note that this jump is
       
   263 unconditional, meaning we always have to jump to the end of
       
   264 $cs_2$. The corresponding instruction of the JVM is
       
   265 \pcode{goto}. In case $b$ turns out to be false we need the
       
   266 control-flow
       
   267 
       
   268 \begin{center}
       
   269 \begin{tikzpicture}[node distance=2mm and 4mm,
       
   270  block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm},
       
   271  point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red},
       
   272  skip loop/.style={black, line width=1mm, to path={-- ++(0,-10mm) -| (\tikztotarget)}}]
       
   273 \node (A1) [point] {};
       
   274 \node (b) [block, right=of A1] {code of $b$};
       
   275 \node (A2) [point, right=of b] {};
       
   276 \node (cs1) [block, right=of A2] {code of $cs_1$};
       
   277 \node (A3) [point, right=of cs1] {};
       
   278 \node (cs2) [block, right=of A3] {code of $cs_2$};
       
   279 \node (A4) [point, right=of cs2] {};
       
   280 
       
   281 \draw (A1) edge [->, black, line width=1mm] (b);
       
   282 \draw (b) edge [->, black, line width=1mm] (A2);
       
   283 \draw (A2) edge [skip loop] (A3);
       
   284 \draw (A3) edge [->, black, line width=1mm] (cs2);
       
   285 \draw (cs2) edge [->,black, line width=1mm] (A4);
       
   286 \node [below=of cs1] {\raisebox{-5mm}{\small{}conditional jump}};
       
   287 \end{tikzpicture}
       
   288 \end{center}
       
   289 
       
   290 \noindent where we now need a conditional jump (if the
       
   291 if-condition is false) from the end of the code for the 
       
   292 boolean to the beginning of the instructions $cs_2$. Once we 
       
   293 are finished with running $cs_2$ we can continue with whatever
       
   294 code comes after the if-statement.
       
   295 
       
   296 The \pcode{goto} and conditional jumps need addresses to where
       
   297 the jump should go. Since we are generating assembly code for
       
   298 the JVM, we do not actually have to give addresses, but need
       
   299 to attach labels to our code. These labels specify a target
       
   300 for a jump. Therefore the labels need to be unique, as
       
   301 otherwise it would be ambiguous where a jump should go. 
       
   302 A labels, say \pcode{L}, is attached to code like
       
   303 
       
   304 \begin{lstlisting}[mathescape,numbers=none]
       
   305 L:
       
   306   $instr_1$
       
   307   $instr_2$
       
   308     $\vdots$
       
   309 \end{lstlisting}
       
   310  
       
   311 Recall the ``trick'' with compiling boolean expressions: the 
       
   312 \textit{compile}-function for boolean expressions takes three
       
   313 arguments: an abstract syntax tree, an environment for 
       
   314 variable indices and also the label, $lab$, to where an conditional 
       
   315 jump needs to go. The clause for the expression $a_1 = a_2$, 
       
   316 for example, is as follows:
       
   317 
       
   318 \begin{center}
       
   319 \begin{tabular}{lcl}
       
   320 $\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ 
       
   321 \multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}
       
   322 \end{tabular}
       
   323 \end{center}
       
   324 
       
   325 \noindent 
       
   326 We are generating code for the subexpressions $a_1$ and $a_2$. 
       
   327 This will mean after running the corresponding code there will
       
   328 be two integers on top of the stack. If they are equal, we do 
       
   329 not have to do anything and just continue with the next 
       
   330 instructions (see control-flow of ifs above). However if they 
       
   331 are \emph{not} equal, then we need to (conditionally) jump to 
       
   332 the label $lab$. This can be done with the instruction
       
   333 
       
   334 \begin{lstlisting}[mathescape,numbers=none]
       
   335 if_icmpne $lab$
       
   336 \end{lstlisting}
       
   337 
       
   338 \noindent Other jump instructions for boolean operators are
       
   339 
       
   340 \begin{center}
       
   341 \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}
       
   342 $=$ & $\Rightarrow$ & \pcode{if_icmpne}\\
       
   343 $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\
       
   344 $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\
       
   345 $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\
       
   346 \end{tabular}
       
   347 \end{center}
       
   348 
       
   349 \noindent and so on. I leave it to you to extend the
       
   350 \textit{compile}-function for the other boolean expressions.
       
   351 Note that we need to jump whenever the boolean is \emph{not}
       
   352 true, which means we have to ``negate'' the jump---equals
       
   353 becomes not-equal, less becomes greater-or-equal. If you do
       
   354 not like this design (it can be the source of some nasty,
       
   355 hard-to-detect errors), you can also change the layout of the
       
   356 code and first give the code for the else-branch and then for
       
   357 the if-branch.
       
   358 
       
   359 We are now ready to give the compile function for 
       
   360 if-statments--remember this function returns for staments a 
       
   361 pair consisting of the code and an environment:
       
   362 
       
   363 \begin{center}
       
   364 \begin{tabular}{lcl}
       
   365 $\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\ 
       
   366 \multicolumn{3}{l}{$\qquad l_\textit{ifelse}\;$ (fresh label)}\\
       
   367 \multicolumn{3}{l}{$\qquad l_\textit{ifend}\;$ (fresh label)}\\
       
   368 \multicolumn{3}{l}{$\qquad (is_1, E') = \textit{compile}(cs_1, E)$}\\
       
   369 \multicolumn{3}{l}{$\qquad (is_2, E'') = \textit{compile}(cs_2, E')$}\\
       
   370 \multicolumn{3}{l}{$\qquad(\textit{compile}(b, E, l_\textit{ifelse})$}\\
       
   371 \multicolumn{3}{l}{$\qquad\phantom{(}@\;is_1$}\\
       
   372 \multicolumn{3}{l}{$\qquad\phantom{(}@\; \pcode{goto}\;l_\textit{ifend}$}\\
       
   373 \multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifelse}:$}\\
       
   374 \multicolumn{3}{l}{$\qquad\phantom{(}@\;is_2$}\\
       
   375 \multicolumn{3}{l}{$\qquad\phantom{(}@\;l_\textit{ifend}:, E'')$}\\
       
   376 \end{tabular}
       
   377 \end{center}
   163 
   378 
   164 \end{document}
   379 \end{document}
   165 
   380 
   166 %%% Local Variables: 
   381 %%% Local Variables: 
   167 %%% mode: latex  
   382 %%% mode: latex