handouts/ho07.tex
changeset 376 af65ffff9cdd
parent 375 bf36664a3196
child 377 a052a83f562e
equal deleted inserted replaced
375:bf36664a3196 376:af65ffff9cdd
     8 \begin{document}
     8 \begin{document}
     9 
     9 
    10 \section*{Handout 7 (Compilation)}
    10 \section*{Handout 7 (Compilation)}
    11 
    11 
    12 The purpose of a compiler is to transform a program, a human
    12 The purpose of a compiler is to transform a program, a human
    13 can write, into code the machine can run as fast as possible.
    13 can read and write, into code the machine can run as fast as
    14 The fastest code would be machine code the CPU can run
    14 possible. The fastest code would be machine code the CPU can
    15 directly, but it is often enough to improve the speed of a
    15 run directly, but it is often enough to improve the speed of a
    16 program by just targeting a virtual machine. This produces not
    16 program by just targeting a virtual machine. This produces not
    17 the fastest possible code, but code that is fast enough and
    17 the fastest possible code, but code that is fast enough and
    18 has the advantage that the virtual machine care of things a
    18 has the advantage that the virtual machine takes care of
    19 compiler would normally need to take care of (like explicit
    19 things a compiler would normally need to take care of (like
    20 memory management).
    20 explicit memory management).
    21 
    21 
    22 As an example we will implement a compiler for the very simple
    22 As a first example we will implement a compiler for the very
    23 While-language. We will be generating code for the Java
    23 simple While-language. It will generate code for the Java
    24 Virtual Machine. This is a stack-based virtual machine, a fact
    24 Virtual Machine (JVM). This is a stack-based virtual machine,
    25 which will make it easy to generate code for arithmetic
    25 a fact which will make it easy to generate code for arithmetic
    26 expressions. For example for generating code for the
    26 expressions. For example for generating code for the
    27 expression $1 + 2$ we need to generate the following three
    27 expression $1 + 2$ we need to generate the following three
    28 instructions
    28 instructions
    29 
    29 
    30 \begin{lstlisting}[numbers=none]
    30 \begin{lstlisting}[numbers=none]
    33 iadd 
    33 iadd 
    34 \end{lstlisting}
    34 \end{lstlisting}
    35 
    35 
    36 \noindent The first instruction loads the constant $1$ onto
    36 \noindent The first instruction loads the constant $1$ onto
    37 the stack, the next one $2$, the third instruction adds both
    37 the stack, the next one $2$, the third instruction adds both
    38 numbers together replacing the top elements of the stack with
    38 numbers together replacing the top two elements of the stack
    39 the result $3$. For simplicity, we will throughout consider
    39 with the result $3$. For simplicity, we will throughout
    40 only integer numbers and results. Therefore we can use the
    40 consider only integer numbers and results. Therefore we can
    41 instructions \code{iadd}, \code{isub}, \code{imul},
    41 use the JVM instructions \code{iadd}, \code{isub},
    42 \code{idiv} and so on. The \code{i} stands for integer
    42 \code{imul}, \code{idiv} and so on. The \code{i} stands for
    43 instructions in the JVM (alternatives are \code{d} for doubles,
    43 integer instructions in the JVM (alternatives are \code{d} for
    44 \code{l} for longs and \code{f} for floats).
    44 doubles, \code{l} for longs and \code{f} for floats).
    45 
    45 
    46 Recall our grammar for arithmetic expressions ($E$ is the
    46 Recall our grammar for arithmetic expressions ($E$ is the
    47 starting symbol):
    47 starting symbol):
    48 
    48 
    49 
    49 
    58           | \meta{Id}
    58           | \meta{Id}
    59           | \meta{Num}\\
    59           | \meta{Num}\\
    60 \end{plstx}
    60 \end{plstx}
    61 
    61 
    62 
    62 
    63 \noindent where \meta{Id} stands for variables and
    63 \noindent where \meta{Id} stands for variables and \meta{Num}
    64 \meta{Num} for numbers. For the moment let us omit variables from
    64 for numbers. For the moment let us omit variables from
    65 arithmetic expressions. Our parser will take this grammar and
    65 arithmetic expressions. Our parser will take this grammar and
    66 produce abstract syntax trees. For
    66 given an input produce abstract syntax trees. For example for
    67 example for the expression $1 + ((2 * 3) + (4 - 3))$ it will
    67 the expression $1 + ((2 * 3) + (4 - 3))$ it will produce the
    68 produce the following tree.
    68 following tree.
    69 
    69 
    70 \begin{center}
    70 \begin{center}
    71 \begin{tikzpicture}
    71 \begin{tikzpicture}
    72 \Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]
    72 \Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]]
    73 \end{tikzpicture}
    73 \end{tikzpicture}
    99 expressions will always be on top of the stack. Note, that a
    99 expressions will always be on top of the stack. Note, that a
   100 different bracketing of the expression, for example $(1 + (2 *
   100 different bracketing of the expression, for example $(1 + (2 *
   101 3)) + (4 - 3)$, produces a different abstract syntax tree and
   101 3)) + (4 - 3)$, produces a different abstract syntax tree and
   102 thus potentially also a different list of instructions.
   102 thus potentially also a different list of instructions.
   103 Generating code in this fashion is rather easy to implement:
   103 Generating code in this fashion is rather easy to implement:
   104 it can be done with the following \textit{compile}-function,
   104 it can be done with the following recursive
   105 which takes the abstract syntax tree as argument:
   105 \textit{compile}-function, which takes the abstract syntax
       
   106 tree as argument:
   106 
   107 
   107 \begin{center}
   108 \begin{center}
   108 \begin{tabular}{lcl}
   109 \begin{tabular}{lcl}
   109 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\
   110 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\
   110 $\textit{compile}(a_1 + a_2)$ & $\dn$ &
   111 $\textit{compile}(a_1 + a_2)$ & $\dn$ &
   169 index the variable $x$ maps to.
   170 index the variable $x$ maps to.
   170 
   171 
   171 There is a similar \textit{compile}-function for boolean
   172 There is a similar \textit{compile}-function for boolean
   172 expressions, but it includes a ``trick'' to do with
   173 expressions, but it includes a ``trick'' to do with
   173 \pcode{if}- and \pcode{while}-statements. To explain the issue
   174 \pcode{if}- and \pcode{while}-statements. To explain the issue
   174 let us explain first the compilation of statements of the
   175 let us first describe the compilation of statements of the
   175 While-language. The clause for \pcode{skip} is trivial, since
   176 While-language. The clause for \pcode{skip} is trivial, since
   176 we do not have to generate any instruction
   177 we do not have to generate any instruction
   177 
   178 
   178 \begin{center}
   179 \begin{center}
   179 \begin{tabular}{lcl}
   180 \begin{tabular}{lcl}
   180 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\
   181 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\
   181 \end{tabular}
   182 \end{tabular}
   182 \end{center}
   183 \end{center}
   183 
   184 
   184 \noindent Note that the \textit{compile}-function for
   185 \noindent $[]$ is the empty list of instructions. Note that
   185 statements returns a pair, a list of instructions (in this
   186 the \textit{compile}-function for statements returns a pair, a
   186 case the empty list) and an environment for variables. The
   187 list of instructions (in this case the empty list) and an
   187 reason for the environment is that assignments in the
   188 environment for variables. The reason for the environment is
   188 While-language might change the environment---clearly if a
   189 that assignments in the While-language might change the
   189 variable is used for the first time, we need to allocate a new
   190 environment---clearly if a variable is used for the first
   190 index and if it has been used before, we need to be able to
   191 time, we need to allocate a new index and if it has been used
   191 retrieve the associated index. This is reflected in
   192 before, we need to be able to retrieve the associated index.
   192 the clause for compiling assignments:
   193 This is reflected in the clause for compiling assignments:
   193 
   194 
   194 \begin{center}
   195 \begin{center}
   195 \begin{tabular}{lcl}
   196 \begin{tabular}{lcl}
   196 $\text{compile}(x := a, E)$ & $\dn$ & 
   197 $\textit{compile}(x := a, E)$ & $\dn$ & 
   197 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$
   198 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$
   198 \end{tabular}
   199 \end{tabular}
   199 \end{center}
   200 \end{center}
   200 
   201 
   201 \noindent We first generate code for the right-hand side of
   202 \noindent We first generate code for the right-hand side of
   227 
   228 
   228 \begin{lstlisting}[mathescape,language={},numbers=none]
   229 \begin{lstlisting}[mathescape,language={},numbers=none]
   229 if $b$ then $cs_1$ else $cs_2$
   230 if $b$ then $cs_1$ else $cs_2$
   230 \end{lstlisting}
   231 \end{lstlisting}
   231 
   232 
   232 \noindent where $b$ is a boolean expression and the $cs_i$
   233 \noindent where $b$ is a boolean expression and the $cs_{1/2}$
   233 are the instructions for each \pcode{if}-branch. Lets assume
   234 are the statements for each \pcode{if}-branch. Lets assume
   234 we already generated code for $b$ and $cs_{1/2}$. Then in the
   235 we already generated code for $b$ and $cs_{1/2}$. Then in the
   235 true-case the control-flow of the program needs to be
   236 true-case the control-flow of the program needs to be
   236 
   237 
   237 \begin{center}
   238 \begin{center}
   238 \begin{tikzpicture}[node distance=2mm and 4mm,
   239 \begin{tikzpicture}[node distance=2mm and 4mm,
   291 if-condition is false) from the end of the code for the 
   292 if-condition is false) from the end of the code for the 
   292 boolean to the beginning of the instructions $cs_2$. Once we 
   293 boolean to the beginning of the instructions $cs_2$. Once we 
   293 are finished with running $cs_2$ we can continue with whatever
   294 are finished with running $cs_2$ we can continue with whatever
   294 code comes after the if-statement.
   295 code comes after the if-statement.
   295 
   296 
   296 The \pcode{goto} and conditional jumps need addresses to where
   297 The \pcode{goto} and the conditional jumps need addresses to
   297 the jump should go. Since we are generating assembly code for
   298 where the jump should go. Since we are generating assembly
   298 the JVM, we do not actually have to give addresses, but need
   299 code for the JVM, we do not actually have to give (numeric)
   299 to attach labels to our code. These labels specify a target
   300 addresses, but can just attach (symbolic) labels to our code.
   300 for a jump. Therefore the labels need to be unique, as
   301 These labels specify a target for a jump. Therefore the labels
   301 otherwise it would be ambiguous where a jump should go. 
   302 need to be unique, as otherwise it would be ambiguous where a
   302 A labels, say \pcode{L}, is attached to code like
   303 jump should go to. A label, say \pcode{L}, is attached to code
       
   304 like
   303 
   305 
   304 \begin{lstlisting}[mathescape,numbers=none]
   306 \begin{lstlisting}[mathescape,numbers=none]
   305 L:
   307 L:
   306   $instr_1$
   308   $instr_1$
   307   $instr_2$
   309   $instr_2$
   308     $\vdots$
   310     $\vdots$
   309 \end{lstlisting}
   311 \end{lstlisting}
       
   312  
       
   313 \noindent where a label is indicated by a colon. 
   310  
   314  
   311 Recall the ``trick'' with compiling boolean expressions: the 
   315 Recall the ``trick'' with compiling boolean expressions: the 
   312 \textit{compile}-function for boolean expressions takes three
   316 \textit{compile}-function for boolean expressions takes three
   313 arguments: an abstract syntax tree, an environment for 
   317 arguments: an abstract syntax tree, an environment for 
   314 variable indices and also the label, $lab$, to where an conditional 
   318 variable indices and also the label, $lab$, to where an conditional 
   320 $\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ 
   324 $\textit{compile}(a_1 = a_2, E, lab)$ & $\dn$\\ 
   321 \multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}
   325 \multicolumn{3}{l}{$\qquad\textit{compile}(a_1, E) \;@\;\textit{compile}(a_2, E)\;@\; \pcode{if_icmpne}\;lab$}
   322 \end{tabular}
   326 \end{tabular}
   323 \end{center}
   327 \end{center}
   324 
   328 
   325 \noindent 
   329 \noindent where we are first generating code for the
   326 We are generating code for the subexpressions $a_1$ and $a_2$. 
   330 subexpressions $a_1$ and $a_2$. This will mean after running
   327 This will mean after running the corresponding code there will
   331 the corresponding code there will be two integers on top of
   328 be two integers on top of the stack. If they are equal, we do 
   332 the stack. If they are equal, we do not have to do anything
   329 not have to do anything and just continue with the next 
   333 (except for popping them off from the stack) and just continue
   330 instructions (see control-flow of ifs above). However if they 
   334 with the next instructions (see control-flow of ifs above).
   331 are \emph{not} equal, then we need to (conditionally) jump to 
   335 However if they are \emph{not} equal, then we need to
   332 the label $lab$. This can be done with the instruction
   336 (conditionally) jump to the label $lab$. This can be done with
       
   337 the instruction
   333 
   338 
   334 \begin{lstlisting}[mathescape,numbers=none]
   339 \begin{lstlisting}[mathescape,numbers=none]
   335 if_icmpne $lab$
   340 if_icmpne $lab$
   336 \end{lstlisting}
   341 \end{lstlisting}
   337 
   342 
   338 \noindent Other jump instructions for boolean operators are
   343 \noindent Other jump instructions for boolean operators are
   339 
   344 
   340 \begin{center}
   345 \begin{center}
   341 \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}
   346 \begin{tabular}{l@{\hspace{10mm}}c@{\hspace{10mm}}l}
   342 $=$ & $\Rightarrow$ & \pcode{if_icmpne}\\
       
   343 $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\
   347 $\not=$ & $\Rightarrow$ & \pcode{if_icmpeq}\\
   344 $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\
   348 $<$ & $\Rightarrow$ & \pcode{if_icmpge}\\
   345 $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\
   349 $\le$ & $\Rightarrow$ & \pcode{if_icmpgt}\\
   346 \end{tabular}
   350 \end{tabular}
   347 \end{center}
   351 \end{center}
   348 
   352 
   349 \noindent and so on. I leave it to you to extend the
   353 \noindent and so on. I leave it to you to extend the
   350 \textit{compile}-function for the other boolean expressions.
   354 \textit{compile}-function for the other boolean expressions.
   351 Note that we need to jump whenever the boolean is \emph{not}
   355 Note that we need to jump whenever the boolean is \emph{not}
   352 true, which means we have to ``negate'' the jump---equals
   356 true, which means we have to ``negate'' the jump
   353 becomes not-equal, less becomes greater-or-equal. If you do
   357 condition---equals becomes not-equal, less becomes
   354 not like this design (it can be the source of some nasty,
   358 greater-or-equal. If you do not like this design (it can be
   355 hard-to-detect errors), you can also change the layout of the
   359 the source of some nasty, hard-to-detect errors), you can also
   356 code and first give the code for the else-branch and then for
   360 change the layout of the code and first give the code for the
   357 the if-branch.
   361 else-branch and then for the if-branch. However in the case
       
   362 of while-loops this way of generating code still seems
       
   363 the most convenient.
   358 
   364 
   359 We are now ready to give the compile function for 
   365 We are now ready to give the compile function for 
   360 if-statments--remember this function returns for staments a 
   366 if-statments---remember this function returns for staments a 
   361 pair consisting of the code and an environment:
   367 pair consisting of the code and an environment:
   362 
   368 
   363 \begin{center}
   369 \begin{center}
   364 \begin{tabular}{lcl}
   370 \begin{tabular}{lcl}
   365 $\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\ 
   371 $\textit{compile}(\pcode{if}\;b\;\pcode{then}\; cs_1\;\pcode{else}\; cs_2, E)$ & $\dn$\\