9 \begin{document} |
9 \begin{document} |
10 \fnote{\copyright{} Christian Urban, King's College London, 2017, 2018, 2019} |
10 \fnote{\copyright{} Christian Urban, King's College London, 2017, 2018, 2019} |
11 |
11 |
12 \section*{Handout 7 (Compilation)} |
12 \section*{Handout 7 (Compilation)} |
13 |
13 |
|
14 |
|
15 |
14 The purpose of a compiler is to transform a program a human can read and |
16 The purpose of a compiler is to transform a program a human can read and |
15 write into code the machine can run as fast as possible. The fastest |
17 write into code the machine can run as fast as possible. The fastest |
16 code would be machine code the CPU can run directly, but it is often |
18 code would be machine code the CPU can run directly, but it is often |
17 good enough for improving the speed of a program to just target a |
19 good enough for improving the speed of a program to target a |
18 virtual machine. This produces not the fastest possible code, but code |
20 virtual machine. This produces not the fastest possible code, but code |
19 that is pretty enough. This way of producing code has the advantage that |
21 that is often pretty fast. This way of producing code has the advantage that |
20 the virtual machine takes care of things a compiler would normally need |
22 the virtual machine takes care of things a compiler would normally need |
21 to take care of (like explicit memory management). An interesting answer |
23 to take care of (like explicit memory management). |
22 for why studying compilers is given by John Regher in his compiler |
|
23 blog:\footnote{\url{http://blog.regehr.org/archives/1419}} |
|
24 |
|
25 \begin{quote}\it{}``We can start off with a couple of observations |
|
26 about the role of compilers. First, hardware is getting weirder |
|
27 rather than getting clocked faster: almost all processors are |
|
28 multicores and it looks like there is increasing asymmetry in |
|
29 resources across cores. Processors come with vector units, crypto |
|
30 accelerators, bit twiddling instructions, and lots of features to |
|
31 make virtualization and concurrency work. We have DSPs, GPUs, |
|
32 big.little, and Xeon Phi. This is only scratching the |
|
33 surface. Second, we’re getting tired of low-level languages and |
|
34 their associated security disasters, we want to write new code, to |
|
35 whatever extent possible, in safer, higher-level |
|
36 languages. Compilers are caught right in the middle of these |
|
37 opposing trends: one of their main jobs is to help bridge the large |
|
38 and growing gap between increasingly high-level languages and |
|
39 increasingly wacky platforms. It’s effectively a perpetual |
|
40 employment act for solid compiler hackers.'' |
|
41 \end{quote} |
|
42 |
24 |
43 As a first example in this module we will implement a compiler for the |
25 As a first example in this module we will implement a compiler for the |
44 very simple While-language. It will generate code for the Java Virtual |
26 very simple While-language. It will generate code for the Java Virtual |
45 Machine (JVM). This is a stack-based virtual machine, a fact which will |
27 Machine (JVM). Unfortunately the Java ecosystem does not come with an |
46 make it easy to generate code for arithmetic expressions. For example |
28 assembler which would be handy for our compiler-endeavour (unlike |
47 when compiling $1 + 2$ we need to generate the following three |
29 Microsoft's Common Language Infrastructure for the .Net platform which |
48 instructions |
30 has an assembler out-of-the-box). As a substitute we use in this module |
|
31 the 3rd-party programs Jasmin and Krakatau |
|
32 |
|
33 \begin{itemize} |
|
34 \item \url{http://jasmin.sourceforge.net} |
|
35 \item \url{https://github.com/Storyyeller/Krakatau} |
|
36 \end{itemize} |
|
37 |
|
38 \noindent |
|
39 The first is a Java program and the second a program written in Python. |
|
40 Each of them allow us to generate \emph{assembly} files that are still |
|
41 readable by humans, as opposed to class-files which are pretty much just |
|
42 (horrible) zeros and ones. Jasmin (respectively Krakatau) will then take |
|
43 an assembly file as input and generate the corresponding class file for |
|
44 us. |
|
45 |
|
46 Good about the JVM is that it is a stack-based virtual machine, a fact |
|
47 which will make it easy to generate code for arithmetic expressions. For |
|
48 example when compiling the expression $1 + 2$ we need to generate the |
|
49 following three instructions |
49 |
50 |
50 \begin{lstlisting}[language=JVMIS,numbers=none] |
51 \begin{lstlisting}[language=JVMIS,numbers=none] |
51 ldc 1 |
52 ldc 1 |
52 ldc 2 |
53 ldc 2 |
53 iadd |
54 iadd |
111 \end{lstlisting} |
112 \end{lstlisting} |
112 |
113 |
113 \noindent If we ``run'' these instructions, the result $8$ will be on |
114 \noindent If we ``run'' these instructions, the result $8$ will be on |
114 top of the stack (I leave this to you to verify; the meaning of each |
115 top of the stack (I leave this to you to verify; the meaning of each |
115 instruction should be clear). The result being on the top of the stack |
116 instruction should be clear). The result being on the top of the stack |
116 will be a convention we always observe in our compiler, that is the |
117 will be an important convention we always observe in our compiler. Note, |
117 results of arithmetic expressions will always be on top of the stack. |
118 that a different bracketing of the expression, for example $(1 + (2 * |
118 Note, that a different bracketing of the expression, for example $(1 + |
119 3)) + (4 - 3)$, produces a different abstract syntax tree and thus also |
119 (2 * 3)) + (4 - 3)$, produces a different abstract syntax tree and thus |
120 a different list of instructions. Generating code in this |
120 also a different list of instructions. Generating code in this |
|
121 post-order-traversal fashion is rather easy to implement: it can be done |
121 post-order-traversal fashion is rather easy to implement: it can be done |
122 with the following recursive \textit{compile}-function, which takes the |
122 with the following recursive \textit{compile}-function, which takes the |
123 abstract syntax tree as argument: |
123 abstract syntax tree as argument: |
124 |
124 |
125 \begin{center} |
125 \begin{center} |
182 \end{tabular} |
182 \end{tabular} |
183 \end{center} |
183 \end{center} |
184 |
184 |
185 \noindent In the last line we generate the code for variables |
185 \noindent In the last line we generate the code for variables |
186 where $E(x)$ stands for looking up the environment to which |
186 where $E(x)$ stands for looking up the environment to which |
187 index the variable $x$ maps to. |
187 index the variable $x$ maps to. This is similar to an interpreter, |
|
188 which also needs an environment: the difference is that the |
|
189 interpreter maintains a mapping from variables to current values (what is the |
|
190 currently the value of a variable), while compilers need a mapping |
|
191 from variables to memory locations (where can I find the current |
|
192 value for the variable in memory). |
188 |
193 |
189 There is a similar \textit{compile}-function for boolean |
194 There is a similar \textit{compile}-function for boolean |
190 expressions, but it includes a ``trick'' to do with |
195 expressions, but it includes a ``trick'' to do with |
191 \pcode{if}- and \pcode{while}-statements. To explain the issue |
196 \pcode{if}- and \pcode{while}-statements. To explain the issue |
192 let us first describe the compilation of statements of the |
197 let us first describe the compilation of statements of the |
204 list of instructions (in this case the empty list) and an |
209 list of instructions (in this case the empty list) and an |
205 environment for variables. The reason for the environment is |
210 environment for variables. The reason for the environment is |
206 that assignments in the While-language might change the |
211 that assignments in the While-language might change the |
207 environment---clearly if a variable is used for the first |
212 environment---clearly if a variable is used for the first |
208 time, we need to allocate a new index and if it has been used |
213 time, we need to allocate a new index and if it has been used |
209 before, we need to be able to retrieve the associated index. |
214 before, then we need to be able to retrieve the associated index. |
210 This is reflected in the clause for compiling assignments: |
215 This is reflected in the clause for compiling assignments, say |
|
216 $\textit{x := a}$: |
211 |
217 |
212 \begin{center} |
218 \begin{center} |
213 \begin{tabular}{lcl} |
219 \begin{tabular}{lcl} |
214 $\textit{compile}(x := a, E)$ & $\dn$ & |
220 $\textit{compile}(x := a, E)$ & $\dn$ & |
215 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$ |
221 $(\textit{compile}(a, E) \;@\;\pcode{istore}\;index, E')$ |
226 the index is and return the environment unchanged (that is in |
232 the index is and return the environment unchanged (that is in |
227 this case $E' = E$). However, if this is the first encounter |
233 this case $E' = E$). However, if this is the first encounter |
228 of the variable $x$ in the program, then we have to augment |
234 of the variable $x$ in the program, then we have to augment |
229 the environment and assign $x$ with the largest index in $E$ |
235 the environment and assign $x$ with the largest index in $E$ |
230 plus one (that is $E' = E(x \mapsto largest\_index + 1)$). |
236 plus one (that is $E' = E(x \mapsto largest\_index + 1)$). |
231 That means for the assignment $x := x + 1$ we generate the |
237 This means for the assignment $x := x + 1$ we generate the |
232 following code |
238 following code |
233 |
239 |
234 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
240 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
235 iload $n_x$ |
241 iload $n_x$ |
236 ldc 1 |
242 ldc 1 |
237 iadd |
243 iadd |
238 istore $n_x$ |
244 istore $n_x$ |
239 \end{lstlisting} |
245 \end{lstlisting} |
240 |
246 |
241 \noindent |
247 \noindent |
242 where $n_x$ is the index for the variable $x$. The code for |
248 where $n_x$ is the index (or pointer to the memory) for the variable $x$ . The code for |
243 looking-up the index for the variable is as follow: |
249 looking-up the index for the variable is as follow: |
244 |
250 |
245 \begin{center} |
251 \begin{center} |
246 \begin{tabular}{lcl} |
252 \begin{tabular}{lcl} |
247 $index \;=\; \textit{if}\;(E(x).\textit{isDefind})\;\textit{then}\;E(x)\;\textit{else}\;|E|$ |
253 $index \;=\; E\textit{.getOrElse}(x, |E|)$ |
248 \end{tabular} |
254 \end{tabular} |
249 \end{center} |
255 \end{center} |
250 |
256 |
251 \noindent |
257 \noindent |
252 In case the environment $E$ contains an index for $x$, we return it. |
258 In case the environment $E$ contains an index for $x$, we return it. |
253 Otherwise we ``create'' a new index by returning the size $|E|$ of the |
259 Otherwise we ``create'' a new index by returning the size $|E|$ of the |
254 environment (that will be an index that is guaranteed to be not used |
260 environment (that will be an index that is guaranteed to be not used |
255 yet). |
261 yet). |
256 |
262 |
257 More complicated is the generation of code for \pcode{if}-statements, say |
263 A bit more complicated is the generation of code for \pcode{if}-statements, say |
258 |
264 |
259 \begin{lstlisting}[mathescape,language={},numbers=none] |
265 \begin{lstlisting}[mathescape,language={},numbers=none] |
260 if $b$ then $cs_1$ else $cs_2$ |
266 if $b$ then $cs_1$ else $cs_2$ |
261 \end{lstlisting} |
267 \end{lstlisting} |
262 |
268 |
421 the else-branch, the label $L_\textit{ifesle}$, followed by |
427 the else-branch, the label $L_\textit{ifesle}$, followed by |
422 the instructions for the else-branch, followed by the |
428 the instructions for the else-branch, followed by the |
423 after-the-else-branch label. Consider for example the |
429 after-the-else-branch label. Consider for example the |
424 if-statement: |
430 if-statement: |
425 |
431 |
426 \begin{lstlisting}[mathescape,numbers=none,language={}] |
432 \begin{lstlisting}[mathescape,numbers=none,language=While] |
427 if 1 = 1 then x := 2 else y := 3 |
433 if 1 = 1 then x := 2 else y := 3 |
428 \end{lstlisting} |
434 \end{lstlisting} |
429 |
435 |
430 \noindent |
436 \noindent |
431 The generated code is as follows: |
437 The generated code is as follows: |
432 |
438 |
433 \begin{lstlisting}[language=JVMIS,mathescape,language={}] |
439 \begin{lstlisting}[language=JVMIS,mathescape,numbers=left] |
434 ldc 1 |
440 ldc 1 |
435 ldc 1 |
441 ldc 1 |
436 if_icmpne L_ifelse $\quad\tikz[remember picture] \node (C) {\mbox{}};$ |
442 if_icmpne L_ifelse $\quad\tikz[remember picture] \node (C) {\mbox{}};$ |
437 ldc 2 |
443 ldc 2 |
438 istore 0 |
444 istore 0 |
531 \end{center} |
537 \end{center} |
532 |
538 |
533 \noindent I let you go through how this clause works. As an example |
539 \noindent I let you go through how this clause works. As an example |
534 you can consider the while-loop |
540 you can consider the while-loop |
535 |
541 |
536 \begin{lstlisting}[mathescape,numbers=none,language={}] |
542 \begin{lstlisting}[mathescape,numbers=none,language=While] |
537 while x <= 10 do x := x + 1 |
543 while x <= 10 do x := x + 1 |
538 \end{lstlisting} |
544 \end{lstlisting} |
539 |
545 |
540 \noindent yielding the following code |
546 \noindent yielding the following code |
541 |
547 |
542 \begin{lstlisting}[language=JVMIS,mathescape,language={}] |
548 \begin{lstlisting}[language=JVMIS,mathescape,numbers=left] |
543 L_wbegin: $\quad\tikz[remember picture] \node[] (LB) {\mbox{}};$ |
549 L_wbegin: $\quad\tikz[remember picture] \node[] (LB) {\mbox{}};$ |
544 iload 0 |
550 iload 0 |
545 ldc 10 |
551 ldc 10 |
546 if_icmpgt L_wend $\quad\tikz[remember picture] \node (LC) {\mbox{}};$ |
552 if_icmpgt L_wend $\quad\tikz[remember picture] \node (LC) {\mbox{}};$ |
547 iload 0 |
553 iload 0 |
557 -- ++(0mm,17.3mm) |- (\tikztotarget)},line width=1mm] (LB.east); |
563 -- ++(0mm,17.3mm) |- (\tikztotarget)},line width=1mm] (LB.east); |
558 \draw[->,very thick] (LC) edge [->,to path={-- ++(10mm,0mm) |
564 \draw[->,very thick] (LC) edge [->,to path={-- ++(10mm,0mm) |
559 -- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (LD.east); |
565 -- ++(0mm,-17.3mm) |- (\tikztotarget)},line width=1mm] (LD.east); |
560 \end{tikzpicture} |
566 \end{tikzpicture} |
561 |
567 |
|
568 \noindent |
|
569 I leave it to you to read the code and follow its controlflow. |
562 |
570 |
563 Next we need to consider the statement \pcode{write x}, which |
571 Next we need to consider the statement \pcode{write x}, which |
564 can be used to print out the content of a variable. For this |
572 can be used to print out the content of a variable. For this |
565 we need to use a Java library function. In order to avoid |
573 we need to use a Java library function. In order to avoid |
566 having to generate a lot of code for each |
574 having to generate a lot of code for each |
672 \caption{Generated code for a test program. This code can be |
680 \caption{Generated code for a test program. This code can be |
673 processed by an Java assembler producing a class-file, which |
681 processed by an Java assembler producing a class-file, which |
674 can be run by the {\tt{}java}-program.\label{test}} |
682 can be run by the {\tt{}java}-program.\label{test}} |
675 \end{figure} |
683 \end{figure} |
676 |
684 |
|
685 \subsection*{Arrays} |
|
686 |
|
687 Maybe a useful addition to the While-language are arrays. This |
|
688 lets us generate more interesting While-programs by translating |
|
689 BF*** programs into equivalent While-programs. This means we |
|
690 want to support the following three constructions |
|
691 |
|
692 \begin{lstlisting}[mathescape,language=While] |
|
693 new arr[15000] |
|
694 x := 3 + arr[3 + y] |
|
695 arr[42 * n] := ... |
|
696 \end{lstlisting} |
|
697 |
|
698 \noindent |
|
699 The first one creates a new array with name \pcode{arr} that can |
|
700 hold a given number of integers. The second is referencing an array |
|
701 inside a arithmetic expression. Essentially we have to be able to |
|
702 look up the contents of an array at an index. Similarly we need to be |
|
703 able to update the content of an array (3rd line). The latter two |
|
704 constructions state that the index to an array can be given by |
|
705 an arithmetic expression. For creating a new array we need to generate |
|
706 the following JVM instructions: |
|
707 |
|
708 \begin{lstlisting}[mathescape,language=JVMIS] |
|
709 ldc number |
|
710 newarray int |
|
711 astore loc_var |
|
712 \end{lstlisting} |
|
713 |
|
714 \noindent |
|
715 First we need to put the dimension of the array onto the |
|
716 stack, then next instruction creates the array and last we |
|
717 need to store the array in a local variable (like ``simple'' variables). |
|
718 For looking up an element in an array we can use the following code |
|
719 |
|
720 \begin{lstlisting}[mathescape,language=JVMIS] |
|
721 aload loc_var |
|
722 index_aexp |
|
723 iaload |
|
724 \end{lstlisting} |
|
725 |
|
726 \noindent |
|
727 This first loads the ``pointer'' to the array onto the stack. Then we have the |
|
728 instructions corresponding to the index where we want to look up the array. The |
|
729 idea is that these instructions will leave a concrete number on the stack, which is |
|
730 the index. Finally we just need to load the corresponding element onto the stack. |
|
731 |
677 \end{document} |
732 \end{document} |
678 |
733 |
679 %%% Local Variables: |
734 %%% Local Variables: |
680 %%% mode: latex |
735 %%% mode: latex |
681 %%% TeX-master: t |
736 %%% TeX-master: t |