5 \usepackage{../grammar} |
5 \usepackage{../grammar} |
6 \usepackage{../graphics} |
6 \usepackage{../graphics} |
7 |
7 |
8 |
8 |
9 \begin{document} |
9 \begin{document} |
|
10 \fnote{\copyright{} Christian Urban, King's College London, 2017, 2018, 2019} |
10 |
11 |
11 \section*{Handout 7 (Compilation)} |
12 \section*{Handout 7 (Compilation)} |
12 |
13 |
13 The purpose of a compiler is to transform a program, a human |
14 The purpose of a compiler is to transform a program a human can read and |
14 can read and write, into code the machine can run as fast as |
15 write into code the machine can run as fast as possible. The fastest |
15 possible. The fastest code would be machine code the CPU can |
16 code would be machine code the CPU can run directly, but it is often |
16 run directly, but it is often enough to improve the speed of a |
17 good enough for improving the speed of a program to just target a |
17 program by just targeting a virtual machine. This produces not |
18 virtual machine. This produces not the fastest possible code, but code |
18 the fastest possible code, but code that is fast enough and |
19 that is pretty enough. This way of producing code has the advantage that |
19 has the advantage that the virtual machine takes care of |
20 the virtual machine takes care of things a compiler would normally need |
20 things a compiler would normally need to take care of (like |
21 to take care of (like explicit memory management). An interesting answer |
21 explicit memory management). Why study compilers? John Regher |
22 for why studying compilers is given by John Regher in his compiler |
22 gives this answer in his compiler blog:\footnote{\url{http://blog.regehr.org/archives/1419}} |
23 blog:\footnote{\url{http://blog.regehr.org/archives/1419}} |
23 |
24 |
24 \begin{quote}\it{}``We can start off with a couple of observations |
25 \begin{quote}\it{}``We can start off with a couple of observations |
25 about the role of compilers. First, hardware is getting weirder |
26 about the role of compilers. First, hardware is getting weirder |
26 rather than getting clocked faster: almost all processors are |
27 rather than getting clocked faster: almost all processors are |
27 multicores and it looks like there is increasing asymmetry in |
28 multicores and it looks like there is increasing asymmetry in |
39 employment act for solid compiler hackers.'' |
40 employment act for solid compiler hackers.'' |
40 \end{quote} |
41 \end{quote} |
41 |
42 |
42 As a first example in this module we will implement a compiler for the |
43 As a first example in this module we will implement a compiler for the |
43 very simple While-language. It will generate code for the Java Virtual |
44 very simple While-language. It will generate code for the Java Virtual |
44 Machine (JVM). This is a stack-based virtual machine, a fact which |
45 Machine (JVM). This is a stack-based virtual machine, a fact which will |
45 will make it easy to generate code for arithmetic expressions. For |
46 make it easy to generate code for arithmetic expressions. For example |
46 example for generating code for the expression $1 + 2$ we need to |
47 when compiling $1 + 2$ we need to generate the following three |
47 generate the following three instructions |
48 instructions |
48 |
49 |
49 \begin{lstlisting}[numbers=none] |
50 \begin{lstlisting}[language=JVMIS,numbers=none] |
50 ldc 1 |
51 ldc 1 |
51 ldc 2 |
52 ldc 2 |
52 iadd |
53 iadd |
53 \end{lstlisting} |
54 \end{lstlisting} |
54 |
55 |
55 \noindent The first instruction loads the constant $1$ onto |
56 \noindent The first instruction loads the constant $1$ onto |
56 the stack, the next one $2$, the third instruction adds both |
57 the stack, the next one loads $2$, the third instruction adds both |
57 numbers together replacing the top two elements of the stack |
58 numbers together replacing the top two elements of the stack |
58 with the result $3$. For simplicity, we will throughout |
59 with the result $3$. For simplicity, we will throughout |
59 consider only integer numbers and results. Therefore we can |
60 consider only integer numbers and results. Therefore we can |
60 use the JVM instructions \code{iadd}, \code{isub}, |
61 use the JVM instructions \code{iadd}, \code{isub}, |
61 \code{imul}, \code{idiv} and so on. The \code{i} stands for |
62 \code{imul}, \code{idiv} and so on. The \code{i} stands for |
78 | \meta{Num}\\ |
79 | \meta{Num}\\ |
79 \end{plstx} |
80 \end{plstx} |
80 |
81 |
81 |
82 |
82 \noindent where \meta{Id} stands for variables and \meta{Num} |
83 \noindent where \meta{Id} stands for variables and \meta{Num} |
83 for numbers. For the moment let us omit variables from |
84 for numbers. For the moment let us omit variables from arithmetic |
84 arithmetic expressions. Our parser will take this grammar and |
85 expressions. Our parser will take this grammar and given an input |
85 given an input produce abstract syntax trees. For example for |
86 produce abstract syntax trees. For example we will obtain for the |
86 the expression $1 + ((2 * 3) + (4 - 3))$ it will produce the |
87 expression $1 + ((2 * 3) + (4 - 3))$ the following tree. |
87 following tree. |
|
88 |
88 |
89 \begin{center} |
89 \begin{center} |
90 \begin{tikzpicture} |
90 \begin{tikzpicture} |
91 \Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]] |
91 \Tree [.$+$ [.$1$ ] [.$+$ [.$*$ $2$ $3$ ] [.$-$ $4$ $3$ ]]] |
92 \end{tikzpicture} |
92 \end{tikzpicture} |
93 \end{center} |
93 \end{center} |
94 |
94 |
95 \noindent To generate code for this expression, we need to |
95 \noindent To generate JVM code for this expression, we need to |
96 traverse this tree in post-order fashion and emit code for |
96 traverse this tree in post-order fashion and emit code for |
97 each node---this traversal in post-order fashion will produce |
97 each node---this traversal in post-order fashion will produce |
98 code for a stack-machine (what the JVM is). Doing so for the |
98 code for a stack-machine (what the JVM is). Doing so for the |
99 tree above generates the instructions |
99 tree above generates the instructions |
100 |
100 |
101 \begin{lstlisting}[numbers=none] |
101 \begin{lstlisting}[language=JVMIS,numbers=none] |
102 ldc 1 |
102 ldc 1 |
103 ldc 2 |
103 ldc 2 |
104 ldc 3 |
104 ldc 3 |
105 imul |
105 imul |
106 ldc 4 |
106 ldc 4 |
108 isub |
108 isub |
109 iadd |
109 iadd |
110 iadd |
110 iadd |
111 \end{lstlisting} |
111 \end{lstlisting} |
112 |
112 |
113 \noindent If we ``run'' these instructions, the result $8$ |
113 \noindent If we ``run'' these instructions, the result $8$ will be on |
114 will be on top of the stack (I leave this to you to verify; |
114 top of the stack (I leave this to you to verify; the meaning of each |
115 the meaning of each instruction should be clear). The result |
115 instruction should be clear). The result being on the top of the stack |
116 being on the top of the stack will be a convention we always |
116 will be a convention we always observe in our compiler, that is the |
117 observe in our compiler, that is the results of arithmetic |
117 results of arithmetic expressions will always be on top of the stack. |
118 expressions will always be on top of the stack. Note, that a |
118 Note, that a different bracketing of the expression, for example $(1 + |
119 different bracketing of the expression, for example $(1 + (2 * |
119 (2 * 3)) + (4 - 3)$, produces a different abstract syntax tree and thus |
120 3)) + (4 - 3)$, produces a different abstract syntax tree and |
120 also a different list of instructions. Generating code in this |
121 thus potentially also a different list of instructions. |
121 post-order-traversal fashion is rather easy to implement: it can be done |
122 Generating code in this fashion is rather easy to implement: |
122 with the following recursive \textit{compile}-function, which takes the |
123 it can be done with the following recursive |
123 abstract syntax tree as argument: |
124 \textit{compile}-function, which takes the abstract syntax |
|
125 tree as argument: |
|
126 |
124 |
127 \begin{center} |
125 \begin{center} |
128 \begin{tabular}{lcl} |
126 \begin{tabular}{lcl} |
129 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\ |
127 $\textit{compile}(n)$ & $\dn$ & $\pcode{ldc}\; n$\\ |
130 $\textit{compile}(a_1 + a_2)$ & $\dn$ & |
128 $\textit{compile}(a_1 + a_2)$ & $\dn$ & |
142 variables. We will represent them as \emph{local variables} in |
140 variables. We will represent them as \emph{local variables} in |
143 the JVM. Essentially, local variables are an array or pointers |
141 the JVM. Essentially, local variables are an array or pointers |
144 to memory cells, containing in our case only integers. Looking |
142 to memory cells, containing in our case only integers. Looking |
145 up a variable can be done with the instruction |
143 up a variable can be done with the instruction |
146 |
144 |
147 \begin{lstlisting}[mathescape,numbers=none] |
145 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
148 iload $index$ |
146 iload $index$ |
149 \end{lstlisting} |
147 \end{lstlisting} |
150 |
148 |
151 \noindent |
149 \noindent |
152 which places the content of the local variable $index$ onto |
150 which places the content of the local variable $index$ onto |
153 the stack. Storing the top of the stack into a local variable |
151 the stack. Storing the top of the stack into a local variable |
154 can be done by the instruction |
152 can be done by the instruction |
155 |
153 |
156 \begin{lstlisting}[mathescape,numbers=none] |
154 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
157 istore $index$ |
155 istore $index$ |
158 \end{lstlisting} |
156 \end{lstlisting} |
159 |
157 |
160 \noindent Note that this also pops off the top of the stack. |
158 \noindent Note that this also pops off the top of the stack. |
161 One problem we have to overcome, however, is that local |
159 One problem we have to overcome, however, is that local |
199 \begin{tabular}{lcl} |
197 \begin{tabular}{lcl} |
200 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\ |
198 $\textit{compile}(\pcode{skip}, E)$ & $\dn$ & $([], E)$\\ |
201 \end{tabular} |
199 \end{tabular} |
202 \end{center} |
200 \end{center} |
203 |
201 |
204 \noindent $[]$ is the empty list of instructions. Note that |
202 \noindent whereby $[]$ is the empty list of instructions. Note that |
205 the \textit{compile}-function for statements returns a pair, a |
203 the \textit{compile}-function for statements returns a pair, a |
206 list of instructions (in this case the empty list) and an |
204 list of instructions (in this case the empty list) and an |
207 environment for variables. The reason for the environment is |
205 environment for variables. The reason for the environment is |
208 that assignments in the While-language might change the |
206 that assignments in the While-language might change the |
209 environment---clearly if a variable is used for the first |
207 environment---clearly if a variable is used for the first |
231 the environment and assign $x$ with the largest index in $E$ |
229 the environment and assign $x$ with the largest index in $E$ |
232 plus one (that is $E' = E(x \mapsto largest\_index + 1)$). |
230 plus one (that is $E' = E(x \mapsto largest\_index + 1)$). |
233 That means for the assignment $x := x + 1$ we generate the |
231 That means for the assignment $x := x + 1$ we generate the |
234 following code |
232 following code |
235 |
233 |
236 \begin{lstlisting}[mathescape,numbers=none] |
234 \begin{lstlisting}[language=JVMIS,mathescape,numbers=none] |
237 iload $n_x$ |
235 iload $n_x$ |
238 ldc 1 |
236 ldc 1 |
239 iadd |
237 iadd |
240 istore $n_x$ |
238 istore $n_x$ |
241 \end{lstlisting} |
239 \end{lstlisting} |
242 |
240 |
243 \noindent |
241 \noindent |
244 where $n_x$ is the index for the variable $x$. |
242 where $n_x$ is the index for the variable $x$. The code for |
245 |
243 looking-up the index for the variable is as follow: |
246 More complicated is the code for \pcode{if}-statements, say |
244 |
|
245 \begin{center} |
|
246 \begin{tabular}{lcl} |
|
247 $index \;=\; \textit{if}\;(E(x).\textit{isDefind})\;\textit{then}\;E(x)\;\textit{else}\;|E|$ |
|
248 \end{tabular} |
|
249 \end{center} |
|
250 |
|
251 \noindent |
|
252 In case the environment $E$ contains an index for $x$, we return it. |
|
253 Otherwise we ``create'' a new index by returning the size $|E|$ of the |
|
254 environment (that will be an index that is guaranteed to be not used |
|
255 yet). |
|
256 |
|
257 More complicated is the generation of code for \pcode{if}-statements, say |
247 |
258 |
248 \begin{lstlisting}[mathescape,language={},numbers=none] |
259 \begin{lstlisting}[mathescape,language={},numbers=none] |
249 if $b$ then $cs_1$ else $cs_2$ |
260 if $b$ then $cs_1$ else $cs_2$ |
250 \end{lstlisting} |
261 \end{lstlisting} |
251 |
262 |
252 \noindent where $b$ is a boolean expression and the $cs_{1/2}$ |
263 \noindent where $b$ is a boolean expression and the $cs_{1/2}$ are the |
253 are the statements for each \pcode{if}-branch. Lets assume |
264 statements for each of the \pcode{if}-branches. Lets assume we already |
254 we already generated code for $b$ and $cs_{1/2}$. Then in the |
265 generated code for $b$ and $cs_{1/2}$. Then in the true-case the |
255 true-case the control-flow of the program needs to be |
266 control-flow of the program needs to be |
256 |
267 |
257 \begin{center} |
268 \begin{center} |
258 \begin{tikzpicture}[node distance=2mm and 4mm, |
269 \begin{tikzpicture}[node distance=2mm and 4mm, |
259 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, |
270 block/.style={rectangle, minimum size=1cm, draw=black, line width=1mm}, |
260 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
271 point/.style={rectangle, inner sep=0mm, minimum size=0mm, fill=red}, |
526 while x <= 10 do x := x + 1 |
537 while x <= 10 do x := x + 1 |
527 \end{lstlisting} |
538 \end{lstlisting} |
528 |
539 |
529 \noindent yielding the following code |
540 \noindent yielding the following code |
530 |
541 |
531 \begin{lstlisting}[mathescape,language={}] |
542 \begin{lstlisting}[language=JVMIS,mathescape,language={}] |
532 L_wbegin: $\quad\tikz[remember picture] \node[] (LB) {\mbox{}};$ |
543 L_wbegin: $\quad\tikz[remember picture] \node[] (LB) {\mbox{}};$ |
533 iload 0 |
544 iload 0 |
534 ldc 10 |
545 ldc 10 |
535 if_icmpgt L_wend $\quad\tikz[remember picture] \node (LC) {\mbox{}};$ |
546 if_icmpgt L_wend $\quad\tikz[remember picture] \node (LC) {\mbox{}};$ |
536 iload 0 |
547 iload 0 |