handouts/ho09.tex
changeset 677 decfd8cf8180
parent 539 ed8f014217be
child 678 ff3b48da282c
equal deleted inserted replaced
676:62d168bf7ac8 677:decfd8cf8180
       
     1 % !TEX program = xelatex
     1 \documentclass{article}
     2 \documentclass{article}
     2 \usepackage{../style}
     3 \usepackage{../style}
     3 \usepackage{../langs}
     4 \usepackage{../langs}
     4 \usepackage{../graphics}
     5 \usepackage{../graphics}
     5 \usepackage{../grammar}
     6 \usepackage{../grammar}
     6 \usepackage{multicol}
     7 %%\usepackage{multicol}
     7 
     8 
     8 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}
     9 %%\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}
     9 
    10 
    10 \begin{document}
    11 \begin{document}
    11 \fnote{\copyright{} Christian Urban, King's College London, 2014}
    12 \fnote{\copyright{} Christian Urban, King's College London, 2019}
    12 
       
    13 %% why are shuttle flights so good with software
       
    14 %%http://www.fastcompany.com/28121/they-write-right-stuff
       
    15 
       
    16 \section*{Handout 9 (Static Analysis)}
       
    17 
       
    18 If we want to improve the safety and security of our programs,
       
    19 we need a more principled approach to programming. Testing is
       
    20 good, but as Edsger Dijkstra famously wrote: 
       
    21 
       
    22 \begin{quote}\it 
       
    23 ``Program testing can be a very effective way to show the
       
    24 \underline{\smash{presence}} of bugs, but it is hopelessly
       
    25 inadequate for showing their \underline{\smash{absence}}.''
       
    26 \end{quote}
       
    27 
       
    28 \noindent While such a more principled approach has been the
       
    29 subject of intense study for a long, long time, only in the
       
    30 past few years some impressive results have been achieved. One
       
    31 is the complete formalisation and (mathematical) verification
       
    32 of a microkernel operating system called seL4.
       
    33 
       
    34 \begin{center}
       
    35 \url{http://sel4.systems}
       
    36 \end{center}
       
    37 
       
    38 \noindent In 2011 this work was included in the MIT Technology
       
    39 Review in the annual list of the world’s ten most important
       
    40 emerging
       
    41 technologies.\footnote{\url{http://www2.technologyreview.com/tr10/?year=2011}}
       
    42 While this work is impressive, its technical details are too
       
    43 enormous for an explanation here. Therefore let us look at
       
    44 something much simpler, namely finding out properties about
       
    45 programs using \emph{static analysis}.
       
    46 
       
    47 Static analysis is a technique that checks properties of a
       
    48 program without actually running the program. This should
       
    49 raise alarm bells with you---because almost all interesting
       
    50 properties about programs  are equivalent to the halting
       
    51 problem, which we know is undecidable. For example estimating
       
    52 the memory consumption of programs is in general undecidable,
       
    53 just like the halting problem. Static analysis circumvents
       
    54 this undecidability-problem by essentially allowing answers
       
    55 \emph{yes} and \emph{no}, but also \emph{don't know}. With
       
    56 this ``trick'' even the halting problem becomes
       
    57 decidable\ldots{}for example we could always say \emph{don't
       
    58 know}. Of course this would be silly. The point is that we
       
    59 should be striving for a method that answers as often as
       
    60 possible either \emph{yes} or \emph{no}---just in cases when
       
    61 it is too difficult we fall back on the
       
    62 \emph{don't-know}-answer. This might sound all like abstract
       
    63 nonsense. Therefore let us look at a concrete example.
       
    64 
    13 
    65 
    14 
    66 \subsubsection*{A Simple, Idealised Programming Language}
    15 \section*{Handout 9 (LLVM, SSA and CPS)}
    67 
    16 
    68 Our starting point is a small, idealised programming language.
    17 Reflecting on our tiny compiler targetting the JVM, code generation was
    69 It is idealised because we cut several corners in comparison
    18 actually not so hard, no? One of the main reason for this ease is that
    70 with real programming languages. The language we will study
    19 the JVM is a stack-based virtual machine and it is, for example, not
    71 contains, amongst other things, variables holding integers.
    20 hard to translate arithmetic expressions into instructions manipulating
    72 Using static analysis, we want to find out what the sign of
    21 the stack. The problem is that ``real'' CPUs, although supporting stack
    73 these integers (positive or negative) will be when the program
    22 operations, are not really \emph{stack machines}. They are just not
    74 runs. This sign-analysis seems like a very simple problem. But
    23 optimised for this way of calculating things. The design of CPUs is more
    75 even such simple problems, if approached naively, are in
    24 like, here is a piece of memory---compiler, or compiler writer, do
    76 general undecidable, just like Turing's halting problem. I let
    25 something with it. Otherwise get lost. So in the name of raw speed,
    77 you think why?
    26 modern compilers go the extra mile and generate code that is much easier
       
    27 and faster to process by CPUs. 
    78 
    28 
       
    29 Another reason why it makes sense to go the extra mile is that stack
       
    30 instructions are very difficult to optimise---you cannot just re-arrange
       
    31 instructions without messing about with what is calculated on the stack.
       
    32 Also it is hard to find out if all the calculations on the stack are
       
    33 actually necessary and not by chance dead code. The JVM has for all this
       
    34 sophisticated machinery to make such ``high-level'' code still run fast,
       
    35 but let's say that for the sake of argument we want to not rely on it.
       
    36 We want to generate fast code ourselves. This means we have to work around
       
    37 the intricacies of what instructions CPUs can actually process. To make
       
    38 this all tractable, we target the LLVM Intermediate Language. In this way
       
    39 we can take advantage of the tools coming with LLVM and for example do not
       
    40 have to worry about things like that CPUs have only a limited amount of
       
    41 registers. 
    79 
    42 
    80 Is sign-analysis of variables an interesting problem? Well,
    43 LLVM\footnote{\url{http://llvm.org}} is a beautiful example that
    81 yes---if a compiler can find out that for example a variable
    44 projects from Academia can make a difference in the world. LLVM was
    82 will never be negative and this variable is used as an index
    45 started in 2000 by two researchers at the  University of Illinois at
    83 for an array, then the compiler does not need to generate code
    46 Urbana–Champaign. At the time the behemoth of compilers was gcc with its
    84 for an underflow-check. Remember some languages are immune to
    47 myriad of front-ends for other languages (e.g.~gfortran, Ada, Go,
    85 buffer-overflow attacks, but they need to add underflow and
    48 Objective-C, Pascal etc). The problem was that gcc morphed over time
    86 overflow checks everywhere. According to John Regher, an
    49 into a monolithic gigantic piece of m\ldots ehm software, which you could
    87 expert in the field of compilers, overflow checks can cause
    50 not mess about in an afternoon. In contrast LLVM was a modular suite of
    88 5-10\% slowdown, in some languages even 100\% for tight
    51 tools with which you could play around easily and try out something new.
    89 loops.\footnote{\url{http://blog.regehr.org/archives/1154}} If
    52 LLVM became a big player once Apple hired one of the original developers
    90 the compiler can omit the underflow check, for example, then
    53 (I cannot remember the reason why Apple did not want to use gcc, but
    91 this can potentially drastically speed up the generated code. 
    54 maybe they were also just disgusted by big monolithic codebase). Anyway,
       
    55 LLVM is now the big player and gcc is more or less legacy. This does not
       
    56 mean that programming languages like C and C++ are dying out any time
       
    57 soon---they are nicely supported by LLVM.
    92 
    58 
    93 What do programs in our simple programming language look like?
    59 Targetting the LLVM-IR also means we can profit from the very modular
    94 The following grammar gives a first specification:
    60 structure of the LLVM compiler and let for example the compiler generate
       
    61 code for X86, or ARM etc. We can be agnostic about where our code
       
    62 actually runs. However, what we have to do is to generate code in
       
    63 \emph{Static Single-Assignment} format (short SSA), because that is what
       
    64 the LLVM-IR expects from us and which is the intermediate format that
       
    65 LLVM can use to do cool things (like targetting strange architectures)
       
    66 and allocating memory efficiently. 
    95 
    67 
    96 \begin{multicols}{2}
    68 The idea behind SSA is to use very simple variable assignments where
    97 \begin{plstx}[rhs style=,one per line,left margin=9mm]
    69 every variable is assigned only once. The assignments also need to be
    98 : \meta{Stmt} ::= \meta{label} \texttt{:}
    70 extremely primitive in the sense that they can be just simple operations
    99                 | \meta{var} \texttt{:=} \meta{Exp}
    71 like addition, multiplication, jumps and so on. A typical program in SSA
   100                 | \texttt{jmp?} \meta{Exp} \meta{label}
    72 is 
   101                 | \texttt{goto} \meta{label}\\
       
   102 : \meta{Prog} ::= \meta{Stmt} \ldots{} \meta{Stmt}\\
       
   103 \end{plstx}
       
   104 \columnbreak
       
   105 \begin{plstx}[rhs style=,one per line]
       
   106 : \meta{Exp} ::= \meta{Exp} \texttt{+} \meta{Exp}
       
   107                | \meta{Exp} \texttt{*} \meta{Exp}
       
   108                | \meta{Exp} \texttt{=} \meta{Exp} 
       
   109                | \meta{num}
       
   110                | \meta{var}\\
       
   111 \end{plstx}
       
   112 \end{multicols}
       
   113 
    73 
   114 \noindent I assume you are familiar with such
    74 \begin{lstlisting}[language=LLVM,numbers=none]
   115 grammars.\footnote{\url{http://en.wikipedia.org/wiki/Backus–Naur_Form}}
    75     x := 1
   116 There are three main syntactic categories: \emph{statments}
    76     y := 2     
   117 and \emph{expressions} as well as \emph{programs}, which are
    77     z := x + y
   118 sequences of statements. Statements are either labels,
       
   119 variable assignments, conditional jumps (\pcode{jmp?}) and
       
   120 unconditional jumps (\pcode{goto}). Labels are just strings,
       
   121 which can be used as the target of a jump. We assume that in
       
   122 every program the labels are unique---if there is a clash,
       
   123 then we do not know where to jump to. The conditional jumps
       
   124 and variable assignments involve (arithmetic) expressions.
       
   125 Expressions are either numbers, variables or compound
       
   126 expressions built up from \pcode{+}, \pcode{*} and \emph{=}
       
   127 (for simplicity reasons we do not consider any other
       
   128 operations). We assume we have negative and positive numbers,
       
   129 \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
       
   130 \pcode{2}\ldots{} An example program that calculates the
       
   131 factorial of 5 is in our programming language as follows:
       
   132 
       
   133 \begin{lstlisting}[language={},xleftmargin=10mm]
       
   134       a := 1
       
   135       n := 5 
       
   136 top:  
       
   137       jmp? n = 0 done 
       
   138       a := a * n 
       
   139       n := n + -1 
       
   140       goto top 
       
   141 done:
       
   142 \end{lstlisting}
    78 \end{lstlisting}
   143 
    79 
   144 \noindent As can be seen each line of the program contains a
    80 \noindent where every variable is used only once. You cannot for example have
   145 statement. In the first two lines we assign values to the
    81 
   146 variables \pcode{a} and \pcode{n}. In line 4 we test whether
    82 \begin{lstlisting}[language=LLVM,numbers=none]
   147 \pcode{n} is zero, in which case we jump to the end of the
    83     x := 1
   148 program marked with the label \pcode{done}. If \pcode{n} is
    84     y := 2     
   149 not zero, we multiply the content of \pcode{a} by \pcode{n},
    85     x := x + y
   150 decrease \pcode{n} by one and jump back to the beginning of
       
   151 the loop, marked with the label \pcode{top}. Another program
       
   152 in our language is shown in Figure~\ref{fib}. I let you think
       
   153 what it calculates.
       
   154  
       
   155 \begin{figure}[t]
       
   156 \begin{lstlisting}[numbers=none,
       
   157                    language={},xleftmargin=10mm]
       
   158       n := 6
       
   159       m1 := 0
       
   160       m2 := 1
       
   161 loop: 
       
   162       jmp? n = 0 done
       
   163       tmp := m2
       
   164       m2 := m1 + m2
       
   165       m1 := tmp
       
   166       n := n + -1
       
   167       goto top
       
   168 done:
       
   169 \end{lstlisting}
    86 \end{lstlisting}
   170 \caption{A mystery program in our idealised programming language.
       
   171 Try to find out what it calculates! \label{fib}}
       
   172 \end{figure}
       
   173 
    87 
   174 Even if our language is rather small, it is still Turing
    88 \noindent because in this snippet \texttt{x} is assigned twice. There
   175 complete---meaning quite powerful. However, discussing this
    89 are sophisticated algorithm for imperative languages, like C, for
   176 fact in more detail would lead us too far astray. Clearly, our
    90 efficiently transforming a program into SSA format, but we do not have
   177 programming is rather low-level and not very comfortable for
    91 to be concerned about those. We want to compile a functional language
   178 writing programs. It is inspired by real machine code, which
    92 and there things get much more interesting than just sophisticated. We
   179 is the code that is executed by a CPU. So a more interesting
    93 will need to have a look at CPS translations, which stands for
   180 question is what is missing in comparison with real machine
    94 Continuation-Passing-Style---basically black art. So sit tight.
   181 code? Well, not much\ldots{}in principle. Real machine code,
       
   182 of course, contains many more arithmetic instructions (not
       
   183 just addition and multiplication) and many more conditional
       
   184 jumps. We could add these to our language if we wanted, but
       
   185 complexity is really beside the point here. Furthermore, real
       
   186 machine code has many instructions for manipulating memory. We
       
   187 do not have this at all. This is actually a more serious
       
   188 simplification because we assume numbers to be arbitrary small
       
   189 or large, which is not the case with real machine code. In
       
   190 real machine code, basic number formats have a range and might
       
   191 over-flow or under-flow this range. Also the number of
       
   192 variables in our programs is potentially unlimited, while
       
   193 memory in an actual computer, of course, is always limited. To
       
   194 sum up, our language might look ridiculously simple, but it is
       
   195 not too far removed from practically relevant issues.
       
   196 
    95 
       
    96 \subsection*{CPS-Translations}
   197 
    97 
   198 \subsubsection*{An Interpreter}
    98 What is good about our simple fun language is that it basically only 
       
    99 contains expressions (be they arithmetic expressions or boolean expressions).
       
   100 The only exceptions are function definitions, for which we can easily
       
   101 use the mechanism of defining functions in LLVM-IR. For example the simple
       
   102 fun program 
   199 
   103 
   200 Designing a language is like playing god: you can say what
   104 \begin{lstlisting}[language=Scala,numbers=none]
   201 names for variables you allow; what programs should look like;
   105 def dble(x) = x * x
   202 most importantly you can decide what each part of the program
   106 \end{lstlisting}
   203 should mean and do. While our language is quite simple and the
       
   204 meaning of statements, for example, is rather straightforward,
       
   205 there are still places where we need to make real choices. For
       
   206 example consider the conditional jumps, say the one in the
       
   207 factorial program: 
       
   208 
   107 
   209 \begin{center}
   108 \noindent
   210 \code{jmp? n = 0 done}
   109 can be compiled into the following LLVM-IR function:
   211 \end{center}
       
   212 
   110 
   213 \noindent How should they work? We could introduce Booleans
   111 \begin{lstlisting}[language=LLVM]
   214 (\pcode{true} and \pcode{false}) and then jump only when the
   112 define i32 dble(i32 %x) {
   215 condition is \pcode{true}. However, since we have numbers in
   113    %tmp =  mul i32 %x, % x
   216 our language anyway, why not just encoding \pcode{true} as
   114    ret i32 %tmp
   217 one, and \pcode{false} as zero? In this way we can dispense
   115 }    
   218 with the additional concept of Booleans.
   116 \end{lstlisting}
   219 
   117 
   220 I hope the above discussion makes it already clear we need to
       
   221 be a bit more careful with our programs. Below we shall
       
   222 describe an interpreter for our programming language, which
       
   223 specifies exactly how programs are supposed to be
       
   224 run\ldots{}at least we will specify this for all \emph{good}
       
   225 programs. By good programs I mean where all variables are
       
   226 initialised, for example. Our interpreter will just crash if
       
   227 it cannot find out the value for a variable when it is not
       
   228 initialised. Also, we will assume that labels in good programs
       
   229 are unique, otherwise our programs will calculate ``garbage''.
       
   230 
       
   231 First we will pre-process our programs. This will simplify the
       
   232 definition of our interpreter later on. By pre-processing our
       
   233 programs we will transform programs into \emph{snippets}. A
       
   234 snippet is a label and all the code that comes after the
       
   235 label. This essentially means a snippet is a \emph{map} from
       
   236 labels to code.\footnote{Be sure you know what maps are. In a
       
   237 programming context they are often represented as association
       
   238 list where some data is associated with a key.} 
       
   239 
       
   240 Given that programs are sequences (or lists) of statements, we
       
   241 can easily calculate the snippets by just traversing this
       
   242 sequence and recursively generating the map. Suppose a program
       
   243 is of the general form
       
   244 
       
   245 \begin{center}
       
   246 $stmt_1\;stmt_2\; \ldots\; stmt_n$
       
   247 \end{center}
       
   248 
       
   249 \noindent The idea is to go through this sequence of
       
   250 statements one by one and check whether they are a label. If
       
   251 yes, we add the label and the remaining statements to our map.
       
   252 If no, we just continue with the next statement. To come up
       
   253 with a recursive definition for generating snippets, let us
       
   254 write $[]$ for the program that does not contain any
       
   255 statement. Consider the following definition:
       
   256 
       
   257 \begin{center}
       
   258 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   259 $\textit{snippets}([])$ & $\dn$ & $\varnothing$\\
       
   260 $\textit{snippets}(stmt\;\; rest)$ & $\dn$ &
       
   261 $\begin{cases}
       
   262    \textit{snippets}(rest)[label := rest] & \text{if}\;stmt = \textit{label:}\\
       
   263    \textit{snippets}(rest)                & \text{otherwise}
       
   264 \end{cases}$   
       
   265 \end{tabular}
       
   266 \end{center}
       
   267 
       
   268 \noindent In the first clause we just return the empty map for
       
   269 the program that does not contain any statement. In the second
       
   270 clause, we have to distinguish the case where the first
       
   271 statement is a label or not. As said before, if not, then we
       
   272 just ``throw away'' the label and recursively calculate the
       
   273 snippets for the rest of the program (the otherwise clause).
       
   274 If yes, then we do the same, but also update the map so that
       
   275 $label$ now points to the rest of the statements (the if
       
   276 clause). This looks all realtively straightforward, but there
       
   277 is one small problem we need to overcome: our two programs
       
   278 shown so far have no label as \emph{entry point}---that is
       
   279 where the execution is supposed to start. We usually assume
       
   280 that the first statement will be run first. To make this the
       
   281 default, it is convenient if we add to all our programs a
       
   282 default label, say \pcode{""} (the empty string). With this we
       
   283 can define our pre-processing of programs as follows
       
   284 
       
   285 \begin{center}
       
   286 $\textit{preproc}(prog) \dn \textit{snippets}(\pcode{"":}\;\; prog)$ 
       
   287 \end{center} 
       
   288 
       
   289 \noindent Let us see how this pans out in practice. If we
       
   290 pre-process the factorial program shown earlier, we obtain the 
       
   291 following map:
       
   292 
       
   293 \begin{center}
       
   294 \small
       
   295 \lstset{numbers=none,
       
   296         language={},
       
   297         xleftmargin=0mm,
       
   298         aboveskip=0.5mm,
       
   299         belowskip=0.5mm,
       
   300         frame=single,
       
   301         framerule=0mm,
       
   302         framesep=0mm}
       
   303 \begin{tikzpicture}[node distance=0mm]
       
   304 \node (A1) [draw]{\pcode{""}};
       
   305 \node (B1) [right=of A1] {$\mapsto$};
       
   306 \node (C1) [right=of B1,anchor=north west] {
       
   307 \begin{minipage}{3.5cm}
       
   308 \begin{lstlisting}[language={},xleftmargin=0mm]
       
   309   a := 1
       
   310   n := 5 
       
   311 top:  
       
   312   jmp? n = 0 done 
       
   313   a := a * n 
       
   314   n := n + -1 
       
   315   goto top 
       
   316 done:
       
   317 \end{lstlisting}
       
   318 \end{minipage}};
       
   319 \node (A2) [right=of C1.north east,draw] {\pcode{top}};
       
   320 \node (B2) [right=of A2] {$\mapsto$};
       
   321 \node (C2) [right=of B2, anchor=north west] {
       
   322 \begin{minipage}{3.5cm}
       
   323 \begin{lstlisting}[language={},xleftmargin=0mm]
       
   324   jmp? n = 0 done 
       
   325   a := a * n 
       
   326   n := n + -1 
       
   327   goto top 
       
   328 done:
       
   329 \end{lstlisting} 
       
   330 \end{minipage}}; 
       
   331 \node (A3) [right=of C2.north east,draw] {\pcode{done}};
       
   332 \node (B3) [right=of A3] {$\mapsto$};
       
   333 \node (C3) [right=of B3] {$[]$};
       
   334 \end{tikzpicture}
       
   335 \end{center}
       
   336 
       
   337 \noindent I highlighted the \emph{keys} in this map. Since
       
   338 there are three labels in the factorial program (remember we
       
   339 added \pcode{""}), there are three keys. When running the
       
   340 factorial program and encountering a jump, then we only have
       
   341 to consult this snippets-map in order to find out what the
       
   342 next statements should be.
       
   343 
       
   344 We should now be in the position to define how a program
       
   345 should be run. In the context of interpreters, this
       
   346 ``running'' of programs is often called \emph{evaluation}. Let
       
   347 us start with the definition of how expressions are evaluated.
       
   348 A first attempt might be the following recursive function:
       
   349 
       
   350 \begin{center}
       
   351 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   352 $\textit{eval\_exp}(\texttt{n})$ & $\dn$ & $n$
       
   353   \qquad\text{if}\; \texttt{n}\; \text{is a number like} 
       
   354   \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
       
   355 \pcode{2}\ldots{}\\
       
   356 $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, 
       
   357                     \texttt{e}_\texttt{2})$ & 
       
   358   $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) + 
       
   359            \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\
       
   360 $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, 
       
   361                     \texttt{e}_\texttt{2})$ & 
       
   362   $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) * 
       
   363            \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\
       
   364 $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, 
       
   365                     \texttt{e}_\texttt{2})$ & 
       
   366   $\dn$ & $\begin{cases}
       
   367   1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}) =
       
   368                  \textit{eval\_exp}(\texttt{e}_\texttt{2})\\
       
   369   0 & \text{otherwise}
       
   370   \end{cases}$          
       
   371 \end{tabular}
       
   372 \end{center}
       
   373 
       
   374 \noindent While this should look all rather intuitive`, still
       
   375 be very careful. There is a subtlety which can be easily
       
   376 overlooked: The function \textit{eval\_exp} takes an
       
   377 expression of our programming language as input and returns a
       
   378 number as output. Therefore whenever we encounter a number in
       
   379 our program, we just return this number---this is defined in
       
   380 the first clause above. Whenever we encounter an addition,
       
   381 well then we first evaluate the left-hand side
       
   382 $\texttt{e}_\texttt{1}$ of the addition (this will give a
       
   383 number), then evaluate the right-hand side
       
   384 $\texttt{e}_\texttt{2}$ (this gives another number), and
       
   385 finally add both numbers together. Here is the subtlety: on
       
   386 the left-hand side of the $\dn$ we have a \texttt{+} (in the
       
   387 teletype font) which is the symbol for addition in our
       
   388 programming language. On the right-hand side we have $+$ which
       
   389 stands for the arithmetic operation from ``mathematics'' of
       
   390 adding two numbers. These are rather different concepts---one
       
   391 is a symbol (which we made up), and the other a mathematical
       
   392 operation. When we will have a look at an actual
       
   393 implementation of our interpreter, the mathematical operation
       
   394 will be the function for addition from the programming
       
   395 language in which we \underline{\smash{implement}} our
       
   396 interpreter. While the \texttt{+} is just a symbol that is
       
   397 used in our programming language. Clearly we have to use a
       
   398 symbol that is a good mnemonic for addition, otherwise we will
       
   399 confuse the programmers working with our language. Therefore
       
   400 we use $\texttt{+}$. A similar choice is made for times in the
       
   401 third clause and equality in the fourth clause. 
       
   402 
       
   403 Remember I wrote at the beginning of this section about being
       
   404 god when designing a programming language. You can see this
       
   405 here: we need to give meaning to symbols. At the moment
       
   406 however, we are a poor fallible god. Look again at the grammar
       
   407 of our programming language and our definition. Clearly, an
       
   408 expression can contain variables. So far we have ignored them.
       
   409 What should our interpreter do with variables? They might
       
   410 change during the evaluation of a program. For example the
       
   411 variable \pcode{n} in the factorial program counts down from 5
       
   412 down to 0. How can we improve our definition above to give also
       
   413 an answer whenever our interpreter encounters a variable in an
       
   414 expression? The solution is to add an \emph{environment},
       
   415 written $env$, as an additional input argument to our
       
   416 \textit{eval\_exp} function.
       
   417  
       
   418 \begin{center}
       
   419 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   420 $\textit{eval\_exp}(\texttt{n}, env)$ & $\dn$ & $n$
       
   421   \qquad\text{if}\; \texttt{n}\; \text{is a number like} 
       
   422   \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
       
   423 \pcode{2}\ldots{}\\
       
   424 $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, 
       
   425                     \texttt{e}_\texttt{2}, env)$ & 
       
   426   $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) + 
       
   427            \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\
       
   428 $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, 
       
   429                     \texttt{e}_\texttt{2}, env)$ & 
       
   430   $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) * 
       
   431            \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\
       
   432 $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, 
       
   433                     \texttt{e}_\texttt{2}, env)$ & 
       
   434   $\dn$ & $\begin{cases}
       
   435   1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) =
       
   436                  \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)\\
       
   437   0 & \text{otherwise}
       
   438   \end{cases}$\\
       
   439 $\textit{eval\_exp}(\texttt{x}, env)$ & $\dn$ & $env(x)$      
       
   440 \end{tabular}
       
   441 \end{center}
       
   442  
       
   443 \noindent This environment $env$ also acts like a map: it
       
   444 associates variables with their current values. For example
       
   445 after evaluating the first two lines in our factorial
       
   446 program, such an environment might look as follows
       
   447 
       
   448 \begin{center}
       
   449 \begin{tabular}{ll}
       
   450 $\fbox{\texttt{a}} \mapsto \texttt{1}$ &
       
   451 $\fbox{\texttt{n}} \mapsto \texttt{5}$
       
   452 \end{tabular}
       
   453 \end{center}
       
   454 
       
   455 \noindent Again I highlighted the keys. In the clause for
       
   456 variables, we can therefore consult this environment and
       
   457 return whatever value is currently stored for this variable.
       
   458 This is written $env(x)$---meaning we query this map with $x$
       
   459 we obtain the corresponding number. You might ask what happens
       
   460 if an environment does not contain any value for, say, the
       
   461 variable $x$? Well, then our interpreter just ``crashes'', or
       
   462 more precisely will raise an exception. In this case we have a
       
   463 ``bad'' program that tried to use a variable before it was
       
   464 initialised. The programmer should not have done this. In a
       
   465 real programming language, we would of course try a bit harder
       
   466 and for example give an error at compile time, or design our
       
   467 language in such a way that this can never happen. With the
       
   468 second version of \textit{eval\_exp} we completed our
       
   469 definition for evaluating expressions.
       
   470  
       
   471 Next comes the evaluation function for statements. We define
       
   472 this function in such a way that we recursively evaluate a
       
   473 whole sequence of statements. Assume a program $p$ (you want
       
   474 to evaluate) and its pre-processed snippets $sn$. Then we can
       
   475 define:
       
   476 
       
   477 \begin{center}
       
   478 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   479 $\textit{eval\_stmts}([], env)$ & $\dn$ & $env$\\
       
   480 $\textit{eval\_stmts}(\texttt{label:}\;rest, env)$ &
       
   481   $\dn$ & $\textit{eval\_stmts}(rest, env)$ \\ 
       
   482 $\textit{eval\_stmts}(\texttt{x\,:=\,e}\;rest, env)$ & 
       
   483   $\dn$ & $\textit{eval\_stmts}(rest, 
       
   484            env[x := \textit{eval\_exp}(\texttt{e}, env)])$\\  
       
   485 $\textit{eval\_stmts}(\texttt{goto\,lbl}\;rest, env)$ 
       
   486  & $\dn$ & $\textit{eval\_stmts}(sn(\texttt{lbl}), env)$\\
       
   487 $\textit{eval\_stmts}(\texttt{jmp?\,e\,lbl}\;rest, env)$ 
       
   488  & $\dn$ & $\begin{cases}\begin{array}{@{}l@{\hspace{-12mm}}r@{}}
       
   489  \textit{eval\_stmts}(sn(\texttt{lbl}), env)\\ 
       
   490  & \text{if}\;\textit{eval\_exp}(\texttt{e}, env) = 1\\
       
   491  \textit{eval\_stmts}(rest, env) & \text{otherwise}\\
       
   492  \end{array}\end{cases}$  
       
   493 \end{tabular}
       
   494 \end{center}
       
   495 
       
   496 \noindent The first clause is for the empty program, or when
       
   497 we arrived at the end of the program. In this case we just
       
   498 return the environment. The second clause is for when the next
       
   499 statement is a label. That means the program is of the form
       
   500 $\texttt{label:}\;rest$ where the label is some string and
       
   501 $rest$ stands for all following statements. This case is easy,
       
   502 because our evaluation function just discards the label and
       
   503 evaluates the rest of the statements (we already extracted all
       
   504 important information about labels when we pre-processed our
       
   505 programs and generated the snippets). The third clause is for
       
   506 variable assignments. Again we just evaluate the rest for the
       
   507 statements, but with a modified environment---since the
       
   508 variable assignment is supposed to introduce a new variable or
       
   509 change the current value of a variable. For this modification
       
   510 of the environment we first evaluate the expression
       
   511 $\texttt{e}$ using our evaluation function for expressions.
       
   512 This gives us a number. Then we assign this number to the
       
   513 variable $x$ in the environment. This modified environment
       
   514 will be used to evaluate the rest of the program. The fourth
       
   515 clause is for the unconditional jump to a label, called
       
   516 \texttt{lbl}. That means we have to look up in our snippets
       
   517 map $sn$ what are the next statements for this label.
       
   518 Therefore we will continue with evaluating, not with the rest
       
   519 of the program, but with the statements stored in the
       
   520 snippets-map under the label $\texttt{lbl}$. The fifth clause
       
   521 for conditional jumps is similar, but to decide whether to
       
   522 make the jump we first need to evaluate the expression
       
   523 $\texttt{e}$ in order to find out whether it is $1$. If yes,
       
   524 we jump, otherwise we just continue with evaluating the rest
       
   525 of the program.
       
   526 
       
   527 Our interpreter works in two stages: First we pre-process our
       
   528 program generating the snippets map $sn$, say. Second we call
       
   529 the evaluation function with the default entry point and the
       
   530 empty environment:
       
   531 
       
   532 \begin{center}
       
   533 $\textit{eval\_stmts}(sn(\pcode{""}), \varnothing)$
       
   534 \end{center}
       
   535  
       
   536 \noindent It is interesting to note that our interpreter when
       
   537 it comes to the end of the program returns an environment. Our
       
   538 programming language does not contain any constructs for input
       
   539 and output. Therefore this environment is the only effect we
       
   540 can observe when running the program (apart from that our
       
   541 interpreter might need some time before finishing the
       
   542 evaluation of the program and the CPU getting hot). Evaluating 
       
   543 the factorial program with our interpreter we receive as
       
   544 ``answer''-environment
       
   545 
       
   546 \begin{center}
       
   547 \begin{tabular}{ll}
       
   548 $\fbox{\texttt{a}} \mapsto \texttt{120}$ &
       
   549 $\fbox{\texttt{n}} \mapsto \texttt{0}$
       
   550 \end{tabular}
       
   551 \end{center}
       
   552 
       
   553 \noindent While the discussion above should have illustrated
       
   554 the ideas, in order to do some serious calculations, we clearly
       
   555 need to implement the interpreter.
       
   556 
       
   557 
       
   558 \subsubsection*{Scala Code for the Interpreter}
       
   559 
       
   560 Functional programming languages are very convenient for
       
   561 implementations of interpreters. A good choice for a
       
   562 functional programming language is Scala, a programming
       
   563 language that combines functional and object-oriented
       
   564 pro\-gramming-styles. It has received in the last five years or
       
   565 so quite a bit of attention. One reason for this attention is
       
   566 that, like the Java programming language, Scala compiles to
       
   567 the Java Virtual Machine (JVM) and therefore Scala programs
       
   568 can run under MacOSX, Linux and Windows.\footnote{There are
       
   569 also experimental backends for Android and JavaScript.} Unlike
       
   570 Java, however, Scala often allows programmers to write very
       
   571 concise and elegant code. Some therefore say Scala is the much
       
   572 better Java. A number of companies, The Guardian, Twitter,
       
   573 Coursera, FourSquare, LinkedIn to name a few, either use Scala
       
   574 exclusively in production code, or at least to some
       
   575 substantial degree. If you want to try out Scala yourself, the
       
   576 Scala compiler can be downloaded from
       
   577 
       
   578 \begin{quote}
       
   579 \url{http://www.scala-lang.org}
       
   580 \end{quote}
       
   581 
       
   582 Let us have a look at the Scala code shown in
       
   583 Figure~\ref{code}. It shows the entire code for the
       
   584 interpreter, though the implementation is admittedly no
       
   585 frills.
       
   586 
       
   587 \begin{figure}[t]
       
   588 \small
       
   589 \lstinputlisting[language=Scala]{../progs/inter.scala}
       
   590 \caption{The entire code of the interpreter for our
       
   591 idealised programming language.\label{code}}
       
   592 \end{figure}
       
   593 
       
   594 
       
   595 \subsubsection*{Static Analysis}
       
   596 
       
   597 Finally we can come back to our original problem, namely 
       
   598 finding out what the signs of variables are 
       
   599 
       
   600 \begin{center}
       
   601 
       
   602 
       
   603 \end{center}
       
   604  
   118  
   605 \end{document}
   119 \end{document}
   606 
   120 
   607 %% list of static analysers for C
       
   608 http://spinroot.com/static/index.html
       
   609 
       
   610 %% NASA coding rules for C
       
   611 http://pixelscommander.com/wp-content/uploads/2014/12/P10.pdf
       
   612 
   121 
   613 %%% Local Variables: 
   122 %%% Local Variables: 
   614 %%% mode: latex
   123 %%% mode: latex
   615 %%% TeX-master: t
   124 %%% TeX-master: t
   616 %%% End: 
   125 %%% End: