| 675 |      1 | % !TEX program = xelatex
 | 
| 539 |      2 | \documentclass{article}
 | 
|  |      3 | \usepackage{../style}
 | 
|  |      4 | \usepackage{../langs}
 | 
|  |      5 | \usepackage{../graphics}
 | 
|  |      6 | \usepackage{../grammar}
 | 
|  |      7 | \usepackage{multicol}
 | 
|  |      8 | 
 | 
| 675 |      9 | %%\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}
 | 
| 539 |     10 | 
 | 
|  |     11 | \begin{document}
 | 
|  |     12 | \fnote{\copyright{} Christian Urban, King's College London, 2014}
 | 
|  |     13 | 
 | 
|  |     14 | %% why are shuttle flights so good with software
 | 
|  |     15 | %%http://www.fastcompany.com/28121/they-write-right-stuff
 | 
|  |     16 | 
 | 
| 675 |     17 | \section*{Handout 10 (Static Analysis)}
 | 
| 539 |     18 | 
 | 
|  |     19 | If we want to improve the safety and security of our programs,
 | 
|  |     20 | we need a more principled approach to programming. Testing is
 | 
|  |     21 | good, but as Edsger Dijkstra famously wrote: 
 | 
|  |     22 | 
 | 
|  |     23 | \begin{quote}\it 
 | 
|  |     24 | ``Program testing can be a very effective way to show the
 | 
|  |     25 | \underline{\smash{presence}} of bugs, but it is hopelessly
 | 
|  |     26 | inadequate for showing their \underline{\smash{absence}}.''
 | 
|  |     27 | \end{quote}
 | 
|  |     28 | 
 | 
|  |     29 | \noindent While such a more principled approach has been the
 | 
|  |     30 | subject of intense study for a long, long time, only in the
 | 
|  |     31 | past few years some impressive results have been achieved. One
 | 
|  |     32 | is the complete formalisation and (mathematical) verification
 | 
|  |     33 | of a microkernel operating system called seL4.
 | 
|  |     34 | 
 | 
|  |     35 | \begin{center}
 | 
|  |     36 | \url{http://sel4.systems}
 | 
|  |     37 | \end{center}
 | 
|  |     38 | 
 | 
|  |     39 | \noindent In 2011 this work was included in the MIT Technology
 | 
|  |     40 | Review in the annual list of the world’s ten most important
 | 
|  |     41 | emerging
 | 
|  |     42 | technologies.\footnote{\url{http://www2.technologyreview.com/tr10/?year=2011}}
 | 
|  |     43 | While this work is impressive, its technical details are too
 | 
|  |     44 | enormous for an explanation here. Therefore let us look at
 | 
|  |     45 | something much simpler, namely finding out properties about
 | 
|  |     46 | programs using \emph{static analysis}.
 | 
|  |     47 | 
 | 
|  |     48 | Static analysis is a technique that checks properties of a
 | 
|  |     49 | program without actually running the program. This should
 | 
|  |     50 | raise alarm bells with you---because almost all interesting
 | 
|  |     51 | properties about programs  are equivalent to the halting
 | 
|  |     52 | problem, which we know is undecidable. For example estimating
 | 
|  |     53 | the memory consumption of programs is in general undecidable,
 | 
|  |     54 | just like the halting problem. Static analysis circumvents
 | 
|  |     55 | this undecidability-problem by essentially allowing answers
 | 
|  |     56 | \emph{yes} and \emph{no}, but also \emph{don't know}. With
 | 
|  |     57 | this ``trick'' even the halting problem becomes
 | 
|  |     58 | decidable\ldots{}for example we could always say \emph{don't
 | 
|  |     59 | know}. Of course this would be silly. The point is that we
 | 
|  |     60 | should be striving for a method that answers as often as
 | 
|  |     61 | possible either \emph{yes} or \emph{no}---just in cases when
 | 
|  |     62 | it is too difficult we fall back on the
 | 
|  |     63 | \emph{don't-know}-answer. This might sound all like abstract
 | 
|  |     64 | nonsense. Therefore let us look at a concrete example.
 | 
|  |     65 | 
 | 
|  |     66 | 
 | 
|  |     67 | \subsubsection*{A Simple, Idealised Programming Language}
 | 
|  |     68 | 
 | 
|  |     69 | Our starting point is a small, idealised programming language.
 | 
|  |     70 | It is idealised because we cut several corners in comparison
 | 
|  |     71 | with real programming languages. The language we will study
 | 
|  |     72 | contains, amongst other things, variables holding integers.
 | 
|  |     73 | Using static analysis, we want to find out what the sign of
 | 
|  |     74 | these integers (positive or negative) will be when the program
 | 
|  |     75 | runs. This sign-analysis seems like a very simple problem. But
 | 
|  |     76 | even such simple problems, if approached naively, are in
 | 
|  |     77 | general undecidable, just like Turing's halting problem. I let
 | 
|  |     78 | you think why?
 | 
|  |     79 | 
 | 
|  |     80 | 
 | 
|  |     81 | Is sign-analysis of variables an interesting problem? Well,
 | 
|  |     82 | yes---if a compiler can find out that for example a variable
 | 
|  |     83 | will never be negative and this variable is used as an index
 | 
|  |     84 | for an array, then the compiler does not need to generate code
 | 
|  |     85 | for an underflow-check. Remember some languages are immune to
 | 
|  |     86 | buffer-overflow attacks, but they need to add underflow and
 | 
|  |     87 | overflow checks everywhere. According to John Regher, an
 | 
|  |     88 | expert in the field of compilers, overflow checks can cause
 | 
|  |     89 | 5-10\% slowdown, in some languages even 100\% for tight
 | 
|  |     90 | loops.\footnote{\url{http://blog.regehr.org/archives/1154}} If
 | 
|  |     91 | the compiler can omit the underflow check, for example, then
 | 
|  |     92 | this can potentially drastically speed up the generated code. 
 | 
|  |     93 | 
 | 
|  |     94 | What do programs in our simple programming language look like?
 | 
|  |     95 | The following grammar gives a first specification:
 | 
|  |     96 | 
 | 
|  |     97 | \begin{multicols}{2}
 | 
|  |     98 | \begin{plstx}[rhs style=,one per line,left margin=9mm]
 | 
|  |     99 | : \meta{Stmt} ::= \meta{label} \texttt{:}
 | 
|  |    100 |                 | \meta{var} \texttt{:=} \meta{Exp}
 | 
|  |    101 |                 | \texttt{jmp?} \meta{Exp} \meta{label}
 | 
|  |    102 |                 | \texttt{goto} \meta{label}\\
 | 
|  |    103 | : \meta{Prog} ::= \meta{Stmt} \ldots{} \meta{Stmt}\\
 | 
|  |    104 | \end{plstx}
 | 
|  |    105 | \columnbreak
 | 
|  |    106 | \begin{plstx}[rhs style=,one per line]
: \meta{Exp} ::= \meta{Exp} \texttt{+} \meta{Exp}
 | 
|  |    107 |                | \meta{Exp} \texttt{*} \meta{Exp}
 | 
|  |    108 |                | \meta{Exp} \texttt{=} \meta{Exp} 
 | 
|  |    109 |                | \meta{num}
 | 
|  |    110 |                | \meta{var}\\
 | 
|  |    111 | \end{plstx}
 | 
|  |    112 | \end{multicols}
 | 
|  |    113 | 
 | 
|  |    114 | \noindent I assume you are familiar with such
 | 
|  |    115 | grammars.\footnote{\url{http://en.wikipedia.org/wiki/Backus–Naur_Form}}
 | 
|  |    116 | There are three main syntactic categories: \emph{statments}
 | 
|  |    117 | and \emph{expressions} as well as \emph{programs}, which are
 | 
|  |    118 | sequences of statements. Statements are either labels,
 | 
|  |    119 | variable assignments, conditional jumps (\pcode{jmp?}) and
 | 
|  |    120 | unconditional jumps (\pcode{goto}). Labels are just strings,
 | 
|  |    121 | which can be used as the target of a jump. We assume that in
 | 
|  |    122 | every program the labels are unique---if there is a clash,
 | 
|  |    123 | then we do not know where to jump to. The conditional jumps
 | 
|  |    124 | and variable assignments involve (arithmetic) expressions.
 | 
|  |    125 | Expressions are either numbers, variables or compound
 | 
|  |    126 | expressions built up from \pcode{+}, \pcode{*} and \emph{=}
 | 
|  |    127 | (for simplicity reasons we do not consider any other
 | 
|  |    128 | operations). We assume we have negative and positive numbers,
 | 
|  |    129 | \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
 | 
|  |    130 | \pcode{2}\ldots{} An example program that calculates the
 | 
|  |    131 | factorial of 5 is in our programming language as follows:
 | 
|  |    132 | 
 | 
|  |    133 | \begin{lstlisting}[language={},xleftmargin=10mm]
 | 
|  |    134 |       a := 1
 | 
|  |    135 |       n := 5 
 | 
|  |    136 | top:  
 | 
|  |    137 |       jmp? n = 0 done 
 | 
|  |    138 |       a := a * n 
 | 
|  |    139 |       n := n + -1 
 | 
|  |    140 |       goto top 
 | 
|  |    141 | done:
 | 
|  |    142 | \end{lstlisting}
 | 
|  |    143 | 
 | 
|  |    144 | \noindent As can be seen each line of the program contains a
 | 
|  |    145 | statement. In the first two lines we assign values to the
 | 
|  |    146 | variables \pcode{a} and \pcode{n}. In line 4 we test whether
 | 
|  |    147 | \pcode{n} is zero, in which case we jump to the end of the
 | 
|  |    148 | program marked with the label \pcode{done}. If \pcode{n} is
 | 
|  |    149 | not zero, we multiply the content of \pcode{a} by \pcode{n},
 | 
|  |    150 | decrease \pcode{n} by one and jump back to the beginning of
 | 
|  |    151 | the loop, marked with the label \pcode{top}. Another program
 | 
|  |    152 | in our language is shown in Figure~\ref{fib}. I let you think
 | 
|  |    153 | what it calculates.
 | 
|  |    154 |  
 | 
|  |    155 | \begin{figure}[t]
 | 
|  |    156 | \begin{lstlisting}[numbers=none,
 | 
|  |    157 |                    language={},xleftmargin=10mm]
 | 
|  |    158 |       n := 6
 | 
|  |    159 |       m1 := 0
 | 
|  |    160 |       m2 := 1
 | 
|  |    161 | loop: 
 | 
|  |    162 |       jmp? n = 0 done
 | 
|  |    163 |       tmp := m2
 | 
|  |    164 |       m2 := m1 + m2
 | 
|  |    165 |       m1 := tmp
 | 
|  |    166 |       n := n + -1
 | 
|  |    167 |       goto top
 | 
|  |    168 | done:
 | 
|  |    169 | \end{lstlisting}
 | 
|  |    170 | \caption{A mystery program in our idealised programming language.
 | 
|  |    171 | Try to find out what it calculates! \label{fib}}
 | 
|  |    172 | \end{figure}
 | 
|  |    173 | 
 | 
|  |    174 | Even if our language is rather small, it is still Turing
 | 
|  |    175 | complete---meaning quite powerful. However, discussing this
 | 
|  |    176 | fact in more detail would lead us too far astray. Clearly, our
 | 
|  |    177 | programming is rather low-level and not very comfortable for
 | 
|  |    178 | writing programs. It is inspired by real machine code, which
 | 
|  |    179 | is the code that is executed by a CPU. So a more interesting
 | 
|  |    180 | question is what is missing in comparison with real machine
 | 
|  |    181 | code? Well, not much\ldots{}in principle. Real machine code,
 | 
|  |    182 | of course, contains many more arithmetic instructions (not
 | 
|  |    183 | just addition and multiplication) and many more conditional
 | 
|  |    184 | jumps. We could add these to our language if we wanted, but
 | 
|  |    185 | complexity is really beside the point here. Furthermore, real
 | 
|  |    186 | machine code has many instructions for manipulating memory. We
 | 
|  |    187 | do not have this at all. This is actually a more serious
 | 
|  |    188 | simplification because we assume numbers to be arbitrary small
 | 
|  |    189 | or large, which is not the case with real machine code. In
 | 
|  |    190 | real machine code, basic number formats have a range and might
 | 
|  |    191 | over-flow or under-flow this range. Also the number of
 | 
|  |    192 | variables in our programs is potentially unlimited, while
 | 
|  |    193 | memory in an actual computer, of course, is always limited. To
 | 
|  |    194 | sum up, our language might look ridiculously simple, but it is
 | 
|  |    195 | not too far removed from practically relevant issues.
 | 
|  |    196 | 
 | 
|  |    197 | 
 | 
|  |    198 | \subsubsection*{An Interpreter}
 | 
|  |    199 | 
 | 
|  |    200 | Designing a language is like playing god: you can say what
 | 
|  |    201 | names for variables you allow; what programs should look like;
 | 
|  |    202 | most importantly you can decide what each part of the program
 | 
|  |    203 | should mean and do. While our language is quite simple and the
 | 
|  |    204 | meaning of statements, for example, is rather straightforward,
 | 
|  |    205 | there are still places where we need to make real choices. For
 | 
|  |    206 | example consider the conditional jumps, say the one in the
 | 
|  |    207 | factorial program: 
 | 
|  |    208 | 
 | 
|  |    209 | \begin{center}
 | 
|  |    210 | \code{jmp? n = 0 done}
 | 
|  |    211 | \end{center}
 | 
|  |    212 | 
 | 
|  |    213 | \noindent How should they work? We could introduce Booleans
 | 
|  |    214 | (\pcode{true} and \pcode{false}) and then jump only when the
 | 
|  |    215 | condition is \pcode{true}. However, since we have numbers in
 | 
|  |    216 | our language anyway, why not just encoding \pcode{true} as
 | 
|  |    217 | one, and \pcode{false} as zero? In this way we can dispense
 | 
|  |    218 | with the additional concept of Booleans.
 | 
|  |    219 | 
 | 
|  |    220 | I hope the above discussion makes it already clear we need to
 | 
|  |    221 | be a bit more careful with our programs. Below we shall
 | 
|  |    222 | describe an interpreter for our programming language, which
 | 
|  |    223 | specifies exactly how programs are supposed to be
 | 
|  |    224 | run\ldots{}at least we will specify this for all \emph{good}
 | 
|  |    225 | programs. By good programs I mean where all variables are
 | 
|  |    226 | initialised, for example. Our interpreter will just crash if
 | 
|  |    227 | it cannot find out the value for a variable when it is not
 | 
|  |    228 | initialised. Also, we will assume that labels in good programs
 | 
|  |    229 | are unique, otherwise our programs will calculate ``garbage''.
 | 
|  |    230 | 
 | 
|  |    231 | First we will pre-process our programs. This will simplify the
 | 
|  |    232 | definition of our interpreter later on. By pre-processing our
 | 
|  |    233 | programs we will transform programs into \emph{snippets}. A
 | 
|  |    234 | snippet is a label and all the code that comes after the
 | 
|  |    235 | label. This essentially means a snippet is a \emph{map} from
 | 
|  |    236 | labels to code.\footnote{Be sure you know what maps are. In a
 | 
|  |    237 | programming context they are often represented as association
 | 
|  |    238 | list where some data is associated with a key.} 
 | 
|  |    239 | 
 | 
|  |    240 | Given that programs are sequences (or lists) of statements, we
 | 
|  |    241 | can easily calculate the snippets by just traversing this
 | 
|  |    242 | sequence and recursively generating the map. Suppose a program
 | 
|  |    243 | is of the general form
 | 
|  |    244 | 
 | 
|  |    245 | \begin{center}
 | 
|  |    246 | $stmt_1\;stmt_2\; \ldots\; stmt_n$
 | 
|  |    247 | \end{center}
 | 
|  |    248 | 
 | 
|  |    249 | \noindent The idea is to go through this sequence of
 | 
|  |    250 | statements one by one and check whether they are a label. If
 | 
|  |    251 | yes, we add the label and the remaining statements to our map.
 | 
|  |    252 | If no, we just continue with the next statement. To come up
 | 
|  |    253 | with a recursive definition for generating snippets, let us
 | 
|  |    254 | write $[]$ for the program that does not contain any
 | 
|  |    255 | statement. Consider the following definition:
 | 
|  |    256 | 
 | 
|  |    257 | \begin{center}
 | 
|  |    258 | \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
|  |    259 | $\textit{snippets}([])$ & $\dn$ & $\varnothing$\\
 | 
|  |    260 | $\textit{snippets}(stmt\;\; rest)$ & $\dn$ &
 | 
|  |    261 | $\begin{cases}
 | 
|  |    262 |    \textit{snippets}(rest)[label := rest] & \text{if}\;stmt = \textit{label:}\\
 | 
|  |    263 |    \textit{snippets}(rest)                & \text{otherwise}
 | 
|  |    264 | \end{cases}$   
 | 
|  |    265 | \end{tabular}
 | 
|  |    266 | \end{center}
 | 
|  |    267 | 
 | 
|  |    268 | \noindent In the first clause we just return the empty map for
 | 
|  |    269 | the program that does not contain any statement. In the second
 | 
|  |    270 | clause, we have to distinguish the case where the first
 | 
|  |    271 | statement is a label or not. As said before, if not, then we
 | 
|  |    272 | just ``throw away'' the label and recursively calculate the
 | 
|  |    273 | snippets for the rest of the program (the otherwise clause).
 | 
|  |    274 | If yes, then we do the same, but also update the map so that
 | 
|  |    275 | $label$ now points to the rest of the statements (the if
 | 
|  |    276 | clause). This looks all realtively straightforward, but there
 | 
|  |    277 | is one small problem we need to overcome: our two programs
 | 
|  |    278 | shown so far have no label as \emph{entry point}---that is
 | 
|  |    279 | where the execution is supposed to start. We usually assume
 | 
|  |    280 | that the first statement will be run first. To make this the
 | 
|  |    281 | default, it is convenient if we add to all our programs a
 | 
|  |    282 | default label, say \pcode{""} (the empty string). With this we
 | 
|  |    283 | can define our pre-processing of programs as follows
 | 
|  |    284 | 
 | 
|  |    285 | \begin{center}
 | 
|  |    286 | $\textit{preproc}(prog) \dn \textit{snippets}(\pcode{"":}\;\; prog)$ 
 | 
|  |    287 | \end{center} 
 | 
|  |    288 | 
 | 
|  |    289 | \noindent Let us see how this pans out in practice. If we
 | 
|  |    290 | pre-process the factorial program shown earlier, we obtain the 
 | 
|  |    291 | following map:
 | 
|  |    292 | 
 | 
|  |    293 | \begin{center}
 | 
|  |    294 | \small
 | 
|  |    295 | \lstset{numbers=none,
 | 
|  |    296 |         language={},
 | 
|  |    297 |         xleftmargin=0mm,
 | 
|  |    298 |         aboveskip=0.5mm,
 | 
|  |    299 |         belowskip=0.5mm,
 | 
|  |    300 |         frame=single,
 | 
|  |    301 |         framerule=0mm,
 | 
|  |    302 |         framesep=0mm}
 | 
|  |    303 | \begin{tikzpicture}[node distance=0mm]
 | 
|  |    304 | \node (A1) [draw]{\pcode{""}};
 | 
|  |    305 | \node (B1) [right=of A1] {$\mapsto$};
 | 
|  |    306 | \node (C1) [right=of B1,anchor=north west] {
 | 
|  |    307 | \begin{minipage}{3.5cm}
 | 
|  |    308 | \begin{lstlisting}[language={},xleftmargin=0mm]
 | 
|  |    309 |   a := 1
 | 
|  |    310 |   n := 5 
 | 
|  |    311 | top:  
 | 
|  |    312 |   jmp? n = 0 done 
 | 
|  |    313 |   a := a * n 
 | 
|  |    314 |   n := n + -1 
 | 
|  |    315 |   goto top 
 | 
|  |    316 | done:
 | 
|  |    317 | \end{lstlisting}
 | 
|  |    318 | \end{minipage}};
 | 
|  |    319 | \node (A2) [right=of C1.north east,draw] {\pcode{top}};
 | 
|  |    320 | \node (B2) [right=of A2] {$\mapsto$};
 | 
|  |    321 | \node (C2) [right=of B2, anchor=north west] {
 | 
|  |    322 | \begin{minipage}{3.5cm}
 | 
|  |    323 | \begin{lstlisting}[language={},xleftmargin=0mm]
 | 
|  |    324 |   jmp? n = 0 done 
 | 
|  |    325 |   a := a * n 
 | 
|  |    326 |   n := n + -1 
 | 
|  |    327 |   goto top 
 | 
|  |    328 | done:
 | 
|  |    329 | \end{lstlisting} 
 | 
|  |    330 | \end{minipage}}; 
 | 
|  |    331 | \node (A3) [right=of C2.north east,draw] {\pcode{done}};
 | 
|  |    332 | \node (B3) [right=of A3] {$\mapsto$};
 | 
|  |    333 | \node (C3) [right=of B3] {$[]$};
 | 
|  |    334 | \end{tikzpicture}
 | 
|  |    335 | \end{center}
 | 
|  |    336 | 
 | 
|  |    337 | \noindent I highlighted the \emph{keys} in this map. Since
 | 
|  |    338 | there are three labels in the factorial program (remember we
 | 
|  |    339 | added \pcode{""}), there are three keys. When running the
 | 
|  |    340 | factorial program and encountering a jump, then we only have
 | 
|  |    341 | to consult this snippets-map in order to find out what the
 | 
|  |    342 | next statements should be.
 | 
|  |    343 | 
 | 
|  |    344 | We should now be in the position to define how a program
 | 
|  |    345 | should be run. In the context of interpreters, this
 | 
|  |    346 | ``running'' of programs is often called \emph{evaluation}. Let
 | 
|  |    347 | us start with the definition of how expressions are evaluated.
 | 
|  |    348 | A first attempt might be the following recursive function:
 | 
|  |    349 | 
 | 
|  |    350 | \begin{center}
 | 
|  |    351 | \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
|  |    352 | $\textit{eval\_exp}(\texttt{n})$ & $\dn$ & $n$
 | 
|  |    353 |   \qquad\text{if}\; \texttt{n}\; \text{is a number like} 
 | 
|  |    354 |   \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
 | 
|  |    355 | \pcode{2}\ldots{}\\
 | 
|  |    356 | $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, 
 | 
|  |    357 |                     \texttt{e}_\texttt{2})$ & 
 | 
|  |    358 |   $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) + 
 | 
|  |    359 |            \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\
 | 
|  |    360 | $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, 
 | 
|  |    361 |                     \texttt{e}_\texttt{2})$ & 
 | 
|  |    362 |   $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) * 
 | 
|  |    363 |            \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\
 | 
|  |    364 | $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, 
 | 
|  |    365 |                     \texttt{e}_\texttt{2})$ & 
 | 
|  |    366 |   $\dn$ & $\begin{cases}
 | 
|  |    367 |   1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}) =
 | 
|  |    368 |                  \textit{eval\_exp}(\texttt{e}_\texttt{2})\\
 | 
|  |    369 |   0 & \text{otherwise}
 | 
|  |    370 |   \end{cases}$          
 | 
|  |    371 | \end{tabular}
 | 
|  |    372 | \end{center}
 | 
|  |    373 | 
 | 
|  |    374 | \noindent While this should look all rather intuitive`, still
 | 
|  |    375 | be very careful. There is a subtlety which can be easily
 | 
|  |    376 | overlooked: The function \textit{eval\_exp} takes an
 | 
|  |    377 | expression of our programming language as input and returns a
 | 
|  |    378 | number as output. Therefore whenever we encounter a number in
 | 
|  |    379 | our program, we just return this number---this is defined in
 | 
|  |    380 | the first clause above. Whenever we encounter an addition,
 | 
|  |    381 | well then we first evaluate the left-hand side
 | 
|  |    382 | $\texttt{e}_\texttt{1}$ of the addition (this will give a
 | 
|  |    383 | number), then evaluate the right-hand side
 | 
|  |    384 | $\texttt{e}_\texttt{2}$ (this gives another number), and
 | 
|  |    385 | finally add both numbers together. Here is the subtlety: on
 | 
|  |    386 | the left-hand side of the $\dn$ we have a \texttt{+} (in the
 | 
|  |    387 | teletype font) which is the symbol for addition in our
 | 
|  |    388 | programming language. On the right-hand side we have $+$ which
 | 
|  |    389 | stands for the arithmetic operation from ``mathematics'' of
 | 
|  |    390 | adding two numbers. These are rather different concepts---one
 | 
|  |    391 | is a symbol (which we made up), and the other a mathematical
 | 
|  |    392 | operation. When we will have a look at an actual
 | 
|  |    393 | implementation of our interpreter, the mathematical operation
 | 
|  |    394 | will be the function for addition from the programming
 | 
|  |    395 | language in which we \underline{\smash{implement}} our
 | 
|  |    396 | interpreter. While the \texttt{+} is just a symbol that is
 | 
|  |    397 | used in our programming language. Clearly we have to use a
 | 
|  |    398 | symbol that is a good mnemonic for addition, otherwise we will
 | 
|  |    399 | confuse the programmers working with our language. Therefore
 | 
|  |    400 | we use $\texttt{+}$. A similar choice is made for times in the
 | 
|  |    401 | third clause and equality in the fourth clause. 
 | 
|  |    402 | 
 | 
|  |    403 | Remember I wrote at the beginning of this section about being
 | 
|  |    404 | god when designing a programming language. You can see this
 | 
|  |    405 | here: we need to give meaning to symbols. At the moment
 | 
|  |    406 | however, we are a poor fallible god. Look again at the grammar
 | 
|  |    407 | of our programming language and our definition. Clearly, an
 | 
|  |    408 | expression can contain variables. So far we have ignored them.
 | 
|  |    409 | What should our interpreter do with variables? They might
 | 
|  |    410 | change during the evaluation of a program. For example the
 | 
|  |    411 | variable \pcode{n} in the factorial program counts down from 5
 | 
|  |    412 | down to 0. How can we improve our definition above to give also
 | 
|  |    413 | an answer whenever our interpreter encounters a variable in an
 | 
|  |    414 | expression? The solution is to add an \emph{environment},
 | 
|  |    415 | written $env$, as an additional input argument to our
 | 
|  |    416 | \textit{eval\_exp} function.
 | 
|  |    417 |  
 | 
|  |    418 | \begin{center}
 | 
|  |    419 | \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
|  |    420 | $\textit{eval\_exp}(\texttt{n}, env)$ & $\dn$ & $n$
 | 
|  |    421 |   \qquad\text{if}\; \texttt{n}\; \text{is a number like} 
 | 
|  |    422 |   \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
 | 
|  |    423 | \pcode{2}\ldots{}\\
 | 
|  |    424 | $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, 
 | 
|  |    425 |                     \texttt{e}_\texttt{2}, env)$ & 
 | 
|  |    426 |   $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) + 
 | 
|  |    427 |            \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\
 | 
|  |    428 | $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, 
 | 
|  |    429 |                     \texttt{e}_\texttt{2}, env)$ & 
 | 
|  |    430 |   $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) * 
 | 
|  |    431 |            \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\
 | 
|  |    432 | $\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, 
 | 
|  |    433 |                     \texttt{e}_\texttt{2}, env)$ & 
 | 
|  |    434 |   $\dn$ & $\begin{cases}
 | 
|  |    435 |   1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) =
 | 
|  |    436 |                  \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)\\
 | 
|  |    437 |   0 & \text{otherwise}
 | 
|  |    438 |   \end{cases}$\\
 | 
|  |    439 | $\textit{eval\_exp}(\texttt{x}, env)$ & $\dn$ & $env(x)$      
 | 
|  |    440 | \end{tabular}
 | 
|  |    441 | \end{center}
 | 
|  |    442 |  
 | 
|  |    443 | \noindent This environment $env$ also acts like a map: it
 | 
|  |    444 | associates variables with their current values. For example
 | 
|  |    445 | after evaluating the first two lines in our factorial
 | 
|  |    446 | program, such an environment might look as follows
 | 
|  |    447 | 
 | 
|  |    448 | \begin{center}
 | 
|  |    449 | \begin{tabular}{ll}
 | 
|  |    450 | $\fbox{\texttt{a}} \mapsto \texttt{1}$ &
 | 
|  |    451 | $\fbox{\texttt{n}} \mapsto \texttt{5}$
 | 
|  |    452 | \end{tabular}
 | 
|  |    453 | \end{center}
 | 
|  |    454 | 
 | 
|  |    455 | \noindent Again I highlighted the keys. In the clause for
 | 
|  |    456 | variables, we can therefore consult this environment and
 | 
|  |    457 | return whatever value is currently stored for this variable.
 | 
|  |    458 | This is written $env(x)$---meaning we query this map with $x$
 | 
|  |    459 | we obtain the corresponding number. You might ask what happens
 | 
|  |    460 | if an environment does not contain any value for, say, the
 | 
|  |    461 | variable $x$? Well, then our interpreter just ``crashes'', or
 | 
|  |    462 | more precisely will raise an exception. In this case we have a
 | 
|  |    463 | ``bad'' program that tried to use a variable before it was
 | 
|  |    464 | initialised. The programmer should not have done this. In a
 | 
|  |    465 | real programming language, we would of course try a bit harder
 | 
|  |    466 | and for example give an error at compile time, or design our
 | 
|  |    467 | language in such a way that this can never happen. With the
 | 
|  |    468 | second version of \textit{eval\_exp} we completed our
 | 
|  |    469 | definition for evaluating expressions.
 | 
|  |    470 |  
 | 
|  |    471 | Next comes the evaluation function for statements. We define
 | 
|  |    472 | this function in such a way that we recursively evaluate a
 | 
|  |    473 | whole sequence of statements. Assume a program $p$ (you want
 | 
|  |    474 | to evaluate) and its pre-processed snippets $sn$. Then we can
 | 
|  |    475 | define:
 | 
|  |    476 | 
 | 
|  |    477 | \begin{center}
 | 
|  |    478 | \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 | 
|  |    479 | $\textit{eval\_stmts}([], env)$ & $\dn$ & $env$\\
 | 
|  |    480 | $\textit{eval\_stmts}(\texttt{label:}\;rest, env)$ &
 | 
|  |    481 |   $\dn$ & $\textit{eval\_stmts}(rest, env)$ \\ 
 | 
|  |    482 | $\textit{eval\_stmts}(\texttt{x\,:=\,e}\;rest, env)$ & 
 | 
|  |    483 |   $\dn$ & $\textit{eval\_stmts}(rest, 
 | 
|  |    484 |            env[x := \textit{eval\_exp}(\texttt{e}, env)])$\\  
 | 
|  |    485 | $\textit{eval\_stmts}(\texttt{goto\,lbl}\;rest, env)$ 
 | 
|  |    486 |  & $\dn$ & $\textit{eval\_stmts}(sn(\texttt{lbl}), env)$\\
 | 
|  |    487 | $\textit{eval\_stmts}(\texttt{jmp?\,e\,lbl}\;rest, env)$ 
 | 
|  |    488 |  & $\dn$ & $\begin{cases}\begin{array}{@{}l@{\hspace{-12mm}}r@{}}
 | 
|  |    489 |  \textit{eval\_stmts}(sn(\texttt{lbl}), env)\\ 
 | 
|  |    490 |  & \text{if}\;\textit{eval\_exp}(\texttt{e}, env) = 1\\
 | 
|  |    491 |  \textit{eval\_stmts}(rest, env) & \text{otherwise}\\
 | 
|  |    492 |  \end{array}\end{cases}$  
 | 
|  |    493 | \end{tabular}
 | 
|  |    494 | \end{center}
 | 
|  |    495 | 
 | 
|  |    496 | \noindent The first clause is for the empty program, or when
 | 
|  |    497 | we arrived at the end of the program. In this case we just
 | 
|  |    498 | return the environment. The second clause is for when the next
 | 
|  |    499 | statement is a label. That means the program is of the form
 | 
|  |    500 | $\texttt{label:}\;rest$ where the label is some string and
 | 
|  |    501 | $rest$ stands for all following statements. This case is easy,
 | 
|  |    502 | because our evaluation function just discards the label and
 | 
|  |    503 | evaluates the rest of the statements (we already extracted all
 | 
|  |    504 | important information about labels when we pre-processed our
 | 
|  |    505 | programs and generated the snippets). The third clause is for
 | 
|  |    506 | variable assignments. Again we just evaluate the rest for the
 | 
|  |    507 | statements, but with a modified environment---since the
 | 
|  |    508 | variable assignment is supposed to introduce a new variable or
 | 
|  |    509 | change the current value of a variable. For this modification
 | 
|  |    510 | of the environment we first evaluate the expression
 | 
|  |    511 | $\texttt{e}$ using our evaluation function for expressions.
 | 
|  |    512 | This gives us a number. Then we assign this number to the
 | 
|  |    513 | variable $x$ in the environment. This modified environment
 | 
|  |    514 | will be used to evaluate the rest of the program. The fourth
 | 
|  |    515 | clause is for the unconditional jump to a label, called
 | 
|  |    516 | \texttt{lbl}. That means we have to look up in our snippets
 | 
|  |    517 | map $sn$ what are the next statements for this label.
 | 
|  |    518 | Therefore we will continue with evaluating, not with the rest
 | 
|  |    519 | of the program, but with the statements stored in the
 | 
|  |    520 | snippets-map under the label $\texttt{lbl}$. The fifth clause
 | 
|  |    521 | for conditional jumps is similar, but to decide whether to
 | 
|  |    522 | make the jump we first need to evaluate the expression
 | 
|  |    523 | $\texttt{e}$ in order to find out whether it is $1$. If yes,
 | 
|  |    524 | we jump, otherwise we just continue with evaluating the rest
 | 
|  |    525 | of the program.
 | 
|  |    526 | 
 | 
|  |    527 | Our interpreter works in two stages: First we pre-process our
 | 
|  |    528 | program generating the snippets map $sn$, say. Second we call
 | 
|  |    529 | the evaluation function with the default entry point and the
 | 
|  |    530 | empty environment:
 | 
|  |    531 | 
 | 
|  |    532 | \begin{center}
 | 
|  |    533 | $\textit{eval\_stmts}(sn(\pcode{""}), \varnothing)$
 | 
|  |    534 | \end{center}
 | 
|  |    535 |  
 | 
|  |    536 | \noindent It is interesting to note that our interpreter when
 | 
|  |    537 | it comes to the end of the program returns an environment. Our
 | 
|  |    538 | programming language does not contain any constructs for input
 | 
|  |    539 | and output. Therefore this environment is the only effect we
 | 
|  |    540 | can observe when running the program (apart from that our
 | 
|  |    541 | interpreter might need some time before finishing the
 | 
|  |    542 | evaluation of the program and the CPU getting hot). Evaluating 
 | 
|  |    543 | the factorial program with our interpreter we receive as
 | 
|  |    544 | ``answer''-environment
 | 
|  |    545 | 
 | 
|  |    546 | \begin{center}
 | 
|  |    547 | \begin{tabular}{ll}
 | 
|  |    548 | $\fbox{\texttt{a}} \mapsto \texttt{120}$ &
 | 
|  |    549 | $\fbox{\texttt{n}} \mapsto \texttt{0}$
 | 
|  |    550 | \end{tabular}
 | 
|  |    551 | \end{center}
 | 
|  |    552 | 
 | 
|  |    553 | \noindent While the discussion above should have illustrated
 | 
|  |    554 | the ideas, in order to do some serious calculations, we clearly
 | 
|  |    555 | need to implement the interpreter.
 | 
|  |    556 | 
 | 
|  |    557 | 
 | 
|  |    558 | \subsubsection*{Scala Code for the Interpreter}
 | 
|  |    559 | 
 | 
|  |    560 | Functional programming languages are very convenient for
 | 
|  |    561 | implementations of interpreters. A good choice for a
 | 
|  |    562 | functional programming language is Scala, a programming
 | 
|  |    563 | language that combines functional and object-oriented
 | 
|  |    564 | pro\-gramming-styles. It has received in the last five years or
 | 
|  |    565 | so quite a bit of attention. One reason for this attention is
 | 
|  |    566 | that, like the Java programming language, Scala compiles to
 | 
|  |    567 | the Java Virtual Machine (JVM) and therefore Scala programs
 | 
|  |    568 | can run under MacOSX, Linux and Windows.\footnote{There are
 | 
|  |    569 | also experimental backends for Android and JavaScript.} Unlike
 | 
|  |    570 | Java, however, Scala often allows programmers to write very
 | 
|  |    571 | concise and elegant code. Some therefore say Scala is the much
 | 
|  |    572 | better Java. A number of companies, The Guardian, Twitter,
 | 
|  |    573 | Coursera, FourSquare, LinkedIn to name a few, either use Scala
 | 
|  |    574 | exclusively in production code, or at least to some
 | 
|  |    575 | substantial degree. If you want to try out Scala yourself, the
 | 
|  |    576 | Scala compiler can be downloaded from
 | 
|  |    577 | 
 | 
|  |    578 | \begin{quote}
 | 
|  |    579 | \url{http://www.scala-lang.org}
 | 
|  |    580 | \end{quote}
 | 
|  |    581 | 
 | 
|  |    582 | Let us have a look at the Scala code shown in
 | 
|  |    583 | Figure~\ref{code}. It shows the entire code for the
 | 
|  |    584 | interpreter, though the implementation is admittedly no
 | 
|  |    585 | frills.
 | 
|  |    586 | 
 | 
|  |    587 | \begin{figure}[t]
 | 
|  |    588 | \small
 | 
|  |    589 | \lstinputlisting[language=Scala]{../progs/inter.scala}
 | 
|  |    590 | \caption{The entire code of the interpreter for our
 | 
|  |    591 | idealised programming language.\label{code}}
 | 
|  |    592 | \end{figure}
 | 
|  |    593 | 
 | 
|  |    594 | 
 | 
|  |    595 | \subsubsection*{Static Analysis}
 | 
|  |    596 | 
 | 
|  |    597 | Finally we can come back to our original problem, namely 
 | 
|  |    598 | finding out what the signs of variables are 
 | 
|  |    599 | 
 | 
|  |    600 | \begin{center}
 | 
|  |    601 | 
 | 
|  |    602 | 
 | 
|  |    603 | \end{center}
 | 
|  |    604 |  
 | 
|  |    605 | \end{document}
 | 
|  |    606 | 
 | 
|  |    607 | %% list of static analysers for C
 | 
|  |    608 | http://spinroot.com/static/index.html
 | 
|  |    609 | 
 | 
|  |    610 | %% NASA coding rules for C
 | 
|  |    611 | http://pixelscommander.com/wp-content/uploads/2014/12/P10.pdf
 | 
|  |    612 | 
 | 
|  |    613 | %%% Local Variables: 
 | 
|  |    614 | %%% mode: latex
 | 
|  |    615 | %%% TeX-master: t
 | 
|  |    616 | %%% End: 
 |