| 
675
 | 
     1  | 
% !TEX program = xelatex
  | 
| 
539
 | 
     2  | 
\documentclass{article}
 | 
| 
 | 
     3  | 
\usepackage{../style}
 | 
| 
 | 
     4  | 
\usepackage{../langs}
 | 
| 
 | 
     5  | 
\usepackage{../graphics}
 | 
| 
 | 
     6  | 
\usepackage{../grammar}
 | 
| 
 | 
     7  | 
\usepackage{multicol}
 | 
| 
 | 
     8  | 
  | 
| 
675
 | 
     9  | 
%%\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}
 | 
| 
539
 | 
    10  | 
  | 
| 
 | 
    11  | 
\begin{document}
 | 
| 
 | 
    12  | 
\fnote{\copyright{} Christian Urban, King's College London, 2014}
 | 
| 
 | 
    13  | 
  | 
| 
 | 
    14  | 
%% why are shuttle flights so good with software
  | 
| 
 | 
    15  | 
%%http://www.fastcompany.com/28121/they-write-right-stuff
  | 
| 
 | 
    16  | 
  | 
| 
675
 | 
    17  | 
\section*{Handout 10 (Static Analysis)}
 | 
| 
539
 | 
    18  | 
  | 
| 
 | 
    19  | 
If we want to improve the safety and security of our programs,
  | 
| 
 | 
    20  | 
we need a more principled approach to programming. Testing is
  | 
| 
 | 
    21  | 
good, but as Edsger Dijkstra famously wrote: 
  | 
| 
 | 
    22  | 
  | 
| 
 | 
    23  | 
\begin{quote}\it 
 | 
| 
 | 
    24  | 
``Program testing can be a very effective way to show the
  | 
| 
 | 
    25  | 
\underline{\smash{presence}} of bugs, but it is hopelessly
 | 
| 
 | 
    26  | 
inadequate for showing their \underline{\smash{absence}}.''
 | 
| 
 | 
    27  | 
\end{quote}
 | 
| 
 | 
    28  | 
  | 
| 
 | 
    29  | 
\noindent While such a more principled approach has been the
  | 
| 
 | 
    30  | 
subject of intense study for a long, long time, only in the
  | 
| 
 | 
    31  | 
past few years some impressive results have been achieved. One
  | 
| 
 | 
    32  | 
is the complete formalisation and (mathematical) verification
  | 
| 
 | 
    33  | 
of a microkernel operating system called seL4.
  | 
| 
 | 
    34  | 
  | 
| 
 | 
    35  | 
\begin{center}
 | 
| 
 | 
    36  | 
\url{http://sel4.systems}
 | 
| 
 | 
    37  | 
\end{center}
 | 
| 
 | 
    38  | 
  | 
| 
 | 
    39  | 
\noindent In 2011 this work was included in the MIT Technology
  | 
| 
 | 
    40  | 
Review in the annual list of the world’s ten most important
  | 
| 
 | 
    41  | 
emerging
  | 
| 
 | 
    42  | 
technologies.\footnote{\url{http://www2.technologyreview.com/tr10/?year=2011}}
 | 
| 
 | 
    43  | 
While this work is impressive, its technical details are too
  | 
| 
 | 
    44  | 
enormous for an explanation here. Therefore let us look at
  | 
| 
 | 
    45  | 
something much simpler, namely finding out properties about
  | 
| 
 | 
    46  | 
programs using \emph{static analysis}.
 | 
| 
 | 
    47  | 
  | 
| 
 | 
    48  | 
Static analysis is a technique that checks properties of a
  | 
| 
 | 
    49  | 
program without actually running the program. This should
  | 
| 
 | 
    50  | 
raise alarm bells with you---because almost all interesting
  | 
| 
 | 
    51  | 
properties about programs  are equivalent to the halting
  | 
| 
 | 
    52  | 
problem, which we know is undecidable. For example estimating
  | 
| 
 | 
    53  | 
the memory consumption of programs is in general undecidable,
  | 
| 
 | 
    54  | 
just like the halting problem. Static analysis circumvents
  | 
| 
 | 
    55  | 
this undecidability-problem by essentially allowing answers
  | 
| 
 | 
    56  | 
\emph{yes} and \emph{no}, but also \emph{don't know}. With
 | 
| 
 | 
    57  | 
this ``trick'' even the halting problem becomes
  | 
| 
 | 
    58  | 
decidable\ldots{}for example we could always say \emph{don't
 | 
| 
 | 
    59  | 
know}. Of course this would be silly. The point is that we
  | 
| 
 | 
    60  | 
should be striving for a method that answers as often as
  | 
| 
 | 
    61  | 
possible either \emph{yes} or \emph{no}---just in cases when
 | 
| 
 | 
    62  | 
it is too difficult we fall back on the
  | 
| 
 | 
    63  | 
\emph{don't-know}-answer. This might sound all like abstract
 | 
| 
 | 
    64  | 
nonsense. Therefore let us look at a concrete example.
  | 
| 
 | 
    65  | 
  | 
| 
 | 
    66  | 
  | 
| 
 | 
    67  | 
\subsubsection*{A Simple, Idealised Programming Language}
 | 
| 
 | 
    68  | 
  | 
| 
 | 
    69  | 
Our starting point is a small, idealised programming language.
  | 
| 
 | 
    70  | 
It is idealised because we cut several corners in comparison
  | 
| 
 | 
    71  | 
with real programming languages. The language we will study
  | 
| 
 | 
    72  | 
contains, amongst other things, variables holding integers.
  | 
| 
 | 
    73  | 
Using static analysis, we want to find out what the sign of
  | 
| 
 | 
    74  | 
these integers (positive or negative) will be when the program
  | 
| 
 | 
    75  | 
runs. This sign-analysis seems like a very simple problem. But
  | 
| 
 | 
    76  | 
even such simple problems, if approached naively, are in
  | 
| 
 | 
    77  | 
general undecidable, just like Turing's halting problem. I let
  | 
| 
 | 
    78  | 
you think why?
  | 
| 
 | 
    79  | 
  | 
| 
 | 
    80  | 
  | 
| 
 | 
    81  | 
Is sign-analysis of variables an interesting problem? Well,
  | 
| 
 | 
    82  | 
yes---if a compiler can find out that for example a variable
  | 
| 
 | 
    83  | 
will never be negative and this variable is used as an index
  | 
| 
 | 
    84  | 
for an array, then the compiler does not need to generate code
  | 
| 
 | 
    85  | 
for an underflow-check. Remember some languages are immune to
  | 
| 
 | 
    86  | 
buffer-overflow attacks, but they need to add underflow and
  | 
| 
 | 
    87  | 
overflow checks everywhere. According to John Regher, an
  | 
| 
 | 
    88  | 
expert in the field of compilers, overflow checks can cause
  | 
| 
 | 
    89  | 
5-10\% slowdown, in some languages even 100\% for tight
  | 
| 
 | 
    90  | 
loops.\footnote{\url{http://blog.regehr.org/archives/1154}} If
 | 
| 
 | 
    91  | 
the compiler can omit the underflow check, for example, then
  | 
| 
 | 
    92  | 
this can potentially drastically speed up the generated code. 
  | 
| 
 | 
    93  | 
  | 
| 
 | 
    94  | 
What do programs in our simple programming language look like?
  | 
| 
 | 
    95  | 
The following grammar gives a first specification:
  | 
| 
 | 
    96  | 
  | 
| 
 | 
    97  | 
\begin{multicols}{2}
 | 
| 
 | 
    98  | 
\begin{plstx}[rhs style=,one per line,left margin=9mm]
 | 
| 
 | 
    99  | 
: \meta{Stmt} ::= \meta{label} \texttt{:}
 | 
| 
 | 
   100  | 
                | \meta{var} \texttt{:=} \meta{Exp}
 | 
| 
 | 
   101  | 
                | \texttt{jmp?} \meta{Exp} \meta{label}
 | 
| 
 | 
   102  | 
                | \texttt{goto} \meta{label}\\
 | 
| 
 | 
   103  | 
: \meta{Prog} ::= \meta{Stmt} \ldots{} \meta{Stmt}\\
 | 
| 
 | 
   104  | 
\end{plstx}
 | 
| 
 | 
   105  | 
\columnbreak
  | 
| 
 | 
   106  | 
\begin{plstx}[rhs style=,one per line]
: \meta{Exp} ::= \meta{Exp} \texttt{+} \meta{Exp}
 | 
| 
 | 
   107  | 
               | \meta{Exp} \texttt{*} \meta{Exp}
 | 
| 
 | 
   108  | 
               | \meta{Exp} \texttt{=} \meta{Exp} 
 | 
| 
 | 
   109  | 
               | \meta{num}
 | 
| 
 | 
   110  | 
               | \meta{var}\\
 | 
| 
 | 
   111  | 
\end{plstx}
 | 
| 
 | 
   112  | 
\end{multicols}
 | 
| 
 | 
   113  | 
  | 
| 
 | 
   114  | 
\noindent I assume you are familiar with such
  | 
| 
 | 
   115  | 
grammars.\footnote{\url{http://en.wikipedia.org/wiki/Backus–Naur_Form}}
 | 
| 
 | 
   116  | 
There are three main syntactic categories: \emph{statments}
 | 
| 
 | 
   117  | 
and \emph{expressions} as well as \emph{programs}, which are
 | 
| 
 | 
   118  | 
sequences of statements. Statements are either labels,
  | 
| 
 | 
   119  | 
variable assignments, conditional jumps (\pcode{jmp?}) and
 | 
| 
 | 
   120  | 
unconditional jumps (\pcode{goto}). Labels are just strings,
 | 
| 
 | 
   121  | 
which can be used as the target of a jump. We assume that in
  | 
| 
 | 
   122  | 
every program the labels are unique---if there is a clash,
  | 
| 
 | 
   123  | 
then we do not know where to jump to. The conditional jumps
  | 
| 
 | 
   124  | 
and variable assignments involve (arithmetic) expressions.
  | 
| 
 | 
   125  | 
Expressions are either numbers, variables or compound
  | 
| 
 | 
   126  | 
expressions built up from \pcode{+}, \pcode{*} and \emph{=}
 | 
| 
 | 
   127  | 
(for simplicity reasons we do not consider any other
  | 
| 
 | 
   128  | 
operations). We assume we have negative and positive numbers,
  | 
| 
 | 
   129  | 
\ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
 | 
| 
 | 
   130  | 
\pcode{2}\ldots{} An example program that calculates the
 | 
| 
 | 
   131  | 
factorial of 5 is in our programming language as follows:
  | 
| 
 | 
   132  | 
  | 
| 
 | 
   133  | 
\begin{lstlisting}[language={},xleftmargin=10mm]
 | 
| 
 | 
   134  | 
      a := 1
  | 
| 
 | 
   135  | 
      n := 5 
  | 
| 
 | 
   136  | 
top:  
  | 
| 
 | 
   137  | 
      jmp? n = 0 done 
  | 
| 
 | 
   138  | 
      a := a * n 
  | 
| 
 | 
   139  | 
      n := n + -1 
  | 
| 
 | 
   140  | 
      goto top 
  | 
| 
 | 
   141  | 
done:
  | 
| 
 | 
   142  | 
\end{lstlisting}
 | 
| 
 | 
   143  | 
  | 
| 
 | 
   144  | 
\noindent As can be seen each line of the program contains a
  | 
| 
 | 
   145  | 
statement. In the first two lines we assign values to the
  | 
| 
 | 
   146  | 
variables \pcode{a} and \pcode{n}. In line 4 we test whether
 | 
| 
 | 
   147  | 
\pcode{n} is zero, in which case we jump to the end of the
 | 
| 
 | 
   148  | 
program marked with the label \pcode{done}. If \pcode{n} is
 | 
| 
 | 
   149  | 
not zero, we multiply the content of \pcode{a} by \pcode{n},
 | 
| 
 | 
   150  | 
decrease \pcode{n} by one and jump back to the beginning of
 | 
| 
 | 
   151  | 
the loop, marked with the label \pcode{top}. Another program
 | 
| 
 | 
   152  | 
in our language is shown in Figure~\ref{fib}. I let you think
 | 
| 
 | 
   153  | 
what it calculates.
  | 
| 
 | 
   154  | 
 
  | 
| 
 | 
   155  | 
\begin{figure}[t]
 | 
| 
 | 
   156  | 
\begin{lstlisting}[numbers=none,
 | 
| 
 | 
   157  | 
                   language={},xleftmargin=10mm]
 | 
| 
 | 
   158  | 
      n := 6
  | 
| 
 | 
   159  | 
      m1 := 0
  | 
| 
 | 
   160  | 
      m2 := 1
  | 
| 
 | 
   161  | 
loop: 
  | 
| 
 | 
   162  | 
      jmp? n = 0 done
  | 
| 
 | 
   163  | 
      tmp := m2
  | 
| 
 | 
   164  | 
      m2 := m1 + m2
  | 
| 
 | 
   165  | 
      m1 := tmp
  | 
| 
 | 
   166  | 
      n := n + -1
  | 
| 
 | 
   167  | 
      goto top
  | 
| 
 | 
   168  | 
done:
  | 
| 
 | 
   169  | 
\end{lstlisting}
 | 
| 
 | 
   170  | 
\caption{A mystery program in our idealised programming language.
 | 
| 
 | 
   171  | 
Try to find out what it calculates! \label{fib}}
 | 
| 
 | 
   172  | 
\end{figure}
 | 
| 
 | 
   173  | 
  | 
| 
 | 
   174  | 
Even if our language is rather small, it is still Turing
  | 
| 
 | 
   175  | 
complete---meaning quite powerful. However, discussing this
  | 
| 
 | 
   176  | 
fact in more detail would lead us too far astray. Clearly, our
  | 
| 
 | 
   177  | 
programming is rather low-level and not very comfortable for
  | 
| 
 | 
   178  | 
writing programs. It is inspired by real machine code, which
  | 
| 
 | 
   179  | 
is the code that is executed by a CPU. So a more interesting
  | 
| 
 | 
   180  | 
question is what is missing in comparison with real machine
  | 
| 
 | 
   181  | 
code? Well, not much\ldots{}in principle. Real machine code,
 | 
| 
 | 
   182  | 
of course, contains many more arithmetic instructions (not
  | 
| 
 | 
   183  | 
just addition and multiplication) and many more conditional
  | 
| 
 | 
   184  | 
jumps. We could add these to our language if we wanted, but
  | 
| 
 | 
   185  | 
complexity is really beside the point here. Furthermore, real
  | 
| 
 | 
   186  | 
machine code has many instructions for manipulating memory. We
  | 
| 
 | 
   187  | 
do not have this at all. This is actually a more serious
  | 
| 
 | 
   188  | 
simplification because we assume numbers to be arbitrary small
  | 
| 
 | 
   189  | 
or large, which is not the case with real machine code. In
  | 
| 
 | 
   190  | 
real machine code, basic number formats have a range and might
  | 
| 
 | 
   191  | 
over-flow or under-flow this range. Also the number of
  | 
| 
 | 
   192  | 
variables in our programs is potentially unlimited, while
  | 
| 
 | 
   193  | 
memory in an actual computer, of course, is always limited. To
  | 
| 
 | 
   194  | 
sum up, our language might look ridiculously simple, but it is
  | 
| 
 | 
   195  | 
not too far removed from practically relevant issues.
  | 
| 
 | 
   196  | 
  | 
| 
 | 
   197  | 
  | 
| 
 | 
   198  | 
\subsubsection*{An Interpreter}
 | 
| 
 | 
   199  | 
  | 
| 
 | 
   200  | 
Designing a language is like playing god: you can say what
  | 
| 
 | 
   201  | 
names for variables you allow; what programs should look like;
  | 
| 
 | 
   202  | 
most importantly you can decide what each part of the program
  | 
| 
 | 
   203  | 
should mean and do. While our language is quite simple and the
  | 
| 
 | 
   204  | 
meaning of statements, for example, is rather straightforward,
  | 
| 
 | 
   205  | 
there are still places where we need to make real choices. For
  | 
| 
 | 
   206  | 
example consider the conditional jumps, say the one in the
  | 
| 
 | 
   207  | 
factorial program: 
  | 
| 
 | 
   208  | 
  | 
| 
 | 
   209  | 
\begin{center}
 | 
| 
 | 
   210  | 
\code{jmp? n = 0 done}
 | 
| 
 | 
   211  | 
\end{center}
 | 
| 
 | 
   212  | 
  | 
| 
 | 
   213  | 
\noindent How should they work? We could introduce Booleans
  | 
| 
 | 
   214  | 
(\pcode{true} and \pcode{false}) and then jump only when the
 | 
| 
 | 
   215  | 
condition is \pcode{true}. However, since we have numbers in
 | 
| 
 | 
   216  | 
our language anyway, why not just encoding \pcode{true} as
 | 
| 
 | 
   217  | 
one, and \pcode{false} as zero? In this way we can dispense
 | 
| 
 | 
   218  | 
with the additional concept of Booleans.
  | 
| 
 | 
   219  | 
  | 
| 
 | 
   220  | 
I hope the above discussion makes it already clear we need to
  | 
| 
 | 
   221  | 
be a bit more careful with our programs. Below we shall
  | 
| 
 | 
   222  | 
describe an interpreter for our programming language, which
  | 
| 
 | 
   223  | 
specifies exactly how programs are supposed to be
  | 
| 
 | 
   224  | 
run\ldots{}at least we will specify this for all \emph{good}
 | 
| 
 | 
   225  | 
programs. By good programs I mean where all variables are
  | 
| 
 | 
   226  | 
initialised, for example. Our interpreter will just crash if
  | 
| 
 | 
   227  | 
it cannot find out the value for a variable when it is not
  | 
| 
 | 
   228  | 
initialised. Also, we will assume that labels in good programs
  | 
| 
 | 
   229  | 
are unique, otherwise our programs will calculate ``garbage''.
  | 
| 
 | 
   230  | 
  | 
| 
 | 
   231  | 
First we will pre-process our programs. This will simplify the
  | 
| 
 | 
   232  | 
definition of our interpreter later on. By pre-processing our
  | 
| 
 | 
   233  | 
programs we will transform programs into \emph{snippets}. A
 | 
| 
 | 
   234  | 
snippet is a label and all the code that comes after the
  | 
| 
 | 
   235  | 
label. This essentially means a snippet is a \emph{map} from
 | 
| 
 | 
   236  | 
labels to code.\footnote{Be sure you know what maps are. In a
 | 
| 
 | 
   237  | 
programming context they are often represented as association
  | 
| 
 | 
   238  | 
list where some data is associated with a key.} 
  | 
| 
 | 
   239  | 
  | 
| 
 | 
   240  | 
Given that programs are sequences (or lists) of statements, we
  | 
| 
 | 
   241  | 
can easily calculate the snippets by just traversing this
  | 
| 
 | 
   242  | 
sequence and recursively generating the map. Suppose a program
  | 
| 
 | 
   243  | 
is of the general form
  | 
| 
 | 
   244  | 
  | 
| 
 | 
   245  | 
\begin{center}
 | 
| 
 | 
   246  | 
$stmt_1\;stmt_2\; \ldots\; stmt_n$
  | 
| 
 | 
   247  | 
\end{center}
 | 
| 
 | 
   248  | 
  | 
| 
 | 
   249  | 
\noindent The idea is to go through this sequence of
  | 
| 
 | 
   250  | 
statements one by one and check whether they are a label. If
  | 
| 
 | 
   251  | 
yes, we add the label and the remaining statements to our map.
  | 
| 
 | 
   252  | 
If no, we just continue with the next statement. To come up
  | 
| 
 | 
   253  | 
with a recursive definition for generating snippets, let us
  | 
| 
 | 
   254  | 
write $[]$ for the program that does not contain any
  | 
| 
 | 
   255  | 
statement. Consider the following definition:
  | 
| 
 | 
   256  | 
  | 
| 
 | 
   257  | 
\begin{center}
 | 
| 
 | 
   258  | 
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
| 
 | 
   259  | 
$\textit{snippets}([])$ & $\dn$ & $\varnothing$\\
 | 
| 
 | 
   260  | 
$\textit{snippets}(stmt\;\; rest)$ & $\dn$ &
 | 
| 
 | 
   261  | 
$\begin{cases}
 | 
| 
 | 
   262  | 
   \textit{snippets}(rest)[label := rest] & \text{if}\;stmt = \textit{label:}\\
 | 
| 
 | 
   263  | 
   \textit{snippets}(rest)                & \text{otherwise}
 | 
| 
 | 
   264  | 
\end{cases}$   
 | 
| 
 | 
   265  | 
\end{tabular}
 | 
| 
 | 
   266  | 
\end{center}
 | 
| 
 | 
   267  | 
  | 
| 
 | 
   268  | 
\noindent In the first clause we just return the empty map for
  | 
| 
 | 
   269  | 
the program that does not contain any statement. In the second
  | 
| 
 | 
   270  | 
clause, we have to distinguish the case where the first
  | 
| 
 | 
   271  | 
statement is a label or not. As said before, if not, then we
  | 
| 
 | 
   272  | 
just ``throw away'' the label and recursively calculate the
  | 
| 
 | 
   273  | 
snippets for the rest of the program (the otherwise clause).
  | 
| 
 | 
   274  | 
If yes, then we do the same, but also update the map so that
  | 
| 
 | 
   275  | 
$label$ now points to the rest of the statements (the if
  | 
| 
 | 
   276  | 
clause). This looks all realtively straightforward, but there
  | 
| 
 | 
   277  | 
is one small problem we need to overcome: our two programs
  | 
| 
 | 
   278  | 
shown so far have no label as \emph{entry point}---that is
 | 
| 
 | 
   279  | 
where the execution is supposed to start. We usually assume
  | 
| 
 | 
   280  | 
that the first statement will be run first. To make this the
  | 
| 
 | 
   281  | 
default, it is convenient if we add to all our programs a
  | 
| 
 | 
   282  | 
default label, say \pcode{""} (the empty string). With this we
 | 
| 
 | 
   283  | 
can define our pre-processing of programs as follows
  | 
| 
 | 
   284  | 
  | 
| 
 | 
   285  | 
\begin{center}
 | 
| 
 | 
   286  | 
$\textit{preproc}(prog) \dn \textit{snippets}(\pcode{"":}\;\; prog)$ 
 | 
| 
 | 
   287  | 
\end{center} 
 | 
| 
 | 
   288  | 
  | 
| 
 | 
   289  | 
\noindent Let us see how this pans out in practice. If we
  | 
| 
 | 
   290  | 
pre-process the factorial program shown earlier, we obtain the 
  | 
| 
 | 
   291  | 
following map:
  | 
| 
 | 
   292  | 
  | 
| 
 | 
   293  | 
\begin{center}
 | 
| 
 | 
   294  | 
\small
  | 
| 
 | 
   295  | 
\lstset{numbers=none,
 | 
| 
 | 
   296  | 
        language={},
 | 
| 
 | 
   297  | 
        xleftmargin=0mm,
  | 
| 
 | 
   298  | 
        aboveskip=0.5mm,
  | 
| 
 | 
   299  | 
        belowskip=0.5mm,
  | 
| 
 | 
   300  | 
        frame=single,
  | 
| 
 | 
   301  | 
        framerule=0mm,
  | 
| 
 | 
   302  | 
        framesep=0mm}
  | 
| 
 | 
   303  | 
\begin{tikzpicture}[node distance=0mm]
 | 
| 
 | 
   304  | 
\node (A1) [draw]{\pcode{""}};
 | 
| 
 | 
   305  | 
\node (B1) [right=of A1] {$\mapsto$};
 | 
| 
 | 
   306  | 
\node (C1) [right=of B1,anchor=north west] {
 | 
| 
 | 
   307  | 
\begin{minipage}{3.5cm}
 | 
| 
 | 
   308  | 
\begin{lstlisting}[language={},xleftmargin=0mm]
 | 
| 
 | 
   309  | 
  a := 1
  | 
| 
 | 
   310  | 
  n := 5 
  | 
| 
 | 
   311  | 
top:  
  | 
| 
 | 
   312  | 
  jmp? n = 0 done 
  | 
| 
 | 
   313  | 
  a := a * n 
  | 
| 
 | 
   314  | 
  n := n + -1 
  | 
| 
 | 
   315  | 
  goto top 
  | 
| 
 | 
   316  | 
done:
  | 
| 
 | 
   317  | 
\end{lstlisting}
 | 
| 
 | 
   318  | 
\end{minipage}};
 | 
| 
 | 
   319  | 
\node (A2) [right=of C1.north east,draw] {\pcode{top}};
 | 
| 
 | 
   320  | 
\node (B2) [right=of A2] {$\mapsto$};
 | 
| 
 | 
   321  | 
\node (C2) [right=of B2, anchor=north west] {
 | 
| 
 | 
   322  | 
\begin{minipage}{3.5cm}
 | 
| 
 | 
   323  | 
\begin{lstlisting}[language={},xleftmargin=0mm]
 | 
| 
 | 
   324  | 
  jmp? n = 0 done 
  | 
| 
 | 
   325  | 
  a := a * n 
  | 
| 
 | 
   326  | 
  n := n + -1 
  | 
| 
 | 
   327  | 
  goto top 
  | 
| 
 | 
   328  | 
done:
  | 
| 
 | 
   329  | 
\end{lstlisting} 
 | 
| 
 | 
   330  | 
\end{minipage}}; 
 | 
| 
 | 
   331  | 
\node (A3) [right=of C2.north east,draw] {\pcode{done}};
 | 
| 
 | 
   332  | 
\node (B3) [right=of A3] {$\mapsto$};
 | 
| 
 | 
   333  | 
\node (C3) [right=of B3] {$[]$};
 | 
| 
 | 
   334  | 
\end{tikzpicture}
 | 
| 
 | 
   335  | 
\end{center}
 | 
| 
 | 
   336  | 
  | 
| 
 | 
   337  | 
\noindent I highlighted the \emph{keys} in this map. Since
 | 
| 
 | 
   338  | 
there are three labels in the factorial program (remember we
  | 
| 
 | 
   339  | 
added \pcode{""}), there are three keys. When running the
 | 
| 
 | 
   340  | 
factorial program and encountering a jump, then we only have
  | 
| 
 | 
   341  | 
to consult this snippets-map in order to find out what the
  | 
| 
 | 
   342  | 
next statements should be.
  | 
| 
 | 
   343  | 
  | 
| 
 | 
   344  | 
We should now be in the position to define how a program
  | 
| 
 | 
   345  | 
should be run. In the context of interpreters, this
  | 
| 
 | 
   346  | 
``running'' of programs is often called \emph{evaluation}. Let
 | 
| 
 | 
   347  | 
us start with the definition of how expressions are evaluated.
  | 
| 
 | 
   348  | 
A first attempt might be the following recursive function:
  | 
| 
 | 
   349  | 
  | 
| 
 | 
   350  | 
\begin{center}
 | 
| 
 | 
   351  | 
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
| 
 | 
   352  | 
$\textit{eval\_exp}(\texttt{n})$ & $\dn$ & $n$
 | 
| 
 | 
   353  | 
  \qquad\text{if}\; \texttt{n}\; \text{is a number like} 
 | 
| 
 | 
   354  | 
  \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
 | 
| 
 | 
   355  | 
\pcode{2}\ldots{}\\
 | 
| 
 | 
   356  | 
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, 
 | 
| 
 | 
   357  | 
                    \texttt{e}_\texttt{2})$ & 
 | 
| 
 | 
   358  | 
  $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) + 
 | 
| 
 | 
   359  | 
           \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\
 | 
| 
 | 
   360  | 
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, 
 | 
| 
 | 
   361  | 
                    \texttt{e}_\texttt{2})$ & 
 | 
| 
 | 
   362  | 
  $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) * 
 | 
| 
 | 
   363  | 
           \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\
 | 
| 
 | 
   364  | 
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, 
 | 
| 
 | 
   365  | 
                    \texttt{e}_\texttt{2})$ & 
 | 
| 
 | 
   366  | 
  $\dn$ & $\begin{cases}
 | 
| 
 | 
   367  | 
  1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}) =
 | 
| 
 | 
   368  | 
                 \textit{eval\_exp}(\texttt{e}_\texttt{2})\\
 | 
| 
 | 
   369  | 
  0 & \text{otherwise}
 | 
| 
 | 
   370  | 
  \end{cases}$          
 | 
| 
 | 
   371  | 
\end{tabular}
 | 
| 
 | 
   372  | 
\end{center}
 | 
| 
 | 
   373  | 
  | 
| 
 | 
   374  | 
\noindent While this should look all rather intuitive`, still
  | 
| 
 | 
   375  | 
be very careful. There is a subtlety which can be easily
  | 
| 
 | 
   376  | 
overlooked: The function \textit{eval\_exp} takes an
 | 
| 
 | 
   377  | 
expression of our programming language as input and returns a
  | 
| 
 | 
   378  | 
number as output. Therefore whenever we encounter a number in
  | 
| 
 | 
   379  | 
our program, we just return this number---this is defined in
  | 
| 
 | 
   380  | 
the first clause above. Whenever we encounter an addition,
  | 
| 
 | 
   381  | 
well then we first evaluate the left-hand side
  | 
| 
 | 
   382  | 
$\texttt{e}_\texttt{1}$ of the addition (this will give a
 | 
| 
 | 
   383  | 
number), then evaluate the right-hand side
  | 
| 
 | 
   384  | 
$\texttt{e}_\texttt{2}$ (this gives another number), and
 | 
| 
 | 
   385  | 
finally add both numbers together. Here is the subtlety: on
  | 
| 
 | 
   386  | 
the left-hand side of the $\dn$ we have a \texttt{+} (in the
 | 
| 
 | 
   387  | 
teletype font) which is the symbol for addition in our
  | 
| 
 | 
   388  | 
programming language. On the right-hand side we have $+$ which
  | 
| 
 | 
   389  | 
stands for the arithmetic operation from ``mathematics'' of
  | 
| 
 | 
   390  | 
adding two numbers. These are rather different concepts---one
  | 
| 
 | 
   391  | 
is a symbol (which we made up), and the other a mathematical
  | 
| 
 | 
   392  | 
operation. When we will have a look at an actual
  | 
| 
 | 
   393  | 
implementation of our interpreter, the mathematical operation
  | 
| 
 | 
   394  | 
will be the function for addition from the programming
  | 
| 
 | 
   395  | 
language in which we \underline{\smash{implement}} our
 | 
| 
 | 
   396  | 
interpreter. While the \texttt{+} is just a symbol that is
 | 
| 
 | 
   397  | 
used in our programming language. Clearly we have to use a
  | 
| 
 | 
   398  | 
symbol that is a good mnemonic for addition, otherwise we will
  | 
| 
 | 
   399  | 
confuse the programmers working with our language. Therefore
  | 
| 
 | 
   400  | 
we use $\texttt{+}$. A similar choice is made for times in the
 | 
| 
 | 
   401  | 
third clause and equality in the fourth clause. 
  | 
| 
 | 
   402  | 
  | 
| 
 | 
   403  | 
Remember I wrote at the beginning of this section about being
  | 
| 
 | 
   404  | 
god when designing a programming language. You can see this
  | 
| 
 | 
   405  | 
here: we need to give meaning to symbols. At the moment
  | 
| 
 | 
   406  | 
however, we are a poor fallible god. Look again at the grammar
  | 
| 
 | 
   407  | 
of our programming language and our definition. Clearly, an
  | 
| 
 | 
   408  | 
expression can contain variables. So far we have ignored them.
  | 
| 
 | 
   409  | 
What should our interpreter do with variables? They might
  | 
| 
 | 
   410  | 
change during the evaluation of a program. For example the
  | 
| 
 | 
   411  | 
variable \pcode{n} in the factorial program counts down from 5
 | 
| 
 | 
   412  | 
down to 0. How can we improve our definition above to give also
  | 
| 
 | 
   413  | 
an answer whenever our interpreter encounters a variable in an
  | 
| 
 | 
   414  | 
expression? The solution is to add an \emph{environment},
 | 
| 
 | 
   415  | 
written $env$, as an additional input argument to our
  | 
| 
 | 
   416  | 
\textit{eval\_exp} function.
 | 
| 
 | 
   417  | 
 
  | 
| 
 | 
   418  | 
\begin{center}
 | 
| 
 | 
   419  | 
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
| 
 | 
   420  | 
$\textit{eval\_exp}(\texttt{n}, env)$ & $\dn$ & $n$
 | 
| 
 | 
   421  | 
  \qquad\text{if}\; \texttt{n}\; \text{is a number like} 
 | 
| 
 | 
   422  | 
  \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
 | 
| 
 | 
   423  | 
\pcode{2}\ldots{}\\
 | 
| 
 | 
   424  | 
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, 
 | 
| 
 | 
   425  | 
                    \texttt{e}_\texttt{2}, env)$ & 
 | 
| 
 | 
   426  | 
  $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) + 
 | 
| 
 | 
   427  | 
           \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\
 | 
| 
 | 
   428  | 
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, 
 | 
| 
 | 
   429  | 
                    \texttt{e}_\texttt{2}, env)$ & 
 | 
| 
 | 
   430  | 
  $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) * 
 | 
| 
 | 
   431  | 
           \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\
 | 
| 
 | 
   432  | 
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, 
 | 
| 
 | 
   433  | 
                    \texttt{e}_\texttt{2}, env)$ & 
 | 
| 
 | 
   434  | 
  $\dn$ & $\begin{cases}
 | 
| 
 | 
   435  | 
  1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) =
 | 
| 
 | 
   436  | 
                 \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)\\
 | 
| 
 | 
   437  | 
  0 & \text{otherwise}
 | 
| 
 | 
   438  | 
  \end{cases}$\\
 | 
| 
 | 
   439  | 
$\textit{eval\_exp}(\texttt{x}, env)$ & $\dn$ & $env(x)$      
 | 
| 
 | 
   440  | 
\end{tabular}
 | 
| 
 | 
   441  | 
\end{center}
 | 
| 
 | 
   442  | 
 
  | 
| 
 | 
   443  | 
\noindent This environment $env$ also acts like a map: it
  | 
| 
 | 
   444  | 
associates variables with their current values. For example
  | 
| 
 | 
   445  | 
after evaluating the first two lines in our factorial
  | 
| 
 | 
   446  | 
program, such an environment might look as follows
  | 
| 
 | 
   447  | 
  | 
| 
 | 
   448  | 
\begin{center}
 | 
| 
 | 
   449  | 
\begin{tabular}{ll}
 | 
| 
 | 
   450  | 
$\fbox{\texttt{a}} \mapsto \texttt{1}$ &
 | 
| 
 | 
   451  | 
$\fbox{\texttt{n}} \mapsto \texttt{5}$
 | 
| 
 | 
   452  | 
\end{tabular}
 | 
| 
 | 
   453  | 
\end{center}
 | 
| 
 | 
   454  | 
  | 
| 
 | 
   455  | 
\noindent Again I highlighted the keys. In the clause for
  | 
| 
 | 
   456  | 
variables, we can therefore consult this environment and
  | 
| 
 | 
   457  | 
return whatever value is currently stored for this variable.
  | 
| 
 | 
   458  | 
This is written $env(x)$---meaning we query this map with $x$
  | 
| 
 | 
   459  | 
we obtain the corresponding number. You might ask what happens
  | 
| 
 | 
   460  | 
if an environment does not contain any value for, say, the
  | 
| 
 | 
   461  | 
variable $x$? Well, then our interpreter just ``crashes'', or
  | 
| 
 | 
   462  | 
more precisely will raise an exception. In this case we have a
  | 
| 
 | 
   463  | 
``bad'' program that tried to use a variable before it was
  | 
| 
 | 
   464  | 
initialised. The programmer should not have done this. In a
  | 
| 
 | 
   465  | 
real programming language, we would of course try a bit harder
  | 
| 
 | 
   466  | 
and for example give an error at compile time, or design our
  | 
| 
 | 
   467  | 
language in such a way that this can never happen. With the
  | 
| 
 | 
   468  | 
second version of \textit{eval\_exp} we completed our
 | 
| 
 | 
   469  | 
definition for evaluating expressions.
  | 
| 
 | 
   470  | 
 
  | 
| 
 | 
   471  | 
Next comes the evaluation function for statements. We define
  | 
| 
 | 
   472  | 
this function in such a way that we recursively evaluate a
  | 
| 
 | 
   473  | 
whole sequence of statements. Assume a program $p$ (you want
  | 
| 
 | 
   474  | 
to evaluate) and its pre-processed snippets $sn$. Then we can
  | 
| 
 | 
   475  | 
define:
  | 
| 
 | 
   476  | 
  | 
| 
 | 
   477  | 
\begin{center}
 | 
| 
 | 
   478  | 
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 | 
| 
 | 
   479  | 
$\textit{eval\_stmts}([], env)$ & $\dn$ & $env$\\
 | 
| 
 | 
   480  | 
$\textit{eval\_stmts}(\texttt{label:}\;rest, env)$ &
 | 
| 
 | 
   481  | 
  $\dn$ & $\textit{eval\_stmts}(rest, env)$ \\ 
 | 
| 
 | 
   482  | 
$\textit{eval\_stmts}(\texttt{x\,:=\,e}\;rest, env)$ & 
 | 
| 
 | 
   483  | 
  $\dn$ & $\textit{eval\_stmts}(rest, 
 | 
| 
 | 
   484  | 
           env[x := \textit{eval\_exp}(\texttt{e}, env)])$\\  
 | 
| 
 | 
   485  | 
$\textit{eval\_stmts}(\texttt{goto\,lbl}\;rest, env)$ 
 | 
| 
 | 
   486  | 
 & $\dn$ & $\textit{eval\_stmts}(sn(\texttt{lbl}), env)$\\
 | 
| 
 | 
   487  | 
$\textit{eval\_stmts}(\texttt{jmp?\,e\,lbl}\;rest, env)$ 
 | 
| 
 | 
   488  | 
 & $\dn$ & $\begin{cases}\begin{array}{@{}l@{\hspace{-12mm}}r@{}}
 | 
| 
 | 
   489  | 
 \textit{eval\_stmts}(sn(\texttt{lbl}), env)\\ 
 | 
| 
 | 
   490  | 
 & \text{if}\;\textit{eval\_exp}(\texttt{e}, env) = 1\\
 | 
| 
 | 
   491  | 
 \textit{eval\_stmts}(rest, env) & \text{otherwise}\\
 | 
| 
 | 
   492  | 
 \end{array}\end{cases}$  
 | 
| 
 | 
   493  | 
\end{tabular}
 | 
| 
 | 
   494  | 
\end{center}
 | 
| 
 | 
   495  | 
  | 
| 
 | 
   496  | 
\noindent The first clause is for the empty program, or when
  | 
| 
 | 
   497  | 
we arrived at the end of the program. In this case we just
  | 
| 
 | 
   498  | 
return the environment. The second clause is for when the next
  | 
| 
 | 
   499  | 
statement is a label. That means the program is of the form
  | 
| 
 | 
   500  | 
$\texttt{label:}\;rest$ where the label is some string and
 | 
| 
 | 
   501  | 
$rest$ stands for all following statements. This case is easy,
  | 
| 
 | 
   502  | 
because our evaluation function just discards the label and
  | 
| 
 | 
   503  | 
evaluates the rest of the statements (we already extracted all
  | 
| 
 | 
   504  | 
important information about labels when we pre-processed our
  | 
| 
 | 
   505  | 
programs and generated the snippets). The third clause is for
  | 
| 
 | 
   506  | 
variable assignments. Again we just evaluate the rest for the
  | 
| 
 | 
   507  | 
statements, but with a modified environment---since the
  | 
| 
 | 
   508  | 
variable assignment is supposed to introduce a new variable or
  | 
| 
 | 
   509  | 
change the current value of a variable. For this modification
  | 
| 
 | 
   510  | 
of the environment we first evaluate the expression
  | 
| 
 | 
   511  | 
$\texttt{e}$ using our evaluation function for expressions.
 | 
| 
 | 
   512  | 
This gives us a number. Then we assign this number to the
  | 
| 
 | 
   513  | 
variable $x$ in the environment. This modified environment
  | 
| 
 | 
   514  | 
will be used to evaluate the rest of the program. The fourth
  | 
| 
 | 
   515  | 
clause is for the unconditional jump to a label, called
  | 
| 
 | 
   516  | 
\texttt{lbl}. That means we have to look up in our snippets
 | 
| 
 | 
   517  | 
map $sn$ what are the next statements for this label.
  | 
| 
 | 
   518  | 
Therefore we will continue with evaluating, not with the rest
  | 
| 
 | 
   519  | 
of the program, but with the statements stored in the
  | 
| 
 | 
   520  | 
snippets-map under the label $\texttt{lbl}$. The fifth clause
 | 
| 
 | 
   521  | 
for conditional jumps is similar, but to decide whether to
  | 
| 
 | 
   522  | 
make the jump we first need to evaluate the expression
  | 
| 
 | 
   523  | 
$\texttt{e}$ in order to find out whether it is $1$. If yes,
 | 
| 
 | 
   524  | 
we jump, otherwise we just continue with evaluating the rest
  | 
| 
 | 
   525  | 
of the program.
  | 
| 
 | 
   526  | 
  | 
| 
 | 
   527  | 
Our interpreter works in two stages: First we pre-process our
  | 
| 
 | 
   528  | 
program generating the snippets map $sn$, say. Second we call
  | 
| 
 | 
   529  | 
the evaluation function with the default entry point and the
  | 
| 
 | 
   530  | 
empty environment:
  | 
| 
 | 
   531  | 
  | 
| 
 | 
   532  | 
\begin{center}
 | 
| 
 | 
   533  | 
$\textit{eval\_stmts}(sn(\pcode{""}), \varnothing)$
 | 
| 
 | 
   534  | 
\end{center}
 | 
| 
 | 
   535  | 
 
  | 
| 
 | 
   536  | 
\noindent It is interesting to note that our interpreter when
  | 
| 
 | 
   537  | 
it comes to the end of the program returns an environment. Our
  | 
| 
 | 
   538  | 
programming language does not contain any constructs for input
  | 
| 
 | 
   539  | 
and output. Therefore this environment is the only effect we
  | 
| 
 | 
   540  | 
can observe when running the program (apart from that our
  | 
| 
 | 
   541  | 
interpreter might need some time before finishing the
  | 
| 
 | 
   542  | 
evaluation of the program and the CPU getting hot). Evaluating 
  | 
| 
 | 
   543  | 
the factorial program with our interpreter we receive as
  | 
| 
 | 
   544  | 
``answer''-environment
  | 
| 
 | 
   545  | 
  | 
| 
 | 
   546  | 
\begin{center}
 | 
| 
 | 
   547  | 
\begin{tabular}{ll}
 | 
| 
 | 
   548  | 
$\fbox{\texttt{a}} \mapsto \texttt{120}$ &
 | 
| 
 | 
   549  | 
$\fbox{\texttt{n}} \mapsto \texttt{0}$
 | 
| 
 | 
   550  | 
\end{tabular}
 | 
| 
 | 
   551  | 
\end{center}
 | 
| 
 | 
   552  | 
  | 
| 
 | 
   553  | 
\noindent While the discussion above should have illustrated
  | 
| 
 | 
   554  | 
the ideas, in order to do some serious calculations, we clearly
  | 
| 
 | 
   555  | 
need to implement the interpreter.
  | 
| 
 | 
   556  | 
  | 
| 
 | 
   557  | 
  | 
| 
 | 
   558  | 
\subsubsection*{Scala Code for the Interpreter}
 | 
| 
 | 
   559  | 
  | 
| 
 | 
   560  | 
Functional programming languages are very convenient for
  | 
| 
 | 
   561  | 
implementations of interpreters. A good choice for a
  | 
| 
 | 
   562  | 
functional programming language is Scala, a programming
  | 
| 
 | 
   563  | 
language that combines functional and object-oriented
  | 
| 
 | 
   564  | 
pro\-gramming-styles. It has received in the last five years or
  | 
| 
 | 
   565  | 
so quite a bit of attention. One reason for this attention is
  | 
| 
 | 
   566  | 
that, like the Java programming language, Scala compiles to
  | 
| 
 | 
   567  | 
the Java Virtual Machine (JVM) and therefore Scala programs
  | 
| 
 | 
   568  | 
can run under MacOSX, Linux and Windows.\footnote{There are
 | 
| 
 | 
   569  | 
also experimental backends for Android and JavaScript.} Unlike
  | 
| 
 | 
   570  | 
Java, however, Scala often allows programmers to write very
  | 
| 
 | 
   571  | 
concise and elegant code. Some therefore say Scala is the much
  | 
| 
 | 
   572  | 
better Java. A number of companies, The Guardian, Twitter,
  | 
| 
 | 
   573  | 
Coursera, FourSquare, LinkedIn to name a few, either use Scala
  | 
| 
 | 
   574  | 
exclusively in production code, or at least to some
  | 
| 
 | 
   575  | 
substantial degree. If you want to try out Scala yourself, the
  | 
| 
 | 
   576  | 
Scala compiler can be downloaded from
  | 
| 
 | 
   577  | 
  | 
| 
 | 
   578  | 
\begin{quote}
 | 
| 
 | 
   579  | 
\url{http://www.scala-lang.org}
 | 
| 
 | 
   580  | 
\end{quote}
 | 
| 
 | 
   581  | 
  | 
| 
 | 
   582  | 
Let us have a look at the Scala code shown in
  | 
| 
 | 
   583  | 
Figure~\ref{code}. It shows the entire code for the
 | 
| 
 | 
   584  | 
interpreter, though the implementation is admittedly no
  | 
| 
 | 
   585  | 
frills.
  | 
| 
 | 
   586  | 
  | 
| 
871
 | 
   587  | 
%\begin{figure}[t]
 | 
| 
 | 
   588  | 
%\small
  | 
| 
 | 
   589  | 
%\lstinputlisting[language=Scala]{../progs/inter.scala}
 | 
| 
 | 
   590  | 
%\caption{The entire code of the interpreter for our
 | 
| 
 | 
   591  | 
%idealised programming language.\label{code}}
 | 
| 
 | 
   592  | 
%\end{figure}
 | 
| 
539
 | 
   593  | 
  | 
| 
 | 
   594  | 
  | 
| 
 | 
   595  | 
\subsubsection*{Static Analysis}
 | 
| 
 | 
   596  | 
  | 
| 
 | 
   597  | 
Finally we can come back to our original problem, namely 
  | 
| 
 | 
   598  | 
finding out what the signs of variables are 
  | 
| 
 | 
   599  | 
  | 
| 
 | 
   600  | 
\begin{center}
 | 
| 
 | 
   601  | 
  | 
| 
 | 
   602  | 
  | 
| 
 | 
   603  | 
\end{center}
 | 
| 
 | 
   604  | 
 
  | 
| 
 | 
   605  | 
\end{document}
 | 
| 
 | 
   606  | 
  | 
| 
 | 
   607  | 
%% list of static analysers for C
  | 
| 
 | 
   608  | 
http://spinroot.com/static/index.html
  | 
| 
 | 
   609  | 
  | 
| 
 | 
   610  | 
%% NASA coding rules for C
  | 
| 
 | 
   611  | 
http://pixelscommander.com/wp-content/uploads/2014/12/P10.pdf
  | 
| 
 | 
   612  | 
  | 
| 
 | 
   613  | 
%%% Local Variables: 
  | 
| 
 | 
   614  | 
%%% mode: latex
  | 
| 
 | 
   615  | 
%%% TeX-master: t
  | 
| 
 | 
   616  | 
%%% End: 
  |