handouts/ho09.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Wed, 17 Dec 2014 17:50:34 +0000
changeset 354 8e5e84b14041
parent 352 da5713bcdbb0
child 355 619073c37649
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}
\usepackage{../grammar}
\usepackage{multicol}
\usepackage{array}

\begin{document}

\section*{Handout 9 (Static Analysis)}

If we want to improve the safety and security of our programs,
we need a more principled approach to programming. Testing is
good, but as Dijkstra famously wrote: 

\begin{quote}\it 
``Program testing can be a very effective way to show the
\underline{\smash{presence}} of bugs, but it is hopelessly
inadequate for showing their \underline{\smash{absence}}.''
\end{quote}

\noindent While such a more principled approach has been the
subject of intense study for a long, long time, only in the
past few years some impressive results have been achieved. One
is the complete formalisation and (mathematical) verification
of a microkernel operating system called seL4.

\begin{center}
\url{http://sel4.systems}
\end{center}

\noindent In 2011 this work was included in the MIT Technology
Review in the annual list of the world’s ten most important
emerging
technologies.\footnote{\url{http://www2.technologyreview.com/tr10/?year=2011}}
While this work is impressive, its technical details are too
enormous for an explanation here. Therefore let us look at
something much simpler, namely finding out properties about
programs using \emph{static analysis}.

Static analysis is a technique that checks properties of a
program without actually running the program. This should
raise alarm bells with you---because almost all interesting
properties about programs  are equivalent to the halting
problem, which we know is undecidable. For example estimating
the memory consumption of programs is in general undecidable,
just like the halting problem. Static analysis circumvents
this undecidability-problem by essentially allowing answers
\emph{yes} and \emph{no}, but also \emph{don't know}. With
this ``trick'' even the halting problem becomes
decidable\ldots{}for example we could always say \emph{don't
know}. Of course this would be silly. The point is that we
should be striving for a method that answers as often as
possible either \emph{yes} or \emph{no}---just in cases when
it is too difficult we fall back on the
\emph{don't-know}-answer. This might sound all like abstract
nonsense. Therefore let us look at a concrete example.


\subsubsection*{A Simple, Idealised Programming Language}

Our starting point is a small, idealised programming language.
It is idealised because we cut several corners in comparison
with real programming languages. The language we will study
contains, amongst other things, variables holding integers.
Using static analysis, we want to find out what the sign of
these integers (positive or negative) will be when the program
runs. This sign-analysis seems like a very simple problem. But
teven such simple problems, if approached naively, are in
general undecidable, just like Turing's halting problem. I let
you think why?


Is sign-analysis of variables an interesting problem? Well,
yes---if a compiler can find out that for example a variable
will never be negative and this variable is used as an index
for an array, then the comopiler does not need to generate code
for an underflow-test. Remember some languages are immune to
buffer-overflow attacks, but they need to add underflow and
overflow checks everywhere. If the compiler can omit the
underflow test, for example, then this can potentially
drastically speed up the generated code.

What do programs in our simple programming language look like?
The following grammar gives a first specification:

\begin{multicols}{2}
\begin{plstx}[rhs style=,one per line,left margin=9mm]
: \meta{Stmt} ::= \meta{label} \texttt{:}
                | \meta{var} \texttt{:=} \meta{Exp}
                | \texttt{jmp?} \meta{Exp} \meta{label}
                | \texttt{goto} \meta{label}\\
: \meta{Prog} ::= \meta{Stmt} \ldots{} \meta{Stmt}\\
\end{plstx}
\columnbreak
\begin{plstx}[rhs style=,one per line]
: \meta{Exp} ::= \meta{Exp} \texttt{+} \meta{Exp}
               | \meta{Exp} \texttt{*} \meta{Exp}
               | \meta{Exp} \texttt{=} \meta{Exp} 
               | \meta{num}
               | \meta{var}\\
\end{plstx}
\end{multicols}

\noindent I assume you are familiar with such
grammars.\footnote{\url{http://en.wikipedia.org/wiki/Backus–Naur_Form}}
There are three main syntactic categories: \emph{statments}
and \emph{expressions} as well as \emph{programs}, which are
sequences of statements. Statements are either labels,
variable assignments, conditional jumps (\pcode{jmp?}) and
unconditional jumps (\pcode{goto}). Labels are just strings,
which can be used as the target of a jump. We assume that in
every program the labels are unique---if there is a clash we
do not know where to jump to. The conditional jumps and
variable assignments involve (arithmetic) expressions.
Expressions are either numbers, variables or compound
expressions built up from \pcode{+}, \pcode{*} and \emph{=}
(for simplicity reasons we do not consider any other
operations). We assume we have negative and positive numbers,
\ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
\pcode{2}\ldots{} An example program that calculates the
factorial of 5 is in oure programming language as follows:

\begin{lstlisting}[language={},xleftmargin=10mm]
      a := 1
      n := 5 
top:  
      jmp? n = 0 done 
      a := a * n 
      n := n + -1 
      goto top 
done:
\end{lstlisting}

\noindent As can be seen each line of the program contains a
statement. In the first two lines we assign values to the
variables \pcode{a} and \pcode{n}. In line 4 we test whether
\pcode{n} is zero, in which case we jump to the end of the
program marked with the label \pcode{done}. If \pcode{n} is
not zero, we multiply the content of \pcode{a} by \pcode{n},
decrease \pcode{n} by one and jump back to the beginning of
the loop, marked with the label \pcode{top}. Another program
in our language is shown in Figure~\ref{fib}. I let you think
what it calculates.
 
\begin{figure}[t]
\begin{lstlisting}[numbers=none,
                   language={},xleftmargin=10mm]
      n := 6
      m1 := 0
      m2 := 1
loop: 
      jmp? n = 0 done
      tmp := m2
      m2 := m1 + m2
      m1 := tmp
      n := n + -1
      goto top
done:
\end{lstlisting}
\caption{A mystery program in our idealised programming language.
Try to find out what it calculates! \label{fib}}
\end{figure}

Even if our language is rather small, it is still Turing
complete---meaning quite powerful. However, discussing this
fact in more detail would lead us too far astray. Clearly, our
programming is rather low-level and not very comfortable for
writing programs. It is inspired by real machine code, which
is the code that is executed by a CPU. So a more interesting
question is what is missing in comparison with real machine
code? Well, not much\ldots{}in principle. Real machine code,
of course, contains many more arithmetic instructions (not
just addition and multiplication) and many more conditional
jumps. We could add these to our language if we wanted, but
complexity is really beside the point here. Furthermore, real
machine code has many instructions for manipulating memory. We
do not have this at all. This is actually a more serious
simplification because we assume numbers to be arbitrary small
or large, which is not the case with real machine code. In
real code basic number formats have a range and might
over-flow or under-flow from this range. Also the number of
variables in our programs is potentially unlimited, while
memory in an actual computer, of course, is always limited
somehow on any actual. To sum up, our language might look
ridiculously simple, but it is not much removed from
practically relevant issues.


\subsubsection*{An Interpreter}

Designing a language is like playing god: you can say what
names for variables you allow; what programs should look like;
most importantly you can decide what each part of the program
should mean and do. While our language is rather simple and
the meaning of statements, for example, is rather
straightforward, there are still places where we need to make
real choices. For example consider the conditional jumps, say
the one in the factorial program: 

\begin{center}
\code{jmp? n = 0 done}
\end{center}

\noindent How should they work? We could introduce Booleans
(\pcode{true} and \pcode{false}) and then jump only when the
condition is \pcode{true}. However, since we have numbers in
our language anyway, why not just encoding \pcode{true} as
one, and \pcode{false} as zero? In this way we can dispense
with the additional concept of Booleans, {\bf but also we could
replace the jump above by

\begin{center}
\code{jmp? 1 + (n + -n) done}
\end{center}

\noindent which behaves exactly the same. But what does it
actually mean that two jumps behave the same? Or two programs
for that matter?}

I hope the above discussion makes it already clear we need to
be a bit more careful with our programs. Below we shall
describe an interpreter for our programming language, which
specifies exactly how programs are supposed to be
run\ldots{}at least we will specify this for all \emph{good}
programs. By good programs I mean where all variables are
initialised, for example. Our interpreter will just crash if
it cannot find out the value for a variable, in case it is not
initialised. Also we will assume that labels in good programs
are unique, otherwise our programs will calculate ``garbage''.

First we will pre-process our programs. This will simplify the
definition of our interpreter later on. We will transform
programs into \emph{snippets}. A snippet is a label and all
code that comes after the label. This essentially means a
snippet is a \emph{map} from labels to code. Given that
programs are sequences (or lists) of statements, we can easily
calculate the snippets by just recursively generating the map.
Suppose a program is of the general form

\begin{center}
$stmt_1\;stmt_2\; \ldots\; stmt_n$
\end{center}

\noindent In order to calculate the snippets of such a
program, we have to go through the statements one by one and
check whether they are a label. If yes, we add the label and
the remaining statements to our map. If no we just continue
with the next statement. To come up with a recursive
definition for generating snippets, let us write $[]$ for the
program that does not contain any statement. Consider the
following definition:

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
$\textit{snippets}([])$ & $\dn$ & $\varnothing$\\
$\textit{snippets}(stmt\;\; rest)$ & $\dn$ &
$\begin{cases}
   \textit{snippets}(rest)[label := rest] & \text{if}\;stmt = \textit{label:}\\
   \textit{snippets}(rest)                & \text{otherwise}
\end{cases}$   
\end{tabular}
\end{center}

\noindent In the first clause we just return the empty map for
the program that does not contain any statements. In the
second clause, we have to distinguish the case where the first
statement is a label or not. If not, then we just ``throw
away'' the label and recursively calculate the snippets for
the rest of the program. If yes, then we do the same, but also
update the map so that $label$ now points to the rest of the
statements. There is one small problem we need to overcome:
our two programs have no label as entry point---that is where
the execution starts. We always assume that the first
statement will be run first. For us it seems convenient if
we add to all our programs a default label, say 
\pcode{""} (the empty string). With this we can define
our preprocessing of programs as follows

\begin{center}
$\textit{preproc}(prog) \dn \textit{snippets}(\pcode{"":}\;\; prog)$ 
\end{center} 

\noindent Let us see how this pans out in practice. If we
preprocess the factorial program shown earlier we obtain the 
following map:

\begin{center}
\small
\lstset{numbers=none,
        language={},
        xleftmargin=0mm,
        aboveskip=0.5mm,
        frame=single,
        framerule=0mm,
        framesep=0mm}
\begin{tikzpicture}[node distance=0mm]
\node (A1) [draw]{\pcode{""}};
\node (B1) [right=of A1] {$\mapsto$};
\node (C1) [right=of B1,anchor=north west] {
\begin{minipage}{3.5cm}
\begin{lstlisting}[language={},xleftmargin=0mm]
  a := 1
  n := 5 
top:  
  jmp? n = 0 done 
  a := a * n 
  n := n + -1 
  goto top 
done:
\end{lstlisting}
\end{minipage}};
\node (A2) [right=of C1.north east,draw] {\pcode{top}};
\node (B2) [right=of A2] {$\mapsto$};
\node (C2) [right=of B2, anchor=north west] {
\begin{minipage}{3.5cm}
\begin{lstlisting}[language={},xleftmargin=0mm]
  jmp? n = 0 done 
  a := a * n 
  n := n + -1 
  goto top 
done:
\end{lstlisting} 
\end{minipage}}; 
\node (A3) [right=of C2.north east,draw] {\pcode{done}};
\node (B3) [right=of A3] {$\mapsto$};
\node (C3) [right=of B3] {$[]$};
\end{tikzpicture}
\end{center}

\noindent I highlighted the \emph{keys} in this map. Since
there are three labels in the factorial program, there are 
three keys.

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: