handouts/ho10.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Sat, 04 Sep 2021 14:08:09 +0100
changeset 834 d3e38dd3b449
parent 675 d665e7dd66d7
child 871 94b84d880c2b
permissions -rw-r--r--
cwupdates

% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}
\usepackage{../grammar}
\usepackage{multicol}

%%\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}

\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 2014}

%% why are shuttle flights so good with software
%%http://www.fastcompany.com/28121/they-write-right-stuff

\section*{Handout 10 (Static Analysis)}

If we want to improve the safety and security of our programs,
we need a more principled approach to programming. Testing is
good, but as Edsger Dijkstra famously wrote: 

\begin{quote}\it 
``Program testing can be a very effective way to show the
\underline{\smash{presence}} of bugs, but it is hopelessly
inadequate for showing their \underline{\smash{absence}}.''
\end{quote}

\noindent While such a more principled approach has been the
subject of intense study for a long, long time, only in the
past few years some impressive results have been achieved. One
is the complete formalisation and (mathematical) verification
of a microkernel operating system called seL4.

\begin{center}
\url{http://sel4.systems}
\end{center}

\noindent In 2011 this work was included in the MIT Technology
Review in the annual list of the world’s ten most important
emerging
technologies.\footnote{\url{http://www2.technologyreview.com/tr10/?year=2011}}
While this work is impressive, its technical details are too
enormous for an explanation here. Therefore let us look at
something much simpler, namely finding out properties about
programs using \emph{static analysis}.

Static analysis is a technique that checks properties of a
program without actually running the program. This should
raise alarm bells with you---because almost all interesting
properties about programs  are equivalent to the halting
problem, which we know is undecidable. For example estimating
the memory consumption of programs is in general undecidable,
just like the halting problem. Static analysis circumvents
this undecidability-problem by essentially allowing answers
\emph{yes} and \emph{no}, but also \emph{don't know}. With
this ``trick'' even the halting problem becomes
decidable\ldots{}for example we could always say \emph{don't
know}. Of course this would be silly. The point is that we
should be striving for a method that answers as often as
possible either \emph{yes} or \emph{no}---just in cases when
it is too difficult we fall back on the
\emph{don't-know}-answer. This might sound all like abstract
nonsense. Therefore let us look at a concrete example.


\subsubsection*{A Simple, Idealised Programming Language}

Our starting point is a small, idealised programming language.
It is idealised because we cut several corners in comparison
with real programming languages. The language we will study
contains, amongst other things, variables holding integers.
Using static analysis, we want to find out what the sign of
these integers (positive or negative) will be when the program
runs. This sign-analysis seems like a very simple problem. But
even such simple problems, if approached naively, are in
general undecidable, just like Turing's halting problem. I let
you think why?


Is sign-analysis of variables an interesting problem? Well,
yes---if a compiler can find out that for example a variable
will never be negative and this variable is used as an index
for an array, then the compiler does not need to generate code
for an underflow-check. Remember some languages are immune to
buffer-overflow attacks, but they need to add underflow and
overflow checks everywhere. According to John Regher, an
expert in the field of compilers, overflow checks can cause
5-10\% slowdown, in some languages even 100\% for tight
loops.\footnote{\url{http://blog.regehr.org/archives/1154}} If
the compiler can omit the underflow check, for example, then
this can potentially drastically speed up the generated code. 

What do programs in our simple programming language look like?
The following grammar gives a first specification:

\begin{multicols}{2}
\begin{plstx}[rhs style=,one per line,left margin=9mm]
: \meta{Stmt} ::= \meta{label} \texttt{:}
                | \meta{var} \texttt{:=} \meta{Exp}
                | \texttt{jmp?} \meta{Exp} \meta{label}
                | \texttt{goto} \meta{label}\\
: \meta{Prog} ::= \meta{Stmt} \ldots{} \meta{Stmt}\\
\end{plstx}
\columnbreak
\begin{plstx}[rhs style=,one per line]
: \meta{Exp} ::= \meta{Exp} \texttt{+} \meta{Exp}
               | \meta{Exp} \texttt{*} \meta{Exp}
               | \meta{Exp} \texttt{=} \meta{Exp} 
               | \meta{num}
               | \meta{var}\\
\end{plstx}
\end{multicols}

\noindent I assume you are familiar with such
grammars.\footnote{\url{http://en.wikipedia.org/wiki/Backus–Naur_Form}}
There are three main syntactic categories: \emph{statments}
and \emph{expressions} as well as \emph{programs}, which are
sequences of statements. Statements are either labels,
variable assignments, conditional jumps (\pcode{jmp?}) and
unconditional jumps (\pcode{goto}). Labels are just strings,
which can be used as the target of a jump. We assume that in
every program the labels are unique---if there is a clash,
then we do not know where to jump to. The conditional jumps
and variable assignments involve (arithmetic) expressions.
Expressions are either numbers, variables or compound
expressions built up from \pcode{+}, \pcode{*} and \emph{=}
(for simplicity reasons we do not consider any other
operations). We assume we have negative and positive numbers,
\ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
\pcode{2}\ldots{} An example program that calculates the
factorial of 5 is in our programming language as follows:

\begin{lstlisting}[language={},xleftmargin=10mm]
      a := 1
      n := 5 
top:  
      jmp? n = 0 done 
      a := a * n 
      n := n + -1 
      goto top 
done:
\end{lstlisting}

\noindent As can be seen each line of the program contains a
statement. In the first two lines we assign values to the
variables \pcode{a} and \pcode{n}. In line 4 we test whether
\pcode{n} is zero, in which case we jump to the end of the
program marked with the label \pcode{done}. If \pcode{n} is
not zero, we multiply the content of \pcode{a} by \pcode{n},
decrease \pcode{n} by one and jump back to the beginning of
the loop, marked with the label \pcode{top}. Another program
in our language is shown in Figure~\ref{fib}. I let you think
what it calculates.
 
\begin{figure}[t]
\begin{lstlisting}[numbers=none,
                   language={},xleftmargin=10mm]
      n := 6
      m1 := 0
      m2 := 1
loop: 
      jmp? n = 0 done
      tmp := m2
      m2 := m1 + m2
      m1 := tmp
      n := n + -1
      goto top
done:
\end{lstlisting}
\caption{A mystery program in our idealised programming language.
Try to find out what it calculates! \label{fib}}
\end{figure}

Even if our language is rather small, it is still Turing
complete---meaning quite powerful. However, discussing this
fact in more detail would lead us too far astray. Clearly, our
programming is rather low-level and not very comfortable for
writing programs. It is inspired by real machine code, which
is the code that is executed by a CPU. So a more interesting
question is what is missing in comparison with real machine
code? Well, not much\ldots{}in principle. Real machine code,
of course, contains many more arithmetic instructions (not
just addition and multiplication) and many more conditional
jumps. We could add these to our language if we wanted, but
complexity is really beside the point here. Furthermore, real
machine code has many instructions for manipulating memory. We
do not have this at all. This is actually a more serious
simplification because we assume numbers to be arbitrary small
or large, which is not the case with real machine code. In
real machine code, basic number formats have a range and might
over-flow or under-flow this range. Also the number of
variables in our programs is potentially unlimited, while
memory in an actual computer, of course, is always limited. To
sum up, our language might look ridiculously simple, but it is
not too far removed from practically relevant issues.


\subsubsection*{An Interpreter}

Designing a language is like playing god: you can say what
names for variables you allow; what programs should look like;
most importantly you can decide what each part of the program
should mean and do. While our language is quite simple and the
meaning of statements, for example, is rather straightforward,
there are still places where we need to make real choices. For
example consider the conditional jumps, say the one in the
factorial program: 

\begin{center}
\code{jmp? n = 0 done}
\end{center}

\noindent How should they work? We could introduce Booleans
(\pcode{true} and \pcode{false}) and then jump only when the
condition is \pcode{true}. However, since we have numbers in
our language anyway, why not just encoding \pcode{true} as
one, and \pcode{false} as zero? In this way we can dispense
with the additional concept of Booleans.

I hope the above discussion makes it already clear we need to
be a bit more careful with our programs. Below we shall
describe an interpreter for our programming language, which
specifies exactly how programs are supposed to be
run\ldots{}at least we will specify this for all \emph{good}
programs. By good programs I mean where all variables are
initialised, for example. Our interpreter will just crash if
it cannot find out the value for a variable when it is not
initialised. Also, we will assume that labels in good programs
are unique, otherwise our programs will calculate ``garbage''.

First we will pre-process our programs. This will simplify the
definition of our interpreter later on. By pre-processing our
programs we will transform programs into \emph{snippets}. A
snippet is a label and all the code that comes after the
label. This essentially means a snippet is a \emph{map} from
labels to code.\footnote{Be sure you know what maps are. In a
programming context they are often represented as association
list where some data is associated with a key.} 

Given that programs are sequences (or lists) of statements, we
can easily calculate the snippets by just traversing this
sequence and recursively generating the map. Suppose a program
is of the general form

\begin{center}
$stmt_1\;stmt_2\; \ldots\; stmt_n$
\end{center}

\noindent The idea is to go through this sequence of
statements one by one and check whether they are a label. If
yes, we add the label and the remaining statements to our map.
If no, we just continue with the next statement. To come up
with a recursive definition for generating snippets, let us
write $[]$ for the program that does not contain any
statement. Consider the following definition:

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
$\textit{snippets}([])$ & $\dn$ & $\varnothing$\\
$\textit{snippets}(stmt\;\; rest)$ & $\dn$ &
$\begin{cases}
   \textit{snippets}(rest)[label := rest] & \text{if}\;stmt = \textit{label:}\\
   \textit{snippets}(rest)                & \text{otherwise}
\end{cases}$   
\end{tabular}
\end{center}

\noindent In the first clause we just return the empty map for
the program that does not contain any statement. In the second
clause, we have to distinguish the case where the first
statement is a label or not. As said before, if not, then we
just ``throw away'' the label and recursively calculate the
snippets for the rest of the program (the otherwise clause).
If yes, then we do the same, but also update the map so that
$label$ now points to the rest of the statements (the if
clause). This looks all realtively straightforward, but there
is one small problem we need to overcome: our two programs
shown so far have no label as \emph{entry point}---that is
where the execution is supposed to start. We usually assume
that the first statement will be run first. To make this the
default, it is convenient if we add to all our programs a
default label, say \pcode{""} (the empty string). With this we
can define our pre-processing of programs as follows

\begin{center}
$\textit{preproc}(prog) \dn \textit{snippets}(\pcode{"":}\;\; prog)$ 
\end{center} 

\noindent Let us see how this pans out in practice. If we
pre-process the factorial program shown earlier, we obtain the 
following map:

\begin{center}
\small
\lstset{numbers=none,
        language={},
        xleftmargin=0mm,
        aboveskip=0.5mm,
        belowskip=0.5mm,
        frame=single,
        framerule=0mm,
        framesep=0mm}
\begin{tikzpicture}[node distance=0mm]
\node (A1) [draw]{\pcode{""}};
\node (B1) [right=of A1] {$\mapsto$};
\node (C1) [right=of B1,anchor=north west] {
\begin{minipage}{3.5cm}
\begin{lstlisting}[language={},xleftmargin=0mm]
  a := 1
  n := 5 
top:  
  jmp? n = 0 done 
  a := a * n 
  n := n + -1 
  goto top 
done:
\end{lstlisting}
\end{minipage}};
\node (A2) [right=of C1.north east,draw] {\pcode{top}};
\node (B2) [right=of A2] {$\mapsto$};
\node (C2) [right=of B2, anchor=north west] {
\begin{minipage}{3.5cm}
\begin{lstlisting}[language={},xleftmargin=0mm]
  jmp? n = 0 done 
  a := a * n 
  n := n + -1 
  goto top 
done:
\end{lstlisting} 
\end{minipage}}; 
\node (A3) [right=of C2.north east,draw] {\pcode{done}};
\node (B3) [right=of A3] {$\mapsto$};
\node (C3) [right=of B3] {$[]$};
\end{tikzpicture}
\end{center}

\noindent I highlighted the \emph{keys} in this map. Since
there are three labels in the factorial program (remember we
added \pcode{""}), there are three keys. When running the
factorial program and encountering a jump, then we only have
to consult this snippets-map in order to find out what the
next statements should be.

We should now be in the position to define how a program
should be run. In the context of interpreters, this
``running'' of programs is often called \emph{evaluation}. Let
us start with the definition of how expressions are evaluated.
A first attempt might be the following recursive function:

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
$\textit{eval\_exp}(\texttt{n})$ & $\dn$ & $n$
  \qquad\text{if}\; \texttt{n}\; \text{is a number like} 
  \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
\pcode{2}\ldots{}\\
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, 
                    \texttt{e}_\texttt{2})$ & 
  $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) + 
           \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, 
                    \texttt{e}_\texttt{2})$ & 
  $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) * 
           \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, 
                    \texttt{e}_\texttt{2})$ & 
  $\dn$ & $\begin{cases}
  1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}) =
                 \textit{eval\_exp}(\texttt{e}_\texttt{2})\\
  0 & \text{otherwise}
  \end{cases}$          
\end{tabular}
\end{center}

\noindent While this should look all rather intuitive`, still
be very careful. There is a subtlety which can be easily
overlooked: The function \textit{eval\_exp} takes an
expression of our programming language as input and returns a
number as output. Therefore whenever we encounter a number in
our program, we just return this number---this is defined in
the first clause above. Whenever we encounter an addition,
well then we first evaluate the left-hand side
$\texttt{e}_\texttt{1}$ of the addition (this will give a
number), then evaluate the right-hand side
$\texttt{e}_\texttt{2}$ (this gives another number), and
finally add both numbers together. Here is the subtlety: on
the left-hand side of the $\dn$ we have a \texttt{+} (in the
teletype font) which is the symbol for addition in our
programming language. On the right-hand side we have $+$ which
stands for the arithmetic operation from ``mathematics'' of
adding two numbers. These are rather different concepts---one
is a symbol (which we made up), and the other a mathematical
operation. When we will have a look at an actual
implementation of our interpreter, the mathematical operation
will be the function for addition from the programming
language in which we \underline{\smash{implement}} our
interpreter. While the \texttt{+} is just a symbol that is
used in our programming language. Clearly we have to use a
symbol that is a good mnemonic for addition, otherwise we will
confuse the programmers working with our language. Therefore
we use $\texttt{+}$. A similar choice is made for times in the
third clause and equality in the fourth clause. 

Remember I wrote at the beginning of this section about being
god when designing a programming language. You can see this
here: we need to give meaning to symbols. At the moment
however, we are a poor fallible god. Look again at the grammar
of our programming language and our definition. Clearly, an
expression can contain variables. So far we have ignored them.
What should our interpreter do with variables? They might
change during the evaluation of a program. For example the
variable \pcode{n} in the factorial program counts down from 5
down to 0. How can we improve our definition above to give also
an answer whenever our interpreter encounters a variable in an
expression? The solution is to add an \emph{environment},
written $env$, as an additional input argument to our
\textit{eval\_exp} function.
 
\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
$\textit{eval\_exp}(\texttt{n}, env)$ & $\dn$ & $n$
  \qquad\text{if}\; \texttt{n}\; \text{is a number like} 
  \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},
\pcode{2}\ldots{}\\
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, 
                    \texttt{e}_\texttt{2}, env)$ & 
  $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) + 
           \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, 
                    \texttt{e}_\texttt{2}, env)$ & 
  $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) * 
           \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\
$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, 
                    \texttt{e}_\texttt{2}, env)$ & 
  $\dn$ & $\begin{cases}
  1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) =
                 \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)\\
  0 & \text{otherwise}
  \end{cases}$\\
$\textit{eval\_exp}(\texttt{x}, env)$ & $\dn$ & $env(x)$      
\end{tabular}
\end{center}
 
\noindent This environment $env$ also acts like a map: it
associates variables with their current values. For example
after evaluating the first two lines in our factorial
program, such an environment might look as follows

\begin{center}
\begin{tabular}{ll}
$\fbox{\texttt{a}} \mapsto \texttt{1}$ &
$\fbox{\texttt{n}} \mapsto \texttt{5}$
\end{tabular}
\end{center}

\noindent Again I highlighted the keys. In the clause for
variables, we can therefore consult this environment and
return whatever value is currently stored for this variable.
This is written $env(x)$---meaning we query this map with $x$
we obtain the corresponding number. You might ask what happens
if an environment does not contain any value for, say, the
variable $x$? Well, then our interpreter just ``crashes'', or
more precisely will raise an exception. In this case we have a
``bad'' program that tried to use a variable before it was
initialised. The programmer should not have done this. In a
real programming language, we would of course try a bit harder
and for example give an error at compile time, or design our
language in such a way that this can never happen. With the
second version of \textit{eval\_exp} we completed our
definition for evaluating expressions.
 
Next comes the evaluation function for statements. We define
this function in such a way that we recursively evaluate a
whole sequence of statements. Assume a program $p$ (you want
to evaluate) and its pre-processed snippets $sn$. Then we can
define:

\begin{center}
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
$\textit{eval\_stmts}([], env)$ & $\dn$ & $env$\\
$\textit{eval\_stmts}(\texttt{label:}\;rest, env)$ &
  $\dn$ & $\textit{eval\_stmts}(rest, env)$ \\ 
$\textit{eval\_stmts}(\texttt{x\,:=\,e}\;rest, env)$ & 
  $\dn$ & $\textit{eval\_stmts}(rest, 
           env[x := \textit{eval\_exp}(\texttt{e}, env)])$\\  
$\textit{eval\_stmts}(\texttt{goto\,lbl}\;rest, env)$ 
 & $\dn$ & $\textit{eval\_stmts}(sn(\texttt{lbl}), env)$\\
$\textit{eval\_stmts}(\texttt{jmp?\,e\,lbl}\;rest, env)$ 
 & $\dn$ & $\begin{cases}\begin{array}{@{}l@{\hspace{-12mm}}r@{}}
 \textit{eval\_stmts}(sn(\texttt{lbl}), env)\\ 
 & \text{if}\;\textit{eval\_exp}(\texttt{e}, env) = 1\\
 \textit{eval\_stmts}(rest, env) & \text{otherwise}\\
 \end{array}\end{cases}$  
\end{tabular}
\end{center}

\noindent The first clause is for the empty program, or when
we arrived at the end of the program. In this case we just
return the environment. The second clause is for when the next
statement is a label. That means the program is of the form
$\texttt{label:}\;rest$ where the label is some string and
$rest$ stands for all following statements. This case is easy,
because our evaluation function just discards the label and
evaluates the rest of the statements (we already extracted all
important information about labels when we pre-processed our
programs and generated the snippets). The third clause is for
variable assignments. Again we just evaluate the rest for the
statements, but with a modified environment---since the
variable assignment is supposed to introduce a new variable or
change the current value of a variable. For this modification
of the environment we first evaluate the expression
$\texttt{e}$ using our evaluation function for expressions.
This gives us a number. Then we assign this number to the
variable $x$ in the environment. This modified environment
will be used to evaluate the rest of the program. The fourth
clause is for the unconditional jump to a label, called
\texttt{lbl}. That means we have to look up in our snippets
map $sn$ what are the next statements for this label.
Therefore we will continue with evaluating, not with the rest
of the program, but with the statements stored in the
snippets-map under the label $\texttt{lbl}$. The fifth clause
for conditional jumps is similar, but to decide whether to
make the jump we first need to evaluate the expression
$\texttt{e}$ in order to find out whether it is $1$. If yes,
we jump, otherwise we just continue with evaluating the rest
of the program.

Our interpreter works in two stages: First we pre-process our
program generating the snippets map $sn$, say. Second we call
the evaluation function with the default entry point and the
empty environment:

\begin{center}
$\textit{eval\_stmts}(sn(\pcode{""}), \varnothing)$
\end{center}
 
\noindent It is interesting to note that our interpreter when
it comes to the end of the program returns an environment. Our
programming language does not contain any constructs for input
and output. Therefore this environment is the only effect we
can observe when running the program (apart from that our
interpreter might need some time before finishing the
evaluation of the program and the CPU getting hot). Evaluating 
the factorial program with our interpreter we receive as
``answer''-environment

\begin{center}
\begin{tabular}{ll}
$\fbox{\texttt{a}} \mapsto \texttt{120}$ &
$\fbox{\texttt{n}} \mapsto \texttt{0}$
\end{tabular}
\end{center}

\noindent While the discussion above should have illustrated
the ideas, in order to do some serious calculations, we clearly
need to implement the interpreter.


\subsubsection*{Scala Code for the Interpreter}

Functional programming languages are very convenient for
implementations of interpreters. A good choice for a
functional programming language is Scala, a programming
language that combines functional and object-oriented
pro\-gramming-styles. It has received in the last five years or
so quite a bit of attention. One reason for this attention is
that, like the Java programming language, Scala compiles to
the Java Virtual Machine (JVM) and therefore Scala programs
can run under MacOSX, Linux and Windows.\footnote{There are
also experimental backends for Android and JavaScript.} Unlike
Java, however, Scala often allows programmers to write very
concise and elegant code. Some therefore say Scala is the much
better Java. A number of companies, The Guardian, Twitter,
Coursera, FourSquare, LinkedIn to name a few, either use Scala
exclusively in production code, or at least to some
substantial degree. If you want to try out Scala yourself, the
Scala compiler can be downloaded from

\begin{quote}
\url{http://www.scala-lang.org}
\end{quote}

Let us have a look at the Scala code shown in
Figure~\ref{code}. It shows the entire code for the
interpreter, though the implementation is admittedly no
frills.

\begin{figure}[t]
\small
\lstinputlisting[language=Scala]{../progs/inter.scala}
\caption{The entire code of the interpreter for our
idealised programming language.\label{code}}
\end{figure}


\subsubsection*{Static Analysis}

Finally we can come back to our original problem, namely 
finding out what the signs of variables are 

\begin{center}


\end{center}
 
\end{document}

%% list of static analysers for C
http://spinroot.com/static/index.html

%% NASA coding rules for C
http://pixelscommander.com/wp-content/uploads/2014/12/P10.pdf

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: