diff -r 280e057558b8 -r 7a6e8f603e08 handouts/ho09.tex --- a/handouts/ho09.tex Sat Sep 23 19:39:53 2017 +0100 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,615 +0,0 @@ -\documentclass{article} -\usepackage{../style} -\usepackage{../langs} -\usepackage{../graphics} -\usepackage{../grammar} -\usepackage{multicol} - -\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}} - -\begin{document} -\fnote{\copyright{} Christian Urban, King's College London, 2014} - -%% why are shuttle flights so good with software -%%http://www.fastcompany.com/28121/they-write-right-stuff - -\section*{Handout 9 (Static Analysis)} - -If we want to improve the safety and security of our programs, -we need a more principled approach to programming. Testing is -good, but as Edsger Dijkstra famously wrote: - -\begin{quote}\it -``Program testing can be a very effective way to show the -\underline{\smash{presence}} of bugs, but it is hopelessly -inadequate for showing their \underline{\smash{absence}}.'' -\end{quote} - -\noindent While such a more principled approach has been the -subject of intense study for a long, long time, only in the -past few years some impressive results have been achieved. One -is the complete formalisation and (mathematical) verification -of a microkernel operating system called seL4. - -\begin{center} -\url{http://sel4.systems} -\end{center} - -\noindent In 2011 this work was included in the MIT Technology -Review in the annual list of the world’s ten most important -emerging -technologies.\footnote{\url{http://www2.technologyreview.com/tr10/?year=2011}} -While this work is impressive, its technical details are too -enormous for an explanation here. Therefore let us look at -something much simpler, namely finding out properties about -programs using \emph{static analysis}. - -Static analysis is a technique that checks properties of a -program without actually running the program. This should -raise alarm bells with you---because almost all interesting -properties about programs are equivalent to the halting -problem, which we know is undecidable. For example estimating -the memory consumption of programs is in general undecidable, -just like the halting problem. Static analysis circumvents -this undecidability-problem by essentially allowing answers -\emph{yes} and \emph{no}, but also \emph{don't know}. With -this ``trick'' even the halting problem becomes -decidable\ldots{}for example we could always say \emph{don't -know}. Of course this would be silly. The point is that we -should be striving for a method that answers as often as -possible either \emph{yes} or \emph{no}---just in cases when -it is too difficult we fall back on the -\emph{don't-know}-answer. This might sound all like abstract -nonsense. Therefore let us look at a concrete example. - - -\subsubsection*{A Simple, Idealised Programming Language} - -Our starting point is a small, idealised programming language. -It is idealised because we cut several corners in comparison -with real programming languages. The language we will study -contains, amongst other things, variables holding integers. -Using static analysis, we want to find out what the sign of -these integers (positive or negative) will be when the program -runs. This sign-analysis seems like a very simple problem. But -even such simple problems, if approached naively, are in -general undecidable, just like Turing's halting problem. I let -you think why? - - -Is sign-analysis of variables an interesting problem? Well, -yes---if a compiler can find out that for example a variable -will never be negative and this variable is used as an index -for an array, then the compiler does not need to generate code -for an underflow-check. Remember some languages are immune to -buffer-overflow attacks, but they need to add underflow and -overflow checks everywhere. According to John Regher, an -expert in the field of compilers, overflow checks can cause -5-10\% slowdown, in some languages even 100\% for tight -loops.\footnote{\url{http://blog.regehr.org/archives/1154}} If -the compiler can omit the underflow check, for example, then -this can potentially drastically speed up the generated code. - -What do programs in our simple programming language look like? -The following grammar gives a first specification: - -\begin{multicols}{2} -\begin{plstx}[rhs style=,one per line,left margin=9mm] -: \meta{Stmt} ::= \meta{label} \texttt{:} - | \meta{var} \texttt{:=} \meta{Exp} - | \texttt{jmp?} \meta{Exp} \meta{label} - | \texttt{goto} \meta{label}\\ -: \meta{Prog} ::= \meta{Stmt} \ldots{} \meta{Stmt}\\ -\end{plstx} -\columnbreak -\begin{plstx}[rhs style=,one per line] : \meta{Exp} ::= \meta{Exp} \texttt{+} \meta{Exp} - | \meta{Exp} \texttt{*} \meta{Exp} - | \meta{Exp} \texttt{=} \meta{Exp} - | \meta{num} - | \meta{var}\\ -\end{plstx} -\end{multicols} - -\noindent I assume you are familiar with such -grammars.\footnote{\url{http://en.wikipedia.org/wiki/Backus–Naur_Form}} -There are three main syntactic categories: \emph{statments} -and \emph{expressions} as well as \emph{programs}, which are -sequences of statements. Statements are either labels, -variable assignments, conditional jumps (\pcode{jmp?}) and -unconditional jumps (\pcode{goto}). Labels are just strings, -which can be used as the target of a jump. We assume that in -every program the labels are unique---if there is a clash, -then we do not know where to jump to. The conditional jumps -and variable assignments involve (arithmetic) expressions. -Expressions are either numbers, variables or compound -expressions built up from \pcode{+}, \pcode{*} and \emph{=} -(for simplicity reasons we do not consider any other -operations). We assume we have negative and positive numbers, -\ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1}, -\pcode{2}\ldots{} An example program that calculates the -factorial of 5 is in our programming language as follows: - -\begin{lstlisting}[language={},xleftmargin=10mm] - a := 1 - n := 5 -top: - jmp? n = 0 done - a := a * n - n := n + -1 - goto top -done: -\end{lstlisting} - -\noindent As can be seen each line of the program contains a -statement. In the first two lines we assign values to the -variables \pcode{a} and \pcode{n}. In line 4 we test whether -\pcode{n} is zero, in which case we jump to the end of the -program marked with the label \pcode{done}. If \pcode{n} is -not zero, we multiply the content of \pcode{a} by \pcode{n}, -decrease \pcode{n} by one and jump back to the beginning of -the loop, marked with the label \pcode{top}. Another program -in our language is shown in Figure~\ref{fib}. I let you think -what it calculates. - -\begin{figure}[t] -\begin{lstlisting}[numbers=none, - language={},xleftmargin=10mm] - n := 6 - m1 := 0 - m2 := 1 -loop: - jmp? n = 0 done - tmp := m2 - m2 := m1 + m2 - m1 := tmp - n := n + -1 - goto top -done: -\end{lstlisting} -\caption{A mystery program in our idealised programming language. -Try to find out what it calculates! \label{fib}} -\end{figure} - -Even if our language is rather small, it is still Turing -complete---meaning quite powerful. However, discussing this -fact in more detail would lead us too far astray. Clearly, our -programming is rather low-level and not very comfortable for -writing programs. It is inspired by real machine code, which -is the code that is executed by a CPU. So a more interesting -question is what is missing in comparison with real machine -code? Well, not much\ldots{}in principle. Real machine code, -of course, contains many more arithmetic instructions (not -just addition and multiplication) and many more conditional -jumps. We could add these to our language if we wanted, but -complexity is really beside the point here. Furthermore, real -machine code has many instructions for manipulating memory. We -do not have this at all. This is actually a more serious -simplification because we assume numbers to be arbitrary small -or large, which is not the case with real machine code. In -real machine code, basic number formats have a range and might -over-flow or under-flow this range. Also the number of -variables in our programs is potentially unlimited, while -memory in an actual computer, of course, is always limited. To -sum up, our language might look ridiculously simple, but it is -not too far removed from practically relevant issues. - - -\subsubsection*{An Interpreter} - -Designing a language is like playing god: you can say what -names for variables you allow; what programs should look like; -most importantly you can decide what each part of the program -should mean and do. While our language is quite simple and the -meaning of statements, for example, is rather straightforward, -there are still places where we need to make real choices. For -example consider the conditional jumps, say the one in the -factorial program: - -\begin{center} -\code{jmp? n = 0 done} -\end{center} - -\noindent How should they work? We could introduce Booleans -(\pcode{true} and \pcode{false}) and then jump only when the -condition is \pcode{true}. However, since we have numbers in -our language anyway, why not just encoding \pcode{true} as -one, and \pcode{false} as zero? In this way we can dispense -with the additional concept of Booleans. - -I hope the above discussion makes it already clear we need to -be a bit more careful with our programs. Below we shall -describe an interpreter for our programming language, which -specifies exactly how programs are supposed to be -run\ldots{}at least we will specify this for all \emph{good} -programs. By good programs I mean where all variables are -initialised, for example. Our interpreter will just crash if -it cannot find out the value for a variable when it is not -initialised. Also, we will assume that labels in good programs -are unique, otherwise our programs will calculate ``garbage''. - -First we will pre-process our programs. This will simplify the -definition of our interpreter later on. By pre-processing our -programs we will transform programs into \emph{snippets}. A -snippet is a label and all the code that comes after the -label. This essentially means a snippet is a \emph{map} from -labels to code.\footnote{Be sure you know what maps are. In a -programming context they are often represented as association -list where some data is associated with a key.} - -Given that programs are sequences (or lists) of statements, we -can easily calculate the snippets by just traversing this -sequence and recursively generating the map. Suppose a program -is of the general form - -\begin{center} -$stmt_1\;stmt_2\; \ldots\; stmt_n$ -\end{center} - -\noindent The idea is to go through this sequence of -statements one by one and check whether they are a label. If -yes, we add the label and the remaining statements to our map. -If no, we just continue with the next statement. To come up -with a recursive definition for generating snippets, let us -write $[]$ for the program that does not contain any -statement. Consider the following definition: - -\begin{center} -\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} -$\textit{snippets}([])$ & $\dn$ & $\varnothing$\\ -$\textit{snippets}(stmt\;\; rest)$ & $\dn$ & -$\begin{cases} - \textit{snippets}(rest)[label := rest] & \text{if}\;stmt = \textit{label:}\\ - \textit{snippets}(rest) & \text{otherwise} -\end{cases}$ -\end{tabular} -\end{center} - -\noindent In the first clause we just return the empty map for -the program that does not contain any statement. In the second -clause, we have to distinguish the case where the first -statement is a label or not. As said before, if not, then we -just ``throw away'' the label and recursively calculate the -snippets for the rest of the program (the otherwise clause). -If yes, then we do the same, but also update the map so that -$label$ now points to the rest of the statements (the if -clause). This looks all realtively straightforward, but there -is one small problem we need to overcome: our two programs -shown so far have no label as \emph{entry point}---that is -where the execution is supposed to start. We usually assume -that the first statement will be run first. To make this the -default, it is convenient if we add to all our programs a -default label, say \pcode{""} (the empty string). With this we -can define our pre-processing of programs as follows - -\begin{center} -$\textit{preproc}(prog) \dn \textit{snippets}(\pcode{"":}\;\; prog)$ -\end{center} - -\noindent Let us see how this pans out in practice. If we -pre-process the factorial program shown earlier, we obtain the -following map: - -\begin{center} -\small -\lstset{numbers=none, - language={}, - xleftmargin=0mm, - aboveskip=0.5mm, - belowskip=0.5mm, - frame=single, - framerule=0mm, - framesep=0mm} -\begin{tikzpicture}[node distance=0mm] -\node (A1) [draw]{\pcode{""}}; -\node (B1) [right=of A1] {$\mapsto$}; -\node (C1) [right=of B1,anchor=north west] { -\begin{minipage}{3.5cm} -\begin{lstlisting}[language={},xleftmargin=0mm] - a := 1 - n := 5 -top: - jmp? n = 0 done - a := a * n - n := n + -1 - goto top -done: -\end{lstlisting} -\end{minipage}}; -\node (A2) [right=of C1.north east,draw] {\pcode{top}}; -\node (B2) [right=of A2] {$\mapsto$}; -\node (C2) [right=of B2, anchor=north west] { -\begin{minipage}{3.5cm} -\begin{lstlisting}[language={},xleftmargin=0mm] - jmp? n = 0 done - a := a * n - n := n + -1 - goto top -done: -\end{lstlisting} -\end{minipage}}; -\node (A3) [right=of C2.north east,draw] {\pcode{done}}; -\node (B3) [right=of A3] {$\mapsto$}; -\node (C3) [right=of B3] {$[]$}; -\end{tikzpicture} -\end{center} - -\noindent I highlighted the \emph{keys} in this map. Since -there are three labels in the factorial program (remember we -added \pcode{""}), there are three keys. When running the -factorial program and encountering a jump, then we only have -to consult this snippets-map in order to find out what the -next statements should be. - -We should now be in the position to define how a program -should be run. In the context of interpreters, this -``running'' of programs is often called \emph{evaluation}. Let -us start with the definition of how expressions are evaluated. -A first attempt might be the following recursive function: - -\begin{center} -\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} -$\textit{eval\_exp}(\texttt{n})$ & $\dn$ & $n$ - \qquad\text{if}\; \texttt{n}\; \text{is a number like} - \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1}, -\pcode{2}\ldots{}\\ -$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, - \texttt{e}_\texttt{2})$ & - $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) + - \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\ -$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, - \texttt{e}_\texttt{2})$ & - $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) * - \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\ -$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, - \texttt{e}_\texttt{2})$ & - $\dn$ & $\begin{cases} - 1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}) = - \textit{eval\_exp}(\texttt{e}_\texttt{2})\\ - 0 & \text{otherwise} - \end{cases}$ -\end{tabular} -\end{center} - -\noindent While this should look all rather intuitive`, still -be very careful. There is a subtlety which can be easily -overlooked: The function \textit{eval\_exp} takes an -expression of our programming language as input and returns a -number as output. Therefore whenever we encounter a number in -our program, we just return this number---this is defined in -the first clause above. Whenever we encounter an addition, -well then we first evaluate the left-hand side -$\texttt{e}_\texttt{1}$ of the addition (this will give a -number), then evaluate the right-hand side -$\texttt{e}_\texttt{2}$ (this gives another number), and -finally add both numbers together. Here is the subtlety: on -the left-hand side of the $\dn$ we have a \texttt{+} (in the -teletype font) which is the symbol for addition in our -programming language. On the right-hand side we have $+$ which -stands for the arithmetic operation from ``mathematics'' of -adding two numbers. These are rather different concepts---one -is a symbol (which we made up), and the other a mathematical -operation. When we will have a look at an actual -implementation of our interpreter, the mathematical operation -will be the function for addition from the programming -language in which we \underline{\smash{implement}} our -interpreter. While the \texttt{+} is just a symbol that is -used in our programming language. Clearly we have to use a -symbol that is a good mnemonic for addition, otherwise we will -confuse the programmers working with our language. Therefore -we use $\texttt{+}$. A similar choice is made for times in the -third clause and equality in the fourth clause. - -Remember I wrote at the beginning of this section about being -god when designing a programming language. You can see this -here: we need to give meaning to symbols. At the moment -however, we are a poor fallible god. Look again at the grammar -of our programming language and our definition. Clearly, an -expression can contain variables. So far we have ignored them. -What should our interpreter do with variables? They might -change during the evaluation of a program. For example the -variable \pcode{n} in the factorial program counts down from 5 -down to 0. How can we improve our definition above to give also -an answer whenever our interpreter encounters a variable in an -expression? The solution is to add an \emph{environment}, -written $env$, as an additional input argument to our -\textit{eval\_exp} function. - -\begin{center} -\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} -$\textit{eval\_exp}(\texttt{n}, env)$ & $\dn$ & $n$ - \qquad\text{if}\; \texttt{n}\; \text{is a number like} - \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1}, -\pcode{2}\ldots{}\\ -$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, - \texttt{e}_\texttt{2}, env)$ & - $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) + - \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\ -$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, - \texttt{e}_\texttt{2}, env)$ & - $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) * - \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\ -$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, - \texttt{e}_\texttt{2}, env)$ & - $\dn$ & $\begin{cases} - 1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) = - \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)\\ - 0 & \text{otherwise} - \end{cases}$\\ -$\textit{eval\_exp}(\texttt{x}, env)$ & $\dn$ & $env(x)$ -\end{tabular} -\end{center} - -\noindent This environment $env$ also acts like a map: it -associates variables with their current values. For example -after evaluating the first two lines in our factorial -program, such an environment might look as follows - -\begin{center} -\begin{tabular}{ll} -$\fbox{\texttt{a}} \mapsto \texttt{1}$ & -$\fbox{\texttt{n}} \mapsto \texttt{5}$ -\end{tabular} -\end{center} - -\noindent Again I highlighted the keys. In the clause for -variables, we can therefore consult this environment and -return whatever value is currently stored for this variable. -This is written $env(x)$---meaning we query this map with $x$ -we obtain the corresponding number. You might ask what happens -if an environment does not contain any value for, say, the -variable $x$? Well, then our interpreter just ``crashes'', or -more precisely will raise an exception. In this case we have a -``bad'' program that tried to use a variable before it was -initialised. The programmer should not have done this. In a -real programming language, we would of course try a bit harder -and for example give an error at compile time, or design our -language in such a way that this can never happen. With the -second version of \textit{eval\_exp} we completed our -definition for evaluating expressions. - -Next comes the evaluation function for statements. We define -this function in such a way that we recursively evaluate a -whole sequence of statements. Assume a program $p$ (you want -to evaluate) and its pre-processed snippets $sn$. Then we can -define: - -\begin{center} -\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} -$\textit{eval\_stmts}([], env)$ & $\dn$ & $env$\\ -$\textit{eval\_stmts}(\texttt{label:}\;rest, env)$ & - $\dn$ & $\textit{eval\_stmts}(rest, env)$ \\ -$\textit{eval\_stmts}(\texttt{x\,:=\,e}\;rest, env)$ & - $\dn$ & $\textit{eval\_stmts}(rest, - env[x := \textit{eval\_exp}(\texttt{e}, env)])$\\ -$\textit{eval\_stmts}(\texttt{goto\,lbl}\;rest, env)$ - & $\dn$ & $\textit{eval\_stmts}(sn(\texttt{lbl}), env)$\\ -$\textit{eval\_stmts}(\texttt{jmp?\,e\,lbl}\;rest, env)$ - & $\dn$ & $\begin{cases}\begin{array}{@{}l@{\hspace{-12mm}}r@{}} - \textit{eval\_stmts}(sn(\texttt{lbl}), env)\\ - & \text{if}\;\textit{eval\_exp}(\texttt{e}, env) = 1\\ - \textit{eval\_stmts}(rest, env) & \text{otherwise}\\ - \end{array}\end{cases}$ -\end{tabular} -\end{center} - -\noindent The first clause is for the empty program, or when -we arrived at the end of the program. In this case we just -return the environment. The second clause is for when the next -statement is a label. That means the program is of the form -$\texttt{label:}\;rest$ where the label is some string and -$rest$ stands for all following statements. This case is easy, -because our evaluation function just discards the label and -evaluates the rest of the statements (we already extracted all -important information about labels when we pre-processed our -programs and generated the snippets). The third clause is for -variable assignments. Again we just evaluate the rest for the -statements, but with a modified environment---since the -variable assignment is supposed to introduce a new variable or -change the current value of a variable. For this modification -of the environment we first evaluate the expression -$\texttt{e}$ using our evaluation function for expressions. -This gives us a number. Then we assign this number to the -variable $x$ in the environment. This modified environment -will be used to evaluate the rest of the program. The fourth -clause is for the unconditional jump to a label, called -\texttt{lbl}. That means we have to look up in our snippets -map $sn$ what are the next statements for this label. -Therefore we will continue with evaluating, not with the rest -of the program, but with the statements stored in the -snippets-map under the label $\texttt{lbl}$. The fifth clause -for conditional jumps is similar, but to decide whether to -make the jump we first need to evaluate the expression -$\texttt{e}$ in order to find out whether it is $1$. If yes, -we jump, otherwise we just continue with evaluating the rest -of the program. - -Our interpreter works in two stages: First we pre-process our -program generating the snippets map $sn$, say. Second we call -the evaluation function with the default entry point and the -empty environment: - -\begin{center} -$\textit{eval\_stmts}(sn(\pcode{""}), \varnothing)$ -\end{center} - -\noindent It is interesting to note that our interpreter when -it comes to the end of the program returns an environment. Our -programming language does not contain any constructs for input -and output. Therefore this environment is the only effect we -can observe when running the program (apart from that our -interpreter might need some time before finishing the -evaluation of the program and the CPU getting hot). Evaluating -the factorial program with our interpreter we receive as -``answer''-environment - -\begin{center} -\begin{tabular}{ll} -$\fbox{\texttt{a}} \mapsto \texttt{120}$ & -$\fbox{\texttt{n}} \mapsto \texttt{0}$ -\end{tabular} -\end{center} - -\noindent While the discussion above should have illustrated -the ideas, in order to do some serious calculations, we clearly -need to implement the interpreter. - - -\subsubsection*{Scala Code for the Interpreter} - -Functional programming languages are very convenient for -implementations of interpreters. A good choice for a -functional programming language is Scala, a programming -language that combines functional and object-oriented -pro\-gramming-styles. It has received in the last five years or -so quite a bit of attention. One reason for this attention is -that, like the Java programming language, Scala compiles to -the Java Virtual Machine (JVM) and therefore Scala programs -can run under MacOSX, Linux and Windows.\footnote{There are -also experimental backends for Android and JavaScript.} Unlike -Java, however, Scala often allows programmers to write very -concise and elegant code. Some therefore say Scala is the much -better Java. A number of companies, The Guardian, Twitter, -Coursera, FourSquare, LinkedIn to name a few, either use Scala -exclusively in production code, or at least to some -substantial degree. If you want to try out Scala yourself, the -Scala compiler can be downloaded from - -\begin{quote} -\url{http://www.scala-lang.org} -\end{quote} - -Let us have a look at the Scala code shown in -Figure~\ref{code}. It shows the entire code for the -interpreter, though the implementation is admittedly no -frills. - -\begin{figure}[t] -\small -\lstinputlisting[language=Scala]{../progs/inter.scala} -\caption{The entire code of the interpreter for our -idealised programming language.\label{code}} -\end{figure} - - -\subsubsection*{Static Analysis} - -Finally we can come back to our original problem, namely -finding out what the signs of variables are - -\begin{center} - - -\end{center} - -\end{document} - -%% list of static analysers for C -http://spinroot.com/static/index.html - -%% NASA coding rules for C -http://pixelscommander.com/wp-content/uploads/2014/12/P10.pdf - -%%% Local Variables: -%%% mode: latex -%%% TeX-master: t -%%% End: