diff -r e0a41a1f24cf -r d665e7dd66d7 handouts/ho10.tex --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/handouts/ho10.tex Sun Oct 27 13:45:54 2019 +0000 @@ -0,0 +1,616 @@ +% !TEX program = xelatex +\documentclass{article} +\usepackage{../style} +\usepackage{../langs} +\usepackage{../graphics} +\usepackage{../grammar} +\usepackage{multicol} + +%%\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}} + +\begin{document} +\fnote{\copyright{} Christian Urban, King's College London, 2014} + +%% why are shuttle flights so good with software +%%http://www.fastcompany.com/28121/they-write-right-stuff + +\section*{Handout 10 (Static Analysis)} + +If we want to improve the safety and security of our programs, +we need a more principled approach to programming. Testing is +good, but as Edsger Dijkstra famously wrote: + +\begin{quote}\it +``Program testing can be a very effective way to show the +\underline{\smash{presence}} of bugs, but it is hopelessly +inadequate for showing their \underline{\smash{absence}}.'' +\end{quote} + +\noindent While such a more principled approach has been the +subject of intense study for a long, long time, only in the +past few years some impressive results have been achieved. One +is the complete formalisation and (mathematical) verification +of a microkernel operating system called seL4. + +\begin{center} +\url{http://sel4.systems} +\end{center} + +\noindent In 2011 this work was included in the MIT Technology +Review in the annual list of the world’s ten most important +emerging +technologies.\footnote{\url{http://www2.technologyreview.com/tr10/?year=2011}} +While this work is impressive, its technical details are too +enormous for an explanation here. Therefore let us look at +something much simpler, namely finding out properties about +programs using \emph{static analysis}. + +Static analysis is a technique that checks properties of a +program without actually running the program. This should +raise alarm bells with you---because almost all interesting +properties about programs are equivalent to the halting +problem, which we know is undecidable. For example estimating +the memory consumption of programs is in general undecidable, +just like the halting problem. Static analysis circumvents +this undecidability-problem by essentially allowing answers +\emph{yes} and \emph{no}, but also \emph{don't know}. With +this ``trick'' even the halting problem becomes +decidable\ldots{}for example we could always say \emph{don't +know}. Of course this would be silly. The point is that we +should be striving for a method that answers as often as +possible either \emph{yes} or \emph{no}---just in cases when +it is too difficult we fall back on the +\emph{don't-know}-answer. This might sound all like abstract +nonsense. Therefore let us look at a concrete example. + + +\subsubsection*{A Simple, Idealised Programming Language} + +Our starting point is a small, idealised programming language. +It is idealised because we cut several corners in comparison +with real programming languages. The language we will study +contains, amongst other things, variables holding integers. +Using static analysis, we want to find out what the sign of +these integers (positive or negative) will be when the program +runs. This sign-analysis seems like a very simple problem. But +even such simple problems, if approached naively, are in +general undecidable, just like Turing's halting problem. I let +you think why? + + +Is sign-analysis of variables an interesting problem? Well, +yes---if a compiler can find out that for example a variable +will never be negative and this variable is used as an index +for an array, then the compiler does not need to generate code +for an underflow-check. Remember some languages are immune to +buffer-overflow attacks, but they need to add underflow and +overflow checks everywhere. According to John Regher, an +expert in the field of compilers, overflow checks can cause +5-10\% slowdown, in some languages even 100\% for tight +loops.\footnote{\url{http://blog.regehr.org/archives/1154}} If +the compiler can omit the underflow check, for example, then +this can potentially drastically speed up the generated code. + +What do programs in our simple programming language look like? +The following grammar gives a first specification: + +\begin{multicols}{2} +\begin{plstx}[rhs style=,one per line,left margin=9mm] +: \meta{Stmt} ::= \meta{label} \texttt{:} + | \meta{var} \texttt{:=} \meta{Exp} + | \texttt{jmp?} \meta{Exp} \meta{label} + | \texttt{goto} \meta{label}\\ +: \meta{Prog} ::= \meta{Stmt} \ldots{} \meta{Stmt}\\ +\end{plstx} +\columnbreak +\begin{plstx}[rhs style=,one per line] : \meta{Exp} ::= \meta{Exp} \texttt{+} \meta{Exp} + | \meta{Exp} \texttt{*} \meta{Exp} + | \meta{Exp} \texttt{=} \meta{Exp} + | \meta{num} + | \meta{var}\\ +\end{plstx} +\end{multicols} + +\noindent I assume you are familiar with such +grammars.\footnote{\url{http://en.wikipedia.org/wiki/Backus–Naur_Form}} +There are three main syntactic categories: \emph{statments} +and \emph{expressions} as well as \emph{programs}, which are +sequences of statements. Statements are either labels, +variable assignments, conditional jumps (\pcode{jmp?}) and +unconditional jumps (\pcode{goto}). Labels are just strings, +which can be used as the target of a jump. We assume that in +every program the labels are unique---if there is a clash, +then we do not know where to jump to. The conditional jumps +and variable assignments involve (arithmetic) expressions. +Expressions are either numbers, variables or compound +expressions built up from \pcode{+}, \pcode{*} and \emph{=} +(for simplicity reasons we do not consider any other +operations). We assume we have negative and positive numbers, +\ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1}, +\pcode{2}\ldots{} An example program that calculates the +factorial of 5 is in our programming language as follows: + +\begin{lstlisting}[language={},xleftmargin=10mm] + a := 1 + n := 5 +top: + jmp? n = 0 done + a := a * n + n := n + -1 + goto top +done: +\end{lstlisting} + +\noindent As can be seen each line of the program contains a +statement. In the first two lines we assign values to the +variables \pcode{a} and \pcode{n}. In line 4 we test whether +\pcode{n} is zero, in which case we jump to the end of the +program marked with the label \pcode{done}. If \pcode{n} is +not zero, we multiply the content of \pcode{a} by \pcode{n}, +decrease \pcode{n} by one and jump back to the beginning of +the loop, marked with the label \pcode{top}. Another program +in our language is shown in Figure~\ref{fib}. I let you think +what it calculates. + +\begin{figure}[t] +\begin{lstlisting}[numbers=none, + language={},xleftmargin=10mm] + n := 6 + m1 := 0 + m2 := 1 +loop: + jmp? n = 0 done + tmp := m2 + m2 := m1 + m2 + m1 := tmp + n := n + -1 + goto top +done: +\end{lstlisting} +\caption{A mystery program in our idealised programming language. +Try to find out what it calculates! \label{fib}} +\end{figure} + +Even if our language is rather small, it is still Turing +complete---meaning quite powerful. However, discussing this +fact in more detail would lead us too far astray. Clearly, our +programming is rather low-level and not very comfortable for +writing programs. It is inspired by real machine code, which +is the code that is executed by a CPU. So a more interesting +question is what is missing in comparison with real machine +code? Well, not much\ldots{}in principle. Real machine code, +of course, contains many more arithmetic instructions (not +just addition and multiplication) and many more conditional +jumps. We could add these to our language if we wanted, but +complexity is really beside the point here. Furthermore, real +machine code has many instructions for manipulating memory. We +do not have this at all. This is actually a more serious +simplification because we assume numbers to be arbitrary small +or large, which is not the case with real machine code. In +real machine code, basic number formats have a range and might +over-flow or under-flow this range. Also the number of +variables in our programs is potentially unlimited, while +memory in an actual computer, of course, is always limited. To +sum up, our language might look ridiculously simple, but it is +not too far removed from practically relevant issues. + + +\subsubsection*{An Interpreter} + +Designing a language is like playing god: you can say what +names for variables you allow; what programs should look like; +most importantly you can decide what each part of the program +should mean and do. While our language is quite simple and the +meaning of statements, for example, is rather straightforward, +there are still places where we need to make real choices. For +example consider the conditional jumps, say the one in the +factorial program: + +\begin{center} +\code{jmp? n = 0 done} +\end{center} + +\noindent How should they work? We could introduce Booleans +(\pcode{true} and \pcode{false}) and then jump only when the +condition is \pcode{true}. However, since we have numbers in +our language anyway, why not just encoding \pcode{true} as +one, and \pcode{false} as zero? In this way we can dispense +with the additional concept of Booleans. + +I hope the above discussion makes it already clear we need to +be a bit more careful with our programs. Below we shall +describe an interpreter for our programming language, which +specifies exactly how programs are supposed to be +run\ldots{}at least we will specify this for all \emph{good} +programs. By good programs I mean where all variables are +initialised, for example. Our interpreter will just crash if +it cannot find out the value for a variable when it is not +initialised. Also, we will assume that labels in good programs +are unique, otherwise our programs will calculate ``garbage''. + +First we will pre-process our programs. This will simplify the +definition of our interpreter later on. By pre-processing our +programs we will transform programs into \emph{snippets}. A +snippet is a label and all the code that comes after the +label. This essentially means a snippet is a \emph{map} from +labels to code.\footnote{Be sure you know what maps are. In a +programming context they are often represented as association +list where some data is associated with a key.} + +Given that programs are sequences (or lists) of statements, we +can easily calculate the snippets by just traversing this +sequence and recursively generating the map. Suppose a program +is of the general form + +\begin{center} +$stmt_1\;stmt_2\; \ldots\; stmt_n$ +\end{center} + +\noindent The idea is to go through this sequence of +statements one by one and check whether they are a label. If +yes, we add the label and the remaining statements to our map. +If no, we just continue with the next statement. To come up +with a recursive definition for generating snippets, let us +write $[]$ for the program that does not contain any +statement. Consider the following definition: + +\begin{center} +\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} +$\textit{snippets}([])$ & $\dn$ & $\varnothing$\\ +$\textit{snippets}(stmt\;\; rest)$ & $\dn$ & +$\begin{cases} + \textit{snippets}(rest)[label := rest] & \text{if}\;stmt = \textit{label:}\\ + \textit{snippets}(rest) & \text{otherwise} +\end{cases}$ +\end{tabular} +\end{center} + +\noindent In the first clause we just return the empty map for +the program that does not contain any statement. In the second +clause, we have to distinguish the case where the first +statement is a label or not. As said before, if not, then we +just ``throw away'' the label and recursively calculate the +snippets for the rest of the program (the otherwise clause). +If yes, then we do the same, but also update the map so that +$label$ now points to the rest of the statements (the if +clause). This looks all realtively straightforward, but there +is one small problem we need to overcome: our two programs +shown so far have no label as \emph{entry point}---that is +where the execution is supposed to start. We usually assume +that the first statement will be run first. To make this the +default, it is convenient if we add to all our programs a +default label, say \pcode{""} (the empty string). With this we +can define our pre-processing of programs as follows + +\begin{center} +$\textit{preproc}(prog) \dn \textit{snippets}(\pcode{"":}\;\; prog)$ +\end{center} + +\noindent Let us see how this pans out in practice. If we +pre-process the factorial program shown earlier, we obtain the +following map: + +\begin{center} +\small +\lstset{numbers=none, + language={}, + xleftmargin=0mm, + aboveskip=0.5mm, + belowskip=0.5mm, + frame=single, + framerule=0mm, + framesep=0mm} +\begin{tikzpicture}[node distance=0mm] +\node (A1) [draw]{\pcode{""}}; +\node (B1) [right=of A1] {$\mapsto$}; +\node (C1) [right=of B1,anchor=north west] { +\begin{minipage}{3.5cm} +\begin{lstlisting}[language={},xleftmargin=0mm] + a := 1 + n := 5 +top: + jmp? n = 0 done + a := a * n + n := n + -1 + goto top +done: +\end{lstlisting} +\end{minipage}}; +\node (A2) [right=of C1.north east,draw] {\pcode{top}}; +\node (B2) [right=of A2] {$\mapsto$}; +\node (C2) [right=of B2, anchor=north west] { +\begin{minipage}{3.5cm} +\begin{lstlisting}[language={},xleftmargin=0mm] + jmp? n = 0 done + a := a * n + n := n + -1 + goto top +done: +\end{lstlisting} +\end{minipage}}; +\node (A3) [right=of C2.north east,draw] {\pcode{done}}; +\node (B3) [right=of A3] {$\mapsto$}; +\node (C3) [right=of B3] {$[]$}; +\end{tikzpicture} +\end{center} + +\noindent I highlighted the \emph{keys} in this map. Since +there are three labels in the factorial program (remember we +added \pcode{""}), there are three keys. When running the +factorial program and encountering a jump, then we only have +to consult this snippets-map in order to find out what the +next statements should be. + +We should now be in the position to define how a program +should be run. In the context of interpreters, this +``running'' of programs is often called \emph{evaluation}. Let +us start with the definition of how expressions are evaluated. +A first attempt might be the following recursive function: + +\begin{center} +\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} +$\textit{eval\_exp}(\texttt{n})$ & $\dn$ & $n$ + \qquad\text{if}\; \texttt{n}\; \text{is a number like} + \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1}, +\pcode{2}\ldots{}\\ +$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, + \texttt{e}_\texttt{2})$ & + $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) + + \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\ +$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, + \texttt{e}_\texttt{2})$ & + $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) * + \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\ +$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, + \texttt{e}_\texttt{2})$ & + $\dn$ & $\begin{cases} + 1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}) = + \textit{eval\_exp}(\texttt{e}_\texttt{2})\\ + 0 & \text{otherwise} + \end{cases}$ +\end{tabular} +\end{center} + +\noindent While this should look all rather intuitive`, still +be very careful. There is a subtlety which can be easily +overlooked: The function \textit{eval\_exp} takes an +expression of our programming language as input and returns a +number as output. Therefore whenever we encounter a number in +our program, we just return this number---this is defined in +the first clause above. Whenever we encounter an addition, +well then we first evaluate the left-hand side +$\texttt{e}_\texttt{1}$ of the addition (this will give a +number), then evaluate the right-hand side +$\texttt{e}_\texttt{2}$ (this gives another number), and +finally add both numbers together. Here is the subtlety: on +the left-hand side of the $\dn$ we have a \texttt{+} (in the +teletype font) which is the symbol for addition in our +programming language. On the right-hand side we have $+$ which +stands for the arithmetic operation from ``mathematics'' of +adding two numbers. These are rather different concepts---one +is a symbol (which we made up), and the other a mathematical +operation. When we will have a look at an actual +implementation of our interpreter, the mathematical operation +will be the function for addition from the programming +language in which we \underline{\smash{implement}} our +interpreter. While the \texttt{+} is just a symbol that is +used in our programming language. Clearly we have to use a +symbol that is a good mnemonic for addition, otherwise we will +confuse the programmers working with our language. Therefore +we use $\texttt{+}$. A similar choice is made for times in the +third clause and equality in the fourth clause. + +Remember I wrote at the beginning of this section about being +god when designing a programming language. You can see this +here: we need to give meaning to symbols. At the moment +however, we are a poor fallible god. Look again at the grammar +of our programming language and our definition. Clearly, an +expression can contain variables. So far we have ignored them. +What should our interpreter do with variables? They might +change during the evaluation of a program. For example the +variable \pcode{n} in the factorial program counts down from 5 +down to 0. How can we improve our definition above to give also +an answer whenever our interpreter encounters a variable in an +expression? The solution is to add an \emph{environment}, +written $env$, as an additional input argument to our +\textit{eval\_exp} function. + +\begin{center} +\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} +$\textit{eval\_exp}(\texttt{n}, env)$ & $\dn$ & $n$ + \qquad\text{if}\; \texttt{n}\; \text{is a number like} + \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1}, +\pcode{2}\ldots{}\\ +$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, + \texttt{e}_\texttt{2}, env)$ & + $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) + + \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\ +$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, + \texttt{e}_\texttt{2}, env)$ & + $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) * + \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\ +$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, + \texttt{e}_\texttt{2}, env)$ & + $\dn$ & $\begin{cases} + 1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) = + \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)\\ + 0 & \text{otherwise} + \end{cases}$\\ +$\textit{eval\_exp}(\texttt{x}, env)$ & $\dn$ & $env(x)$ +\end{tabular} +\end{center} + +\noindent This environment $env$ also acts like a map: it +associates variables with their current values. For example +after evaluating the first two lines in our factorial +program, such an environment might look as follows + +\begin{center} +\begin{tabular}{ll} +$\fbox{\texttt{a}} \mapsto \texttt{1}$ & +$\fbox{\texttt{n}} \mapsto \texttt{5}$ +\end{tabular} +\end{center} + +\noindent Again I highlighted the keys. In the clause for +variables, we can therefore consult this environment and +return whatever value is currently stored for this variable. +This is written $env(x)$---meaning we query this map with $x$ +we obtain the corresponding number. You might ask what happens +if an environment does not contain any value for, say, the +variable $x$? Well, then our interpreter just ``crashes'', or +more precisely will raise an exception. In this case we have a +``bad'' program that tried to use a variable before it was +initialised. The programmer should not have done this. In a +real programming language, we would of course try a bit harder +and for example give an error at compile time, or design our +language in such a way that this can never happen. With the +second version of \textit{eval\_exp} we completed our +definition for evaluating expressions. + +Next comes the evaluation function for statements. We define +this function in such a way that we recursively evaluate a +whole sequence of statements. Assume a program $p$ (you want +to evaluate) and its pre-processed snippets $sn$. Then we can +define: + +\begin{center} +\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}} +$\textit{eval\_stmts}([], env)$ & $\dn$ & $env$\\ +$\textit{eval\_stmts}(\texttt{label:}\;rest, env)$ & + $\dn$ & $\textit{eval\_stmts}(rest, env)$ \\ +$\textit{eval\_stmts}(\texttt{x\,:=\,e}\;rest, env)$ & + $\dn$ & $\textit{eval\_stmts}(rest, + env[x := \textit{eval\_exp}(\texttt{e}, env)])$\\ +$\textit{eval\_stmts}(\texttt{goto\,lbl}\;rest, env)$ + & $\dn$ & $\textit{eval\_stmts}(sn(\texttt{lbl}), env)$\\ +$\textit{eval\_stmts}(\texttt{jmp?\,e\,lbl}\;rest, env)$ + & $\dn$ & $\begin{cases}\begin{array}{@{}l@{\hspace{-12mm}}r@{}} + \textit{eval\_stmts}(sn(\texttt{lbl}), env)\\ + & \text{if}\;\textit{eval\_exp}(\texttt{e}, env) = 1\\ + \textit{eval\_stmts}(rest, env) & \text{otherwise}\\ + \end{array}\end{cases}$ +\end{tabular} +\end{center} + +\noindent The first clause is for the empty program, or when +we arrived at the end of the program. In this case we just +return the environment. The second clause is for when the next +statement is a label. That means the program is of the form +$\texttt{label:}\;rest$ where the label is some string and +$rest$ stands for all following statements. This case is easy, +because our evaluation function just discards the label and +evaluates the rest of the statements (we already extracted all +important information about labels when we pre-processed our +programs and generated the snippets). The third clause is for +variable assignments. Again we just evaluate the rest for the +statements, but with a modified environment---since the +variable assignment is supposed to introduce a new variable or +change the current value of a variable. For this modification +of the environment we first evaluate the expression +$\texttt{e}$ using our evaluation function for expressions. +This gives us a number. Then we assign this number to the +variable $x$ in the environment. This modified environment +will be used to evaluate the rest of the program. The fourth +clause is for the unconditional jump to a label, called +\texttt{lbl}. That means we have to look up in our snippets +map $sn$ what are the next statements for this label. +Therefore we will continue with evaluating, not with the rest +of the program, but with the statements stored in the +snippets-map under the label $\texttt{lbl}$. The fifth clause +for conditional jumps is similar, but to decide whether to +make the jump we first need to evaluate the expression +$\texttt{e}$ in order to find out whether it is $1$. If yes, +we jump, otherwise we just continue with evaluating the rest +of the program. + +Our interpreter works in two stages: First we pre-process our +program generating the snippets map $sn$, say. Second we call +the evaluation function with the default entry point and the +empty environment: + +\begin{center} +$\textit{eval\_stmts}(sn(\pcode{""}), \varnothing)$ +\end{center} + +\noindent It is interesting to note that our interpreter when +it comes to the end of the program returns an environment. Our +programming language does not contain any constructs for input +and output. Therefore this environment is the only effect we +can observe when running the program (apart from that our +interpreter might need some time before finishing the +evaluation of the program and the CPU getting hot). Evaluating +the factorial program with our interpreter we receive as +``answer''-environment + +\begin{center} +\begin{tabular}{ll} +$\fbox{\texttt{a}} \mapsto \texttt{120}$ & +$\fbox{\texttt{n}} \mapsto \texttt{0}$ +\end{tabular} +\end{center} + +\noindent While the discussion above should have illustrated +the ideas, in order to do some serious calculations, we clearly +need to implement the interpreter. + + +\subsubsection*{Scala Code for the Interpreter} + +Functional programming languages are very convenient for +implementations of interpreters. A good choice for a +functional programming language is Scala, a programming +language that combines functional and object-oriented +pro\-gramming-styles. It has received in the last five years or +so quite a bit of attention. One reason for this attention is +that, like the Java programming language, Scala compiles to +the Java Virtual Machine (JVM) and therefore Scala programs +can run under MacOSX, Linux and Windows.\footnote{There are +also experimental backends for Android and JavaScript.} Unlike +Java, however, Scala often allows programmers to write very +concise and elegant code. Some therefore say Scala is the much +better Java. A number of companies, The Guardian, Twitter, +Coursera, FourSquare, LinkedIn to name a few, either use Scala +exclusively in production code, or at least to some +substantial degree. If you want to try out Scala yourself, the +Scala compiler can be downloaded from + +\begin{quote} +\url{http://www.scala-lang.org} +\end{quote} + +Let us have a look at the Scala code shown in +Figure~\ref{code}. It shows the entire code for the +interpreter, though the implementation is admittedly no +frills. + +\begin{figure}[t] +\small +\lstinputlisting[language=Scala]{../progs/inter.scala} +\caption{The entire code of the interpreter for our +idealised programming language.\label{code}} +\end{figure} + + +\subsubsection*{Static Analysis} + +Finally we can come back to our original problem, namely +finding out what the signs of variables are + +\begin{center} + + +\end{center} + +\end{document} + +%% list of static analysers for C +http://spinroot.com/static/index.html + +%% NASA coding rules for C +http://pixelscommander.com/wp-content/uploads/2014/12/P10.pdf + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: t +%%% End: