% !TEX program = xelatex\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\usepackage{../grammar}\usepackage{multicol}%%\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}\begin{document}\fnote{\copyright{} Christian Urban, King's College London, 2014}%% why are shuttle flights so good with software%%http://www.fastcompany.com/28121/they-write-right-stuff\section*{Handout 10 (Static Analysis)}If we want to improve the safety and security of our programs,we need a more principled approach to programming. Testing isgood, but as Edsger Dijkstra famously wrote: \begin{quote}\it ``Program testing can be a very effective way to show the\underline{\smash{presence}} of bugs, but it is hopelesslyinadequate for showing their \underline{\smash{absence}}.''\end{quote}\noindent While such a more principled approach has been thesubject of intense study for a long, long time, only in thepast few years some impressive results have been achieved. Oneis the complete formalisation and (mathematical) verificationof a microkernel operating system called seL4.\begin{center}\url{http://sel4.systems}\end{center}\noindent In 2011 this work was included in the MIT TechnologyReview in the annual list of the world’s ten most importantemergingtechnologies.\footnote{\url{http://www2.technologyreview.com/tr10/?year=2011}}While this work is impressive, its technical details are tooenormous for an explanation here. Therefore let us look atsomething much simpler, namely finding out properties aboutprograms using \emph{static analysis}.Static analysis is a technique that checks properties of aprogram without actually running the program. This shouldraise alarm bells with you---because almost all interestingproperties about programs are equivalent to the haltingproblem, which we know is undecidable. For example estimatingthe memory consumption of programs is in general undecidable,just like the halting problem. Static analysis circumventsthis undecidability-problem by essentially allowing answers\emph{yes} and \emph{no}, but also \emph{don't know}. Withthis ``trick'' even the halting problem becomesdecidable\ldots{}for example we could always say \emph{don'tknow}. Of course this would be silly. The point is that weshould be striving for a method that answers as often aspossible either \emph{yes} or \emph{no}---just in cases whenit is too difficult we fall back on the\emph{don't-know}-answer. This might sound all like abstractnonsense. Therefore let us look at a concrete example.\subsubsection*{A Simple, Idealised Programming Language}Our starting point is a small, idealised programming language.It is idealised because we cut several corners in comparisonwith real programming languages. The language we will studycontains, amongst other things, variables holding integers.Using static analysis, we want to find out what the sign ofthese integers (positive or negative) will be when the programruns. This sign-analysis seems like a very simple problem. Buteven such simple problems, if approached naively, are ingeneral undecidable, just like Turing's halting problem. I letyou think why?Is sign-analysis of variables an interesting problem? Well,yes---if a compiler can find out that for example a variablewill never be negative and this variable is used as an indexfor an array, then the compiler does not need to generate codefor an underflow-check. Remember some languages are immune tobuffer-overflow attacks, but they need to add underflow andoverflow checks everywhere. According to John Regher, anexpert in the field of compilers, overflow checks can cause5-10\% slowdown, in some languages even 100\% for tightloops.\footnote{\url{http://blog.regehr.org/archives/1154}} Ifthe compiler can omit the underflow check, for example, thenthis can potentially drastically speed up the generated code. What do programs in our simple programming language look like?The following grammar gives a first specification:\begin{multicols}{2}\begin{plstx}[rhs style=,one per line,left margin=9mm]: \meta{Stmt} ::= \meta{label} \texttt{:} | \meta{var} \texttt{:=} \meta{Exp} | \texttt{jmp?} \meta{Exp} \meta{label} | \texttt{goto} \meta{label}\\: \meta{Prog} ::= \meta{Stmt} \ldots{} \meta{Stmt}\\\end{plstx}\columnbreak\begin{plstx}[rhs style=,one per line]: \meta{Exp} ::= \meta{Exp} \texttt{+} \meta{Exp} | \meta{Exp} \texttt{*} \meta{Exp} | \meta{Exp} \texttt{=} \meta{Exp} | \meta{num} | \meta{var}\\\end{plstx}\end{multicols}\noindent I assume you are familiar with suchgrammars.\footnote{\url{http://en.wikipedia.org/wiki/Backus–Naur_Form}}There are three main syntactic categories: \emph{statments}and \emph{expressions} as well as \emph{programs}, which aresequences of statements. Statements are either labels,variable assignments, conditional jumps (\pcode{jmp?}) andunconditional jumps (\pcode{goto}). Labels are just strings,which can be used as the target of a jump. We assume that inevery program the labels are unique---if there is a clash,then we do not know where to jump to. The conditional jumpsand variable assignments involve (arithmetic) expressions.Expressions are either numbers, variables or compoundexpressions built up from \pcode{+}, \pcode{*} and \emph{=}(for simplicity reasons we do not consider any otheroperations). We assume we have negative and positive numbers,\ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},\pcode{2}\ldots{} An example program that calculates thefactorial of 5 is in our programming language as follows:\begin{lstlisting}[language={},xleftmargin=10mm] a := 1 n := 5 top: jmp? n = 0 done a := a * n n := n + -1 goto top done:\end{lstlisting}\noindent As can be seen each line of the program contains astatement. In the first two lines we assign values to thevariables \pcode{a} and \pcode{n}. In line 4 we test whether\pcode{n} is zero, in which case we jump to the end of theprogram marked with the label \pcode{done}. If \pcode{n} isnot zero, we multiply the content of \pcode{a} by \pcode{n},decrease \pcode{n} by one and jump back to the beginning ofthe loop, marked with the label \pcode{top}. Another programin our language is shown in Figure~\ref{fib}. I let you thinkwhat it calculates.\begin{figure}[t]\begin{lstlisting}[numbers=none, language={},xleftmargin=10mm] n := 6 m1 := 0 m2 := 1loop: jmp? n = 0 done tmp := m2 m2 := m1 + m2 m1 := tmp n := n + -1 goto topdone:\end{lstlisting}\caption{A mystery program in our idealised programming language.Try to find out what it calculates! \label{fib}}\end{figure}Even if our language is rather small, it is still Turingcomplete---meaning quite powerful. However, discussing thisfact in more detail would lead us too far astray. Clearly, ourprogramming is rather low-level and not very comfortable forwriting programs. It is inspired by real machine code, whichis the code that is executed by a CPU. So a more interestingquestion is what is missing in comparison with real machinecode? Well, not much\ldots{}in principle. Real machine code,of course, contains many more arithmetic instructions (notjust addition and multiplication) and many more conditionaljumps. We could add these to our language if we wanted, butcomplexity is really beside the point here. Furthermore, realmachine code has many instructions for manipulating memory. Wedo not have this at all. This is actually a more serioussimplification because we assume numbers to be arbitrary smallor large, which is not the case with real machine code. Inreal machine code, basic number formats have a range and mightover-flow or under-flow this range. Also the number ofvariables in our programs is potentially unlimited, whilememory in an actual computer, of course, is always limited. Tosum up, our language might look ridiculously simple, but it isnot too far removed from practically relevant issues.\subsubsection*{An Interpreter}Designing a language is like playing god: you can say whatnames for variables you allow; what programs should look like;most importantly you can decide what each part of the programshould mean and do. While our language is quite simple and themeaning of statements, for example, is rather straightforward,there are still places where we need to make real choices. Forexample consider the conditional jumps, say the one in thefactorial program: \begin{center}\code{jmp? n = 0 done}\end{center}\noindent How should they work? We could introduce Booleans(\pcode{true} and \pcode{false}) and then jump only when thecondition is \pcode{true}. However, since we have numbers inour language anyway, why not just encoding \pcode{true} asone, and \pcode{false} as zero? In this way we can dispensewith the additional concept of Booleans.I hope the above discussion makes it already clear we need tobe a bit more careful with our programs. Below we shalldescribe an interpreter for our programming language, whichspecifies exactly how programs are supposed to berun\ldots{}at least we will specify this for all \emph{good}programs. By good programs I mean where all variables areinitialised, for example. Our interpreter will just crash ifit cannot find out the value for a variable when it is notinitialised. Also, we will assume that labels in good programsare unique, otherwise our programs will calculate ``garbage''.First we will pre-process our programs. This will simplify thedefinition of our interpreter later on. By pre-processing ourprograms we will transform programs into \emph{snippets}. Asnippet is a label and all the code that comes after thelabel. This essentially means a snippet is a \emph{map} fromlabels to code.\footnote{Be sure you know what maps are. In aprogramming context they are often represented as associationlist where some data is associated with a key.} Given that programs are sequences (or lists) of statements, wecan easily calculate the snippets by just traversing thissequence and recursively generating the map. Suppose a programis of the general form\begin{center}$stmt_1\;stmt_2\; \ldots\; stmt_n$\end{center}\noindent The idea is to go through this sequence ofstatements one by one and check whether they are a label. Ifyes, we add the label and the remaining statements to our map.If no, we just continue with the next statement. To come upwith a recursive definition for generating snippets, let uswrite $[]$ for the program that does not contain anystatement. Consider the following definition:\begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}$\textit{snippets}([])$ & $\dn$ & $\varnothing$\\$\textit{snippets}(stmt\;\; rest)$ & $\dn$ &$\begin{cases} \textit{snippets}(rest)[label := rest] & \text{if}\;stmt = \textit{label:}\\ \textit{snippets}(rest) & \text{otherwise}\end{cases}$ \end{tabular}\end{center}\noindent In the first clause we just return the empty map forthe program that does not contain any statement. In the secondclause, we have to distinguish the case where the firststatement is a label or not. As said before, if not, then wejust ``throw away'' the label and recursively calculate thesnippets for the rest of the program (the otherwise clause).If yes, then we do the same, but also update the map so that$label$ now points to the rest of the statements (the ifclause). This looks all realtively straightforward, but thereis one small problem we need to overcome: our two programsshown so far have no label as \emph{entry point}---that iswhere the execution is supposed to start. We usually assumethat the first statement will be run first. To make this thedefault, it is convenient if we add to all our programs adefault label, say \pcode{""} (the empty string). With this wecan define our pre-processing of programs as follows\begin{center}$\textit{preproc}(prog) \dn \textit{snippets}(\pcode{"":}\;\; prog)$ \end{center} \noindent Let us see how this pans out in practice. If wepre-process the factorial program shown earlier, we obtain the following map:\begin{center}\small\lstset{numbers=none, language={}, xleftmargin=0mm, aboveskip=0.5mm, belowskip=0.5mm, frame=single, framerule=0mm, framesep=0mm}\begin{tikzpicture}[node distance=0mm]\node (A1) [draw]{\pcode{""}};\node (B1) [right=of A1] {$\mapsto$};\node (C1) [right=of B1,anchor=north west] {\begin{minipage}{3.5cm}\begin{lstlisting}[language={},xleftmargin=0mm] a := 1 n := 5 top: jmp? n = 0 done a := a * n n := n + -1 goto top done:\end{lstlisting}\end{minipage}};\node (A2) [right=of C1.north east,draw] {\pcode{top}};\node (B2) [right=of A2] {$\mapsto$};\node (C2) [right=of B2, anchor=north west] {\begin{minipage}{3.5cm}\begin{lstlisting}[language={},xleftmargin=0mm] jmp? n = 0 done a := a * n n := n + -1 goto top done:\end{lstlisting} \end{minipage}}; \node (A3) [right=of C2.north east,draw] {\pcode{done}};\node (B3) [right=of A3] {$\mapsto$};\node (C3) [right=of B3] {$[]$};\end{tikzpicture}\end{center}\noindent I highlighted the \emph{keys} in this map. Sincethere are three labels in the factorial program (remember weadded \pcode{""}), there are three keys. When running thefactorial program and encountering a jump, then we only haveto consult this snippets-map in order to find out what thenext statements should be.We should now be in the position to define how a programshould be run. In the context of interpreters, this``running'' of programs is often called \emph{evaluation}. Letus start with the definition of how expressions are evaluated.A first attempt might be the following recursive function:\begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}$\textit{eval\_exp}(\texttt{n})$ & $\dn$ & $n$ \qquad\text{if}\; \texttt{n}\; \text{is a number like} \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},\pcode{2}\ldots{}\\$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, \texttt{e}_\texttt{2})$ & $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) + \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, \texttt{e}_\texttt{2})$ & $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}) * \textit{eval\_exp}(\texttt{e}_\texttt{2})$\\$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, \texttt{e}_\texttt{2})$ & $\dn$ & $\begin{cases} 1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}) = \textit{eval\_exp}(\texttt{e}_\texttt{2})\\ 0 & \text{otherwise} \end{cases}$ \end{tabular}\end{center}\noindent While this should look all rather intuitive`, stillbe very careful. There is a subtlety which can be easilyoverlooked: The function \textit{eval\_exp} takes anexpression of our programming language as input and returns anumber as output. Therefore whenever we encounter a number inour program, we just return this number---this is defined inthe first clause above. Whenever we encounter an addition,well then we first evaluate the left-hand side$\texttt{e}_\texttt{1}$ of the addition (this will give anumber), then evaluate the right-hand side$\texttt{e}_\texttt{2}$ (this gives another number), andfinally add both numbers together. Here is the subtlety: onthe left-hand side of the $\dn$ we have a \texttt{+} (in theteletype font) which is the symbol for addition in ourprogramming language. On the right-hand side we have $+$ whichstands for the arithmetic operation from ``mathematics'' ofadding two numbers. These are rather different concepts---oneis a symbol (which we made up), and the other a mathematicaloperation. When we will have a look at an actualimplementation of our interpreter, the mathematical operationwill be the function for addition from the programminglanguage in which we \underline{\smash{implement}} ourinterpreter. While the \texttt{+} is just a symbol that isused in our programming language. Clearly we have to use asymbol that is a good mnemonic for addition, otherwise we willconfuse the programmers working with our language. Thereforewe use $\texttt{+}$. A similar choice is made for times in thethird clause and equality in the fourth clause. Remember I wrote at the beginning of this section about beinggod when designing a programming language. You can see thishere: we need to give meaning to symbols. At the momenthowever, we are a poor fallible god. Look again at the grammarof our programming language and our definition. Clearly, anexpression can contain variables. So far we have ignored them.What should our interpreter do with variables? They mightchange during the evaluation of a program. For example thevariable \pcode{n} in the factorial program counts down from 5down to 0. How can we improve our definition above to give alsoan answer whenever our interpreter encounters a variable in anexpression? The solution is to add an \emph{environment},written $env$, as an additional input argument to our\textit{eval\_exp} function.\begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}$\textit{eval\_exp}(\texttt{n}, env)$ & $\dn$ & $n$ \qquad\text{if}\; \texttt{n}\; \text{is a number like} \ldots \pcode{-2}, \pcode{-1}, \pcode{0}, \pcode{1},\pcode{2}\ldots{}\\$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{+}\, \texttt{e}_\texttt{2}, env)$ & $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) + \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{*}\, \texttt{e}_\texttt{2}, env)$ & $\dn$ & $\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) * \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)$\\$\textit{eval\_exp}(\texttt{e}_\texttt{1} \,\texttt{=}\, \texttt{e}_\texttt{2}, env)$ & $\dn$ & $\begin{cases} 1 & \text{if}\;\textit{eval\_exp}(\texttt{e}_\texttt{1}, env) = \textit{eval\_exp}(\texttt{e}_\texttt{2}, env)\\ 0 & \text{otherwise} \end{cases}$\\$\textit{eval\_exp}(\texttt{x}, env)$ & $\dn$ & $env(x)$ \end{tabular}\end{center}\noindent This environment $env$ also acts like a map: itassociates variables with their current values. For exampleafter evaluating the first two lines in our factorialprogram, such an environment might look as follows\begin{center}\begin{tabular}{ll}$\fbox{\texttt{a}} \mapsto \texttt{1}$ &$\fbox{\texttt{n}} \mapsto \texttt{5}$\end{tabular}\end{center}\noindent Again I highlighted the keys. In the clause forvariables, we can therefore consult this environment andreturn whatever value is currently stored for this variable.This is written $env(x)$---meaning we query this map with $x$we obtain the corresponding number. You might ask what happensif an environment does not contain any value for, say, thevariable $x$? Well, then our interpreter just ``crashes'', ormore precisely will raise an exception. In this case we have a``bad'' program that tried to use a variable before it wasinitialised. The programmer should not have done this. In areal programming language, we would of course try a bit harderand for example give an error at compile time, or design ourlanguage in such a way that this can never happen. With thesecond version of \textit{eval\_exp} we completed ourdefinition for evaluating expressions.Next comes the evaluation function for statements. We definethis function in such a way that we recursively evaluate awhole sequence of statements. Assume a program $p$ (you wantto evaluate) and its pre-processed snippets $sn$. Then we candefine:\begin{center}\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}$\textit{eval\_stmts}([], env)$ & $\dn$ & $env$\\$\textit{eval\_stmts}(\texttt{label:}\;rest, env)$ & $\dn$ & $\textit{eval\_stmts}(rest, env)$ \\ $\textit{eval\_stmts}(\texttt{x\,:=\,e}\;rest, env)$ & $\dn$ & $\textit{eval\_stmts}(rest, env[x := \textit{eval\_exp}(\texttt{e}, env)])$\\ $\textit{eval\_stmts}(\texttt{goto\,lbl}\;rest, env)$ & $\dn$ & $\textit{eval\_stmts}(sn(\texttt{lbl}), env)$\\$\textit{eval\_stmts}(\texttt{jmp?\,e\,lbl}\;rest, env)$ & $\dn$ & $\begin{cases}\begin{array}{@{}l@{\hspace{-12mm}}r@{}} \textit{eval\_stmts}(sn(\texttt{lbl}), env)\\ & \text{if}\;\textit{eval\_exp}(\texttt{e}, env) = 1\\ \textit{eval\_stmts}(rest, env) & \text{otherwise}\\ \end{array}\end{cases}$ \end{tabular}\end{center}\noindent The first clause is for the empty program, or whenwe arrived at the end of the program. In this case we justreturn the environment. The second clause is for when the nextstatement is a label. That means the program is of the form$\texttt{label:}\;rest$ where the label is some string and$rest$ stands for all following statements. This case is easy,because our evaluation function just discards the label andevaluates the rest of the statements (we already extracted allimportant information about labels when we pre-processed ourprograms and generated the snippets). The third clause is forvariable assignments. Again we just evaluate the rest for thestatements, but with a modified environment---since thevariable assignment is supposed to introduce a new variable orchange the current value of a variable. For this modificationof the environment we first evaluate the expression$\texttt{e}$ using our evaluation function for expressions.This gives us a number. Then we assign this number to thevariable $x$ in the environment. This modified environmentwill be used to evaluate the rest of the program. The fourthclause is for the unconditional jump to a label, called\texttt{lbl}. That means we have to look up in our snippetsmap $sn$ what are the next statements for this label.Therefore we will continue with evaluating, not with the restof the program, but with the statements stored in thesnippets-map under the label $\texttt{lbl}$. The fifth clausefor conditional jumps is similar, but to decide whether tomake the jump we first need to evaluate the expression$\texttt{e}$ in order to find out whether it is $1$. If yes,we jump, otherwise we just continue with evaluating the restof the program.Our interpreter works in two stages: First we pre-process ourprogram generating the snippets map $sn$, say. Second we callthe evaluation function with the default entry point and theempty environment:\begin{center}$\textit{eval\_stmts}(sn(\pcode{""}), \varnothing)$\end{center}\noindent It is interesting to note that our interpreter whenit comes to the end of the program returns an environment. Ourprogramming language does not contain any constructs for inputand output. Therefore this environment is the only effect wecan observe when running the program (apart from that ourinterpreter might need some time before finishing theevaluation of the program and the CPU getting hot). Evaluating the factorial program with our interpreter we receive as``answer''-environment\begin{center}\begin{tabular}{ll}$\fbox{\texttt{a}} \mapsto \texttt{120}$ &$\fbox{\texttt{n}} \mapsto \texttt{0}$\end{tabular}\end{center}\noindent While the discussion above should have illustratedthe ideas, in order to do some serious calculations, we clearlyneed to implement the interpreter.\subsubsection*{Scala Code for the Interpreter}Functional programming languages are very convenient forimplementations of interpreters. A good choice for afunctional programming language is Scala, a programminglanguage that combines functional and object-orientedpro\-gramming-styles. It has received in the last five years orso quite a bit of attention. One reason for this attention isthat, like the Java programming language, Scala compiles tothe Java Virtual Machine (JVM) and therefore Scala programscan run under MacOSX, Linux and Windows.\footnote{There arealso experimental backends for Android and JavaScript.} UnlikeJava, however, Scala often allows programmers to write veryconcise and elegant code. Some therefore say Scala is the muchbetter Java. A number of companies, The Guardian, Twitter,Coursera, FourSquare, LinkedIn to name a few, either use Scalaexclusively in production code, or at least to somesubstantial degree. If you want to try out Scala yourself, theScala compiler can be downloaded from\begin{quote}\url{http://www.scala-lang.org}\end{quote}Let us have a look at the Scala code shown inFigure~\ref{code}. It shows the entire code for theinterpreter, though the implementation is admittedly nofrills.%\begin{figure}[t]%\small%\lstinputlisting[language=Scala]{../progs/inter.scala}%\caption{The entire code of the interpreter for our%idealised programming language.\label{code}}%\end{figure}\subsubsection*{Static Analysis}Finally we can come back to our original problem, namely finding out what the signs of variables are \begin{center}\end{center}\end{document}%% list of static analysers for Chttp://spinroot.com/static/index.html%% NASA coding rules for Chttp://pixelscommander.com/wp-content/uploads/2014/12/P10.pdf%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: