diff -r 01fe5aba8781 -r 7807863c4196 handouts/ho03.tex --- a/handouts/ho03.tex Tue Oct 07 12:48:07 2014 +0100 +++ b/handouts/ho03.tex Thu Oct 09 14:41:36 2014 +0100 @@ -21,7 +21,9 @@ computer science students, but who said that criminal hackers restrict themselves to everyday fare? Not to mention the free-riding script-kiddies who use this technology without -even knowing what the underlying ideas are. +even knowing what the underlying ideas are. If you want to be +a good security engineer who needs to defend such attacks, +then better you know the details. For buffer overflow attacks to work, a number of innocent design decisions, which are really benign on their own, need @@ -34,7 +36,7 @@ ``forefathers'' can really be blamed, but as mentioned above we should already be way beyond the point that buffer overflow attacks are worth a thought. Unfortunately, this is far from -the truth. I let you think why? +the truth. I let you ponder why? One such ``benign'' design decision is how the memory is laid out into different regions for each process. @@ -64,18 +66,19 @@ this region is read-only). The heap stores all data the programmer explicitly allocates. For us the most interesting region is the stack, which contains data mostly associated -with the ``control flow'' of the program. Notice that the stack -grows from a higher addresses to lower addresses. That means -that older items on the stack will be stored behind newer -items. Let's look a bit closer what happens with the stack. -Consider the the trivial C program. +with the control flow of the program. Notice that the stack +grows from a higher addresses to lower addresses. That means +that older items on the stack will be stored behind, or after, +newer items. Let's look a bit closer what happens with the +stack when a program is running. Consider the following simple +C program. \lstinputlisting[language=C]{../progs/example1.c} -\noindent The main function calls \code{foo} with three -argument. Foo contains two (local) buffers. The interesting -point is what will the stack looks like after Line 3 has been -executed? The answer is as follows: +\noindent The \code{main} function calls \code{foo} with three +arguments. \code{Foo} contains two (local) buffers. The +interesting point for us will be what will the stack loke +like after Line 3 has been executed? The answer is as follows: \begin{center} \begin{tikzpicture}[scale=0.65] @@ -128,8 +131,42 @@ needed in order to clean up the stack to the last level---in fact there is no cleaning involved, but just the top of the stack will be set back. The two buffers are also on the stack, -because they are local data within \code{foo}. +because they are local data within \code{foo}. So in the +middle is a snapshot of the stack after Line 3 has been +executed. In case you are familiar with assembly instructions +you can also read off this behaviour from the machine +code that the \code{gcc} compiler generates for the program +above:\footnote{You can make \pcode{gcc} generate assembly +instructions if you call it with the \pcode{-S} option, +for example \pcode{gcc -S out in.c}\;. Or you can look +at this code by using the debugger. This will be explained +later.}. +\begin{center}\small +\begin{tabular}[t]{@{}c@{\hspace{8mm}}c@{}} +{\lstinputlisting[language={[x86masm]Assembler}, + morekeywords={movl},xleftmargin=5mm] + {../progs/example1a.s}} & +{\lstinputlisting[language={[x86masm]Assembler}, + morekeywords={movl,movw},xleftmargin=5mm] + {../progs/example1b.s}} +\end{tabular} +\end{center} + +\noindent On the left you can see how the function +\pcode{main} prepares in Lines 2 to 7 the stack, before +calling the function \pcode{foo}. You can see that the +numbers 3, 2, 1 are stored on the stack (the register +\code{$esp} refers to the top of the stack). On the right +you can see how the function \pcode{foo} stores the two local +buffers onto the stack and initialises them with the given +data (Lines 2 to 9). Since there is no real computation +going on inside \pcode{foo} the function then just restores +the stack to its old state and crucially sets the return +address where the computation should resume (Line 9 in the +code on the left hand side). The instruction \code{ret} then +transfers control back to the function \pcode{main} to the +teh instruction just after the call, namely Line 9. Another part of the ``conspiracy'' is that library functions in C look typically as follows: @@ -142,13 +179,204 @@ to a destination \pcode{dst}. The important point is that it copies the data until it reaches a zero-byte (\code{"\\0"}). +The central idea of the buffer overflow attack is to overwrite +the return address on the stack which states where the control +flow of the program should resume once the function at hand +has finished its computation. So if we have somewhere in a +function a local a buffer, say + +\begin{center} +\code{char buf[8];} +\end{center} + +\noindent +then the corresponding stack will look as follows + +\begin{center} + \begin{tikzpicture}[scale=0.65] + %\draw[step=1cm] (-3,-1) grid (3,8); + \draw[gray!20,fill=gray!20] (-1, 0) rectangle (1,-1); + \draw[line width=1mm] (-1,-1.2) -- (-1,6.4); + \draw[line width=1mm] ( 1,-1.2) -- ( 1,6.4); + \draw (0,-1) node[anchor=south] {\tt main}; + \draw[line width=1mm] (-1,0) -- (1,0); + \draw (0,0) node[anchor=south] {\tt arg$_3$=3}; + \draw[line width=1mm] (-1,1) -- (1,1); + \draw (0,1) node[anchor=south] {\tt arg$_2$=2}; + \draw[line width=1mm] (-1,2) -- (1,2); + \draw (0,2) node[anchor=south] {\tt arg$_1$=1}; + \draw[line width=1mm] (-1,3) -- (1,3); + \draw (0,3.1) node[anchor=south] {\tt ret}; + \draw[line width=1mm] (-1,4) -- (1,4); + \draw (0,4) node[anchor=south] {\small\tt last sp}; + \draw[line width=1mm] (-1,5) -- (1,5); + \draw (0,5) node[anchor=south] {\tt buf}; + \draw[line width=1mm] (-1,6) -- (1,6); + \draw (2,5.1) node[anchor=south] {\code{$esp}}; + \draw[<-,line width=0.5mm] (1.1,6) -- (2.5,6); + + \draw[->,line width=0.5mm] (1,4.5) -- (2.5,4.5); + \draw (2.6,4.1) node[anchor=south west] {\code{??}}; + + \draw[->,line width=0.5mm] (1,3.5) -- (2.5,3.5); + \draw (2.6,3.1) node[anchor=south west] {\tt jump to \code{\\x080483f4}}; +\end{tikzpicture} +\end{center} + +\noindent We need to fill this over its limit of +8 characters so that it overwrites the stack pointer +and then overwrites the return address. If, for example, +we want to jump to a specific address in memory, say, +\pcode{\\x080483f4} then we need to fill the +buffer for example as follows + +\begin{center} +\code{char buf[8] = "AAAAAAAABBBB\\xf4\\x83\\x04\\x08";} +\end{center} + +\noindent The first 8 \pcode{A}s fill the buffer to the rim; +the next four \pcode{B}s overwrite the stack pointer (with +what data we overwrite this part is usually not important); +then comes the address we want to jump to. Notice that we have +to give the address in the reverse order. All addresses on +Intel CPUs need to be given in this way. Since the string is +enclosed in double quotes, the C convention is that the string +internally will automatically be terminated by a zero-byte. If +the programmer uses functions like \pcode{strcpy} for filling +the buffer \pcode{buf}, then we can be sure it will overwrite +the stack in this manner---since it will copy everything up +to the zero-byte. + +What the outcome of such an attack is can be illustrated with +the code shown in Figure~\ref{C2}. Under ``normal operation'' +this program ask for a login-name and a password (both are +represented as strings). Both of which are stored in buffers +of length 8. The function \pcode{match} tests whether two such +strings are equal. If yes, then the function lets you in +(by printing \pcode{Welcome}). If not, it denies access +(by printing \pcode{Wrong identity}). The vulnerable function +is \code{get_line} in Lines 11 to 19. This function does not +take any precautions about the buffer of 8 characters being +filled beyond this 8-character-limit. The buffer overflow +can be triggered by inputing something, like \pcode{foo}, for +the login name and then the specially crafted string as +password: + +\begin{center} +\code{AAAAAAAABBBB\\x2c\\x85\\x04\\x08\\n} +\end{center} + +\noindent The address happens to be the one for the function +\pcode{welcome()}. This means even with this input (where the +login name and password clearly do not match) the program will +still print out \pcode{Welcome}. The only information we need +for this attack is to know where the function +\pcode{welcome()} starts in memory. This information can be +easily obtained by starting the program inside the debugger +and disassembling this function. + +\begin{lstlisting}[numbers=none,language={[x86masm]Assembler}, + morekeywords={movl,movw}] +$ gdb C2 +GNU gdb (GDB) 7.2-ubuntu +(gdb) disassemble welcome +\end{lstlisting} + +\noindent +The output will be something like this + +\begin{lstlisting}[numbers=none,language={[x86masm]Assembler}, + morekeywords={movl,movw}] +0x0804852c <+0>: push %ebp +0x0804852d <+1>: mov %esp,%ebp +0x0804852f <+3>: sub $0x4,%esp +0x08048532 <+6>: movl $0x8048690,(%esp) +0x08048539 <+13>: call 0x80483a4 +0x0804853e <+18>: movl $0x0,(%esp) +0x08048545 <+25>: call 0x80483b4 +\end{lstlisting} + +\noindent indicating that the function \pcode{welcome()} +starts at address \pcode{0x0804852c}. + \begin{figure}[p] \lstinputlisting[language=C]{../progs/C2.c} \caption{A suspicious login implementation.\label{C2}} \end{figure} +This kind of attack was very popular with commercial programs +that needed a key to be unlocked. Historically, hackers first +broke the rather weak encryption of these locking mechanisms. +After the encryption had been made stronger, hackers used +buffer overflow attacks as shown above to jump directly to +the part of the program that was intended to be only available +after the correct key was typed in by the user. + +\subsection*{Paylods} + +Unfortunately, much more harm can be caused by buffer overflow +attacks. This is achieved by injecting code that will be run +once the return address is appropriately modified. Typically +the code that will be injected is for running a shell. In +order to be send as part of the string that is overflowing the +buffer, we need the code to be encoded as a sequence of +characters + +\lstinputlisting[language=C,numbers=none]{../progs/o1.c} + +\noindent These characters represent the machine code +for opening a shell. It seems obtaining such a string +requires higher-education in the architecture of the +target system. But it is actually relatively simple: First +there are many ready-made strings available---just a quick +Google query away. Second, tools like the debugger can help +us again. We can just write the code we want in C, for +example this would be the program to start a shell + +\lstinputlisting[language=C,numbers=none]{../progs/shell.c} + +\noindent Once compiled, we can use the debugger to obtain +the machine code, or even the ready made encoding as character +sequence. + +While easy, obtaining this string is not entirely trivial. +Remember the functions in C that copy or fill buffers work +such that they copy everything until the zero byte is reached. +Unfortunately the ``vanilla'' output from the debugger for the +shell-program will contain such zero bytes. So a +post-processing phase is needed to rewrite the machine code +such that it does not contain any zero bytes. This is like +some works of literature that have been rewritten so that the +letter 'i', for example, is avoided. For rewriting the machine +code you might need to use clever tricks like + +\begin{lstlisting}[numbers=none,language={[x86masm]Assembler}] +xor %eax, %eax +\end{lstlisting} + +\noindent This instruction does not contain any zero byte when +encoded, but produces a zero byte on the stack. + +Having removed the zero bytes we can craft the string that +will be send to our target computer. It is typically of the +form + +\begin{center} + \begin{tikzpicture}[scale=0.7] + \draw[line width=1mm] (-2, -1) rectangle (2,3); + \draw[line width=1mm] (-2,1.9) -- (2,1.9); + \draw (0,2.5) node {\large\tt shell code}; + \draw[line width=1mm,fill=black] (0.3, -1) rectangle (2,-0.7); + \draw[->,line width=0.3mm] (1.05, -1) -- (1.05,-1.7) -- + (-3,-1.7) -- (-3, 3.7) -- (-1.9, 3.7) -- (-1.9, 3.1); + \draw (-2, 3) node[anchor=north east] {\LARGE\tt "}; + \draw ( 2,-0.9) node[anchor=west] {\LARGE\tt "}; + \end{tikzpicture} +\end{center} + + \bigskip\bigskip -\subsubsection*{A Crash-Course on GDB} +\subsubsection*{A Crash-Course for GDB} \begin{itemize} \item \texttt{(l)ist n} -- listing the source file from line