# HG changeset patch # User Christian Urban # Date 1415927858 0 # Node ID b1ba3d88696ed5e0092543ffd96b8507574b6d05 # Parent 2a814c06ae038ed416bcb3d7ad0027046d128371 updated diff -r 2a814c06ae03 -r b1ba3d88696e handouts/ho07.pdf Binary file handouts/ho07.pdf has changed diff -r 2a814c06ae03 -r b1ba3d88696e handouts/ho07.tex --- a/handouts/ho07.tex Thu Nov 13 18:53:25 2014 +0000 +++ b/handouts/ho07.tex Fri Nov 14 01:17:38 2014 +0000 @@ -15,19 +15,21 @@ stage of development with privacy. Nobody really knows what it is about or what it is good for. All seems very hazy. The result is that the world of ``privacy'' looks a little bit -like the old Wild West. For example, UCAS, a charity set up to -help students apply to universities, has a commercial unit -that happily sells your email addresses to anybody who forks -out enough money in order to bombard you with spam. Yes, you -can opt out very often, but in case of UCAS any opt-out will -limit also legit emails you might actually be interested +like the old Wild West. Anything seems to go. + +For example, UCAS, a charity set up to help students to apply +to universities, has a commercial unit that happily sells your +email addresses to anybody who forks out enough money in order +to be able to bombard you with spam. Yes, you can opt out very +often in such ``schemes'', but in case of UCAS any opt-out +will limit also legit emails you might actually be interested in.\footnote{The main objectionable point, in my opinion, is that the \emph{charity} everybody has to use for HE applications has actually very honourable goals (e.g.~assist applicants in gaining access to universities), but in their small print (or better under the link ``About us'') reveals they set up their organisation so that they can also -shamelessly sell email addresses the ``harvest''. Everything +shamelessly sell email addresses they ``harvest''. Everything is of course very legal\ldots{}moral?\ldots{}well that is in the eye of the beholder. See: @@ -35,10 +37,11 @@ or \url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}} -Verizon, an ISP who provides you with connectivity, has found -a ``nice'' side-business too: When you have enabled all -privacy guards in your browser, the few you have at your -disposal, Verizon happily adds a kind of cookie to your +Another example: Verizon, an ISP who provides you with +connectivity, has found a ``nice'' side-business too: When you +have enabled all privacy guards in your browser, the few you +have at your disposal, Verizon happily adds a kind of cookie +to your HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}} As shown in the picture below, this cookie will be sent to every web-site you visit. The web-sites then can forward the @@ -61,7 +64,7 @@ Why does privacy matter? Nobody, I think, has a conclusive answer to this question. Maybe the following four notions -clarify the picture somewhat: +help with clarifying the overall picture somewhat: \begin{itemize} \item \textbf{Secrecy} is the mechanism used to limit the @@ -91,7 +94,7 @@ like to disclose my location data, because thieves might break into my house if they know I am away at work. Privacy is essentially everything which `shouldn't be - anybodies business'. + anybody's business'. \end{itemize} @@ -99,66 +102,116 @@ definitions, the problem with privacy is that it is an extremely fine line what should stay private and what should not. For example, since I am working in academia, I am very -happy to be essentially a digital exhibitionist: I am happy to +happy to be a digital exhibitionist: I am very happy to disclose all `trivia' related to my work on my personal web-page. This is a kind of bragging that is normal in -academia (at least in the CS field). I am even happy that -Google maintains a profile about all of my academic papers and -their citations. +academia (at least in the field of CS), even expected if you +look for a job. I am even happy that Google maintains a +profile about all my academic papers and their citations. -On the other hand I would be very peeved if anybody had a too -close look on my private live---it shouldn't be anybodies -business. The reason is that knowledge about my private life -usually is used against me. As mentioned above, public -location data might mean I get robbed. If supermarkets build a -profile of my shopping habits, they will use it to -\emph{their} advantage---surely not to \emph{my} advantage. -Also whatever might be collected about my life will always be -an incomplete, or even misleading, picture---I am sure my -creditworthiness score was temporarily(?) destroyed by not -having a regular income in this country (before coming to -King's I worked in Munich). To correct such incomplete or -flawed data there is, since recently, a law that allows you to -check what information is held about you for determining your +On the other hand I would be very irritated if anybody I do +not know had a too close look on my private live---it +shouldn't be anybody's business. The reason is that knowledge +about my private life usually is used against me. As mentioned +above, public location data might mean I get robbed. If +supermarkets build a profile of my shopping habits, they will +use it to \emph{their} advantage---surely not to \emph{my} +advantage. Also whatever might be collected about my life will +always be an incomplete, or even misleading, picture---for +example I am sure my creditworthiness score was temporarily(?) +destroyed by not having a regular income in this country +(before coming to King's I worked in Munich for five years). +To correct such incomplete or flawed credit history data there +is, since recently, a law that allows you to check what +information is held about you for determining your creditworthiness. But this concerns only a very small part of the data that is held about me/you. -This is an endless field. I let you ponder about the two -statements that are often float about in discussions about -privacy: +To cut a long story short, I let you ponder about the two +statements that often voiced in discussions about privacy: \begin{itemize} \item \textit{``You have zero privacy anyway. Get over it.''}\\ -\mbox{}\hfill{}Scott Mcnealy (CEO of Sun) +\mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)} \item \textit{``If you have nothing to hide, you have nothing to fear.''} \end{itemize} -\noindent There are some technical problems that are easier to -discuss and that often have privacy implications. The problem -I want to focus on is how to safely disclose datasets. What -can go wrong with this can be illustrated with three examples: +\noindent An article that attempts a deeper analysis appeared +in 2011 in the Chronicle of Higher Education + +\begin{center} +\url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/} +\end{center} + +\noindent Funnily, or maybe not so funnily, the author of this +article carefully tries to construct an argument that does not +only attack the nothing-to-hide statement in cases where +governments \& Co collect people's deepest secrets, or +pictures of people's naked bodies, but an argument that +applies also in cases where governments ``only'' collect data +relevant to, say, preventing terrorism. The fun is of course, +in 2011 we could just not imagine that respected governments +would do such infantile things as intercepting people's nude +photos. Well, since Snowden we know some people at the NSA did +and then shared such photos among colleagues as ``fringe +benefit''. + + +\subsubsection*{Re-Identification Attacks} + +Apart from philosophical arguments, there are fortunately also +some real technical problems with privacy implications. The +problem I want to focus on in this handout is how to safely +disclose datasets containing potentially private data, say +health data. What can go wrong with such disclosures can be +illustrated with four examples: \begin{itemize} -\item In 2006 a then young company called Netflix offered a 1 +\item In 2006, a then young company called Netflix offered a 1 Mio \$ prize to anybody who could improve their movie rating algorithm. For this they disclosed a dataset - containing 10\% of all Netflix users (appr.~500K). They - removed names, but included numerical ratings as well as - times of ratings. Though some information was perturbed - (i.e., slightly modified). + containing 10\% of all Netflix users at the time + (appr.~500K). They removed names, but included numerical + ratings of movies as well as times of ratings. Though + some information was perturbed (i.e., slightly + modified). - Two researchers took that data and compared it with - public data available from the International Movie - Database (IMDb). They found that 98 \% of the entries - could be re-identified: either by their ratings or by - the dates the ratings were uploaded. + Two researchers had a closer look at this anonymised + data and compared it with public data available from the + International Movie Database (IMDb). They found that 98 + \% of the entries could be re-identified in the Netflix + dataset: either by their ratings or by the dates the + ratings were uploaded. The result was a class-action + suit against Netflix, which was only recently resolved + involving a lot of money. -\item In the 1990, medical databases were routinely made - publicised for research purposes. This was done in - anonymised form with names removed, but birth dates, - gender, ZIP-code were retained. +\item In the 1990ies, medical datasets were often made public for + research purposes. This was done in anonymised form with + names removed, but birth dates, gender, ZIP-code were + retained. In one case where such data was made public + about state employees in Massachusetts, the then + governor assured the public that the released dataset + protected patient privacy by deleting identifiers. A + graduate student could not resist and cross-referenced + public voter data with the data about birth dates, + gender, ZIP-code. The result was that she could send + the governor his own hospital record. + +\item In 2006, AOL published 20 million Web search queries + collected of 650,000 users (names had been deleted). + This was again for research purposes. However, within + days an old lady, Thelma Arnold, from Lilburn, Georgia, + (11,596 inhabitants) was identified as user No.~4417749 + in this dataset. It turned out that search engine + queries are windows into people's private lives. + +\item Genomic-Wide Association Studies (GWAS) was a public + database of gene-frequency studies linked to diseases. + you only needed partial DNA information in order to + identify whether an individual was part of the study — + DB closed in 2008 \end{itemize}