# HG changeset patch # User Christian Urban # Date 1416006280 0 # Node ID 1d243ac510780f81d734a6d1f0a0c33e1552b37e # Parent c913fe9bfd59825c8bb647850ffd2e233289caf0 updated diff -r c913fe9bfd59 -r 1d243ac51078 handouts/ho07.pdf Binary file handouts/ho07.pdf has changed diff -r c913fe9bfd59 -r 1d243ac51078 handouts/ho07.tex --- a/handouts/ho07.tex Fri Nov 14 14:03:15 2014 +0000 +++ b/handouts/ho07.tex Fri Nov 14 23:04:40 2014 +0000 @@ -14,27 +14,27 @@ novelty---a car. In my humble opinion, we are at the same stage of development with privacy. Nobody really knows what it is about or what it is good for. All seems very hazy. There -are a few laws (cookie law, right-to-be-forgotten) which -address problems with privacy, but even if they are well +are a few laws (e.g.~cookie law, right-to-be-forgotten law) +which address problems with privacy, but even if they are well intentioned, they either back-fire or are already obsolete because of newer technologies. The result is that the world of ``privacy'' looks a little bit like the old Wild West. -For example, UCAS, a charity set up to help students to apply -to universities, has a commercial unit that happily sells your -email addresses to anybody who forks out enough money in order -to be able to bombard you with spam. Yes, you can opt out very -often in such ``schemes'', but in case of UCAS any opt-out -will limit also legit emails you might actually be interested -in.\footnote{The main objectionable point, in my opinion, is -that the \emph{charity} everybody has to use for HE -applications has actually very honourable goals (e.g.~assist -applicants in gaining access to universities), but in their -small print (or better under the link ``About us'') reveals -they set up their organisation so that they can also -shamelessly sell email addresses they ``harvest''. Everything -is of course very legal\ldots{}moral?\ldots{}well that is in -the eye of the beholder. See: +For example, UCAS, a charity set up to help students with +applying to universities, has a commercial unit that happily +sells your email addresses to anybody who forks out enough +money in order to be able to bombard you with spam. Yes, you +can opt out very often in such ``schemes'', but in case of +UCAS any opt-out will limit also legit emails you might +actually be interested in.\footnote{The main objectionable +point, in my opinion, is that the \emph{charity} everybody has +to use for HE applications has actually very honourable goals +(e.g.~assist applicants in gaining access to universities), +but the small print (or better the link ``About +us'') reveals they set up their organisation so that they can +also shamelessly sell the email addresses they ``harvest''. +Everything is of course very legal\ldots{}moral?\ldots{}well +that is in the eye of the beholder. See: \url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} or @@ -43,7 +43,7 @@ Another example: Verizon, an ISP who is supposed to provide you just with connectivity, has found a ``nice'' side-business too: When you have enabled all privacy guards in your browser -(the few you have at your disposal) Verizon happily adds a +(the few you have at your disposal), Verizon happily adds a kind of cookie to your HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}} As shown in the picture below, this cookie will be sent to @@ -93,12 +93,12 @@ \item \textbf{Privacy} is the ability or right to protect your personal secrets (secrecy for the benefit of an individual). For example, in a job interview, I might - not like to disclose that I am pregnant, if I were - a woman, or that I am a father. Similarly, I might not - like to disclose my location data, because thieves might - break into my house if they know I am away at work. - Privacy is essentially everything which `shouldn't be - anybody's business'. + not like to disclose that I am pregnant, if I were a + woman, or that I am a father. Lest they might not hire + me. Similarly, I might not like to disclose my location + data, because thieves might break into my house if they + know I am away at work. Privacy is essentially + everything which ``shouldn't be anybody's business''. \end{itemize} @@ -121,15 +121,15 @@ supermarkets build a profile of my shopping habits, they will use it to \emph{their} advantage---surely not to \emph{my} advantage. Also whatever might be collected about my life will -always be an incomplete, or even misleading, picture---for -example I am sure my creditworthiness score was temporarily(?) -destroyed by not having a regular income in this country -(before coming to King's I worked in Munich for five years). -To correct such incomplete or flawed credit history data there -is, since recently, a law that allows you to check what -information is held about you for determining your -creditworthiness. But this concerns only a very small part of -the data that is held about me/you. +always be an incomplete, or even misleading, picture. For +example I am pretty sure my creditworthiness score was +temporarily(?) destroyed by not having a regular income in +this country (before coming to King's I worked in Munich for +five years). To correct such incomplete or flawed credit +history data there is, since recently, a law that allows you +to check what information is held about you for determining +your creditworthiness. But this concerns only a very small +part of the data that is held about me/you. To see how private matter can lead really to the wrong conclusions, take the example of Stephen Hawking: When he was @@ -138,14 +138,15 @@ they have employed Hawking? Now, he is enjoying his 70+ birthday. Clearly personal medical data needs to stay private. A movie which has this topic as its main focus is Gattaca from -1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}} +1997, in case you like to watch +it.\footnote{\url{http://www.imdb.com/title/tt0119177/}} To cut a long story short, I let you ponder about the two -statements that often voiced in discussions about privacy: +statements that are often voiced in discussions about privacy: \begin{itemize} -\item \textit{``You have zero privacy anyway. Get over it.''}\\ +\item \textit{``You have zero privacy anyway. Get over it.''} \mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)} \item \textit{``If you have nothing to hide, you have nothing @@ -163,7 +164,7 @@ \noindent Funnily, or maybe not so funnily, the author of this article carefully tries to construct an argument that does not only attack the nothing-to-hide statement in cases where -governments \& Co collect people's deepest secrets, or +governments \& co collect people's deepest secrets, or pictures of people's naked bodies, but an argument that applies also in cases where governments ``only'' collect data relevant to, say, preventing terrorism. The fun is of course @@ -179,9 +180,9 @@ Apart from philosophical musings, there are fortunately also some real technical problems with privacy. The problem I want to focus on in this handout is how to safely disclose datasets -containing very potentially private data, say health data. What can -go wrong with such disclosures can be illustrated with four -well-known examples: +containing potentially very private data, say health records. +What can go wrong with such disclosures can be illustrated +with four well-known examples: \begin{itemize} \item In 2006, a then young company called Netflix offered a 1 @@ -189,9 +190,9 @@ rating algorithm. For this they disclosed a dataset containing 10\% of all Netflix users at the time (appr.~500K). They removed names, but included numerical - ratings of movies as well as times of ratings. Though - some information was perturbed (i.e., slightly - modified). + ratings of movies as well as times when ratings were + uploaded. Though some information was perturbed (i.e., + slightly modified). Two researchers had a closer look at this anonymised data and compared it with public data available from the @@ -212,13 +213,13 @@ deleting identifiers. A graduate student could not resist cross-referencing - public voter data with the released data including birth - dates, gender and ZIP-code. The result was that she - could send the governor his own hospital record. It - turns out that birth dates, gender and ZIP-code uniquely - identify 87\% of people in the US. This work resulted - in a number of laws prescribing which private data - cannot be released in such datasets. + public voter data with the released data that still + included birth dates, gender and ZIP-code. The result + was that she could send the governor his own hospital + record. It turns out that birth dates, gender and + ZIP-code uniquely identify 87\% of people in the US. + This work resulted in a number of laws prescribing which + private data cannot be released in such datasets. \item In 2006, AOL published 20 million Web search queries collected from 650,000 users (names had been deleted). @@ -232,54 +233,55 @@ \item Genome-Wide Association Studies (GWAS) was a public database of gene-frequency studies linked to diseases. It would essentially record that people who have a - disease, say diabetes, have also these genes. In order + disease, say diabetes, have also certain genes. In order to maintain privacy, the dataset would only include - aggregate information. In case of DNA data this was - achieved by mixing the DNA of many individuals (having - a disease) into a single solution. Then this mixture - was sequenced and included in the dataset. The idea - was that the agregate information would still be helpful - to researchers, but would protect the DNA data of - individuals. + aggregate information. In case of DNA data this + aggregation was achieved by mixing the DNA of many + individuals (having a disease) into a single solution. + Then this mixture was sequenced and included in the + dataset. The idea was that the aggregate information + would still be helpful to researchers, but would protect + the DNA data of individuals. - In 2007 a forensic computer scientist showed that - individuals can be still identified. For this he used + In 2007 a forensic computer scientist showed that + individuals can still be identified. For this he used the DNA data from a comparison group (people from the general public) and ``subtracted'' this data from the - published data. He was left with data that included - all ``special'' DNA-markers of the individuals - present in the original mixture. He essentially deleted - the ``background noise''. Now the problem with - DNA data is that it is of such a high resolution that - even if the mixture contained maybe 100 individuals, - you can now detect whether an individual was included - in the mixture or not. + published data. He was left with data that included all + ``special'' DNA-markers of the individuals present in + the original mixture. He essentially deleted the + ``background noise'' in the published data. The + problem with DNA data is that it is of such a high + resolution that even if the mixture contained maybe 100 + individuals, you can now detect whether an individual + was included in the mixture or not. This result changed completely how DNA data is nowadays published for research purposes. After the success of the human-genome project with a very open culture of exchanging data, it became much more difficult to - anonymise datasuch that patient's privacy is preserved. + anonymise data so that patient's privacy is preserved. The public GWAS database was taken offline in 2008. \end{itemize} \noindent There are many lessons that can be learned from -these examples. One is that when making data public in -anonymised form you want to achieve \emph{forward privacy}. -This means, no matter of what other data that is also available -or will be released later, the data does not compromise -an individual's privacy. This principle was violated by the -data in the Netflix and governor of Massachusetts cases. There -additional data allowed one to re-identify individuals in the -dataset. In case of GWAS a new technique of re-identification -compromised the privacy of people on the list. -The case of the AOL dataset shows clearly how incomplete such -data can be: Although the queries uniquely identified the -old lady, she also looked up diseases that her friends had, -which had nothing to do with her. Any rational analysis of her -query data must have concluded, the lady is on her deathbed, -while she was actually very much alive and kicking. +these examples. One is that when making datasets public in +anonymised form, you want to achieve \emph{forward privacy}. +This means, no matter what other data that is also available +or will be released later, the data in the original dataset +does not compromise an individual's privacy. This principle +was violated by the availability of ``outside data'' in the +Netflix and governor of Massachusetts cases. The additional +data permitted a re-identification of individuals in the +dataset. In case of GWAS a new technique of re-identification +compromised the privacy of people in the dataset. The case of +the AOL dataset shows clearly how incomplete such data can be: +Although the queries uniquely identified the older lady, she +also looked up diseases that her friends had, which had +nothing to do with her. Any rational analysis of her query +data must therefore have concluded, the lady is on her +deathbed, while she was actually very much alive and kicking. \subsubsection*{Differential Privacy} @@ -305,11 +307,14 @@ \end{tabular} \end{center} +\ldots + + \subsubsection*{Further Reading} A readable article about how supermarkets mine your shopping habits (especially how they prey on young exhausted families -;o) appeared in 2012 in a New York Times article. +;o) appeared in 2012 in the New York Times: \begin{center} \url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html} @@ -329,10 +334,10 @@ \url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf} \end{center} -\noindent An article that sheds light on the paradox that +\noindent An article that sheds light on the paradox that people usually worry about privacy invasions of little -significance, and overlook that might cause significant -damage: +significance, and overlook the privacy invasion that might +cause significant damage: \begin{center} \url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf}