handouts/ho07.tex
changeset 313 1d243ac51078
parent 312 c913fe9bfd59
child 314 e01f55e7485a
equal deleted inserted replaced
312:c913fe9bfd59 313:1d243ac51078
    12 of them. The person with the red flag was intended to warn the
    12 of them. The person with the red flag was intended to warn the
    13 public, for example horse owners, about the impending
    13 public, for example horse owners, about the impending
    14 novelty---a car. In my humble opinion, we are at the same
    14 novelty---a car. In my humble opinion, we are at the same
    15 stage of development with privacy. Nobody really knows what it
    15 stage of development with privacy. Nobody really knows what it
    16 is about or what it is good for. All seems very hazy. There
    16 is about or what it is good for. All seems very hazy. There
    17 are a few laws (cookie law, right-to-be-forgotten) which
    17 are a few laws (e.g.~cookie law, right-to-be-forgotten law)
    18 address problems with privacy, but even if they are well
    18 which address problems with privacy, but even if they are well
    19 intentioned, they either back-fire or are already obsolete
    19 intentioned, they either back-fire or are already obsolete
    20 because of newer technologies. The result is that the world of
    20 because of newer technologies. The result is that the world of
    21 ``privacy'' looks a little bit like the old Wild West.
    21 ``privacy'' looks a little bit like the old Wild West.
    22 
    22 
    23 For example, UCAS, a charity set up to help students to apply
    23 For example, UCAS, a charity set up to help students with
    24 to universities, has a commercial unit that happily sells your
    24 applying to universities, has a commercial unit that happily
    25 email addresses to anybody who forks out enough money in order
    25 sells your email addresses to anybody who forks out enough
    26 to be able to bombard you with spam. Yes, you can opt out very
    26 money in order to be able to bombard you with spam. Yes, you
    27 often in such ``schemes'', but in case of UCAS any opt-out
    27 can opt out very often in such ``schemes'', but in case of
    28 will limit also legit emails you might actually be interested
    28 UCAS any opt-out will limit also legit emails you might
    29 in.\footnote{The main objectionable point, in my opinion, is
    29 actually be interested in.\footnote{The main objectionable
    30 that the \emph{charity} everybody has to use for HE
    30 point, in my opinion, is that the \emph{charity} everybody has
    31 applications has actually very honourable goals (e.g.~assist
    31 to use for HE applications has actually very honourable goals
    32 applicants in gaining access to universities), but in their
    32 (e.g.~assist applicants in gaining access to universities),
    33 small print (or better under the link ``About us'') reveals
    33 but the small print (or better the link ``About
    34 they set up their organisation so that they can also
    34 us'') reveals they set up their organisation so that they can
    35 shamelessly sell email addresses they ``harvest''. Everything
    35 also shamelessly sell the email addresses they ``harvest''.
    36 is of course very legal\ldots{}moral?\ldots{}well that is in
    36 Everything is of course very legal\ldots{}moral?\ldots{}well
    37 the eye of the beholder. See:
    37 that is in the eye of the beholder. See:
    38 
    38 
    39 \url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} 
    39 \url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} 
    40 or
    40 or
    41 \url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
    41 \url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
    42 
    42 
    43 Another example: Verizon, an ISP who is supposed to provide
    43 Another example: Verizon, an ISP who is supposed to provide
    44 you just with connectivity, has found a ``nice'' side-business
    44 you just with connectivity, has found a ``nice'' side-business
    45 too: When you have enabled all privacy guards in your browser
    45 too: When you have enabled all privacy guards in your browser
    46 (the few you have at your disposal) Verizon happily adds a
    46 (the few you have at your disposal), Verizon happily adds a
    47 kind of cookie to your
    47 kind of cookie to your
    48 HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
    48 HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
    49 As shown in the picture below, this cookie will be sent to
    49 As shown in the picture below, this cookie will be sent to
    50 every web-site you visit. The web-sites then can forward the
    50 every web-site you visit. The web-sites then can forward the
    51 cookie to advertisers who in turn pay Verizon to tell them
    51 cookie to advertisers who in turn pay Verizon to tell them
    91         voting, exam marking and so on.
    91         voting, exam marking and so on.
    92 
    92 
    93 \item \textbf{Privacy} is the ability or right to protect your
    93 \item \textbf{Privacy} is the ability or right to protect your
    94       personal secrets (secrecy for the benefit of an
    94       personal secrets (secrecy for the benefit of an
    95       individual). For example, in a job interview, I might
    95       individual). For example, in a job interview, I might
    96       not like to disclose that I am pregnant, if I were
    96       not like to disclose that I am pregnant, if I were a
    97       a woman, or that I am a father. Similarly, I might not
    97       woman, or that I am a father. Lest they might not hire
    98       like to disclose my location data, because thieves might
    98       me. Similarly, I might not like to disclose my location
    99       break into my house if they know I am away at work. 
    99       data, because thieves might break into my house if they
   100       Privacy is essentially everything which `shouldn't be
   100       know I am away at work. Privacy is essentially
   101       anybody's business'.
   101       everything which ``shouldn't be anybody's business''.
   102 
   102 
   103 \end{itemize}
   103 \end{itemize}
   104 
   104 
   105 \noindent While this might provide us with some rough
   105 \noindent While this might provide us with some rough
   106 definitions, the problem with privacy is that it is an
   106 definitions, the problem with privacy is that it is an
   119 about my private life usually is used against me. As mentioned
   119 about my private life usually is used against me. As mentioned
   120 above, public location data might mean I get robbed. If
   120 above, public location data might mean I get robbed. If
   121 supermarkets build a profile of my shopping habits, they will
   121 supermarkets build a profile of my shopping habits, they will
   122 use it to \emph{their} advantage---surely not to \emph{my}
   122 use it to \emph{their} advantage---surely not to \emph{my}
   123 advantage. Also whatever might be collected about my life will
   123 advantage. Also whatever might be collected about my life will
   124 always be an incomplete, or even misleading, picture---for
   124 always be an incomplete, or even misleading, picture. For
   125 example I am sure my creditworthiness score was temporarily(?)
   125 example I am pretty sure my creditworthiness score was
   126 destroyed by not having a regular income in this country
   126 temporarily(?) destroyed by not having a regular income in
   127 (before coming to King's I worked in Munich for five years).
   127 this country (before coming to King's I worked in Munich for
   128 To correct such incomplete or flawed credit history data there
   128 five years). To correct such incomplete or flawed credit
   129 is, since recently, a law that allows you to check what
   129 history data there is, since recently, a law that allows you
   130 information is held about you for determining your
   130 to check what information is held about you for determining
   131 creditworthiness. But this concerns only a very small part of
   131 your creditworthiness. But this concerns only a very small
   132 the data that is held about me/you.
   132 part of the data that is held about me/you.
   133 
   133 
   134 To see how private matter can lead really to the wrong
   134 To see how private matter can lead really to the wrong
   135 conclusions, take the example of Stephen Hawking: When he was
   135 conclusions, take the example of Stephen Hawking: When he was
   136 diagnosed with his disease, he was given a life expectancy of
   136 diagnosed with his disease, he was given a life expectancy of
   137 two years. If employers would know about such problems, would
   137 two years. If employers would know about such problems, would
   138 they have employed Hawking? Now, he is enjoying his 70+
   138 they have employed Hawking? Now, he is enjoying his 70+
   139 birthday. Clearly personal medical data needs to stay private.
   139 birthday. Clearly personal medical data needs to stay private.
   140 A movie which has this topic as its main focus is Gattaca from
   140 A movie which has this topic as its main focus is Gattaca from
   141 1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}}
   141 1997, in case you like to watch
       
   142 it.\footnote{\url{http://www.imdb.com/title/tt0119177/}}
   142 
   143 
   143 
   144 
   144 To cut a long story short, I let you ponder about the two
   145 To cut a long story short, I let you ponder about the two
   145 statements that often voiced in discussions about privacy:
   146 statements that are often voiced in discussions about privacy:
   146 
   147 
   147 \begin{itemize}
   148 \begin{itemize}
   148 \item \textit{``You have zero privacy anyway. Get over it.''}\\
   149 \item \textit{``You have zero privacy anyway. Get over it.''}
   149 \mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)}
   150 \mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)}
   150 
   151 
   151 \item \textit{``If you have nothing to hide, you have nothing 
   152 \item \textit{``If you have nothing to hide, you have nothing 
   152 to fear.''}
   153 to fear.''}
   153 \end{itemize}
   154 \end{itemize}
   161 \end{center} 
   162 \end{center} 
   162 
   163 
   163 \noindent Funnily, or maybe not so funnily, the author of this
   164 \noindent Funnily, or maybe not so funnily, the author of this
   164 article carefully tries to construct an argument that does not
   165 article carefully tries to construct an argument that does not
   165 only attack the nothing-to-hide statement in cases where
   166 only attack the nothing-to-hide statement in cases where
   166 governments \& Co collect people's deepest secrets, or
   167 governments \& co collect people's deepest secrets, or
   167 pictures of people's naked bodies, but an argument that
   168 pictures of people's naked bodies, but an argument that
   168 applies also in cases where governments ``only'' collect data
   169 applies also in cases where governments ``only'' collect data
   169 relevant to, say, preventing terrorism. The fun is of course
   170 relevant to, say, preventing terrorism. The fun is of course
   170 that in 2011 we could just not imagine that respected
   171 that in 2011 we could just not imagine that respected
   171 governments would do such infantile things as intercepting
   172 governments would do such infantile things as intercepting
   177 \subsubsection*{Re-Identification Attacks} 
   178 \subsubsection*{Re-Identification Attacks} 
   178 
   179 
   179 Apart from philosophical musings, there are fortunately also
   180 Apart from philosophical musings, there are fortunately also
   180 some real technical problems with privacy. The problem I want
   181 some real technical problems with privacy. The problem I want
   181 to focus on in this handout is how to safely disclose datasets
   182 to focus on in this handout is how to safely disclose datasets
   182 containing very potentially private data, say health data. What can
   183 containing potentially very private data, say health records.
   183 go wrong with such disclosures can be illustrated with four
   184 What can go wrong with such disclosures can be illustrated
   184 well-known examples:
   185 with four well-known examples:
   185 
   186 
   186 \begin{itemize}
   187 \begin{itemize}
   187 \item In 2006, a then young company called Netflix offered a 1
   188 \item In 2006, a then young company called Netflix offered a 1
   188       Mio \$ prize to anybody who could improve their movie
   189       Mio \$ prize to anybody who could improve their movie
   189       rating algorithm. For this they disclosed a dataset
   190       rating algorithm. For this they disclosed a dataset
   190       containing 10\% of all Netflix users at the time
   191       containing 10\% of all Netflix users at the time
   191       (appr.~500K). They removed names, but included numerical
   192       (appr.~500K). They removed names, but included numerical
   192       ratings of movies as well as times of ratings. Though
   193       ratings of movies as well as times when ratings were
   193       some information was perturbed (i.e., slightly
   194       uploaded. Though some information was perturbed (i.e.,
   194       modified).
   195       slightly modified).
   195       
   196       
   196       Two researchers had a closer look at this anonymised
   197       Two researchers had a closer look at this anonymised
   197       data and compared it with public data available from the
   198       data and compared it with public data available from the
   198       International Movie Database (IMDb). They found that
   199       International Movie Database (IMDb). They found that
   199       98\% of the entries could be re-identified in the
   200       98\% of the entries could be re-identified in the
   210       made public, the then governor assured the public that
   211       made public, the then governor assured the public that
   211       the released dataset protected patient privacy by
   212       the released dataset protected patient privacy by
   212       deleting identifiers. 
   213       deleting identifiers. 
   213       
   214       
   214       A graduate student could not resist cross-referencing
   215       A graduate student could not resist cross-referencing
   215       public voter data with the released data including birth
   216       public voter data with the released data that still
   216       dates, gender and ZIP-code. The result was that she
   217       included birth dates, gender and ZIP-code. The result
   217       could send the governor his own hospital record. It
   218       was that she could send the governor his own hospital
   218       turns out that birth dates, gender and ZIP-code uniquely
   219       record. It turns out that birth dates, gender and
   219       identify 87\% of people in the US. This work resulted
   220       ZIP-code uniquely identify 87\% of people in the US.
   220       in a number of laws prescribing which private data
   221       This work resulted in a number of laws prescribing which
   221       cannot be released in such datasets.
   222       private data cannot be released in such datasets.
   222  
   223  
   223 \item In 2006, AOL published 20 million Web search queries
   224 \item In 2006, AOL published 20 million Web search queries
   224       collected from 650,000 users (names had been deleted).
   225       collected from 650,000 users (names had been deleted).
   225       This was again done for research purposes. However,
   226       This was again done for research purposes. However,
   226       within days an old lady, Thelma Arnold, from Lilburn,
   227       within days an old lady, Thelma Arnold, from Lilburn,
   230       lives. 
   231       lives. 
   231   
   232   
   232 \item Genome-Wide Association Studies (GWAS) was a public
   233 \item Genome-Wide Association Studies (GWAS) was a public
   233       database of gene-frequency studies linked to diseases.
   234       database of gene-frequency studies linked to diseases.
   234       It would essentially record that people who have a
   235       It would essentially record that people who have a
   235       disease, say diabetes, have also these genes. In order
   236       disease, say diabetes, have also certain genes. In order
   236       to maintain privacy, the dataset would only include
   237       to maintain privacy, the dataset would only include
   237       aggregate information. In case of DNA data this was 
   238       aggregate information. In case of DNA data this
   238       achieved by mixing the DNA of many individuals (having
   239       aggregation was achieved by mixing the DNA of many
   239       a disease) into a single solution. Then this mixture 
   240       individuals (having a disease) into a single solution.
   240       was sequenced and included in the dataset. The idea
   241       Then this mixture was sequenced and included in the
   241       was that the agregate information would still be helpful
   242       dataset. The idea was that the aggregate information
   242       to researchers, but would protect the DNA data of 
   243       would still be helpful to researchers, but would protect
   243       individuals. 
   244       the DNA data of individuals. 
   244        
   245        
   245       In 2007 a forensic computer scientist showed that 
   246       In 2007 a forensic computer scientist showed that
   246       individuals can be still identified. For this he used
   247       individuals can still be identified. For this he used
   247       the DNA data from a comparison group (people from the
   248       the DNA data from a comparison group (people from the
   248       general public) and ``subtracted'' this data from the
   249       general public) and ``subtracted'' this data from the
   249       published data. He was left with data that included
   250       published data. He was left with data that included all
   250       all ``special'' DNA-markers of the individuals
   251       ``special'' DNA-markers of the individuals present in
   251       present in the original mixture. He essentially deleted
   252       the original mixture. He essentially deleted the
   252       the ``background noise''. Now the problem with
   253       ``background noise'' in the published data. The
   253       DNA data is that it is of such a high resolution that
   254       problem with DNA data is that it is of such a high
   254       even if the mixture contained maybe 100 individuals,
   255       resolution that even if the mixture contained maybe 100
   255       you can now detect whether an individual was included
   256       individuals, you can now detect whether an individual
   256       in the mixture or not.
   257       was included in the mixture or not.
   257       
   258       
   258       This result changed completely how DNA data is nowadays
   259       This result changed completely how DNA data is nowadays
   259       published for research purposes. After the success of 
   260       published for research purposes. After the success of 
   260       the human-genome project with a very open culture of
   261       the human-genome project with a very open culture of
   261       exchanging data, it became much more difficult to 
   262       exchanging data, it became much more difficult to 
   262       anonymise datasuch that patient's privacy is preserved.
   263       anonymise data so that patient's privacy is preserved.
   263       The public GWAS database was taken offline in 2008.
   264       The public GWAS database was taken offline in 2008.
   264       
   265       
   265 \end{itemize}
   266 \end{itemize}
   266 
   267 
   267 \noindent There are many lessons that can be learned from
   268 \noindent There are many lessons that can be learned from
   268 these examples. One is that when making data public in 
   269 these examples. One is that when making datasets public in
   269 anonymised form you want to achieve \emph{forward privacy}.
   270 anonymised form, you want to achieve \emph{forward privacy}.
   270 This means, no matter of what other data that is also available
   271 This means, no matter what other data that is also available
   271 or will be released later, the data does not compromise
   272 or will be released later, the data in the original dataset
   272 an individual's privacy. This principle was violated by the 
   273 does not compromise an individual's privacy. This principle
   273 data in the Netflix and governor of Massachusetts cases. There
   274 was violated by the availability of ``outside data'' in the
   274 additional data allowed one to re-identify individuals in the
   275 Netflix and governor of Massachusetts cases. The additional
   275 dataset. In case of GWAS a new technique of re-identification 
   276 data permitted a re-identification of individuals in the
   276 compromised the privacy of people on the list.
   277 dataset. In case of GWAS a new technique of re-identification
   277 The case of the AOL dataset shows clearly how incomplete such 
   278 compromised the privacy of people in the dataset. The case of
   278 data can be: Although the queries uniquely identified the
   279 the AOL dataset shows clearly how incomplete such data can be:
   279 old lady, she also looked up diseases that her friends had,
   280 Although the queries uniquely identified the older lady, she
   280 which had nothing to do with her. Any rational analysis of her
   281 also looked up diseases that her friends had, which had
   281 query data must have concluded, the lady is on her deathbed, 
   282 nothing to do with her. Any rational analysis of her query
   282 while she was actually very much alive and kicking.
   283 data must therefore have concluded, the lady is on her
       
   284 deathbed, while she was actually very much alive and kicking.
   283 
   285 
   284 \subsubsection*{Differential Privacy}
   286 \subsubsection*{Differential Privacy}
   285 
   287 
   286 Differential privacy is one of the few methods, that tries to 
   288 Differential privacy is one of the few methods, that tries to 
   287 achieve forward privacy with large datasets. The basic idea
   289 achieve forward privacy with large datasets. The basic idea
   303 Database\\
   305 Database\\
   304 $x_1, \ldots, x_n$
   306 $x_1, \ldots, x_n$
   305 \end{tabular}
   307 \end{tabular}
   306 \end{center}
   308 \end{center}
   307 
   309 
       
   310 \ldots
       
   311 
       
   312 
   308 \subsubsection*{Further Reading}
   313 \subsubsection*{Further Reading}
   309 
   314 
   310 A readable article about how supermarkets mine your shopping
   315 A readable article about how supermarkets mine your shopping
   311 habits (especially how they prey on young exhausted families
   316 habits (especially how they prey on young exhausted families
   312 ;o) appeared in 2012 in a New York Times article.
   317 ;o) appeared in 2012 in the New York Times:
   313 
   318 
   314 \begin{center}
   319 \begin{center}
   315 \url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html}
   320 \url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html}
   316 \end{center}
   321 \end{center}
   317 
   322 
   327 
   332 
   328 \begin{center}
   333 \begin{center}
   329 \url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf}
   334 \url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf}
   330 \end{center}
   335 \end{center}
   331 
   336 
   332 \noindent An article that sheds light on the paradox that 
   337 \noindent An article that sheds light on the paradox that
   333 people usually worry about privacy invasions of little
   338 people usually worry about privacy invasions of little
   334 significance, and overlook that might cause significant 
   339 significance, and overlook the privacy invasion that might
   335 damage:
   340 cause significant damage:
   336 
   341 
   337 \begin{center}
   342 \begin{center}
   338 \url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf}
   343 \url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf}
   339 \end{center}
   344 \end{center}
   340 
   345