handouts/ho07.tex
changeset 311 8befc029ca1e
parent 310 591b62e1f86a
child 312 c913fe9bfd59
equal deleted inserted replaced
310:591b62e1f86a 311:8befc029ca1e
    11 were such a novelty that most people did not know what to make
    11 were such a novelty that most people did not know what to make
    12 of them. The person with the red flag was intended to warn the
    12 of them. The person with the red flag was intended to warn the
    13 public, for example horse owners, about the impending
    13 public, for example horse owners, about the impending
    14 novelty---a car. In my humble opinion, we are at the same
    14 novelty---a car. In my humble opinion, we are at the same
    15 stage of development with privacy. Nobody really knows what it
    15 stage of development with privacy. Nobody really knows what it
    16 is about or what it is good for. All seems very hazy. The
    16 is about or what it is good for. All seems very hazy. There
    17 result is that the world of ``privacy'' looks a little bit
    17 are a few laws (cookie law, right-to-be-forgotten) which
    18 like the old Wild West. Anything seems to go. 
    18 address problems with privacy, but even if they are well
       
    19 intentioned, they either back-fire or are already obsolete
       
    20 because of newer technologies. The result is that the world of
       
    21 ``privacy'' looks a little bit like the old Wild West.
    19 
    22 
    20 For example, UCAS, a charity set up to help students to apply
    23 For example, UCAS, a charity set up to help students to apply
    21 to universities, has a commercial unit that happily sells your
    24 to universities, has a commercial unit that happily sells your
    22 email addresses to anybody who forks out enough money in order
    25 email addresses to anybody who forks out enough money in order
    23 to be able to bombard you with spam. Yes, you can opt out very
    26 to be able to bombard you with spam. Yes, you can opt out very
    35 
    38 
    36 \url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} 
    39 \url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} 
    37 or
    40 or
    38 \url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
    41 \url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
    39 
    42 
    40 Another example: Verizon, an ISP who provides you with
    43 Another example: Verizon, an ISP who is supposed to provide
    41 connectivity, has found a ``nice'' side-business too: When you
    44 you just with connectivity, has found a ``nice'' side-business
    42 have enabled all privacy guards in your browser, the few you
    45 too: When you have enabled all privacy guards in your browser
    43 have at your disposal, Verizon happily adds a kind of cookie
    46 (the few you have at your disposal) Verizon happily adds a
    44 to your
    47 kind of cookie to your
    45 HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
    48 HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
    46 As shown in the picture below, this cookie will be sent to
    49 As shown in the picture below, this cookie will be sent to
    47 every web-site you visit. The web-sites then can forward the
    50 every web-site you visit. The web-sites then can forward the
    48 cookie to advertisers who in turn pay Verizon to tell them
    51 cookie to advertisers who in turn pay Verizon to tell them
    49 everything they want to know about the person who just made
    52 everything they want to know about the person who just made
    50 this request, that is you.
    53 this request, that is you.
    51  
    54  
    52 \begin{center}
    55 \begin{center}
    53 \includegraphics[scale=0.21]{../pics/verizon.png}
    56 \includegraphics[scale=0.19]{../pics/verizon.png}
    54 \end{center}
    57 \end{center}
    55 
    58 
    56 \noindent How disgusting? Even worse, Verizon is not known for
    59 \noindent How disgusting? Even worse, Verizon is not known for
    57 being the cheapest ISP on the planet (completely the
    60 being the cheapest ISP on the planet (completely the
    58 contrary), and also not known for providing the fastest
    61 contrary), and also not known for providing the fastest
    60 the US with a quasi-monopolistic ``market distribution''.
    63 the US with a quasi-monopolistic ``market distribution''.
    61 
    64 
    62 
    65 
    63 Well, we could go on and on\ldots{}and that has not even
    66 Well, we could go on and on\ldots{}and that has not even
    64 started us yet with all the naughty things NSA \& Friends are
    67 started us yet with all the naughty things NSA \& Friends are
    65 up to. Why does privacy matter? Nobody, I think, has a
    68 up to. Why does privacy actually matter? Nobody, I think, has
    66 conclusive answer to this question yet. Maybe the following four
    69 a conclusive answer to this question yet. Maybe the following
    67 notions help with clarifying the overall picture somewhat: 
    70 four notions help with clarifying the overall picture
       
    71 somewhat: 
    68 
    72 
    69 \begin{itemize}
    73 \begin{itemize}
    70 \item \textbf{Secrecy} is the mechanism used to limit the
    74 \item \textbf{Secrecy} is the mechanism used to limit the
    71       number of principals with access to information (e.g.,
    75       number of principals with access to information (e.g.,
    72       cryptography or access controls). For example I better
    76       cryptography or access controls). For example I better
   125 is, since recently, a law that allows you to check what
   129 is, since recently, a law that allows you to check what
   126 information is held about you for determining your
   130 information is held about you for determining your
   127 creditworthiness. But this concerns only a very small part of
   131 creditworthiness. But this concerns only a very small part of
   128 the data that is held about me/you.
   132 the data that is held about me/you.
   129 
   133 
   130 Take the example of Stephen Hawking: when he was diagnosed
   134 To see how private matter can lead really to the wrong
   131 with his disease, he was given a life expectancy of two years.
   135 conclusions, take the example of Stephen Hawking: When he was
   132 If an employer would know about such problems, would they have
   136 diagnosed with his disease, he was given a life expectancy of
   133 employed Hawking? Now he is enjoying his 70+ birthday.
   137 two years. If employers would know about such problems, would
   134 Clearly personal medical data needs to stay private.
   138 they have employed Hawking? Now, he is enjoying his 70+
       
   139 birthday. Clearly personal medical data needs to stay private.
       
   140 A movie which has this topic as its main focus is Gattaca from
       
   141 1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}}
       
   142 
   135 
   143 
   136 To cut a long story short, I let you ponder about the two
   144 To cut a long story short, I let you ponder about the two
   137 statements that often voiced in discussions about privacy:
   145 statements that often voiced in discussions about privacy:
   138 
   146 
   139 \begin{itemize}
   147 \begin{itemize}
   142 
   150 
   143 \item \textit{``If you have nothing to hide, you have nothing 
   151 \item \textit{``If you have nothing to hide, you have nothing 
   144 to fear.''}
   152 to fear.''}
   145 \end{itemize}
   153 \end{itemize}
   146  
   154  
   147 \noindent An article that attempts a deeper analysis appeared
   155 \noindent If you want to read up further on this topic, I can
   148 in 2011 in the Chronicle of Higher Education
   156 recommend the following article that appeared in 2011 in the
       
   157 Chronicle of Higher Education
   149 
   158 
   150 \begin{center} 
   159 \begin{center} 
   151 \url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/} 
   160 \url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/} 
   152 \end{center} 
   161 \end{center} 
   153 
   162 
   168 \subsubsection*{Re-Identification Attacks} 
   177 \subsubsection*{Re-Identification Attacks} 
   169 
   178 
   170 Apart from philosophical musings, there are fortunately also
   179 Apart from philosophical musings, there are fortunately also
   171 some real technical problems with privacy. The problem I want
   180 some real technical problems with privacy. The problem I want
   172 to focus on in this handout is how to safely disclose datasets
   181 to focus on in this handout is how to safely disclose datasets
   173 containing potentially private data, say health data. What can
   182 containing very potentially private data, say health data. What can
   174 go wrong with such disclosures can be illustrated with four
   183 go wrong with such disclosures can be illustrated with four
   175 well-known examples:
   184 well-known examples:
   176 
   185 
   177 \begin{itemize}
   186 \begin{itemize}
   178 \item In 2006, a then young company called Netflix offered a 1
   187 \item In 2006, a then young company called Netflix offered a 1
   184       some information was perturbed (i.e., slightly
   193       some information was perturbed (i.e., slightly
   185       modified).
   194       modified).
   186       
   195       
   187       Two researchers had a closer look at this anonymised
   196       Two researchers had a closer look at this anonymised
   188       data and compared it with public data available from the
   197       data and compared it with public data available from the
   189       International Movie Database (IMDb). They found that 98
   198       International Movie Database (IMDb). They found that
   190       \% of the entries could be re-identified in the Netflix
   199       98\% of the entries could be re-identified in the
   191       dataset: either by their ratings or by the dates the
   200       Netflix dataset: either by their ratings or by the dates
   192       ratings were uploaded. The result was a class-action 
   201       the ratings were uploaded. The result was a class-action
   193       suit against Netflix, which was only recently resolved
   202       suit against Netflix, which was only recently resolved
   194       involving a lot of money.
   203       involving a lot of money.
   195 
   204 
   196 \item In the 1990ies, medical datasets were often made public
   205 \item In the 1990ies, medical datasets were often made public
   197       for research purposes. This was done in anonymised form
   206       for research purposes. This was done in anonymised form
   198       with names removed, but birth dates, gender, ZIP-code
   207       with names removed, but birth dates, gender and ZIP-code
   199       were retained. In one case where such data about
   208       were retained. In one case where such data about
   200       hospital visits of state employees in Massachusetts was
   209       hospital visits of state employees in Massachusetts was
   201       made public, the then governor assured the public that
   210       made public, the then governor assured the public that
   202       the released dataset protected patient privacy by
   211       the released dataset protected patient privacy by
   203       deleting identifiers. A graduate student could not
   212       deleting identifiers. 
   204       resist cross-referencing public voter data with the
   213       
   205       released data including birth dates, gender and
   214       A graduate student could not resist cross-referencing
   206       ZIP-code. The result was that she could send the
   215       public voter data with the released data including birth
   207       governor his own hospital record. It turns out that
   216       dates, gender and ZIP-code. The result was that she
   208       birth dates, gender and ZIP-code uniquely identify 87\%
   217       could send the governor his own hospital record. It
   209       people in the US.
   218       turns out that birth dates, gender and ZIP-code uniquely
       
   219       identify 87\% of people in the US. This work resulted
       
   220       in a number of laws prescribing which private data
       
   221       cannot be released in such datasets.
   210  
   222  
   211 \item In 2006, AOL published 20 million Web search queries
   223 \item In 2006, AOL published 20 million Web search queries
   212       collected from 650,000 users (names had been deleted).
   224       collected from 650,000 users (names had been deleted).
   213       This was again done for research purposes. However,
   225       This was again done for research purposes. However,
   214       within days an old lady, Thelma Arnold, from Lilburn,
   226       within days an old lady, Thelma Arnold, from Lilburn,
   215       Georgia, (11,596 inhabitants) was identified as user
   227       Georgia, (11,596 inhabitants) was identified as user
   216       No.~4417749 in this dataset. It turned out that search
   228       No.~4417749 in this dataset. It turned out that search
   217       engine queries are deep windows into people's private
   229       engine queries are deep windows into people's private
   218       lives. 
   230       lives. 
   219   
   231   
   220 \item Genomic-Wide Association Studies (GWAS) was a public
   232 \item Genome-Wide Association Studies (GWAS) was a public
   221       database of gene-frequency studies linked to diseases.
   233       database of gene-frequency studies linked to diseases.
       
   234       It would essentially record that people who have a
       
   235       disease, say diabetes, have also these genes. In order
       
   236       to maintain privacy, the dataset would only include
       
   237       aggregate information. In case of DNA data this was 
       
   238       achieved by mixing the DNA of many individuals (having
       
   239       a disease) into a single solution. Then this mixture 
       
   240       was sequenced and included in the dataset. The idea
       
   241       was that the agregate information would still be helpful
       
   242       to researchers, but would protect the DNA data of 
       
   243       individuals. 
       
   244        
       
   245       In 2007 a forensic computer scientist showed that 
       
   246       individuals can be still identified. For this he used
       
   247       the DNA data from a comparison group (people from the
       
   248       general public) and ``subtracted'' this data from the
       
   249       published data. He was left with data that included
       
   250       all ``special'' DNA-markers of the individuals
       
   251       present in the original mixture. He essentially deleted
       
   252       the ``background noise''. Now the problem with
       
   253       DNA data is that it is of such a high resolution that
       
   254       even if the mixture contained maybe 100 individuals,
       
   255       you can now detect whether an individual was included
       
   256       in the mixture or not.
   222       
   257       
   223       
   258       This result changed completely how DNA data is nowadays
   224       you only needed partial DNA information in order to
   259       published for research purposes. After the success of 
   225       identify whether an individual was part of the study —
   260       the human-genome project with a very open culture of
   226       DB closed in 2008
   261       exchanging data, it became much more difficult to 
       
   262       anonymise datasuch that patient's privacy is preserved.
       
   263       The public GWAS database was taken offline in 2008.
   227       
   264       
   228 \end{itemize}
   265 \end{itemize}
       
   266 
       
   267 \noindent There are many lessons that can be learned from
       
   268 these examples. One is that when making data public in 
       
   269 anonymised form you want to achieve \emph{forward privacy}.
       
   270 This means, no matter of what other data that is also available
       
   271 or will be released later, the data does not compromise
       
   272 an individual's privacy. This principle was violated by the 
       
   273 data in the Netflix and governor of Massachusetts cases. There
       
   274 additional data allowed one to re-identify individuals in the
       
   275 dataset. In case of GWAS a new technique of re-identification 
       
   276 compromised the privacy of people on the list.
       
   277 The case of the AOL dataset shows clearly how incomplete such 
       
   278 data can be: Although the queries uniquely identified the
       
   279 old lady, she also looked up diseases that her friends had,
       
   280 which had nothing to do with her. Any rational analysis of her
       
   281 query data must have concluded, the lady is on her deathbed, 
       
   282 while she was actually very much alive and kicking.
       
   283 
       
   284 \subsubsection*{Differential Privacy}
       
   285 
       
   286 Differential privacy is one of the few methods, that tries to 
       
   287 achieve forward privacy with large datasets. The basic idea
       
   288 is to add appropriate noise, or errors, to any query of the
       
   289 dataset. The intention is to make the result of a query 
       
   290 insensitive to individual entries in the database. The hope is
       
   291 that the added error does not eliminate the ``signal'' one is 
       
   292 looking for by querying the dataset.
       
   293 
       
   294 
       
   295 
       
   296 \begin{center}
       
   297 User\;\;\;\;    
       
   298 \begin{tabular}{c}
       
   299 tell me $f(x)$ $\Rightarrow$\\
       
   300 $\Leftarrow$ $f(x) + \text{noise}$
       
   301 \end{tabular}
       
   302 \;\;\;\;\begin{tabular}{@{}c}
       
   303 Database\\
       
   304 $x_1, \ldots, x_n$
       
   305 \end{tabular}
       
   306 \end{center}
   229 
   307 
   230 
   308 
   231 \end{document}
   309 \end{document}
   232 
   310 
   233 http://randomwalker.info/teaching/fall-2012-privacy-technologies/?
   311 http://randomwalker.info/teaching/fall-2012-privacy-technologies/?