12 of them. The person with the red flag was intended to warn the |
12 of them. The person with the red flag was intended to warn the |
13 public, for example horse owners, about the impending |
13 public, for example horse owners, about the impending |
14 novelty---a car. In my humble opinion, we are at the same |
14 novelty---a car. In my humble opinion, we are at the same |
15 stage of development with privacy. Nobody really knows what it |
15 stage of development with privacy. Nobody really knows what it |
16 is about or what it is good for. All seems very hazy. There |
16 is about or what it is good for. All seems very hazy. There |
17 are a few laws (cookie law, right-to-be-forgotten) which |
17 are a few laws (e.g.~cookie law, right-to-be-forgotten law) |
18 address problems with privacy, but even if they are well |
18 which address problems with privacy, but even if they are well |
19 intentioned, they either back-fire or are already obsolete |
19 intentioned, they either back-fire or are already obsolete |
20 because of newer technologies. The result is that the world of |
20 because of newer technologies. The result is that the world of |
21 ``privacy'' looks a little bit like the old Wild West. |
21 ``privacy'' looks a little bit like the old Wild West. |
22 |
22 |
23 For example, UCAS, a charity set up to help students to apply |
23 For example, UCAS, a charity set up to help students with |
24 to universities, has a commercial unit that happily sells your |
24 applying to universities, has a commercial unit that happily |
25 email addresses to anybody who forks out enough money in order |
25 sells your email addresses to anybody who forks out enough |
26 to be able to bombard you with spam. Yes, you can opt out very |
26 money in order to be able to bombard you with spam. Yes, you |
27 often in such ``schemes'', but in case of UCAS any opt-out |
27 can opt out very often in such ``schemes'', but in case of |
28 will limit also legit emails you might actually be interested |
28 UCAS any opt-out will limit also legit emails you might |
29 in.\footnote{The main objectionable point, in my opinion, is |
29 actually be interested in.\footnote{The main objectionable |
30 that the \emph{charity} everybody has to use for HE |
30 point, in my opinion, is that the \emph{charity} everybody has |
31 applications has actually very honourable goals (e.g.~assist |
31 to use for HE applications has actually very honourable goals |
32 applicants in gaining access to universities), but in their |
32 (e.g.~assist applicants in gaining access to universities), |
33 small print (or better under the link ``About us'') reveals |
33 but the small print (or better the link ``About |
34 they set up their organisation so that they can also |
34 us'') reveals they set up their organisation so that they can |
35 shamelessly sell email addresses they ``harvest''. Everything |
35 also shamelessly sell the email addresses they ``harvest''. |
36 is of course very legal\ldots{}moral?\ldots{}well that is in |
36 Everything is of course very legal\ldots{}moral?\ldots{}well |
37 the eye of the beholder. See: |
37 that is in the eye of the beholder. See: |
38 |
38 |
39 \url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} |
39 \url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} |
40 or |
40 or |
41 \url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}} |
41 \url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}} |
42 |
42 |
43 Another example: Verizon, an ISP who is supposed to provide |
43 Another example: Verizon, an ISP who is supposed to provide |
44 you just with connectivity, has found a ``nice'' side-business |
44 you just with connectivity, has found a ``nice'' side-business |
45 too: When you have enabled all privacy guards in your browser |
45 too: When you have enabled all privacy guards in your browser |
46 (the few you have at your disposal) Verizon happily adds a |
46 (the few you have at your disposal), Verizon happily adds a |
47 kind of cookie to your |
47 kind of cookie to your |
48 HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}} |
48 HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}} |
49 As shown in the picture below, this cookie will be sent to |
49 As shown in the picture below, this cookie will be sent to |
50 every web-site you visit. The web-sites then can forward the |
50 every web-site you visit. The web-sites then can forward the |
51 cookie to advertisers who in turn pay Verizon to tell them |
51 cookie to advertisers who in turn pay Verizon to tell them |
91 voting, exam marking and so on. |
91 voting, exam marking and so on. |
92 |
92 |
93 \item \textbf{Privacy} is the ability or right to protect your |
93 \item \textbf{Privacy} is the ability or right to protect your |
94 personal secrets (secrecy for the benefit of an |
94 personal secrets (secrecy for the benefit of an |
95 individual). For example, in a job interview, I might |
95 individual). For example, in a job interview, I might |
96 not like to disclose that I am pregnant, if I were |
96 not like to disclose that I am pregnant, if I were a |
97 a woman, or that I am a father. Similarly, I might not |
97 woman, or that I am a father. Lest they might not hire |
98 like to disclose my location data, because thieves might |
98 me. Similarly, I might not like to disclose my location |
99 break into my house if they know I am away at work. |
99 data, because thieves might break into my house if they |
100 Privacy is essentially everything which `shouldn't be |
100 know I am away at work. Privacy is essentially |
101 anybody's business'. |
101 everything which ``shouldn't be anybody's business''. |
102 |
102 |
103 \end{itemize} |
103 \end{itemize} |
104 |
104 |
105 \noindent While this might provide us with some rough |
105 \noindent While this might provide us with some rough |
106 definitions, the problem with privacy is that it is an |
106 definitions, the problem with privacy is that it is an |
119 about my private life usually is used against me. As mentioned |
119 about my private life usually is used against me. As mentioned |
120 above, public location data might mean I get robbed. If |
120 above, public location data might mean I get robbed. If |
121 supermarkets build a profile of my shopping habits, they will |
121 supermarkets build a profile of my shopping habits, they will |
122 use it to \emph{their} advantage---surely not to \emph{my} |
122 use it to \emph{their} advantage---surely not to \emph{my} |
123 advantage. Also whatever might be collected about my life will |
123 advantage. Also whatever might be collected about my life will |
124 always be an incomplete, or even misleading, picture---for |
124 always be an incomplete, or even misleading, picture. For |
125 example I am sure my creditworthiness score was temporarily(?) |
125 example I am pretty sure my creditworthiness score was |
126 destroyed by not having a regular income in this country |
126 temporarily(?) destroyed by not having a regular income in |
127 (before coming to King's I worked in Munich for five years). |
127 this country (before coming to King's I worked in Munich for |
128 To correct such incomplete or flawed credit history data there |
128 five years). To correct such incomplete or flawed credit |
129 is, since recently, a law that allows you to check what |
129 history data there is, since recently, a law that allows you |
130 information is held about you for determining your |
130 to check what information is held about you for determining |
131 creditworthiness. But this concerns only a very small part of |
131 your creditworthiness. But this concerns only a very small |
132 the data that is held about me/you. |
132 part of the data that is held about me/you. |
133 |
133 |
134 To see how private matter can lead really to the wrong |
134 To see how private matter can lead really to the wrong |
135 conclusions, take the example of Stephen Hawking: When he was |
135 conclusions, take the example of Stephen Hawking: When he was |
136 diagnosed with his disease, he was given a life expectancy of |
136 diagnosed with his disease, he was given a life expectancy of |
137 two years. If employers would know about such problems, would |
137 two years. If employers would know about such problems, would |
138 they have employed Hawking? Now, he is enjoying his 70+ |
138 they have employed Hawking? Now, he is enjoying his 70+ |
139 birthday. Clearly personal medical data needs to stay private. |
139 birthday. Clearly personal medical data needs to stay private. |
140 A movie which has this topic as its main focus is Gattaca from |
140 A movie which has this topic as its main focus is Gattaca from |
141 1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}} |
141 1997, in case you like to watch |
|
142 it.\footnote{\url{http://www.imdb.com/title/tt0119177/}} |
142 |
143 |
143 |
144 |
144 To cut a long story short, I let you ponder about the two |
145 To cut a long story short, I let you ponder about the two |
145 statements that often voiced in discussions about privacy: |
146 statements that are often voiced in discussions about privacy: |
146 |
147 |
147 \begin{itemize} |
148 \begin{itemize} |
148 \item \textit{``You have zero privacy anyway. Get over it.''}\\ |
149 \item \textit{``You have zero privacy anyway. Get over it.''} |
149 \mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)} |
150 \mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)} |
150 |
151 |
151 \item \textit{``If you have nothing to hide, you have nothing |
152 \item \textit{``If you have nothing to hide, you have nothing |
152 to fear.''} |
153 to fear.''} |
153 \end{itemize} |
154 \end{itemize} |
161 \end{center} |
162 \end{center} |
162 |
163 |
163 \noindent Funnily, or maybe not so funnily, the author of this |
164 \noindent Funnily, or maybe not so funnily, the author of this |
164 article carefully tries to construct an argument that does not |
165 article carefully tries to construct an argument that does not |
165 only attack the nothing-to-hide statement in cases where |
166 only attack the nothing-to-hide statement in cases where |
166 governments \& Co collect people's deepest secrets, or |
167 governments \& co collect people's deepest secrets, or |
167 pictures of people's naked bodies, but an argument that |
168 pictures of people's naked bodies, but an argument that |
168 applies also in cases where governments ``only'' collect data |
169 applies also in cases where governments ``only'' collect data |
169 relevant to, say, preventing terrorism. The fun is of course |
170 relevant to, say, preventing terrorism. The fun is of course |
170 that in 2011 we could just not imagine that respected |
171 that in 2011 we could just not imagine that respected |
171 governments would do such infantile things as intercepting |
172 governments would do such infantile things as intercepting |
177 \subsubsection*{Re-Identification Attacks} |
178 \subsubsection*{Re-Identification Attacks} |
178 |
179 |
179 Apart from philosophical musings, there are fortunately also |
180 Apart from philosophical musings, there are fortunately also |
180 some real technical problems with privacy. The problem I want |
181 some real technical problems with privacy. The problem I want |
181 to focus on in this handout is how to safely disclose datasets |
182 to focus on in this handout is how to safely disclose datasets |
182 containing very potentially private data, say health data. What can |
183 containing potentially very private data, say health records. |
183 go wrong with such disclosures can be illustrated with four |
184 What can go wrong with such disclosures can be illustrated |
184 well-known examples: |
185 with four well-known examples: |
185 |
186 |
186 \begin{itemize} |
187 \begin{itemize} |
187 \item In 2006, a then young company called Netflix offered a 1 |
188 \item In 2006, a then young company called Netflix offered a 1 |
188 Mio \$ prize to anybody who could improve their movie |
189 Mio \$ prize to anybody who could improve their movie |
189 rating algorithm. For this they disclosed a dataset |
190 rating algorithm. For this they disclosed a dataset |
190 containing 10\% of all Netflix users at the time |
191 containing 10\% of all Netflix users at the time |
191 (appr.~500K). They removed names, but included numerical |
192 (appr.~500K). They removed names, but included numerical |
192 ratings of movies as well as times of ratings. Though |
193 ratings of movies as well as times when ratings were |
193 some information was perturbed (i.e., slightly |
194 uploaded. Though some information was perturbed (i.e., |
194 modified). |
195 slightly modified). |
195 |
196 |
196 Two researchers had a closer look at this anonymised |
197 Two researchers had a closer look at this anonymised |
197 data and compared it with public data available from the |
198 data and compared it with public data available from the |
198 International Movie Database (IMDb). They found that |
199 International Movie Database (IMDb). They found that |
199 98\% of the entries could be re-identified in the |
200 98\% of the entries could be re-identified in the |
210 made public, the then governor assured the public that |
211 made public, the then governor assured the public that |
211 the released dataset protected patient privacy by |
212 the released dataset protected patient privacy by |
212 deleting identifiers. |
213 deleting identifiers. |
213 |
214 |
214 A graduate student could not resist cross-referencing |
215 A graduate student could not resist cross-referencing |
215 public voter data with the released data including birth |
216 public voter data with the released data that still |
216 dates, gender and ZIP-code. The result was that she |
217 included birth dates, gender and ZIP-code. The result |
217 could send the governor his own hospital record. It |
218 was that she could send the governor his own hospital |
218 turns out that birth dates, gender and ZIP-code uniquely |
219 record. It turns out that birth dates, gender and |
219 identify 87\% of people in the US. This work resulted |
220 ZIP-code uniquely identify 87\% of people in the US. |
220 in a number of laws prescribing which private data |
221 This work resulted in a number of laws prescribing which |
221 cannot be released in such datasets. |
222 private data cannot be released in such datasets. |
222 |
223 |
223 \item In 2006, AOL published 20 million Web search queries |
224 \item In 2006, AOL published 20 million Web search queries |
224 collected from 650,000 users (names had been deleted). |
225 collected from 650,000 users (names had been deleted). |
225 This was again done for research purposes. However, |
226 This was again done for research purposes. However, |
226 within days an old lady, Thelma Arnold, from Lilburn, |
227 within days an old lady, Thelma Arnold, from Lilburn, |
230 lives. |
231 lives. |
231 |
232 |
232 \item Genome-Wide Association Studies (GWAS) was a public |
233 \item Genome-Wide Association Studies (GWAS) was a public |
233 database of gene-frequency studies linked to diseases. |
234 database of gene-frequency studies linked to diseases. |
234 It would essentially record that people who have a |
235 It would essentially record that people who have a |
235 disease, say diabetes, have also these genes. In order |
236 disease, say diabetes, have also certain genes. In order |
236 to maintain privacy, the dataset would only include |
237 to maintain privacy, the dataset would only include |
237 aggregate information. In case of DNA data this was |
238 aggregate information. In case of DNA data this |
238 achieved by mixing the DNA of many individuals (having |
239 aggregation was achieved by mixing the DNA of many |
239 a disease) into a single solution. Then this mixture |
240 individuals (having a disease) into a single solution. |
240 was sequenced and included in the dataset. The idea |
241 Then this mixture was sequenced and included in the |
241 was that the agregate information would still be helpful |
242 dataset. The idea was that the aggregate information |
242 to researchers, but would protect the DNA data of |
243 would still be helpful to researchers, but would protect |
243 individuals. |
244 the DNA data of individuals. |
244 |
245 |
245 In 2007 a forensic computer scientist showed that |
246 In 2007 a forensic computer scientist showed that |
246 individuals can be still identified. For this he used |
247 individuals can still be identified. For this he used |
247 the DNA data from a comparison group (people from the |
248 the DNA data from a comparison group (people from the |
248 general public) and ``subtracted'' this data from the |
249 general public) and ``subtracted'' this data from the |
249 published data. He was left with data that included |
250 published data. He was left with data that included all |
250 all ``special'' DNA-markers of the individuals |
251 ``special'' DNA-markers of the individuals present in |
251 present in the original mixture. He essentially deleted |
252 the original mixture. He essentially deleted the |
252 the ``background noise''. Now the problem with |
253 ``background noise'' in the published data. The |
253 DNA data is that it is of such a high resolution that |
254 problem with DNA data is that it is of such a high |
254 even if the mixture contained maybe 100 individuals, |
255 resolution that even if the mixture contained maybe 100 |
255 you can now detect whether an individual was included |
256 individuals, you can now detect whether an individual |
256 in the mixture or not. |
257 was included in the mixture or not. |
257 |
258 |
258 This result changed completely how DNA data is nowadays |
259 This result changed completely how DNA data is nowadays |
259 published for research purposes. After the success of |
260 published for research purposes. After the success of |
260 the human-genome project with a very open culture of |
261 the human-genome project with a very open culture of |
261 exchanging data, it became much more difficult to |
262 exchanging data, it became much more difficult to |
262 anonymise datasuch that patient's privacy is preserved. |
263 anonymise data so that patient's privacy is preserved. |
263 The public GWAS database was taken offline in 2008. |
264 The public GWAS database was taken offline in 2008. |
264 |
265 |
265 \end{itemize} |
266 \end{itemize} |
266 |
267 |
267 \noindent There are many lessons that can be learned from |
268 \noindent There are many lessons that can be learned from |
268 these examples. One is that when making data public in |
269 these examples. One is that when making datasets public in |
269 anonymised form you want to achieve \emph{forward privacy}. |
270 anonymised form, you want to achieve \emph{forward privacy}. |
270 This means, no matter of what other data that is also available |
271 This means, no matter what other data that is also available |
271 or will be released later, the data does not compromise |
272 or will be released later, the data in the original dataset |
272 an individual's privacy. This principle was violated by the |
273 does not compromise an individual's privacy. This principle |
273 data in the Netflix and governor of Massachusetts cases. There |
274 was violated by the availability of ``outside data'' in the |
274 additional data allowed one to re-identify individuals in the |
275 Netflix and governor of Massachusetts cases. The additional |
275 dataset. In case of GWAS a new technique of re-identification |
276 data permitted a re-identification of individuals in the |
276 compromised the privacy of people on the list. |
277 dataset. In case of GWAS a new technique of re-identification |
277 The case of the AOL dataset shows clearly how incomplete such |
278 compromised the privacy of people in the dataset. The case of |
278 data can be: Although the queries uniquely identified the |
279 the AOL dataset shows clearly how incomplete such data can be: |
279 old lady, she also looked up diseases that her friends had, |
280 Although the queries uniquely identified the older lady, she |
280 which had nothing to do with her. Any rational analysis of her |
281 also looked up diseases that her friends had, which had |
281 query data must have concluded, the lady is on her deathbed, |
282 nothing to do with her. Any rational analysis of her query |
282 while she was actually very much alive and kicking. |
283 data must therefore have concluded, the lady is on her |
|
284 deathbed, while she was actually very much alive and kicking. |
283 |
285 |
284 \subsubsection*{Differential Privacy} |
286 \subsubsection*{Differential Privacy} |
285 |
287 |
286 Differential privacy is one of the few methods, that tries to |
288 Differential privacy is one of the few methods, that tries to |
287 achieve forward privacy with large datasets. The basic idea |
289 achieve forward privacy with large datasets. The basic idea |