updated
authorChristian Urban <christian dot urban at kcl dot ac dot uk>
Fri, 14 Nov 2014 01:17:38 +0000
changeset 309 b1ba3d88696e
parent 308 2a814c06ae03
child 310 591b62e1f86a
updated
handouts/ho07.pdf
handouts/ho07.tex
Binary file handouts/ho07.pdf has changed
--- a/handouts/ho07.tex	Thu Nov 13 18:53:25 2014 +0000
+++ b/handouts/ho07.tex	Fri Nov 14 01:17:38 2014 +0000
@@ -15,19 +15,21 @@
 stage of development with privacy. Nobody really knows what it
 is about or what it is good for. All seems very hazy. The
 result is that the world of ``privacy'' looks a little bit
-like the old Wild West. For example, UCAS, a charity set up to
-help students apply to universities, has a commercial unit
-that happily sells your email addresses to anybody who forks
-out enough money in order to bombard you with spam. Yes, you
-can opt out very often, but in case of UCAS any opt-out will
-limit also legit emails you might actually be interested
+like the old Wild West. Anything seems to go. 
+
+For example, UCAS, a charity set up to help students to apply
+to universities, has a commercial unit that happily sells your
+email addresses to anybody who forks out enough money in order
+to be able to bombard you with spam. Yes, you can opt out very
+often in such ``schemes'', but in case of UCAS any opt-out
+will limit also legit emails you might actually be interested
 in.\footnote{The main objectionable point, in my opinion, is
 that the \emph{charity} everybody has to use for HE
 applications has actually very honourable goals (e.g.~assist
 applicants in gaining access to universities), but in their
 small print (or better under the link ``About us'') reveals
 they set up their organisation so that they can also
-shamelessly sell email addresses the ``harvest''. Everything
+shamelessly sell email addresses they ``harvest''. Everything
 is of course very legal\ldots{}moral?\ldots{}well that is in
 the eye of the beholder. See:
 
@@ -35,10 +37,11 @@
 or
 \url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
 
-Verizon, an ISP who provides you with connectivity, has found
-a ``nice'' side-business too: When you have enabled all
-privacy guards in your browser, the few you have at your
-disposal, Verizon happily adds a kind of cookie to your
+Another example: Verizon, an ISP who provides you with
+connectivity, has found a ``nice'' side-business too: When you
+have enabled all privacy guards in your browser, the few you
+have at your disposal, Verizon happily adds a kind of cookie
+to your
 HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
 As shown in the picture below, this cookie will be sent to
 every web-site you visit. The web-sites then can forward the
@@ -61,7 +64,7 @@
 
 Why does privacy matter? Nobody, I think, has a conclusive
 answer to this question. Maybe the following four notions
-clarify the picture somewhat: 
+help with clarifying the overall picture somewhat: 
 
 \begin{itemize}
 \item \textbf{Secrecy} is the mechanism used to limit the
@@ -91,7 +94,7 @@
       like to disclose my location data, because thieves might
       break into my house if they know I am away at work. 
       Privacy is essentially everything which `shouldn't be
-      anybodies business'.
+      anybody's business'.
 
 \end{itemize}
 
@@ -99,66 +102,116 @@
 definitions, the problem with privacy is that it is an
 extremely fine line what should stay private and what should
 not. For example, since I am working in academia, I am very
-happy to be essentially a digital exhibitionist: I am happy to
+happy to be a digital exhibitionist: I am very happy to
 disclose all `trivia' related to my work on my personal
 web-page. This is a kind of bragging that is normal in
-academia (at least in the CS field). I am even happy that
-Google maintains a profile about all of my academic papers and
-their citations. 
+academia (at least in the field of CS), even expected if you
+look for a job. I am even happy that Google maintains a
+profile about all my academic papers and their citations. 
 
-On the other hand I would be very peeved if anybody had a too
-close look on my private live---it shouldn't be anybodies
-business. The reason is that knowledge about my private life
-usually is used against me. As mentioned above, public
-location data might mean I get robbed. If supermarkets build a
-profile of my shopping habits, they will use it to
-\emph{their} advantage---surely not to \emph{my} advantage.
-Also whatever might be collected about my life will always be
-an incomplete, or even misleading, picture---I am sure my
-creditworthiness score was temporarily(?) destroyed by not
-having a regular income in this country (before coming to
-King's I worked in Munich). To correct such incomplete or
-flawed data there is, since recently, a law that allows you to
-check what information is held about you for determining your
+On the other hand I would be very irritated if anybody I do
+not know had a too close look on my private live---it
+shouldn't be anybody's business. The reason is that knowledge
+about my private life usually is used against me. As mentioned
+above, public location data might mean I get robbed. If
+supermarkets build a profile of my shopping habits, they will
+use it to \emph{their} advantage---surely not to \emph{my}
+advantage. Also whatever might be collected about my life will
+always be an incomplete, or even misleading, picture---for
+example I am sure my creditworthiness score was temporarily(?)
+destroyed by not having a regular income in this country
+(before coming to King's I worked in Munich for five years).
+To correct such incomplete or flawed credit history data there
+is, since recently, a law that allows you to check what
+information is held about you for determining your
 creditworthiness. But this concerns only a very small part of
 the data that is held about me/you.
 
-This is an endless field. I let you ponder about the two
-statements that are often float about in discussions about
-privacy:
+To cut a long story short, I let you ponder about the two
+statements that often voiced in discussions about privacy:
 
 \begin{itemize}
 \item \textit{``You have zero privacy anyway. Get over it.''}\\
-\mbox{}\hfill{}Scott Mcnealy (CEO of Sun)
+\mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)}
 
 \item \textit{``If you have nothing to hide, you have nothing 
 to fear.''}
 \end{itemize}
  
-\noindent There are some technical problems that are easier to
-discuss and that often have privacy implications. The problem
-I want to focus on is how to safely disclose datasets. What
-can go wrong with this can be illustrated with three examples:
+\noindent An article that attempts a deeper analysis appeared
+in 2011 in the Chronicle of Higher Education
+
+\begin{center} 
+\url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/} 
+\end{center} 
+
+\noindent Funnily, or maybe not so funnily, the author of this
+article carefully tries to construct an argument that does not
+only attack the nothing-to-hide statement in cases where
+governments \& Co collect people's deepest secrets, or
+pictures of people's naked bodies, but an argument that
+applies also in cases where governments ``only'' collect data
+relevant to, say, preventing terrorism. The fun is of course,
+in 2011 we could just not imagine that respected governments
+would do such infantile things as intercepting people's nude
+photos. Well, since Snowden we know some people at the NSA did
+and then shared such photos among colleagues as ``fringe
+benefit''.  
+
+
+\subsubsection*{Re-Identification Attacks} 
+
+Apart from philosophical arguments, there are fortunately also
+some real technical problems with privacy implications. The
+problem I want to focus on in this handout is how to safely
+disclose datasets containing potentially private data, say
+health data. What can go wrong with such disclosures can be
+illustrated with four examples:
 
 \begin{itemize}
-\item In 2006 a then young company called Netflix offered a 1
+\item In 2006, a then young company called Netflix offered a 1
       Mio \$ prize to anybody who could improve their movie
       rating algorithm. For this they disclosed a dataset
-      containing 10\% of all Netflix users (appr.~500K). They
-      removed names, but included numerical ratings as well as
-      times of ratings. Though some information was perturbed
-      (i.e., slightly modified).
+      containing 10\% of all Netflix users at the time
+      (appr.~500K). They removed names, but included numerical
+      ratings of movies as well as times of ratings. Though
+      some information was perturbed (i.e., slightly
+      modified).
       
-      Two researchers took that data and compared it with
-      public data available from the International Movie
-      Database (IMDb). They found that 98 \% of the entries
-      could be re-identified: either by their ratings or by
-      the dates the ratings were uploaded. 
+      Two researchers had a closer look at this anonymised
+      data and compared it with public data available from the
+      International Movie Database (IMDb). They found that 98
+      \% of the entries could be re-identified in the Netflix
+      dataset: either by their ratings or by the dates the
+      ratings were uploaded. The result was a class-action 
+      suit against Netflix, which was only recently resolved
+      involving a lot of money.
 
-\item In the 1990, medical databases were routinely made
-      publicised for research purposes. This was done in
-      anonymised form with names removed, but birth dates,
-      gender, ZIP-code were retained.
+\item In the 1990ies, medical datasets were often made public for
+      research purposes. This was done in anonymised form with
+      names removed, but birth dates, gender, ZIP-code were
+      retained. In one case where such data was made public
+      about state employees in Massachusetts, the then
+      governor assured the public that the released dataset
+      protected patient privacy by deleting identifiers. A
+      graduate student could not resist and cross-referenced
+      public voter data with the data about birth dates,
+      gender, ZIP-code. The result was that she could send
+      the governor his own hospital record.
+ 
+\item In 2006, AOL published 20 million Web search queries
+      collected of 650,000 users (names had been deleted).
+      This was again for research purposes. However, within
+      days an old lady, Thelma Arnold, from Lilburn, Georgia,
+      (11,596 inhabitants) was identified as user No.~4417749
+      in this dataset. It turned out that search engine
+      queries are windows into people's private lives. 
+  
+\item Genomic-Wide Association Studies (GWAS) was a public
+      database of gene-frequency studies linked to diseases.
+      you only needed partial DNA information in order to
+      identify whether an individual was part of the study —
+      DB closed in 2008
       
 \end{itemize}