authorChristian Urban <christian dot urban at kcl dot ac dot uk>
Fri, 14 Nov 2014 23:04:40 +0000 (2014-11-14)
changeset 313 1d243ac51078
parent 312 c913fe9bfd59
child 314 e01f55e7485a
Binary file handouts/ho07.pdf has changed
--- a/handouts/ho07.tex	Fri Nov 14 14:03:15 2014 +0000
+++ b/handouts/ho07.tex	Fri Nov 14 23:04:40 2014 +0000
@@ -14,27 +14,27 @@
 novelty---a car. In my humble opinion, we are at the same
 stage of development with privacy. Nobody really knows what it
 is about or what it is good for. All seems very hazy. There
-are a few laws (cookie law, right-to-be-forgotten) which
-address problems with privacy, but even if they are well
+are a few laws (e.g.~cookie law, right-to-be-forgotten law)
+which address problems with privacy, but even if they are well
 intentioned, they either back-fire or are already obsolete
 because of newer technologies. The result is that the world of
 ``privacy'' looks a little bit like the old Wild West.
-For example, UCAS, a charity set up to help students to apply
-to universities, has a commercial unit that happily sells your
-email addresses to anybody who forks out enough money in order
-to be able to bombard you with spam. Yes, you can opt out very
-often in such ``schemes'', but in case of UCAS any opt-out
-will limit also legit emails you might actually be interested
-in.\footnote{The main objectionable point, in my opinion, is
-that the \emph{charity} everybody has to use for HE
-applications has actually very honourable goals (e.g.~assist
-applicants in gaining access to universities), but in their
-small print (or better under the link ``About us'') reveals
-they set up their organisation so that they can also
-shamelessly sell email addresses they ``harvest''. Everything
-is of course very legal\ldots{}moral?\ldots{}well that is in
-the eye of the beholder. See:
+For example, UCAS, a charity set up to help students with
+applying to universities, has a commercial unit that happily
+sells your email addresses to anybody who forks out enough
+money in order to be able to bombard you with spam. Yes, you
+can opt out very often in such ``schemes'', but in case of
+UCAS any opt-out will limit also legit emails you might
+actually be interested in.\footnote{The main objectionable
+point, in my opinion, is that the \emph{charity} everybody has
+to use for HE applications has actually very honourable goals
+(e.g.~assist applicants in gaining access to universities),
+but the small print (or better the link ``About
+us'') reveals they set up their organisation so that they can
+also shamelessly sell the email addresses they ``harvest''.
+Everything is of course very legal\ldots{}moral?\ldots{}well
+that is in the eye of the beholder. See:
@@ -43,7 +43,7 @@
 Another example: Verizon, an ISP who is supposed to provide
 you just with connectivity, has found a ``nice'' side-business
 too: When you have enabled all privacy guards in your browser
-(the few you have at your disposal) Verizon happily adds a
+(the few you have at your disposal), Verizon happily adds a
 kind of cookie to your
 As shown in the picture below, this cookie will be sent to
@@ -93,12 +93,12 @@
 \item \textbf{Privacy} is the ability or right to protect your
       personal secrets (secrecy for the benefit of an
       individual). For example, in a job interview, I might
-      not like to disclose that I am pregnant, if I were
-      a woman, or that I am a father. Similarly, I might not
-      like to disclose my location data, because thieves might
-      break into my house if they know I am away at work. 
-      Privacy is essentially everything which `shouldn't be
-      anybody's business'.
+      not like to disclose that I am pregnant, if I were a
+      woman, or that I am a father. Lest they might not hire
+      me. Similarly, I might not like to disclose my location
+      data, because thieves might break into my house if they
+      know I am away at work. Privacy is essentially
+      everything which ``shouldn't be anybody's business''.
@@ -121,15 +121,15 @@
 supermarkets build a profile of my shopping habits, they will
 use it to \emph{their} advantage---surely not to \emph{my}
 advantage. Also whatever might be collected about my life will
-always be an incomplete, or even misleading, picture---for
-example I am sure my creditworthiness score was temporarily(?)
-destroyed by not having a regular income in this country
-(before coming to King's I worked in Munich for five years).
-To correct such incomplete or flawed credit history data there
-is, since recently, a law that allows you to check what
-information is held about you for determining your
-creditworthiness. But this concerns only a very small part of
-the data that is held about me/you.
+always be an incomplete, or even misleading, picture. For
+example I am pretty sure my creditworthiness score was
+temporarily(?) destroyed by not having a regular income in
+this country (before coming to King's I worked in Munich for
+five years). To correct such incomplete or flawed credit
+history data there is, since recently, a law that allows you
+to check what information is held about you for determining
+your creditworthiness. But this concerns only a very small
+part of the data that is held about me/you.
 To see how private matter can lead really to the wrong
 conclusions, take the example of Stephen Hawking: When he was
@@ -138,14 +138,15 @@
 they have employed Hawking? Now, he is enjoying his 70+
 birthday. Clearly personal medical data needs to stay private.
 A movie which has this topic as its main focus is Gattaca from
+1997, in case you like to watch
 To cut a long story short, I let you ponder about the two
-statements that often voiced in discussions about privacy:
+statements that are often voiced in discussions about privacy:
-\item \textit{``You have zero privacy anyway. Get over it.''}\\
+\item \textit{``You have zero privacy anyway. Get over it.''}
 \mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)}
 \item \textit{``If you have nothing to hide, you have nothing 
@@ -163,7 +164,7 @@
 \noindent Funnily, or maybe not so funnily, the author of this
 article carefully tries to construct an argument that does not
 only attack the nothing-to-hide statement in cases where
-governments \& Co collect people's deepest secrets, or
+governments \& co collect people's deepest secrets, or
 pictures of people's naked bodies, but an argument that
 applies also in cases where governments ``only'' collect data
 relevant to, say, preventing terrorism. The fun is of course
@@ -179,9 +180,9 @@
 Apart from philosophical musings, there are fortunately also
 some real technical problems with privacy. The problem I want
 to focus on in this handout is how to safely disclose datasets
-containing very potentially private data, say health data. What can
-go wrong with such disclosures can be illustrated with four
-well-known examples:
+containing potentially very private data, say health records.
+What can go wrong with such disclosures can be illustrated
+with four well-known examples:
 \item In 2006, a then young company called Netflix offered a 1
@@ -189,9 +190,9 @@
       rating algorithm. For this they disclosed a dataset
       containing 10\% of all Netflix users at the time
       (appr.~500K). They removed names, but included numerical
-      ratings of movies as well as times of ratings. Though
-      some information was perturbed (i.e., slightly
-      modified).
+      ratings of movies as well as times when ratings were
+      uploaded. Though some information was perturbed (i.e.,
+      slightly modified).
       Two researchers had a closer look at this anonymised
       data and compared it with public data available from the
@@ -212,13 +213,13 @@
       deleting identifiers. 
       A graduate student could not resist cross-referencing
-      public voter data with the released data including birth
-      dates, gender and ZIP-code. The result was that she
-      could send the governor his own hospital record. It
-      turns out that birth dates, gender and ZIP-code uniquely
-      identify 87\% of people in the US. This work resulted
-      in a number of laws prescribing which private data
-      cannot be released in such datasets.
+      public voter data with the released data that still
+      included birth dates, gender and ZIP-code. The result
+      was that she could send the governor his own hospital
+      record. It turns out that birth dates, gender and
+      ZIP-code uniquely identify 87\% of people in the US.
+      This work resulted in a number of laws prescribing which
+      private data cannot be released in such datasets.
 \item In 2006, AOL published 20 million Web search queries
       collected from 650,000 users (names had been deleted).
@@ -232,54 +233,55 @@
 \item Genome-Wide Association Studies (GWAS) was a public
       database of gene-frequency studies linked to diseases.
       It would essentially record that people who have a
-      disease, say diabetes, have also these genes. In order
+      disease, say diabetes, have also certain genes. In order
       to maintain privacy, the dataset would only include
-      aggregate information. In case of DNA data this was 
-      achieved by mixing the DNA of many individuals (having
-      a disease) into a single solution. Then this mixture 
-      was sequenced and included in the dataset. The idea
-      was that the agregate information would still be helpful
-      to researchers, but would protect the DNA data of 
-      individuals. 
+      aggregate information. In case of DNA data this
+      aggregation was achieved by mixing the DNA of many
+      individuals (having a disease) into a single solution.
+      Then this mixture was sequenced and included in the
+      dataset. The idea was that the aggregate information
+      would still be helpful to researchers, but would protect
+      the DNA data of individuals. 
-      In 2007 a forensic computer scientist showed that 
-      individuals can be still identified. For this he used
+      In 2007 a forensic computer scientist showed that
+      individuals can still be identified. For this he used
       the DNA data from a comparison group (people from the
       general public) and ``subtracted'' this data from the
-      published data. He was left with data that included
-      all ``special'' DNA-markers of the individuals
-      present in the original mixture. He essentially deleted
-      the ``background noise''. Now the problem with
-      DNA data is that it is of such a high resolution that
-      even if the mixture contained maybe 100 individuals,
-      you can now detect whether an individual was included
-      in the mixture or not.
+      published data. He was left with data that included all
+      ``special'' DNA-markers of the individuals present in
+      the original mixture. He essentially deleted the
+      ``background noise'' in the published data. The
+      problem with DNA data is that it is of such a high
+      resolution that even if the mixture contained maybe 100
+      individuals, you can now detect whether an individual
+      was included in the mixture or not.
       This result changed completely how DNA data is nowadays
       published for research purposes. After the success of 
       the human-genome project with a very open culture of
       exchanging data, it became much more difficult to 
-      anonymise datasuch that patient's privacy is preserved.
+      anonymise data so that patient's privacy is preserved.
       The public GWAS database was taken offline in 2008.
 \noindent There are many lessons that can be learned from
-these examples. One is that when making data public in 
-anonymised form you want to achieve \emph{forward privacy}.
-This means, no matter of what other data that is also available
-or will be released later, the data does not compromise
-an individual's privacy. This principle was violated by the 
-data in the Netflix and governor of Massachusetts cases. There
-additional data allowed one to re-identify individuals in the
-dataset. In case of GWAS a new technique of re-identification 
-compromised the privacy of people on the list.
-The case of the AOL dataset shows clearly how incomplete such 
-data can be: Although the queries uniquely identified the
-old lady, she also looked up diseases that her friends had,
-which had nothing to do with her. Any rational analysis of her
-query data must have concluded, the lady is on her deathbed, 
-while she was actually very much alive and kicking.
+these examples. One is that when making datasets public in
+anonymised form, you want to achieve \emph{forward privacy}.
+This means, no matter what other data that is also available
+or will be released later, the data in the original dataset
+does not compromise an individual's privacy. This principle
+was violated by the availability of ``outside data'' in the
+Netflix and governor of Massachusetts cases. The additional
+data permitted a re-identification of individuals in the
+dataset. In case of GWAS a new technique of re-identification
+compromised the privacy of people in the dataset. The case of
+the AOL dataset shows clearly how incomplete such data can be:
+Although the queries uniquely identified the older lady, she
+also looked up diseases that her friends had, which had
+nothing to do with her. Any rational analysis of her query
+data must therefore have concluded, the lady is on her
+deathbed, while she was actually very much alive and kicking.
 \subsubsection*{Differential Privacy}
@@ -305,11 +307,14 @@
 \subsubsection*{Further Reading}
 A readable article about how supermarkets mine your shopping
 habits (especially how they prey on young exhausted families
-;o) appeared in 2012 in a New York Times article.
+;o) appeared in 2012 in the New York Times:
@@ -329,10 +334,10 @@
-\noindent An article that sheds light on the paradox that 
+\noindent An article that sheds light on the paradox that
 people usually worry about privacy invasions of little
-significance, and overlook that might cause significant 
+significance, and overlook the privacy invasion that might
+cause significant damage: