--- a/handouts/ho07.tex Fri Nov 14 14:03:15 2014 +0000
+++ b/handouts/ho07.tex Fri Nov 14 23:04:40 2014 +0000
@@ -14,27 +14,27 @@
novelty---a car. In my humble opinion, we are at the same
stage of development with privacy. Nobody really knows what it
is about or what it is good for. All seems very hazy. There
-are a few laws (cookie law, right-to-be-forgotten) which
-address problems with privacy, but even if they are well
+are a few laws (e.g.~cookie law, right-to-be-forgotten law)
+which address problems with privacy, but even if they are well
intentioned, they either back-fire or are already obsolete
because of newer technologies. The result is that the world of
``privacy'' looks a little bit like the old Wild West.
-For example, UCAS, a charity set up to help students to apply
-to universities, has a commercial unit that happily sells your
-email addresses to anybody who forks out enough money in order
-to be able to bombard you with spam. Yes, you can opt out very
-often in such ``schemes'', but in case of UCAS any opt-out
-will limit also legit emails you might actually be interested
-in.\footnote{The main objectionable point, in my opinion, is
-that the \emph{charity} everybody has to use for HE
-applications has actually very honourable goals (e.g.~assist
-applicants in gaining access to universities), but in their
-small print (or better under the link ``About us'') reveals
-they set up their organisation so that they can also
-shamelessly sell email addresses they ``harvest''. Everything
-is of course very legal\ldots{}moral?\ldots{}well that is in
-the eye of the beholder. See:
+For example, UCAS, a charity set up to help students with
+applying to universities, has a commercial unit that happily
+sells your email addresses to anybody who forks out enough
+money in order to be able to bombard you with spam. Yes, you
+can opt out very often in such ``schemes'', but in case of
+UCAS any opt-out will limit also legit emails you might
+actually be interested in.\footnote{The main objectionable
+point, in my opinion, is that the \emph{charity} everybody has
+to use for HE applications has actually very honourable goals
+(e.g.~assist applicants in gaining access to universities),
+but the small print (or better the link ``About
+us'') reveals they set up their organisation so that they can
+also shamelessly sell the email addresses they ``harvest''.
+Everything is of course very legal\ldots{}moral?\ldots{}well
+that is in the eye of the beholder. See:
\url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities}
or
@@ -43,7 +43,7 @@
Another example: Verizon, an ISP who is supposed to provide
you just with connectivity, has found a ``nice'' side-business
too: When you have enabled all privacy guards in your browser
-(the few you have at your disposal) Verizon happily adds a
+(the few you have at your disposal), Verizon happily adds a
kind of cookie to your
HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
As shown in the picture below, this cookie will be sent to
@@ -93,12 +93,12 @@
\item \textbf{Privacy} is the ability or right to protect your
personal secrets (secrecy for the benefit of an
individual). For example, in a job interview, I might
- not like to disclose that I am pregnant, if I were
- a woman, or that I am a father. Similarly, I might not
- like to disclose my location data, because thieves might
- break into my house if they know I am away at work.
- Privacy is essentially everything which `shouldn't be
- anybody's business'.
+ not like to disclose that I am pregnant, if I were a
+ woman, or that I am a father. Lest they might not hire
+ me. Similarly, I might not like to disclose my location
+ data, because thieves might break into my house if they
+ know I am away at work. Privacy is essentially
+ everything which ``shouldn't be anybody's business''.
\end{itemize}
@@ -121,15 +121,15 @@
supermarkets build a profile of my shopping habits, they will
use it to \emph{their} advantage---surely not to \emph{my}
advantage. Also whatever might be collected about my life will
-always be an incomplete, or even misleading, picture---for
-example I am sure my creditworthiness score was temporarily(?)
-destroyed by not having a regular income in this country
-(before coming to King's I worked in Munich for five years).
-To correct such incomplete or flawed credit history data there
-is, since recently, a law that allows you to check what
-information is held about you for determining your
-creditworthiness. But this concerns only a very small part of
-the data that is held about me/you.
+always be an incomplete, or even misleading, picture. For
+example I am pretty sure my creditworthiness score was
+temporarily(?) destroyed by not having a regular income in
+this country (before coming to King's I worked in Munich for
+five years). To correct such incomplete or flawed credit
+history data there is, since recently, a law that allows you
+to check what information is held about you for determining
+your creditworthiness. But this concerns only a very small
+part of the data that is held about me/you.
To see how private matter can lead really to the wrong
conclusions, take the example of Stephen Hawking: When he was
@@ -138,14 +138,15 @@
they have employed Hawking? Now, he is enjoying his 70+
birthday. Clearly personal medical data needs to stay private.
A movie which has this topic as its main focus is Gattaca from
-1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}}
+1997, in case you like to watch
+it.\footnote{\url{http://www.imdb.com/title/tt0119177/}}
To cut a long story short, I let you ponder about the two
-statements that often voiced in discussions about privacy:
+statements that are often voiced in discussions about privacy:
\begin{itemize}
-\item \textit{``You have zero privacy anyway. Get over it.''}\\
+\item \textit{``You have zero privacy anyway. Get over it.''}
\mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)}
\item \textit{``If you have nothing to hide, you have nothing
@@ -163,7 +164,7 @@
\noindent Funnily, or maybe not so funnily, the author of this
article carefully tries to construct an argument that does not
only attack the nothing-to-hide statement in cases where
-governments \& Co collect people's deepest secrets, or
+governments \& co collect people's deepest secrets, or
pictures of people's naked bodies, but an argument that
applies also in cases where governments ``only'' collect data
relevant to, say, preventing terrorism. The fun is of course
@@ -179,9 +180,9 @@
Apart from philosophical musings, there are fortunately also
some real technical problems with privacy. The problem I want
to focus on in this handout is how to safely disclose datasets
-containing very potentially private data, say health data. What can
-go wrong with such disclosures can be illustrated with four
-well-known examples:
+containing potentially very private data, say health records.
+What can go wrong with such disclosures can be illustrated
+with four well-known examples:
\begin{itemize}
\item In 2006, a then young company called Netflix offered a 1
@@ -189,9 +190,9 @@
rating algorithm. For this they disclosed a dataset
containing 10\% of all Netflix users at the time
(appr.~500K). They removed names, but included numerical
- ratings of movies as well as times of ratings. Though
- some information was perturbed (i.e., slightly
- modified).
+ ratings of movies as well as times when ratings were
+ uploaded. Though some information was perturbed (i.e.,
+ slightly modified).
Two researchers had a closer look at this anonymised
data and compared it with public data available from the
@@ -212,13 +213,13 @@
deleting identifiers.
A graduate student could not resist cross-referencing
- public voter data with the released data including birth
- dates, gender and ZIP-code. The result was that she
- could send the governor his own hospital record. It
- turns out that birth dates, gender and ZIP-code uniquely
- identify 87\% of people in the US. This work resulted
- in a number of laws prescribing which private data
- cannot be released in such datasets.
+ public voter data with the released data that still
+ included birth dates, gender and ZIP-code. The result
+ was that she could send the governor his own hospital
+ record. It turns out that birth dates, gender and
+ ZIP-code uniquely identify 87\% of people in the US.
+ This work resulted in a number of laws prescribing which
+ private data cannot be released in such datasets.
\item In 2006, AOL published 20 million Web search queries
collected from 650,000 users (names had been deleted).
@@ -232,54 +233,55 @@
\item Genome-Wide Association Studies (GWAS) was a public
database of gene-frequency studies linked to diseases.
It would essentially record that people who have a
- disease, say diabetes, have also these genes. In order
+ disease, say diabetes, have also certain genes. In order
to maintain privacy, the dataset would only include
- aggregate information. In case of DNA data this was
- achieved by mixing the DNA of many individuals (having
- a disease) into a single solution. Then this mixture
- was sequenced and included in the dataset. The idea
- was that the agregate information would still be helpful
- to researchers, but would protect the DNA data of
- individuals.
+ aggregate information. In case of DNA data this
+ aggregation was achieved by mixing the DNA of many
+ individuals (having a disease) into a single solution.
+ Then this mixture was sequenced and included in the
+ dataset. The idea was that the aggregate information
+ would still be helpful to researchers, but would protect
+ the DNA data of individuals.
- In 2007 a forensic computer scientist showed that
- individuals can be still identified. For this he used
+ In 2007 a forensic computer scientist showed that
+ individuals can still be identified. For this he used
the DNA data from a comparison group (people from the
general public) and ``subtracted'' this data from the
- published data. He was left with data that included
- all ``special'' DNA-markers of the individuals
- present in the original mixture. He essentially deleted
- the ``background noise''. Now the problem with
- DNA data is that it is of such a high resolution that
- even if the mixture contained maybe 100 individuals,
- you can now detect whether an individual was included
- in the mixture or not.
+ published data. He was left with data that included all
+ ``special'' DNA-markers of the individuals present in
+ the original mixture. He essentially deleted the
+ ``background noise'' in the published data. The
+ problem with DNA data is that it is of such a high
+ resolution that even if the mixture contained maybe 100
+ individuals, you can now detect whether an individual
+ was included in the mixture or not.
This result changed completely how DNA data is nowadays
published for research purposes. After the success of
the human-genome project with a very open culture of
exchanging data, it became much more difficult to
- anonymise datasuch that patient's privacy is preserved.
+ anonymise data so that patient's privacy is preserved.
The public GWAS database was taken offline in 2008.
\end{itemize}
\noindent There are many lessons that can be learned from
-these examples. One is that when making data public in
-anonymised form you want to achieve \emph{forward privacy}.
-This means, no matter of what other data that is also available
-or will be released later, the data does not compromise
-an individual's privacy. This principle was violated by the
-data in the Netflix and governor of Massachusetts cases. There
-additional data allowed one to re-identify individuals in the
-dataset. In case of GWAS a new technique of re-identification
-compromised the privacy of people on the list.
-The case of the AOL dataset shows clearly how incomplete such
-data can be: Although the queries uniquely identified the
-old lady, she also looked up diseases that her friends had,
-which had nothing to do with her. Any rational analysis of her
-query data must have concluded, the lady is on her deathbed,
-while she was actually very much alive and kicking.
+these examples. One is that when making datasets public in
+anonymised form, you want to achieve \emph{forward privacy}.
+This means, no matter what other data that is also available
+or will be released later, the data in the original dataset
+does not compromise an individual's privacy. This principle
+was violated by the availability of ``outside data'' in the
+Netflix and governor of Massachusetts cases. The additional
+data permitted a re-identification of individuals in the
+dataset. In case of GWAS a new technique of re-identification
+compromised the privacy of people in the dataset. The case of
+the AOL dataset shows clearly how incomplete such data can be:
+Although the queries uniquely identified the older lady, she
+also looked up diseases that her friends had, which had
+nothing to do with her. Any rational analysis of her query
+data must therefore have concluded, the lady is on her
+deathbed, while she was actually very much alive and kicking.
\subsubsection*{Differential Privacy}
@@ -305,11 +307,14 @@
\end{tabular}
\end{center}
+\ldots
+
+
\subsubsection*{Further Reading}
A readable article about how supermarkets mine your shopping
habits (especially how they prey on young exhausted families
-;o) appeared in 2012 in a New York Times article.
+;o) appeared in 2012 in the New York Times:
\begin{center}
\url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html}
@@ -329,10 +334,10 @@
\url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf}
\end{center}
-\noindent An article that sheds light on the paradox that
+\noindent An article that sheds light on the paradox that
people usually worry about privacy invasions of little
-significance, and overlook that might cause significant
-damage:
+significance, and overlook the privacy invasion that might
+cause significant damage:
\begin{center}
\url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf}