--- a/handouts/ho07.tex Fri Nov 14 01:37:46 2014 +0000
+++ b/handouts/ho07.tex Fri Nov 14 13:38:30 2014 +0000
@@ -13,9 +13,12 @@
public, for example horse owners, about the impending
novelty---a car. In my humble opinion, we are at the same
stage of development with privacy. Nobody really knows what it
-is about or what it is good for. All seems very hazy. The
-result is that the world of ``privacy'' looks a little bit
-like the old Wild West. Anything seems to go.
+is about or what it is good for. All seems very hazy. There
+are a few laws (cookie law, right-to-be-forgotten) which
+address problems with privacy, but even if they are well
+intentioned, they either back-fire or are already obsolete
+because of newer technologies. The result is that the world of
+``privacy'' looks a little bit like the old Wild West.
For example, UCAS, a charity set up to help students to apply
to universities, has a commercial unit that happily sells your
@@ -37,11 +40,11 @@
or
\url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
-Another example: Verizon, an ISP who provides you with
-connectivity, has found a ``nice'' side-business too: When you
-have enabled all privacy guards in your browser, the few you
-have at your disposal, Verizon happily adds a kind of cookie
-to your
+Another example: Verizon, an ISP who is supposed to provide
+you just with connectivity, has found a ``nice'' side-business
+too: When you have enabled all privacy guards in your browser
+(the few you have at your disposal) Verizon happily adds a
+kind of cookie to your
HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
As shown in the picture below, this cookie will be sent to
every web-site you visit. The web-sites then can forward the
@@ -50,7 +53,7 @@
this request, that is you.
\begin{center}
-\includegraphics[scale=0.21]{../pics/verizon.png}
+\includegraphics[scale=0.19]{../pics/verizon.png}
\end{center}
\noindent How disgusting? Even worse, Verizon is not known for
@@ -62,9 +65,10 @@
Well, we could go on and on\ldots{}and that has not even
started us yet with all the naughty things NSA \& Friends are
-up to. Why does privacy matter? Nobody, I think, has a
-conclusive answer to this question yet. Maybe the following four
-notions help with clarifying the overall picture somewhat:
+up to. Why does privacy actually matter? Nobody, I think, has
+a conclusive answer to this question yet. Maybe the following
+four notions help with clarifying the overall picture
+somewhat:
\begin{itemize}
\item \textbf{Secrecy} is the mechanism used to limit the
@@ -127,11 +131,15 @@
creditworthiness. But this concerns only a very small part of
the data that is held about me/you.
-Take the example of Stephen Hawking: when he was diagnosed
-with his disease, he was given a life expectancy of two years.
-If an employer would know about such problems, would they have
-employed Hawking? Now he is enjoying his 70+ birthday.
-Clearly personal medical data needs to stay private.
+To see how private matter can lead really to the wrong
+conclusions, take the example of Stephen Hawking: When he was
+diagnosed with his disease, he was given a life expectancy of
+two years. If employers would know about such problems, would
+they have employed Hawking? Now, he is enjoying his 70+
+birthday. Clearly personal medical data needs to stay private.
+A movie which has this topic as its main focus is Gattaca from
+1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}}
+
To cut a long story short, I let you ponder about the two
statements that often voiced in discussions about privacy:
@@ -144,8 +152,9 @@
to fear.''}
\end{itemize}
-\noindent An article that attempts a deeper analysis appeared
-in 2011 in the Chronicle of Higher Education
+\noindent If you want to read up further on this topic, I can
+recommend the following article that appeared in 2011 in the
+Chronicle of Higher Education
\begin{center}
\url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/}
@@ -170,7 +179,7 @@
Apart from philosophical musings, there are fortunately also
some real technical problems with privacy. The problem I want
to focus on in this handout is how to safely disclose datasets
-containing potentially private data, say health data. What can
+containing very potentially private data, say health data. What can
go wrong with such disclosures can be illustrated with four
well-known examples:
@@ -186,27 +195,30 @@
Two researchers had a closer look at this anonymised
data and compared it with public data available from the
- International Movie Database (IMDb). They found that 98
- \% of the entries could be re-identified in the Netflix
- dataset: either by their ratings or by the dates the
- ratings were uploaded. The result was a class-action
+ International Movie Database (IMDb). They found that
+ 98\% of the entries could be re-identified in the
+ Netflix dataset: either by their ratings or by the dates
+ the ratings were uploaded. The result was a class-action
suit against Netflix, which was only recently resolved
involving a lot of money.
\item In the 1990ies, medical datasets were often made public
for research purposes. This was done in anonymised form
- with names removed, but birth dates, gender, ZIP-code
+ with names removed, but birth dates, gender and ZIP-code
were retained. In one case where such data about
hospital visits of state employees in Massachusetts was
made public, the then governor assured the public that
the released dataset protected patient privacy by
- deleting identifiers. A graduate student could not
- resist cross-referencing public voter data with the
- released data including birth dates, gender and
- ZIP-code. The result was that she could send the
- governor his own hospital record. It turns out that
- birth dates, gender and ZIP-code uniquely identify 87\%
- people in the US.
+ deleting identifiers.
+
+ A graduate student could not resist cross-referencing
+ public voter data with the released data including birth
+ dates, gender and ZIP-code. The result was that she
+ could send the governor his own hospital record. It
+ turns out that birth dates, gender and ZIP-code uniquely
+ identify 87\% of people in the US. This work resulted
+ in a number of laws prescribing which private data
+ cannot be released in such datasets.
\item In 2006, AOL published 20 million Web search queries
collected from 650,000 users (names had been deleted).
@@ -217,16 +229,82 @@
engine queries are deep windows into people's private
lives.
-\item Genomic-Wide Association Studies (GWAS) was a public
+\item Genome-Wide Association Studies (GWAS) was a public
database of gene-frequency studies linked to diseases.
-
+ It would essentially record that people who have a
+ disease, say diabetes, have also these genes. In order
+ to maintain privacy, the dataset would only include
+ aggregate information. In case of DNA data this was
+ achieved by mixing the DNA of many individuals (having
+ a disease) into a single solution. Then this mixture
+ was sequenced and included in the dataset. The idea
+ was that the agregate information would still be helpful
+ to researchers, but would protect the DNA data of
+ individuals.
+
+ In 2007 a forensic computer scientist showed that
+ individuals can be still identified. For this he used
+ the DNA data from a comparison group (people from the
+ general public) and ``subtracted'' this data from the
+ published data. He was left with data that included
+ all ``special'' DNA-markers of the individuals
+ present in the original mixture. He essentially deleted
+ the ``background noise''. Now the problem with
+ DNA data is that it is of such a high resolution that
+ even if the mixture contained maybe 100 individuals,
+ you can now detect whether an individual was included
+ in the mixture or not.
- you only needed partial DNA information in order to
- identify whether an individual was part of the study —
- DB closed in 2008
+ This result changed completely how DNA data is nowadays
+ published for research purposes. After the success of
+ the human-genome project with a very open culture of
+ exchanging data, it became much more difficult to
+ anonymise datasuch that patient's privacy is preserved.
+ The public GWAS database was taken offline in 2008.
\end{itemize}
+\noindent There are many lessons that can be learned from
+these examples. One is that when making data public in
+anonymised form you want to achieve \emph{forward privacy}.
+This means, no matter of what other data that is also available
+or will be released later, the data does not compromise
+an individual's privacy. This principle was violated by the
+data in the Netflix and governor of Massachusetts cases. There
+additional data allowed one to re-identify individuals in the
+dataset. In case of GWAS a new technique of re-identification
+compromised the privacy of people on the list.
+The case of the AOL dataset shows clearly how incomplete such
+data can be: Although the queries uniquely identified the
+old lady, she also looked up diseases that her friends had,
+which had nothing to do with her. Any rational analysis of her
+query data must have concluded, the lady is on her deathbed,
+while she was actually very much alive and kicking.
+
+\subsubsection*{Differential Privacy}
+
+Differential privacy is one of the few methods, that tries to
+achieve forward privacy with large datasets. The basic idea
+is to add appropriate noise, or errors, to any query of the
+dataset. The intention is to make the result of a query
+insensitive to individual entries in the database. The hope is
+that the added error does not eliminate the ``signal'' one is
+looking for by querying the dataset.
+
+
+
+\begin{center}
+User\;\;\;\;
+\begin{tabular}{c}
+tell me $f(x)$ $\Rightarrow$\\
+$\Leftarrow$ $f(x) + \text{noise}$
+\end{tabular}
+\;\;\;\;\begin{tabular}{@{}c}
+Database\\
+$x_1, \ldots, x_n$
+\end{tabular}
+\end{center}
+
\end{document}