--- a/handouts/ho07.tex Thu Nov 13 18:53:25 2014 +0000
+++ b/handouts/ho07.tex Fri Nov 14 01:17:38 2014 +0000
@@ -15,19 +15,21 @@
stage of development with privacy. Nobody really knows what it
is about or what it is good for. All seems very hazy. The
result is that the world of ``privacy'' looks a little bit
-like the old Wild West. For example, UCAS, a charity set up to
-help students apply to universities, has a commercial unit
-that happily sells your email addresses to anybody who forks
-out enough money in order to bombard you with spam. Yes, you
-can opt out very often, but in case of UCAS any opt-out will
-limit also legit emails you might actually be interested
+like the old Wild West. Anything seems to go.
+
+For example, UCAS, a charity set up to help students to apply
+to universities, has a commercial unit that happily sells your
+email addresses to anybody who forks out enough money in order
+to be able to bombard you with spam. Yes, you can opt out very
+often in such ``schemes'', but in case of UCAS any opt-out
+will limit also legit emails you might actually be interested
in.\footnote{The main objectionable point, in my opinion, is
that the \emph{charity} everybody has to use for HE
applications has actually very honourable goals (e.g.~assist
applicants in gaining access to universities), but in their
small print (or better under the link ``About us'') reveals
they set up their organisation so that they can also
-shamelessly sell email addresses the ``harvest''. Everything
+shamelessly sell email addresses they ``harvest''. Everything
is of course very legal\ldots{}moral?\ldots{}well that is in
the eye of the beholder. See:
@@ -35,10 +37,11 @@
or
\url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
-Verizon, an ISP who provides you with connectivity, has found
-a ``nice'' side-business too: When you have enabled all
-privacy guards in your browser, the few you have at your
-disposal, Verizon happily adds a kind of cookie to your
+Another example: Verizon, an ISP who provides you with
+connectivity, has found a ``nice'' side-business too: When you
+have enabled all privacy guards in your browser, the few you
+have at your disposal, Verizon happily adds a kind of cookie
+to your
HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
As shown in the picture below, this cookie will be sent to
every web-site you visit. The web-sites then can forward the
@@ -61,7 +64,7 @@
Why does privacy matter? Nobody, I think, has a conclusive
answer to this question. Maybe the following four notions
-clarify the picture somewhat:
+help with clarifying the overall picture somewhat:
\begin{itemize}
\item \textbf{Secrecy} is the mechanism used to limit the
@@ -91,7 +94,7 @@
like to disclose my location data, because thieves might
break into my house if they know I am away at work.
Privacy is essentially everything which `shouldn't be
- anybodies business'.
+ anybody's business'.
\end{itemize}
@@ -99,66 +102,116 @@
definitions, the problem with privacy is that it is an
extremely fine line what should stay private and what should
not. For example, since I am working in academia, I am very
-happy to be essentially a digital exhibitionist: I am happy to
+happy to be a digital exhibitionist: I am very happy to
disclose all `trivia' related to my work on my personal
web-page. This is a kind of bragging that is normal in
-academia (at least in the CS field). I am even happy that
-Google maintains a profile about all of my academic papers and
-their citations.
+academia (at least in the field of CS), even expected if you
+look for a job. I am even happy that Google maintains a
+profile about all my academic papers and their citations.
-On the other hand I would be very peeved if anybody had a too
-close look on my private live---it shouldn't be anybodies
-business. The reason is that knowledge about my private life
-usually is used against me. As mentioned above, public
-location data might mean I get robbed. If supermarkets build a
-profile of my shopping habits, they will use it to
-\emph{their} advantage---surely not to \emph{my} advantage.
-Also whatever might be collected about my life will always be
-an incomplete, or even misleading, picture---I am sure my
-creditworthiness score was temporarily(?) destroyed by not
-having a regular income in this country (before coming to
-King's I worked in Munich). To correct such incomplete or
-flawed data there is, since recently, a law that allows you to
-check what information is held about you for determining your
+On the other hand I would be very irritated if anybody I do
+not know had a too close look on my private live---it
+shouldn't be anybody's business. The reason is that knowledge
+about my private life usually is used against me. As mentioned
+above, public location data might mean I get robbed. If
+supermarkets build a profile of my shopping habits, they will
+use it to \emph{their} advantage---surely not to \emph{my}
+advantage. Also whatever might be collected about my life will
+always be an incomplete, or even misleading, picture---for
+example I am sure my creditworthiness score was temporarily(?)
+destroyed by not having a regular income in this country
+(before coming to King's I worked in Munich for five years).
+To correct such incomplete or flawed credit history data there
+is, since recently, a law that allows you to check what
+information is held about you for determining your
creditworthiness. But this concerns only a very small part of
the data that is held about me/you.
-This is an endless field. I let you ponder about the two
-statements that are often float about in discussions about
-privacy:
+To cut a long story short, I let you ponder about the two
+statements that often voiced in discussions about privacy:
\begin{itemize}
\item \textit{``You have zero privacy anyway. Get over it.''}\\
-\mbox{}\hfill{}Scott Mcnealy (CEO of Sun)
+\mbox{}\hfill{}{\small{}by Scott Mcnealy (CEO of Sun)}
\item \textit{``If you have nothing to hide, you have nothing
to fear.''}
\end{itemize}
-\noindent There are some technical problems that are easier to
-discuss and that often have privacy implications. The problem
-I want to focus on is how to safely disclose datasets. What
-can go wrong with this can be illustrated with three examples:
+\noindent An article that attempts a deeper analysis appeared
+in 2011 in the Chronicle of Higher Education
+
+\begin{center}
+\url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/}
+\end{center}
+
+\noindent Funnily, or maybe not so funnily, the author of this
+article carefully tries to construct an argument that does not
+only attack the nothing-to-hide statement in cases where
+governments \& Co collect people's deepest secrets, or
+pictures of people's naked bodies, but an argument that
+applies also in cases where governments ``only'' collect data
+relevant to, say, preventing terrorism. The fun is of course,
+in 2011 we could just not imagine that respected governments
+would do such infantile things as intercepting people's nude
+photos. Well, since Snowden we know some people at the NSA did
+and then shared such photos among colleagues as ``fringe
+benefit''.
+
+
+\subsubsection*{Re-Identification Attacks}
+
+Apart from philosophical arguments, there are fortunately also
+some real technical problems with privacy implications. The
+problem I want to focus on in this handout is how to safely
+disclose datasets containing potentially private data, say
+health data. What can go wrong with such disclosures can be
+illustrated with four examples:
\begin{itemize}
-\item In 2006 a then young company called Netflix offered a 1
+\item In 2006, a then young company called Netflix offered a 1
Mio \$ prize to anybody who could improve their movie
rating algorithm. For this they disclosed a dataset
- containing 10\% of all Netflix users (appr.~500K). They
- removed names, but included numerical ratings as well as
- times of ratings. Though some information was perturbed
- (i.e., slightly modified).
+ containing 10\% of all Netflix users at the time
+ (appr.~500K). They removed names, but included numerical
+ ratings of movies as well as times of ratings. Though
+ some information was perturbed (i.e., slightly
+ modified).
- Two researchers took that data and compared it with
- public data available from the International Movie
- Database (IMDb). They found that 98 \% of the entries
- could be re-identified: either by their ratings or by
- the dates the ratings were uploaded.
+ Two researchers had a closer look at this anonymised
+ data and compared it with public data available from the
+ International Movie Database (IMDb). They found that 98
+ \% of the entries could be re-identified in the Netflix
+ dataset: either by their ratings or by the dates the
+ ratings were uploaded. The result was a class-action
+ suit against Netflix, which was only recently resolved
+ involving a lot of money.
-\item In the 1990, medical databases were routinely made
- publicised for research purposes. This was done in
- anonymised form with names removed, but birth dates,
- gender, ZIP-code were retained.
+\item In the 1990ies, medical datasets were often made public for
+ research purposes. This was done in anonymised form with
+ names removed, but birth dates, gender, ZIP-code were
+ retained. In one case where such data was made public
+ about state employees in Massachusetts, the then
+ governor assured the public that the released dataset
+ protected patient privacy by deleting identifiers. A
+ graduate student could not resist and cross-referenced
+ public voter data with the data about birth dates,
+ gender, ZIP-code. The result was that she could send
+ the governor his own hospital record.
+
+\item In 2006, AOL published 20 million Web search queries
+ collected of 650,000 users (names had been deleted).
+ This was again for research purposes. However, within
+ days an old lady, Thelma Arnold, from Lilburn, Georgia,
+ (11,596 inhabitants) was identified as user No.~4417749
+ in this dataset. It turned out that search engine
+ queries are windows into people's private lives.
+
+\item Genomic-Wide Association Studies (GWAS) was a public
+ database of gene-frequency studies linked to diseases.
+ you only needed partial DNA information in order to
+ identify whether an individual was part of the study —
+ DB closed in 2008
\end{itemize}