\documentclass{article}\usepackage{../style}\usepackage{../graphics}\begin{document}\section*{Handout 7 (Privacy)}The first motor car was invented around 1886. For ten years,until 1896, the law in the UK (and elsewhere) required aperson to walk in front of any moving car waving a red flag.Cars were such a novelty that most people did not know what tomake of them. The person with the red flag was intended towarn the public, for example horse owners, about the impendingnovelty---a car. In my humble opinion, we are at the samestage of development with privacy. Nobody really knows what itis about or what it is good for. All seems very hazy. Thereare a few laws (e.g.~cookie law, right-to-be-forgotten law)which address problems with privacy, but even if they are wellintentioned, they either back-fire or are already obsoletebecause of newer technologies. The result is that the world of``privacy'' looks a little bit like the old WildWest---lawless and mythical.For example, UCAS, a charity set up to help students withapplying to universities, has a commercial unit that happilysells your email addresses to anybody who forks out enoughmoney for bombarding you with spam. Yes, you can opt out veryoften from such ``schemes'', but in case of UCAS any opt-outwill limit also legit emails you might actually be interestedin.\footnote{The main objectionable point, in my opinion, isthat the \emph{charity} everybody has to use for HEapplications has actually very honourable goals (e.g.~assistapplicants in gaining access to universities), but the smallprint (or better the link ``About us'') reveals they set uptheir organisation so that they can also shamelessly sell theemail addresses they ``harvest''. Everything is of course verylegal\ldots{}moral?\ldots{}well that is in the eye of thebeholder. See:\url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} or\url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}Another example: Verizon, an ISP who is supposed to provideyou just with connectivity, has found a ``nice'' side-businesstoo: When you have enabled all privacy guards in your browser(the few you have at your disposal), Verizon happily adds akind of cookie to yourHTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}As shown in the picture below, this cookie will be sent toevery web-site you visit. The web-sites then can forward thecookie to advertisers who in turn pay Verizon to tell themeverything they want to know about the person who just madethis request, that is you.\begin{center}\includegraphics[scale=0.17]{../pics/verizon.png}\end{center}\noindent How disgusting! Even worse, Verizon is not known forbeing the cheapest ISP on the planet (completely thecontrary), and also not known for providing the fastestpossible speeds, but rather for being among the few ISPs inthe US with a quasi-monopolistic ``market distribution''.Well, we could go on and on\ldots{}and that has not evenstarted us yet with all the naughty things NSA \& Friends areup to. Why does privacy actually matter? Nobody, I think, hasa conclusive answer to this question yet. Maybe the followingfour notions help with clarifying the overall picturesomewhat: \begin{itemize}\item \textbf{Secrecy} is the mechanism used to limit the number of principals with access to information (e.g., cryptography or access controls). For example I better keep my password secret, otherwise people from the wrong side of the law might impersonate me.\item \textbf{Confidentiality} is the obligation to protect the secrets of other people or organisations (secrecy for the benefit of an organisation). For example as a staff member at King's I have access to data, even private data, I am allowed to use in my work but not allowed to disclose to anyone else.\item \textbf{Anonymity} is the ability to leave no evidence of an activity (e.g., sharing a secret). This is not equal with privacy---anonymity is required in many circumstances, for example for whistle-blowers, voting, exam marking and so on.\item \textbf{Privacy} is the ability or right to protect your personal secrets (secrecy for the benefit of an individual). For example, in a job interview, I might not like to disclose that I am pregnant, if I were a woman, or that I am a father. Lest they might not hire me. Similarly, I might not like to disclose my location data, because thieves might break into my house if they know I am away at work. Privacy is essentially everything which ``shouldn't be anybody's business''.\end{itemize}\noindent While this might provide us with some roughdefinitions, the problem with privacy is that it is anextremely fine line what should stay private and what shouldnot. For example, since I am working in academia, I am everyso often very happy to be a digital exhibitionist: I am veryhappy to disclose all `trivia' related to my work on mypersonal web-page. This is a kind of bragging that is normalin academia (at least in the field of CS), even expected ifyou look for a job. I am even happy that Google maintains aprofile about all my academic papers and their citations. On the other hand I would be very irritated if anybody I donot know had a too close look on my private live---itshouldn't be anybody's business. The reason is that knowledgeabout my private life usually is used against me. As mentionedabove, public location data might mean I get robbed. Ifsupermarkets build a profile of my shopping habits, they willuse it to \emph{their} advantage---surely not to \emph{my}advantage. Also whatever might be collected about my life willalways be an incomplete, or even misleading, picture. Forexample I am pretty sure my creditworthiness score wastemporarily(?) destroyed by not having a regular income inthis country (before coming to King's I worked in Munich forfive years). To correct such incomplete or flawed credithistory data there is, since recently, a law that allows youto check what information is held about you for determiningyour creditworthiness. But this concerns only a very smallpart of the data that is held about me/you.To see how private matter can lead really to the wrongconclusions, take the example of Stephen Hawking: When he wasdiagnosed with his disease, he was given a life expectancy oftwo years. If employers would know about such problems, wouldthey have employed Hawking? Now, he is enjoying his 70+birthday. Clearly personal medical data needs to stay private.To cut a long story short, I let you ponder about the twostatements which are often voiced in discussions about privacy:\begin{itemize}\item \textit{``You have zero privacy anyway. Get over it.''}\\\mbox{}\hfill{}{\small{}(by Scott Mcnealy, former CEO of Sun)}\item \textit{``If you have nothing to hide, you have nothing to fear.''}\end{itemize}\noindent If you like to watch a movie which has this topic asits main focus I recommend \emph{Gattaca} from1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}} Ifyou want to read up on this topic, I can recommend thefollowing article that appeared in 2011 in the Chronicle ofHigher Education:\begin{center} \url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/} \end{center} \noindent Funnily, or maybe not so funnily, the author of thisarticle carefully tries to construct an argument that does notonly attack the nothing-to-hide statement in cases wheregovernments \& co collect people's deepest secrets, orpictures of people's naked bodies, but an argument thatapplies also in cases where governments ``only'' collect datarelevant to, say, preventing terrorism. The fun is of coursethat in 2011 we could just not imagine that respectedgovernments would do such infantile things as interceptingpeople's nude photos. Well, since Snowden we know some peopleat the NSA did exactly that and then shared such photos amongcolleagues as ``fringe benefit''. \subsubsection*{Re-Identification Attacks} Apart from philosophical musings, there are fortunately alsosome real technical problems with privacy. The problem I wantto focus on in this handout is how to safely disclose datasetscontaining potentially very private data, say health records.What can go wrong with such disclosures can be illustratedwith four well-known examples:\begin{itemize}\item In 2006, a then young company called Netflix offered a 1 Mio \$ prize to anybody who could improve their movie rating algorithm. For this they disclosed a dataset containing 10\% of all Netflix users at the time (appr.~500K). They removed names, but included numerical ratings of movies as well as times when ratings were uploaded. Though some information was perturbed (i.e., slightly modified). Two researchers had a closer look at this anonymised data and compared it with public data available from the International Movie Database (IMDb). They found that 98\% of the entries could be re-identified in the Netflix dataset: either by their ratings or by the dates the ratings were uploaded. The result was a class-action suit against Netflix, which was only recently resolved involving a lot of money.\item In the 1990ies, medical datasets were often made public for research purposes. This was done in anonymised form with names removed, but birth dates, gender and ZIP-code were retained. In one case where such data about hospital visits of state employees in Massachusetts was made public, the then governor assured the public that the released dataset protected patient privacy by deleting identifiers. A graduate student could not resist cross-referencing public voter data with the released data that still included birth dates, gender and ZIP-code. The result was that she could send the governor his own hospital record. It turns out that birth dates, gender and ZIP-code uniquely identify 87\% of people in the US. This work resulted in a number of laws prescribing which private data cannot be released in such datasets.\item In 2006, AOL published 20 million Web search queries collected from 650,000 users (names had been deleted). This was again done for research purposes. However, within days an old lady, Thelma Arnold, from Lilburn, Georgia, (11,596 inhabitants) was identified as user No.~4417749 in this dataset. It turned out that search engine queries are deep windows into people's private lives. \item Genome-Wide Association Studies (GWAS) was a public database of gene-frequency studies linked to diseases. It would essentially record that people who have a disease, say diabetes, have also certain genes. In order to maintain privacy, the dataset would only include aggregate information. In case of DNA data this aggregation was achieved by mixing the DNA of many individuals (having a disease) into a single solution. Then this mixture was sequenced and included in the dataset. The idea was that the aggregate information would still be helpful to researchers, but would protect the DNA data of individuals. In 2007 a forensic computer scientist showed that individuals can still be identified. For this he used the DNA data from a comparison group (people from the general public) and ``subtracted'' this data from the published data. He was left with data that included all ``special'' DNA-markers of the individuals present in the original mixture. He essentially deleted the ``background noise'' in the published data. The problem with DNA data is that it is of such a high resolution that even if the mixture contained maybe 100 individuals, you can now detect whether an individual was included in the mixture or not. This result changed completely how DNA data is nowadays published for research purposes. After the success of the human-genome project with a very open culture of exchanging data, it became much more difficult to anonymise data so that patient's privacy is preserved. The public GWAS database was taken offline in 2008.\end{itemize}\noindent There are many lessons that can be learned fromthese examples. One is that when making datasets public inanonymised form, you want to achieve \emph{forward privacy}.This means, no matter what other data that is also availableor will be released later, the data in the original datasetdoes not compromise an individual's privacy. This principlewas violated by the availability of ``outside data'' in theNetflix and governor of Massachusetts cases. The additionaldata permitted a re-identification of individuals in thedataset. In case of GWAS a new technique of re-identificationcompromised the privacy of people in the dataset. The case ofthe AOL dataset shows clearly how incomplete such data can be:Although the queries uniquely identified the older lady, shealso looked up diseases that her friends had, which hadnothing to do with her. Any rational analysis of her querydata must therefore have concluded, the lady is on herdeath bed, while she was actually very much alive and kicking.\subsubsection*{Differential Privacy}Differential privacy is one of the few methods that tries toachieve forward privacy. The basic idea is to add appropriatenoise, or errors, to any query of the dataset. The intentionis to make the result of a query insensitive to individualentries in the database. That means the results areapproximately the same no matter if a particular individual isin the dataset or not. The hope is that the added error doesnot eliminate the ``signal'' one is looking for in thedataset.%\begin{center}%User\;\;\;\; %\begin{tabular}{c}%tell me $f(x)$ $\Rightarrow$\\%$\Leftarrow$ $f(x) + \text{noise}$%\end{tabular}%\;\;\;\;\begin{tabular}{@{}c}%Database\\%$x_1, \ldots, x_n$%\end{tabular}%\end{center}%%\begin{center}%\begin{tabular}{l|l}%Staff & Salary\\\hline%$PM$ & \pounds{107}\\%$PF$ & \pounds{102}\\%$LM_1$ & \pounds{101}\\%$LF_2$ & \pounds{97}\\%$LM_3$ & \pounds{100}\\%$LM_4$ & \pounds{99}\\%$LF_5$ & \pounds{98}%\end{tabular}%\end{center}%%%\begin{center}%\begin{tikzpicture} %\begin{axis}[symbolic y coords={salary},% ytick=data,% height=3cm]%\addplot+[jump mark mid] coordinates%{(0,salary) (0.1,salary) % (0.4,salary) (0.5,salary) % (0.8,salary) (0.9,salary)};%\end{axis}%\end{tikzpicture}%\end{center}%%\begin{tikzpicture}[outline/.style={draw=#1,fill=#1!20}]% \node [outline=red] {red box};% \node [outline=blue] at (0,-1) {blue box};%\end{tikzpicture}\ldots\subsubsection*{Further Reading}Two cool articles about how somebody obtained via the Freedomof Information Law the taxicab dataset of New York and someoneelse showed how easy it is to mine for private information: \begin{center}\begin{tabular}{p{0.8\textwidth}}\url{http://chriswhong.com/open-data/foil_nyc_taxi/}\\\url{http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset}\end{tabular}\end{center}\noindent A readable article about how supermarkets mine your shoppinghabits (especially how they prey on new exhausted parents;o) appeared in 2012 in the New York Times:\begin{center}\url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html}\end{center}\noindent An article that analyses privacy and shopping habits from a more economic point is available from:\begin{center}\url{http://www.dtc.umn.edu/~odlyzko/doc/privacy.economics.pdf}\end{center}\noindent An attempt to untangle the web of current technologyfor spying on consumers is published in:\begin{center}\url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf}\end{center}\noindent An article that sheds light on the paradox thatpeople usually worry about privacy invasions of littlesignificance, and overlook the privacy invasion that mightcause significant damage:\begin{center}\url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf}\end{center}\end{document}http://randomwalker.info/teaching/fall-2012-privacy-technologies/?http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/http://repository.cmu.edu/cgi/viewcontent.cgi?article=1077&context=hciihttps://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdfhttp://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdfhttp://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdfhttps://www.youtube.com/watch?v=Gx13lgEudtUhttps://www.cs.purdue.edu/homes/ctask/pdfs/CERIAS_Presentation.pdfhttp://www.futureofprivacy.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdfhttp://www.cis.upenn.edu/~aaroth/courses/slides/Overview.pdfhttp://www.cl.cam.ac.uk/~sjm217/papers/tor14design.pdf%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: