\documentclass{article}\usepackage{../style}\usepackage{../graphics}\begin{document}\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015}%https://www.theguardian.com/technology/2016/oct/04/yahoo-secret-email-program-nsa-fbi%https://nakedsecurity.sophos.com/2015/11/12/california-collects-owns-and-sells-infants-dna-samples/%http://randomwalker.info/teaching/fall-2012-privacy-technologies/?%https://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdf%http://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdf%http://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdf%https://www.youtube.com/watch?v=Gx13lgEudtU%https://fpf.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdf%http://research.neustar.biz/2014/09/08/differential-privacy-the-basics/%=====%Tim Greene, Network World, 17 Dec 2015 (via ACM TechNews, 18 Dec 2015)%%Massachusetts Institute of Technology (MIT) researchers' experimental%Vuvuzela messaging system offers more privacy than The Onion Router (Tor) by%rendering text messages sent through it untraceable. MIT Ph.D. student%David Lazar says Vuvuzela resists traffic analysis attacks, while Tor%cannot. The researchers say the system functions no matter how many parties%are using it to communicate, and it employs encryption and a set of servers%to conceal whether or not parties are participating in text-based dialogues.%"Vuvuzela prevents an adversary from learning which pairs of users are%communicating, as long as just one out of [the] servers is not compromised,%even for users who continue to use Vuvuzela for years," they note. Vuvuzela%can support millions of users hosted on commodity servers deployed by a%single group of users. Instead of anonymizing users, Vuvuzela prevents%outside observers from differentiating between people sending messages,%receiving messages, or neither, according to Lazar. The system imposes%noise on the client-server traffic which cannot be distinguished from actual%messages, and all communications are triple-wrapped in encryption by three%servers. "Vuvuzela guarantees privacy as long as one of the servers is%uncompromised, so using more servers increases security at the cost of%increased message latency," Lazar notes.%http://orange.hosting.lsoft.com/trk/click?ref=znwrbbrs9_5-e70bx2d991x066779&%%%%%% canvas tracking%%https://freedom-to-tinker.com/blog/englehardt/the-princeton-web-census-a-1-million-site-measurement-and-analysis-of-web-privacy/%%%%% cupit re-identification attack%% https://nakedsecurity.sophos.com/2016/05/20/published-personal-data-on-70000-okcupid-users-taken-down-after-dmca-order/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+nakedsecurity+%28Naked+Security+-+Sophos%29%Differential privacy%=====================%https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/%Differential privacy, translated from Apple-speak, is the%statistical science of trying to learn as much as possible%about a group while learning as little as possible about any%individual in it.%As Roth notes when he refers to a “mathematical proof,â€%differential privacy doesn’t merely try to obfuscate or%“anonymize†users’ data. That anonymization approach, he%argues, tends to fail. In 2007, for instance, Netflix released%a large collection of its viewers’ film ratings as part of a%competition to optimize its recommendations, removing people’s%names and other identifying details and publishing only their%Netflix ratings. But researchers soon cross-referenced the%Netflix data with public review data on IMDB to match up%similar patterns of recommendations between the sites and add%names back into Netflix’s supposedly anonymous database.%As an example of that last method, Microsoft’s Dwork points to%the technique in which a survey asks if the respondent has%ever, say, broken a law. But first, the survey asks them to%flip a coin. If the result is tails, they should answer%honestly. If the result is heads, they’re instructed to flip%the coin again and then answer “yes†for heads or “no†for%tails. The resulting random noise can be subtracted from the%results with a bit of algebra, and every respondent is%protected from punishment if they admitted to lawbreaking.%https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf% Windows 10 data send back to Microsoft (Cortana)%Here’s a non-exhaustive list of data sent back: location data, text%input, voice input, touch input, webpages you visit, and telemetry%data regarding your general usage of your computer, including which%programs you run and for how long.% Businesses are already using customised pricing online based on% information they can glean about you. It is hard to know how% widespread the practice is; companies keep their pricing strategies% closely guarded and are wary of the bad PR price discrimination% could pose. However, it is clear that a number of large retailers% are experimenting with it. Staples, for example, has offered% discounted prices based on whether rival stores are within 20 miles% of its customers’ location. Office Depot has admitted to using its% customers’ browsing history and location to vary its range of offers% and products. A 2014 study from Northeastern University found% evidence of “steering†or differential pricing at four out of 10% general merchandise websites and five out of five travel% websites. (Steering is when a company doesn’t give you a customised% price, but points you towards more expensive options if it thinks% you will pay more.) The online travel company Orbitz raised% headlines in 2012 when it emerged that the firm was pointing Mac% users towards higher-priced hotel rooms than PC users.%%% government will overwrite your wishes if it is annoymous%% https://www.lightbluetouchpaper.org/2016/12/05/government-u-turn-on-health-privacy/%% corporate surveilance / privacy - report and CC3C talk%% http://crackedlabs.org/en/networksofcontrol%% https://media.ccc.de/v/33c3-8414-corporate_surveillance_digital_tracking_big_data_privacy#video&t=2933\section*{Handout 6 (Privacy)}The first motor car was invented around 1886. For ten years,until 1896, the law in the UK (and elsewhere) required aperson to walk in front of any moving car waving a red flag.Cars were such a novelty that most people did not know what tomake of them. The person with the red flag was intended towarn the public, for example horse owners, about the impendingnovelty---a car. In my humble opinion, we are at the samestage of development with privacy. Nobody really knows what itis about or what it is good for. All seems very hazy. Thereare a few laws (e.g.~cookie law, right-to-be-forgotten law)which address problems with privacy, but even if they are wellintentioned, they either back-fire or are already obsoletebecause of newer technologies. The result is that the world of``privacy'' looks a little bit like the old WildWest---lawless and mythical.We would have hoped that after Snowden, Western governmentswould be a bit more sensitive and enlightned about the topicof privacy, but this is far from the truth. Ross Andersonwrote the following in his blog\footnote{\url{https://www.lightbluetouchpaper.org/2016/02/11/report-on-the-ip-bill/}} about the approach taken inthe US to lessons learned from the Snowden leaks and contraststhis with the new snooping bill that is considered in the UKparliament: \begin{quote}\it ``The comparison with the USA is stark. There, all threebranches of government realised they'd gone too far afterSnowden. President Obama set up the NSA review group, andimplemented most of its recommendations by executive order;the judiciary made changes to the procedures of the FISACourt; and Congress failed to renew the data retentionprovisions in the Patriot Act (aided by the judiciary). Yethere in Britain the response is just to take Henry VIII powersto legalise all the illegal things that GCHQ had been up to,and hope that the European courts won't strike the law downyet again.''\end{quote}\noindent Unfortunately, also big organisations besidesgovernments seem to take an unenlightened approach to privacy.For example, UCAS, a charity set up to help students withapplying to universities in the UK, has a commercial unit thathappily sells your email addresses to anybody who forks outenough money for bombarding you with spam. Yes, you can optout very often from such ``schemes'', but in case of UCAS anyopt-out will limit also legit emails you might actually beinterested in.\footnote{The main objectionable point, in myopinion, is that the \emph{charity} everybody has to use forHE applications has actually very honourable goals(e.g.~assist applicants in gaining access to universities),but the small print (or better the link ``About us'') revealsthey set up their organisation so that they can alsoshamelessly sell the email addresses they ``harvest''.Everything is of course very legal\ldots{}ethical?\ldots{}wellthat is in the eye of the beholder. See:\url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} or\url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}Another example: Verizon, an ISP who is supposed to provideyou just with connectivity, has found a ``nice'' side-businesstoo: When you have enabled all privacy guards in your browser(the few you have at your disposal), Verizon happily adds akind of cookie to yourHTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}As shown in the picture below, this cookie will be sent toevery web-site you visit. The web-sites then can forward thecookie to advertisers who in turn pay Verizon to tell themeverything they want to know about the person who just madethis request, that is you.\begin{center}\includegraphics[scale=0.16]{../pics/verizon.png}\end{center}\noindent How disgusting! Even worse, Verizon is not known forbeing the cheapest ISP on the planet (completely thecontrary), and also not known for providing the fastestpossible speeds, but rather for being among the few ISPs inthe US with a quasi-monopolistic ``market distribution''.Well, we could go on and on\ldots{}and that has not evenstarted us yet with all the naughty things NSA \& Friends areup to. Why does privacy actually matter? Nobody, I think, hasa conclusive answer to this question yet. Maybe the followingfour notions help with clarifying the overall picturesomewhat: \begin{itemize}\item \textbf{Secrecy} is the mechanism used to limit the number of principals with access to information (e.g., cryptography or access controls). For example I better keep my password secret, otherwise people from the wrong side of the law might impersonate me.\item \textbf{Confidentiality} is the obligation to protect the secrets of other people or organisations (secrecy for the benefit of an organisation). For example as a staff member at King's I have access to data, even private data, I am allowed to use in my work but not allowed to disclose to anyone else.\item \textbf{Anonymity} is the ability to leave no evidence of an activity (e.g., sharing a secret). This is not equal with privacy---anonymity is required in many circumstances, for example for whistle-blowers, voting, exam marking and so on.\item \textbf{Privacy} is the ability or right to protect your personal secrets (secrecy for the benefit of an individual). For example, in a job interview, I might not like to disclose that I am pregnant, if I were a woman, or that I am a father. Lest they might not hire me. Similarly, I might not like to disclose my location data, because thieves might break into my house if they know I am away at work. Privacy is essentially everything which ``shouldn't be anybody's business''.\end{itemize}\noindent While this might provide us with some roughdefinitions, the problem with privacy is that it is anextremely fine line what should stay private and what shouldnot. For example, since I am working in academia, I am everyso often very happy to be a digital exhibitionist: I am veryhappy to disclose all `trivia' related to my work on mypersonal web-page. This is a kind of bragging that is normalin academia (at least in the field of CS), even expected ifyou look for a job. I am even happy that Google maintains aprofile about all my academic papers and their citations. On the other hand I would be very irritated if anybody I donot know had a too close look on my private live---itshouldn't be anybody's business. The reason is that knowledgeabout my private life can often be used against me. As mentionedabove, public location data might mean I get robbed. Ifsupermarkets build a profile of my shopping habits, they willuse it to \emph{their} advantage---surely not to \emph{my}advantage. Also whatever might be collected about my life willalways be an incomplete, or even misleading, picture. Forexample I am pretty sure my creditworthiness score wastemporarily(?) destroyed by not having a regular income inthis country (before coming to King's I worked in Munich forfive years). To correct such incomplete or flawed credithistory data there is, since recently, a law that allows youto check what information is held about you for determiningyour creditworthiness. But this concerns only a very smallpart of the data that is held about me/you. Alsowhat about cases where data is wrong or outdated (but do weneed a right-to be forgotten).To see how private matter can lead really to the wrongconclusions, take the example of Stephen Hawking: When he wasdiagnosed with his disease, he was given a life expectancy oftwo years. If employers would know about such problems, wouldthey have employed Hawking? Now, he is enjoying his 70+birthday. Clearly personal medical data needs to stay private.To cut a long story short, I let you ponder about the twostatements which are often voiced in discussions about privacy:\begin{itemize}\item \textit{``You have zero privacy anyway. Get over it.''}\\\mbox{}\hfill{}{\small{}(by Scott Mcnealy, former CEO of Sun)}\item \textit{``If you have nothing to hide, you have nothing to fear.''}\end{itemize}\noindent If you like to watch a movie which has this topic asits main focus I recommend \emph{Gattaca} from1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}} Ifyou want to read up on this topic, I can recommend thefollowing article that appeared in 2011 in the Chronicle ofHigher Education:\begin{center} \url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/} \end{center} \noindent Funnily, or maybe not so funnily, the author of thisarticle carefully tries to construct an argument that does notonly attack the nothing-to-hide statement in cases wheregovernments \& co collect people's deepest secrets, orpictures of people's naked bodies, but an argument thatapplies also in cases where governments ``only'' collect datarelevant to, say, preventing terrorism. The fun is of coursethat in 2011 we could just not imagine that respectedgovernments would do such infantile things as interceptingpeople's nude photos. Well, since Snowden we know some peopleat the NSA did exactly that and then shared such photos amongcolleagues as ``fringe benefit''. \subsubsection*{Re-Identification Attacks} Apart from philosophical musings, there are fortunately alsosome real technical problems with privacy. The problem I wantto focus on in this handout is how to safely disclose datasetscontaining potentially very private data, say health records.What can go wrong with such disclosures can be illustratedwith four well-known examples:\begin{itemize}\item In 2006, a then young company called Netflix offered a 1 Mio \$ prize to anybody who could improve their movie rating algorithm. For this they disclosed a dataset containing 10\% of all Netflix users at the time (appr.~500K). They removed names, but included numerical ratings of movies as well as times when ratings were uploaded. Though some information was perturbed (i.e., slightly modified). Two researchers had a closer look at this anonymised data and compared it with public data available from the International Movie Database (IMDb). They found that 98\% of the entries could be re-identified in the Netflix dataset: either by their ratings or by the dates the ratings were uploaded. The result was a class-action suit against Netflix, which was only recently resolved involving a lot of money.\item In the 1990ies, medical datasets were often made public for research purposes. This was done in anonymised form with names removed, but birth dates, gender and ZIP-code were retained. In one case where such data about hospital visits of state employees in Massachusetts was made public, the then governor assured the public that the released dataset protected patient privacy by deleting identifiers. A graduate student could not resist cross-referencing public voter data with the released data that still included birth dates, gender and ZIP-code. The result was that she could send the governor his own hospital record. It turns out that birth dates, gender and ZIP-code uniquely identify 87\% of people in the US. This work resulted in a number of laws prescribing which private data cannot be released in such datasets.\item In 2006, AOL published 20 million Web search queries collected from 650,000 users (names had been deleted). This was again done for research purposes. However, within days an old lady, Thelma Arnold, from Lilburn, Georgia, (11,596 inhabitants) was identified as user No.~4417749 in this dataset. It turned out that search engine queries are deep windows into people's private lives. \item Genome-Wide Association Studies (GWAS) was a public database of gene-frequency studies linked to diseases. It would essentially record that people who have a disease, say diabetes, have also certain genes. In order to maintain privacy, the dataset would only include aggregate information. In case of DNA data this aggregation was achieved by mixing the DNA of many individuals (having a disease) into a single solution. Then this mixture was sequenced and included in the dataset. The idea was that the aggregate information would still be helpful to researchers, but would protect the DNA data of individuals. In 2007 a forensic computer scientist showed that individuals can still be identified. For this he used the DNA data from a comparison group (people from the general public) and ``subtracted'' this data from the published data. He was left with data that included all ``special'' DNA-markers of the individuals present in the original mixture. He essentially deleted the ``background noise'' in the published data. The problem with DNA data is that it is of such a high resolution that even if the mixture contained maybe 100 individuals, you can with current technology detect whether an individual was included in the mixture or not. This result changed completely how DNA data is nowadays published for research purposes. After the success of the human-genome project with a very open culture of exchanging data, it became much more difficult to anonymise data so that patient's privacy is preserved. The public GWAS database was taken offline in 2008.\end{itemize}\noindent There are many lessons that can be learned fromthese examples. One is that when making datasets public inanonymised form, you want to achieve \emph{forward privacy}.This means, no matter what other data that is also availableor will be released later, the data in the original datasetdoes not compromise an individual's privacy. This principlewas violated by the availability of ``outside data'' in theNetflix and governor of Massachusetts cases. The additionaldata permitted a re-identification of individuals in thedataset. In case of GWAS a new technique of re-identificationcompromised the privacy of people in the dataset. The case ofthe AOL dataset shows clearly how incomplete such data can be:Although the queries uniquely identified the older lady, shealso looked up diseases that her friends had, which hadnothing to do with her. Any rational analysis of her querydata must therefore have concluded, the lady is on herdeath bed, while she was actually very much alive and kicking.In 2016, Yahoo released the so far largest machine learningdataset to the research community. It includes approximately13.5 TByte of data representing around 100 Billion events fromanonymized user-news items, collected by recordinginteractions of about 20M users from February 2015 to May2015. Yahoo's gracious goal is to promote independent researchin the fields of large-scale machine learning and recommendersystems. It remains to be seen whether this data will reallyonly be used for that purpose.\subsubsection*{Differential Privacy}Differential privacy is one of the few methods that tries toachieve forward privacy. The basic idea is to add appropriatenoise, or errors, to any query of the dataset. The intentionis to make the result of a query insensitive to individualentries in the database. That means the results areapproximately the same no matter if a particular individual isin the dataset or not. The hope is that the added error doesnot eliminate the ``signal'' one is looking for in thedataset.%\begin{center}%User\;\;\;\; %\begin{tabular}{c}%tell me $f(x)$ $\Rightarrow$\\%$\Leftarrow$ $f(x) + \text{noise}$%\end{tabular}%\;\;\;\;\begin{tabular}{@{}c}%Database\\%$x_1, \ldots, x_n$%\end{tabular}%\end{center}%%\begin{center}%\begin{tabular}{l|l}%Staff & Salary\\\hline%$PM$ & \pounds{107}\\%$PF$ & \pounds{102}\\%$LM_1$ & \pounds{101}\\%$LF_2$ & \pounds{97}\\%$LM_3$ & \pounds{100}\\%$LM_4$ & \pounds{99}\\%$LF_5$ & \pounds{98}%\end{tabular}%\end{center}%%%\begin{center}%\begin{tikzpicture} %\begin{axis}[symbolic y coords={salary},% ytick=data,% height=3cm]%\addplot+[jump mark mid] coordinates%{(0,salary) (0.1,salary) % (0.4,salary) (0.5,salary) % (0.8,salary) (0.9,salary)};%\end{axis}%\end{tikzpicture}%\end{center}%%\begin{tikzpicture}[outline/.style={draw=#1,fill=#1!20}]% \node [outline=red] {red box};% \node [outline=blue] at (0,-1) {blue box};%\end{tikzpicture}\ldots\subsubsection*{Further Reading}Two cool articles about how somebody obtained via the Freedomof Information Law the taxicab dataset of New York and someoneelse showed how easy it is to mine for private information: \begin{center}\small\begin{tabular}{p{0.78\textwidth}}\url{http://chriswhong.com/open-data/foil_nyc_taxi/}\smallskip\\\url{http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset}\end{tabular}\end{center}\noindent A readable article about how supermarkets mine your shoppinghabits (especially how they prey on new exhausted parents;o) appeared in 2012 in the New York Times:\begin{center}\url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html}\end{center}\noindent An article that analyses privacy and shopping habits from a more economic point of view is available from:\begin{center}\url{http://www.dtc.umn.edu/~odlyzko/doc/privacy.economics.pdf}\end{center}\noindent An attempt to untangle the web of current technologyfor spying on consumers is published in:\begin{center}\url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf}\end{center}\noindent An article that sheds light on the paradox thatpeople usually worry about privacy invasions of littlesignificance, and overlook the privacy invasion that mightcause significant damage:\begin{center}\url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf}\end{center}Interesting ideas\begin{center}\url{https://adnauseam.io}\end{center}\noindentAnd a paper that predicts ad-blockers will in the end win over anti-ad-blocking:\begin{center}\url{http://randomwalker.info/publications/ad-blocking-framework-techniques.pdf}\end{center}\end{document}http://randomwalker.info/teaching/fall-2012-privacy-technologies/?http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/http://repository.cmu.edu/cgi/viewcontent.cgi?article=1077&context=hciihttps://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdfhttp://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdfhttp://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdfhttps://www.youtube.com/watch?v=Gx13lgEudtUhttps://www.cs.purdue.edu/homes/ctask/pdfs/CERIAS_Presentation.pdfhttp://www.futureofprivacy.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdfhttp://www.cis.upenn.edu/~aaroth/courses/slides/Overview.pdfhttp://www.cl.cam.ac.uk/~sjm217/papers/tor14design.pdf%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: