diff -r 280e057558b8 -r 7a6e8f603e08 handouts/ho06.tex --- a/handouts/ho06.tex Sat Sep 23 19:39:53 2017 +0100 +++ b/handouts/ho06.tex Sat Sep 23 19:52:27 2017 +0100 @@ -1,620 +1,559 @@ \documentclass{article} \usepackage{../style} +\usepackage{../graphics} \begin{document} \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015} -\section*{Handout 6 (Zero-Knowledge Proofs)} +%https://www.theguardian.com/technology/2016/oct/04/yahoo-secret-email-program-nsa-fbi +%https://nakedsecurity.sophos.com/2015/11/12/california-collects-owns-and-sells-infants-dna-samples/ +%http://randomwalker.info/teaching/fall-2012-privacy-technologies/? +%https://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdf +%http://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdf +%http://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdf +%https://www.youtube.com/watch?v=Gx13lgEudtU +%https://fpf.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdf +%http://research.neustar.biz/2014/09/08/differential-privacy-the-basics/ + +%===== +%Tim Greene, Network World, 17 Dec 2015 (via ACM TechNews, 18 Dec 2015) +% +%Massachusetts Institute of Technology (MIT) researchers' experimental +%Vuvuzela messaging system offers more privacy than The Onion Router (Tor) by +%rendering text messages sent through it untraceable. MIT Ph.D. student +%David Lazar says Vuvuzela resists traffic analysis attacks, while Tor +%cannot. The researchers say the system functions no matter how many parties +%are using it to communicate, and it employs encryption and a set of servers +%to conceal whether or not parties are participating in text-based dialogues. +%"Vuvuzela prevents an adversary from learning which pairs of users are +%communicating, as long as just one out of [the] servers is not compromised, +%even for users who continue to use Vuvuzela for years," they note. Vuvuzela +%can support millions of users hosted on commodity servers deployed by a +%single group of users. Instead of anonymizing users, Vuvuzela prevents +%outside observers from differentiating between people sending messages, +%receiving messages, or neither, according to Lazar. The system imposes +%noise on the client-server traffic which cannot be distinguished from actual +%messages, and all communications are triple-wrapped in encryption by three +%servers. "Vuvuzela guarantees privacy as long as one of the servers is +%uncompromised, so using more servers increases security at the cost of +%increased message latency," Lazar notes. +%http://orange.hosting.lsoft.com/trk/click?ref=znwrbbrs9_5-e70bx2d991x066779& -Zero-knowledge proofs (short ZKP) solve a paradoxical puzzle: -How to convince somebody else that one knows a secret, without -revealing what the secret actually is? This sounds like a -problem the Mad Hatter from Alice in Wonderland would occupy -himself with, but actually there some serious and not so -serious applications of it. For example, if you solve -crosswords with your friend, say Bob, you might want to -convince him that you found a solution for one question, but -of course you do not want to reveal the solution, as this -might give Bob an advantage somewhere else in the crossword. +%%%% +%% canvas tracking +%%https://freedom-to-tinker.com/blog/englehardt/the-princeton-web-census-a-1-million-site-measurement-and-analysis-of-web-privacy/ + +%%% +%% cupit re-identification attack +%% https://nakedsecurity.sophos.com/2016/05/20/published-personal-data-on-70000-okcupid-users-taken-down-after-dmca-order/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+nakedsecurity+%28Naked+Security+-+Sophos%29 + +%Differential privacy +%===================== +%https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/ + +%Differential privacy, translated from Apple-speak, is the +%statistical science of trying to learn as much as possible +%about a group while learning as little as possible about any +%individual in it. -So how to convince Bob that you know the answer (or a secret)? -One way would be to come up with the following protocol: -Suppose the answer is \textit{folio}. Then look up the -definition of \textit{folio} in a dictionary. Say you find: +%As Roth notes when he refers to a “mathematical proof,” +%differential privacy doesn’t merely try to obfuscate or +%“anonymize” users’ data. That anonymization approach, he +%argues, tends to fail. In 2007, for instance, Netflix released +%a large collection of its viewers’ film ratings as part of a +%competition to optimize its recommendations, removing people’s +%names and other identifying details and publishing only their +%Netflix ratings. But researchers soon cross-referenced the +%Netflix data with public review data on IMDB to match up +%similar patterns of recommendations between the sites and add +%names back into Netflix’s supposedly anonymous database. + +%As an example of that last method, Microsoft’s Dwork points to +%the technique in which a survey asks if the respondent has +%ever, say, broken a law. But first, the survey asks them to +%flip a coin. If the result is tails, they should answer +%honestly. If the result is heads, they’re instructed to flip +%the coin again and then answer “yes” for heads or “no” for +%tails. The resulting random noise can be subtracted from the +%results with a bit of algebra, and every respondent is +%protected from punishment if they admitted to lawbreaking. -\begin{quote} -``an \textit{individual} leaf of paper or parchment, either -loose as one of a series or forming part of a bound volume, -which is numbered on the recto or front side only.'' -\end{quote} +%https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf + +% Windows 10 data send back to Microsoft (Cortana) +%Here’s a non-exhaustive list of data sent back: location data, text +%input, voice input, touch input, webpages you visit, and telemetry +%data regarding your general usage of your computer, including which +%programs you run and for how long. -\noindent -Take the first non-small word\footnote{Let's say the, a, an, -or, and \ldots fall into the category of small words.} in this definition, -in this case \textit{individual}, and look up the definition -of this word, say +% Businesses are already using customised pricing online based on +% information they can glean about you. It is hard to know how +% widespread the practice is; companies keep their pricing strategies +% closely guarded and are wary of the bad PR price discrimination +% could pose. However, it is clear that a number of large retailers +% are experimenting with it. Staples, for example, has offered +% discounted prices based on whether rival stores are within 20 miles +% of its customers’ location. Office Depot has admitted to using its +% customers’ browsing history and location to vary its range of offers +% and products. A 2014 study from Northeastern University found +% evidence of “steering” or differential pricing at four out of 10 +% general merchandise websites and five out of five travel +% websites. (Steering is when a company doesn’t give you a customised +% price, but points you towards more expensive options if it thinks +% you will pay more.) The online travel company Orbitz raised +% headlines in 2012 when it emerged that the firm was pointing Mac +% users towards higher-priced hotel rooms than PC users. + + +%%% government will overwrite your wishes if it is annoymous +%% https://www.lightbluetouchpaper.org/2016/12/05/government-u-turn-on-health-privacy/ + +%% corporate surveilance / privacy - report and CC3C talk +%% http://crackedlabs.org/en/networksofcontrol +%% https://media.ccc.de/v/33c3-8414-corporate_surveillance_digital_tracking_big_data_privacy#video&t=2933 + +\section*{Handout 6 (Privacy)} -\begin{quote} -``a single \textit{human} being as distinct from a group'' -\end{quote} +The first motor car was invented around 1886. For ten years, +until 1896, the law in the UK (and elsewhere) required a +person to walk in front of any moving car waving a red flag. +Cars were such a novelty that most people did not know what to +make of them. The person with the red flag was intended to +warn the public, for example horse owners, about the impending +novelty---a car. In my humble opinion, we are at the same +stage of development with privacy. Nobody really knows what it +is about or what it is good for. All seems very hazy. There +are a few laws (e.g.~cookie law, right-to-be-forgotten law) +which address problems with privacy, but even if they are well +intentioned, they either back-fire or are already obsolete +because of newer technologies. The result is that the world of +``privacy'' looks a little bit like the old Wild +West---lawless and mythical. -\noindent -In this definition take the second non-small word, that -is \textit{human}, and again look up the definition of this -word. This will yield +We would have hoped that after Snowden, Western governments +would be a bit more sensitive and enlightned about the topic +of privacy, but this is far from the truth. Ross Anderson +wrote the following in his blog\footnote{\url{https://www.lightbluetouchpaper.org/2016/02/11/report-on-the-ip-bill/}} about the approach taken in +the US to lessons learned from the Snowden leaks and contrasts +this with the new snooping bill that is considered in the UK +parliament: -\begin{quote} -``relating to or \textit{characteristic} of humankind'' +\begin{quote}\it +``The comparison with the USA is stark. There, all three +branches of government realised they'd gone too far after +Snowden. President Obama set up the NSA review group, and +implemented most of its recommendations by executive order; +the judiciary made changes to the procedures of the FISA +Court; and Congress failed to renew the data retention +provisions in the Patriot Act (aided by the judiciary). Yet +here in Britain the response is just to take Henry VIII powers +to legalise all the illegal things that GCHQ had been up to, +and hope that the European courts won't strike the law down +yet again.'' \end{quote} -\noindent -You could go on looking up the definition of the third -non-small word in this definition and so on. But let us assume -you agreed with Bob to stop after three iterations with the -third non-article word in the last definition, that is -\textit{or}. Now, instead of sending to Bob the solution -\textit{folio}, you send to him \textit{characteristic}. +\noindent Unfortunately, also big organisations besides +governments seem to take an unenlightened approach to privacy. +For example, UCAS, a charity set up to help students with +applying to universities in the UK, has a commercial unit that +happily sells your email addresses to anybody who forks out +enough money for bombarding you with spam. Yes, you can opt +out very often from such ``schemes'', but in case of UCAS any +opt-out will limit also legit emails you might actually be +interested in.\footnote{The main objectionable point, in my +opinion, is that the \emph{charity} everybody has to use for +HE applications has actually very honourable goals +(e.g.~assist applicants in gaining access to universities), +but the small print (or better the link ``About us'') reveals +they set up their organisation so that they can also +shamelessly sell the email addresses they ``harvest''. +Everything is of course very legal\ldots{}ethical?\ldots{}well +that is in the eye of the beholder. See: -How can Bob verify that you know the solution? Well, once he -solved it himself, he can use the dictionary and follow the -same ``trail'' as you did. If the final word agrees with what -you had sent him, he must infer you knew the solution earlier -than him. This protocol works like a one-way hash function -because \textit{characteristic} does not give any hint as to -what was the first word was. I leave you to think why this -protocol avoids small words? +\url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} +or +\url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}} -After Bob found his solution and verified that according to -the protocol it ``maps'' also to \textit{characteristic}, can -he be entirely sure no cheating is going on? Not with 100\% -certainty. It could have been possible that he was given -\textit{characteristic} as the word, but it derived from a -different word. This might seem very unlikely, but at least -theoretical it is a possibility. Protocols based on -zero-knowledge proofs will produce a similar result---they -give an answer that might be erroneous in a very small number -of cases. The point is to iterate them long enough so that the -theoretical possibility of cheating is negligibly small. +Another example: Verizon, an ISP who is supposed to provide +you just with connectivity, has found a ``nice'' side-business +too: When you have enabled all privacy guards in your browser +(the few you have at your disposal), Verizon happily adds a +kind of cookie to your +HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}} +As shown in the picture below, this cookie will be sent to +every web-site you visit. The web-sites then can forward the +cookie to advertisers who in turn pay Verizon to tell them +everything they want to know about the person who just made +this request, that is you. + +\begin{center} +\includegraphics[scale=0.16]{../pics/verizon.png} +\end{center} -By the way, the authors of the paper ``Dismantling Megamos -Crypto: Wirelessly Lockpicking a Vehicle Immobilizer'' who -were barred from publishing their results used also a hash to -prove they did the work and (presumably) managed to get into -cars without a key; see Figure~\ref{paper}. This is very -similar to the method above about crosswords: They like to -prove that they did the work, but not giving out the -``solution''. But this also shows what the problem with such a -method is: yes, we can hide the secret temporarily, but if -somebody else wants to verify it, then the secret has to be -made public. Bob needs to know that \textit{folio} is the -solution before he can verify the claim of Alice that she had -the solution first. Similarly with the car-crypto paper: we -needed to wait until September 2015 when the authors were -finally able to publish their findings in order to verify the -hash. Zero-knowledge proofs, in contrast, can be immediately -checked, even if the secret is not public yet and perhaps -never will be. - -\begin{figure} -\begin{center} -\addtolength{\fboxsep}{4mm} -\fbox{\includegraphics[scale=0.4]{../pics/Dismantling_Megamos_Crypto.png}} -\end{center} -\caption{The authors of this paper used a hash in order to prove -later that they have managed to break into cars.\label{paper}} -\end{figure} +\noindent How disgusting! Even worse, Verizon is not known for +being the cheapest ISP on the planet (completely the +contrary), and also not known for providing the fastest +possible speeds, but rather for being among the few ISPs in +the US with a quasi-monopolistic ``market distribution''. -\subsubsection*{ZKP: An Illustrative Example} +Well, we could go on and on\ldots{}and that has not even +started us yet with all the naughty things NSA \& Friends are +up to. Why does privacy actually matter? Nobody, I think, has +a conclusive answer to this question yet. Maybe the following +four notions help with clarifying the overall picture +somewhat: -The idea behind zero-knowledge proofs is not very obvious and -will surely take some time for you to digest. Therefore let us -start with a simple illustrative example. The example will not -be perfect, but hopefully explain the gist of the idea. The -example is called Alibaba's cave, which graphically looks as -follows:\footnote{The example is taken from an -article titled ``How to Explain Zero-Knowledge Protocols -to Your Children'' available from -\url{http://pages.cs.wisc.edu/~mkowalcz/628.pdf}.} +\begin{itemize} +\item \textbf{Secrecy} is the mechanism used to limit the + number of principals with access to information (e.g., + cryptography or access controls). For example I better + keep my password secret, otherwise people from the wrong + side of the law might impersonate me. + +\item \textbf{Confidentiality} is the obligation to protect + the secrets of other people or organisations (secrecy + for the benefit of an organisation). For example as a + staff member at King's I have access to data, even + private data, I am allowed to use in my work but not + allowed to disclose to anyone else. -\begin{center} -\begin{tabular}{c@{\hspace{8mm}}c@{\hspace{8mm}}c} -\includegraphics[scale=0.1]{../pics/alibaba1.png} & -\includegraphics[scale=0.1]{../pics/alibaba2.png} & -\includegraphics[scale=0.1]{../pics/alibaba3.png} \\ -Step 1 & Step 2 & Step 3 -\end{tabular} -\end{center} +\item \textbf{Anonymity} is the ability to leave no evidence of + an activity (e.g., sharing a secret). This is not equal + with privacy---anonymity is required in many + circumstances, for example for whistle-blowers, + voting, exam marking and so on. -\noindent Let us take a closer look at the picture in Step 1: -The cave has a tunnel which forks at some point. Both forks -are connected in a loop. At the deep end of the loop is a -magic wand. The point of the magic wand is that Alice knows -the secret word for how to open it. She wants to keep the word -secret, but wants to convince Bob that she knows it. +\item \textbf{Privacy} is the ability or right to protect your + personal secrets (secrecy for the benefit of an + individual). For example, in a job interview, I might + not like to disclose that I am pregnant, if I were a + woman, or that I am a father. Lest they might not hire + me. Similarly, I might not like to disclose my location + data, because thieves might break into my house if they + know I am away at work. Privacy is essentially + everything which ``shouldn't be anybody's business''. + +\end{itemize} + +\noindent While this might provide us with some rough +definitions, the problem with privacy is that it is an +extremely fine line what should stay private and what should +not. For example, since I am working in academia, I am every +so often very happy to be a digital exhibitionist: I am very +happy to disclose all `trivia' related to my work on my +personal web-page. This is a kind of bragging that is normal +in academia (at least in the field of CS), even expected if +you look for a job. I am even happy that Google maintains a +profile about all my academic papers and their citations. -One way of course would be to let Bob follow her, but then he -would also find out the secret. So this does not work. To find -a solution, let us first fix the rules: At the beginning Alice -and Bob are outside the cave. Alice goes in alone and takes -either tunnel labelled $A$ in the picture, or the other tunnel -labelled $B$. She waits at the magic wand for the instructions -from Bob, who also goes into the gave and observes what -happens at the fork. He has no knowledge which tunnel Alice -took and calls out (in Step 2) that she should emerge from tunnel -$A$, for example. If she knows the secret for opening the -wand, this will not be a problem for Alice. If she was already -in the $A$-segment of the tunnel, then she just comes back. If -she was in the $B$-segment of the tunnel she will say the magic -word and goes through the wand to emerge from $A$ as requested -by Bob. +On the other hand I would be very irritated if anybody I do +not know had a too close look on my private live---it +shouldn't be anybody's business. The reason is that knowledge +about my private life can often be used against me. As mentioned +above, public location data might mean I get robbed. If +supermarkets build a profile of my shopping habits, they will +use it to \emph{their} advantage---surely not to \emph{my} +advantage. Also whatever might be collected about my life will +always be an incomplete, or even misleading, picture. For +example I am pretty sure my creditworthiness score was +temporarily(?) destroyed by not having a regular income in +this country (before coming to King's I worked in Munich for +five years). To correct such incomplete or flawed credit +history data there is, since recently, a law that allows you +to check what information is held about you for determining +your creditworthiness. But this concerns only a very small +part of the data that is held about me/you. Also +what about cases where data is wrong or outdated (but do we +need a right-to be forgotten). + + +To see how private matter can lead really to the wrong +conclusions, take the example of Stephen Hawking: When he was +diagnosed with his disease, he was given a life expectancy of +two years. If employers would know about such problems, would +they have employed Hawking? Now, he is enjoying his 70+ +birthday. Clearly personal medical data needs to stay private. + +To cut a long story short, I let you ponder about the two +statements which are often voiced in discussions about privacy: -Let us have a look at the case where Alice cheats, that is not -knows the secret. She would still go into the cave and use, -for example the $B$-segment of the tunnel. If now Bob says she -should emerge from $B$, she is lucky. But if he says she -should emerge from $A$ then Alice is in trouble: Bob will find -out she does not actually know the secret. So in order to fool -Bob she needs to anticipate his call, and already go into the -corresponding tunnel. This of course also does not work, since -Bob can make any call he wants. Consequently in order to find -out whether Alice cheats, Bob just needs to repeat this -protocol many times. Each time Alice has a chance of -$\frac{1}{2}$ to be lucky or being found out. Iterating this -$n$ times means she must be right every time and when -cheating: the probability for this is $\frac{1}{2}^n$, number -that for already relatively small $n$, say 10, is incredibly -small. +\begin{itemize} +\item \textit{``You have zero privacy anyway. Get over +it.''}\\ +\mbox{}\hfill{}{\small{}(by Scott Mcnealy, former CEO of Sun)} + +\item \textit{``If you have nothing to hide, you have nothing +to fear.''} +\end{itemize} + +\noindent If you like to watch a movie which has this topic as +its main focus I recommend \emph{Gattaca} from +1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}} If +you want to read up on this topic, I can recommend the +following article that appeared in 2011 in the Chronicle of +Higher Education: + +\begin{center} +\url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/} +\end{center} + +\noindent Funnily, or maybe not so funnily, the author of this +article carefully tries to construct an argument that does not +only attack the nothing-to-hide statement in cases where +governments \& co collect people's deepest secrets, or +pictures of people's naked bodies, but an argument that +applies also in cases where governments ``only'' collect data +relevant to, say, preventing terrorism. The fun is of course +that in 2011 we could just not imagine that respected +governments would do such infantile things as intercepting +people's nude photos. Well, since Snowden we know some people +at the NSA did exactly that and then shared such photos among +colleagues as ``fringe benefit''. -There are some interesting observations we can make about -Alibaba's cave and the ZKP protocol between Alice and Bob: +\subsubsection*{Re-Identification Attacks} + +Apart from philosophical musings, there are fortunately also +some real technical problems with privacy. The problem I want +to focus on in this handout is how to safely disclose datasets +containing potentially very private data, say health records. +What can go wrong with such disclosures can be illustrated +with four well-known examples: \begin{itemize} -\item {\bf Completeness} If Alice knows the secret, Bob - accepts Alice ``proof'' for sure. There is no error - possible in that Bob thinks Alice cheats when she - actually knows the secret. -\item {\bf Soundness} If Alice does not know the secret, - Bob accepts her ``proof'' with a very small probability. - If, as in the example above, the probability of being - able to hide cheating is $\frac{1}{2}$ in each round - it will be $\frac{1}{2}^n$ after $n$-rounds, which even - for small $n$ say $> 10$ is very small indeed. -\item {\bf Zero-Knowledge} Even if Bob accepts - the proof by Alice, he cannot convince anybody else. -\end{itemize} +\item In 2006, a then young company called Netflix offered a 1 + Mio \$ prize to anybody who could improve their movie + rating algorithm. For this they disclosed a dataset + containing 10\% of all Netflix users at the time + (appr.~500K). They removed names, but included numerical + ratings of movies as well as times when ratings were + uploaded. Though some information was perturbed (i.e., + slightly modified). + + Two researchers had a closer look at this anonymised + data and compared it with public data available from the + International Movie Database (IMDb). They found that + 98\% of the entries could be re-identified in the + Netflix dataset: either by their ratings or by the dates + the ratings were uploaded. The result was a class-action + suit against Netflix, which was only recently resolved + involving a lot of money. -\noindent The last property is the most interesting one. -Assume Alice has convinced Bob that she knows the secret and -Bob filmed the whole protocol with a camera. Can he use the -video to convince anybody else? After a moment of thought, you -will agree that this is not the case. Alice and Bob might have -just made it all up and colluded by Bob telling Alice -beforehand which tunnel he will call. In this way it appears -as if all iterations of the protocol were successful, but they -prove nothing. If another person wants to find out whether -Alice knows the secret, he or she would have to conduct the -protocol again. This is actually the formal definition of a -zero-knowledge proof: an independent observer cannot -distinguish between a real protocol (where Alice knows the -secret) and a fake one (where Bob and Alice colluded). +\item In the 1990ies, medical datasets were often made public + for research purposes. This was done in anonymised form + with names removed, but birth dates, gender and ZIP-code + were retained. In one case where such data about + hospital visits of state employees in Massachusetts was + made public, the then governor assured the public that + the released dataset protected patient privacy by + deleting identifiers. + + A graduate student could not resist cross-referencing + public voter data with the released data that still + included birth dates, gender and ZIP-code. The result + was that she could send the governor his own hospital + record. It turns out that birth dates, gender and + ZIP-code uniquely identify 87\% of people in the US. + This work resulted in a number of laws prescribing which + private data cannot be released in such datasets. + +\item In 2006, AOL published 20 million Web search queries + collected from 650,000 users (names had been deleted). + This was again done for research purposes. However, + within days an old lady, Thelma Arnold, from Lilburn, + Georgia, (11,596 inhabitants) was identified as user + No.~4417749 in this dataset. It turned out that search + engine queries are deep windows into people's private + lives. + +\item Genome-Wide Association Studies (GWAS) was a public + database of gene-frequency studies linked to diseases. + It would essentially record that people who have a + disease, say diabetes, have also certain genes. In order + to maintain privacy, the dataset would only include + aggregate information. In case of DNA data this + aggregation was achieved by mixing the DNA of many + individuals (having a disease) into a single solution. + Then this mixture was sequenced and included in the + dataset. The idea was that the aggregate information + would still be helpful to researchers, but would protect + the DNA data of individuals. + + In 2007 a forensic computer scientist showed that + individuals can still be identified. For this he used + the DNA data from a comparison group (people from the + general public) and ``subtracted'' this data from the + published data. He was left with data that included all + ``special'' DNA-markers of the individuals present in + the original mixture. He essentially deleted the + ``background noise'' in the published data. The problem + with DNA data is that it is of such a high resolution + that even if the mixture contained maybe 100 + individuals, you can with current technology detect + whether an individual was included in the mixture or + not. + + This result changed completely how DNA data is nowadays + published for research purposes. After the success of + the human-genome project with a very open culture of + exchanging data, it became much more difficult to + anonymise data so that patient's privacy is preserved. + The public GWAS database was taken offline in 2008. + +\end{itemize} + +\noindent There are many lessons that can be learned from +these examples. One is that when making datasets public in +anonymised form, you want to achieve \emph{forward privacy}. +This means, no matter what other data that is also available +or will be released later, the data in the original dataset +does not compromise an individual's privacy. This principle +was violated by the availability of ``outside data'' in the +Netflix and governor of Massachusetts cases. The additional +data permitted a re-identification of individuals in the +dataset. In case of GWAS a new technique of re-identification +compromised the privacy of people in the dataset. The case of +the AOL dataset shows clearly how incomplete such data can be: +Although the queries uniquely identified the older lady, she +also looked up diseases that her friends had, which had +nothing to do with her. Any rational analysis of her query +data must therefore have concluded, the lady is on her +death bed, while she was actually very much alive and kicking. + +In 2016, Yahoo released the so far largest machine learning +dataset to the research community. It includes approximately +13.5 TByte of data representing around 100 Billion events from +anonymized user-news items, collected by recording +interactions of about 20M users from February 2015 to May +2015. Yahoo's gracious goal is to promote independent research +in the fields of large-scale machine learning and recommender +systems. It remains to be seen whether this data will really +only be used for that purpose. + +\subsubsection*{Differential Privacy} + +Differential privacy is one of the few methods that tries to +achieve forward privacy. The basic idea is to add appropriate +noise, or errors, to any query of the dataset. The intention +is to make the result of a query insensitive to individual +entries in the database. That means the results are +approximately the same no matter if a particular individual is +in the dataset or not. The hope is that the added error does +not eliminate the ``signal'' one is looking for in the +dataset. + +%\begin{center} +%User\;\;\;\; +%\begin{tabular}{c} +%tell me $f(x)$ $\Rightarrow$\\ +%$\Leftarrow$ $f(x) + \text{noise}$ +%\end{tabular} +%\;\;\;\;\begin{tabular}{@{}c} +%Database\\ +%$x_1, \ldots, x_n$ +%\end{tabular} +%\end{center} +% +%\begin{center} +%\begin{tabular}{l|l} +%Staff & Salary\\\hline +%$PM$ & \pounds{107}\\ +%$PF$ & \pounds{102}\\ +%$LM_1$ & \pounds{101}\\ +%$LF_2$ & \pounds{97}\\ +%$LM_3$ & \pounds{100}\\ +%$LM_4$ & \pounds{99}\\ +%$LF_5$ & \pounds{98} +%\end{tabular} +%\end{center} +% +% +%\begin{center} +%\begin{tikzpicture} +%\begin{axis}[symbolic y coords={salary}, +% ytick=data, +% height=3cm] %\addplot+[jump mark mid] coordinates %{(0,salary) (0.1,salary) % (0.4,salary) (0.5,salary) % (0.8,salary) (0.9,salary)}; %\end{axis} %\end{tikzpicture} +%\end{center} +% +%\begin{tikzpicture}[outline/.style={draw=#1,fill=#1!20}] % \node [outline=red] {red box}; % \node [outline=blue] at (0,-1) {blue box}; %\end{tikzpicture} + +\ldots -\subsubsection*{Using an Graph-Isomorphism Problem for ZKPs} - -Now the question is how can we translate Alibaba's cave into a -computer science solution? It turns out we need an NP problem -for that. The main feature of an NP problem is that it is -computational very hard to generate a solution, but it is very -easy to check whether a given solution indeed solves the -problem at hand.\footnote{The question whether $P = NP$ or not, -we leave to others to speculate about.} - -One NP problem is the \emph{graph isomorphism problem}. It -essentially determines whether the following two graphs, say -$G_1$ and $G_2$, can be moved and stretched so that they look -exactly the same. +\subsubsection*{Further Reading} -\begin{center} -\begin{tabular}{c@{\hspace{8mm}}c@{\hspace{8mm}}c} -$G_1$ & $G_2$\\ -\raisebox{-13mm}{\includegraphics[scale=0.4]{../pics/simple.png}} & -\raisebox{-13mm}{\includegraphics[scale=0.4]{../pics/simple-b.png}}& +Two cool articles about how somebody obtained via the Freedom +of Information Law the taxicab dataset of New York and someone +else showed how easy it is to mine for private information: -\footnotesize -\begin{tabular}{rl} -Graph $G_1$ & Graph $G_2$\\ -a & 1\\ -b & 6\\ -c & 8\\ -d & 3\\ -g & 5\\ -h & 2\\ -i & 4\\ -j & 7\\ -\end{tabular} +\begin{center}\small +\begin{tabular}{p{0.78\textwidth}} +\url{http://chriswhong.com/open-data/foil_nyc_taxi/}\smallskip\\ +\url{http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset} \end{tabular} \end{center} -\noindent The table on the right gives a mapping of the nodes -of the first graph to the nodes of the second. With this -mapping we can check: node $a$ is connected in $G_1$ with $g$, -$h$ and $i$. Node $a$ maps to node $1$ in $G_2$, which is -connected to $2$, $4$ and $5$, which again correspond via the -mapping to $h$, $i$ and $g$ respectively. Let us write -$\sigma$ for such a table and let us write - -\[G_1 = \sigma(G_2)\] - -\noindent for two isomorphic graphs ($\sigma$ being the -isomorphism). It is actually very easy to construct two -isomorphic graphs: Start with an arbitrary graph, re-label the -nodes consistently. Alice will need to do this frequently -for the protocol below. In order to be useful, we therefore -would need to consider substantially larger graphs, like -with thousand nodes. - -Now the secret which Alice tries to hide is the knowledge of -such an isomorphism $\sigma$ between two such graphs. But she -can convince Bob whether she knows such an isomorphism for two -graphs, say $G_1$ and $G_2$, using the same idea as in the -example with Alibaba's cave. For this Alice and Bob must -follow the following protocol: - -\begin{enumerate} -\item Alice generates an isomorphic graph $H$ which she sends - to Bob (in each iteration she needs to generate a - different $H$). -\item - After receiving $H$, Bob asks Alice either for an - isomorphism between $G_1$ and $H$, or $G_2$ and $H$. -\item Alice and Bob repeat this procedure $n$ times. -\end{enumerate} - -\noindent In Step 1 it is important that Alice always -generates a fresh isomorphic graph. I let you think what -would happen if Alice sends out twice the same graph $H$. - -As said before, this is relatively easy to generate by -consistently relabelling nodes. If she started from $G_1$, -Alice will have generated - -\begin{equation} -H = \sigma'(G_1)\label{hiso} -\end{equation} - -\noindent where $\sigma'$ is the isomorphism between $H$ and -$G_1$. But she could equally have started from $G_2$. In the -case where $G_1$ and $G_2$ are isomorphic, if $H$ is -isomorphic with $G_1$, it will also be isomorphic with -$G_2$, and vice versa. - -After generating $H$, Alice sends it to Bob. The important -point is that she needs to ``commit'' to this $H$, therefore -this kind of zero-knowledge protocols are called -\emph{commitment protocols}. Only after receiving $H$, Bob -will make up his mind about which isomorphism he asks -for---whether between $H$ and $G_1$ or $H$ and $G_2$. For this -he could flip a coin, since the choice should be as -unpredictable for Alice as possible. Once Alice receives the -request, she has to produce an isomorphism. If she generated -$H$ as shown in \eqref{hiso} and is asked for an isomorphism -between $H$ and $G_1$, she just sends $\sigma'$. If she had -been asked for an isomorphism between $H$ and $G_2$, she just -has to compose her secret isomorphism $\sigma$ and $\sigma'$. -The main point for the protocol is that even knowing the -isomorphism between $H$ and $G_1$ or $H$ and $G_2$, will not -make the task easier to find the isomorphism between $G_1$ and -$G_2$, which is the secret Alice tries to protect. - -In order to make it crystal clear how this protocol proceeds, -let us give a version using our more formal notation for -protocols: +\noindent +A readable article about how supermarkets mine your shopping +habits (especially how they prey on new exhausted parents +;o) appeared in 2012 in the New York Times: \begin{center} -\begin{tabular}{lrl} -0) & $A \to B:$ & $G_1$ and $G_2$\\ -1a) & $A \to B:$ & $H_1$ \\ -1b) & $B \to A:$ & produce isomorphism $G_1 \leftrightarrow H_1$? - \quad(or $G_2 \leftrightarrow H_1$)\\ -1c) & $A \to B:$ & requested isomorphism\\ -2a) & $A \to B:$ & $H_2$\\ -2b) & $B \to A:$ & produce isomorphism $G_1 \leftrightarrow H_2$? - \quad(or $G_2 \leftrightarrow H_2$)\\ -2c) & $A \to B:$ & requested isomorphism\\ - & \ldots\\ -\end{tabular} +\url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html} \end{center} -\noindent As can be seen the protocol runs for some -agreed number of iterations. The $H_i$ Alice needs to -produce, need to be all distinct. I hope you now know -why? - -It is also crucial that in each iteration, Alice first sends -$H_i$ and then Bob can decide which isomorphism he wants: -either $G_1 \leftrightarrow H_i$ or $G_2 \leftrightarrow H_i$. -If somehow Alice can find out before she committed to $H_i$, -she can cheat. For this assume Alice does \emph{not} know an -isomorphism between $G_1$ and $G_2$. If she knows which -isomorphism Bob will ask for she can craft $H$ in such a way -that it is isomorphism with either $G_1$ or $G_2$ (but it -cannot with both). Then in each case she would send Bob -a correct answer and he would come to the conclusion that -all is well. I let you also answer the question whether -such a protocol run between Alice and Bob would convince -a third person, say Pete. - -Since the interactive nature of the above PKZ protocol and the -correct ordering of the messages is so important for the -``correctness'' of the protocol, it might look surprising that -the same goal can also me achieved in a completely offline -manner. By this I mean Alice can publish all data at once, and -then at a later time, Bob can inspect the data and come to the -conclusion whether or not Alice knows the secret (again -without actually learning about the secret). For this -Alice has to do the following: +\noindent An article that analyses privacy and shopping habits +from a more economic point of view is available from: -\begin{enumerate} -\item Alice generates $n$ isomorphic graphs - $H_{1..n}$ (they need to be all distinct) -\item she feeds the $H_{1..n}$ into a hashing function - (for example encoded as as matrix) -\item she takes the first $n$ bits of the output of the hashing - function: - whenever the output is $0$, she shows an isomorphism - with $G_1$; for $1$ she shows an isomorphism - with $G_2$ -\end{enumerate} +\begin{center} +\url{http://www.dtc.umn.edu/~odlyzko/doc/privacy.economics.pdf} +\end{center} -\noindent The reason why this works and achieves the same -goal as the interactive variant is that Alice has no -control over the hashing function. It would be -computationally just too hard to assemble a set of -$H_{1..n}$ such that she can force where $0$s and $1$s -in the hash values are such that it would pass an external -test. The point is that Alice can publish all this data -on the comfort of her own web-page, for example, and -in this way convince everybody who bothers to check. - -The virtue of the use of graphs and isomorphism for a -zero-knowledge protocol is that the idea why it works -are relatively straightforward. The disadvantage is -that encoding any secret into a graph-isomorphism, while -possible, is awkward. The good news is that in fact -any NP problem can be used as part of a ZKP protocol. - - -\subsubsection*{Using Modular Logarithms for ZKP Protocols} - -While information can be encoded into graph isomorphisms, it -is not the most convenient carrier of information. Clearly it -is much easier to encode information into numbers. Let us look -at zero-knowledge proofs that use numbers as secrets. For this -the underlying NP-problem is to calculate discrete logarithms. -It can be used by choosing public numbers $A$, $B$, $p$, and -private $x$ such that +\noindent An attempt to untangle the web of current technology +for spying on consumers is published in: \begin{center} -$A^x \equiv B\; mod\; p$ -\end{center} - -\noindent holds. The secret Alice tries to keep secret is $x$. -The point of the modular logarithm is that it is very hard -from the public data to calculate $x$ (for large primes). -Now the protocol proceeds in three stages: - -\begin{itemize} -\item {\bf Commitment Stage} - \begin{enumerate} - \item Alice generates $z$ random numbers $r_1, \ldots, r_z$, - all less than $p - 1$. Alice then sends Bob for all - $i = 1,\ldots, z$: - \[ h_i = A^{r_i}\; mod\; p\] - \item Bob generates $z$ random bits, say $b_1,\ldots, b_z$. He can do this - by flipping $z$ times a coin, for example. - \item For each bit $b_i$, Alice sends Bob an $s_i$ where - - \begin{center} - \begin{tabular}{ll} - if $b_i = 0$: & $s_i = r_i$\\ - if $b_i = 1$: & $s_i = (r_i - r_j) \;mod\; (p -1)$\\ - \end{tabular} - \end{center} +\url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf} +\end{center} - where $r_j$ is the lowest $j$ where $b_j = 1$. - - \end{enumerate} - \end{itemize} - -\noindent For understanding the last step, let $z$ be just 4. -We have four random values $r_i$ chosen by Alice and four -random bits $b_i$ chosen subsequently by Bob, for example - - \begin{center} - \begin{tabular}{lcccc} - $r_i$:\; & 4 & 9 & 1 & 3\\ - $b_i$:\; & 0 & 1 & 0 & 1\\ - & & $\uparrow$ \\ - & & $j$ - \end{tabular} - \end{center} - - \noindent The highlighted column is the lowest where $b_i = - 1$ (counted from the left). That means $r_j = 9$. The reason - for letting Alice choose the random numbers $r_1, \ldots, r_z$ - will become clear shortly. Next is the confirmation - phase where Bob essentially checks whether Alice has sent - him ``correct'' $s_i$ and $h_i$. - -\begin{itemize} -\item {\bf Confirmation Stage} - \begin{enumerate} - \item For each $b_i$ Bob checks whether $s_i$ - conform to the protocol - - \begin{center} - \begin{tabular}{ll} - if $b_i = 0$: & $A^{s_i} \stackrel{?}{\equiv} h_i\;mod\;p$\\ - if $b_i = 1$: & $A^{s_i} \stackrel{?}{\equiv} h_i * h_j^{-1} \;mod\; p$\\ - \end{tabular} - \end{center} - \end{enumerate} -\end{itemize} - -\noindent To understand the case for $b_i = 1$, you have -to do the following calculation: +\noindent An article that sheds light on the paradox that +people usually worry about privacy invasions of little +significance, and overlook the privacy invasion that might +cause significant damage: \begin{center} -\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l} -$A^{s_i}$ & $=$ & $A^{r_i - r_j}$\\ - & $=$ & $A^{r_i} * A^{-r_j}$\\ - & $=$ & $h_{r_i} * h_{r_j}^{-1}\;mod\;p$ -\end{tabular} -\end{center} - -\noindent What is interesting that so far nothing has been -sent about $x$, which is the secret Alice has. Also notice -that Bob does not know $r_j$. He received - -\begin{center} -$r_j - r_j$, $r_m - r_j$, \ldots, $r_p - r_j \;mod \;p - 1$ -\end{center} - -\noindent whenever his corresponding bits were $1$. So Bob -does not know $r_j$ and also does not know any $r_i$ where the -bit was $1$. Information about the $x$ is sent in the next -stage (obviously not revealing $x$). - -\begin{itemize} -\item {\bf Proving Stage} - -\begin{enumerate} -\item Alice proves she knows $x$, the discrete log of $B$, -by sending - -\begin{center} -$s_{z+1} = x - r_j\;mod\;p-1$ +\url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf} \end{center} -\item Bob confirms + +Interesting ideas \begin{center} -$A^{s_{z+1}} \stackrel{?}{\equiv} B * h_j^{-1} \;mod \; p$ +\url{https://adnauseam.io} \end{center} -\end{enumerate} -\end{itemize} - -\noindent To understand the last step, you have to do the trick -again that - -\[A^{s_{z+1}} = A^{x-r_j} = \ldots -\] - -\noindent -which I leave to you. - -Now the question is how can Alice cheat? In order to cheat she -has to coordinate what she sends as $h_i$ in step 1 and $s_i$ -in step 3 of the commitment stage, and also what to send as -$s_{z+1}$ in the proving stage. For the latter of course -Alice does not know $x$, so she just chooses some random -number for $s_{z+1}$ and calculates - -\[A^{s_{z+1}}\] \noindent -and then solves the equation - -\[A^{s_{z+1}} \equiv B * y \;mod\;p\] - -\noindent for $y$. This is easy since no logarithm needs to be -computed. If Alice can guess the $j$ where the first 1 will -appear in Bob's bit vector, then she sends the inverse of $y$ -as $h_j$ and 0 as $s_j$. However, notice that when she -calculates a solution for $y$ she does not know $r_j$. For this she -would need to calculate the modular logarithm - -\[y \equiv A^{r_j}\;mod\;p\] - -\noindent which is hard (see step 1 in the commitment stage). - -Having settled on what $h_j$ should be, now what should Alice -send as the other $h_i$ and other $s_i$? If the $b_i$ happens -to be a 1, then the $h_i$ and other $s_i$ need to satisfy the -test - -\[A^{s_i} \stackrel{?}{\equiv} h_i * h_j^{-1} \;mod\; p\] - -\noindent where she has already settled on the value of -$h_j^{-1}$. Lets say she choses $s_i$ at random, then she just -needs to solve - -\[A^{s_i} \equiv z * h_j^{-1} \;mod\; p\] - -\noindent for $z$. Again that is easy, but it does not allow -us to know $r_i$, because then we would again need to solve -a modular logarithm problem. Let us call an $h_i$ which was -solved the easy way as \emph{bogus}. Alice has to produce -bogus $h_i$ for all bits that are going to be $1$ in advance! -This means she has to guess all the bits correctly. (Yes? -I let you think about this.) - -Let us see what happens if she guesses wrongly: Suppose the -bit $b_i = 1$ where she thought she will get a 0. Then she has -already sent an $h_i$ and $h_j$ and now must find an $s_i$ -such that - -\[A^{s_i} \equiv h_i * h_j^{-1} \;mod\; p\] - -\noindent holds. For this remember in calculating $h_i$, she -just chose a random $s_i$. Now she has to send a genuine one. -But this is of course too hard. If she knew the genuine $r_i$ -and $r_j$ for $h_i$ and $h_j$, it would be easy (in this case -$s_i = r_i - r_j$). But she does not. So she will be found -out. If $b_i = 0$, but she thought she will get a 1, then -she has to send a $s_i$ which satisfies - -\[A^{s_i} \equiv h_i\;mod\;p\] - -\noindent Again she does not know $r_i$. So it is a too hard -task and she will be found out again. - -To sum up, in order for Alice to successfully cheat Bob, she -needs to guess \emph{all} bits correctly. She has only a -$\frac{1}{2^z}$ chance of doing this. - -\subsubsection*{Further Reading} - -Make sure you understand what NP problems -are.\footnote{\url{http://en.wikipedia.org/wiki/NP_(complexity)}} -They are the building blocks for zero-knowledge proofs. -Zero-Knowldege proofs are not yet widely used in production -systems, but it is slowly gaining ground. One area of application -where they pop up is crypto currencies (for example Zerocoins -or how to make sure a Bitcoin exchange is solvent without -revealing its assets). - -If you want to brush up on the modular logarithm problem, -the Khan Academy has a nice video: +And a paper that predicts ad-blockers will in the end win over anti-ad-blocking: \begin{center} -\url{https://www.khanacademy.org/video/discrete-logarithm-problem} +\url{http://randomwalker.info/publications/ad-blocking-framework-techniques.pdf} \end{center} + \end{document} -http://blog.cryptographyengineering.com/2014/11/zero-knowledge-proofs-illustrated-primer.html - -http://btravers.weebly.com/uploads/6/7/2/9/6729909/zero_knowledge_technique.pdf -http://zk-ssh.cms.ac/docs/Zero_Knowledge_Prinzipien.pdf -http://www.wisdom.weizmann.ac.il/~oded/PS/zk-tut02v4.ps - -socialist millionares problem -http://en.wikipedia.org/wiki/Socialist_millionaire -http://twistedoakstudios.com/blog/Post3724_explain-it-like-im-five-the-socialist-millionaire-problem-and-secure-multi-party-computation - +http://randomwalker.info/teaching/fall-2012-privacy-technologies/? +http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/ +http://repository.cmu.edu/cgi/viewcontent.cgi?article=1077&context=hcii +https://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdf +http://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdf +http://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdf +https://www.youtube.com/watch?v=Gx13lgEudtU +https://www.cs.purdue.edu/homes/ctask/pdfs/CERIAS_Presentation.pdf +http://www.futureofprivacy.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdf +http://www.cis.upenn.edu/~aaroth/courses/slides/Overview.pdf +http://www.cl.cam.ac.uk/~sjm217/papers/tor14design.pdf %%% Local Variables: %%% mode: latex