--- a/handouts/ho06.tex Tue Sep 26 12:03:24 2017 +0100
+++ b/handouts/ho06.tex Tue Sep 26 12:10:41 2017 +0100
@@ -1,620 +1,559 @@
\documentclass{article}
\usepackage{../style}
+\usepackage{../graphics}
\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015}
-\section*{Handout 6 (Zero-Knowledge Proofs)}
+%https://www.theguardian.com/technology/2016/oct/04/yahoo-secret-email-program-nsa-fbi
+%https://nakedsecurity.sophos.com/2015/11/12/california-collects-owns-and-sells-infants-dna-samples/
+%http://randomwalker.info/teaching/fall-2012-privacy-technologies/?
+%https://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdf
+%http://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdf
+%http://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdf
+%https://www.youtube.com/watch?v=Gx13lgEudtU
+%https://fpf.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdf
+%http://research.neustar.biz/2014/09/08/differential-privacy-the-basics/
+
+%=====
+%Tim Greene, Network World, 17 Dec 2015 (via ACM TechNews, 18 Dec 2015)
+%
+%Massachusetts Institute of Technology (MIT) researchers' experimental
+%Vuvuzela messaging system offers more privacy than The Onion Router (Tor) by
+%rendering text messages sent through it untraceable. MIT Ph.D. student
+%David Lazar says Vuvuzela resists traffic analysis attacks, while Tor
+%cannot. The researchers say the system functions no matter how many parties
+%are using it to communicate, and it employs encryption and a set of servers
+%to conceal whether or not parties are participating in text-based dialogues.
+%"Vuvuzela prevents an adversary from learning which pairs of users are
+%communicating, as long as just one out of [the] servers is not compromised,
+%even for users who continue to use Vuvuzela for years," they note. Vuvuzela
+%can support millions of users hosted on commodity servers deployed by a
+%single group of users. Instead of anonymizing users, Vuvuzela prevents
+%outside observers from differentiating between people sending messages,
+%receiving messages, or neither, according to Lazar. The system imposes
+%noise on the client-server traffic which cannot be distinguished from actual
+%messages, and all communications are triple-wrapped in encryption by three
+%servers. "Vuvuzela guarantees privacy as long as one of the servers is
+%uncompromised, so using more servers increases security at the cost of
+%increased message latency," Lazar notes.
+%http://orange.hosting.lsoft.com/trk/click?ref=znwrbbrs9_5-e70bx2d991x066779&
-Zero-knowledge proofs (short ZKP) solve a paradoxical puzzle:
-How to convince somebody else that one knows a secret, without
-revealing what the secret actually is? This sounds like a
-problem the Mad Hatter from Alice in Wonderland would occupy
-himself with, but actually there some serious and not so
-serious applications of it. For example, if you solve
-crosswords with your friend, say Bob, you might want to
-convince him that you found a solution for one question, but
-of course you do not want to reveal the solution, as this
-might give Bob an advantage somewhere else in the crossword.
+%%%%
+%% canvas tracking
+%%https://freedom-to-tinker.com/blog/englehardt/the-princeton-web-census-a-1-million-site-measurement-and-analysis-of-web-privacy/
+
+%%%
+%% cupit re-identification attack
+%% https://nakedsecurity.sophos.com/2016/05/20/published-personal-data-on-70000-okcupid-users-taken-down-after-dmca-order/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+nakedsecurity+%28Naked+Security+-+Sophos%29
+
+%Differential privacy
+%=====================
+%https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/
+
+%Differential privacy, translated from Apple-speak, is the
+%statistical science of trying to learn as much as possible
+%about a group while learning as little as possible about any
+%individual in it.
-So how to convince Bob that you know the answer (or a secret)?
-One way would be to come up with the following protocol:
-Suppose the answer is \textit{folio}. Then look up the
-definition of \textit{folio} in a dictionary. Say you find:
+%As Roth notes when he refers to a “mathematical proof,”
+%differential privacy doesn’t merely try to obfuscate or
+%“anonymize” users’ data. That anonymization approach, he
+%argues, tends to fail. In 2007, for instance, Netflix released
+%a large collection of its viewers’ film ratings as part of a
+%competition to optimize its recommendations, removing people’s
+%names and other identifying details and publishing only their
+%Netflix ratings. But researchers soon cross-referenced the
+%Netflix data with public review data on IMDB to match up
+%similar patterns of recommendations between the sites and add
+%names back into Netflix’s supposedly anonymous database.
+
+%As an example of that last method, Microsoft’s Dwork points to
+%the technique in which a survey asks if the respondent has
+%ever, say, broken a law. But first, the survey asks them to
+%flip a coin. If the result is tails, they should answer
+%honestly. If the result is heads, they’re instructed to flip
+%the coin again and then answer “yes” for heads or “no” for
+%tails. The resulting random noise can be subtracted from the
+%results with a bit of algebra, and every respondent is
+%protected from punishment if they admitted to lawbreaking.
-\begin{quote}
-``an \textit{individual} leaf of paper or parchment, either
-loose as one of a series or forming part of a bound volume,
-which is numbered on the recto or front side only.''
-\end{quote}
+%https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
+
+% Windows 10 data send back to Microsoft (Cortana)
+%Here’s a non-exhaustive list of data sent back: location data, text
+%input, voice input, touch input, webpages you visit, and telemetry
+%data regarding your general usage of your computer, including which
+%programs you run and for how long.
-\noindent
-Take the first non-small word\footnote{Let's say the, a, an,
-or, and \ldots fall into the category of small words.} in this definition,
-in this case \textit{individual}, and look up the definition
-of this word, say
+% Businesses are already using customised pricing online based on
+% information they can glean about you. It is hard to know how
+% widespread the practice is; companies keep their pricing strategies
+% closely guarded and are wary of the bad PR price discrimination
+% could pose. However, it is clear that a number of large retailers
+% are experimenting with it. Staples, for example, has offered
+% discounted prices based on whether rival stores are within 20 miles
+% of its customers’ location. Office Depot has admitted to using its
+% customers’ browsing history and location to vary its range of offers
+% and products. A 2014 study from Northeastern University found
+% evidence of “steering” or differential pricing at four out of 10
+% general merchandise websites and five out of five travel
+% websites. (Steering is when a company doesn’t give you a customised
+% price, but points you towards more expensive options if it thinks
+% you will pay more.) The online travel company Orbitz raised
+% headlines in 2012 when it emerged that the firm was pointing Mac
+% users towards higher-priced hotel rooms than PC users.
+
+
+%%% government will overwrite your wishes if it is annoymous
+%% https://www.lightbluetouchpaper.org/2016/12/05/government-u-turn-on-health-privacy/
+
+%% corporate surveilance / privacy - report and CC3C talk
+%% http://crackedlabs.org/en/networksofcontrol
+%% https://media.ccc.de/v/33c3-8414-corporate_surveillance_digital_tracking_big_data_privacy#video&t=2933
+
+\section*{Handout 6 (Privacy)}
-\begin{quote}
-``a single \textit{human} being as distinct from a group''
-\end{quote}
+The first motor car was invented around 1886. For ten years,
+until 1896, the law in the UK (and elsewhere) required a
+person to walk in front of any moving car waving a red flag.
+Cars were such a novelty that most people did not know what to
+make of them. The person with the red flag was intended to
+warn the public, for example horse owners, about the impending
+novelty---a car. In my humble opinion, we are at the same
+stage of development with privacy. Nobody really knows what it
+is about or what it is good for. All seems very hazy. There
+are a few laws (e.g.~cookie law, right-to-be-forgotten law)
+which address problems with privacy, but even if they are well
+intentioned, they either back-fire or are already obsolete
+because of newer technologies. The result is that the world of
+``privacy'' looks a little bit like the old Wild
+West---lawless and mythical.
-\noindent
-In this definition take the second non-small word, that
-is \textit{human}, and again look up the definition of this
-word. This will yield
+We would have hoped that after Snowden, Western governments
+would be a bit more sensitive and enlightned about the topic
+of privacy, but this is far from the truth. Ross Anderson
+wrote the following in his blog\footnote{\url{https://www.lightbluetouchpaper.org/2016/02/11/report-on-the-ip-bill/}} about the approach taken in
+the US to lessons learned from the Snowden leaks and contrasts
+this with the new snooping bill that is considered in the UK
+parliament:
-\begin{quote}
-``relating to or \textit{characteristic} of humankind''
+\begin{quote}\it
+``The comparison with the USA is stark. There, all three
+branches of government realised they'd gone too far after
+Snowden. President Obama set up the NSA review group, and
+implemented most of its recommendations by executive order;
+the judiciary made changes to the procedures of the FISA
+Court; and Congress failed to renew the data retention
+provisions in the Patriot Act (aided by the judiciary). Yet
+here in Britain the response is just to take Henry VIII powers
+to legalise all the illegal things that GCHQ had been up to,
+and hope that the European courts won't strike the law down
+yet again.''
\end{quote}
-\noindent
-You could go on looking up the definition of the third
-non-small word in this definition and so on. But let us assume
-you agreed with Bob to stop after three iterations with the
-third non-article word in the last definition, that is
-\textit{or}. Now, instead of sending to Bob the solution
-\textit{folio}, you send to him \textit{characteristic}.
+\noindent Unfortunately, also big organisations besides
+governments seem to take an unenlightened approach to privacy.
+For example, UCAS, a charity set up to help students with
+applying to universities in the UK, has a commercial unit that
+happily sells your email addresses to anybody who forks out
+enough money for bombarding you with spam. Yes, you can opt
+out very often from such ``schemes'', but in case of UCAS any
+opt-out will limit also legit emails you might actually be
+interested in.\footnote{The main objectionable point, in my
+opinion, is that the \emph{charity} everybody has to use for
+HE applications has actually very honourable goals
+(e.g.~assist applicants in gaining access to universities),
+but the small print (or better the link ``About us'') reveals
+they set up their organisation so that they can also
+shamelessly sell the email addresses they ``harvest''.
+Everything is of course very legal\ldots{}ethical?\ldots{}well
+that is in the eye of the beholder. See:
-How can Bob verify that you know the solution? Well, once he
-solved it himself, he can use the dictionary and follow the
-same ``trail'' as you did. If the final word agrees with what
-you had sent him, he must infer you knew the solution earlier
-than him. This protocol works like a one-way hash function
-because \textit{characteristic} does not give any hint as to
-what was the first word was. I leave you to think why this
-protocol avoids small words?
+\url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities}
+or
+\url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
-After Bob found his solution and verified that according to
-the protocol it ``maps'' also to \textit{characteristic}, can
-he be entirely sure no cheating is going on? Not with 100\%
-certainty. It could have been possible that he was given
-\textit{characteristic} as the word, but it derived from a
-different word. This might seem very unlikely, but at least
-theoretical it is a possibility. Protocols based on
-zero-knowledge proofs will produce a similar result---they
-give an answer that might be erroneous in a very small number
-of cases. The point is to iterate them long enough so that the
-theoretical possibility of cheating is negligibly small.
+Another example: Verizon, an ISP who is supposed to provide
+you just with connectivity, has found a ``nice'' side-business
+too: When you have enabled all privacy guards in your browser
+(the few you have at your disposal), Verizon happily adds a
+kind of cookie to your
+HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
+As shown in the picture below, this cookie will be sent to
+every web-site you visit. The web-sites then can forward the
+cookie to advertisers who in turn pay Verizon to tell them
+everything they want to know about the person who just made
+this request, that is you.
+
+\begin{center}
+\includegraphics[scale=0.16]{../pics/verizon.png}
+\end{center}
-By the way, the authors of the paper ``Dismantling Megamos
-Crypto: Wirelessly Lockpicking a Vehicle Immobilizer'' who
-were barred from publishing their results used also a hash to
-prove they did the work and (presumably) managed to get into
-cars without a key; see Figure~\ref{paper}. This is very
-similar to the method above about crosswords: They like to
-prove that they did the work, but not giving out the
-``solution''. But this also shows what the problem with such a
-method is: yes, we can hide the secret temporarily, but if
-somebody else wants to verify it, then the secret has to be
-made public. Bob needs to know that \textit{folio} is the
-solution before he can verify the claim of Alice that she had
-the solution first. Similarly with the car-crypto paper: we
-needed to wait until September 2015 when the authors were
-finally able to publish their findings in order to verify the
-hash. Zero-knowledge proofs, in contrast, can be immediately
-checked, even if the secret is not public yet and perhaps
-never will be.
-
-\begin{figure}
-\begin{center}
-\addtolength{\fboxsep}{4mm}
-\fbox{\includegraphics[scale=0.4]{../pics/Dismantling_Megamos_Crypto.png}}
-\end{center}
-\caption{The authors of this paper used a hash in order to prove
-later that they have managed to break into cars.\label{paper}}
-\end{figure}
+\noindent How disgusting! Even worse, Verizon is not known for
+being the cheapest ISP on the planet (completely the
+contrary), and also not known for providing the fastest
+possible speeds, but rather for being among the few ISPs in
+the US with a quasi-monopolistic ``market distribution''.
-\subsubsection*{ZKP: An Illustrative Example}
+Well, we could go on and on\ldots{}and that has not even
+started us yet with all the naughty things NSA \& Friends are
+up to. Why does privacy actually matter? Nobody, I think, has
+a conclusive answer to this question yet. Maybe the following
+four notions help with clarifying the overall picture
+somewhat:
-The idea behind zero-knowledge proofs is not very obvious and
-will surely take some time for you to digest. Therefore let us
-start with a simple illustrative example. The example will not
-be perfect, but hopefully explain the gist of the idea. The
-example is called Alibaba's cave, which graphically looks as
-follows:\footnote{The example is taken from an
-article titled ``How to Explain Zero-Knowledge Protocols
-to Your Children'' available from
-\url{http://pages.cs.wisc.edu/~mkowalcz/628.pdf}.}
+\begin{itemize}
+\item \textbf{Secrecy} is the mechanism used to limit the
+ number of principals with access to information (e.g.,
+ cryptography or access controls). For example I better
+ keep my password secret, otherwise people from the wrong
+ side of the law might impersonate me.
+
+\item \textbf{Confidentiality} is the obligation to protect
+ the secrets of other people or organisations (secrecy
+ for the benefit of an organisation). For example as a
+ staff member at King's I have access to data, even
+ private data, I am allowed to use in my work but not
+ allowed to disclose to anyone else.
-\begin{center}
-\begin{tabular}{c@{\hspace{8mm}}c@{\hspace{8mm}}c}
-\includegraphics[scale=0.1]{../pics/alibaba1.png} &
-\includegraphics[scale=0.1]{../pics/alibaba2.png} &
-\includegraphics[scale=0.1]{../pics/alibaba3.png} \\
-Step 1 & Step 2 & Step 3
-\end{tabular}
-\end{center}
+\item \textbf{Anonymity} is the ability to leave no evidence of
+ an activity (e.g., sharing a secret). This is not equal
+ with privacy---anonymity is required in many
+ circumstances, for example for whistle-blowers,
+ voting, exam marking and so on.
-\noindent Let us take a closer look at the picture in Step 1:
-The cave has a tunnel which forks at some point. Both forks
-are connected in a loop. At the deep end of the loop is a
-magic wand. The point of the magic wand is that Alice knows
-the secret word for how to open it. She wants to keep the word
-secret, but wants to convince Bob that she knows it.
+\item \textbf{Privacy} is the ability or right to protect your
+ personal secrets (secrecy for the benefit of an
+ individual). For example, in a job interview, I might
+ not like to disclose that I am pregnant, if I were a
+ woman, or that I am a father. Lest they might not hire
+ me. Similarly, I might not like to disclose my location
+ data, because thieves might break into my house if they
+ know I am away at work. Privacy is essentially
+ everything which ``shouldn't be anybody's business''.
+
+\end{itemize}
+
+\noindent While this might provide us with some rough
+definitions, the problem with privacy is that it is an
+extremely fine line what should stay private and what should
+not. For example, since I am working in academia, I am every
+so often very happy to be a digital exhibitionist: I am very
+happy to disclose all `trivia' related to my work on my
+personal web-page. This is a kind of bragging that is normal
+in academia (at least in the field of CS), even expected if
+you look for a job. I am even happy that Google maintains a
+profile about all my academic papers and their citations.
-One way of course would be to let Bob follow her, but then he
-would also find out the secret. So this does not work. To find
-a solution, let us first fix the rules: At the beginning Alice
-and Bob are outside the cave. Alice goes in alone and takes
-either tunnel labelled $A$ in the picture, or the other tunnel
-labelled $B$. She waits at the magic wand for the instructions
-from Bob, who also goes into the gave and observes what
-happens at the fork. He has no knowledge which tunnel Alice
-took and calls out (in Step 2) that she should emerge from tunnel
-$A$, for example. If she knows the secret for opening the
-wand, this will not be a problem for Alice. If she was already
-in the $A$-segment of the tunnel, then she just comes back. If
-she was in the $B$-segment of the tunnel she will say the magic
-word and goes through the wand to emerge from $A$ as requested
-by Bob.
+On the other hand I would be very irritated if anybody I do
+not know had a too close look on my private live---it
+shouldn't be anybody's business. The reason is that knowledge
+about my private life can often be used against me. As mentioned
+above, public location data might mean I get robbed. If
+supermarkets build a profile of my shopping habits, they will
+use it to \emph{their} advantage---surely not to \emph{my}
+advantage. Also whatever might be collected about my life will
+always be an incomplete, or even misleading, picture. For
+example I am pretty sure my creditworthiness score was
+temporarily(?) destroyed by not having a regular income in
+this country (before coming to King's I worked in Munich for
+five years). To correct such incomplete or flawed credit
+history data there is, since recently, a law that allows you
+to check what information is held about you for determining
+your creditworthiness. But this concerns only a very small
+part of the data that is held about me/you. Also
+what about cases where data is wrong or outdated (but do we
+need a right-to be forgotten).
+
+
+To see how private matter can lead really to the wrong
+conclusions, take the example of Stephen Hawking: When he was
+diagnosed with his disease, he was given a life expectancy of
+two years. If employers would know about such problems, would
+they have employed Hawking? Now, he is enjoying his 70+
+birthday. Clearly personal medical data needs to stay private.
+
+To cut a long story short, I let you ponder about the two
+statements which are often voiced in discussions about privacy:
-Let us have a look at the case where Alice cheats, that is not
-knows the secret. She would still go into the cave and use,
-for example the $B$-segment of the tunnel. If now Bob says she
-should emerge from $B$, she is lucky. But if he says she
-should emerge from $A$ then Alice is in trouble: Bob will find
-out she does not actually know the secret. So in order to fool
-Bob she needs to anticipate his call, and already go into the
-corresponding tunnel. This of course also does not work, since
-Bob can make any call he wants. Consequently in order to find
-out whether Alice cheats, Bob just needs to repeat this
-protocol many times. Each time Alice has a chance of
-$\frac{1}{2}$ to be lucky or being found out. Iterating this
-$n$ times means she must be right every time and when
-cheating: the probability for this is $\frac{1}{2}^n$, number
-that for already relatively small $n$, say 10, is incredibly
-small.
+\begin{itemize}
+\item \textit{``You have zero privacy anyway. Get over
+it.''}\\
+\mbox{}\hfill{}{\small{}(by Scott Mcnealy, former CEO of Sun)}
+
+\item \textit{``If you have nothing to hide, you have nothing
+to fear.''}
+\end{itemize}
+
+\noindent If you like to watch a movie which has this topic as
+its main focus I recommend \emph{Gattaca} from
+1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}} If
+you want to read up on this topic, I can recommend the
+following article that appeared in 2011 in the Chronicle of
+Higher Education:
+
+\begin{center}
+\url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/}
+\end{center}
+
+\noindent Funnily, or maybe not so funnily, the author of this
+article carefully tries to construct an argument that does not
+only attack the nothing-to-hide statement in cases where
+governments \& co collect people's deepest secrets, or
+pictures of people's naked bodies, but an argument that
+applies also in cases where governments ``only'' collect data
+relevant to, say, preventing terrorism. The fun is of course
+that in 2011 we could just not imagine that respected
+governments would do such infantile things as intercepting
+people's nude photos. Well, since Snowden we know some people
+at the NSA did exactly that and then shared such photos among
+colleagues as ``fringe benefit''.
-There are some interesting observations we can make about
-Alibaba's cave and the ZKP protocol between Alice and Bob:
+\subsubsection*{Re-Identification Attacks}
+
+Apart from philosophical musings, there are fortunately also
+some real technical problems with privacy. The problem I want
+to focus on in this handout is how to safely disclose datasets
+containing potentially very private data, say health records.
+What can go wrong with such disclosures can be illustrated
+with four well-known examples:
\begin{itemize}
-\item {\bf Completeness} If Alice knows the secret, Bob
- accepts Alice ``proof'' for sure. There is no error
- possible in that Bob thinks Alice cheats when she
- actually knows the secret.
-\item {\bf Soundness} If Alice does not know the secret,
- Bob accepts her ``proof'' with a very small probability.
- If, as in the example above, the probability of being
- able to hide cheating is $\frac{1}{2}$ in each round
- it will be $\frac{1}{2}^n$ after $n$-rounds, which even
- for small $n$ say $> 10$ is very small indeed.
-\item {\bf Zero-Knowledge} Even if Bob accepts
- the proof by Alice, he cannot convince anybody else.
-\end{itemize}
+\item In 2006, a then young company called Netflix offered a 1
+ Mio \$ prize to anybody who could improve their movie
+ rating algorithm. For this they disclosed a dataset
+ containing 10\% of all Netflix users at the time
+ (appr.~500K). They removed names, but included numerical
+ ratings of movies as well as times when ratings were
+ uploaded. Though some information was perturbed (i.e.,
+ slightly modified).
+
+ Two researchers had a closer look at this anonymised
+ data and compared it with public data available from the
+ International Movie Database (IMDb). They found that
+ 98\% of the entries could be re-identified in the
+ Netflix dataset: either by their ratings or by the dates
+ the ratings were uploaded. The result was a class-action
+ suit against Netflix, which was only recently resolved
+ involving a lot of money.
-\noindent The last property is the most interesting one.
-Assume Alice has convinced Bob that she knows the secret and
-Bob filmed the whole protocol with a camera. Can he use the
-video to convince anybody else? After a moment of thought, you
-will agree that this is not the case. Alice and Bob might have
-just made it all up and colluded by Bob telling Alice
-beforehand which tunnel he will call. In this way it appears
-as if all iterations of the protocol were successful, but they
-prove nothing. If another person wants to find out whether
-Alice knows the secret, he or she would have to conduct the
-protocol again. This is actually the formal definition of a
-zero-knowledge proof: an independent observer cannot
-distinguish between a real protocol (where Alice knows the
-secret) and a fake one (where Bob and Alice colluded).
+\item In the 1990ies, medical datasets were often made public
+ for research purposes. This was done in anonymised form
+ with names removed, but birth dates, gender and ZIP-code
+ were retained. In one case where such data about
+ hospital visits of state employees in Massachusetts was
+ made public, the then governor assured the public that
+ the released dataset protected patient privacy by
+ deleting identifiers.
+
+ A graduate student could not resist cross-referencing
+ public voter data with the released data that still
+ included birth dates, gender and ZIP-code. The result
+ was that she could send the governor his own hospital
+ record. It turns out that birth dates, gender and
+ ZIP-code uniquely identify 87\% of people in the US.
+ This work resulted in a number of laws prescribing which
+ private data cannot be released in such datasets.
+
+\item In 2006, AOL published 20 million Web search queries
+ collected from 650,000 users (names had been deleted).
+ This was again done for research purposes. However,
+ within days an old lady, Thelma Arnold, from Lilburn,
+ Georgia, (11,596 inhabitants) was identified as user
+ No.~4417749 in this dataset. It turned out that search
+ engine queries are deep windows into people's private
+ lives.
+
+\item Genome-Wide Association Studies (GWAS) was a public
+ database of gene-frequency studies linked to diseases.
+ It would essentially record that people who have a
+ disease, say diabetes, have also certain genes. In order
+ to maintain privacy, the dataset would only include
+ aggregate information. In case of DNA data this
+ aggregation was achieved by mixing the DNA of many
+ individuals (having a disease) into a single solution.
+ Then this mixture was sequenced and included in the
+ dataset. The idea was that the aggregate information
+ would still be helpful to researchers, but would protect
+ the DNA data of individuals.
+
+ In 2007 a forensic computer scientist showed that
+ individuals can still be identified. For this he used
+ the DNA data from a comparison group (people from the
+ general public) and ``subtracted'' this data from the
+ published data. He was left with data that included all
+ ``special'' DNA-markers of the individuals present in
+ the original mixture. He essentially deleted the
+ ``background noise'' in the published data. The problem
+ with DNA data is that it is of such a high resolution
+ that even if the mixture contained maybe 100
+ individuals, you can with current technology detect
+ whether an individual was included in the mixture or
+ not.
+
+ This result changed completely how DNA data is nowadays
+ published for research purposes. After the success of
+ the human-genome project with a very open culture of
+ exchanging data, it became much more difficult to
+ anonymise data so that patient's privacy is preserved.
+ The public GWAS database was taken offline in 2008.
+
+\end{itemize}
+
+\noindent There are many lessons that can be learned from
+these examples. One is that when making datasets public in
+anonymised form, you want to achieve \emph{forward privacy}.
+This means, no matter what other data that is also available
+or will be released later, the data in the original dataset
+does not compromise an individual's privacy. This principle
+was violated by the availability of ``outside data'' in the
+Netflix and governor of Massachusetts cases. The additional
+data permitted a re-identification of individuals in the
+dataset. In case of GWAS a new technique of re-identification
+compromised the privacy of people in the dataset. The case of
+the AOL dataset shows clearly how incomplete such data can be:
+Although the queries uniquely identified the older lady, she
+also looked up diseases that her friends had, which had
+nothing to do with her. Any rational analysis of her query
+data must therefore have concluded, the lady is on her
+death bed, while she was actually very much alive and kicking.
+
+In 2016, Yahoo released the so far largest machine learning
+dataset to the research community. It includes approximately
+13.5 TByte of data representing around 100 Billion events from
+anonymized user-news items, collected by recording
+interactions of about 20M users from February 2015 to May
+2015. Yahoo's gracious goal is to promote independent research
+in the fields of large-scale machine learning and recommender
+systems. It remains to be seen whether this data will really
+only be used for that purpose.
+
+\subsubsection*{Differential Privacy}
+
+Differential privacy is one of the few methods that tries to
+achieve forward privacy. The basic idea is to add appropriate
+noise, or errors, to any query of the dataset. The intention
+is to make the result of a query insensitive to individual
+entries in the database. That means the results are
+approximately the same no matter if a particular individual is
+in the dataset or not. The hope is that the added error does
+not eliminate the ``signal'' one is looking for in the
+dataset.
+
+%\begin{center}
+%User\;\;\;\;
+%\begin{tabular}{c}
+%tell me $f(x)$ $\Rightarrow$\\
+%$\Leftarrow$ $f(x) + \text{noise}$
+%\end{tabular}
+%\;\;\;\;\begin{tabular}{@{}c}
+%Database\\
+%$x_1, \ldots, x_n$
+%\end{tabular}
+%\end{center}
+%
+%\begin{center}
+%\begin{tabular}{l|l}
+%Staff & Salary\\\hline
+%$PM$ & \pounds{107}\\
+%$PF$ & \pounds{102}\\
+%$LM_1$ & \pounds{101}\\
+%$LF_2$ & \pounds{97}\\
+%$LM_3$ & \pounds{100}\\
+%$LM_4$ & \pounds{99}\\
+%$LF_5$ & \pounds{98}
+%\end{tabular}
+%\end{center}
+%
+%
+%\begin{center}
+%\begin{tikzpicture}
+%\begin{axis}[symbolic y coords={salary},
+% ytick=data,
+% height=3cm]
%\addplot+[jump mark mid] coordinates
%{(0,salary) (0.1,salary)
% (0.4,salary) (0.5,salary)
% (0.8,salary) (0.9,salary)};
%\end{axis}
%\end{tikzpicture}
+%\end{center}
+%
+%\begin{tikzpicture}[outline/.style={draw=#1,fill=#1!20}]
% \node [outline=red] {red box};
% \node [outline=blue] at (0,-1) {blue box};
%\end{tikzpicture}
+
+\ldots
-\subsubsection*{Using an Graph-Isomorphism Problem for ZKPs}
-
-Now the question is how can we translate Alibaba's cave into a
-computer science solution? It turns out we need an NP problem
-for that. The main feature of an NP problem is that it is
-computational very hard to generate a solution, but it is very
-easy to check whether a given solution indeed solves the
-problem at hand.\footnote{The question whether $P = NP$ or not,
-we leave to others to speculate about.}
-
-One NP problem is the \emph{graph isomorphism problem}. It
-essentially determines whether the following two graphs, say
-$G_1$ and $G_2$, can be moved and stretched so that they look
-exactly the same.
+\subsubsection*{Further Reading}
-\begin{center}
-\begin{tabular}{c@{\hspace{8mm}}c@{\hspace{8mm}}c}
-$G_1$ & $G_2$\\
-\raisebox{-13mm}{\includegraphics[scale=0.4]{../pics/simple.png}} &
-\raisebox{-13mm}{\includegraphics[scale=0.4]{../pics/simple-b.png}}&
+Two cool articles about how somebody obtained via the Freedom
+of Information Law the taxicab dataset of New York and someone
+else showed how easy it is to mine for private information:
-\footnotesize
-\begin{tabular}{rl}
-Graph $G_1$ & Graph $G_2$\\
-a & 1\\
-b & 6\\
-c & 8\\
-d & 3\\
-g & 5\\
-h & 2\\
-i & 4\\
-j & 7\\
-\end{tabular}
+\begin{center}\small
+\begin{tabular}{p{0.78\textwidth}}
+\url{http://chriswhong.com/open-data/foil_nyc_taxi/}\smallskip\\
+\url{http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset}
\end{tabular}
\end{center}
-\noindent The table on the right gives a mapping of the nodes
-of the first graph to the nodes of the second. With this
-mapping we can check: node $a$ is connected in $G_1$ with $g$,
-$h$ and $i$. Node $a$ maps to node $1$ in $G_2$, which is
-connected to $2$, $4$ and $5$, which again correspond via the
-mapping to $h$, $i$ and $g$ respectively. Let us write
-$\sigma$ for such a table and let us write
-
-\[G_1 = \sigma(G_2)\]
-
-\noindent for two isomorphic graphs ($\sigma$ being the
-isomorphism). It is actually very easy to construct two
-isomorphic graphs: Start with an arbitrary graph, re-label the
-nodes consistently. Alice will need to do this frequently
-for the protocol below. In order to be useful, we therefore
-would need to consider substantially larger graphs, like
-with thousand nodes.
-
-Now the secret which Alice tries to hide is the knowledge of
-such an isomorphism $\sigma$ between two such graphs. But she
-can convince Bob whether she knows such an isomorphism for two
-graphs, say $G_1$ and $G_2$, using the same idea as in the
-example with Alibaba's cave. For this Alice and Bob must
-follow the following protocol:
-
-\begin{enumerate}
-\item Alice generates an isomorphic graph $H$ which she sends
- to Bob (in each iteration she needs to generate a
- different $H$).
-\item
- After receiving $H$, Bob asks Alice either for an
- isomorphism between $G_1$ and $H$, or $G_2$ and $H$.
-\item Alice and Bob repeat this procedure $n$ times.
-\end{enumerate}
-
-\noindent In Step 1 it is important that Alice always
-generates a fresh isomorphic graph. I let you think what
-would happen if Alice sends out twice the same graph $H$.
-
-As said before, this is relatively easy to generate by
-consistently relabelling nodes. If she started from $G_1$,
-Alice will have generated
-
-\begin{equation}
-H = \sigma'(G_1)\label{hiso}
-\end{equation}
-
-\noindent where $\sigma'$ is the isomorphism between $H$ and
-$G_1$. But she could equally have started from $G_2$. In the
-case where $G_1$ and $G_2$ are isomorphic, if $H$ is
-isomorphic with $G_1$, it will also be isomorphic with
-$G_2$, and vice versa.
-
-After generating $H$, Alice sends it to Bob. The important
-point is that she needs to ``commit'' to this $H$, therefore
-this kind of zero-knowledge protocols are called
-\emph{commitment protocols}. Only after receiving $H$, Bob
-will make up his mind about which isomorphism he asks
-for---whether between $H$ and $G_1$ or $H$ and $G_2$. For this
-he could flip a coin, since the choice should be as
-unpredictable for Alice as possible. Once Alice receives the
-request, she has to produce an isomorphism. If she generated
-$H$ as shown in \eqref{hiso} and is asked for an isomorphism
-between $H$ and $G_1$, she just sends $\sigma'$. If she had
-been asked for an isomorphism between $H$ and $G_2$, she just
-has to compose her secret isomorphism $\sigma$ and $\sigma'$.
-The main point for the protocol is that even knowing the
-isomorphism between $H$ and $G_1$ or $H$ and $G_2$, will not
-make the task easier to find the isomorphism between $G_1$ and
-$G_2$, which is the secret Alice tries to protect.
-
-In order to make it crystal clear how this protocol proceeds,
-let us give a version using our more formal notation for
-protocols:
+\noindent
+A readable article about how supermarkets mine your shopping
+habits (especially how they prey on new exhausted parents
+;o) appeared in 2012 in the New York Times:
\begin{center}
-\begin{tabular}{lrl}
-0) & $A \to B:$ & $G_1$ and $G_2$\\
-1a) & $A \to B:$ & $H_1$ \\
-1b) & $B \to A:$ & produce isomorphism $G_1 \leftrightarrow H_1$?
- \quad(or $G_2 \leftrightarrow H_1$)\\
-1c) & $A \to B:$ & requested isomorphism\\
-2a) & $A \to B:$ & $H_2$\\
-2b) & $B \to A:$ & produce isomorphism $G_1 \leftrightarrow H_2$?
- \quad(or $G_2 \leftrightarrow H_2$)\\
-2c) & $A \to B:$ & requested isomorphism\\
- & \ldots\\
-\end{tabular}
+\url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html}
\end{center}
-\noindent As can be seen the protocol runs for some
-agreed number of iterations. The $H_i$ Alice needs to
-produce, need to be all distinct. I hope you now know
-why?
-
-It is also crucial that in each iteration, Alice first sends
-$H_i$ and then Bob can decide which isomorphism he wants:
-either $G_1 \leftrightarrow H_i$ or $G_2 \leftrightarrow H_i$.
-If somehow Alice can find out before she committed to $H_i$,
-she can cheat. For this assume Alice does \emph{not} know an
-isomorphism between $G_1$ and $G_2$. If she knows which
-isomorphism Bob will ask for she can craft $H$ in such a way
-that it is isomorphism with either $G_1$ or $G_2$ (but it
-cannot with both). Then in each case she would send Bob
-a correct answer and he would come to the conclusion that
-all is well. I let you also answer the question whether
-such a protocol run between Alice and Bob would convince
-a third person, say Pete.
-
-Since the interactive nature of the above PKZ protocol and the
-correct ordering of the messages is so important for the
-``correctness'' of the protocol, it might look surprising that
-the same goal can also me achieved in a completely offline
-manner. By this I mean Alice can publish all data at once, and
-then at a later time, Bob can inspect the data and come to the
-conclusion whether or not Alice knows the secret (again
-without actually learning about the secret). For this
-Alice has to do the following:
+\noindent An article that analyses privacy and shopping habits
+from a more economic point of view is available from:
-\begin{enumerate}
-\item Alice generates $n$ isomorphic graphs
- $H_{1..n}$ (they need to be all distinct)
-\item she feeds the $H_{1..n}$ into a hashing function
- (for example encoded as as matrix)
-\item she takes the first $n$ bits of the output of the hashing
- function:
- whenever the output is $0$, she shows an isomorphism
- with $G_1$; for $1$ she shows an isomorphism
- with $G_2$
-\end{enumerate}
+\begin{center}
+\url{http://www.dtc.umn.edu/~odlyzko/doc/privacy.economics.pdf}
+\end{center}
-\noindent The reason why this works and achieves the same
-goal as the interactive variant is that Alice has no
-control over the hashing function. It would be
-computationally just too hard to assemble a set of
-$H_{1..n}$ such that she can force where $0$s and $1$s
-in the hash values are such that it would pass an external
-test. The point is that Alice can publish all this data
-on the comfort of her own web-page, for example, and
-in this way convince everybody who bothers to check.
-
-The virtue of the use of graphs and isomorphism for a
-zero-knowledge protocol is that the idea why it works
-are relatively straightforward. The disadvantage is
-that encoding any secret into a graph-isomorphism, while
-possible, is awkward. The good news is that in fact
-any NP problem can be used as part of a ZKP protocol.
-
-
-\subsubsection*{Using Modular Logarithms for ZKP Protocols}
-
-While information can be encoded into graph isomorphisms, it
-is not the most convenient carrier of information. Clearly it
-is much easier to encode information into numbers. Let us look
-at zero-knowledge proofs that use numbers as secrets. For this
-the underlying NP-problem is to calculate discrete logarithms.
-It can be used by choosing public numbers $A$, $B$, $p$, and
-private $x$ such that
+\noindent An attempt to untangle the web of current technology
+for spying on consumers is published in:
\begin{center}
-$A^x \equiv B\; mod\; p$
-\end{center}
-
-\noindent holds. The secret Alice tries to keep secret is $x$.
-The point of the modular logarithm is that it is very hard
-from the public data to calculate $x$ (for large primes).
-Now the protocol proceeds in three stages:
-
-\begin{itemize}
-\item {\bf Commitment Stage}
- \begin{enumerate}
- \item Alice generates $z$ random numbers $r_1, \ldots, r_z$,
- all less than $p - 1$. Alice then sends Bob for all
- $i = 1,\ldots, z$:
- \[ h_i = A^{r_i}\; mod\; p\]
- \item Bob generates $z$ random bits, say $b_1,\ldots, b_z$. He can do this
- by flipping $z$ times a coin, for example.
- \item For each bit $b_i$, Alice sends Bob an $s_i$ where
-
- \begin{center}
- \begin{tabular}{ll}
- if $b_i = 0$: & $s_i = r_i$\\
- if $b_i = 1$: & $s_i = (r_i - r_j) \;mod\; (p -1)$\\
- \end{tabular}
- \end{center}
+\url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf}
+\end{center}
- where $r_j$ is the lowest $j$ where $b_j = 1$.
-
- \end{enumerate}
- \end{itemize}
-
-\noindent For understanding the last step, let $z$ be just 4.
-We have four random values $r_i$ chosen by Alice and four
-random bits $b_i$ chosen subsequently by Bob, for example
-
- \begin{center}
- \begin{tabular}{lcccc}
- $r_i$:\; & 4 & 9 & 1 & 3\\
- $b_i$:\; & 0 & 1 & 0 & 1\\
- & & $\uparrow$ \\
- & & $j$
- \end{tabular}
- \end{center}
-
- \noindent The highlighted column is the lowest where $b_i =
- 1$ (counted from the left). That means $r_j = 9$. The reason
- for letting Alice choose the random numbers $r_1, \ldots, r_z$
- will become clear shortly. Next is the confirmation
- phase where Bob essentially checks whether Alice has sent
- him ``correct'' $s_i$ and $h_i$.
-
-\begin{itemize}
-\item {\bf Confirmation Stage}
- \begin{enumerate}
- \item For each $b_i$ Bob checks whether $s_i$
- conform to the protocol
-
- \begin{center}
- \begin{tabular}{ll}
- if $b_i = 0$: & $A^{s_i} \stackrel{?}{\equiv} h_i\;mod\;p$\\
- if $b_i = 1$: & $A^{s_i} \stackrel{?}{\equiv} h_i * h_j^{-1} \;mod\; p$\\
- \end{tabular}
- \end{center}
- \end{enumerate}
-\end{itemize}
-
-\noindent To understand the case for $b_i = 1$, you have
-to do the following calculation:
+\noindent An article that sheds light on the paradox that
+people usually worry about privacy invasions of little
+significance, and overlook the privacy invasion that might
+cause significant damage:
\begin{center}
-\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
-$A^{s_i}$ & $=$ & $A^{r_i - r_j}$\\
- & $=$ & $A^{r_i} * A^{-r_j}$\\
- & $=$ & $h_{r_i} * h_{r_j}^{-1}\;mod\;p$
-\end{tabular}
-\end{center}
-
-\noindent What is interesting that so far nothing has been
-sent about $x$, which is the secret Alice has. Also notice
-that Bob does not know $r_j$. He received
-
-\begin{center}
-$r_j - r_j$, $r_m - r_j$, \ldots, $r_p - r_j \;mod \;p - 1$
-\end{center}
-
-\noindent whenever his corresponding bits were $1$. So Bob
-does not know $r_j$ and also does not know any $r_i$ where the
-bit was $1$. Information about the $x$ is sent in the next
-stage (obviously not revealing $x$).
-
-\begin{itemize}
-\item {\bf Proving Stage}
-
-\begin{enumerate}
-\item Alice proves she knows $x$, the discrete log of $B$,
-by sending
-
-\begin{center}
-$s_{z+1} = x - r_j\;mod\;p-1$
+\url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf}
\end{center}
-\item Bob confirms
+
+Interesting ideas
\begin{center}
-$A^{s_{z+1}} \stackrel{?}{\equiv} B * h_j^{-1} \;mod \; p$
+\url{https://adnauseam.io}
\end{center}
-\end{enumerate}
-\end{itemize}
-
-\noindent To understand the last step, you have to do the trick
-again that
-
-\[A^{s_{z+1}} = A^{x-r_j} = \ldots
-\]
-
-\noindent
-which I leave to you.
-
-Now the question is how can Alice cheat? In order to cheat she
-has to coordinate what she sends as $h_i$ in step 1 and $s_i$
-in step 3 of the commitment stage, and also what to send as
-$s_{z+1}$ in the proving stage. For the latter of course
-Alice does not know $x$, so she just chooses some random
-number for $s_{z+1}$ and calculates
-
-\[A^{s_{z+1}}\]
\noindent
-and then solves the equation
-
-\[A^{s_{z+1}} \equiv B * y \;mod\;p\]
-
-\noindent for $y$. This is easy since no logarithm needs to be
-computed. If Alice can guess the $j$ where the first 1 will
-appear in Bob's bit vector, then she sends the inverse of $y$
-as $h_j$ and 0 as $s_j$. However, notice that when she
-calculates a solution for $y$ she does not know $r_j$. For this she
-would need to calculate the modular logarithm
-
-\[y \equiv A^{r_j}\;mod\;p\]
-
-\noindent which is hard (see step 1 in the commitment stage).
-
-Having settled on what $h_j$ should be, now what should Alice
-send as the other $h_i$ and other $s_i$? If the $b_i$ happens
-to be a 1, then the $h_i$ and other $s_i$ need to satisfy the
-test
-
-\[A^{s_i} \stackrel{?}{\equiv} h_i * h_j^{-1} \;mod\; p\]
-
-\noindent where she has already settled on the value of
-$h_j^{-1}$. Lets say she choses $s_i$ at random, then she just
-needs to solve
-
-\[A^{s_i} \equiv z * h_j^{-1} \;mod\; p\]
-
-\noindent for $z$. Again that is easy, but it does not allow
-us to know $r_i$, because then we would again need to solve
-a modular logarithm problem. Let us call an $h_i$ which was
-solved the easy way as \emph{bogus}. Alice has to produce
-bogus $h_i$ for all bits that are going to be $1$ in advance!
-This means she has to guess all the bits correctly. (Yes?
-I let you think about this.)
-
-Let us see what happens if she guesses wrongly: Suppose the
-bit $b_i = 1$ where she thought she will get a 0. Then she has
-already sent an $h_i$ and $h_j$ and now must find an $s_i$
-such that
-
-\[A^{s_i} \equiv h_i * h_j^{-1} \;mod\; p\]
-
-\noindent holds. For this remember in calculating $h_i$, she
-just chose a random $s_i$. Now she has to send a genuine one.
-But this is of course too hard. If she knew the genuine $r_i$
-and $r_j$ for $h_i$ and $h_j$, it would be easy (in this case
-$s_i = r_i - r_j$). But she does not. So she will be found
-out. If $b_i = 0$, but she thought she will get a 1, then
-she has to send a $s_i$ which satisfies
-
-\[A^{s_i} \equiv h_i\;mod\;p\]
-
-\noindent Again she does not know $r_i$. So it is a too hard
-task and she will be found out again.
-
-To sum up, in order for Alice to successfully cheat Bob, she
-needs to guess \emph{all} bits correctly. She has only a
-$\frac{1}{2^z}$ chance of doing this.
-
-\subsubsection*{Further Reading}
-
-Make sure you understand what NP problems
-are.\footnote{\url{http://en.wikipedia.org/wiki/NP_(complexity)}}
-They are the building blocks for zero-knowledge proofs.
-Zero-Knowldege proofs are not yet widely used in production
-systems, but it is slowly gaining ground. One area of application
-where they pop up is crypto currencies (for example Zerocoins
-or how to make sure a Bitcoin exchange is solvent without
-revealing its assets).
-
-If you want to brush up on the modular logarithm problem,
-the Khan Academy has a nice video:
+And a paper that predicts ad-blockers will in the end win over anti-ad-blocking:
\begin{center}
-\url{https://www.khanacademy.org/video/discrete-logarithm-problem}
+\url{http://randomwalker.info/publications/ad-blocking-framework-techniques.pdf}
\end{center}
+
\end{document}
-http://blog.cryptographyengineering.com/2014/11/zero-knowledge-proofs-illustrated-primer.html
-
-http://btravers.weebly.com/uploads/6/7/2/9/6729909/zero_knowledge_technique.pdf
-http://zk-ssh.cms.ac/docs/Zero_Knowledge_Prinzipien.pdf
-http://www.wisdom.weizmann.ac.il/~oded/PS/zk-tut02v4.ps
-
-socialist millionares problem
-http://en.wikipedia.org/wiki/Socialist_millionaire
-http://twistedoakstudios.com/blog/Post3724_explain-it-like-im-five-the-socialist-millionaire-problem-and-secure-multi-party-computation
-
+http://randomwalker.info/teaching/fall-2012-privacy-technologies/?
+http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/
+http://repository.cmu.edu/cgi/viewcontent.cgi?article=1077&context=hcii
+https://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdf
+http://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdf
+http://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdf
+https://www.youtube.com/watch?v=Gx13lgEudtU
+https://www.cs.purdue.edu/homes/ctask/pdfs/CERIAS_Presentation.pdf
+http://www.futureofprivacy.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdf
+http://www.cis.upenn.edu/~aaroth/courses/slides/Overview.pdf
+http://www.cl.cam.ac.uk/~sjm217/papers/tor14design.pdf
%%% Local Variables:
%%% mode: latex