handouts/ho06.tex
changeset 523 7a6e8f603e08
parent 513 84ed8d6143ea
child 534 62985f147c85
--- a/handouts/ho06.tex	Sat Sep 23 19:39:53 2017 +0100
+++ b/handouts/ho06.tex	Sat Sep 23 19:52:27 2017 +0100
@@ -1,620 +1,559 @@
 \documentclass{article}
 \usepackage{../style}
+\usepackage{../graphics}
 
 \begin{document}
 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015}
 
-\section*{Handout 6 (Zero-Knowledge Proofs)}
+%https://www.theguardian.com/technology/2016/oct/04/yahoo-secret-email-program-nsa-fbi
+%https://nakedsecurity.sophos.com/2015/11/12/california-collects-owns-and-sells-infants-dna-samples/
+%http://randomwalker.info/teaching/fall-2012-privacy-technologies/?
+%https://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdf
+%http://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdf
+%http://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdf
+%https://www.youtube.com/watch?v=Gx13lgEudtU
+%https://fpf.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdf
+%http://research.neustar.biz/2014/09/08/differential-privacy-the-basics/
+
+%=====
+%Tim Greene, Network World, 17 Dec 2015   (via ACM TechNews, 18 Dec 2015)
+%
+%Massachusetts Institute of Technology (MIT) researchers' experimental
+%Vuvuzela messaging system offers more privacy than The Onion Router (Tor) by
+%rendering text messages sent through it untraceable.  MIT Ph.D. student
+%David Lazar says Vuvuzela resists traffic analysis attacks, while Tor
+%cannot.  The researchers say the system functions no matter how many parties
+%are using it to communicate, and it employs encryption and a set of servers
+%to conceal whether or not parties are participating in text-based dialogues.
+%"Vuvuzela prevents an adversary from learning which pairs of users are
+%communicating, as long as just one out of [the] servers is not compromised,
+%even for users who continue to use Vuvuzela for years," they note.  Vuvuzela
+%can support millions of users hosted on commodity servers deployed by a
+%single group of users.  Instead of anonymizing users, Vuvuzela prevents
+%outside observers from differentiating between people sending messages,
+%receiving messages, or neither, according to Lazar.  The system imposes
+%noise on the client-server traffic which cannot be distinguished from actual
+%messages, and all communications are triple-wrapped in encryption by three
+%servers.  "Vuvuzela guarantees privacy as long as one of the servers is
+%uncompromised, so using more servers increases security at the cost of
+%increased message latency," Lazar notes.
+%http://orange.hosting.lsoft.com/trk/click?ref=znwrbbrs9_5-e70bx2d991x066779&
 
-Zero-knowledge proofs (short ZKP) solve a paradoxical puzzle:
-How to convince somebody else that one knows a secret, without
-revealing what the secret actually is? This sounds like a
-problem the Mad Hatter from Alice in Wonderland would occupy
-himself with, but actually there some serious and not so
-serious applications of it. For example, if you solve
-crosswords with your friend, say Bob, you might want to
-convince him that you found a solution for one question, but
-of course you do not want to reveal the solution, as this
-might give Bob an advantage somewhere else in the crossword.
+%%%%
+%% canvas tracking
+%%https://freedom-to-tinker.com/blog/englehardt/the-princeton-web-census-a-1-million-site-measurement-and-analysis-of-web-privacy/
+
+%%%
+%% cupit re-identification attack
+%% https://nakedsecurity.sophos.com/2016/05/20/published-personal-data-on-70000-okcupid-users-taken-down-after-dmca-order/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+nakedsecurity+%28Naked+Security+-+Sophos%29
+
+%Differential privacy
+%=====================
+%https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/
+
+%Differential privacy, translated from Apple-speak, is the
+%statistical science of trying to learn as much as possible
+%about a group while learning as little as possible about any
+%individual in it.
 
-So how to convince Bob that you know the answer (or a secret)?
-One way would be to come up with the following protocol:
-Suppose the answer is \textit{folio}. Then look up the
-definition of \textit{folio} in a dictionary. Say you find:
+%As Roth notes when he refers to a “mathematical proof,”
+%differential privacy doesn’t merely try to obfuscate or
+%“anonymize” users’ data. That anonymization approach, he
+%argues, tends to fail. In 2007, for instance, Netflix released
+%a large collection of its viewers’ film ratings as part of a
+%competition to optimize its recommendations, removing people’s
+%names and other identifying details and publishing only their
+%Netflix ratings. But researchers soon cross-referenced the
+%Netflix data with public review data on IMDB to match up
+%similar patterns of recommendations between the sites and add
+%names back into Netflix’s supposedly anonymous database.
+
+%As an example of that last method, Microsoft’s Dwork points to
+%the technique in which a survey asks if the respondent has
+%ever, say, broken a law. But first, the survey asks them to
+%flip a coin. If the result is tails, they should answer
+%honestly. If the result is heads, they’re instructed to flip
+%the coin again and then answer “yes” for heads or “no” for
+%tails. The resulting random noise can be subtracted from the
+%results with a bit of algebra, and every respondent is
+%protected from punishment if they admitted to lawbreaking.
 
-\begin{quote}
-``an \textit{individual} leaf of paper or parchment, either
-loose as one of a series or forming part of a bound volume,
-which is numbered on the recto or front side only.''
-\end{quote}
+%https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
+
+% Windows 10 data send back to Microsoft (Cortana)
+%Here’s a non-exhaustive list of data sent back: location data, text
+%input, voice input, touch input, webpages you visit, and telemetry
+%data regarding your general usage of your computer, including which
+%programs you run and for how long.
 
-\noindent
-Take the first non-small word\footnote{Let's say the, a, an,
-or, and \ldots fall into the category of small words.} in this definition,
-in this case \textit{individual}, and look up the definition
-of this word, say
+% Businesses are already using customised pricing online based on
+% information they can glean about you. It is hard to know how
+% widespread the practice is; companies keep their pricing strategies
+% closely guarded and are wary of the bad PR price discrimination
+% could pose. However, it is clear that a number of large retailers
+% are experimenting with it. Staples, for example, has offered
+% discounted prices based on whether rival stores are within 20 miles
+% of its customers’ location. Office Depot has admitted to using its
+% customers’ browsing history and location to vary its range of offers
+% and products. A 2014 study from Northeastern University found
+% evidence of “steering” or differential pricing at four out of 10
+% general merchandise websites and five out of five travel
+% websites. (Steering is when a company doesn’t give you a customised
+% price, but points you towards more expensive options if it thinks
+% you will pay more.) The online travel company Orbitz raised
+% headlines in 2012 when it emerged that the firm was pointing Mac
+% users towards higher-priced hotel rooms than PC users.
+
+
+%%% government will overwrite your wishes if it is annoymous
+%% https://www.lightbluetouchpaper.org/2016/12/05/government-u-turn-on-health-privacy/
+
+%% corporate surveilance / privacy - report and CC3C talk
+%%      http://crackedlabs.org/en/networksofcontrol
+%%      https://media.ccc.de/v/33c3-8414-corporate_surveillance_digital_tracking_big_data_privacy#video&t=2933
+
+\section*{Handout 6 (Privacy)}
 
-\begin{quote}
-``a single \textit{human} being as distinct from a group''
-\end{quote}
+The first motor car was invented around 1886. For ten years,
+until 1896, the law in the UK (and elsewhere) required a
+person to walk in front of any moving car waving a red flag.
+Cars were such a novelty that most people did not know what to
+make of them. The person with the red flag was intended to
+warn the public, for example horse owners, about the impending
+novelty---a car. In my humble opinion, we are at the same
+stage of development with privacy. Nobody really knows what it
+is about or what it is good for. All seems very hazy. There
+are a few laws (e.g.~cookie law, right-to-be-forgotten law)
+which address problems with privacy, but even if they are well
+intentioned, they either back-fire or are already obsolete
+because of newer technologies. The result is that the world of
+``privacy'' looks a little bit like the old Wild
+West---lawless and mythical.
 
-\noindent 
-In this definition take the second non-small word, that
-is \textit{human}, and again look up the definition of this 
-word. This will yield
+We would have hoped that after Snowden, Western governments
+would be a bit more sensitive and enlightned about the topic
+of privacy, but this is far from the truth. Ross Anderson
+wrote the following in his blog\footnote{\url{https://www.lightbluetouchpaper.org/2016/02/11/report-on-the-ip-bill/}} about the approach taken in
+the US to lessons learned from the Snowden leaks and contrasts
+this with the new snooping bill that is considered in the UK
+parliament: 
 
-\begin{quote}
-``relating to or \textit{characteristic} of humankind''
+\begin{quote}\it 
+``The comparison with the USA is stark. There, all three
+branches of government realised they'd gone too far after
+Snowden. President Obama set up the NSA review group, and
+implemented most of its recommendations by executive order;
+the judiciary made changes to the procedures of the FISA
+Court; and Congress failed to renew the data retention
+provisions in the Patriot Act (aided by the judiciary). Yet
+here in Britain the response is just to take Henry VIII powers
+to legalise all the illegal things that GCHQ had been up to,
+and hope that the European courts won't strike the law down
+yet again.''
 \end{quote}
 
-\noindent 
-You could go on looking up the definition of the third
-non-small word in this definition and so on. But let us assume
-you agreed with Bob to stop after three iterations with the
-third non-article word in the last definition, that is
-\textit{or}. Now, instead of sending to Bob the solution
-\textit{folio}, you send to him \textit{characteristic}. 
+\noindent Unfortunately, also big organisations besides
+governments seem to take an unenlightened approach to privacy.
+For example, UCAS, a charity set up to help students with
+applying to universities in the UK, has a commercial unit that
+happily sells your email addresses to anybody who forks out
+enough money for bombarding you with spam. Yes, you can opt
+out very often from such ``schemes'', but in case of UCAS any
+opt-out will limit also legit emails you might actually be
+interested in.\footnote{The main objectionable point, in my
+opinion, is that the \emph{charity} everybody has to use for
+HE applications has actually very honourable goals
+(e.g.~assist applicants in gaining access to universities),
+but the small print (or better the link ``About us'') reveals
+they set up their organisation so that they can also
+shamelessly sell the email addresses they ``harvest''.
+Everything is of course very legal\ldots{}ethical?\ldots{}well
+that is in the eye of the beholder. See:
 
-How can Bob verify that you know the solution? Well, once he
-solved it himself, he can use the dictionary and follow the
-same ``trail'' as you did. If the final word agrees with what
-you had sent him, he must infer you knew the solution earlier
-than him. This protocol works like a one-way hash function
-because \textit{characteristic} does not give any hint as to
-what was the first word was. I leave you to think why this
-protocol avoids small words? 
+\url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} 
+or
+\url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
 
-After Bob found his solution and verified that according to
-the protocol it ``maps'' also to \textit{characteristic}, can
-he be entirely sure no cheating is going on? Not with 100\%
-certainty. It could have been possible that he was given
-\textit{characteristic} as the word, but it derived from a
-different word. This might seem very unlikely, but at least
-theoretical it is a possibility. Protocols based on
-zero-knowledge proofs will produce a similar result---they
-give an answer that might be erroneous in a very small number
-of cases. The point is to iterate them long enough so that the
-theoretical possibility of cheating is negligibly small. 
+Another example: Verizon, an ISP who is supposed to provide
+you just with connectivity, has found a ``nice'' side-business
+too: When you have enabled all privacy guards in your browser
+(the few you have at your disposal), Verizon happily adds a
+kind of cookie to your
+HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
+As shown in the picture below, this cookie will be sent to
+every web-site you visit. The web-sites then can forward the
+cookie to advertisers who in turn pay Verizon to tell them
+everything they want to know about the person who just made
+this request, that is you.
+ 
+\begin{center}
+\includegraphics[scale=0.16]{../pics/verizon.png}
+\end{center}
 
-By the way, the authors of the paper ``Dismantling Megamos
-Crypto: Wirelessly Lockpicking a Vehicle Immobilizer'' who
-were barred from publishing their results used also a hash to
-prove they did the work and (presumably) managed to get into
-cars without a key; see Figure~\ref{paper}. This is very
-similar to the method above about crosswords: They like to
-prove that they did the work, but not giving out the
-``solution''. But this also shows what the problem with such a
-method is: yes, we can hide the secret temporarily, but if
-somebody else wants to verify it, then the secret has to be
-made public. Bob needs to know that \textit{folio} is the
-solution before he can verify the claim of Alice that she had
-the solution first. Similarly with the car-crypto paper: we
-needed to wait until September 2015 when the authors were
-finally able to publish their findings in order to verify the
-hash. Zero-knowledge proofs, in contrast, can be immediately
-checked, even if the secret is not public yet and perhaps
-never will be.
-
-\begin{figure}
-\begin{center}
-\addtolength{\fboxsep}{4mm}
-\fbox{\includegraphics[scale=0.4]{../pics/Dismantling_Megamos_Crypto.png}}
-\end{center}
-\caption{The authors of this paper used a hash in order to prove
-later that they have managed to break into cars.\label{paper}}
-\end{figure}
+\noindent How disgusting! Even worse, Verizon is not known for
+being the cheapest ISP on the planet (completely the
+contrary), and also not known for providing the fastest
+possible speeds, but rather for being among the few ISPs in
+the US with a quasi-monopolistic ``market distribution''.
 
 
-\subsubsection*{ZKP: An Illustrative Example}
+Well, we could go on and on\ldots{}and that has not even
+started us yet with all the naughty things NSA \& Friends are
+up to. Why does privacy actually matter? Nobody, I think, has
+a conclusive answer to this question yet. Maybe the following
+four notions help with clarifying the overall picture
+somewhat: 
 
-The idea behind zero-knowledge proofs is not very obvious and
-will surely take some time for you to digest. Therefore let us
-start with a simple illustrative example. The example will not
-be perfect, but hopefully explain the gist of the idea. The
-example is called Alibaba's cave, which graphically looks as 
-follows:\footnote{The example is taken from an 
-article titled ``How to Explain Zero-Knowledge Protocols
-to Your Children'' available from 
-\url{http://pages.cs.wisc.edu/~mkowalcz/628.pdf}.}
+\begin{itemize}
+\item \textbf{Secrecy} is the mechanism used to limit the
+      number of principals with access to information (e.g.,
+      cryptography or access controls). For example I better
+      keep my password secret, otherwise people from the wrong
+      side of the law might impersonate me.
+
+\item \textbf{Confidentiality} is the obligation to protect
+      the secrets of other people or organisations (secrecy
+      for the benefit of an organisation). For example as a
+      staff member at King's I have access to data, even
+      private data, I am allowed to use in my work but not
+      allowed to disclose to anyone else.
 
-\begin{center}
-\begin{tabular}{c@{\hspace{8mm}}c@{\hspace{8mm}}c}
-\includegraphics[scale=0.1]{../pics/alibaba1.png} &
-\includegraphics[scale=0.1]{../pics/alibaba2.png} &
-\includegraphics[scale=0.1]{../pics/alibaba3.png} \\
-Step 1 & Step 2 & Step 3
-\end{tabular}
-\end{center}
+\item \textbf{Anonymity} is the ability to leave no evidence of
+      an activity (e.g., sharing a secret). This is not equal
+        with privacy---anonymity is required in many 
+        circumstances, for example for whistle-blowers, 
+        voting, exam marking and so on.
 
-\noindent Let us take a closer look at the picture in Step 1:
-The cave has a tunnel which forks at some point. Both forks
-are connected in a loop. At the deep end of the loop is a
-magic wand. The point of the magic wand is that Alice knows
-the secret word for how to open it. She wants to keep the word
-secret, but wants to convince Bob that she knows it.
+\item \textbf{Privacy} is the ability or right to protect your
+      personal secrets (secrecy for the benefit of an
+      individual). For example, in a job interview, I might
+      not like to disclose that I am pregnant, if I were a
+      woman, or that I am a father. Lest they might not hire
+      me. Similarly, I might not like to disclose my location
+      data, because thieves might break into my house if they
+      know I am away at work. Privacy is essentially
+      everything which ``shouldn't be anybody's business''.
+
+\end{itemize}
+
+\noindent While this might provide us with some rough
+definitions, the problem with privacy is that it is an
+extremely fine line what should stay private and what should
+not. For example, since I am working in academia, I am every
+so often very happy to be a digital exhibitionist: I am very
+happy to disclose all `trivia' related to my work on my
+personal web-page. This is a kind of bragging that is normal
+in academia (at least in the field of CS), even expected if
+you look for a job. I am even happy that Google maintains a
+profile about all my academic papers and their citations. 
 
-One way of course would be to let Bob follow her, but then he
-would also find out the secret. So this does not work. To find
-a solution, let us first fix the rules: At the beginning Alice
-and Bob are outside the cave. Alice goes in alone and takes
-either tunnel labelled $A$ in the picture, or the other tunnel
-labelled $B$. She waits at the magic wand for the instructions
-from Bob, who also goes into the gave and observes what
-happens at the fork. He has no knowledge which tunnel Alice
-took and calls out (in Step 2) that she should emerge from tunnel
-$A$, for example. If she knows the secret for opening the
-wand, this will not be a problem for Alice. If she was already
-in the $A$-segment of the tunnel, then she just comes back. If
-she was in the $B$-segment of the tunnel she will say the magic
-word and goes through the wand to emerge from $A$ as requested
-by Bob.
+On the other hand I would be very irritated if anybody I do
+not know had a too close look on my private live---it
+shouldn't be anybody's business. The reason is that knowledge
+about my private life can often be used against me. As mentioned
+above, public location data might mean I get robbed. If
+supermarkets build a profile of my shopping habits, they will
+use it to \emph{their} advantage---surely not to \emph{my}
+advantage. Also whatever might be collected about my life will
+always be an incomplete, or even misleading, picture. For
+example I am pretty sure my creditworthiness score was
+temporarily(?) destroyed by not having a regular income in
+this country (before coming to King's I worked in Munich for
+five years). To correct such incomplete or flawed credit
+history data there is, since recently, a law that allows you
+to check what information is held about you for determining
+your creditworthiness. But this concerns only a very small
+part of the data that is held about me/you. Also
+what about cases where data is wrong or outdated (but do we
+need a right-to be forgotten).
+
+
+To see how private matter can lead really to the wrong
+conclusions, take the example of Stephen Hawking: When he was
+diagnosed with his disease, he was given a life expectancy of
+two years. If employers would know about such problems, would
+they have employed Hawking? Now, he is enjoying his 70+
+birthday. Clearly personal medical data needs to stay private.
+
+To cut a long story short, I let you ponder about the two
+statements which are often voiced in discussions about privacy:
 
-Let us have a look at the case where Alice cheats, that is not
-knows the secret. She would still go into the cave and use,
-for example the $B$-segment of the tunnel. If now Bob says she
-should emerge from $B$, she is lucky. But if he says she
-should emerge from $A$ then Alice is in trouble: Bob will find
-out she does not actually know the secret. So in order to fool
-Bob she needs to anticipate his call, and already go into the
-corresponding tunnel. This of course also does not work, since
-Bob can make any call he wants. Consequently in order to find
-out whether Alice cheats, Bob just needs to repeat this
-protocol many times. Each time Alice has a chance of
-$\frac{1}{2}$ to be lucky or being found out. Iterating this
-$n$ times means she must be right every time and when
-cheating: the probability for this is $\frac{1}{2}^n$, number
-that for already relatively small $n$, say 10, is incredibly 
-small.
+\begin{itemize}
+\item \textit{``You have zero privacy anyway. Get over 
+it.''}\\
+\mbox{}\hfill{}{\small{}(by Scott Mcnealy, former CEO of Sun)}
+
+\item \textit{``If you have nothing to hide, you have nothing 
+to fear.''}
+\end{itemize}
+ 
+\noindent If you like to watch a movie which has this topic as
+its main focus I recommend \emph{Gattaca} from
+1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}} If
+you want to read up on this topic, I can recommend the
+following article that appeared in 2011 in the Chronicle of
+Higher Education:
+
+\begin{center} 
+\url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/} 
+\end{center} 
+
+\noindent Funnily, or maybe not so funnily, the author of this
+article carefully tries to construct an argument that does not
+only attack the nothing-to-hide statement in cases where
+governments \& co collect people's deepest secrets, or
+pictures of people's naked bodies, but an argument that
+applies also in cases where governments ``only'' collect data
+relevant to, say, preventing terrorism. The fun is of course
+that in 2011 we could just not imagine that respected
+governments would do such infantile things as intercepting
+people's nude photos. Well, since Snowden we know some people
+at the NSA did exactly that and then shared such photos among
+colleagues as ``fringe benefit''.  
 
 
-There are some interesting observations we can make about 
-Alibaba's cave and the ZKP protocol between Alice and Bob:
+\subsubsection*{Re-Identification Attacks} 
+
+Apart from philosophical musings, there are fortunately also
+some real technical problems with privacy. The problem I want
+to focus on in this handout is how to safely disclose datasets
+containing potentially very private data, say health records.
+What can go wrong with such disclosures can be illustrated
+with four well-known examples:
 
 \begin{itemize}
-\item {\bf Completeness} If Alice knows the secret, Bob
-      accepts Alice ``proof'' for sure. There is no error
-      possible in that Bob thinks Alice cheats when she
-      actually knows the secret. 
-\item {\bf Soundness} If Alice does not know the secret,
-      Bob accepts her ``proof'' with a very small probability.
-      If, as in the example above, the probability of being
-      able to hide cheating is $\frac{1}{2}$ in each round 
-      it will be $\frac{1}{2}^n$ after $n$-rounds, which even
-      for small $n$ say $> 10$ is very small indeed.
-\item {\bf Zero-Knowledge} Even if Bob accepts
-      the proof by Alice, he cannot convince anybody else.
-\end{itemize} 
+\item In 2006, a then young company called Netflix offered a 1
+      Mio \$ prize to anybody who could improve their movie
+      rating algorithm. For this they disclosed a dataset
+      containing 10\% of all Netflix users at the time
+      (appr.~500K). They removed names, but included numerical
+      ratings of movies as well as times when ratings were
+      uploaded. Though some information was perturbed (i.e.,
+      slightly modified).
+      
+      Two researchers had a closer look at this anonymised
+      data and compared it with public data available from the
+      International Movie Database (IMDb). They found that
+      98\% of the entries could be re-identified in the
+      Netflix dataset: either by their ratings or by the dates
+      the ratings were uploaded. The result was a class-action
+      suit against Netflix, which was only recently resolved
+      involving a lot of money.
 
-\noindent The last property is the most interesting one.
-Assume Alice has convinced Bob that she knows the secret and
-Bob filmed the whole protocol with a camera. Can he use the
-video to convince anybody else? After a moment of thought, you
-will agree that this is not the case. Alice and Bob might have
-just made it all up and colluded by Bob telling Alice
-beforehand which tunnel he will call. In this way it appears
-as if all iterations of the protocol were successful, but they
-prove nothing. If another person wants to find out whether
-Alice knows the secret, he or she would have to conduct the
-protocol again. This is actually the formal definition of a
-zero-knowledge proof: an independent observer cannot
-distinguish between a real protocol (where Alice knows the
-secret) and a fake one (where Bob and Alice colluded).
+\item In the 1990ies, medical datasets were often made public
+      for research purposes. This was done in anonymised form
+      with names removed, but birth dates, gender and ZIP-code
+      were retained. In one case where such data about
+      hospital visits of state employees in Massachusetts was
+      made public, the then governor assured the public that
+      the released dataset protected patient privacy by
+      deleting identifiers. 
+      
+      A graduate student could not resist cross-referencing
+      public voter data with the released data that still
+      included birth dates, gender and ZIP-code. The result
+      was that she could send the governor his own hospital
+      record. It turns out that birth dates, gender and
+      ZIP-code uniquely identify 87\% of people in the US.
+      This work resulted in a number of laws prescribing which
+      private data cannot be released in such datasets.
+ 
+\item In 2006, AOL published 20 million Web search queries
+      collected from 650,000 users (names had been deleted).
+      This was again done for research purposes. However,
+      within days an old lady, Thelma Arnold, from Lilburn,
+      Georgia, (11,596 inhabitants) was identified as user
+      No.~4417749 in this dataset. It turned out that search
+      engine queries are deep windows into people's private
+      lives. 
+  
+\item Genome-Wide Association Studies (GWAS) was a public
+      database of gene-frequency studies linked to diseases.
+      It would essentially record that people who have a
+      disease, say diabetes, have also certain genes. In order
+      to maintain privacy, the dataset would only include
+      aggregate information. In case of DNA data this
+      aggregation was achieved by mixing the DNA of many
+      individuals (having a disease) into a single solution.
+      Then this mixture was sequenced and included in the
+      dataset. The idea was that the aggregate information
+      would still be helpful to researchers, but would protect
+      the DNA data of individuals. 
+       
+      In 2007 a forensic computer scientist showed that
+      individuals can still be identified. For this he used
+      the DNA data from a comparison group (people from the
+      general public) and ``subtracted'' this data from the
+      published data. He was left with data that included all
+      ``special'' DNA-markers of the individuals present in
+      the original mixture. He essentially deleted the
+      ``background noise'' in the published data. The problem
+      with DNA data is that it is of such a high resolution
+      that even if the mixture contained maybe 100
+      individuals, you can with current technology detect
+      whether an individual was included in the mixture or
+      not.
+      
+      This result changed completely how DNA data is nowadays
+      published for research purposes. After the success of 
+      the human-genome project with a very open culture of
+      exchanging data, it became much more difficult to 
+      anonymise data so that patient's privacy is preserved.
+      The public GWAS database was taken offline in 2008.
+      
+\end{itemize}
+
+\noindent There are many lessons that can be learned from
+these examples. One is that when making datasets public in
+anonymised form, you want to achieve \emph{forward privacy}.
+This means, no matter what other data that is also available
+or will be released later, the data in the original dataset
+does not compromise an individual's privacy. This principle
+was violated by the availability of ``outside data'' in the
+Netflix and governor of Massachusetts cases. The additional
+data permitted a re-identification of individuals in the
+dataset. In case of GWAS a new technique of re-identification
+compromised the privacy of people in the dataset. The case of
+the AOL dataset shows clearly how incomplete such data can be:
+Although the queries uniquely identified the older lady, she
+also looked up diseases that her friends had, which had
+nothing to do with her. Any rational analysis of her query
+data must therefore have concluded, the lady is on her
+death bed, while she was actually very much alive and kicking.
+
+In 2016, Yahoo released the so far largest machine learning
+dataset to the research community. It includes approximately
+13.5 TByte of data representing around 100 Billion events from
+anonymized user-news items, collected by recording
+interactions of about 20M users from February 2015 to May
+2015. Yahoo's gracious goal is to promote independent research
+in the fields of large-scale machine learning and recommender
+systems. It remains to be seen whether this data will really
+only be used for that purpose.
+
+\subsubsection*{Differential Privacy}
+
+Differential privacy is one of the few methods that tries to
+achieve forward privacy. The basic idea is to add appropriate
+noise, or errors, to any query of the dataset. The intention
+is to make the result of a query insensitive to individual
+entries in the database. That means the results are
+approximately the same no matter if a particular individual is
+in the dataset or not. The hope is that the added error does
+not eliminate the ``signal'' one is looking for in the
+dataset.
+
+%\begin{center}
+%User\;\;\;\;    
+%\begin{tabular}{c}
+%tell me $f(x)$ $\Rightarrow$\\
+%$\Leftarrow$ $f(x) + \text{noise}$
+%\end{tabular}
+%\;\;\;\;\begin{tabular}{@{}c}
+%Database\\
+%$x_1, \ldots, x_n$
+%\end{tabular}
+%\end{center}
+%
+%\begin{center}
+%\begin{tabular}{l|l}
+%Staff & Salary\\\hline
+%$PM$ & \pounds{107}\\
+%$PF$ & \pounds{102}\\
+%$LM_1$ & \pounds{101}\\
+%$LF_2$ & \pounds{97}\\
+%$LM_3$ & \pounds{100}\\
+%$LM_4$ & \pounds{99}\\
+%$LF_5$ & \pounds{98}
+%\end{tabular}
+%\end{center}
+%
+%
+%\begin{center}
+%\begin{tikzpicture} 
+%\begin{axis}[symbolic y coords={salary},
+%             ytick=data,
+%             height=3cm]
%\addplot+[jump mark mid] coordinates
%{(0,salary)   (0.1,salary) 
% (0.4,salary) (0.5,salary)  
% (0.8,salary) (0.9,salary)};
%\end{axis}
%\end{tikzpicture}
+%\end{center}
+%
+%\begin{tikzpicture}[outline/.style={draw=#1,fill=#1!20}]
%  \node [outline=red]            {red box};
%  \node [outline=blue] at (0,-1) {blue box};
%\end{tikzpicture}
+
+\ldots
 
 
-\subsubsection*{Using an Graph-Isomorphism Problem for ZKPs}
-
-Now the question is how can we translate Alibaba's cave into a
-computer science solution? It turns out we need an NP problem
-for that. The main feature of an NP problem is that it is
-computational very hard to generate a solution, but it is very
-easy to check whether a given solution indeed solves the
-problem at hand.\footnote{The question whether $P = NP$ or not,
-we leave to others to speculate about.} 
-
-One NP problem is the \emph{graph isomorphism problem}. It
-essentially determines whether the following two graphs, say
-$G_1$ and $G_2$, can be moved and stretched so that they look
-exactly the same.
+\subsubsection*{Further Reading}
 
-\begin{center}
-\begin{tabular}{c@{\hspace{8mm}}c@{\hspace{8mm}}c}
-$G_1$ & $G_2$\\
-\raisebox{-13mm}{\includegraphics[scale=0.4]{../pics/simple.png}} &
-\raisebox{-13mm}{\includegraphics[scale=0.4]{../pics/simple-b.png}}&
+Two cool articles about how somebody obtained via the Freedom
+of Information Law the taxicab dataset of New York and someone
+else showed how easy it is to mine for private information: 
 
-\footnotesize
-\begin{tabular}{rl}
-Graph $G_1$	& Graph $G_2$\\
-a  & 1\\
-b  & 6\\
-c  & 8\\
-d  & 3\\
-g  & 5\\
-h  & 2\\
-i  & 4\\
-j  & 7\\
-\end{tabular}
+\begin{center}\small
+\begin{tabular}{p{0.78\textwidth}}
+\url{http://chriswhong.com/open-data/foil_nyc_taxi/}\smallskip\\
+\url{http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset}
 \end{tabular}
 \end{center}
 
-\noindent The table on the right gives a mapping of the nodes
-of the first graph to the nodes of the second. With this
-mapping we can check: node $a$ is connected in $G_1$ with $g$,
-$h$ and $i$. Node $a$ maps to node $1$ in $G_2$, which is
-connected to $2$, $4$ and $5$, which again correspond via the
-mapping to $h$, $i$ and $g$ respectively. Let us write
-$\sigma$ for such a table and let us write
-
-\[G_1 = \sigma(G_2)\]
-
-\noindent for two isomorphic graphs ($\sigma$ being the
-isomorphism). It is actually very easy to construct two
-isomorphic graphs: Start with an arbitrary graph, re-label the
-nodes consistently. Alice will need to do this frequently 
-for the protocol below. In order to be useful, we therefore
-would need to consider substantially larger graphs, like
-with thousand nodes. 
-
-Now the secret which Alice tries to hide is the knowledge of
-such an isomorphism $\sigma$ between two such graphs. But she
-can convince Bob whether she knows such an isomorphism for two
-graphs, say $G_1$ and $G_2$, using the same idea as in the
-example with Alibaba's cave. For this Alice and Bob must
-follow the following protocol:
-
-\begin{enumerate}
-\item Alice generates an isomorphic graph $H$ which she sends 
-      to Bob (in each iteration she needs to generate a 
-      different $H$). 
-\item
-      After receiving $H$, Bob asks Alice either for an      
-      isomorphism between $G_1$ and $H$, or $G_2$ and $H$.
-\item Alice and Bob repeat this procedure $n$ times.
-\end{enumerate}
-
-\noindent In Step 1 it is important that Alice always 
-generates a fresh isomorphic graph. I let you think what
-would happen if Alice sends out twice the same graph $H$.
-
-As said before, this is relatively easy to generate by
-consistently relabelling nodes. If she started from $G_1$,
-Alice will have generated
-
-\begin{equation}
-H = \sigma'(G_1)\label{hiso}
-\end{equation}
- 
-\noindent where $\sigma'$ is the isomorphism between $H$ and
-$G_1$. But she could equally have started from $G_2$. In the 
-case where $G_1$ and $G_2$ are isomorphic, if $H$ is 
-isomorphic with $G_1$, it will also be isomorphic with 
-$G_2$, and vice versa. 
-
-After generating $H$, Alice sends it to Bob. The important
-point is that she needs to ``commit'' to this $H$, therefore
-this kind of zero-knowledge protocols are called
-\emph{commitment protocols}. Only after receiving $H$, Bob
-will make up his mind about which isomorphism he asks
-for---whether between $H$ and $G_1$ or $H$ and $G_2$. For this
-he could flip a coin, since the choice should be as
-unpredictable for Alice as possible. Once Alice receives the
-request, she has to produce an isomorphism. If she generated
-$H$ as shown in \eqref{hiso} and is asked for an isomorphism
-between $H$ and $G_1$, she just sends $\sigma'$. If she had
-been asked for an isomorphism between $H$ and $G_2$, she just
-has to compose her secret isomorphism $\sigma$ and $\sigma'$.
-The main point for the protocol is that even knowing the
-isomorphism between $H$ and $G_1$ or $H$ and $G_2$, will not
-make the task easier to find the isomorphism between $G_1$ and
-$G_2$, which is the secret Alice tries to protect.
-
-In order to make it crystal clear how this protocol proceeds, 
-let us give a version using our more formal notation for 
-protocols:
+\noindent 
+A readable article about how supermarkets mine your shopping
+habits (especially how they prey on new exhausted parents
+;o) appeared in 2012 in the New York Times:
 
 \begin{center}
-\begin{tabular}{lrl}
-0)  & $A \to B:$ & $G_1$ and $G_2$\\
-1a) & $A \to B:$ & $H_1$ \\
-1b) & $B \to A:$ & produce isomorphism $G_1 \leftrightarrow H_1$? 
- \quad(or $G_2 \leftrightarrow H_1$)\\
-1c) & $A \to B:$ & requested isomorphism\\
-2a) & $A \to B:$ & $H_2$\\
-2b) & $B \to A:$ & produce isomorphism $G_1 \leftrightarrow H_2$? 
- \quad(or $G_2 \leftrightarrow H_2$)\\
-2c) & $A \to B:$ & requested isomorphism\\
-    & \ldots\\
-\end{tabular}
+\url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html}
 \end{center}
 
-\noindent As can be seen the protocol runs for some
-agreed number of iterations. The $H_i$ Alice needs to 
-produce, need to be all distinct. I hope you now know
-why?
-
-It is also crucial that in each iteration, Alice first sends
-$H_i$ and then Bob can decide which isomorphism he wants:
-either $G_1 \leftrightarrow H_i$ or $G_2 \leftrightarrow H_i$.
-If somehow Alice can find out before she committed to $H_i$,
-she can cheat. For this assume Alice does \emph{not} know an
-isomorphism between $G_1$ and $G_2$. If she knows which
-isomorphism Bob will ask for she can craft $H$ in such a way
-that it is isomorphism with either $G_1$ or $G_2$ (but it
-cannot with both). Then in each case she would send Bob
-a correct answer and he would come to the conclusion that
-all is well. I let you also answer the question whether
-such a protocol run between Alice and Bob would convince
-a third person, say Pete.
- 
-Since the interactive nature of the above PKZ protocol and the
-correct ordering of the messages is so important for the
-``correctness'' of the protocol, it might look surprising that
-the same goal can also me achieved in a completely offline
-manner. By this I mean Alice can publish all data at once, and
-then at a later time, Bob can inspect the data and come to the
-conclusion whether or not Alice knows the secret (again
-without actually learning about the secret). For this
-Alice has to do the following:
+\noindent An article that analyses privacy and shopping habits 
+from a more economic point of view is available from:
 
-\begin{enumerate}
-\item Alice generates $n$ isomorphic graphs
-      $H_{1..n}$ (they need to be all distinct)
-\item she feeds the $H_{1..n}$ into a hashing function
-      (for example encoded as as matrix)
-\item she takes the first $n$ bits of the output of the hashing
-      function:
-      whenever the output is $0$, she shows an isomorphism
-      with $G_1$; for $1$ she shows an isomorphism
-      with $G_2$
-\end{enumerate}
+\begin{center}
+\url{http://www.dtc.umn.edu/~odlyzko/doc/privacy.economics.pdf}
+\end{center}
 
-\noindent The reason why this works and achieves the same
-goal as the interactive variant is that Alice has no
-control over the hashing function. It would be 
-computationally just too hard to assemble a set of
-$H_{1..n}$ such that she can force where $0$s and $1$s
-in the hash values are such that it would pass an external
-test. The point is that Alice can publish all this data
-on the comfort of her own web-page, for example, and 
-in this way convince everybody who bothers to check. 
-
-The virtue of the use of graphs and isomorphism for a 
-zero-knowledge protocol is that the idea why it works
-are relatively straightforward. The disadvantage is
-that encoding any secret into a graph-isomorphism, while
-possible, is awkward. The good news is that in fact
-any NP problem can be used as part of a ZKP protocol.  
-
-
-\subsubsection*{Using Modular Logarithms for ZKP Protocols}
-
-While information can be encoded into graph isomorphisms, it
-is not the most convenient carrier of information. Clearly it
-is much easier to encode information into numbers. Let us look
-at zero-knowledge proofs that use numbers as secrets. For this
-the underlying NP-problem is to calculate discrete logarithms.
-It can be used by choosing public numbers $A$, $B$, $p$, and
-private $x$ such that
+\noindent An attempt to untangle the web of current technology
+for spying on consumers is published in:
 
 \begin{center}
-$A^x \equiv B\; mod\; p$
-\end{center} 
-
-\noindent holds. The secret Alice tries to keep secret is $x$.
-The point of the modular logarithm is that it is very hard 
-from the public data to calculate $x$ (for large primes). 
-Now the protocol proceeds in three stages:
-
-\begin{itemize}
-\item {\bf Commitment Stage}
-  \begin{enumerate}
-    \item Alice generates $z$ random numbers $r_1, \ldots, r_z$, 
-      all less than $p - 1$. Alice then sends Bob for all 
-      $i = 1,\ldots, z$:
-      \[ h_i = A^{r_i}\; mod\; p\]
-    \item Bob generates $z$ random bits, say $b_1,\ldots, b_z$. He can do this 
-       by flipping $z$ times a coin, for example.
-    \item For each bit $b_i$, Alice sends Bob an $s_i$ where
-
-      \begin{center}
-      \begin{tabular}{ll}
-      if $b_i = 0$: & $s_i = r_i$\\
-      if $b_i = 1$: & $s_i = (r_i - r_j) \;mod\; (p -1)$\\
-      \end{tabular}
-      \end{center}
+\url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf}
+\end{center}
 
-      where $r_j$ is the lowest $j$ where $b_j = 1$.
- 
-    \end{enumerate}
- \end{itemize}   
-    
-\noindent For understanding the last step, let $z$ be just 4.
-We have four random values $r_i$ chosen by Alice and four
-random bits $b_i$ chosen subsequently by Bob, for example
-  
-  \begin{center}
-  \begin{tabular}{lcccc}
-  $r_i$:\; & 4 & 9 & 1 & 3\\ 
-  $b_i$:\; & 0 & 1 & 0 & 1\\
-  & & $\uparrow$ \\
-  & & $j$
-  \end{tabular}             
-  \end{center}         
-    
-  \noindent The highlighted column is the lowest where $b_i = 
-  1$ (counted from the left). That means $r_j = 9$. The reason
-  for letting Alice choose the random numbers $r_1, \ldots, r_z$
-  will become clear shortly. Next is the confirmation 
-  phase where Bob essentially checks whether Alice has sent
-  him ``correct'' $s_i$ and $h_i$.
-    
-\begin{itemize}   
-\item {\bf Confirmation Stage}
-  \begin{enumerate}
-  \item For each $b_i$ Bob checks whether $s_i$ 
-  conform to the protocol
-
-  \begin{center}
-  \begin{tabular}{ll}
-  if $b_i = 0$: & $A^{s_i} \stackrel{?}{\equiv} h_i\;mod\;p$\\
-  if $b_i = 1$: & $A^{s_i} \stackrel{?}{\equiv} h_i * h_j^{-1}  \;mod\; p$\\
-  \end{tabular}
-  \end{center}
-  \end{enumerate}
-\end{itemize}
-
-\noindent To understand the case for $b_i = 1$, you have 
-to do the following calculation: 
+\noindent An article that sheds light on the paradox that
+people usually worry about privacy invasions of little
+significance, and overlook the privacy invasion that might
+cause significant damage:
 
 \begin{center}
-\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
-$A^{s_i}$ & $=$ & $A^{r_i - r_j}$\\ 
-          & $=$ & $A^{r_i} * A^{-r_j}$\\
-          & $=$ & $h_{r_i} * h_{r_j}^{-1}\;mod\;p$   
-\end{tabular}
-\end{center}
-
-\noindent What is interesting that so far nothing has been 
-sent about $x$, which is the secret Alice has. Also notice 
-that Bob does not know $r_j$. He received 
-
-\begin{center}
-$r_j - r_j$,  $r_m - r_j$, \ldots, $r_p - r_j \;mod \;p - 1$ 
-\end{center}
-
-\noindent whenever his corresponding bits were $1$. So Bob
-does not know $r_j$ and also does not know any $r_i$ where the
-bit was $1$. Information about the $x$ is sent in the next
-stage (obviously not revealing $x$).
-
-\begin{itemize}   
-\item {\bf Proving Stage}
-
-\begin{enumerate}
-\item Alice proves she knows $x$, the discrete log of $B$,
-by sending
-
-\begin{center}
-$s_{z+1} = x - r_j\;mod\;p-1$
+\url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf}
 \end{center}
 
-\item Bob confirms
+
+Interesting ideas
 
 \begin{center}
-$A^{s_{z+1}} \stackrel{?}{\equiv} B * h_j^{-1} \;mod \; p$
+\url{https://adnauseam.io}
 \end{center}
-\end{enumerate}
-\end{itemize}
-
-\noindent To understand the last step, you have to do the trick
-again that 
-
-\[A^{s_{z+1}} = A^{x-r_j} = \ldots
-\]
-
-\noindent
-which I leave to you.
-
-Now the question is how can Alice cheat? In order to cheat she
-has to coordinate what she sends as $h_i$ in step 1 and $s_i$
-in step 3 of the commitment stage, and also what to send as
-$s_{z+1}$ in the proving stage. For the latter of course 
-Alice does not know $x$, so she just chooses some random 
-number for $s_{z+1}$ and calculates 
-
-\[A^{s_{z+1}}\]
 
 \noindent
-and then solves the equation
-
-\[A^{s_{z+1}} \equiv B * y \;mod\;p\]
-
-\noindent for $y$. This is easy since no logarithm needs to be
-computed. If Alice can guess the $j$ where the first 1 will
-appear in Bob's bit vector, then she sends the inverse of $y$
-as $h_j$ and 0 as $s_j$. However, notice that when she
-calculates a solution for $y$ she does not know $r_j$. For this she
-would need to calculate the modular logarithm
-
-\[y \equiv A^{r_j}\;mod\;p\] 
-
-\noindent which is hard (see step 1 in the commitment stage).
-
-Having settled on what $h_j$ should be, now what should Alice
-send as the other $h_i$ and other $s_i$? If the $b_i$ happens
-to be a 1, then the $h_i$ and other $s_i$ need to satisfy the 
-test
-
-\[A^{s_i} \stackrel{?}{\equiv} h_i * h_j^{-1}  \;mod\; p\]
-
-\noindent where she has already settled on the value of 
-$h_j^{-1}$. Lets say she choses $s_i$ at random, then she just 
-needs to solve
-
-\[A^{s_i} \equiv z * h_j^{-1}  \;mod\; p\] 
-
-\noindent for $z$. Again that is easy, but it does not allow 
-us to know $r_i$, because then we would again need to solve
-a modular logarithm problem. Let us call an $h_i$ which was 
-solved the easy way as \emph{bogus}. Alice has to produce
-bogus $h_i$ for all bits that are going to be $1$ in advance!
-This means she has to guess all the bits correctly. (Yes? 
-I let you think about this.)
-
-Let us see what happens if she guesses wrongly: Suppose the
-bit $b_i = 1$ where she thought she will get a 0. Then she has
-already sent an $h_i$ and $h_j$ and now must find an $s_i$
-such that 
-
-\[A^{s_i} \equiv h_i * h_j^{-1}  \;mod\; p\]
-
-\noindent holds. For this remember in calculating $h_i$, she
-just chose a random $s_i$. Now she has to send a genuine one.
-But this is of course too hard. If she knew the genuine $r_i$
-and $r_j$ for $h_i$ and $h_j$, it would be easy (in this case
-$s_i = r_i - r_j$). But she does not. So she will be found
-out. If $b_i = 0$, but she thought she will get a 1, then 
-she has to send a $s_i$ which satisfies
-
-\[A^{s_i} \equiv h_i\;mod\;p\]
-
-\noindent Again she does not know $r_i$. So it is a too hard 
-task and she will be found out again.
-
-To sum up, in order for Alice to successfully cheat Bob, she
-needs to guess \emph{all} bits correctly. She has only a
-$\frac{1}{2^z}$ chance of doing this.
-
-\subsubsection*{Further Reading}
-
-Make sure you understand what NP problems
-are.\footnote{\url{http://en.wikipedia.org/wiki/NP_(complexity)}}
-They are the building blocks for zero-knowledge proofs.
-Zero-Knowldege proofs are not yet widely used in production
-systems, but it is slowly gaining ground. One area of application
-where they pop up is crypto currencies (for example Zerocoins
-or how to make sure a Bitcoin exchange is solvent without
-revealing its assets).
-
-If you want to brush up on the modular logarithm problem,
-the Khan Academy has a nice video:
+And a paper that predicts ad-blockers will in the end win over anti-ad-blocking:
 
 \begin{center}
-\url{https://www.khanacademy.org/video/discrete-logarithm-problem}
+\url{http://randomwalker.info/publications/ad-blocking-framework-techniques.pdf}
 \end{center}
 
+
 \end{document}
 
-http://blog.cryptographyengineering.com/2014/11/zero-knowledge-proofs-illustrated-primer.html
-
-http://btravers.weebly.com/uploads/6/7/2/9/6729909/zero_knowledge_technique.pdf
-http://zk-ssh.cms.ac/docs/Zero_Knowledge_Prinzipien.pdf
-http://www.wisdom.weizmann.ac.il/~oded/PS/zk-tut02v4.ps
-
-socialist millionares problem
-http://en.wikipedia.org/wiki/Socialist_millionaire
-http://twistedoakstudios.com/blog/Post3724_explain-it-like-im-five-the-socialist-millionaire-problem-and-secure-multi-party-computation
-
+http://randomwalker.info/teaching/fall-2012-privacy-technologies/?
+http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/
+http://repository.cmu.edu/cgi/viewcontent.cgi?article=1077&context=hcii
+https://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdf
+http://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdf
+http://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdf
+https://www.youtube.com/watch?v=Gx13lgEudtU
+https://www.cs.purdue.edu/homes/ctask/pdfs/CERIAS_Presentation.pdf
+http://www.futureofprivacy.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdf
+http://www.cis.upenn.edu/~aaroth/courses/slides/Overview.pdf
+http://www.cl.cam.ac.uk/~sjm217/papers/tor14design.pdf
 
 %%% Local Variables: 
 %%% mode: latex