handouts/ho06.tex
changeset 523 7a6e8f603e08
parent 513 84ed8d6143ea
child 534 62985f147c85
equal deleted inserted replaced
522:280e057558b8 523:7a6e8f603e08
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{../style}
     2 \usepackage{../style}
       
     3 \usepackage{../graphics}
     3 
     4 
     4 \begin{document}
     5 \begin{document}
     5 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015}
     6 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015}
     6 
     7 
     7 \section*{Handout 6 (Zero-Knowledge Proofs)}
     8 %https://www.theguardian.com/technology/2016/oct/04/yahoo-secret-email-program-nsa-fbi
     8 
     9 %https://nakedsecurity.sophos.com/2015/11/12/california-collects-owns-and-sells-infants-dna-samples/
     9 Zero-knowledge proofs (short ZKP) solve a paradoxical puzzle:
    10 %http://randomwalker.info/teaching/fall-2012-privacy-technologies/?
    10 How to convince somebody else that one knows a secret, without
    11 %https://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdf
    11 revealing what the secret actually is? This sounds like a
    12 %http://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdf
    12 problem the Mad Hatter from Alice in Wonderland would occupy
    13 %http://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdf
    13 himself with, but actually there some serious and not so
    14 %https://www.youtube.com/watch?v=Gx13lgEudtU
    14 serious applications of it. For example, if you solve
    15 %https://fpf.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdf
    15 crosswords with your friend, say Bob, you might want to
    16 %http://research.neustar.biz/2014/09/08/differential-privacy-the-basics/
    16 convince him that you found a solution for one question, but
    17 
    17 of course you do not want to reveal the solution, as this
    18 %=====
    18 might give Bob an advantage somewhere else in the crossword.
    19 %Tim Greene, Network World, 17 Dec 2015   (via ACM TechNews, 18 Dec 2015)
    19 
    20 %
    20 So how to convince Bob that you know the answer (or a secret)?
    21 %Massachusetts Institute of Technology (MIT) researchers' experimental
    21 One way would be to come up with the following protocol:
    22 %Vuvuzela messaging system offers more privacy than The Onion Router (Tor) by
    22 Suppose the answer is \textit{folio}. Then look up the
    23 %rendering text messages sent through it untraceable.  MIT Ph.D. student
    23 definition of \textit{folio} in a dictionary. Say you find:
    24 %David Lazar says Vuvuzela resists traffic analysis attacks, while Tor
    24 
    25 %cannot.  The researchers say the system functions no matter how many parties
    25 \begin{quote}
    26 %are using it to communicate, and it employs encryption and a set of servers
    26 ``an \textit{individual} leaf of paper or parchment, either
    27 %to conceal whether or not parties are participating in text-based dialogues.
    27 loose as one of a series or forming part of a bound volume,
    28 %"Vuvuzela prevents an adversary from learning which pairs of users are
    28 which is numbered on the recto or front side only.''
    29 %communicating, as long as just one out of [the] servers is not compromised,
       
    30 %even for users who continue to use Vuvuzela for years," they note.  Vuvuzela
       
    31 %can support millions of users hosted on commodity servers deployed by a
       
    32 %single group of users.  Instead of anonymizing users, Vuvuzela prevents
       
    33 %outside observers from differentiating between people sending messages,
       
    34 %receiving messages, or neither, according to Lazar.  The system imposes
       
    35 %noise on the client-server traffic which cannot be distinguished from actual
       
    36 %messages, and all communications are triple-wrapped in encryption by three
       
    37 %servers.  "Vuvuzela guarantees privacy as long as one of the servers is
       
    38 %uncompromised, so using more servers increases security at the cost of
       
    39 %increased message latency," Lazar notes.
       
    40 %http://orange.hosting.lsoft.com/trk/click?ref=znwrbbrs9_5-e70bx2d991x066779&
       
    41 
       
    42 %%%%
       
    43 %% canvas tracking
       
    44 %%https://freedom-to-tinker.com/blog/englehardt/the-princeton-web-census-a-1-million-site-measurement-and-analysis-of-web-privacy/
       
    45 
       
    46 %%%
       
    47 %% cupit re-identification attack
       
    48 %% https://nakedsecurity.sophos.com/2016/05/20/published-personal-data-on-70000-okcupid-users-taken-down-after-dmca-order/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+nakedsecurity+%28Naked+Security+-+Sophos%29
       
    49 
       
    50 %Differential privacy
       
    51 %=====================
       
    52 %https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/
       
    53 
       
    54 %Differential privacy, translated from Apple-speak, is the
       
    55 %statistical science of trying to learn as much as possible
       
    56 %about a group while learning as little as possible about any
       
    57 %individual in it.
       
    58 
       
    59 %As Roth notes when he refers to a “mathematical proof,”
       
    60 %differential privacy doesn’t merely try to obfuscate or
       
    61 %“anonymize” users’ data. That anonymization approach, he
       
    62 %argues, tends to fail. In 2007, for instance, Netflix released
       
    63 %a large collection of its viewers’ film ratings as part of a
       
    64 %competition to optimize its recommendations, removing people’s
       
    65 %names and other identifying details and publishing only their
       
    66 %Netflix ratings. But researchers soon cross-referenced the
       
    67 %Netflix data with public review data on IMDB to match up
       
    68 %similar patterns of recommendations between the sites and add
       
    69 %names back into Netflix’s supposedly anonymous database.
       
    70 
       
    71 %As an example of that last method, Microsoft’s Dwork points to
       
    72 %the technique in which a survey asks if the respondent has
       
    73 %ever, say, broken a law. But first, the survey asks them to
       
    74 %flip a coin. If the result is tails, they should answer
       
    75 %honestly. If the result is heads, they’re instructed to flip
       
    76 %the coin again and then answer “yes” for heads or “no” for
       
    77 %tails. The resulting random noise can be subtracted from the
       
    78 %results with a bit of algebra, and every respondent is
       
    79 %protected from punishment if they admitted to lawbreaking.
       
    80 
       
    81 %https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
       
    82 
       
    83 % Windows 10 data send back to Microsoft (Cortana)
       
    84 %Here’s a non-exhaustive list of data sent back: location data, text
       
    85 %input, voice input, touch input, webpages you visit, and telemetry
       
    86 %data regarding your general usage of your computer, including which
       
    87 %programs you run and for how long.
       
    88 
       
    89 % Businesses are already using customised pricing online based on
       
    90 % information they can glean about you. It is hard to know how
       
    91 % widespread the practice is; companies keep their pricing strategies
       
    92 % closely guarded and are wary of the bad PR price discrimination
       
    93 % could pose. However, it is clear that a number of large retailers
       
    94 % are experimenting with it. Staples, for example, has offered
       
    95 % discounted prices based on whether rival stores are within 20 miles
       
    96 % of its customers’ location. Office Depot has admitted to using its
       
    97 % customers’ browsing history and location to vary its range of offers
       
    98 % and products. A 2014 study from Northeastern University found
       
    99 % evidence of “steering” or differential pricing at four out of 10
       
   100 % general merchandise websites and five out of five travel
       
   101 % websites. (Steering is when a company doesn’t give you a customised
       
   102 % price, but points you towards more expensive options if it thinks
       
   103 % you will pay more.) The online travel company Orbitz raised
       
   104 % headlines in 2012 when it emerged that the firm was pointing Mac
       
   105 % users towards higher-priced hotel rooms than PC users.
       
   106 
       
   107 
       
   108 %%% government will overwrite your wishes if it is annoymous
       
   109 %% https://www.lightbluetouchpaper.org/2016/12/05/government-u-turn-on-health-privacy/
       
   110 
       
   111 %% corporate surveilance / privacy - report and CC3C talk
       
   112 %%      http://crackedlabs.org/en/networksofcontrol
       
   113 %%      https://media.ccc.de/v/33c3-8414-corporate_surveillance_digital_tracking_big_data_privacy#video&t=2933
       
   114 
       
   115 \section*{Handout 6 (Privacy)}
       
   116 
       
   117 The first motor car was invented around 1886. For ten years,
       
   118 until 1896, the law in the UK (and elsewhere) required a
       
   119 person to walk in front of any moving car waving a red flag.
       
   120 Cars were such a novelty that most people did not know what to
       
   121 make of them. The person with the red flag was intended to
       
   122 warn the public, for example horse owners, about the impending
       
   123 novelty---a car. In my humble opinion, we are at the same
       
   124 stage of development with privacy. Nobody really knows what it
       
   125 is about or what it is good for. All seems very hazy. There
       
   126 are a few laws (e.g.~cookie law, right-to-be-forgotten law)
       
   127 which address problems with privacy, but even if they are well
       
   128 intentioned, they either back-fire or are already obsolete
       
   129 because of newer technologies. The result is that the world of
       
   130 ``privacy'' looks a little bit like the old Wild
       
   131 West---lawless and mythical.
       
   132 
       
   133 We would have hoped that after Snowden, Western governments
       
   134 would be a bit more sensitive and enlightned about the topic
       
   135 of privacy, but this is far from the truth. Ross Anderson
       
   136 wrote the following in his blog\footnote{\url{https://www.lightbluetouchpaper.org/2016/02/11/report-on-the-ip-bill/}} about the approach taken in
       
   137 the US to lessons learned from the Snowden leaks and contrasts
       
   138 this with the new snooping bill that is considered in the UK
       
   139 parliament: 
       
   140 
       
   141 \begin{quote}\it 
       
   142 ``The comparison with the USA is stark. There, all three
       
   143 branches of government realised they'd gone too far after
       
   144 Snowden. President Obama set up the NSA review group, and
       
   145 implemented most of its recommendations by executive order;
       
   146 the judiciary made changes to the procedures of the FISA
       
   147 Court; and Congress failed to renew the data retention
       
   148 provisions in the Patriot Act (aided by the judiciary). Yet
       
   149 here in Britain the response is just to take Henry VIII powers
       
   150 to legalise all the illegal things that GCHQ had been up to,
       
   151 and hope that the European courts won't strike the law down
       
   152 yet again.''
    29 \end{quote}
   153 \end{quote}
    30 
   154 
       
   155 \noindent Unfortunately, also big organisations besides
       
   156 governments seem to take an unenlightened approach to privacy.
       
   157 For example, UCAS, a charity set up to help students with
       
   158 applying to universities in the UK, has a commercial unit that
       
   159 happily sells your email addresses to anybody who forks out
       
   160 enough money for bombarding you with spam. Yes, you can opt
       
   161 out very often from such ``schemes'', but in case of UCAS any
       
   162 opt-out will limit also legit emails you might actually be
       
   163 interested in.\footnote{The main objectionable point, in my
       
   164 opinion, is that the \emph{charity} everybody has to use for
       
   165 HE applications has actually very honourable goals
       
   166 (e.g.~assist applicants in gaining access to universities),
       
   167 but the small print (or better the link ``About us'') reveals
       
   168 they set up their organisation so that they can also
       
   169 shamelessly sell the email addresses they ``harvest''.
       
   170 Everything is of course very legal\ldots{}ethical?\ldots{}well
       
   171 that is in the eye of the beholder. See:
       
   172 
       
   173 \url{http://www.ucas.com/about-us/inside-ucas/advertising-opportunities} 
       
   174 or
       
   175 \url{http://www.theguardian.com/uk-news/2014/mar/12/ucas-sells-marketing-access-student-data-advertisers}}
       
   176 
       
   177 Another example: Verizon, an ISP who is supposed to provide
       
   178 you just with connectivity, has found a ``nice'' side-business
       
   179 too: When you have enabled all privacy guards in your browser
       
   180 (the few you have at your disposal), Verizon happily adds a
       
   181 kind of cookie to your
       
   182 HTTP-requests.\footnote{\url{http://webpolicy.org/2014/10/24/how-verizons-advertising-header-works/}}
       
   183 As shown in the picture below, this cookie will be sent to
       
   184 every web-site you visit. The web-sites then can forward the
       
   185 cookie to advertisers who in turn pay Verizon to tell them
       
   186 everything they want to know about the person who just made
       
   187 this request, that is you.
       
   188  
       
   189 \begin{center}
       
   190 \includegraphics[scale=0.16]{../pics/verizon.png}
       
   191 \end{center}
       
   192 
       
   193 \noindent How disgusting! Even worse, Verizon is not known for
       
   194 being the cheapest ISP on the planet (completely the
       
   195 contrary), and also not known for providing the fastest
       
   196 possible speeds, but rather for being among the few ISPs in
       
   197 the US with a quasi-monopolistic ``market distribution''.
       
   198 
       
   199 
       
   200 Well, we could go on and on\ldots{}and that has not even
       
   201 started us yet with all the naughty things NSA \& Friends are
       
   202 up to. Why does privacy actually matter? Nobody, I think, has
       
   203 a conclusive answer to this question yet. Maybe the following
       
   204 four notions help with clarifying the overall picture
       
   205 somewhat: 
       
   206 
       
   207 \begin{itemize}
       
   208 \item \textbf{Secrecy} is the mechanism used to limit the
       
   209       number of principals with access to information (e.g.,
       
   210       cryptography or access controls). For example I better
       
   211       keep my password secret, otherwise people from the wrong
       
   212       side of the law might impersonate me.
       
   213 
       
   214 \item \textbf{Confidentiality} is the obligation to protect
       
   215       the secrets of other people or organisations (secrecy
       
   216       for the benefit of an organisation). For example as a
       
   217       staff member at King's I have access to data, even
       
   218       private data, I am allowed to use in my work but not
       
   219       allowed to disclose to anyone else.
       
   220 
       
   221 \item \textbf{Anonymity} is the ability to leave no evidence of
       
   222       an activity (e.g., sharing a secret). This is not equal
       
   223         with privacy---anonymity is required in many 
       
   224         circumstances, for example for whistle-blowers, 
       
   225         voting, exam marking and so on.
       
   226 
       
   227 \item \textbf{Privacy} is the ability or right to protect your
       
   228       personal secrets (secrecy for the benefit of an
       
   229       individual). For example, in a job interview, I might
       
   230       not like to disclose that I am pregnant, if I were a
       
   231       woman, or that I am a father. Lest they might not hire
       
   232       me. Similarly, I might not like to disclose my location
       
   233       data, because thieves might break into my house if they
       
   234       know I am away at work. Privacy is essentially
       
   235       everything which ``shouldn't be anybody's business''.
       
   236 
       
   237 \end{itemize}
       
   238 
       
   239 \noindent While this might provide us with some rough
       
   240 definitions, the problem with privacy is that it is an
       
   241 extremely fine line what should stay private and what should
       
   242 not. For example, since I am working in academia, I am every
       
   243 so often very happy to be a digital exhibitionist: I am very
       
   244 happy to disclose all `trivia' related to my work on my
       
   245 personal web-page. This is a kind of bragging that is normal
       
   246 in academia (at least in the field of CS), even expected if
       
   247 you look for a job. I am even happy that Google maintains a
       
   248 profile about all my academic papers and their citations. 
       
   249 
       
   250 On the other hand I would be very irritated if anybody I do
       
   251 not know had a too close look on my private live---it
       
   252 shouldn't be anybody's business. The reason is that knowledge
       
   253 about my private life can often be used against me. As mentioned
       
   254 above, public location data might mean I get robbed. If
       
   255 supermarkets build a profile of my shopping habits, they will
       
   256 use it to \emph{their} advantage---surely not to \emph{my}
       
   257 advantage. Also whatever might be collected about my life will
       
   258 always be an incomplete, or even misleading, picture. For
       
   259 example I am pretty sure my creditworthiness score was
       
   260 temporarily(?) destroyed by not having a regular income in
       
   261 this country (before coming to King's I worked in Munich for
       
   262 five years). To correct such incomplete or flawed credit
       
   263 history data there is, since recently, a law that allows you
       
   264 to check what information is held about you for determining
       
   265 your creditworthiness. But this concerns only a very small
       
   266 part of the data that is held about me/you. Also
       
   267 what about cases where data is wrong or outdated (but do we
       
   268 need a right-to be forgotten).
       
   269 
       
   270 
       
   271 To see how private matter can lead really to the wrong
       
   272 conclusions, take the example of Stephen Hawking: When he was
       
   273 diagnosed with his disease, he was given a life expectancy of
       
   274 two years. If employers would know about such problems, would
       
   275 they have employed Hawking? Now, he is enjoying his 70+
       
   276 birthday. Clearly personal medical data needs to stay private.
       
   277 
       
   278 To cut a long story short, I let you ponder about the two
       
   279 statements which are often voiced in discussions about privacy:
       
   280 
       
   281 \begin{itemize}
       
   282 \item \textit{``You have zero privacy anyway. Get over 
       
   283 it.''}\\
       
   284 \mbox{}\hfill{}{\small{}(by Scott Mcnealy, former CEO of Sun)}
       
   285 
       
   286 \item \textit{``If you have nothing to hide, you have nothing 
       
   287 to fear.''}
       
   288 \end{itemize}
       
   289  
       
   290 \noindent If you like to watch a movie which has this topic as
       
   291 its main focus I recommend \emph{Gattaca} from
       
   292 1997.\footnote{\url{http://www.imdb.com/title/tt0119177/}} If
       
   293 you want to read up on this topic, I can recommend the
       
   294 following article that appeared in 2011 in the Chronicle of
       
   295 Higher Education:
       
   296 
       
   297 \begin{center} 
       
   298 \url{http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/} 
       
   299 \end{center} 
       
   300 
       
   301 \noindent Funnily, or maybe not so funnily, the author of this
       
   302 article carefully tries to construct an argument that does not
       
   303 only attack the nothing-to-hide statement in cases where
       
   304 governments \& co collect people's deepest secrets, or
       
   305 pictures of people's naked bodies, but an argument that
       
   306 applies also in cases where governments ``only'' collect data
       
   307 relevant to, say, preventing terrorism. The fun is of course
       
   308 that in 2011 we could just not imagine that respected
       
   309 governments would do such infantile things as intercepting
       
   310 people's nude photos. Well, since Snowden we know some people
       
   311 at the NSA did exactly that and then shared such photos among
       
   312 colleagues as ``fringe benefit''.  
       
   313 
       
   314 
       
   315 \subsubsection*{Re-Identification Attacks} 
       
   316 
       
   317 Apart from philosophical musings, there are fortunately also
       
   318 some real technical problems with privacy. The problem I want
       
   319 to focus on in this handout is how to safely disclose datasets
       
   320 containing potentially very private data, say health records.
       
   321 What can go wrong with such disclosures can be illustrated
       
   322 with four well-known examples:
       
   323 
       
   324 \begin{itemize}
       
   325 \item In 2006, a then young company called Netflix offered a 1
       
   326       Mio \$ prize to anybody who could improve their movie
       
   327       rating algorithm. For this they disclosed a dataset
       
   328       containing 10\% of all Netflix users at the time
       
   329       (appr.~500K). They removed names, but included numerical
       
   330       ratings of movies as well as times when ratings were
       
   331       uploaded. Though some information was perturbed (i.e.,
       
   332       slightly modified).
       
   333       
       
   334       Two researchers had a closer look at this anonymised
       
   335       data and compared it with public data available from the
       
   336       International Movie Database (IMDb). They found that
       
   337       98\% of the entries could be re-identified in the
       
   338       Netflix dataset: either by their ratings or by the dates
       
   339       the ratings were uploaded. The result was a class-action
       
   340       suit against Netflix, which was only recently resolved
       
   341       involving a lot of money.
       
   342 
       
   343 \item In the 1990ies, medical datasets were often made public
       
   344       for research purposes. This was done in anonymised form
       
   345       with names removed, but birth dates, gender and ZIP-code
       
   346       were retained. In one case where such data about
       
   347       hospital visits of state employees in Massachusetts was
       
   348       made public, the then governor assured the public that
       
   349       the released dataset protected patient privacy by
       
   350       deleting identifiers. 
       
   351       
       
   352       A graduate student could not resist cross-referencing
       
   353       public voter data with the released data that still
       
   354       included birth dates, gender and ZIP-code. The result
       
   355       was that she could send the governor his own hospital
       
   356       record. It turns out that birth dates, gender and
       
   357       ZIP-code uniquely identify 87\% of people in the US.
       
   358       This work resulted in a number of laws prescribing which
       
   359       private data cannot be released in such datasets.
       
   360  
       
   361 \item In 2006, AOL published 20 million Web search queries
       
   362       collected from 650,000 users (names had been deleted).
       
   363       This was again done for research purposes. However,
       
   364       within days an old lady, Thelma Arnold, from Lilburn,
       
   365       Georgia, (11,596 inhabitants) was identified as user
       
   366       No.~4417749 in this dataset. It turned out that search
       
   367       engine queries are deep windows into people's private
       
   368       lives. 
       
   369   
       
   370 \item Genome-Wide Association Studies (GWAS) was a public
       
   371       database of gene-frequency studies linked to diseases.
       
   372       It would essentially record that people who have a
       
   373       disease, say diabetes, have also certain genes. In order
       
   374       to maintain privacy, the dataset would only include
       
   375       aggregate information. In case of DNA data this
       
   376       aggregation was achieved by mixing the DNA of many
       
   377       individuals (having a disease) into a single solution.
       
   378       Then this mixture was sequenced and included in the
       
   379       dataset. The idea was that the aggregate information
       
   380       would still be helpful to researchers, but would protect
       
   381       the DNA data of individuals. 
       
   382        
       
   383       In 2007 a forensic computer scientist showed that
       
   384       individuals can still be identified. For this he used
       
   385       the DNA data from a comparison group (people from the
       
   386       general public) and ``subtracted'' this data from the
       
   387       published data. He was left with data that included all
       
   388       ``special'' DNA-markers of the individuals present in
       
   389       the original mixture. He essentially deleted the
       
   390       ``background noise'' in the published data. The problem
       
   391       with DNA data is that it is of such a high resolution
       
   392       that even if the mixture contained maybe 100
       
   393       individuals, you can with current technology detect
       
   394       whether an individual was included in the mixture or
       
   395       not.
       
   396       
       
   397       This result changed completely how DNA data is nowadays
       
   398       published for research purposes. After the success of 
       
   399       the human-genome project with a very open culture of
       
   400       exchanging data, it became much more difficult to 
       
   401       anonymise data so that patient's privacy is preserved.
       
   402       The public GWAS database was taken offline in 2008.
       
   403       
       
   404 \end{itemize}
       
   405 
       
   406 \noindent There are many lessons that can be learned from
       
   407 these examples. One is that when making datasets public in
       
   408 anonymised form, you want to achieve \emph{forward privacy}.
       
   409 This means, no matter what other data that is also available
       
   410 or will be released later, the data in the original dataset
       
   411 does not compromise an individual's privacy. This principle
       
   412 was violated by the availability of ``outside data'' in the
       
   413 Netflix and governor of Massachusetts cases. The additional
       
   414 data permitted a re-identification of individuals in the
       
   415 dataset. In case of GWAS a new technique of re-identification
       
   416 compromised the privacy of people in the dataset. The case of
       
   417 the AOL dataset shows clearly how incomplete such data can be:
       
   418 Although the queries uniquely identified the older lady, she
       
   419 also looked up diseases that her friends had, which had
       
   420 nothing to do with her. Any rational analysis of her query
       
   421 data must therefore have concluded, the lady is on her
       
   422 death bed, while she was actually very much alive and kicking.
       
   423 
       
   424 In 2016, Yahoo released the so far largest machine learning
       
   425 dataset to the research community. It includes approximately
       
   426 13.5 TByte of data representing around 100 Billion events from
       
   427 anonymized user-news items, collected by recording
       
   428 interactions of about 20M users from February 2015 to May
       
   429 2015. Yahoo's gracious goal is to promote independent research
       
   430 in the fields of large-scale machine learning and recommender
       
   431 systems. It remains to be seen whether this data will really
       
   432 only be used for that purpose.
       
   433 
       
   434 \subsubsection*{Differential Privacy}
       
   435 
       
   436 Differential privacy is one of the few methods that tries to
       
   437 achieve forward privacy. The basic idea is to add appropriate
       
   438 noise, or errors, to any query of the dataset. The intention
       
   439 is to make the result of a query insensitive to individual
       
   440 entries in the database. That means the results are
       
   441 approximately the same no matter if a particular individual is
       
   442 in the dataset or not. The hope is that the added error does
       
   443 not eliminate the ``signal'' one is looking for in the
       
   444 dataset.
       
   445 
       
   446 %\begin{center}
       
   447 %User\;\;\;\;    
       
   448 %\begin{tabular}{c}
       
   449 %tell me $f(x)$ $\Rightarrow$\\
       
   450 %$\Leftarrow$ $f(x) + \text{noise}$
       
   451 %\end{tabular}
       
   452 %\;\;\;\;\begin{tabular}{@{}c}
       
   453 %Database\\
       
   454 %$x_1, \ldots, x_n$
       
   455 %\end{tabular}
       
   456 %\end{center}
       
   457 %
       
   458 %\begin{center}
       
   459 %\begin{tabular}{l|l}
       
   460 %Staff & Salary\\\hline
       
   461 %$PM$ & \pounds{107}\\
       
   462 %$PF$ & \pounds{102}\\
       
   463 %$LM_1$ & \pounds{101}\\
       
   464 %$LF_2$ & \pounds{97}\\
       
   465 %$LM_3$ & \pounds{100}\\
       
   466 %$LM_4$ & \pounds{99}\\
       
   467 %$LF_5$ & \pounds{98}
       
   468 %\end{tabular}
       
   469 %\end{center}
       
   470 %
       
   471 %
       
   472 %\begin{center}
       
   473 %\begin{tikzpicture} 
       
   474 %\begin{axis}[symbolic y coords={salary},
       
   475 %             ytick=data,
       
   476 %             height=3cm]
       
   477 %\addplot+[jump mark mid] coordinates
       
   478 %{(0,salary)   (0.1,salary) 
       
   479 % (0.4,salary) (0.5,salary)  
       
   480 % (0.8,salary) (0.9,salary)};
       
   481 %\end{axis}
       
   482 %\end{tikzpicture}
       
   483 %\end{center}
       
   484 %
       
   485 %\begin{tikzpicture}[outline/.style={draw=#1,fill=#1!20}]
       
   486 %  \node [outline=red]            {red box};
       
   487 %  \node [outline=blue] at (0,-1) {blue box};
       
   488 %\end{tikzpicture}
       
   489 
       
   490 \ldots
       
   491 
       
   492 
       
   493 \subsubsection*{Further Reading}
       
   494 
       
   495 Two cool articles about how somebody obtained via the Freedom
       
   496 of Information Law the taxicab dataset of New York and someone
       
   497 else showed how easy it is to mine for private information: 
       
   498 
       
   499 \begin{center}\small
       
   500 \begin{tabular}{p{0.78\textwidth}}
       
   501 \url{http://chriswhong.com/open-data/foil_nyc_taxi/}\smallskip\\
       
   502 \url{http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset}
       
   503 \end{tabular}
       
   504 \end{center}
       
   505 
       
   506 \noindent 
       
   507 A readable article about how supermarkets mine your shopping
       
   508 habits (especially how they prey on new exhausted parents
       
   509 ;o) appeared in 2012 in the New York Times:
       
   510 
       
   511 \begin{center}
       
   512 \url{http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html}
       
   513 \end{center}
       
   514 
       
   515 \noindent An article that analyses privacy and shopping habits 
       
   516 from a more economic point of view is available from:
       
   517 
       
   518 \begin{center}
       
   519 \url{http://www.dtc.umn.edu/~odlyzko/doc/privacy.economics.pdf}
       
   520 \end{center}
       
   521 
       
   522 \noindent An attempt to untangle the web of current technology
       
   523 for spying on consumers is published in:
       
   524 
       
   525 \begin{center}
       
   526 \url{http://cyberlaw.stanford.edu/files/publication/files/trackingsurvey12.pdf}
       
   527 \end{center}
       
   528 
       
   529 \noindent An article that sheds light on the paradox that
       
   530 people usually worry about privacy invasions of little
       
   531 significance, and overlook the privacy invasion that might
       
   532 cause significant damage:
       
   533 
       
   534 \begin{center}
       
   535 \url{http://www.heinz.cmu.edu/~acquisti/papers/Acquisti-Grossklags-Chapter-Etrics.pdf}
       
   536 \end{center}
       
   537 
       
   538 
       
   539 Interesting ideas
       
   540 
       
   541 \begin{center}
       
   542 \url{https://adnauseam.io}
       
   543 \end{center}
       
   544 
    31 \noindent
   545 \noindent
    32 Take the first non-small word\footnote{Let's say the, a, an,
   546 And a paper that predicts ad-blockers will in the end win over anti-ad-blocking:
    33 or, and \ldots fall into the category of small words.} in this definition,
   547 
    34 in this case \textit{individual}, and look up the definition
   548 \begin{center}
    35 of this word, say
   549 \url{http://randomwalker.info/publications/ad-blocking-framework-techniques.pdf}
    36 
   550 \end{center}
    37 \begin{quote}
   551 
    38 ``a single \textit{human} being as distinct from a group''
       
    39 \end{quote}
       
    40 
       
    41 \noindent 
       
    42 In this definition take the second non-small word, that
       
    43 is \textit{human}, and again look up the definition of this 
       
    44 word. This will yield
       
    45 
       
    46 \begin{quote}
       
    47 ``relating to or \textit{characteristic} of humankind''
       
    48 \end{quote}
       
    49 
       
    50 \noindent 
       
    51 You could go on looking up the definition of the third
       
    52 non-small word in this definition and so on. But let us assume
       
    53 you agreed with Bob to stop after three iterations with the
       
    54 third non-article word in the last definition, that is
       
    55 \textit{or}. Now, instead of sending to Bob the solution
       
    56 \textit{folio}, you send to him \textit{characteristic}. 
       
    57 
       
    58 How can Bob verify that you know the solution? Well, once he
       
    59 solved it himself, he can use the dictionary and follow the
       
    60 same ``trail'' as you did. If the final word agrees with what
       
    61 you had sent him, he must infer you knew the solution earlier
       
    62 than him. This protocol works like a one-way hash function
       
    63 because \textit{characteristic} does not give any hint as to
       
    64 what was the first word was. I leave you to think why this
       
    65 protocol avoids small words? 
       
    66 
       
    67 After Bob found his solution and verified that according to
       
    68 the protocol it ``maps'' also to \textit{characteristic}, can
       
    69 he be entirely sure no cheating is going on? Not with 100\%
       
    70 certainty. It could have been possible that he was given
       
    71 \textit{characteristic} as the word, but it derived from a
       
    72 different word. This might seem very unlikely, but at least
       
    73 theoretical it is a possibility. Protocols based on
       
    74 zero-knowledge proofs will produce a similar result---they
       
    75 give an answer that might be erroneous in a very small number
       
    76 of cases. The point is to iterate them long enough so that the
       
    77 theoretical possibility of cheating is negligibly small. 
       
    78 
       
    79 By the way, the authors of the paper ``Dismantling Megamos
       
    80 Crypto: Wirelessly Lockpicking a Vehicle Immobilizer'' who
       
    81 were barred from publishing their results used also a hash to
       
    82 prove they did the work and (presumably) managed to get into
       
    83 cars without a key; see Figure~\ref{paper}. This is very
       
    84 similar to the method above about crosswords: They like to
       
    85 prove that they did the work, but not giving out the
       
    86 ``solution''. But this also shows what the problem with such a
       
    87 method is: yes, we can hide the secret temporarily, but if
       
    88 somebody else wants to verify it, then the secret has to be
       
    89 made public. Bob needs to know that \textit{folio} is the
       
    90 solution before he can verify the claim of Alice that she had
       
    91 the solution first. Similarly with the car-crypto paper: we
       
    92 needed to wait until September 2015 when the authors were
       
    93 finally able to publish their findings in order to verify the
       
    94 hash. Zero-knowledge proofs, in contrast, can be immediately
       
    95 checked, even if the secret is not public yet and perhaps
       
    96 never will be.
       
    97 
       
    98 \begin{figure}
       
    99 \begin{center}
       
   100 \addtolength{\fboxsep}{4mm}
       
   101 \fbox{\includegraphics[scale=0.4]{../pics/Dismantling_Megamos_Crypto.png}}
       
   102 \end{center}
       
   103 \caption{The authors of this paper used a hash in order to prove
       
   104 later that they have managed to break into cars.\label{paper}}
       
   105 \end{figure}
       
   106 
       
   107 
       
   108 \subsubsection*{ZKP: An Illustrative Example}
       
   109 
       
   110 The idea behind zero-knowledge proofs is not very obvious and
       
   111 will surely take some time for you to digest. Therefore let us
       
   112 start with a simple illustrative example. The example will not
       
   113 be perfect, but hopefully explain the gist of the idea. The
       
   114 example is called Alibaba's cave, which graphically looks as 
       
   115 follows:\footnote{The example is taken from an 
       
   116 article titled ``How to Explain Zero-Knowledge Protocols
       
   117 to Your Children'' available from 
       
   118 \url{http://pages.cs.wisc.edu/~mkowalcz/628.pdf}.}
       
   119 
       
   120 \begin{center}
       
   121 \begin{tabular}{c@{\hspace{8mm}}c@{\hspace{8mm}}c}
       
   122 \includegraphics[scale=0.1]{../pics/alibaba1.png} &
       
   123 \includegraphics[scale=0.1]{../pics/alibaba2.png} &
       
   124 \includegraphics[scale=0.1]{../pics/alibaba3.png} \\
       
   125 Step 1 & Step 2 & Step 3
       
   126 \end{tabular}
       
   127 \end{center}
       
   128 
       
   129 \noindent Let us take a closer look at the picture in Step 1:
       
   130 The cave has a tunnel which forks at some point. Both forks
       
   131 are connected in a loop. At the deep end of the loop is a
       
   132 magic wand. The point of the magic wand is that Alice knows
       
   133 the secret word for how to open it. She wants to keep the word
       
   134 secret, but wants to convince Bob that she knows it.
       
   135 
       
   136 One way of course would be to let Bob follow her, but then he
       
   137 would also find out the secret. So this does not work. To find
       
   138 a solution, let us first fix the rules: At the beginning Alice
       
   139 and Bob are outside the cave. Alice goes in alone and takes
       
   140 either tunnel labelled $A$ in the picture, or the other tunnel
       
   141 labelled $B$. She waits at the magic wand for the instructions
       
   142 from Bob, who also goes into the gave and observes what
       
   143 happens at the fork. He has no knowledge which tunnel Alice
       
   144 took and calls out (in Step 2) that she should emerge from tunnel
       
   145 $A$, for example. If she knows the secret for opening the
       
   146 wand, this will not be a problem for Alice. If she was already
       
   147 in the $A$-segment of the tunnel, then she just comes back. If
       
   148 she was in the $B$-segment of the tunnel she will say the magic
       
   149 word and goes through the wand to emerge from $A$ as requested
       
   150 by Bob.
       
   151 
       
   152 Let us have a look at the case where Alice cheats, that is not
       
   153 knows the secret. She would still go into the cave and use,
       
   154 for example the $B$-segment of the tunnel. If now Bob says she
       
   155 should emerge from $B$, she is lucky. But if he says she
       
   156 should emerge from $A$ then Alice is in trouble: Bob will find
       
   157 out she does not actually know the secret. So in order to fool
       
   158 Bob she needs to anticipate his call, and already go into the
       
   159 corresponding tunnel. This of course also does not work, since
       
   160 Bob can make any call he wants. Consequently in order to find
       
   161 out whether Alice cheats, Bob just needs to repeat this
       
   162 protocol many times. Each time Alice has a chance of
       
   163 $\frac{1}{2}$ to be lucky or being found out. Iterating this
       
   164 $n$ times means she must be right every time and when
       
   165 cheating: the probability for this is $\frac{1}{2}^n$, number
       
   166 that for already relatively small $n$, say 10, is incredibly 
       
   167 small.
       
   168 
       
   169 
       
   170 There are some interesting observations we can make about 
       
   171 Alibaba's cave and the ZKP protocol between Alice and Bob:
       
   172 
       
   173 \begin{itemize}
       
   174 \item {\bf Completeness} If Alice knows the secret, Bob
       
   175       accepts Alice ``proof'' for sure. There is no error
       
   176       possible in that Bob thinks Alice cheats when she
       
   177       actually knows the secret. 
       
   178 \item {\bf Soundness} If Alice does not know the secret,
       
   179       Bob accepts her ``proof'' with a very small probability.
       
   180       If, as in the example above, the probability of being
       
   181       able to hide cheating is $\frac{1}{2}$ in each round 
       
   182       it will be $\frac{1}{2}^n$ after $n$-rounds, which even
       
   183       for small $n$ say $> 10$ is very small indeed.
       
   184 \item {\bf Zero-Knowledge} Even if Bob accepts
       
   185       the proof by Alice, he cannot convince anybody else.
       
   186 \end{itemize} 
       
   187 
       
   188 \noindent The last property is the most interesting one.
       
   189 Assume Alice has convinced Bob that she knows the secret and
       
   190 Bob filmed the whole protocol with a camera. Can he use the
       
   191 video to convince anybody else? After a moment of thought, you
       
   192 will agree that this is not the case. Alice and Bob might have
       
   193 just made it all up and colluded by Bob telling Alice
       
   194 beforehand which tunnel he will call. In this way it appears
       
   195 as if all iterations of the protocol were successful, but they
       
   196 prove nothing. If another person wants to find out whether
       
   197 Alice knows the secret, he or she would have to conduct the
       
   198 protocol again. This is actually the formal definition of a
       
   199 zero-knowledge proof: an independent observer cannot
       
   200 distinguish between a real protocol (where Alice knows the
       
   201 secret) and a fake one (where Bob and Alice colluded).
       
   202 
       
   203 
       
   204 \subsubsection*{Using an Graph-Isomorphism Problem for ZKPs}
       
   205 
       
   206 Now the question is how can we translate Alibaba's cave into a
       
   207 computer science solution? It turns out we need an NP problem
       
   208 for that. The main feature of an NP problem is that it is
       
   209 computational very hard to generate a solution, but it is very
       
   210 easy to check whether a given solution indeed solves the
       
   211 problem at hand.\footnote{The question whether $P = NP$ or not,
       
   212 we leave to others to speculate about.} 
       
   213 
       
   214 One NP problem is the \emph{graph isomorphism problem}. It
       
   215 essentially determines whether the following two graphs, say
       
   216 $G_1$ and $G_2$, can be moved and stretched so that they look
       
   217 exactly the same.
       
   218 
       
   219 \begin{center}
       
   220 \begin{tabular}{c@{\hspace{8mm}}c@{\hspace{8mm}}c}
       
   221 $G_1$ & $G_2$\\
       
   222 \raisebox{-13mm}{\includegraphics[scale=0.4]{../pics/simple.png}} &
       
   223 \raisebox{-13mm}{\includegraphics[scale=0.4]{../pics/simple-b.png}}&
       
   224 
       
   225 \footnotesize
       
   226 \begin{tabular}{rl}
       
   227 Graph $G_1$	& Graph $G_2$\\
       
   228 a  & 1\\
       
   229 b  & 6\\
       
   230 c  & 8\\
       
   231 d  & 3\\
       
   232 g  & 5\\
       
   233 h  & 2\\
       
   234 i  & 4\\
       
   235 j  & 7\\
       
   236 \end{tabular}
       
   237 \end{tabular}
       
   238 \end{center}
       
   239 
       
   240 \noindent The table on the right gives a mapping of the nodes
       
   241 of the first graph to the nodes of the second. With this
       
   242 mapping we can check: node $a$ is connected in $G_1$ with $g$,
       
   243 $h$ and $i$. Node $a$ maps to node $1$ in $G_2$, which is
       
   244 connected to $2$, $4$ and $5$, which again correspond via the
       
   245 mapping to $h$, $i$ and $g$ respectively. Let us write
       
   246 $\sigma$ for such a table and let us write
       
   247 
       
   248 \[G_1 = \sigma(G_2)\]
       
   249 
       
   250 \noindent for two isomorphic graphs ($\sigma$ being the
       
   251 isomorphism). It is actually very easy to construct two
       
   252 isomorphic graphs: Start with an arbitrary graph, re-label the
       
   253 nodes consistently. Alice will need to do this frequently 
       
   254 for the protocol below. In order to be useful, we therefore
       
   255 would need to consider substantially larger graphs, like
       
   256 with thousand nodes. 
       
   257 
       
   258 Now the secret which Alice tries to hide is the knowledge of
       
   259 such an isomorphism $\sigma$ between two such graphs. But she
       
   260 can convince Bob whether she knows such an isomorphism for two
       
   261 graphs, say $G_1$ and $G_2$, using the same idea as in the
       
   262 example with Alibaba's cave. For this Alice and Bob must
       
   263 follow the following protocol:
       
   264 
       
   265 \begin{enumerate}
       
   266 \item Alice generates an isomorphic graph $H$ which she sends 
       
   267       to Bob (in each iteration she needs to generate a 
       
   268       different $H$). 
       
   269 \item
       
   270       After receiving $H$, Bob asks Alice either for an      
       
   271       isomorphism between $G_1$ and $H$, or $G_2$ and $H$.
       
   272 \item Alice and Bob repeat this procedure $n$ times.
       
   273 \end{enumerate}
       
   274 
       
   275 \noindent In Step 1 it is important that Alice always 
       
   276 generates a fresh isomorphic graph. I let you think what
       
   277 would happen if Alice sends out twice the same graph $H$.
       
   278 
       
   279 As said before, this is relatively easy to generate by
       
   280 consistently relabelling nodes. If she started from $G_1$,
       
   281 Alice will have generated
       
   282 
       
   283 \begin{equation}
       
   284 H = \sigma'(G_1)\label{hiso}
       
   285 \end{equation}
       
   286  
       
   287 \noindent where $\sigma'$ is the isomorphism between $H$ and
       
   288 $G_1$. But she could equally have started from $G_2$. In the 
       
   289 case where $G_1$ and $G_2$ are isomorphic, if $H$ is 
       
   290 isomorphic with $G_1$, it will also be isomorphic with 
       
   291 $G_2$, and vice versa. 
       
   292 
       
   293 After generating $H$, Alice sends it to Bob. The important
       
   294 point is that she needs to ``commit'' to this $H$, therefore
       
   295 this kind of zero-knowledge protocols are called
       
   296 \emph{commitment protocols}. Only after receiving $H$, Bob
       
   297 will make up his mind about which isomorphism he asks
       
   298 for---whether between $H$ and $G_1$ or $H$ and $G_2$. For this
       
   299 he could flip a coin, since the choice should be as
       
   300 unpredictable for Alice as possible. Once Alice receives the
       
   301 request, she has to produce an isomorphism. If she generated
       
   302 $H$ as shown in \eqref{hiso} and is asked for an isomorphism
       
   303 between $H$ and $G_1$, she just sends $\sigma'$. If she had
       
   304 been asked for an isomorphism between $H$ and $G_2$, she just
       
   305 has to compose her secret isomorphism $\sigma$ and $\sigma'$.
       
   306 The main point for the protocol is that even knowing the
       
   307 isomorphism between $H$ and $G_1$ or $H$ and $G_2$, will not
       
   308 make the task easier to find the isomorphism between $G_1$ and
       
   309 $G_2$, which is the secret Alice tries to protect.
       
   310 
       
   311 In order to make it crystal clear how this protocol proceeds, 
       
   312 let us give a version using our more formal notation for 
       
   313 protocols:
       
   314 
       
   315 \begin{center}
       
   316 \begin{tabular}{lrl}
       
   317 0)  & $A \to B:$ & $G_1$ and $G_2$\\
       
   318 1a) & $A \to B:$ & $H_1$ \\
       
   319 1b) & $B \to A:$ & produce isomorphism $G_1 \leftrightarrow H_1$? 
       
   320  \quad(or $G_2 \leftrightarrow H_1$)\\
       
   321 1c) & $A \to B:$ & requested isomorphism\\
       
   322 2a) & $A \to B:$ & $H_2$\\
       
   323 2b) & $B \to A:$ & produce isomorphism $G_1 \leftrightarrow H_2$? 
       
   324  \quad(or $G_2 \leftrightarrow H_2$)\\
       
   325 2c) & $A \to B:$ & requested isomorphism\\
       
   326     & \ldots\\
       
   327 \end{tabular}
       
   328 \end{center}
       
   329 
       
   330 \noindent As can be seen the protocol runs for some
       
   331 agreed number of iterations. The $H_i$ Alice needs to 
       
   332 produce, need to be all distinct. I hope you now know
       
   333 why?
       
   334 
       
   335 It is also crucial that in each iteration, Alice first sends
       
   336 $H_i$ and then Bob can decide which isomorphism he wants:
       
   337 either $G_1 \leftrightarrow H_i$ or $G_2 \leftrightarrow H_i$.
       
   338 If somehow Alice can find out before she committed to $H_i$,
       
   339 she can cheat. For this assume Alice does \emph{not} know an
       
   340 isomorphism between $G_1$ and $G_2$. If she knows which
       
   341 isomorphism Bob will ask for she can craft $H$ in such a way
       
   342 that it is isomorphism with either $G_1$ or $G_2$ (but it
       
   343 cannot with both). Then in each case she would send Bob
       
   344 a correct answer and he would come to the conclusion that
       
   345 all is well. I let you also answer the question whether
       
   346 such a protocol run between Alice and Bob would convince
       
   347 a third person, say Pete.
       
   348  
       
   349 Since the interactive nature of the above PKZ protocol and the
       
   350 correct ordering of the messages is so important for the
       
   351 ``correctness'' of the protocol, it might look surprising that
       
   352 the same goal can also me achieved in a completely offline
       
   353 manner. By this I mean Alice can publish all data at once, and
       
   354 then at a later time, Bob can inspect the data and come to the
       
   355 conclusion whether or not Alice knows the secret (again
       
   356 without actually learning about the secret). For this
       
   357 Alice has to do the following:
       
   358 
       
   359 \begin{enumerate}
       
   360 \item Alice generates $n$ isomorphic graphs
       
   361       $H_{1..n}$ (they need to be all distinct)
       
   362 \item she feeds the $H_{1..n}$ into a hashing function
       
   363       (for example encoded as as matrix)
       
   364 \item she takes the first $n$ bits of the output of the hashing
       
   365       function:
       
   366       whenever the output is $0$, she shows an isomorphism
       
   367       with $G_1$; for $1$ she shows an isomorphism
       
   368       with $G_2$
       
   369 \end{enumerate}
       
   370 
       
   371 \noindent The reason why this works and achieves the same
       
   372 goal as the interactive variant is that Alice has no
       
   373 control over the hashing function. It would be 
       
   374 computationally just too hard to assemble a set of
       
   375 $H_{1..n}$ such that she can force where $0$s and $1$s
       
   376 in the hash values are such that it would pass an external
       
   377 test. The point is that Alice can publish all this data
       
   378 on the comfort of her own web-page, for example, and 
       
   379 in this way convince everybody who bothers to check. 
       
   380 
       
   381 The virtue of the use of graphs and isomorphism for a 
       
   382 zero-knowledge protocol is that the idea why it works
       
   383 are relatively straightforward. The disadvantage is
       
   384 that encoding any secret into a graph-isomorphism, while
       
   385 possible, is awkward. The good news is that in fact
       
   386 any NP problem can be used as part of a ZKP protocol.  
       
   387 
       
   388 
       
   389 \subsubsection*{Using Modular Logarithms for ZKP Protocols}
       
   390 
       
   391 While information can be encoded into graph isomorphisms, it
       
   392 is not the most convenient carrier of information. Clearly it
       
   393 is much easier to encode information into numbers. Let us look
       
   394 at zero-knowledge proofs that use numbers as secrets. For this
       
   395 the underlying NP-problem is to calculate discrete logarithms.
       
   396 It can be used by choosing public numbers $A$, $B$, $p$, and
       
   397 private $x$ such that
       
   398 
       
   399 \begin{center}
       
   400 $A^x \equiv B\; mod\; p$
       
   401 \end{center} 
       
   402 
       
   403 \noindent holds. The secret Alice tries to keep secret is $x$.
       
   404 The point of the modular logarithm is that it is very hard 
       
   405 from the public data to calculate $x$ (for large primes). 
       
   406 Now the protocol proceeds in three stages:
       
   407 
       
   408 \begin{itemize}
       
   409 \item {\bf Commitment Stage}
       
   410   \begin{enumerate}
       
   411     \item Alice generates $z$ random numbers $r_1, \ldots, r_z$, 
       
   412       all less than $p - 1$. Alice then sends Bob for all 
       
   413       $i = 1,\ldots, z$:
       
   414       \[ h_i = A^{r_i}\; mod\; p\]
       
   415     \item Bob generates $z$ random bits, say $b_1,\ldots, b_z$. He can do this 
       
   416        by flipping $z$ times a coin, for example.
       
   417     \item For each bit $b_i$, Alice sends Bob an $s_i$ where
       
   418 
       
   419       \begin{center}
       
   420       \begin{tabular}{ll}
       
   421       if $b_i = 0$: & $s_i = r_i$\\
       
   422       if $b_i = 1$: & $s_i = (r_i - r_j) \;mod\; (p -1)$\\
       
   423       \end{tabular}
       
   424       \end{center}
       
   425 
       
   426       where $r_j$ is the lowest $j$ where $b_j = 1$.
       
   427  
       
   428     \end{enumerate}
       
   429  \end{itemize}   
       
   430     
       
   431 \noindent For understanding the last step, let $z$ be just 4.
       
   432 We have four random values $r_i$ chosen by Alice and four
       
   433 random bits $b_i$ chosen subsequently by Bob, for example
       
   434   
       
   435   \begin{center}
       
   436   \begin{tabular}{lcccc}
       
   437   $r_i$:\; & 4 & 9 & 1 & 3\\ 
       
   438   $b_i$:\; & 0 & 1 & 0 & 1\\
       
   439   & & $\uparrow$ \\
       
   440   & & $j$
       
   441   \end{tabular}             
       
   442   \end{center}         
       
   443     
       
   444   \noindent The highlighted column is the lowest where $b_i = 
       
   445   1$ (counted from the left). That means $r_j = 9$. The reason
       
   446   for letting Alice choose the random numbers $r_1, \ldots, r_z$
       
   447   will become clear shortly. Next is the confirmation 
       
   448   phase where Bob essentially checks whether Alice has sent
       
   449   him ``correct'' $s_i$ and $h_i$.
       
   450     
       
   451 \begin{itemize}   
       
   452 \item {\bf Confirmation Stage}
       
   453   \begin{enumerate}
       
   454   \item For each $b_i$ Bob checks whether $s_i$ 
       
   455   conform to the protocol
       
   456 
       
   457   \begin{center}
       
   458   \begin{tabular}{ll}
       
   459   if $b_i = 0$: & $A^{s_i} \stackrel{?}{\equiv} h_i\;mod\;p$\\
       
   460   if $b_i = 1$: & $A^{s_i} \stackrel{?}{\equiv} h_i * h_j^{-1}  \;mod\; p$\\
       
   461   \end{tabular}
       
   462   \end{center}
       
   463   \end{enumerate}
       
   464 \end{itemize}
       
   465 
       
   466 \noindent To understand the case for $b_i = 1$, you have 
       
   467 to do the following calculation: 
       
   468 
       
   469 \begin{center}
       
   470 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   471 $A^{s_i}$ & $=$ & $A^{r_i - r_j}$\\ 
       
   472           & $=$ & $A^{r_i} * A^{-r_j}$\\
       
   473           & $=$ & $h_{r_i} * h_{r_j}^{-1}\;mod\;p$   
       
   474 \end{tabular}
       
   475 \end{center}
       
   476 
       
   477 \noindent What is interesting that so far nothing has been 
       
   478 sent about $x$, which is the secret Alice has. Also notice 
       
   479 that Bob does not know $r_j$. He received 
       
   480 
       
   481 \begin{center}
       
   482 $r_j - r_j$,  $r_m - r_j$, \ldots, $r_p - r_j \;mod \;p - 1$ 
       
   483 \end{center}
       
   484 
       
   485 \noindent whenever his corresponding bits were $1$. So Bob
       
   486 does not know $r_j$ and also does not know any $r_i$ where the
       
   487 bit was $1$. Information about the $x$ is sent in the next
       
   488 stage (obviously not revealing $x$).
       
   489 
       
   490 \begin{itemize}   
       
   491 \item {\bf Proving Stage}
       
   492 
       
   493 \begin{enumerate}
       
   494 \item Alice proves she knows $x$, the discrete log of $B$,
       
   495 by sending
       
   496 
       
   497 \begin{center}
       
   498 $s_{z+1} = x - r_j\;mod\;p-1$
       
   499 \end{center}
       
   500 
       
   501 \item Bob confirms
       
   502 
       
   503 \begin{center}
       
   504 $A^{s_{z+1}} \stackrel{?}{\equiv} B * h_j^{-1} \;mod \; p$
       
   505 \end{center}
       
   506 \end{enumerate}
       
   507 \end{itemize}
       
   508 
       
   509 \noindent To understand the last step, you have to do the trick
       
   510 again that 
       
   511 
       
   512 \[A^{s_{z+1}} = A^{x-r_j} = \ldots
       
   513 \]
       
   514 
       
   515 \noindent
       
   516 which I leave to you.
       
   517 
       
   518 Now the question is how can Alice cheat? In order to cheat she
       
   519 has to coordinate what she sends as $h_i$ in step 1 and $s_i$
       
   520 in step 3 of the commitment stage, and also what to send as
       
   521 $s_{z+1}$ in the proving stage. For the latter of course 
       
   522 Alice does not know $x$, so she just chooses some random 
       
   523 number for $s_{z+1}$ and calculates 
       
   524 
       
   525 \[A^{s_{z+1}}\]
       
   526 
       
   527 \noindent
       
   528 and then solves the equation
       
   529 
       
   530 \[A^{s_{z+1}} \equiv B * y \;mod\;p\]
       
   531 
       
   532 \noindent for $y$. This is easy since no logarithm needs to be
       
   533 computed. If Alice can guess the $j$ where the first 1 will
       
   534 appear in Bob's bit vector, then she sends the inverse of $y$
       
   535 as $h_j$ and 0 as $s_j$. However, notice that when she
       
   536 calculates a solution for $y$ she does not know $r_j$. For this she
       
   537 would need to calculate the modular logarithm
       
   538 
       
   539 \[y \equiv A^{r_j}\;mod\;p\] 
       
   540 
       
   541 \noindent which is hard (see step 1 in the commitment stage).
       
   542 
       
   543 Having settled on what $h_j$ should be, now what should Alice
       
   544 send as the other $h_i$ and other $s_i$? If the $b_i$ happens
       
   545 to be a 1, then the $h_i$ and other $s_i$ need to satisfy the 
       
   546 test
       
   547 
       
   548 \[A^{s_i} \stackrel{?}{\equiv} h_i * h_j^{-1}  \;mod\; p\]
       
   549 
       
   550 \noindent where she has already settled on the value of 
       
   551 $h_j^{-1}$. Lets say she choses $s_i$ at random, then she just 
       
   552 needs to solve
       
   553 
       
   554 \[A^{s_i} \equiv z * h_j^{-1}  \;mod\; p\] 
       
   555 
       
   556 \noindent for $z$. Again that is easy, but it does not allow 
       
   557 us to know $r_i$, because then we would again need to solve
       
   558 a modular logarithm problem. Let us call an $h_i$ which was 
       
   559 solved the easy way as \emph{bogus}. Alice has to produce
       
   560 bogus $h_i$ for all bits that are going to be $1$ in advance!
       
   561 This means she has to guess all the bits correctly. (Yes? 
       
   562 I let you think about this.)
       
   563 
       
   564 Let us see what happens if she guesses wrongly: Suppose the
       
   565 bit $b_i = 1$ where she thought she will get a 0. Then she has
       
   566 already sent an $h_i$ and $h_j$ and now must find an $s_i$
       
   567 such that 
       
   568 
       
   569 \[A^{s_i} \equiv h_i * h_j^{-1}  \;mod\; p\]
       
   570 
       
   571 \noindent holds. For this remember in calculating $h_i$, she
       
   572 just chose a random $s_i$. Now she has to send a genuine one.
       
   573 But this is of course too hard. If she knew the genuine $r_i$
       
   574 and $r_j$ for $h_i$ and $h_j$, it would be easy (in this case
       
   575 $s_i = r_i - r_j$). But she does not. So she will be found
       
   576 out. If $b_i = 0$, but she thought she will get a 1, then 
       
   577 she has to send a $s_i$ which satisfies
       
   578 
       
   579 \[A^{s_i} \equiv h_i\;mod\;p\]
       
   580 
       
   581 \noindent Again she does not know $r_i$. So it is a too hard 
       
   582 task and she will be found out again.
       
   583 
       
   584 To sum up, in order for Alice to successfully cheat Bob, she
       
   585 needs to guess \emph{all} bits correctly. She has only a
       
   586 $\frac{1}{2^z}$ chance of doing this.
       
   587 
       
   588 \subsubsection*{Further Reading}
       
   589 
       
   590 Make sure you understand what NP problems
       
   591 are.\footnote{\url{http://en.wikipedia.org/wiki/NP_(complexity)}}
       
   592 They are the building blocks for zero-knowledge proofs.
       
   593 Zero-Knowldege proofs are not yet widely used in production
       
   594 systems, but it is slowly gaining ground. One area of application
       
   595 where they pop up is crypto currencies (for example Zerocoins
       
   596 or how to make sure a Bitcoin exchange is solvent without
       
   597 revealing its assets).
       
   598 
       
   599 If you want to brush up on the modular logarithm problem,
       
   600 the Khan Academy has a nice video:
       
   601 
       
   602 \begin{center}
       
   603 \url{https://www.khanacademy.org/video/discrete-logarithm-problem}
       
   604 \end{center}
       
   605 
   552 
   606 \end{document}
   553 \end{document}
   607 
   554 
   608 http://blog.cryptographyengineering.com/2014/11/zero-knowledge-proofs-illustrated-primer.html
   555 http://randomwalker.info/teaching/fall-2012-privacy-technologies/?
   609 
   556 http://chronicle.com/article/Why-Privacy-Matters-Even-if/127461/
   610 http://btravers.weebly.com/uploads/6/7/2/9/6729909/zero_knowledge_technique.pdf
   557 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1077&context=hcii
   611 http://zk-ssh.cms.ac/docs/Zero_Knowledge_Prinzipien.pdf
   558 https://josephhall.org/papers/NYU-MCC-1303-S2012_privacy_syllabus.pdf
   612 http://www.wisdom.weizmann.ac.il/~oded/PS/zk-tut02v4.ps
   559 http://www.jetlaw.org/wp-content/uploads/2014/06/Bambauer_Final.pdf
   613 
   560 http://www.cs.cmu.edu/~yuxiangw/docs/Differential%20Privacy.pdf
   614 socialist millionares problem
   561 https://www.youtube.com/watch?v=Gx13lgEudtU
   615 http://en.wikipedia.org/wiki/Socialist_millionaire
   562 https://www.cs.purdue.edu/homes/ctask/pdfs/CERIAS_Presentation.pdf
   616 http://twistedoakstudios.com/blog/Post3724_explain-it-like-im-five-the-socialist-millionaire-problem-and-secure-multi-party-computation
   563 http://www.futureofprivacy.org/wp-content/uploads/Differential-Privacy-as-a-Response-to-the-Reidentification-Threat-Klinefelter-and-Chin.pdf
   617 
   564 http://www.cis.upenn.edu/~aaroth/courses/slides/Overview.pdf
       
   565 http://www.cl.cam.ac.uk/~sjm217/papers/tor14design.pdf
   618 
   566 
   619 %%% Local Variables: 
   567 %%% Local Variables: 
   620 %%% mode: latex
   568 %%% mode: latex
   621 %%% TeX-master: t
   569 %%% TeX-master: t
   622 %%% End: 
   570 %%% End: