updated
authorChristian Urban <christian dot urban at kcl dot ac dot uk>
Fri, 23 Sep 2016 15:07:05 +0100
changeset 457 3feaf8bc3e48
parent 456 9eea20ad0caf
child 458 0647d8161a84
updated
bsc-projects-16.html
--- a/bsc-projects-16.html	Fri Sep 23 13:29:59 2016 +0100
+++ b/bsc-projects-16.html	Fri Sep 23 15:07:05 2016 +0100
@@ -1,7 +1,7 @@
 <?xml version="1.0" encoding="utf-8"?>
 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 <HEAD>
-<TITLE>2015/16 MSc Projects</TITLE>
+<TITLE>2016/17 BSc Projects</TITLE>
 <BASE HREF="http://www.inf.kcl.ac.uk/staff/urbanc/">
 <script type="text/javascript" src="striper.js"></script>
 <link rel="stylesheet" href="nominal.css">
@@ -35,7 +35,7 @@
 <H4>Email: christian dot urban at kcl dot ac dot uk,  Office: Strand Building S1.27</H4>
 <H4>If you are interested in a project, please send me an email and we can discuss details. Please include
 a short description about your programming skills and Computer Science background in your first email. 
-I will also need your King's username in order to book the project for you. Thanks.</H4> 
+Thanks.</H4> 
 
 <H4>Note that besides being a lecturer at the theoretical end of Computer Science, I am also a passionate
     <A HREF="http://en.wikipedia.org/wiki/Hacker_(programmer_subculture)">hacker</A> &hellip;
@@ -58,7 +58,9 @@
   lexing programs, syntax highlighting and so on. Given that regular expressions were
   introduced in 1950 by <A HREF="http://en.wikipedia.org/wiki/Stephen_Cole_Kleene">Stephen Kleene</A>,
   you might think regular expressions have since been studied and implemented to death. But you would definitely be
-  mistaken: in fact they are still an active research area. For example
+  mistaken: in fact they are still an active research area. On the top of my head, I can give
+  you at least research papers that appeared in the last few years.
+  For example
   <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/regex-parsing-derivatives.pdf">this paper</A> 
   about regular expression matching and derivatives was presented just last summer at the international 
   FLOPS'14 conference. The task in this project is to implement their results and use them for lexing.</p>
@@ -72,13 +74,21 @@
   innocently looking regular expression <code>a?{28}a{28}</code> and match it, say, against the string 
   <code>aaaaaaaaaaaaaaaaaaaaaaaaaaaa</code> (that is 28 <code>a</code>s), you will soon notice that your CPU usage goes to 100%. In fact,
   Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself:
-  <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/re.py">re.py</A> (Python version) and 
-  <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/re.rb">re.rb</A> 
-  (Ruby version). You can imagine an attacker
+  <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/catastrophic.py">catastrophic.py</A> (Python version) and 
+  <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/catastrophic.rb">catastrophic.rb</A> 
+  (Ruby version). Here is a similar problem in Java: <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/catastrophic.rb">catastrophic.java</A>
+  </p> 
+
+  <p>
+  You can imagine an attacker
   mounting a nice <A HREF="http://en.wikipedia.org/wiki/Denial-of-service_attack">DoS attack</A> against 
-  your program if it contains such an &ldquo;evil&rdquo; regular expression. Actually 
-  <A HREF="http://www.scala-lang.org/">Scala</A> (and also Java) are almost immune from such
-  attacks as they can deal with strings of up to 4,300 <code>a</code>s in less than a second. But if you scale
+  your program if it contains such an &ldquo;evil&rdquo; regular expression. But it can also happen by accident:
+  on 20 July 2016 the website <A HREF="http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016">Stack Exchange</A>
+  was knocked offline because of an evil regular expression. One of their engineers talks about it in this
+  <A HREF="https://vimeo.com/112065252">video</A>. A similar problem needed to be fixed in the
+  <A HREF="http://davidvgalbraith.com/how-i-fixed-atom/">Atom</A> editor.
+  A few implementations of regular expression matchers are almost immune from such problems.
+  For example, <A HREF="http://www.scala-lang.org/">Scala</A> can deal with strings of up to 4,300 <code>a</code>s in less than a second. But if you scale
   the regular expression and string further to, say, 4,600 <code>a</code>s, then you get a <code>StackOverflowError</code> 
   potentially crashing your program. Moreover (beside the "minor" problem of being painfully slow) according to this
   <A HREF="http://www.haskell.org/haskellwiki/Regex_Posix">report</A>
@@ -96,7 +106,7 @@
   <A HREF="http://lambda-the-ultimate.org/node/2293">derivatives of regular expressions</A>.
   These derivatives were introduced in 1964 by <A HREF="http://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)">
   Janusz Brzozowski</A>, but according to this
-  <A HREF="http://www.cl.cam.ac.uk/~so294/documents/jfp09.pdf">paper</A> had been lost in the &ldquo;sands of time&rdquo;.
+  <A HREF="https://www.cs.kent.ac.uk/people/staff/sao/documents/jfp09.pdf">paper</A> had been lost in the &ldquo;sands of time&rdquo;.
   The advantage of derivatives is that they side-step completely the usual 
   <A HREF="http://hackingoff.com/compilers/regular-expression-to-nfa-dfa">translations</A> of regular expressions
   into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular
@@ -124,7 +134,7 @@
   <B>Literature:</B> 
   The place to start with this project is obviously this
   <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/regex-parsing-derivatives.pdf">paper</A>
-  and this <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp12-part-deriv-sub-match.pdf">one</A>.
+  and this <A HREF="http://www.inf.kcl.ac.uk/staff/urbanc/Publications/posix.pdf">one</A>.
   Traditional methods for regular expression matching are explained
   in the Wikipedia articles 
   <A HREF="http://en.wikipedia.org/wiki/DFA_minimization">here</A> and 
@@ -150,8 +160,8 @@
   <A HREF="http://fsharp.org">F#</A>, 
   <A HREF="http://en.wikipedia.org/wiki/Standard_ML">ML</A>,  
   <A HREF="http://haskell.org/haskellwiki/Haskell">Haskell</A>, etc. Python and other non-functional languages
-  can be also used, but seem much less convenient. If you attend my Formal Languages and
-  Automata module, that would obviously give you a head-start with this project.
+  can be also used, but seem much less convenient. If you attend my Compilers and Formal Languages
+  module, that would obviously give you a head-start with this project.
   </p>
   
 <li> <H4>[CU2] A Compiler for a small Programming Language</H4>