msc-projects-12.html
changeset 165 89c9fcb211a8
parent 164 d99c0026ebaf
child 166 78e14199b429
equal deleted inserted replaced
164:d99c0026ebaf 165:89c9fcb211a8
    55   <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp12-part-deriv-sub-match.pdf">this paper</A> 
    55   <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp12-part-deriv-sub-match.pdf">this paper</A> 
    56   about regular expression matching and partial derivatives was presented this summer at the international 
    56   about regular expression matching and partial derivatives was presented this summer at the international 
    57   PPDP'12 conference. The task in this project is to implement the results from this paper.</p>
    57   PPDP'12 conference. The task in this project is to implement the results from this paper.</p>
    58 
    58 
    59   <p>The background for this project is that some regular expressions are 
    59   <p>The background for this project is that some regular expressions are 
    60   &quot;<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>&quot; 
    60   &ldquo;<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>&rdquo;
    61   and can &quot;stab you in the back&quot; according to
    61   and can &ldquo;stab you in the back&rdquo; according to
    62   this <A HREF="http://tech.blog.cueup.com/regular-expressions-will-stab-you-in-the-back">blog post</A>.
    62   this <A HREF="http://tech.blog.cueup.com/regular-expressions-will-stab-you-in-the-back">blog post</A>.
    63   For example, if you use in <A HREF="http://www.python.org">Python</A> or 
    63   For example, if you use in <A HREF="http://www.python.org">Python</A> or 
    64   in <A HREF="http://www.ruby-lang.org/en/">Ruby</A> (probably also in other mainstream programming languages) the 
    64   in <A HREF="http://www.ruby-lang.org/en/">Ruby</A> (probably also in other mainstream programming languages) the 
    65   innocently looking regular expression <code>a?{28}a{28}</code> and match it, say, against the string 
    65   innocently looking regular expression <code>a?{28}a{28}</code> and match it, say, against the string 
    66   <code>aaaaaaaaaaaaaaaaaaaaaaaaaaaa</code> (that is 28 <code>a</code>s), you will soon notice that your CPU usage goes to 100%. In fact,
    66   <code>aaaaaaaaaaaaaaaaaaaaaaaaaaaa</code> (that is 28 <code>a</code>s), you will soon notice that your CPU usage goes to 100%. In fact,
    67   Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself:
    67   Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself:
    68   <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/re.py">re.py</A> (Python version) and 
    68   <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/re.py">re.py</A> (Python version) and 
    69   <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/re-internal.rb">re.rb</A> 
    69   <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/re-internal.rb">re.rb</A> 
    70   (Ruby version). You can imagine an attacker
    70   (Ruby version). You can imagine an attacker
    71   mounting a nice <A HREF="http://en.wikipedia.org/wiki/Denial-of-service_attack">DoS attack</A> against 
    71   mounting a nice <A HREF="http://en.wikipedia.org/wiki/Denial-of-service_attack">DoS attack</A> against 
    72   your program if it contains such an &quot;evil&quot; regular expression. Actually 
    72   your program if it contains such an &ldquo;evil&rdquo; regular expression. Actually 
    73   <A HREF="http://www.scala-lang.org/">Scala</A> (and also Java) are almost immune from such
    73   <A HREF="http://www.scala-lang.org/">Scala</A> (and also Java) are almost immune from such
    74   attacks as they can deal with strings of up to 4,300 <code>a</code>s in less than a second. But if you scale
    74   attacks as they can deal with strings of up to 4,300 <code>a</code>s in less than a second. But if you scale
    75   the regular expression and string further to, say, 4,600 <code>a</code>s, then you get a <code>StackOverflowError</code> 
    75   the regular expression and string further to, say, 4,600 <code>a</code>s, then you get a <code>StackOverflowError</code> 
    76   potentially crashing your program.
    76   potentially crashing your program.
    77   </p>
    77   </p>
    85   official matcher maxes out at 4,600 <code>a</code>s). My matcher is approximately
    85   official matcher maxes out at 4,600 <code>a</code>s). My matcher is approximately
    86   85 lines of code and based on the concept of 
    86   85 lines of code and based on the concept of 
    87   <A HREF="http://lambda-the-ultimate.org/node/2293">derivatives of regular expressions</A>.
    87   <A HREF="http://lambda-the-ultimate.org/node/2293">derivatives of regular expressions</A>.
    88   These derivatives were introduced in 1964 by <A HREF="http://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)">
    88   These derivatives were introduced in 1964 by <A HREF="http://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)">
    89   Janusz Brzozowski</A>, but according to this 
    89   Janusz Brzozowski</A>, but according to this 
    90   <A HREF="http://www.cl.cam.ac.uk/~so294/documents/jfp09.pdf">paper</A> had been lost in the &quot;sands of time&quot;.
    90   <A HREF="http://www.cl.cam.ac.uk/~so294/documents/jfp09.pdf">paper</A> had been lost in the &ldquo;sands of time&rdquo;.
    91   The advantage of derivatives is that they side-step completely the usual 
    91   The advantage of derivatives is that they side-step completely the usual 
    92   <A HREF="http://hackingoff.com/compilers/regular-expression-to-nfa-dfa">translations</A> of regular expressions
    92   <A HREF="http://hackingoff.com/compilers/regular-expression-to-nfa-dfa">translations</A> of regular expressions
    93   into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular
    93   into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular
    94   expression matchers in Python and Ruby.
    94   expression matchers in Python and Ruby.
    95   </p>
    95   </p>
   104   Their approach is based on the concept of partial derivatives introduced in 1994 by
   104   Their approach is based on the concept of partial derivatives introduced in 1994 by
   105   <A HREF="http://reference.kfupm.edu.sa/content/p/a/partial_derivatives_of_regular_expressio_1319383.pdf">Valentin Antimirov</A>.
   105   <A HREF="http://reference.kfupm.edu.sa/content/p/a/partial_derivatives_of_regular_expressio_1319383.pdf">Valentin Antimirov</A>.
   106   I used them once myself in a <A HREF="http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf">paper</A> 
   106   I used them once myself in a <A HREF="http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf">paper</A> 
   107   in order to prove the <A HREF="http://en.wikipedia.org/wiki/Myhill–Nerode_theorem">Myhill-Nerode theorem</A>.
   107   in order to prove the <A HREF="http://en.wikipedia.org/wiki/Myhill–Nerode_theorem">Myhill-Nerode theorem</A>.
   108   So I know they are worth their money. Still, it would be interesting to actually compare their results
   108   So I know they are worth their money. Still, it would be interesting to actually compare their results
   109   with my simple rainy-afternoon matcher and potentially &quot;blow away&quot; the regular expression matchers 
   109   with my simple rainy-afternoon matcher and potentially &ldquo;blow away&rdquo; the regular expression matchers 
   110   in Python and Ruby (and possibly in Scala too).
   110   in Python and Ruby (and possibly in Scala too).
   111   </p>
   111   </p>
   112 
   112 
   113   <p>
   113   <p>
   114   <B>Literature:</B> 
   114   <B>Literature:</B> 
   122   on automata and regular expressions is by John Hopcroft and Jeffrey Ullmann (available in the library). 
   122   on automata and regular expressions is by John Hopcroft and Jeffrey Ullmann (available in the library). 
   123   There is also an online course about this topic by Ullman at 
   123   There is also an online course about this topic by Ullman at 
   124   <A HREF="https://www.coursera.org/course/automata">Coursera</A>, though IMHO not 
   124   <A HREF="https://www.coursera.org/course/automata">Coursera</A>, though IMHO not 
   125   done with love. 
   125   done with love. 
   126   Finally, there are millions of other pointers about regular expression
   126   Finally, there are millions of other pointers about regular expression
   127   matching on the Net. Test cases for &quot;<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>&quot;
   127   matching on the Net. Test cases for &ldquo;<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>&rdquo;
   128   regular expressions can be obtained from <A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">here</A>.
   128   regular expressions can be obtained from <A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">here</A>.
   129   </p>
   129   </p>
   130 
   130 
   131   <p>
   131   <p>
   132   <B>Skills:</B> 
   132   <B>Skills:</B> 
   181 
   181 
   182 <p>
   182 <p>
   183   <B>Skills:</B> 
   183   <B>Skills:</B> 
   184   This is a project for a student with very good programming 
   184   This is a project for a student with very good programming 
   185   and <A HREF="http://en.wikipedia.org/wiki/Hacker_(programmer_subculture)">hacking</A> skills. 
   185   and <A HREF="http://en.wikipedia.org/wiki/Hacker_(programmer_subculture)">hacking</A> skills. 
   186   Some knowledge in JavaScript, HTML and CSS cannot hurt The algorithms from automata
   186   Some knowledge in JavaScript, HTML and CSS cannot hurt. The algorithms from automata
   187   theory are fairly standard material.
   187   theory are fairly standard material.
   188   </p>
   188   </p>
   189 
   189 
   190 
   190 
   191 <!--
   191 <!--
   293   <p>
   293   <p>
   294   <b>Description:</b> 
   294   <b>Description:</b> 
   295   This project is similar to [CU3]. The emphasis here, however, is on the
   295   This project is similar to [CU3]. The emphasis here, however, is on the
   296   implementation and comparison of register spilling algorithms, also often called register allocation 
   296   implementation and comparison of register spilling algorithms, also often called register allocation 
   297   algorithms. They are part of any respectable compiler.  As said
   297   algorithms. They are part of any respectable compiler.  As said
   298   in [CU3], however, my simple compiler lacks them and assumes an infinite amount of registers instead.
   298   in [CU3] my simple compiler lacks them and assumes an infinite amount of registers instead.
   299   Real CPUs however only provide a fixed amount of registers (for example
   299   Real CPUs however only provide a fixed amount of registers (for example
   300   x86-64 has 16 general purpose registers). Whenever a program needs
   300   x86-64 has 16 general purpose registers). Whenever a program needs
   301   to hold more values than registers, the values need to be &ldquo;spilled&rdquo;
   301   to hold more values than registers, the values need to be &ldquo;spilled&rdquo;
   302   into the main memory. Register spilling algorithms try to minimise
   302   into the main memory. Register spilling algorithms try to minimise
   303   this spilling, since fetching values from main memory is a costly 
   303   this spilling, since fetching values from main memory is a costly 
   381   computers or smart phones.
   381   computers or smart phones.
   382   </p>
   382   </p>
   383 
   383 
   384   <p>
   384   <p>
   385   However, there is one restriction that makes this project harder than it seems
   385   However, there is one restriction that makes this project harder than it seems
   386   as first sight. The department does not allow large server applications and databases
   386   at first sight: The department does not allow large server applications and databases
   387   to be run on calcium - the central server in the department. So the problem should be solved with as few resources needed
   387   to be run on calcium, which is the central server in the department. So the problem 
   388   on the &quot;back-end&quot; which collects the votes. 
   388   should be solved with as few resources as possible 
       
   389   on the &ldquo;back-end&rdquo; collecting the votes. 
   389   </p>
   390   </p>
   390 
   391 
   391   <p>
   392   <p>
   392   <B>Literature:</B> 
   393   <B>Literature:</B> 
   393   The project requires fluency in a web-programming language (for example 
   394   The project requires fluency in a web-programming language (for example