+-
[CU1] Regular Expression Matching and Partial Derivatives
+
+
+ Description:
+ Regular expressions
+ are extremely useful for many text-processing tasks...finding patterns in texts,
+ lexing programs, syntax highlighting and so on. Given that regular expressions were
+ introduced in 1950 by Stephen Kleene, you might think
+ regular expressions have since been studied to death. But you would definitely be mistaken: in fact they are still
+ an active research area. For example
+ this paper
+ about regular expression matching and partial derivatives was presented this summer at the international
+ PPDP'12 conference.
+
+ The background for this project is that some regular expressions are
+ "evil"
+ and can "stab you in the back" according to
+ this recent blog post.
+ For example, if you use in Python or
+ in Ruby (probably also in other mainstream programming languages) the
+ innocently looking regular expression a?{28}a{28}
and match it, say, against the string
+ aaaaaaaaaaaaaaaaaaaaaaaaaaaa
, you will soon notice that your CPU usage goes to 100%. In fact,
+ Python and Ruby need approximately 30 seconds for matching this string. You can try it for yourself:
+ re.py (Python version) and
+ re.rb
+ (Ruby version). You can imagine an attacker
+ mounting a nice DoS attack against
+ your program if it contains such an "evil" regular expression. Actually
+ Scala (and also Java) are almost immune from such
+ attacks as they can deal with strings of up to 4,300 a
s in less than a second. But if you scale
+ the regular expression and string further to, say, 4,600 a
s, you get a StackOverflowError
+ exception chrashing your program.
+
+
+
+ On a rainy afternoon, I implemented
+ this
+ regular expression matcher in Scala. It is not as fast as the official one in Scala, but
+ it can match up to 11,000 a
s in less than 5 seconds without raising any exception
+ (remember Python and Ruby both need nearly 30 seconds to process 28(!) a
s, and Scala's
+ offical matcher maxes out at 4,600 a
s). My matcher is approximately
+ 85 lines of code and based on the concept of
+ derivatives of regular experssions.
+ Derivatives were introduced in 1964 by
+ Janusz Brzozowski, but according to this
+ paper had been lost in the "sands of time".
+ The advantage of derivatives is that they side-step completely the usual
+ translations of regular expressions
+ into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular
+ expression matchers in Python and Ruby.
+
+
+
+ Now the guys from the
+ PPDP'12-paper mentioned
+ above claim they are even faster than me and can deal with even more features of regular expressions
+ (for example subexpression matching, which my rainy-afternoon matcher lacks). I am sure they thought
+ about the problem much longer than a single afternoon. The task
+ in this project is to find out how good they actually are by implementing the results from their paper.
+ Their approach is based on the concept of partial derivatives introduced in 1994 by
+ Valentin Antimirov.
+ I used them once
+ in order to prove the Myhill-Nerode theorem
+ by using only regular expressions.
+
+
+
+ Literature:
+ The place to start with this project is obviously this
+ paper.
+ Traditional methods for regular expression matching are explained
+ in the wikipedia articles
+ here and
+ here.
+ The authoritative book
+ on automata and regular expressions is by John Hopcroft and Jeffrey Ullmann (available in the library).
+ There is also an online course about this topic by Ullman at
+ Coursera, though IMHO not
+ done with love.
+ Finally, there are millions of other pointers about regular expression
+ matching on the Net. Test cases for "evil"
+ regular expressions can be obtained from here.
+
+
+
+ Skills:
+ This is a project for a student with an interest in theory and some
+ reasonable programming skills. The project can be easily implemented
+ in languages like
+ Scala,
+ ML,
+ Haskell,
+ Python, etc.
+
+
+
+
+ -
[CU3] Machine Code Generation for a Simple Compiler
+
+
+ Description:
+ Compilers translate high-level programs that humans can read and write into
+ efficient machine code that can be run on a CPU or virtual machine.
+ I recently implemented a very simple compiler for a very simple functional
+ programming language following this
+ paper
+ (also described here).
+ My code, written in Scala, of this compiler is
+ here.
+ The compiler can deal with simple programs involving natural numbers, such
+ as Fibonacci numbers
+ or factorial (but it can be easily extended - that is not the point).
+
+
+
+ While the hard work has been done (understanding the two papers above),
+ my compiler only produces some idealised machine code. For example I
+ assume there are infinitely many registers. The goal of this
+ project is to generate machine code that is more realistic and can
+ run on a CPU, like x86, or run on a virtual machine, say the JVM.
+ This gives probably a speedup of thousand times in comparison to
+ my naive machine code and virtual machine. The project
+ requires to dig into the literature about real CPUs and generating
+ real machine code.
+
+
+
+ Literature:
+ There is a lot of literature about compilers
+ (for example this book -
+ I can lend you my copy for the duration of the project). A very good overview article
+ about implementing compilers by
+ Laurie Tratt is
+ here.
+ An introduction into x86 machine code is here.
+ Intel's official manual for the x86 instruction is
+ here.
+ A simple assembler for the JVM is described here.
+ An interesting twist of this project is to not generate code for a CPU, but
+ for the intermediate language of the LLVM compiler
+ (also described here and
+ here). If you want to see
+ what machine code looks like you can compile your C-program using gcc -S.
+
+
+
+ Skills:
+ This is a project for a student with a deep interest in programming languages and
+ compilers. Since my compiler is implemented in Scala,
+ it would make sense to continue this project in this language. I can be
+ of help with questions and books about Scala.
+ But if Scala is a problem, my code can also be translated quickly into any other functional
+ language.
+
+
+ -
[CU4] Implementation of Register Spilling Algorithms
+
+
+ Description:
+ This project is similar to [CU3]. The emphasis here, however, is on the
+ implementation and comparison of register spilling algorithms, also often called register allocation
+ algorithms. They are part of any respectable compiler. As said
+ in [CU3], however, my simple compiler lacks them and assumes an infinite amount of registers instead.
+ Real CPUs however only provide a fixed amount of registers (for example
+ x86-64 has 16 general purpose registers). Whenever a program needs
+ to hold more values than registers, the values need to be “spilled”
+ into the main memory. Register spilling algorithms try to minimise
+ this spilling, since fetching values from main memory is a costly
+ operation.
+
+
+
+ The classic algorithm for register spilling uses a
+ graph-colouring method.
+ However, for some time the LLVM compiler
+ used a supposedly more efficient method, called the linear scan allocation method
+ (described
+ here).
+ However, it was later decided to abandon this method in favour of
+ a
+ greedy register allocation method. It would be nice if this project can find out
+ what the issues are with these methods and implement at least one of them for the
+ simple compiler referenced in [CU3].
+
+
+
+ Literature:
+ The graph colouring method is described in Andrew Appel's
+ book on compilers
+ (I can give you my copy of this book, if it is not available in the library).
+ There is also a survey
+ article
+ about register allocation algorithms with further pointers.
+
+
+
+ Skills:
+ Same skills as [CU3].
+
+
+ -
[CU5] A Student Polling System
+
+
+ Description:
+ One of the more annoying aspects of giving a lecture is to ask a question
+ to the students and no matter how easy the questions is to not
+ receive an answer. Recently, the online course system
+ Udacity made an art out of
+ asking questions during lectures (see for example the
+ Web Application Engineering
+ course CS253).
+ The lecturer there gives multiple-choice questions as part of the lecture and the students need to
+ click on the appropriate answer. This works very well in the online world.
+ For “real-world” lectures, the department has some
+ clickers
+ (these are little devices part of an audience response systems). However,
+ they are a logistic nightmare for the lecturer: they need to be distributed
+ during the lecture and collected at the end. Nowadays, where students
+ come with their own laptop or smartphone to lectures, this can
+ be improved.
+
+
+
+ The task of this project is to implement an online student
+ polling system. The lecturer should be able to prepare
+ questions beforehand (encoded as some web-form) and be able to
+ show them during the lecture. The students
+ can give their answers by clicking on the corresponding webpage.
+ The lecturer can then collect the responses online and evaluate them
+ immediately. Such a system is sometimes called
+ HTML voting.
+ There are a number of commercial
+ solutions for this problem, but they are not easy to use (in addition
+ to being ridiculously expensive). A good student can easily improve upon
+ what they provide.
+
+
+
+ The problem of student polling is not as hard as
+ electronic voting,
+ which essentially is still an unsolved problem in Computer Science. The
+ students only need to be prevented from answering question more than once thus skewing
+ any statistics. Unlike electronic voting, no audit trail needs to be kept
+ for student polling. Restricting the number of answers can probably be solved
+ by setting appropriate cookies on the students
+ computers or smart phones.
+
+
+
+ Literature:
+ The project requires fluency in a web-programming language (for example
+ Javascript,
+ PHP,
+ Java, Python,
+ Go,
+ Scala,
+ Ruby)
+ and possibly a cloud application platform (for example
+ Google App Engine or
+ Heroku).
+ For web-programming the
+ Web Application Engineering
+ course at Udacity is a good starting point
+ to be aware of the issues involved. This course uses Python.
+ To evaluate the answers from the student, Google's
+ Chart Tools
+ might be useful, which ar also described in this
+ youtube video.
+
+
+
+ Skills:
+ In order to provide convenience for the lecturer, this project needs very good web-programming skills. A
+ hacker mentality
+ (see above) is probably very beneficial: web-programming is an area that only emerged recently and
+ many tools still lack maturity. You probably have to experiment a lot with several different
+ languages and tools.
+
+
+ -
[CU6] Implementation of a Distributed Clock-Synchronisation Algorithm developed at NASA
+
+
+ Description:
+ There are many algorithms for synchronising clocks. This
+ paper
+ describes a new algorithm for clocks that communicate by exchanging
+ messages and thereby reach a state in which (within some bound) all clocks are synchronised.
+ A slightly longer and more detailed paper about the algorithm is
+ here.
+ The point of this project is to implement this algorithm and simulate networks of clocks.
+
+
+
+ Literature:
+ There is a wide range of literature on clock syncronisation algorithms.
+ Some pointers are given in this
+ paper,
+ which describes the algorithm to be implemented in this project. Pointers
+ are given also here.
+
+
+
+ Skills:
+ In order to implement a simulation of a network of clocks, you need to tackle
+ concurrency. You can do this for example in the programming language
+ Scala with the help of the
+ Akka library. This library enables you to send messages
+ between different actors. Here
+ are some examples that explain how to implement exchanging messages between actors.
+
+
+
+