BSc Projects

Supervisor: Christian Urban

Email: christian dot urban at kcl dot ac dot uk, Office: Bush House N7.07

Note that besides being a lecturer at the theoretical end of Computer Science, I am also a passionate hacker … defined as “a person who enjoys exploring the details of programmable systems and stretching their capabilities, as opposed to most users, who prefer to learn only the minimum necessary.” I am always happy to supervise like-minded students.

In 2013/14, I was nominated by the students for the best BSc project supervisor and best MSc project supervisor awards in the NMES faculty. Somehow I won both. In 2014/15 I was nominated again for the best MSc project supervisor, but did not win it. ;o)

[CU1] Regular Expressions, Lexing and Derivatives

Description: Regular expressions are extremely useful for many text-processing tasks, such as finding patterns in hostile network traffic, lexing programs, syntax highlighting and so on. Given that regular expressions were introduced in 1950 by Stephen Kleene, you might think regular expressions have since been studied and implemented to death. But you would definitely be mistaken: in fact they are still an active research area. On the top of my head, I can give you at least ten research papers that appeared in the last few years. For example this paper about regular expression matching and derivatives was presented in 2014 at the international FLOPS conference. Another paper by my PhD student and me was presented in 2016 at the international ITP conference. The task in this project is to implement these results and use them for lexing.

The background for this project is that some regular expressions are “evil” and can “stab you in the back” according to this blog post. For example, if you use in Python or in Ruby (or also in a number of other mainstream programming languages) the innocently looking regular expression a?{28}a{28} and match it, say, against the string aaaaaaaaaaaaaaaaaaaaaaaaaaaa (that is 28 as), you will soon notice that your CPU usage goes to 100%. In fact, Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself: catastrophic.py (Python version) and catastrophic.rb (Ruby version). Here is a similar problem with the regular expression (a*)*b in Java: catastrophic.java

You can imagine an attacker mounting a nice DoS attack against your program if it contains such an “evil” regular expression. But it can also happen by accident: on 20 July 2016 the website Stack Exchange was knocked offline because of an evil regular expression. One of their engineers talks about this in this video. A similar problem needed to be fixed in the Atom editor. A few implementations of regular expression matchers are almost immune from such problems. For example, Scala can deal with strings of up to 4,300 as in less than a second. But if you scale the regular expression and string further to, say, 4,600 as, then you get a StackOverflowError potentially crashing your program. Moreover (beside the "minor" problem of being painfully slow) according to this report nearly all regular expression matchers using the POSIX rules are actually buggy.

On a rainy afternoon, I implemented this regular expression matcher in Scala. It is not as fast as the official one in Scala, but it can match up to 11,000 as in less than 5 seconds without raising any exception (remember Python and Ruby both need nearly 30 seconds to process 28(!) as, and Scala's official matcher maxes out at 4,600 as). My matcher is approximately 85 lines of code and based on the concept of derivatives of regular expressions. These derivatives were introduced in 1964 by Janusz Brzozowski, but according to this paper had been lost in the “sands of time”. The advantage of derivatives is that they side-step completely the usual translations of regular expressions into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular expression matchers in Python, Java and Ruby.

Now the authors from the FLOPS'14-paper mentioned above claim they are even faster than me and can deal with even more features of regular expressions (for example subexpression matching, which my rainy-afternoon matcher cannot). I am sure they thought about the problem much longer than a single afternoon. The task in this project is to find out how good they actually are by implementing the results from their paper. Their approach to regular expression matching is also based on the concept of derivatives. I used derivatives very successfully once for something completely different in a paper about the Myhill-Nerode theorem. So I know they are worth their money. Still, it would be interesting to actually compare their results with my simple rainy-afternoon matcher and potentially “blow away” the regular expression matchers in Python, Ruby and Java (and possibly in Scala too). The application would be to implement a fast lexer for programming languages, or improve the network traffic analysers in the tools Snort and Bro.

Literature: The place to start with this project is obviously this paper and this one. Traditional methods for regular expression matching are explained in the Wikipedia articles here and here. The authoritative book on automata and regular expressions is by John Hopcroft and Jeffrey Ullmann (available in the library). There is also an online course about this topic by Ullman at Coursera, though IMHO not done with love. There are millions of other pointers about regular expression matching on the Web. I found the chapter on Lexing in this online book very helpful. Finally, it will be of great help for this project to take part in my Compiler and Formal Language module (6CCS3CFL). Test cases for “evil” regular expressions can be obtained from here.

Skills: This is a project for a student with an interest in theory and with good programming skills. The project can be easily implemented in functional languages like Scala, F#, ML, Haskell, etc. Python and other non-functional languages can be also used, but seem much less convenient. If you do attend my Compilers and Formal Languages module, that would obviously give you a head-start with this project.
[CU1a] Grammars and Derivative-Based Parsing Algorithms

Parsing is an old nut. Generations of software developers need to do parsing of data or text. There are zillions of links, tools, papers and textbooks about parsing. One particular book contains something like 700 different algorithm, nicely analysed and described. Surely, parsing must be a solved problem. Or is it? Laurie Tratt has a blog post about Parsing: The Solved Problem That Isn't. IMHO parsing is still a wide open field and not solved at all. PEG parsing, error reporting, error correction, runtime to name just a few are aspects that seem to cause headaches to developers, and to researchers.

A recent paper follows an idea for regular expressions: it adapts the notion of derivatives of regular expressions to grammars. The idea is to implement in a functional programming language the parsing algorithm proposed in this paper and to try it out with some sample data. There is also a recent PhD thesis about derivative-based parsing Efficient Parsing with Derivatives and Zippers.

Literature: paper, paper

Skills: See [CU1].
[CU2] A Compiler for a small Programming Language

Description: Compilers translate high-level programs that humans can read and write into efficient machine code that can be run on a CPU or virtual machine. A compiler for a simple functional language generating assembly code is described here. I recently implemented a very simple compiler for an even simpler functional programming language following this paper (also described here). My code, written in Scala, for this compiler is here. The compiler can only deal with simple programs involving natural numbers, such as Fibonacci numbers or factorial function (but it can be easily extended - that is not the point). The interesting feature in this compiler is that it can also deal with closure conversions and hoisting of nested functions.

While the hard work has been done (understanding the two papers above), my compiler only produces some idealised machine code. For example I assume there are infinitely many registers. The goal of this project is to generate machine code that is more realistic and can run on a CPU, like X86, or run on a virtual machine, say the JVM. You could also compile to the LLVM-IR. This gives probably a speedup of thousand times in comparison to my naive machine code and tiny virtual machine. The project requires to dig into the literature about real machine code.

An alternative is to not generate machine code, but build a compiler that compiles to JavaScript. This is the language that is supported by most browsers and therefore is a favourite vehicle for Web-programming. Some call it the scripting language of the Web. Unfortunately, JavaScript is also probably one of the worst languages to program in (being designed and released in a hurry). But it can be used as a convenient target for translating programs from other languages. In particular there are two very optimised subsets of JavaScript that can be used for this purpose: one is asm.js and the other is emscripten. Since a few year ago there is even the official Webassembly There is a tutorial for emscripten and an impressive demo which runs the Unreal Engine 3 in a browser with spectacular speed. This was achieved by compiling the C-code of the Unreal Engine to the LLVM intermediate language and then translating the LLVM code to JavaScript/Webassembly.

Literature: There is a lot of literature about compilers (for example this book - I can lend you my copy for the duration of the project, or this online book). A very good overview article about implementing compilers by Laurie Tratt is here. An online book about the Art of Assembly Language is here. An introduction into x86 machine code is here. Intel's official manual for the x86 instruction is here. Two assemblers for the JVM are described here and here. An interesting twist of this project is to not generate code for a CPU, but for the intermediate language of the LLVM compiler (also described here). If you want to see what machine code looks like you can compile your C-program using gcc -S.

If JavaScript is chosen as a target instead, then there are plenty of tutorials on the Web. Here is a list of free books on JavaScript. A project from which you can draw inspiration is this Lisp-to-JavaScript translator. Here is another such project. And another in less than 100 lines of code. Coffeescript is a similar project except that it is already quite mature. And finally not to forget TypeScript developed by Microsoft. The main difference between these projects and this one is that they translate into relatively high-level JavaScript code; none of them use the much lower levels asm.js and emscripten.

Skills: This is a project for a student with a deep interest in programming languages and compilers. Since my compiler is implemented in Scala, it would make sense to continue this project in this language. I can be of help with questions and books about Scala. But if Scala is a problem, my code can also be translated quickly into any other functional language. Again, it will be of great help for this project to take part in my Compiler and Formal Language module (6CCS3CFL).

PS: Compiler projects consistently received high marks in the past. I have supervised eight so far and many of them received a mark above 70% - one even was awarded a prize. However in order to achieve anything better than a passing mark, you need to go beyond the compiler presented in the CFL-module. For example you could implement
1. first-class functions and closure conversions
2. recursive datatypes
3. interesting type-systems
[CU2a] Webassembly Interpreter / Compiler

Webassembly is a recently agreed standard for speeding up web applications in browsers. In this project the aim is to implement an interpreter or compiler for webassembly. There are already reference interpreters, but people take different views, for example implement a Forth language on top of webassembly. What is good about webassembly is that is a rather simple format, which can be generated quite easily, unlike Java class files, which need some head-standing when you generate them.

A reference interpreter for webassembly.

Skills: See [CU1].

2018-09-24 12:12:35 by Christian Urban [Validate this page.]

BSc Projects

Supervisor: Christian Urban

Email: christian dot urban at kcl dot ac dot uk, Office: Bush House N7.07

In 2013/14, I was nominated by the students for the best BSc project supervisor and best MSc project supervisor awards in the NMES faculty. Somehow I won both. In 2014/15 I was nominated again for the best MSc project supervisor, but did not win it. ;o)

[CU1] Regular Expressions, Lexing and Derivatives

[CU1a] Grammars and Derivative-Based Parsing Algorithms

[CU2] A Compiler for a small Programming Language

[CU2a] Webassembly Interpreter / Compiler