BSc Projects
Supervisor: Christian Urban
Email: christian dot urban at kcl dot ac dot uk, Office: Bush House N7.07
Note that besides being a lecturer at the theoretical end of Computer Science, I am also a passionate
hacker …
defined as “a person who enjoys exploring the details of programmable systems and
stretching their capabilities, as opposed to most users, who prefer to learn only the minimum
necessary.” I am always happy to supervise like-minded students.
In 2013/14, I was nominated by the students
for the best BSc project supervisor and best MSc project supervisor awards in the NMES
faculty. Somehow I won both. In 2014/15 I was nominated again for the best MSc
project supervisor, but did not win it. ;o)
-
[CU1] Regular Expressions, Lexing and Derivatives
Description:
Regular expressions
are extremely useful for many text-processing tasks, such as finding patterns in hostile
network traffic,
lexing programs, syntax highlighting and so on. Given that regular expressions were
introduced in 1950 by Stephen Kleene,
you might think regular expressions have since been studied and implemented to death. But you would definitely be
mistaken: in fact they are still an active research area. On the top of my head, I can give
you at least ten research papers that appeared in the last few years.
For example
this paper
about regular expression matching and derivatives was presented in 2014 at the international
FLOPS conference. Another paper by my PhD student and me was presented in 2016
at the international ITP conference.
The task in this project is to implement these results and use them for lexing.
The background for this project is that some regular expressions are
“evil”
and can “stab you in the back” according to
this blog post.
For example, if you use in Python or
in Ruby (or also in a number of other mainstream programming languages) the
innocently looking regular expression a?{28}a{28} and match it, say, against the string
aaaaaaaaaaaaaaaaaaaaaaaaaaaa (that is 28 as), you will soon notice that your CPU usage goes to 100%. In fact,
Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself:
catastrophic.py (Python version) and
catastrophic.rb
(Ruby version). Here is a similar problem with the regular expression (a*)*b in Java:
catastrophic.java
You can imagine an attacker
mounting a nice DoS attack against
your program if it contains such an “evil” regular expression. But it can also happen by accident:
on 20 July 2016 the website Stack Exchange
was knocked offline because of an evil regular expression. One of their engineers talks about this in this
video. A similar problem needed to be fixed in the
Atom editor.
A few implementations of regular expression matchers are almost immune from such problems.
For example, Scala can deal with strings of up to 4,300 as in less than a second. But if you scale
the regular expression and string further to, say, 4,600 as, then you get a StackOverflowError
potentially crashing your program. Moreover (beside the "minor" problem of being painfully slow) according to this
report
nearly all regular expression matchers using the POSIX rules are actually buggy.
On a rainy afternoon, I implemented
this
regular expression matcher in Scala. It is not as fast as the official one in Scala, but
it can match up to 11,000 as in less than 5 seconds without raising any exception
(remember Python and Ruby both need nearly 30 seconds to process 28(!) as, and Scala's
official matcher maxes out at 4,600 as). My matcher is approximately
85 lines of code and based on the concept of
derivatives of regular expressions.
These derivatives were introduced in 1964 by
Janusz Brzozowski, but according to this
paper had been lost in the “sands of time”.
The advantage of derivatives is that they side-step completely the usual
translations of regular expressions
into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular
expression matchers in Python, Java and Ruby.
Now the authors from the
FLOPS'14-paper mentioned
above claim they are even faster than me and can deal with even more features of regular expressions
(for example subexpression matching, which my rainy-afternoon matcher cannot). I am sure they thought
about the problem much longer than a single afternoon. The task
in this project is to find out how good they actually are by implementing the results from their paper.
Their approach to regular expression matching is also based on the concept of derivatives.
I used derivatives very successfully once for something completely different in a
paper
about the Myhill-Nerode theorem.
So I know they are worth their money. Still, it would be interesting to actually compare their results
with my simple rainy-afternoon matcher and potentially “blow away” the regular expression matchers
in Python, Ruby and Java (and possibly in Scala too). The application would be to implement a fast lexer for
programming languages, or improve the network traffic analysers in the tools Snort and
Bro.
Literature:
The place to start with this project is obviously this
paper
and this one.
Traditional methods for regular expression matching are explained
in the Wikipedia articles
here and
here.
The authoritative book
on automata and regular expressions is by John Hopcroft and Jeffrey Ullmann (available in the library).
There is also an online course about this topic by Ullman at
Coursera, though IMHO not
done with love.
There are millions of other pointers about regular expression
matching on the Web. I found the chapter on Lexing in this
online book very helpful. Finally, it will
be of great help for this project to take part in my Compiler and Formal Language module (6CCS3CFL).
Test cases for “evil”
regular expressions can be obtained from here.
Skills:
This is a project for a student with an interest in theory and with
good programming skills. The project can be easily implemented
in functional languages like
Scala,
F#,
ML,
Haskell, etc. Python and other non-functional languages
can be also used, but seem much less convenient. If you do attend my Compilers and Formal Languages
module, that would obviously give you a head-start with this project.
-
[CU1a] Grammars and Derivative-Based Parsing Algorithms
Parsing is an old nut. Generations of software developers need to do parsing of data or text.
There are zillions of links, tools, papers and textbooks about parsing. One particular
book contains something
like 700 different algorithm, nicely analysed and described. Surely, parsing must be a solved problem. Or is it?
Laurie Tratt has a blog post
about Parsing: The Solved Problem That Isn't. IMHO parsing is still a wide open field and not solved at all.
PEG parsing, error reporting, error correction, runtime to name just a few are aspects that seem to cause headaches
to developers, and to researchers.
A recent paper
follows an idea for regular expressions: it adapts the notion of
derivatives of regular expressions to grammars. The idea is to implement in a functional programming language
the parsing algorithm proposed in this paper and to try it out with some sample data. There is also
a recent PhD thesis about derivative-based parsing
Efficient Parsing with Derivatives and Zippers.
Literature: paper,
paper
Skills: See [CU1].
-
[CU2] A Compiler for a small Programming Language
Description:
Compilers translate high-level programs that humans can read and write into
efficient machine code that can be run on a CPU or virtual machine.
A compiler for a simple functional language generating assembly code is described
here.
I recently implemented a very simple compiler for an even simpler functional
programming language following this
paper
(also described here).
My code, written in Scala, for this compiler is
here.
The compiler can only deal with simple programs involving natural numbers, such
as Fibonacci numbers or factorial function (but it can be easily extended - that is not the point).
The interesting feature in this compiler is that it can also deal with closure conversions and hoisting of
nested functions.
While the hard work has been done (understanding the two papers above),
my compiler only produces some idealised machine code. For example I
assume there are infinitely many registers. The goal of this
project is to generate machine code that is more realistic and can
run on a CPU, like X86, or run on a virtual machine, say the JVM.
You could also compile to the LLVM-IR.
This gives probably a speedup of thousand times in comparison to
my naive machine code and tiny virtual machine. The project
requires to dig into the literature about real machine code.
An alternative is to not generate machine code, but build a compiler that compiles to
JavaScript. This is the language that is supported by most
browsers and therefore is a favourite
vehicle for Web-programming. Some call it the scripting language of the Web.
Unfortunately, JavaScript is also probably one of the worst
languages to program in (being designed and released in a hurry). But it can be used as a convenient target
for translating programs from other languages. In particular there are two
very optimised subsets of JavaScript that can be used for this purpose:
one is asm.js and the other is
emscripten. Since
a few year ago there is even the official Webassembly
There is a tutorial for emscripten
and an impressive demo which runs the
Unreal Engine 3
in a browser with spectacular speed. This was achieved by compiling the
C-code of the Unreal Engine to the LLVM intermediate language and then translating the LLVM
code to JavaScript/Webassembly.
Literature:
There is a lot of literature about compilers
(for example this book -
I can lend you my copy for the duration of the project, or this
online book). A very good overview article
about implementing compilers by
Laurie Tratt is
here.
An online book about the Art of Assembly Language is
here.
An introduction into x86 machine code is here.
Intel's official manual for the x86 instruction is
here.
Two assemblers for the JVM are described here
and here.
An interesting twist of this project is to not generate code for a CPU, but
for the intermediate language of the LLVM compiler
(also described here). If you want to see
what machine code looks like you can compile your C-program using gcc -S.
If JavaScript is chosen as a target instead, then there are plenty of
tutorials on the Web.
Here is a list of free books on JavaScript.
A project from which you can draw inspiration is this
Lisp-to-JavaScript
translator. Here is another such project.
And another in less than 100 lines of code.
Coffeescript is a similar project
except that it is already quite mature. And finally not to
forget TypeScript developed by Microsoft. The main
difference between these projects and this one is that they translate into relatively high-level
JavaScript code; none of them use the much lower levels asm.js and
emscripten.
Skills:
This is a project for a student with a deep interest in programming languages and
compilers. Since my compiler is implemented in Scala,
it would make sense to continue this project in this language. I can be
of help with questions and books about Scala.
But if Scala is a problem, my code can also be translated quickly into any other functional
language. Again, it will be of great help for this project to take part in
my Compiler and Formal Language module (6CCS3CFL).
PS: Compiler projects consistently received high marks in the past.
I have supervised eight so far and many of them received a mark above 70% - one even was awarded a prize.
However in order to achieve anything better than a passing mark, you need to go beyond the
compiler presented in the CFL-module. For example you could implement
- first-class functions and closure conversions
- recursive datatypes
- interesting type-systems
-
[CU2a] Webassembly Interpreter / Compiler
Webassembly is a recently agreed standard for speeding up web applications in browsers. In this
project the aim is to implement an interpreter or compiler for webassembly. There are already
reference interpreters,
but people take different views, for example implement a
Forth language on top of webassembly.
What is good about webassembly is that is a rather simple format, which can be generated quite
easily, unlike Java class files, which need some head-standing when you generate them.
A reference interpreter for webassembly.
Skills: See [CU1].
|