In 2013/14, I was nominated by the students
for the best BSc project supervisor and best MSc project supervisor awards in the NMS
faculty. Somehow I won both ;o)
-
[CU1] Regular Expression Matching, Lexing and Derivatives
Description:
Regular expressions
are extremely useful for many text-processing tasks, such as finding patterns in texts,
lexing programs, syntax highlighting and so on. Given that regular expressions were
introduced in 1950 by Stephen Kleene,
you might think regular expressions have since been studied and implemented to death. But you would definitely be
mistaken: in fact they are still an active research area. For example
this paper
about regular expression matching and derivatives was presented just last summer at the international
FLOPS'14 conference. The task in this project is to implement their results and use them for lexing.
The background for this project is that some regular expressions are
“evil”
and can “stab you in the back” according to
this blog post.
For example, if you use in Python or
in Ruby (or also in a number of other mainstream programming languages according to this
blog) the
innocently looking regular expression a?{28}a{28}
and match it, say, against the string
aaaaaaaaaaaaaaaaaaaaaaaaaaaa
(that is 28 a
s), you will soon notice that your CPU usage goes to 100%. In fact,
Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself:
re.py (Python version) and
re.rb
(Ruby version). You can imagine an attacker
mounting a nice DoS attack against
your program if it contains such an “evil” regular expression. Actually
Scala (and also Java) are almost immune from such
attacks as they can deal with strings of up to 4,300 a
s in less than a second. But if you scale
the regular expression and string further to, say, 4,600 a
s, then you get a StackOverflowError
potentially crashing your program. Moreover (beside the "minor" problem of being painfully slow) according to this
report
nearly all POSIX regular expression matchers are actually buggy.
On a rainy afternoon, I implemented
this
regular expression matcher in Scala. It is not as fast as the official one in Scala, but
it can match up to 11,000 a
s in less than 5 seconds without raising any exception
(remember Python and Ruby both need nearly 30 seconds to process 28(!) a
s, and Scala's
official matcher maxes out at 4,600 a
s). My matcher is approximately
85 lines of code and based on the concept of
derivatives of regular expressions.
These derivatives were introduced in 1964 by
Janusz Brzozowski, but according to this
paper had been lost in the “sands of time”.
The advantage of derivatives is that they side-step completely the usual
translations of regular expressions
into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular
expression matchers in Python and Ruby.
Now the authors from the
FLOPS'14-paper mentioned
above claim they are even faster than me and can deal with even more features of regular expressions
(for example subexpression matching, which my rainy-afternoon matcher cannot). I am sure they thought
about the problem much longer than a single afternoon. The task
in this project is to find out how good they actually are by implementing the results from their paper.
Their approach to regular expression matching is also based on the concept of derivatives.
I used derivatives very successfully once for something completely different in a
paper
about the Myhill-Nerode theorem.
So I know they are worth their money. Still, it would be interesting to actually compare their results
with my simple rainy-afternoon matcher and potentially “blow away” the regular expression matchers
in Python and Ruby (and possibly in Scala too). The application would be to implement a fast lexer for
programming languages.
Literature:
The place to start with this project is obviously this
paper.
Traditional methods for regular expression matching are explained
in the Wikipedia articles
here and
here.
The authoritative book
on automata and regular expressions is by John Hopcroft and Jeffrey Ullmann (available in the library).
There is also an online course about this topic by Ullman at
Coursera, though IMHO not
done with love.
Finally, there are millions of other pointers about regular expression
matching on the Web. I found the chapter on Lexing in this
online book very helpful.
Test cases for “evil”
regular expressions can be obtained from here.
Skills:
This is a project for a student with an interest in theory and some
good programming skills. The project can be easily implemented
in functional languages like
Scala,
F#,
ML,
Haskell, etc. Python and other non-functional languages
can be also used, but seem much less convenient. If you attend my Formal Languages and
Automata module, that would obviously give you a head-start with this project.
-
[CU2] A Compiler for a small Programming Language
Description:
Compilers translate high-level programs that humans can read and write into
efficient machine code that can be run on a CPU or virtual machine.
A compiler for a simple functional language generating X86 code is described
here.
I recently implemented a very simple compiler for an even simpler functional
programming language following this
paper
(also described here).
My code, written in Scala, of this compiler is
here.
The compiler can deal with simple programs involving natural numbers, such
as Fibonacci numbers or factorial (but it can be easily extended - that is not the point).
While the hard work has been done (understanding the two papers above),
my compiler only produces some idealised machine code. For example I
assume there are infinitely many registers. The goal of this
project is to generate machine code that is more realistic and can
run on a CPU, like X86, or run on a virtual machine, say the JVM.
This gives probably a speedup of thousand times in comparison to
my naive machine code and virtual machine. The project
requires to dig into the literature about real CPUs and generating
real machine code.
An alternative is to not generate machine code, but build a compiler that compiles to
JavaScript. This is the language that is supported by most
browsers and therefore is a favourite
vehicle for Web-programming. Some call it the scripting language of the Web.
Unfortunately, JavaScript is also probably one of the worst
languages to program in (being designed and released in a hurry). But it can be used as a convenient target
for translating programs from other languages. In particular there are two
very optimised subsets of JavaScript that can be used for this purpose:
one is asm.js and the other is
emscripten.
There is a tutorial for emscripten
and an impressive demo which runs the
Unreal Engine 3
in a browser with spectacular speed. This was achieved by compiling the
C-code of the Unreal Engine to the LLVM intermediate language and then translating the LLVM
code to JavaScript.
Literature:
There is a lot of literature about compilers
(for example this book -
I can lend you my copy for the duration of the project, or this
online book). A very good overview article
about implementing compilers by
Laurie Tratt is
here.
An online book about the Art of Assembly Language is
here.
An introduction into x86 machine code is here.
Intel's official manual for the x86 instruction is
here.
A simple assembler for the JVM is described here.
An interesting twist of this project is to not generate code for a CPU, but
for the intermediate language of the LLVM compiler
(also described here). If you want to see
what machine code looks like you can compile your C-program using gcc -S.
If JavaScript is chosen as a target instead, then there are plenty of tutorials on the Web.
Here is a list of free books on JavaScript.
A project from which you can draw inspiration is this
List-to-JavaScript
translator. Here is another such project.
And another in less than 100 lines of code.
Coffeescript is a similar project
except that it is already quite mature. And finally not to
forget TypeScript developed by Microsoft. The main
difference between these projects and this one is that they translate into relatively high-level
JavaScript code; none of them use the much lower levels asm.js and
emscripten.
Skills:
This is a project for a student with a deep interest in programming languages and
compilers. Since my compiler is implemented in Scala,
it would make sense to continue this project in this language. I can be
of help with questions and books about Scala.
But if Scala is a problem, my code can also be translated quickly into any other functional
language.
PS: Compiler projects consistently received high marks in the past.
I have suprvised five so far and none of them received a mark below 70% - one even was awarded a prize.
-
[CU3] Slide-Making in the Web-Age
The standard technology for writing scientific papers in Computer Science is to use
LaTeX, a document preparation
system originally implemented by Donald Knuth
and Leslie Lamport.
LaTeX produces very pleasantly looking documents, can deal nicely with mathematical
formulas and is very flexible. If you are interested, here
is a side-by-side comparison between Word and LaTeX (which LaTeX “wins” with 18 out of 21 points).
Computer scientists not only use LaTeX for documents,
but also for slides (really, nobody who wants to be cool uses Keynote or Powerpoint).
Although used widely, LaTeX seems nowadays a bit dated for producing
slides. Unlike documents, which are typically “static” and published in a book or journal,
slides often contain changing contents that might first only be partially visible and
only later be revealed as the “story” of a talk or lecture demands.
Also slides often contain animated algorithms where each state in the
calculation is best explained by highlighting the changing data.
It seems HTML and JavaScript are much better suited for generating
such animated slides. This page
links to 22 slide-generating programs using this combination of technologies.
However, the problem with all of these project is that they depend heavily on the users being
able to write JavaScript, CCS or HTML...not something one would like to depend on given that
“normal” users likely only have a LaTeX background. The aim of this project is to invent a
very simple language that is inspired by LaTeX and then generate from code written in this language
slides that can be displayed in a web-browser.
This sounds complicated, but there is already some help available:
Mathjax is a JavaScript library that can
be used to display mathematical text, for example
When \(a \ne 0\), there are two solutions to \(ax^2 + bx + c = 0\) and they are
\(x = {-b \pm \sqrt{b^2-4ac} \over 2a}\).
by writing code in the familiar LaTeX-way. This can be reused.
Another such library is KaTeX.
There are also plenty of JavaScript
libraries for graphical animations (for example
Raphael,
SVG.JS,
Bonsaijs,
JSXGraph). The inspiration for how the user should be able to write
slides could come from the LaTeX packages Beamer
and PGF/TikZ. A slide-making project from which
inspiration can be drawn is hyhyhy.
Skills:
This is a project that requires good knowledge of JavaScript. You need to be able to
parse a language and translate it to a suitable part of JavaScript using
appropriate libraries. Tutorials for JavaScript are here.
A parser generator for JavaScript is here. There are probably also
others. If you want to avoid JavaScript there are a number of alternatives: for example the
Elm
language has been especially designed for implementing with ease interactive animations, which would be
very convenient for this project.
-
[CU4] An Online Student Voting System
Description:
One of the more annoying aspects of giving a lecture is to ask a question
to the students and no matter how easy the question is to not
receive any answer. The online course system
Udacity, in contrast, made an art out of
asking questions during lectures (see for example the
Web Application Engineering
course CS253).
The lecturer there gives multiple-choice questions as part of the lecture and the students need to
click on the appropriate answer. This works very well in the online world.
For “real-world” lectures, the department has some
clickers
(these are little devices which form a part of an audience response systems). However,
they are a logistic nightmare for the lecturer: they need to be distributed
during the lecture and collected at the end. Nowadays, where students
come with their own laptop or smartphone to lectures, this can
be improved.
The task of this project is to implement an online student
polling system. The lecturer should be able to prepare
questions beforehand (encoded as some web-form) and be able to
show them during the lecture. The students
can give their answers by clicking on the corresponding webpage.
The lecturer can then collect the responses online and evaluate them
immediately. Such a system is sometimes called
HTML voting.
There are a number of commercial
solutions for this problem, but they are not easy to use (in addition
to being ridiculously expensive). A good student can easily improve upon
what they provide.
The problem of student polling is not as hard as
electronic voting,
which essentially is still an unsolved problem in Computer Science. The
students only need to be prevented from answering question more than once thus skewing
any statistics. Unlike electronic voting, no audit trail needs to be kept
for student polling. Restricting the number of answers can probably be solved
by setting appropriate cookies on the students
computers or smart phones.
Literature:
The project requires fluency in a web-programming language (for example
JavaScript,
Go,
Scala). However JavaScript with
the Node.js extension seems to be best suited for the job.
Here is a tutorial on Node.js for beginners.
For web-programming the
Web Application Engineering
course at Udacity is a good starting point
to be aware of the issues involved. This course uses Python.
To evaluate the answers from the students, Google's
Chart Tools
might be useful, which is also described in this
youtube video.
Skills:
In order to provide convenience for the lecturer, this project needs very good web-programming skills. A
hacker mentality
(see above) is probably also very beneficial: web-programming is an area that only emerged recently and
many tools still lack maturity. You probably have to experiment a lot with several different
languages and tools.
-
[CU5] Raspberry Pi's and Arduinos
Description:
This project is for true hackers! Raspberry Pi's
are small Linux computers the size of a credit-card and only cost £26 (see picture on the left below). They were introduced
in 2012 and people went crazy...well some of them. There is a
Google+ community about Raspberry Pi's that has more
than 177k of followers. It is hard to keep up with what people do with these small computers. The possibilities
seem to be limitless. The main resource for Raspberry Pi's is here.
There are magazines dedicated to them and tons of
books (not to mention
floods of online material).
Google just released a
framework
for web-programming on Raspberry Pi's truning them into webservers.
Arduinos are slightly older (from 2005) but still very cool (see picture on the right below). They
are small single-board micro-controllers that can talk to various external gadgets (sensors, motors, etc). Since Arduinos
are open-software and open-hardware there are many clones and add-on boards. Like for the Raspberry Pi, there
is a lot of material available about Arduinos.
The main reference is here. Like the Raspberry Pi's, the good thing about
Arduinos is that they can be powered with simple AA-batteries.
I have two such Raspberry Pi's including wifi-connectors and two cameras.
I also have two Freakduino Boards that are Arduinos extended with wireless communication. I can lend them to responsible
students for one or two projects. However, the aim is to first come up with an idea for a project. Popular projects are
automated temperature sensors, network servers, robots, web-cams (here
is a web-cam directed at the Shard that can
tell
you whether it is raining or cloudy). There are plenty more ideas listed
here for Raspberry Pi's and
here for Arduinos.
There are essentially two kinds of projects: One is purely software-based. Software projects for Raspberry Pi's are often
written in Python, but since these are Linux-capable computers any other
language would do as well. You can also write your own operating system as done
here. For example the students
here developed their own bare-metal OS and then implemented
a chess-program on top of it (have a look at their very impressive
youtube video).
The other kind of project is a combination of hardware and software; usually attaching some sensors
or motors to the Raspberry Pi or Arduino. This might require some soldering or what is called
a bread-board. But be careful before choosing a project
involving new hardware: these devices
can be destroyed (if “Vin connected to GND” or “drawing more than 30mA from a GPIO”
does not make sense to you, you should probably stay away from such a project).
Skills:
Well, you must be a hacker; happy to make things. Your desk might look like the photo below on the left.
The photo below on the righ shows an earlier student project which connects wirelessly a wearable Arduino (packaged
in a "self-3d-printed" watch) to a Raspberry Pi seen in the background. The Arduino in the forground takes meaurements of
heart rate and body temperature; the Raspberry Pi collects this data and makes it accessible via a simple
web-service.
-
[CU6] An Infrastructure for Displaying and Animating Code in a Web-Browser
Description:
The project aim is to implement an infrastructure for displaying and
animating code in a web-browser. The infrastructure should be agnostic
with respect to the programming language, but should be configurable.
I envisage something smaller than the projects
here (for Python),
here (for Java),
here (for multiple languages),
here (for HTML)
here (for JavaScript),
and here (for Scala).
The tasks in this project are being able (1) to lex and parse languages and (2) to write an interpreter.
The goal is to implement this as much as possible in a language-agnostic fashion.
Skills:
Good skills in lexing and language parsing, as well as being fluent with web programming (for
example JavaScript).
-
[CU7] Implementation of a Distributed Clock-Synchronisation Algorithm developed at NASA
Description:
There are many algorithms for synchronising clocks. This
paper
describes a new algorithm for clocks that communicate by exchanging
messages and thereby reach a state in which (within some bound) all clocks are synchronised.
A slightly longer and more detailed paper about the algorithm is
here.
The point of this project is to implement this algorithm and simulate networks of clocks.
Literature:
There is a wide range of literature on clock synchronisation algorithms.
Some pointers are given in this
paper,
which describes the algorithm to be implemented in this project. Pointers
are given also here.
Skills:
In order to implement a simulation of a network of clocks, you need to tackle
concurrency. You can do this for example in the programming language
Scala with the help of the
Akka library. This library enables you to send messages
between different actors. Here
are some examples that explain how to implement exchanging messages between actors.
-
[CU8] Proving the Correctness of Programs
I am one of the main developers of the interactive theorem prover
Isabelle. This theorem prover
has been used to establish the correctness of some quite large
programs (for example an operating system).
Together with colleagues from Nanjing, I used this theorem prover to establish the correctness of a
scheduling algorithm, called
Priority Inheritance,
for real-time operating systems. This scheduling algorithm is part of the operating
system that drives, for example, the
Mars rovers.
Actually, the very first Mars rover mission in 1997 did not have this
algorithm switched on and it almost caused a catastrophic mission failure (see
this youtube video here
for an explanation what happened).
We were able to prove the correctness of this algorithm, but were also able to
establish the correctness of some optimisations in this
paper.
On a much smaller scale, there are a few small programs and underlying algorithms where it
is not really understood whether they always compute a correct result (for example the
regular expression matcher by Sulzmann and Lu in project [CU1]). The aim of this
project is to completely specify an algorithm in Isabelle and then prove it correct (that is,
it always computes the correct result).
Skills:
This project is for a very good student with a knack for theoretical things and formal reasoning.
-
[CU9] Anything Security Related that is Interesting
If you have your own project that is related to security (must be
something interesting), please propose it. We can then have a look
whether it would be suitable for a project.
-
[CU10] A Graphics Framework for JavaScript
-
[CU11] Anything Interesting in the Areas
- Elm (a reactive functional language for animating webpages; have a look at the cool examples, or here for an introduction)
- SMLtoJS (a ML compiler to JavaScript; or anything else related to
sane languages that compile to JavaScript)
- Any statistical data related to Bitcoins (in the spirit of this
paper or
this one; this will probably require some extensive C knowledge or any
other heavy-duty programming language)
- Anything related to programming languages and formal methods (like
static program analysis)
- Anything related to low-cost, hands-on hardware like Raspberry Pi, Arduino,
Cubieboard
- Anything related to microkernel operating systems, like
Xen or
Mirage OS
- Any kind of applied hacking, for example the Arduino-based keylogger described
here
-
Earlier Projects
I am also open to project suggestions from you. You might find some inspiration from my earlier projects:
BSc 2012/13,
MSc 2012/13,
BSc 2013/14
MSc 2013/14
BSc 2014/15
MSc 2014/15