2012/13 MSc Projects

Supervisor: Christian Urban

Email: christian dot urban at kcl dot ac dot uk, Office: Strand Building S1.27

If you are interested in a project, please send me an email and we can discuss details. Please include a short description about your programming skills and Computer Science background in your first email. I will also need your King's username (not student ID) in order to book the project for you. Thanks.

Note that besides being a lecturer in the theoretical part of Computer Science, I am also a passionate hacker … defined as “a person who enjoys exploring the details of programmable systems and stretching their capabilities, as opposed to most users, who prefer to learn only the minimum necessary.” I am always happy to supervise like-minded students.

  • [CU1] Regular Expression Matching and Partial Derivatives

    Description: Regular expressions are extremely useful for many text-processing tasks such as finding patterns in texts, lexing programs, syntax highlighting and so on. Given that regular expressions were introduced in 1950 by Stephen Kleene, you might think regular expressions have since been studied and implemented to death. But you would definitely be mistaken: in fact they are still an active research area. For example this paper about regular expression matching and partial derivatives was presented this summer at the international PPDP'12 conference. The task in this project is to implement the results from this paper.

    The background for this project is that some regular expressions are “evil” and can “stab you in the back” according to this blog post. For example, if you use in Python or in Ruby (probably also in other mainstream programming languages) the innocently looking regular expression a?{28}a{28} and match it, say, against the string aaaaaaaaaaaaaaaaaaaaaaaaaaaa (that is 28 as), you will soon notice that your CPU usage goes to 100%. In fact, Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself: re.py (Python version) and re.rb (Ruby version). You can imagine an attacker mounting a nice DoS attack against your program if it contains such an “evil” regular expression. Actually Scala (and also Java) are almost immune from such attacks as they can deal with strings of up to 4,300 as in less than a second. But if you scale the regular expression and string further to, say, 4,600 as, then you get a StackOverflowError potentially crashing your program.

    On a rainy afternoon, I implemented this regular expression matcher in Scala. It is not as fast as the official one in Scala, but it can match up to 11,000 as in less than 5 seconds without raising any exception (remember Python and Ruby both need nearly 30 seconds to process 28(!) as, and Scala's official matcher maxes out at 4,600 as). My matcher is approximately 85 lines of code and based on the concept of derivatives of regular expressions. These derivatives were introduced in 1964 by Janusz Brzozowski, but according to this paper had been lost in the “sands of time”. The advantage of derivatives is that they side-step completely the usual translations of regular expressions into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular expression matchers in Python and Ruby.

    Now the guys from the PPDP'12-paper mentioned above claim they are even faster than me and can deal with even more features of regular expressions (for example subexpression matching, which my rainy-afternoon matcher cannot). I am sure they thought about the problem much longer than a single afternoon. The task in this project is to find out how good they actually are by implementing the results from their paper. Their approach is based on the concept of partial derivatives introduced in 1994 by Valentin Antimirov. I used them once myself in a paper in order to prove the Myhill-Nerode theorem. So I know they are worth their money. Still, it would be interesting to actually compare their results with my simple rainy-afternoon matcher and potentially “blow away” the regular expression matchers in Python and Ruby (and possibly in Scala too).

    Literature: The place to start with this project is obviously this paper. Traditional methods for regular expression matching are explained in the Wikipedia articles here and here. The authoritative book on automata and regular expressions is by John Hopcroft and Jeffrey Ullmann (available in the library). There is also an online course about this topic by Ullman at Coursera, though IMHO not done with love. Finally, there are millions of other pointers about regular expression matching on the Net. Test cases for “evil” regular expressions can be obtained from here.

    Skills: This is a project for a student with an interest in theory and some reasonable programming skills. The project can be easily implemented in languages like Scala, ML, Haskell, Python, etc.

  • [CU2] Automata Theory in Your Web-Browser

    This project is about web-programming (but not in Java): There are a number of classic algorithms in automata theory (such as the transformation of regular expressions into NFAs and DFAs, automata minimisation, subset construction). All these algorithms involve a fair amount of calculations, which cannot be easily done by hand. There are a few web applications, typically written in Javascript, that animate these calculations, for example this one. But they all have their deficiencies and can be improved with more modern technology. An instance is the impressive animation of Phython code found here.

    There now many useful libraries for JavaScript, for example, this one for graphs or this one for graphics. There are also a number of new programming languages targeting JavaScript, for example TypeScript, CoffeeScript, Dart, Script#, Clojure and so on. The task in this project is to use a web-programming language and suitable library to animate algorithms from automata theory (and also parsing, if wanted). This project is for someone who want to get to know these new languages.

    Literature: The same general literature as in [CU1].

    Skills: This is a project for a student with very good programming and hacking skills. Some knowledge in JavaScript, HTML and CSS cannot hurt. The algorithms from automata theory are fairly standard material.

  • [CU3] Machine Code Generation for a Simple Compiler

    Description: Compilers translate high-level programs that humans can read into efficient machine code that can be run on a CPU or virtual machine. I recently implemented a very simple compiler for a very simple functional programming language following this paper (also described here). My code, written in Scala, of this compiler is here. The compiler can deal with simple programs involving natural numbers, such as Fibonacci or factorial (but it can be easily extended - that is not the point).

    While the hard work has been done (understanding the two papers above), my compiler only produces some idealised machine code. For example I assume there are infinitely many registers. The goal of this project is to generate machine code that is more realistic and can run on a CPU, like x86, or run on a virtual machine, say the JVM. This gives probably a speedup of thousand times in comparison to my naive machine code and virtual machine. The project requires to dig into the literature about real CPUs and generating real machine code.

    Literature: There is a lot of literature about compilers (for example this book - I can lend you my copy for the duration of the project). A very good overview article about implementing compilers by Laurie Tratt is here. An introduction into x86 machine code is here. Intel's official manual for the x86 instruction is here. A simple assembler for the JVM is described here. An interesting twist of this project is to not generate code for a CPU, but for the intermediate language of the LLVM compiler (also described here and here). If you want to see what machine code looks like you can compile your C-program using gcc -S.

    Skills: This is a project for a student with a deep interest in programming languages and compilers. Since my compiler is implemented in Scala, it would make sense to continue this project in this language. I can be of help with questions and books about Scala. But if Scala is a problem, my code can also be translated quickly into any other language.

  • [CU4] Implementation of Register Spilling Algorithms

    Description: This project is similar to [CU3]. The emphasis here, however, is on the implementation and comparison of register spilling algorithms, also often called register allocation algorithms. They are part of any respectable compiler. As said in [CU3] my simple compiler lacks them and assumes an infinite amount of registers instead. Real CPUs however only provide a fixed amount of registers (for example x86-64 has 16 general purpose registers). Whenever a program needs to hold more values than registers, the values need to be “spilled” into the main memory. Register spilling algorithms try to minimise this spilling, since fetching values from main memory is a costly operation.

    The classic algorithm for register spilling uses a graph-colouring method. However, for some time the LLVM compiler used a supposedly more efficient method, called the linear scan allocation method (described here). However, it was later decided to abandon this method in favour of a greedy register allocation method. It would be nice if this project can find out what the issues are with these methods and implement at least one of them for the simple compiler referenced in [CU3].

    Literature: The graph colouring method is described in Andrew Appel's book on compilers (I can give you my copy of this book, if it is not available in the library). There is also a survey article about register allocation algorithms with further pointers.

    Skills: Same skills as [CU3].

  • [CU5] A Student Polling System

    Description: One of the more annoying aspects of giving a lecture is to ask a question to the students and no matter how easy the questions is to not receive an answer. Recently, the online course system Udacity made an art out of asking questions during lectures (see for example the Web Application Engineering course CS253). The lecturer there gives multiple-choice questions as part of the lecture and the students need to click on the appropriate answer. This works very well in the online world. For “real-world” lectures, the department has some clickers (these are little devices part of an audience response systems). However, they are a logistic nightmare for the lecturer: they need to be distributed during the lecture and collected at the end. Nowadays, where students come with their own laptop or smartphone to lectures, this can be improved.

    The task of this project is to implement an online student polling system. The lecturer should be able to prepare questions beforehand (encoded as some web-form) and be able to show them during the lecture. The students can give their answers by clicking on the corresponding webpage. The lecturer can then collect the responses online and evaluate them immediately. Such a system is sometimes called HTML voting. There are a number of commercial solutions for this problem, but they are not easy to use (in addition to being ridiculously expensive). A good student can easily improve upon what they provide.

    The problem of student polling is not as hard as electronic voting, which essentially is still an unsolved problem in Computer Science. The students only need to be prevented from answering question more than once thus skewing any statistics. Unlike electronic voting, no audit trail needs to be kept for student polling. Restricting the number of answers can probably be solved by setting appropriate cookies on the students' computers or smart phones.

    However, there is one restriction that makes this project harder than it seems at first sight: The department does not allow large server applications and databases to be run on calcium, which is the central server in the department. So the problem should be solved with as few resources as possible on the “back-end” collecting the votes.

    Literature: The project requires fluency in a web-programming language (for example JavaScript, PHP, Java, Python, Go, Scala, Ruby). For web-programming the Web Application Engineering course at Udacity is a good starting point to be aware of the issues involved. This course uses Python. To evaluate the answers from the student, Google's Chart Tools might be useful, which are also described in this youtube video.

    Skills: This project needs very good web-programming skills. A hacker mentality (see above) is probably very beneficial: web-programming is an area that only emerged recently and many tools still lack maturity. You probably have to experiment a lot with several different languages and tools.

  • [CU6] Implementation of a Distributed Clock-Synchronisation Algorithm developed at NASA

    Description: There are many algorithms for synchronising clocks. This paper describes a new algorithm developed by NASA for clocks that communicate by exchanging messages and thereby reach a state in which (within some bound) all clocks are synchronised. A slightly longer and more detailed paper about the algorithm is here. The point of this project is to implement this algorithm and simulate a networks of clocks.

    Literature: There is a wide range of literature on clock synchronisation algorithms. Some pointers are given in this paper, which describes the algorithm to be implemented in this project. Pointers are given also here.

    Skills: In order to implement a simulation of a network of clocks, you need to tackle concurrency. You can do this for example in the programming language Scala with the help of the Akka library. This library enables you to send messages between different actors. Here are some examples that explain how to implement exchanging messages between actors.

  • [CU7] An Infrastructure for Dispalying and Animating Code in a Web-Browser

    Description: The project aim is to implement an infrastructure for displaying and animating code in a web-browser. The infrastructure should be agnostic with respect to the programming language, but should be configurable. Something smaller than projects such as here, here, here, here

Last modified: Wed Sep 12 16:30:03 GMT 2012 [Validate this page.]