Workshop on Compression, Texts, and Algorithms (WCTA 2015)

Location: Friday, 4 Sep 2015. Room K6.63 (6th floor, King's Building)

Chairs: Travis Gagie, University of Helsinki, and Tatiana Starikovskaya, University of Bristol

Slides and photos: Dropbox - WCTA

 

Keynote talk, Chair: Giovanni Manzini

09:00-10:00

Richard Durbin, Wellcome Trust Sanger Institute

Using Suffix Array Based Data Structures in Computational Genomics


Abstract: This talk will describe how the BWT and FM-indexes are used in read mapping (e.g., with BWA), sequence assembly (e.g., with SGA and Fermi), and haplotype matching and storage (with the Positional BWT).


Biography: Richard Durbin, FRS, is Acting Head of Computational Genomics at the Wellcome Trust Sanger Institute and leader of the Genome Informatics group. He studied mathematics at Cambridge and earned a PhD on the development and organization of the nervous system in C. elegans. He has developed numerous methods for computational sequence analysis, co-authored a textbook on this subject, and co-leads the international 1000 Genomes Project. He was a joint winner of the Mullard Award of the Royal Society in 1994 (for work on the confocal microscope), won the Lord Lloyd of Kilgerran Award of the Foundation for Science and Technology in 2004, and was elected a Fellow of the Royal Society in 2004 (for contributions to computational biology) and a member of the European Molecular Biology Organization (EMBO) in 2009.

10:00-10:20

Coffee Break

Session on Bioinformatics, Chair: Giovanna Rosone

10:20-10:45

Jarno Alanko, University of Helsinki (with Djamal Belazzougui and Fabio Cunial)

A Framework for Space-Efficient Metagenomic Read Clustering


Abstract: A metagenomic sample is a set of DNA fragments sampled from an environment. We call metagenomic clustering the problem of partitioning a metagenomic sample into sets that correspond to distinct taxonomic units. It can be applied for instance to estimate the number of species in a sample, or as a preprocessing step in metagenome assembly. We present a space-efficient, parallel framework, that can be used to implement a number of key steps in metagenomic clustering, such as filtering reads from low-coverage species, and merging reads that share strings of a specific length or maximal repeats of any length. The framework is based on an algorithm that traverses the suffix-link tree of a set of strings in parallel, using two Burrows-Wheeler transforms. Using our framework, we are able to scale a state-of-the-art clustering pipeline so that it takes less than a tenth of its original space, while using a similar amount of time.

10:45-11:10

Broňa Brejová, Comenius University (with Petra Kubincová, Michal Petrucha and Tomáš Vinař)

Data Structures for Mapping Between Genomes


Abstract: When comparing genomes of related species, the first step is often the construction of a whole-genome alignment which contains information about corresponding parts of these genomes. Such alignments are then used for mapping genomic coordinates of regions of interest (e.g. genes) between pairs of genomes. In this talk, I will discuss the problem of space-efficient data structures supporting such mapping and present some preliminary work in this direction.

11:10-11:35

Nicola Prezza, University of Udine (with Giuseppe Narzisi and Bud Mishra)

ONTRC: Aligning Nanopore Events on a Reference Genome


Abstract: Nanopore Sequencing (NS) is a novel technology which aims to read single long DNA molecules at single-base resolution. With reasonable accuracy, it could thus allow haplotype whole genome assembly, characterization of structural variants in a population and detection of complex genomic rearrangements such as translocation in cancer genomes. NS technology is based on a simple biophysical principle, enabling detection of electrical signals as a DNA, traveling through a nanometer-sized pore, modulates the ions current in the pore. Consequently NS can be inexpensive both to build and to operate. The main cost is in interpreting the ion flow - a process named base-calling as it reconstructs a set of hypothetical DNA sequence(s) that probably produced it. Currently, base-calling of nanopore events is highly error-prone (yielding sequences with 70% to 85% average identity with the reference), and remains computationally challenging. Despite this, NS appears to be an extremely promising technology capable of reading long (tens of Kbp) molecules at single-base resolution, with the potential to detect single-base modifications such as cytosine methylation. Here, we focus on an ongoing joint work with NYU's CIMS and NYGC that promises to improve nanopore base-calling quality by applying the TotalReCaller (TRC) technology. TRC improves the quality of base calling by using a genomic reference as Bayesian prior. The main idea underlying this line of attack is to align the nanopore electric signal against a reference genome by using a dynamic programming algorithm (a Hidden Markov Model) on the suffix trie of the reference. The trie encodes all strings matching the corpus, and can be efficiently implemented with an FM index (one genome) or a repetition-aware index (a pan-genome), e.g. a LZ/run-length BWT index, or a stringome, a generalized compressed suffix array. Our strategy merges base-calling and alignment in a single step and can be used to perform real-time variant calling, SNP phasing over long genomic regions, and identification of structural variations.

11:35-12:00

Szymon Grabowski, Lodz University of Technology (with Aleksander Cisłak)

Time is Cache: On Cache-Friendly Full-Text Indexing


Abstract: Modern computing platforms differ vastly from the classic von Neumann architecture. One of the most important (and relatively easy to model) differences is the memory hierarchy and non-constant cost of access to its various levels. Locality of accesses is preferred, and while in theory this phenomenon is captured in the I/O (Aggarwal and Vitter, 1988) and cache-oblivious (Frigo et al., 1999) models, with optimal results achieved for many cases, it is our feeling that, for some problems, practical variants of these techniques lag behind. In this talk we revisit the classic exact pattern matching with a full-text index problem, focusing on cache-friendly techniques and expressing the search complexities in terms of cache misses. In particular, we present several q-gram based FM-index (Ferragina and Manzini, 2000) variants, under the umbrella term of FM-bloated approach (as the space requirements of these variants are Ω(n log n) bits). Preliminary experiments show their practicality, at least when the RAM memory is in plenty.

12:00-14:00

Lunch Break

Tutorial, Chair: Yasuo Tabei

14:00-15:00

Simon Gog, Karlsruhe Institute of Technology

Compact and Succinct Data Structures - From Theory to Practice


Abstract: For decades index structures were built on top of the data to enable users to carry out queries efficiently. For instance, suffix trees or arrays were built on top of a text to answer pattern matching queries in a time complexity which was independent from the text length. Unfortunately, these traditional pointer-based index structures often take significantly more space than the original data and therefore cannot be used in scenarios where the data itself is not much smaller than the available main memory. In the last 25 years researchers invented space-efficient counterparts for many index structures which use not much more space than the original data and have the same query complexity in theory. In this talk we will review popular examples of compact and succinct structures - ranging from bit vectors over wavelet trees to compressed suffix trees - and show how they can easily be used in applications by employing the Succinct Data Structure Library (SDSL). We will further show how to use the library's facilities to analyse, measure, and monitor time and space requirements of structures. Finally, we will learn how more complex structures can be composed and integrated in the existing framework.


Biography: Simon Gog obtained his PhD from Ulm University where he was working on space-efficient index data structures with applications in Bioinformatics. The Succinct Data Structure Library project was started in Ulm and continued at the University of Melbourne, where he was working as a PostDoc on compressed external index structures. Currently, Simon is working at the Karlsruhe Institute of Technology on compact index structures for applications in the area of Information Retrieval.

15:00-15:20

Coffee Break

Session on Stringology, Chair: Jouni Sirén

15:20-15:45

Jérémy Barbay, University of Chile

A Review of Beyond Worst Case (for Fixed Size) Results, and the Choice of a Name


Abstract: In the area of data structures, the concept of compressed data structure is well defined: those are data structures which might take a very small amount of additional space in the worst case, but which take much less space on large families of instances. In the area of algorithms, there are some algorithms which take slightly more time than optimal algorithms in the worst case (sometimes as much as a constant factor), but which take much less time on large families of instances. Defining a good classification of existing results often yields to interesting questions and to new results (e.g. while some examples of succinct indexes existed before their definition, many more were discovered afterwards), and having distinct names for the same concept causes confusion. In the case of such "accelerated" algorithms, the classification is rather untidy, with many related names with overlapping scopes (e.g. Adaptive Algorithms, Distribution Sensitive Algorithms, Parameterized Complexity). I will present a short review of such "beyond worst case" (for fixed size) results and how they have been classified so far, and propose some new classification schemes inspired by the field of compressed data structures, hoping to clarify the field and start a convergence of the vocabulary being used.

15:45-16:10

Allyx Fontaine, University of Bristol (with Raphaël Clifford, Ely Porat, Benjamin Sach and Tatiana Starikovskaya)

The k-Mismatch Problem Revisited


Abstract: We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows:

  • Our first result is a deterministic O(n k² log k / m + n polylog m) time offline algorithm for k- mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000.
  • We then give a randomised and online algorithm which runs in the same time complexity but requires only O(k² polylog m) space in total.
  • Next we give a randomised (1 + eps)-approximation algorithm for the streaming k-mismatch problem which uses O(k² polylog m / ε²) space and runs in O(polylog m / ε²) worst-case time per arriving symbol.
  • Finally we combine our new results to derive a randomised O(k² polylog m) space algorithm for the streaming k-mismatch problem which runs in O(sqrt(k) log k + polylog m) worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors).

16:10-16:35

Tomasz Kociumaka, University of Warsaw (with Paweŀ Gawrychowski, Jakub Radoszewski, Wojciech Rytter and Tomasz Waleń)

Universal Reconstruction of a String


Abstract: Many properties of a string can be viewed as sets of dependencies between substrings of the string expressed in terms of substring equality. We design a linear-time algorithm which finds a solution to an arbitrary system of such constraints: a generic string satisfying a system of substring equations. This provides a general tool for reconstructing a string from different kinds of repetitions or symmetries present in the string, in particular, from runs or from maximal palindromes. The recursive structure of our algorithm in some aspects resembles the suffix array construction by Kärkkäinen and Sanders (J. ACM, 2006).

16:35-17:00

Alberto Ordóñez, University of A Coruña (with Simon Gog and Gonzalo Navarro)

Mixing Fully-Compressed and Grammar-Compressed Suffix Trees


Abstract: The Fully Compressed Suffix Tree (FCST) was known as the smallest CST by far, using around 4-5 bits per character (bpc), but also the slowest by far, taking milliseconds per operation. A recent reimplementation by Simon Gog et al., together with improvements by Navarro and Russo, make the FCST orders of magnitude faster, although still significantly slower than larger CSTs. The FCST samples some suffix tree nodes and from those it infers the information on the rest of the nodes. For repetitive collections, it is possible to get even lower spaces. The best current representative is the Grammar-Compressed Suffix Tree (GCST), which uses 2-3 bpc and answers within times similar to the faster FCST. The GCST applies grammar compression on the CST components. In this talk, we will explore the possibility of combining both CSTs, by applying grammar compression on the sampled suffix tree data. We expect this to yield a smaller and/or faster CST for highly repetitive data. We will show some experimental results that hint at what kind of improvements can we expect.