Martin Riese:
Parallelization of RazerS
Abstract
In this work the existing implementation of the read mapping tool RazerS should be parallelized and improved in the handling of repeats.
Contents
Description:
RazerS [1] is a fast read mapping tool with sensitivity control, i.e. it provides a seamless tradeoff between sensitivity and running time and guarantees not to lose more reads than specified. It is based on a specialization of the SWIFT [2] algorithm and uses a k-mer index of the reads to search for common k-mer in parallelograms.
In this work 3 possible ways of parallelization should be examined:
- The scanning of genomic regions could be distributed
- The set of reads and there indices could be distributed
- The verification processes could be distributed
Therefore different multi-core frameworks should be examined, e.g. Threads, Mutexes, ... in SeqAn; OpenMP; Intel Threading Building Blocks [4]
Another goal of this work is to improve the handling of repeats. In contrast to non-repeat regions, low-complexity repeats and reads of these repeats cause a huge number of counting events. This number is linear in the length of repeat region and the number of reads covering this region. Different strategies should be developed to overcome this bottleneck, e.g. using repeat annotations / detection, compressing repeats, generating matches from a representative repeat substring.
The verification process of RazerS should be extended by a quality based DP alignment and a banded Myers bitvector verification.
Literature:
[1] Weese, D., Emde, A.-K., Rausch, T., Döring, A., and Reinert, K. (2009). Razers–fast read mapping with sensitivity control. Genome Res, 19(9):1646–54.
[2] Rasmussen K., Stoye J., Myers G. (2006). Efficient q-gram filters for finding all ε-matches over a given length. J Comput Biol 13:296–308.
[3] Morgulis A., Gertz E.M., Schäffer A.A., Agarwala R. (2006) A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 2006 Jun;13(5):1028-40.