Devising read mapping strategies in KNIME
Area
Read mapping quality, Mappability, KNIME
Topic
The main goal of this thesis is to devise strategies to compute a) a mappability score for each base in the genome (or for k-mers) (see Lee and Schatz [2]) and b) a mapping quality for a read (as a starting point see Li, Ruan and Durbin [1]) and use this to annotate the result of best-mappers and all-mappers in conjunction with the read mapper benchmark Rabema [4].
In particular, a stand-alone tool for mappability scores should be implemented in SeqAn. First use the GMS by Lee & Schatz. Develop a new measurement for mappability score, without using the informations a all mapper contains. I.e. Calculate the ambiguity of all k-mers (i.e. k=30, k < |read|). These information can be used to give a upper boundary for the ambiguity of the read. The mappability scores could also be used to construct a k-mer table of "good" k-mers with a certain mappability which in turn could be used to filter reads. The mappability scores can be used to compute mapping quality of reads (given a SAM or BAM file). Finally, mapping qualities should be used to compare best mappers and all mappers and correlate mapping quality to (weighted) alignment scores. All tools should be implemented as a pipeline (in KNIME).
Lasse verschiedene Verfahren reads auswählen
- razer im unique mode
- BWA/Bowtie2
- k-mer / und pileup mappability als tidebreaker
- "goldene mapping quality" berechnet im razer all-mode
Vergleiche diese.
Timeline
- Get accustomed to read mappers, SeqAn, KNIME, Rabema (May)
- Read into [1],[2] and think of an implementation of mapping qualities using k-mers. (June/July)
- Propose first pipelines (meeting with Weese, Reinert, Siragusa)
Done
- Rabema Pipeline is implemented in KNIME (besides mason)
- R snippets as evaluation/visualisation scripts in KNIME (coverage, missed reads, rabema results)
- short bash script to use flexbar in knime, direct use should be possible soon
- improved samtools ctd
- mason in KNIME
- SNP calling (using samtools pipeline?)
References
- Li, Heng, Jue Ruan, and Richard Durbin. 2008. Mapping Short DNA Sequencing Reads and Calling Variants Using Mapping Quality Scores.. Genome Research 18 (11) (November): 18511858. doi:10.1101/gr.078212.108.
- Lee, H, and M C Schatz. 2012. Genomic Dark Matter: the Reliability of Short Read Mapping Illustrated by the Genome Mappability Score. Bioinformatics (Oxford, England) 28 (16) (August 7): 20972105. doi:10.1093/bioinformatics/bts330.
- Siragusa, Enrico, David Weese, and Knut Reinert. 2013. Fast and Accurate Read Mapping with Approximate Seeds and Multiple Backtracking. Nucleic Acids Research (January 28). doi:10.1093/nar/gkt005.
- Holtgrewe, Manuel, Anne-Katrin Emde, David Weese, and Knut Reinert. 2011. A Novel and Well-Defined Benchmarking Method for Second Generation Read Mapping. BMC Bioinformatics 12 (1): 210. doi:10.1186/1471-2105-12-210. http://www.biomedcentral.com/1471-2105/12/210/abstract.
- Weese, David, M Holtgrewe, and Knut Reinert. 2012. RazerS 3: Faster, Fully Sensitive Read Mapping. Bioinformatics (Oxford, England) 28 (20) (October 10): 25922599. doi:10.1093/bioinformatics/bts505.