Solving the Multi-read assignment problem
Description
This thesis should provide new ideas to solve the problem of multi-read assignment for NGS data. Based on the works of Kececioglu [1] and Tammi [2], who address a similar problem in the context of sequence assembly methods should be developed to use the microheterogenity in the multiple hit locations to group reads together and subsequently assign them to the correct genomic location (e.g. using partial overlap and mate pair information).
Literature
- [1] Kececioglu, Yu: Separating repats in DNA sequence assembly.
- [2] Tammi et al. : Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs, Bioinformatics 18, 2002, 379-388
- [3] Stephan Aiche: Separation of repeats in shotgun assembly data, MSc thesis.
Work Plan
This master thesis has a duration of 6 month from 01.12.2010 until 31.05.2011.
Month 1 - Setup evaluation framework. This includes
- Generation of simulated test cases
- Mapping of read with Razers
- Computing a consensus alignment
- Compute the correct location of the multireads (first a dummy implementation i.e. random assignment, or assignment to position with fewest errors)
- Compute and display the goodness of the solution.
Month 2 - Implement different variant for multi read assignment, strategies
Month 3 - Integrate into Seqan, use real world data and performance tests
Month 4 - Enhance implementations and beginn writing diploma thesis
Month 5 & 6 - Working and writing out the master thesis
Weekly Report
Structure thesis
- Introduction
- Shotgut sequencing
- Repeats in sequence assembly
- Various origins in read mapping applications
- All origins are present in the reference [A]
- Some origin(s) are present in the reference ("hidden repeat") [B]
- No origins are present in the reference [C] [?]
- The solution framework
- Identifying locations of interest for separation
- Based on unexpectedly high coverage [should work for A+B]
- Based the target regions of multi-mapped reads [should work for A]
- Method for [C]?
- Identification of valid separating columns in those locations
- Kececioglu model
- Tammi model
- Classification of reads based on those columns by Kececioglu ILP
- [ Integration ]
- Implementation [?]
- Evaluation
- Conclusion / Discussion
To review
Integration der lokalen Klassifikation
Separation anhand einer oder mehrerer DNPs führt zu lokaler Klassifikation: Die Read_vorkommen_ an der DNP Stelle werden gegeneinander klassifiziert.
Verfahren:
- Read-Vorkommen lokal Klassifizieren
- Prüfen, ob innerhalb der lokalen Klassen reads mit globalen Klassen vorkommen
- NEIN
Die nächste freie globale Class-ID wird abgefragt und allen Read-Vorkommen zugewiesen
- JA
- Wenn nur eine globale Klasse in der lokalen Klasse vorkommt:
- Alle Read-Vorkommen werden dieser globalen Klasse zugewiesen
- Wenn mehrere globale Klassen vorkommen:
- Alle in Konflikt stehenden Klassen werden ignoriert.
- Verbleiben mehrere globale Klassen werden sie gemerged
- Die verbleibende globale Klasse wird allen Read-Vorkommen ohne Zuweisung zugewiesen
- Die globalen Klassen die den jeweiligen lokalen Klassen zugewiesen wurden werden paarweise als in Konflikt stehend markiert