Overlap Module for NGS Pipeline
Summary
The overlap module merges the information retained by read mapping to a genome with annotation information (for example genes, transcripts, known intervals of genomic abberations like inversions, etc.) and can be used to measure gene expression levels with RNA-Seq data (NGS of mRNAs) or to produce visualisations files for sequencing coverage. It counts the matched reads for the annotated intervals.
Expose
This thesis uses structures, which are already implemented in Seqan. The most important structure is the
FragmentStore, which stores reads, contigs and alignments among others. The first thing to be done is to write a function to extract the matched intervals of the contig in the alignment with a given read.
Apart from that a function is needed to read the given GFF-file, which contains the annotations for special intervals (like genes, exons, etc.) in a contig.
The
FragmentStore will be extended for an
annotationStore and an
annotationNameStore, where the information out of the GFF-file will be stored. The
annotationNameStore will hold the names of the different intervals in a lexicographical order and hence implicitly the id of the entry by the position.
Corresponding to the ids the intervals will be stored in the
annotationStore. It also holds the parent-ids (e.g. the gene-id for an exon), the contig-ids, the start- and the end-positions.
In addition to that, an
intervalTreeStore will be created as a part of the
FragmentStore. For each contig two interval-trees will be created (one for each orientation), which holds the intervals of the containing parts with a pointer to the corresponding
annotationStore -entry. (Interval-trees are already implemented in Seqan)
Now the interval-trees will be searched for the matched-intervals of each read. Therefore the given function
findIntervals() will be used, but we'll search for shortened intervals to compensate mistakes caused by the sequencing or inaccuracies. After that a new function will be used to increment the counts for the current interval and if neccessary for its parent-intervals. To do this it's neccessary to differentiate how a matched-interval can occur (e.g. if the read matches in two overlapping exons, but only in one exon completely).
There will be three different results:
- 1. For each read the ids of the intervals are stored.
- 2. For each interval of the annotationStore the counts of matched reads are stored.
- 3. For each interval of the annotationStore connected intervals by reads and their counts are stored.
The results are stored in the position corresponding to the
annotationStore -id of the containers (Seqan-Strings and -StringsSets).
At the end, a gff-file will be created for the output. Therefore a fast access to the
annotationNameStore is possible by using the ids to retrieve the names.
[1] Marc Sultan, Marcel H. Schulz, Hugues Richard, Alon Magen, Andreas Klingenhoff, Matthias Scherf, Martin Seifert, Tatjana Borodina, Aleksey Soldatov, Dmitri Parkhomchuk, Dominic Schmidt, Sean O'Keeffe, Stefan Haas, Martin Vingron, Hans Lehrach and Marie-Laure Yaspo, A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science (2008), published online July 3.
Weekly Reports
Week 1 ( - 09.08):
- detailed working plan created
Week 2 (10.08. - 14.08.):
Implementation of:
- function to read gff-file and to store the information in annotationNameStore and annotationStore
- function to extract the matched intervals of a contig in an alignment
- function to get the annotationStore-IDs of one read, to select the right ones and to store them in a readCountStore
- function to get the counts of read- and matepair-connections from the readCountStore and to store them in a tupleCountStore
- function to append the parent-IDs to the readCountStore
- function to build an annoCountStore by use of the readCountStore, which stores the counts of all annotationStore-IDs
- function to calculate the sart- and end-positions from parent-entries and to store them in the annotationStore
Super, genau so soll sein!
-- Main.maschulz - 13 Jul 2009