SLAGAN
Input: two genomic dna sequences in FASTA file(s)
Generation of local alignments (Svenja)
SLAGAN uses the CHAOS aligner for this phase:
- finding seeds: k-mers with d mismatches (k-mer index?)
- chaining seeds
- scoring chains
- removing chains with score < t
- extending seeds & finding optimal gap positions
- rescoring
Implementation:
Chaos is part of lagan so this is somewhere implemented in SeqAn. But we will probably need to adjust it for our needs.
Probably this step has to be run twice: sequence 1 vs sequence2 and sequence1 vs reverse complement of sequence2.
REINERT: have a look at lagan.cpp in the SeqAn demo folder. You will also find the call to chaos.
Alternatively, Birte Kehr can provide you with her new local aligner.
Output: local alignments as six-tuples (start1,end1,start2,end2,score,strand)
REINERT: Ask Birte for the standard output of her aligner. That format can also be read in again for the next step.
Building the 1-monotonic conservation map (Felix)
- storing local alignments in eight data structures for the different chaining types
- which data structure? paper says: balanced binary trees. Cant we use any structure that gives us access to the right alignment in logarithmic time?
- dynamic programming for finding high scoring chain (this is in fact a segment based approach)
Implementation:
As this is the core feature of the algorithm, we will try to reimplement it.
Output: chain consisting of consistent subchains (how to store this?)
Aligning consistent subsegments (Svenja)
- expand consistent subchains
- re-align consistent subchains with lagan
- clip overlaps in alignment in an optimal way
Implementation:
SLAGAN uses LAGAN for this task. This is already implemented in SeqAn somewhere so we can use this. But we will have either to formate our input in a way that the LAGAN implementation can deal with it or adjust the implementation so it can deal with our input
Output: glocal alignment
Implementation Detail
Types of Gaps
The type of gap between two matches L1 and L2 is denoted by three bits (linearInSeq2, IPositive, JPositive)
- linearInSeq2: L1 and L2 are collinear or consistent (i.e. L1.end2 < L2.start2)
- IPositive: L1 is positive
- JPositive: L2 is positive
Implemented a class that can store this three bits in one char. Realized afterward that there is a Bitset class in the STL…
Chaining
gap penalties:
- gap open: constant
- gap continue: |(L1.end1-L1.end2)-(L2.start1-L2.start2)| * constant (difference in diagonal numbers of end point of L1 and start point of L2 times constant)
- distance: min(|L1.end1-L2.start1|, |L1.end2-L2.start2|) * constant
- consistent (+++ and ---): gap open = 0, gap continue = -0.5, distance = ?
- inversions (++- and --+): gap open = -1000, gap continue = ?, distance = 1*distance
- translocation (+-+ and -+-): gap open = -2000, gap continue = ?, distance = 2.5*distance
- translocated inversions (+-- and -++): gap open = ?, gap continue = ?, distance = ?
Problems and Milestones
- Chaos is done but seems to return wrong results. Some closer examinations and examples are needed
- tested with a very short example: seq1: AAAAAAAAAAA, seq2: different combinations of 11-15 A's and other bases, k=11
- start-position of every local alignment which is found is always 0, end position always the last position of seq2 (that means, always the whole sequence seq2 is found)
- number of local alignments, which are found, would be correct
- further analysis is needed (look at the found seeds and the positions during the alignment step)
- seeds are okay on small examples
- re-align-step is not necessary
- addSeed with Chaos-Tag (chaining seeds with Chaos algorithm) leads to segmentation faults and infinity loops → some serious debugging is needed here
- we will skip this for now and only use the Merge tag. Maybe we can use Birtes local aligner later…
- scores for local matches can be extracted from the seedSets scoreMap
- position(seedSetIter, seedSet) seems not to work → count the steps the iterator made, use the counter to access the scoreMap
- Chaining Algorithms seems to be way more complicated than we thought. We read the supplemental material to the slagan paper but it seems we will also have to read (and understand) the Eppstein Paper on sparse dynamic programming. Maybe it is a good idea to first implement a quadratic version of the chaining
- we will build a graph with nodes representing the matches ans edges represneting possible chainings (only 1-monotinic ones)
- use bellman-ford algorithm to find the heaviest path → highscoring chaining
- gap costs are not really good explained in the paper (as can seen above we are lacking some of them)
Next steps
- debugging chaos; output of chaos
- implement the chaining in O(n^2)