Comparing graphs for genome alignment of multiple sequences under the presence of large structural events.
Finishing Date | Milestone | estimated time needed |
---|---|---|
31.07. | Modul for read-in of .maf files | 1 week |
21.08. | Modul to create A-Bruijn graphs from .maf files | 3 weeks |
28.08. | Modul for output of A-Bruijn graphs | 1 week |
04.09. | Adding translocation and inversion to AlignmentGraph modul | 1 week |
11.09. | Converting A-Bruijn graph to Alignment graph and vice versa (optional) | 1 week |
18.09. | Benchmarks | 1 week |
25.09. | Creating Bachelor Thesis | 2 weeks |
Week | progress |
---|---|
24.07. - 31.07. | Finished maf parser with some problems. Fixed in time though. |
31.07. - 07.08. | Decided to change schedule. Going to finish A-Bruijn graph implementation first. Added inversion to AlignmentGraph. Altering buildAlignmentGraph() for translocation is probably very time consuming, therefore the schedule change. Read up on A-Bruijn graph mainly and worked out a plan for the implementation. |
31.07. - 07.08. | Implemented a scaffold for the A-Bruijn graph and cleared up the last details. |
07.08. - 14.08. | Implemented the buildAbruijnGraph()-function with support for inversions. maf files can now be converted into A-Bruijn graphs. |
14.08. - 21.08. | Implemented the two functions write2MAF() and write2DOT(). write2MAF() creates a .maf file of the A-Bruijn. write2DOT() creates a .dot file that can be read by GraphViz and converted into a graph. |
21.08. - 28.08 | Problem with the output after insertion and deletion of vertices. Problem solved by fixing the functions addVertex() and removeVertex(). Minor changes to the datatypes used (Align object and AlignBlock object). Wrote tests for all functions except for write2DOT() and write2MAF(). Taking care of AlignmentGraph now. |
28.08. - 04.09. | Tried to get the buildALignmentGraph() function to work with a test.maf, I created. Problems occured, when using Align objects. Apparently it tries to create an edge where source = target. Milestone wasn't reached. Wrote the interface for the abruijn program. |
04.09. - 18.09. | Fixed the problem with the Align objects (made some major changes in the maf parser header). Wrote my own (simplified) buildAlignmentGraph() function and a function to convert an Abruijn graph to an Alignment graph. |
Genome alignment compares whole genomes of different species or individuals with the aim of identifying homologous regions. Challenges are the huge amount of data involved and the presence of evolutionary events that affect the structure of a genome. Graphs have proven to be a versatile data structure to cope with both challenges ([5],[6]).
At the moment, there are four dominant graph structures that are able to represent a genome alignment of multiple genomes under the presence of large evolutionary events ([1]-[4]). Appearing very different at first glance, the graphs share similar substructures and modification schemes.
The SeqAn library provides a generic interface for graph components and already implements an alignment graph. Within the scope of this thesis, the task is to extend the alignment graph and implement a basic version of the A-Bruijn graph in SeqAn. A further step is to implement a (simple) transformation that converts one graph into the other, and then test the performance of both graphs in terms of construction time, space and transformation time (and space).
This thesis is integrated into a larger project, where in the first part the above mentioned graphs have been compared from a theoretical point of view. The thesis is the first step of a potential proceeding part, where the graphs are compared based on their applicability and time and space consumption regarding construction and modification operations performed on the graphs. This requires sound and comparable implementations of the graph specialisations, that are able to build up (and transform) a graph in adequate run time given large sets of genome data.
Whole-genome alignment overview paper:
[5] Dewey CN, Pachter L.: Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Human Mol Gen 2006, 15:R51-R56. [6] Dubchak I, Poliakov A, Kislyuk A, Brudno M: Multiple whole-genome alignments without a reference organism. Genome Res 2009, 19(4):682689.