The explosion of the availability of genomic data over the last 1-2 decades has facilitated large advances in life sciences. Beginning with the sequencing of the fruit fly genome (Adams et al. 2000) and, more importantly, the human genome (Venter et al. 2001), more and more whole genomic sequences have become available. In the future, we can expect an even larger rise of sequenced organisms. (However, few assembled genomes will be as complete as the human genome.)
This thesis has the following parts.
(1) The simulation of genomes based. (2) The simulation of the evolution of genomes.
Part (1) should be based on (Myers, 1999), part (2) could be based on ideas from sgEvolver (Darling et al. 2004) but should extend the set of events considered there. The following plan consists of 6 months of work.
Following Box' famous quote "all models are wrong, but some are useful", the aim is not to create a comprehensive model of evolution. This would require more biological knowledge than is available in the foreseeable future. Rather, the model should capture events that are relevant for non-collinear whole genome alignment software.
The following should be seen as a proposal that should be followed "in spirit" but not set in stone. If parts are replaced or left out, the well-documented and structured code should allow a later extension such that all features described below are possible.
Reading of literature, looking at existing biological databases.
The original motivation for celsim was to be able to test algorithms on "real looking" genomic data. While we have complete genomes today that can be used for testing algorithms, syntetic data has the advantage that is that it can be explained completely since the process of generating it is known.
The student should write a program for the artifical creation of genomes with cerrtain features observed in biological data. Myers' celsim (Myers, 1999) should serve as a template for this. However, the approach should be extended by allowing for using both procedurally generated sequences (Dau, 2010) and real-world sequences as the sequence sources. Real-world sequences could be retrieved from public gene databases or annotations of the human genome.
The input would be parameters for sequence generation, parameters for generation of the genome and possibly real-world sequences for genes, e.g. with exon/intron annotation. The output of this program should be (i) a genome (multiple chromosomes), (ii) a "log file" with the operations that were executed to generate the sequence, and (iii) annotations with this log information (see Figure 1).
The annotated genome from (1) or an annotated real-world genome is the input for the second program. These genome annotations should include at least two types regions with higher and lower conservation. Furthermore, it would be useful if the annotation of genes with alternating introns/exons (higher/lower evolution rate) is possible.
We now simulate evolutionary events, both small-scale (insert, deletion, substitution) and large-scale (larger inserts, deletions, substitutions but also: inversions, transpositions, duplications), cf. Figure 2.
Additional interesting events include whole-genome duplications (or: duplication of one chromosome), retrotransposed pseudogenee and horizontal gene transfer (occurs in bacteria, e.g.), cf. Figure 3. A nice-to-have feature would be events related to cancer genomics, e.g. rearrangements and gene fusions over short time spans.
The input are probabilities for small-scale mutation rates in certain classes of sequences (e.g. higher rates in introns than in exons) as well as specifications for large-scale events and phylogenetic trees. The trees specify which reference sequences are to be generated, the distance (number of events) between them (cf. Figure 4). The specifications for large-scale events could be probabilities but also fixed commands, e.g. "force a transposition of a N(mu,sigma) part of genome from species 1 to species 2, wrap behind possibly broken genes with a probability of 0.8 (and thus do not break them with 80% probability)."
The output is (i) one reference sequence for each node in the phylogenetic tree, (ii) a "log file"/list description of the evolutionary events that occured between it and the predecessor (this could also contain horizontal gene transfer from an "outlier" give as part of the or from other nodes of the tree) and (iii) a mapping between different parts of the reference sequences, flattened from the even list.
There currently also is the ongoing project Evolver (Edgar et al., 2011).
Technical aims of this thesis are a clean and well-documented implementation and the possibility of being able to extend the set of evolutionary events.