Possible Project for a BSc Thesis in Bioinformatics or Computer Science
Introduction
"Many nucleotide and amino acid sequences are highly repetitive in nature. If your query sequence contains regions of low complexity or repeats, you can end up with many non-related, high scoring sequences being found during BLAST (or FASTA) searches (e.g. hits against proline-rich regions or poly-A tails). In other cases, your sequence may contain regions of vector sequence, or repeat regions such as Alu sequences, that you either do not want included in your sequence, or at the very least, wish to have discluded in any searches you carry out based on sequence similarity." [1]
Two projects make sense, depending on the student's interests and skill and time frame.
Implement multiple masking algorithms and compare (option 1)
Goal of this thesis is to reimplement the famous filtering algorithms SEG (protein sequences) [2] in a stand-alone
SeqAn application and to compare this against the original implementation.
Steps:
- implement the SEG algorithm [2] as a function in SeqAn
- add support for writing (and reading) SEG interval output to SeqAn
- develop a tool that reads FASTA files and outputs intervals for them
- benchmark and compare the solution to the original tool
Stretchgoals:
- include alternatives to SEG, like GBA [4]
- parallelise the tool over the input sequences (should be fairly simple with OpenMP)
- measure the influence of the algorithm on a tool like Blast or Lambda [5]
Expected outcome for student:
- learn how to read and understand scientific papers, pseudo code and/or other implementations' source code
- learn how to efficiently implement an existing algorithm in SeqAn, do I/O and develop an application
- learn how to benchmark and compare your implementation with other
- learn how to write a thesis
Add masking support to SeqAn3 (option 2)
The focus of this work would be to add masking functionality to the new library. It is more about a clean implementation, proper documentation and participation in the software project and it's workflows.
Goals:
- implement the SEG algorithm [2] or another simpler algorithm as a function in SeqAn3 (if permitted by license an existing solution could be imported with little change)
- add alphabet types for mask (0 or 1) and a template masked that creates an masking alphabet from an existing one
- Implement a masked_sequence_adaptor that stores masking information more efficiently than per-character; evaluate theoretical differences in space consumption and access time vs a regular sequence over masked alphabet
- Write proper documentation and tests for all new functionality
Stretch-goals:
- Get the changes merged before handing in the thesis
- evaluate more storage strategies for masked_sequence_adaptor
- adapt the Fasta-Input/Output code to be able to read masked sequences from file
Expected outcome for student
- learn how read and understand Modern C++ library code and documentation
- improvement of C++ skills
- learn how to do good quality software engineering, including automated testing, documentation, version control
- learn how the SeqAn project is organised and how to participate in the development work-flow
- write a thesis
References
[1]
http://www.molbiol.ox.ac.uk/analysis_tools/BLAST/BLAST_filtering.shtml
[2]
http://www.sciencedirect.com/science/article/pii/009784859385006X ,
http://www.sciencedirect.com/science/article/pii/S0076687996660352 , more information and documentation available; public reference implementation in C/C++ available in the ncbi-toolkit
[3] original unpublished, improved version:
http://www.ncbi.nlm.nih.gov/pubmed/16796549
[4]
http://bioinformatics.oxfordjournals.org/content/22/24/2980.full
[5]
http://bioinformatics.oxfordjournals.org/content/30/17/i349.abstract