In this project you will be asked to write an interface that will help the Sequence IO of
SeqAn to seamlessly read SAM/BAM files as if they are FASTA/FASTQ files.
Introduction
SAM/BAM Files:
These are common file formats that are used to store alignment information of a short sequences (often called as reads) with respect to a reference sequence, which is usually a longer sequence. To know more about what a SAM/BAM files looks like read the specification at
https://samtools.github.io/hts-specs/SAMv1.pdf.
FASTA/FASTQ Files:
These file formats are used for storing biological sequences. This could be any of DNA, RNA or Protein sequences.
The
SeqAn FormattedFile class supports reading and writing of both FASTA/FASTQ sequence files and SAM/BAM alignment files. But many people utilize SAM/BAM alignment files only for the sequences inside them discarding the mapping information. Which means Given a SAM/BAM file one wants to extract the sequences and their corresponding identifiers and qualities.
Tasks
- Getting familiar with the file formats (FASTA/FASTQ and SAM/BAM)
- Take a closer look at the FormattedFile implementation of SeqAn
- Implement the interface for reading SAM/BAM files as sequence files
- Test the implementation with example data.
Stretch goals
- create tests under SeqAn that checks if the implementation is working.
- Make your code inline with the SeqAn standards and actually integrate it with SeqAn.
Extension as Bachelor Project
TODO
Literature