Weekly Reports for the Bachelor Thesis "
Comparative Genomics with MEGAN and RazerS" by Hannes Hauswedell
Week 1 (2009-07-11..2009-07-18)
E-Values and Bit-Scores:
- Contacted help/developer-mailinglist of NCBI-Blast concerning scoring-schemes in Blast and parameter-calculation
- received large amounts of documentation regarding the topic (50+ pages)
- started reading the docs, beginning to think that dynamically or statically linking with BLAST is the only way of reaching similar scores
- postpone decision regarding this topic, for now use the functions and values already implemented earlier
- this will become important later, as we will definitely need protein-scoring in addition to nucleotide scoring
Overall progress on "BLASTN-Mode" for
RazerS:
- implemented a dumpAlignment()-function for writing the final nucleotide-alignment to the report
- format seems to be compatible with Blast-Output, but more tests have to be done to make sure
- started to do some test runs with real-world data to have real reports to compare with output of RazerBlastS
- ran out of memory very quickly → need to get smaller samples and databases or better hardware (or just use other hardware)
=> all-in-all less progress than hoped for, but about as much as expected (I knew time would be limited since I am also preparing for my last exam)
Week 2 (2009-07-19..2009-07-26)
Overall progress on "BLASTN-Mode" for
RazerS:
- implemented some more command-line options:
- it is now posssible to choose the window-size (-W N) which will deactivate parameter-choosing and use an ungapped shape of length N (this behavior might be desirable, it is closer to BLAST)
- generated an „artificial“ dataset that can be used with available hardware
- first test-runs on the data with regular BLAST(2) and RazerBlastS
- RazerBlastS didn't produce any results
- spent a lot of time debugging this, fixed a lot of issues on the way, but found new one
- was able to narrow down the problem, but was not able to fix the issue yet (RazerBlastS throws SIGABRT somewhere deep in seqan )
Week 3 (2009-07-27..2009-08-02)
Overall progress on "BLASTN-Mode" for
RazerS:
- After some time the Crash was resolved, there was a problem in the original verification function
- With RazerBlastS running I did tests to compare output with BLAST's
- RazerBlastS produced results!
- => Most of the hits looked "similar" to BLAST, but actual alignments and scores differed
- found out that they actually differ inside RazerBlastS as well - between verification phase (where the actual alignment is computed) and output-phase (where it is recomputed from genom coordinates)
- first thought this was just the difference between Gotoh and BandedGotoh → wrong!
- it turned out that in verification genome was aligned against read and in output the other way around (which makes a difference because the DP-Matrix-Configuration is asymmetrical)
- that solved most of the issues, however many alignments had a huge "gap-prefix" or "gap-suffix"
- this was due to wrong parameter-retrieval from the match-fragments which could be fixed
- many formatting improvements to the BLASTN-Output-Format
=> Current State:
- on the testdata RazerBlastS finds all of matches, that blast finds, in almost the same alignments.
- it finds very few additional useless matches (not yet sure where they come from)
- the scores on the matches are nearly identical to BLAST's score, even the Bit-Score
- The output-report already looks very similar and should satisfy MEGAN (no testing done there, yet)
Week 4 (2009-08-03..2009-08-09)
Overall progress on "BLASTN-Mode" for RazerS:
- fixed a minor problem in e-Value-calculation and switched to "scientific" output, e-Values are now similar to BLAST
- added "overview" tables to output (beginning of sections) and fixed some formatting issues
- spent a entire day figuring out where strange hits with bad scores and alignments come from. Found out that those were reverse hits marked as duplicates, which -- because of the way RazerS marks duplicate hits :O -- loose their "reverse"-attribute and are therefore aligned against some forward sequence, resulting in a useless and confusing alignment during output-phase => ignoring matches marked as duplicates
- changing back to BandedGotoh() in verification, which halves the execution time on a medium sized testset, but produces bad results (in the process of finding errors I had previously switched to regular Gotoh() )
- after a lot of debugging I found an error in calculation of diagonals
- fixed that and added a general +-3 to k
- now results are nearly identical to "real" Gotoh!
Overall progress on "BLASTX-Mode" for RazerS:
- started work, CLI parameters added
- began researching an efficient method for Codon→AminoAcid translation
- didn't find anything useful in seqan
- didn't manage to adapt ModView or ModifiedString<> because they don't like char[3] to char translation
- asked on seqan-dev for help
Weeks 5 & 6 (2009-08-10..2009-08-23)
- spent lots of time trying to switch to real local alignments from the current more-or-less semi-global approach
- had discussions with David about this and with Tobias via the list
- no real progress, other than the strong impression that this is not going to work the way David (and I) had planned
- progress on Protein-Mode:
- wrote a codon-conversion table that enables coodon-translation in constant time
- wrote calls for translating a nucleotide-sequence with 1, 3 or 6 Frames
- wrote an import function for reading fasta-nucleotide-sequences from file and directly translating them
- imported tables of kappa- and lambda-values for Protein-Scoring from BLAST-Source-Code
- adapted e-Value and bitScore-calculation to also work for Protein-Scoring
- some code-refactoring to increase reusability and readability
Week 7 (2009-08-24..2009-08-30)
- fixed e-Value-based Sorting of Matches
- switch to exact local Alignments via localAlignment()
- this works well, but is very slow
- removed the global typedefs to make read and genome-types generic (needed for BLASTX-Mode)
- → this resulted in many changed signatures
- a lot of improvements on Protein-Support
- still no protein alignements, though
Week 8 (2009-08-31..2009-09-06)
- fix in e-Value calculation makes it closer to Blast's
- lots of progress with BLASTX, RazerBlastS now produces Protein-Alignments!
- change in find_swift.h to prevent it from overwriting the threshold-parameter
- we now have a lot more results than we need...
- had another appointment with David to plan the last weeks of work
Week 9 (2009-09-07..2009-09-13)
- implemented the verification function for ungapped alignments, already works for BLASTX!
- started writing the thesis paper! spent a lot of time on organizational stuff and latex…
Week 10-11.5 (2009-09-04..2009-10-01)
- been busy writing the most of each day
- fixed minor bugs a long the way