((short description what this page is about))
The latter approaches are especially useful for sequence alignment, because substitution ma- trices are also fundamental parts of scoring algorithms in most sequence alignment applications and the impact of reductions that are based on the same metric as the target function is intuitively clear. Beside the method of reduction, integral parameters are the size of the original alphabet and the desired output size of the target alphabet, i.e. the number of clusters that remain after re- duction. All of the aforementioned methods begin the reduction on the canonical 20-letter amino acid alphabet that includes all proteinogenic amino acids, without the rare amino acids Seleno- cystein (U) and Pyrrolysine (O) and that does not include a character for the STOP-codon and non of the wildcard characters frequently encountered (X for any amino acid; B for N or D; Z for Q or E). Depending on the method the target size may be fixed or variable, some research indi- cating that sizes as low 5 are sufficient (Bacardit et al., 2009), most suggesting that 10-12 letters are required and/or most effective (Li et al., 2003; Murphy et al., 2000; Ye et al., 2011)."
-- from Hannes Hauswedell's master thesis: http://www.mi.fu-berlin.de/en/inf/groups/abi/theses/master_dipl/hauswedell/msc_thesis_hauswedell.pdf (pp. 14-15)* study of literature: what are the alphabet reductions used historically? What methods were used for clustering? What are recent publications in the field? Which reductions are used by current protein alignment programs, e.g. Lambda, Diamond, MMSeqs2, Malt, Rapsearch2, Paladin?
* implementation: select an interesting sub-set of reductions and implement them in SeqAn. Write test and conversion functions…* implementation: add support for the implemented alphabets to the Lambda application; possibly also add support to another application.
* evaluation: study the effect of different reductions on the performance and sensitivity of Lambda (and possibly another application). What can be said of the different reductions? What influence does the size of the reduced alphabet have? Can you recommend that Lambda choose a different alphabet in the future?