L1 Cache | ~3-4 cycles | ~0.5-1 ns |
L2 Cache | ~10-20 cycles | ~3-7 ns |
L3 Cache | ~40-45 cycles | ~15 ns |
Main memory | ~120-240 cycles | ~60-120 ns |
The purpose of introducing the concept of minimizers into sequence alignment is to cluster adjacent k-mers in memory. The origin concept of minimizer was defined as lexicographical minimum substring with fixed length[2]. The main idea of minimizer concept is adjacent k-mers in a sequence are very likely to share the same minimizer such that when searching k-mers from left to right in given sequences they are tending to fall into the same searching block already loaded into caches by their previous k-mers. However simple lexicographical minimum minimizes has its own defects. For an example it might create many unwanted contents swapped into caches increasing times of cache miss dramatically. These unwanted contents can be caused by nonadjacent k-mers but having same minimizers. Besides, other factors such as "too complex" strategy to acquire minimizers can also leads to this undesirable effects.
Generally, a minimizer function takes a k-mer as an input and returns its minimizer. Three relevant requirements for minimizer functions are enumerated in the following list. These are not definitions of minimizer functions but functions meeting these requirements might be more cache friendly or adaptable to q-gram index.
Clustering is one of the defects. Lexicographic minimizer results bias toward lexicographic smaller and low-complexity k-mer. For an example k-mer AA...A is more likely to be a minimizer than k-mer TT...T.
Its better to keep each corresponding block of consistent k-mers created by minimizers at appropriate size. To be more specific, caches cant be utilized efficiently in too small size blocks because few adjacent k-mers in blocks can be loaded into caches. On the other hand too large size of blocks will be more time consuming for searching.
Some simple operations might be worth consideration. For example minimizer := min xor max can reduce numbers of unwanted nonadjacent minimizers and create more uniformly distributed minimizers while still keeps the probability that adjacent k-mers share the same minimizes. Its defect is its more complex to calculate.
Last but not least the cache performance does not only depend on how to choose minimizers. It is also very sensitive to how the function was programed. "Cache friendly" [3] programing is crucial to the final performance of the program. In other words the concept of minimizers might be able to enhance indexing structures generically, but adapting minimizers to these structures needs much more specific optimizations.
New strategies developed will be assessed on the basis of a test data set. Several aspects of results like computing speed, distribution will be compared to those of lexicographical minimizer.
I | Attachment | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|
png | dp_antidiag.png | manage | 22 K | 18 Mar 2015 - 15:24 | CxpanUserTopic | |
png | pmsb_2015_cache.png | manage | 15 K | 18 Mar 2015 - 15:37 | CxpanUserTopic |