TODO: Wait for evaluation of Hobbes on human, execute class, update tables.
[Manuel]: Section S5 was updated to reflect the usage of the read mappers for the new evaluation. Section 3.3 was moved to the appendix because of space constraints.
Dear Dr. ???? (contact?)
please find below our answers to the referees comments. As you can estimate by the time it took to revise our paper we did indeed accept almost all criticism about the evaluation of our algorithm and substantially redid the evaluation part. We hope that the reviewers are satisfied with our changes and look forward to their comments.
For readability we cite the points from the different reviewers and add underneath our answer.
We changed out evaluation substantially. In particular, we map against the whole human genome.
Use of the Rabema benchmark is not well justified here. If my understanding is correct, use of Rabema would tend to give RazerS 3 an unfair advantage since RazerS 3 and Rabema share certain assumptions that aren’t shared by the competing tools. The authors should either choose a different way to compare the tools or explain why it’s fair to use Rabema here. For instance, is the Rabema “gold standard” based on edit distance? If so, does that not give the edit-distance-based tools (RazerS 3) an advantage over the tools that use a non-edit-distance-based scoring scheme (BWA, Bowtie 2, etc)?This might be a misunderstanding. The Rabema benchmark is not built into Razers 3 and the definition of ”lakes” is independent of the tool. In our tools we choose to report one match per lake and parallelogram (i.e. could be several hits per lake). However, the Rabema benchmark is about sensitivity in the edit distance model. Since our tool and many others are edit distance based, this is an appropriate benchmark. To address the quality based mappers, we added to the Rabema benchmark a test for finding the original location. This is certainly independent of the chosen distance model. In addition we changed our evaluation substantially and added new benchmarks similar in spirit to Bowtie2 and Shrimp2.
Some of the comparisons take reads originating from across the whole genome and align them to just a single chromosome of the genome (e.g. human chr2). Whether a tool performs well on such a test depends heavily on how quickly the tool throws away a read that fails to align. This is not a very relevant basis for comparing tools.We changed that and compare to the whole genome.
The “island criterion” is in some sense “built in” to RazerS 3 but not into the tools compared against. This can lead to an unfair advantage for RazerS 3. For instance bowtie -k 100000 might “spend” its 100,000 alignments describing alignments to distinct positions within a long homopolymer, whereas RazerS 3 won’t (or shouldn’t) do that (and whether this is a good thing is, I think, debatable - see point 10 below).See above. (We changed and expanded the evaluation section).
“To make results comparable, we filter the output of all read mappers to obtain only alignments with a maximal error rate of 4%.” Why does this make the results more comparable, and is it fair? This would seem to handicap the tools that use more sophisticated scoring schemes, like BWA and Bowtie 2.We filtered tha results to make them comparable, since we believe that an edit distance match with say 2 errors is always better than a match with more than 2 errors. However, to address the point we now additionally report the unfiltered results in the new evaluation section.
A way to solve many of these issues at a stroke would be to perform a comparison more like what Heng Li does here (http://lh3lh3.users.sourceforge.net/alnROC.shtml), which is essentially also what is done in the Bowtie 2 manuscript (http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.1923.html), using the whole human genome as the reference.Taking into account the comments of all reviewers we changed the evaluation section substantially taking into account ideas from Bowtie2 and Schrimp2.
We shortened the SWIFT section and pointed to the previous publication. In Section 2.4 we mentioned once Razers instead of Razers 3. We corrected this.
We updated Section 2.4 and hope that it is clearer now.
We used not the optimal BWA parameters. We redid the evaluation with correct parameters.
We added a note about the memory consumption of the whole human genome in Section 3.5. However, because of space constraints, we left the detailed discussion and tables in the supplement.
We added an explanation how we compute the positional probabilities in 2.2. For mixed-length read we compute loss rates for each read length compute a weighted loss rate.
The referee raises some good questions. Edit distance based mapper which are agnostic to quality values oftne have an advantage in finding real variants (SNPs, SNVs) since those are not "marked" by bad qualities which could misguide a quality based mapper as convincingly shown in Shrimp 2 paper. We discuss this now in greater detail and added a evaluation similar to shrimp 2. Affine gaps costs can be advantageous for larger indels finding which is mostly not addressed by standard read mappers but by specialized tools like splazers, pindel, etc.
I’m guessing that the answer is: it’s a tradeoff. Edit-distance based tools are a bit simpler, and therefore usually a bit faster. For instance, use of an affine gap penalty can lead to a more challenging “verification” step because the low penalty for a gap extension leads to wider parallelograms. There should be at least some discussion of this tradeoff, since the leading tool (BWA) has what seems like a strictly more sophisticated policy than RazerS 3’s.We agree with the referee and added some discussion.
Yes, this was a typo and is fixed in the updated Section 3.
This is indeed a valid and interesting point. How to perform load balancing always is a tricky question and there are always many possible ways to do this. We should have thought about this before, given that we manually parallelized mrFAST in this fashion. We note that the choice of package size is critical and adds another layer of complexity which is not used for our dynamic load balancing.
We evaluated this using 8-way parallelism in the following way: The input of 10M reads was split into reads of size 250k (yielding p = 8/16/32/40 packages) and we ran RazerS as we did run mrFAST, using 8 processes. We do not doubt that the yielded load balancing is bad, however, consider the following numbers of accumulated CPU time for each step.mode | filtration [s] | verification [s] | filt.+verif. [s] |
---|---|---|---|
dynamic | 242.6 | 200.5 | 443.1 |
p=8 | 216.4 | 203.5 | 419.9 |
p=16 | 388.0 | 191.9 | 579.9 |
p=32 | 694.3 | 184.7 | 878.0 |
p=40 | 1025.4 | 200.5 | 1225.9 |
Note that these times exclude the time for I/O of the reads but include the time for I/O of the contigs. The CPU times should speak for themselves, however, since any optimal load balancing algorithm can only achieve a total wall cock time of "TIME/t". The verification time is probably reduced because the processes run more isolated than the threads and can be scheduled better by the OS. As we can see, the overhead for scanning over the genome "eats up" any possible advantage of package-based load balancing.
Section S3.3 also describes these results.
We are sorry if the impression was given, that a hit would be unique if it hits in a homoploymer island. We indeed do not do that. The intention of the island criterion is to be fair to any read mapper in such a region. Any end position would count as correct for the Rabema benchmark. A read mapper can provide all locations in each lake (which is not practical), several, or only one. A good measure for the uniqueness of a match is the size of the lake, i.e. a lake of size 8 with maximal edit distance 4 around a perfect match is "unique" whereas a lake that is much larger (say greater than half the size of a read) indicates a repetetive region. This is explained in detail in the Rabema paper.
RazerS actually creates multiple hits in a homopolymer, one in each parallelogram. This is stated in the last paragraph of Section 2.3: """ The definition in (Holtgrewe et al., 2011) offers a clean, formal, and practical definition of the expected output of read mappers. RazerS reported one best match in every parallelogram whereas RazerS 3 reports one best match for each lake in each parallelogram. This increases the sensitivity and makes RazerS 3 a fully sensitive read mapper for the read mapping problem with the above match definition. """The reviewer is partially correct with this statement. Sharing parts in their trace is exactly what we define as "trace equivalence" in the Rabema paper. However, this is not sufficient as reasoned in the same paper, one also needs to combine trace equivalence with (some kind of) "neighbour equivalence" for a mathematical representation of what is intuitive. This gives rise to k-trace equivalence.
The Rabema paper (and its erratum) gives a detailed discussion, as does the following link: http://www.seqan.de/wp-content/uploads/downloads/2012/03/rabema-supp_mat2.pdfIt seems that the right-hand label was cropped during a manual to pdf-to-eps conversion. We fixed that in the resubmission.
We clarified the point.
We added data for one thread in Table S1. Furthermore, we updated Section S2 with an appropriate discussion to clarify this point further.
A good remark of the reviewer. However, how many blocks needs to be computed does in general not only depend on the scores in the first block. A generalization of the reviewers idea is the so-called Ukkonen trick [Ukkonen, E. (1985). Finding approximate patterns in strings. J. Algorithms, 6(1), 132–137.] which reduces the running time from O(mn) to O(kn) on average by computing the DP matrix column-wise from the topmost to the last active cell (last cell in the column with score <= k). This idea was already implemented in our unbanded Myers algorithm for large patterns to improve its running time by computing only the first words up to the word that contains the last active cell. We switch between both variants (single-word and multi-word) depending on the length of the reads (<=/> 64). As the effective performance drop is influenced by specificity of the preceding filter we removed the absolute numbers and clarified the paragraph
The referee is correct in that the space consumption grows linearly. We added results and a discussion about this in the supplement S6.
We substantially redid our evaluation. In general we report now unfiltered and filtered results (see above) and added some more benchmarks to the evaluation in the spirit of Shrimp2 and Bowtie2.
Regarding the Drosophila reads: We contacted the author of the original paper using this data set and he stated that they were probably the first paired reads outside of Illumina, thus containing many errors. The author wrote that current reads exhibit much better quality.
When collecting the data sets for the original manuscript, these were the only 100bp genomic Drosophila Illumina reads we could find. We have now found an additional, better data set and have replaced the low-quality data set with the new one. The numbers are now more comparable to the ones for the other organisms.
The reviewers' point is valid. We changed the Single-End and Paired-End Real-World data to align against the whole human genome. See the updates in Section 3.5. and see our comments in the new experimental section.
Again, the authors state that they have sampled only 100k reads. I believe it is necessary to extend the performance analysis to a full set of reads (eg. a HiSeq2000 lane) and a full mammalian chromosome to give the reader an impression of the real-life performance.The sampling is only performed for the Rabema benchmark. We sampled now over the whole human genome. Here, sampling uniformly from the reads set will give a good representation of the results and wont change with a larger number of reads. Because of time constraints on the whole human genome we stuck with 100 k reads. However, for the real-world data sets, we used 10M reads in the original submission. We increased the number of reads in the data sets as can be seen in the updates to Section 3. The times for larger data sets can be estimated by splitting the reads into multiple packages and mapping them independently, as indicated in the updated manuscript. This also allows to derive running time estimates for larger data sets.
In addition we refer to our new experimental session which contains large data sets.
TODO(Manuel): Update Section 3 with the data after we collected it.We are using default Mason settings. A clarifying sentence has been added in Section 3.5.
As the reviewer suggested, we have contacted the authors of of Hobbes and mrFAST. A new version of mrFAST has been released by Can Alkan which is now used, as indicated in the updated Section S5.
For Hobbes, one problem was that the program has to be compiled on the exact machine it is run. This resolved some but not all problems.
Hobbes is not able to align reads longer than 100 bp, as confirmed by its authors. More problems appeared for Hobbes when switching to the cluster and 12 threads: Hobbes kept crashing. These issues could not be resolved to the day of resubmission.
We added tables with such times in the Supplement. However, we want to remark that the "SYSTEM + USER" time (equaling to cpu time) is not a very good measure: First, parallel programs are usually run with a fixed number of threads T and T cores are allocated exclusively to the program. In parallel computing, one normally only considers wall-clock time since "cpu time" normally increases for well parallelized programs due to synchronization and increases less for programs with multiple sequential sections. Second, I/O time is not considered.
Also note, that we were not able to record the cpu time exactly in a straightforward fashion for mrfast because of the manual parallelization that has to be employed.
We added memory footprints of all aligners.
We also think that the exact parametrizations of the tools should be clearly documented, many papers are missing this information. However, because of space constraints, we feel that the supplement is the best place for the detailed description.
We completely changed our evaluation section which does not contain 454 reads anymore (see point 4 reviewer 2) The performance difference was mostly due to filtering the results according to edit distance.
For paired end read mapping we choose only to compare concordant reads, because most mappers only support those. Of course the reviewer is correct that discordant reads are useful for variant detection. Such mappings are addressed by other tools of our group. We also refer to our new evaluation section.
2. Maybe we can turn off the post-alignment filtering and delete/turn-off quality values with the following options. That would ease the discussion if it is fair to filter alignments that quality-based tools consider to be better.