Thsi page will be the discussion forum.
Lecture 1
REINERT (30.4): Hier könnte eine ausgearbeitetere Version der Vorlesungsabfolge stehen. Oder im SVN.
Momentan kann ich dann keine Kommentare geben.
Read Mapping
- 1 Review about RNA-Seq
- 1.1 Motivation
- 1.2 Pipeline
- 2 Outline about Read Mapping
- 3 Bowtie
- 3.1 Backtracking (in general)
- 3.2 Suffix Arrays
- 3.3 BWT
- Compression transformation
- Cyclic permutation
- Transformation output
- Decompression transformation
- BWT's advantages
- 3.4 Exactmatch
- Overline: What do we have? What do we want to do?
- FM-Index
- Short idea of BWT compression
- Backward-search algorithm
- Definition of C and Occ (with examples)
- Algorithm of searching a pattern of length=1
- Searching a longer pattern
- Algorithm to locate P in T
- 3.5 Backtracking of Bowtie
(Duc, Nicolas und Corinna)
Fragen zu lecture 1
F: Wenn ich ein Extrasymbol wie '#' anhänge, wird dann der Index I, den ich als BWT Output und auch für den Input zur Rücktransformation brauche, nicht trivial?
A: # lässt sich zumindest in F trivial finden. Beim Entpacken von hinten nach vorn wird dort auch angefangen, also braucht man I nicht extra zu speichern.
--
DavidWeese - 04 May 2010
F: Indem ich L nach der Position von # scanne, weiß ich automat. mein I. Oder soll dies
Performance steigernd sein?
A: Beim Entpacken von vorn nach hinten braucht man ja ein F-to-L mapping, damit findet man ja auch gleich das # in L, I oder Scannen ist dort also auch nicht nötig.
--
DavidWeese - 04 May 2010:
F: Umgekehrt, ist '#' nötig, nur um bei der Rücktransformation in keinen Loop zu kommen? Das kann ich auch
herausfinden, wenn ich weiß, dass der Eintrag F[I] erreicht wurde.
A: Probiert mal die BWT von
abab ohne angehängtes # zu entpacken.
--
DavidWeese - 04 May 2010:
Assignments lecture 1
Slides of lecture 1 (13 May 2010 5:26p.m.)
Task 1: Compute the BWT for S=ACGCACGTACGG. APply the exact match algorithm of the lecture to find the number of occurrences of the Pattern P=ACG. In the lecture we will learn how to also find the positions in the text and extend the example.
Task 2: Compute all matches with exactly one mismatch of the Pattern P=AGG using the backtracking algorithm of the lecture. (It will be given on Wednesday, nevertheless prepare ALL for it).
Slides
Here are some additional comments to the slides:
--
DavidWeese - 06 May 2010, 11pm:
- S41: "abracadabra" has 11 suffixes, only after appending # it has 12
- S51: "characters in L occur in same order as in F (rank preserving)" is not what rank/order preserving means
- S57: "every column in matrix M is a permutation of string S", in fact it is a permutation of S# (according to your definition of S)
- S59: "T[i] is position of the character F[i] in L" - No. F[2]=F[3]=i. So I would expect T[2]=T[3] after reading this sentence.
- S67: "that connect"
- S69: "the number of P in T" → "the number of occurrences of P in T"
- S77: "pos(i) = C[L[i]] + Occ(L[i], i)" you use pos( ) with a different definition later
- S78: "All suffixes prefixed by P occupies in a set of rows" should be "All suffixes prefixed by P occur in a consecutive set of rows"
- S98: "It’s possible to find row(i) indirectly" - what is row(i), define or omit
- I don't understand slides 98-99:
- I thought pos(j) is unknown or do you describe the preprocessing alg. that marks positions in T?
- successor©?
- "Determine L[j]: Compare Occ(c, j) with Occ(c,j-i) for every c ∈ Σ ∪ {#}" - this is only necessary if L is compressed
- "pos(i)=C[L[j]] + Occ(L[j], i)" seems wrong to me, "C[L[j]] + Occ(L[j], i)"" is the result of the L→F mapping of i
- Backward_step(x) is simply the result of the L→F mapping of x, it should be explicitly pointed out
- S101: "use Backward_step(i)" and "repeat t times" could be misleading, as you set "j=i" and actually repeat "j=Backward_step(j)" t times
- (hint) you could introduce with "pos(Backward_step(i)) = pos(i) - 1"
- (minor) S103: "Last = 10": You want to calculate pos(i) for i=9,..,10 so "i=10" would be more intuitive here
- (minor) S105: same as above, "First = 9" is correct, but I would prefer "i=9"
- S105: "Finding ssi"
- S107: Shouldn't the red arrow point to [s|7]?
- S111: Why Backward_step(4)? It should be Backward_step(11)=4.
- S119: The ↑ arrows are maybe unusual, you could use "The higher the …, the higher the ..." or "more … results in ..."
BLASSE: Folien wurden überarbeitet
Script
--
DavidWeese - 19 May 2010
My comments to revision 6251:
Definitionen
- p10: "Definition" Definitions
- p10: "lexicographically sorted" → "ordered"
- p10: "The character P[i] is the..." → "P[i] denotes the ..."
- p10: You not only need prefixes, "P[i,j] is a pattern of symbols from the i-th to the j-th position in P"
- p11: Introduce $ first and for both suffix array, see the SA definition in previous P4 lectures
BWT
- p12: "an special"
- p12: you already introduced $ as a character smaller than all characters of Σ, why introducing # with the same properties?
- p12: "M whose rows are cyclic shifts of S" - either of S$ or of S' with S'=S$
- p12: "column … are ..."
- p12: "position of character c in string X" - what is the position of character c=
s
? There are 4 occurrences of s
in L and F. You have to bring characters of the same origin in T into relation. Either use cyclic shifts or the already mentioned "predecessor of" relation.
- p12: What if c1=
m
and c2= $
?
- p12: "of given string"
- p12: Use the latex theorem block for the "rank preserving property" and the latex proof for its proof.
- p12: "There are regions in F with c" - inexact, they occur in one substring/block
- p13: If I=5 your row counting seems to begin with 0, you should say that
- p14: "c1andc2" and wrong formulation of rank preserving property
- p14: "Advantages of BWT" - advantages are relative, compared to what? Better use "Properties..."
- p14: the first paragraph of 6.2 could be shortened
- p14: "it is reversible algorithm"
- p14: cite "Move to Front algorithm", "Run-Length encoding"
- p15: "..algorithm is the string ..."
- p15: instead of writing "Σ ∪ special character" multiple times introduce Σ'=Σ ∪ {$}
- p15: "character in L corresponding to character F[i]" - the same here, corresponding is undefined, first define corresponding as stemming from the same text position
- p15: How do you get the row beginning with $ and ending with m (this is definitely not row I)? In this example its coincidence that L[I=5] = m, because your string begins with 1 and I is counted from 0.
- p15: "successor character of S" → "in S"
- p15: "the i character"? better: "The character corresponding to T[i] in L …."
- p16: "the rank of c" is inexact. It can only be used to relate the corresponding cyclic rotations ending with the same character. Leave it out.
- p16: "f you do not use" → "Without"
- p16: "If you construct.. you can use" - colloquial, omit "you", better: "The BWT can be used..."
kleine Änderungen beim L-to-F mapping,
- p17: "to to"
- p17: (minor) "C[L[i]] is the number of lexicographically smaller characters before the rst L[i] in F occurs." shorter: "C[L[i]] is the position before the first L[i] in F."
- p17: ".. calculate the first possible position … add the number of L[i]s which occurred before the L[i]" not exactly how defined it. Its the position before the first L[i] and you add j if L[i] is the j-th occurrence in L.
Exactmatch, Backward_search, Anfang von Backward_search]
- p18: "Bowtie used" You haven't introduced Bowtie. Remove or rephrase.
- p18: "Here you have the challenge" sounds strange
- p18: "a index.."
- p18: the most significant advantage is: Exactmatch is O(|p|) (without compression of Occ)
- p18: "For the performance it just needs" - What performance?
- p18: "occupies in"
- p18: "Because of the rank preserving property it is possible to calculate the first and last position of this set." Why that?
- p19: Please consider rewriting. Most of the sentences are hard to understand. Please avoid "you" (if there's no other way use "we").
- Try to precisely formulate each sentence in german first then translate
- Read the sentence multiple times and check for exactness and unambiguity.
- Let your group mates review the text.
- p19: You could move 4) between 1) and 2) and use j instead of j-1 in 2). Additionally you need 5) "Repeat 2)-4) until j=1"
- p20: The output is actually only First, Last (not number of occ.)
- p20: "Initialisierung"
- p20: "return no rows..." better: "output/print "no rows..."" and always return <First,Last>
- p20: "The … Algorithm runs..." → "algorithm"
- p20: "Because of space limits it isn't possible…." What limits, who sets them? Better: "To reduce the space consumption it is possible to..."
- p20: "isn't" → "is not"
- p20: what is the position of a row in the T? You actually mean the originating position of F[i] the first character of row i.
- p21: "pos(1)-1" → "pos_T(1)-1"
- p21: "Determine L[j]" again, Occ not L is compressed
- p21: "Extension of the BWT" section empty
- p21: Backtracking should be explained
- …
To all sections
- p*: check and replace all ?? and cite all non-standard terms/algorithms
Lecture 2
REINERT: Dann haben wir zwei Vorlesungen über read mapping. Ist das gewünscht? (siehe auch unten)
(Hannes, Sebastian)
Slides
--
DavidWeese - 25 May 2010
My comments to revision 6323:
- please turn on page numbers in the slides (easier to refer to)
- s13: "exponential runtime increasing" → "exponential incease in running time"
- s14: "easier" is not correct, it should be faster. A filter only makes sense if the work load of subsequent steps and also the overall running time is reduced.
- s15: "parts … does"
- s15: "contain possible", "be match" - quantifiers missing
- s16: "no guaranty to lose any significant match" - so the filter is/can be lossy? what is significant?
- s19: "fast" is relative. faster than what?
- s19: "improve" - compared to what?
- s21: "at least one set has no object" - you should say that each object is contained in one set
- s25: one could think you divide the pattern into pieces p1,...,pn and the text into t1,...,tx. Omit p1,...,pn for single characters, this is misleading. Better use |P|=m or "... of length m"
- s25: "one of the pattern pieces has to match without error" - Match with what and why? You didn't mention that there is an occurrence with <=k errors. And maybe it is easier to start with something like: T differs from P with at most k errors.
- s26: you need the generic pigeonhole lemma on s41 the first time. Try to rearrange
- s29: is very similar to s25
- s30: "p_i[start : end]" what's that? What is start and end? You use i_end and i_start later. I suggested to use positions s_i and e_i for pattern piece p_i
- s30: "matches at position t[...]" - t[...] is not a position, it's a substring
- s32: "before t_j" → t[j] or better: position j in t
- s35: "text area" → "text area is"
- s36: you only have one text, "text1 (t1)" → "text"
- s40: "many unnecessary verification" and "many verifications are unsuccessful" - is somehow the partially same, as they are unnecessary if they are unsuccessful.
- s42: my pdf viewer (Mac, TexShop) doesn't display the arrows/lines(?) between the tree nodes.
- s48: How do you partition the error rate in the unbalanced case
- s50: Does Quasar really compute approx. local matches (including BLAST) or only potential approximate matches (w/o BLAST, I'm not sure)
- s51: "with window length" undefined. Better: |Q'|=w (or |D'| depending on your algorithm)
- s56: "devide"
- s56: So if only one read reaches the threshold of a block a you have to verify all reads in this block?
- s60: Maybe to much text on this slide.
- s64: "higer"
- s69: You should consider skipping Minimum Coverage to spare some time
- s69: increases the filter specificity
Script
--
DavidWeese - 02 Jun 2010:
Revision 6452:
Comments to the proof on p39:
- (1) should be omitted, is given in the assumption.
- Use more connecting words.
- before (2): What is k_i? "Let k_i be the number of errors in p^i. Then the following holds: ..."
- "order to show … assumption" can be written as "We assume"
- The proof is wrong, as you argue: "Assume the Lemma is wrong. This is a contradiction to the Lemma. So it holds.".
- (4) is wrong. The indirect assumption is "For all i=1,..,j holds k_i>floor((a_i*k)/A)." The k_i are integers so for all i holds: k_i >= floor(a_i*k/A) + 1 > a_i*k/A. Thus, sum over k_i>=sum over a_i*k/A. → k > k (contradiction)
Lecture 3
REINERT: Passender wäre vielleicht die Arbeiten von Tammi und Kececioglu zusammen (wie im repeat resolution script)
(
TopHat und spliced mapping könnte man mit filter prinzipien in lecture 2 machen?)
- Splicing:
- Multireads:
- Algorithmus von Kececioglu
(Duc, Nicolas und Corinna)