ReadLengthSplicing

You are here: ABI » LectureWiki » AdvancedAlgorithms » RnaSeqP4 » ReadLengthSplicing

Mich hat es interessiert zu wissen wie viele exons wir tatsächlich verlieren, wenn wir nicht auf die spliced reads achten würden. Bzw wie sich diese verteilung ändert, um so länger die reads werden. Macht es sinn für kurze reads (25-30bp) read mapper zu benutzen, die algorithmisch aufwändiger sind und splicing beachten? Bzw ab wann ist es wirklich nötig diese zu benutzen. Wie viel information würde verloren gehen?

Für die Daten wollte ich das menschliche Genom benutzen, da dies in meinen Augen am interessantesten ist und Praxisnah.
Die Daten hole ich mir aus Sequence Read Archive (SRA) vom NCBI.

Als mapper liegt Bowtie auf der Hand, da es auch spliced reads mapped, schnell ist und unser VL Thema war.

REINERT: Das ist auch ein gutes Projekt. Man muss natürlich bei den kurzen Reads dann die negativen Effekte (kleinere Spezifizität) mit hereinbringen.

verlaufsplan:

1.	download data	done	from: http://bowtie-bio.sourceforge.net/index.shtml and ftp://ftp.ncbi.nlm.nih.gov/sra/Studies/SRP000/SRP000910/SRX005924/
1.1	re-download data	done	downloaded whole human mRNA sequences, parsing with perl script
2.	install bowtie	done	from binary, make
3.	run bowtie on data	done	test: ./bowtie hg18 reads/SRR017933_head80.fastq ::SUCCESS::
3.1	move project to server	done	too much data and too slow on laptop
3.2	check out possible parameters for bowtie	done	-v 0-3 (mismatches allowed) --best -p 4 (multithreat) -t --refout --suppress 6
4.	modify data to my needs	done	change size of the reads! i chose now: 10, 15, 20, 25, 30, 40, …, 100, 120, …, 200, 800. bowtie doesnt like long reads (1024bp max)
5.	statistical analysis	done	distribution of reads that could be mapped over different sizes. See histogram.

# MAKE MY READ FILES. IN FASTQ FORMAT
balcazar@tiaotiao:/project/MID1_complex/P4/reads> perl ./script.pl human.rna.gbff genbank 50 > reads_50.fastq

# RUN BOWTIE
balcazar@tiaotiao:~/P4/bowtie-0.12.5> ./bowtie hg18 /project/MID1_complex/P4/reads/reads_50_head40.fastq
balcazar@tiaotiao:~/P4/bowtie-0.12.5> ./bowtie hg18 /project/MID1_complex/P4/reads/reads_50.fastq > OUT_bowtie_reads_50.txt

Nach Besprechung (25.8.10):

-Region zwischen 20-50 bp genauer 'angucken' und zwar in 2er Schritten.
-Anzahl der Reads erhöhen. Mindestens Coverage 3.
#reads=3*#basesInmRNA/readLengths
-Diese Daten mit realen Daten vergleichen (RGASP).

verlaufsplan 2:

1.	download data	done	from: ftp://ftp.sanger.ac.uk/pub/gencode/rgasp/inputdata_2/
1.1	get needed (new) read file	done	H_sapiens-HepG2_10879_311B7AAXX_5_1.fastq
2.	use more reads	done	coverage>3: #reads=3*#basesInmRNA/readLengths; 524475 reads die ich jetzt behandel (21 reads pro gen)
3.a	make different sizes	done	20-50bp with 2bp steps. parsing with python script
3.b	same for previous reads	done	20-50bp with 2bp steps. parsing with perl script
4.	run bowtie on this new data	done	balcazar@tiaotiao:~/P4/bowtie-0.12.5> for i in `cat list.txt`; do ./bowtie -v 2 --best -p 4 -t --suppress 6 hg18 /project/MID1_complex/P4/reads2050/reads_$i.fastq OUT_bowtie_reads_$i.txt; done
5.	statistical analysis	x	distribution of reads that could be mapped over different sizes. Plot both curves in one histogram (real & artificial datasets).

68797240 lines in H_sapiens-HepG2/H_sapiens-HepG2_10879_311B7AAXX_5_1.fastq
=^ 17mio reads (length 75)

real reads on server:
balcazar@tiaotiao:/project/MID1_complex/P4/real_reads/H_sapiens-HepG2_10879_311B7AAXX_5_1.fastq

Sitemap

GeschichteDesComputers

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback

Attachments ($count)

Sitemap