Als mapper liegt Bowtie auf der Hand, da es auch spliced reads mapped, schnell ist und unser VL Thema war.
REINERT: Das ist auch ein gutes Projekt. Man muss natürlich bei den kurzen Reads dann die negativen Effekte (kleinere Spezifizität) mit hereinbringen.
verlaufsplan:
1. | download data | done | from: http://bowtie-bio.sourceforge.net/index.shtml and ftp://ftp.ncbi.nlm.nih.gov/sra/Studies/SRP000/SRP000910/SRX005924/ | |
1.1 | re-download data | done | downloaded whole human mRNA sequences, parsing with perl script | |
2. | install bowtie | done | from binary, make | |
3. | run bowtie on data | done | test: ./bowtie hg18 reads/SRR017933_head80.fastq ::SUCCESS:: | |
3.1 | move project to server | done | too much data and too slow on laptop | |
3.2 | check out possible parameters for bowtie | done | -v 0-3 (mismatches allowed) --best -p 4 (multithreat) -t --refout --suppress 6 | |
4. | modify data to my needs | done | change size of the reads! i chose now: 10, 15, 20, 25, 30, 40, …, 100, 120, …, 200, 800. bowtie doesnt like long reads (1024bp max) | |
5. | statistical analysis | done | distribution of reads that could be mapped over different sizes. See histogram. | |
verlaufsplan 2:
1. | download data | done | from: ftp://ftp.sanger.ac.uk/pub/gencode/rgasp/inputdata_2/ | |
1.1 | get needed (new) read file | done | H_sapiens-HepG2_10879_311B7AAXX_5_1.fastq | |
2. | use more reads | done | coverage>3: #reads=3*#basesInmRNA/readLengths; 524475 reads die ich jetzt behandel (21 reads pro gen) | |
3.a | make different sizes | done | 20-50bp with 2bp steps. parsing with python script | |
3.b | same for previous reads | done | 20-50bp with 2bp steps. parsing with perl script | |
4. | run bowtie on this new data | done | balcazar@tiaotiao:~/P4/bowtie-0.12.5> for i in `cat list.txt`; do ./bowtie -v 2 --best -p 4 -t --suppress 6 hg18 /project/MID1_complex/P4/reads2050/reads_$i.fastq OUT_bowtie_reads_$i.txt; done | |
5. | statistical analysis | x | distribution of reads that could be mapped over different sizes. Plot both curves in one histogram (real & artificial datasets). | |