[Bioperl-l] Getting sequences by base pair locations

Fri Jul 28 13:41:45 UTC 2006

Sendu Bala wrote:
> Yuval Itan wrote:
> 
>>Hello all,
>>
>>I was BLATing a few hundred human genes against the chimp genome, and 
>>kept the best chimp hits for every human gene.
>>I have the base pair start and end location for every chimp hit, and I 
>>need to get the sequence for each of these chimp hits. Here is an 
>>example for a few chimp hits bp locations:
>>
>>Start End*
>>*142854 144504
>>154479 155198
>>153066 167370
>>163146 163559
>>
>>I have one chimp genome file (about 3GB) including all chromosomes, but 
>>I could also get one file per chromosome if that would make things 
>>easier. Does anyone have a script or a link for an interface that can do 
>>the job?
> 
> 
> If your genome file is in some standard format, use SeqIO.
> http://www.bioperl.org/wiki/HOWTO:SeqIO
> 
> And then get the sequence corresponding to the correct chromosome and 
> get the desired chunk with subseq();
> http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object

My guess is that Yuval will need random access to the sequences.  With 
seqIO, this is possible with a relatively large amount of memory, but 
Bio::DB::Fasta might be the better bet.

Alternatively, make a custom track (see the documentation for doing so 
at the UCSC genome browser site), upload it, and then getting the DNA is 
trivial with just a couple of mouseclicks.  This method also has the 
advantage of being able to do things like viewing the data in genome 
coordinates and allows the possibility of doing interections with known 
chimp genes so you could find hits that don't overlap known chimp genes, 
for example.

Sean