[Bioperl-l] Getting sequences by base pair locations
Sean Davis
sdavis2 at mail.nih.gov
Fri Jul 28 13:41:45 UTC 2006
Sendu Bala wrote:
> Yuval Itan wrote:
>
>>Hello all,
>>
>>I was BLATing a few hundred human genes against the chimp genome, and
>>kept the best chimp hits for every human gene.
>>I have the base pair start and end location for every chimp hit, and I
>>need to get the sequence for each of these chimp hits. Here is an
>>example for a few chimp hits bp locations:
>>
>>Start End*
>>*142854 144504
>>154479 155198
>>153066 167370
>>163146 163559
>>
>>I have one chimp genome file (about 3GB) including all chromosomes, but
>>I could also get one file per chromosome if that would make things
>>easier. Does anyone have a script or a link for an interface that can do
>>the job?
>
>
> If your genome file is in some standard format, use SeqIO.
> http://www.bioperl.org/wiki/HOWTO:SeqIO
>
> And then get the sequence corresponding to the correct chromosome and
> get the desired chunk with subseq();
> http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object
My guess is that Yuval will need random access to the sequences. With
seqIO, this is possible with a relatively large amount of memory, but
Bio::DB::Fasta might be the better bet.
Alternatively, make a custom track (see the documentation for doing so
at the UCSC genome browser site), upload it, and then getting the DNA is
trivial with just a couple of mouseclicks. This method also has the
advantage of being able to do things like viewing the data in genome
coordinates and allows the possibility of doing interections with known
chimp genes so you could find hits that don't overlap known chimp genes,
for example.
Sean
More information about the Bioperl-l
mailing list