[BioPython] NCBI: from protein to CDS

Fri Apr 11 13:40:43 EDT 2003

Ha!  It flushed Andrew out of the woodwork.  This week's award for most  
valuable python contributor goes to Jason!  :)

Jeff

On Thursday, April 10, 2003, at 07:18  PM, Iddo Friedberg wrote:

>
> The truth Jason: Jeff Chang deviously put you up to this to get me to  
> write the appropriate code for Biopython, right?
>
> Thanks! I'll use it!
>
> Best,
>
>
> Iddo
>
>
>
> Jason Stajich wrote:
>> Sorry had to do it =)  apologies for the perl code on a python list...
>> Not totally trivial but it's doable in bioperl.
>> <caveats>
>> Note don't use this for lots of request - NCBI will shut you off.   
>> Better
>> to download genbank local, yadda yadda if you are going to do multiple
>> genomes.  Follow NCBI's directions on limiting queries (we institute a
>> 2sec delay on our queries).
>> </caveats>
>> (download/install bioperl-1.2.1 and IO::String)
>> #!/usr/bin/perl -w
>> use strict;
>> use Bio::DB::GenBank;
>> use Bio::DB::GenPept;
>> use Bio::DB::FileCache;
>> use Bio::Factory::FTLocationFactory;
>> use Bio::SeqFeature::Generic;
>> my $ntdb = new Bio::DB::GenBank;
>> my $pepdb= new Bio::DB::GenPept;
>> # do some caching in the event you're pulling up the same
>> # chromosome and/or you are debugging
>> my $cachent = new Bio::DB::FileCache(-kept => 1,
>> 				     -file => '/tmp/cache/nt.idx',
>> 				     -seqdb => $ntdb);
>> my $cachepep = new Bio::DB::FileCache(-kept => 1,
>> 				      -file => '/tmp/cache/pep.idx',
>> 				      -seqdb => $pepdb);
>> # obj to turn strings into Bio::Location object
>> my $locfactory = new Bio::Factory::FTLocationFactory;
>> # you might get these from a file (and they can be accessions too)
>> my @protgis = (10956263);
>> foreach my $gi ( @protgis ) {
>>   my $protseq = $cachepep->get_Seq_by_id($gi);
>>   if( ! $protseq ) { print STDERR "could not find a seq for gi:$gi\n";
>> 		     next;
>> 	           }
>>   foreach my $cds (  grep { $_->primary_tag eq 'CDS' }
>>                           $protseq->get_SeqFeatures() )
>>   {
>>      next unless( $cds->has_tag('coded_by') ); # skip CDSes with no  
>> coded_by
>>      my ($codedby) = $cds->each_tag_value('coded_by');
>>      my ($ntacc,$loc) = split(/\:/, $codedby);
>>      $ntacc =~ s/(\.\d+)//; # genbank wants an accession not a  
>> versioned one
>>      my $cdslocation = $locfactory->from_string($loc);
>>      my $cdsfeature = new Bio::SeqFeature::Generic(-location =>  
>> $cdslocation);
>>      my $ntseq = $cachent->get_Seq_by_acc($ntacc);
>>      next unless $ntseq;
>>      $ntseq->add_SeqFeature($cdsfeature); # locate the feature on a  
>> seq
>>      my $cdsseq = $cdsfeature->spliced_seq();
>>      print "cds seq is ", $cdsseq->seq(), "\n";
>>  }
>> }
>> On Thu, 10 Apr 2003, Iddo Friedberg wrote:
>>> Sorry, slight mistake in my question.
>>>
>>> The protein sequence is retrieved using the gi-10956263
>>>
>>> Then the CDS link hold the following:
>>>
>>> val=10956247  (which is the nucleotide GI)
>>> itemID=98 (which tells us which bit of the nucleotide sequence  
>>> actually
>>> codes for this protein).
>>>
>>> Best,
>>>
>>> Iddo
>>>
>>> Iddo Friedberg wrote:
>>>
>>>> Hi,
>>>>
>>>> Can anyone suggest a painless way to retrieve a coding sequence  
>>>> using a
>>>> protein gi number via NCBI? manually that would mean to:
>>>>
>>>> 1) go to the protein page using the protein gi.
>>>> 2) Click on the CDS
>>>> 3) Get the DNA sequence coding to it.
>>>>
>>>> Here why this is a bummer:
>>>>
>>>> Using the gi-10956247 a protein record is retrieved from NCBI. The  
>>>> URL
>>>> for the CDS looks like :
>>>>
>>>> "http://www.ncbi.nlm.nih.gov/entrez/ 
>>>> viewer.fcgi?val=10956247&itemID=98&view=gbwithparts"
>>>>
>>>>
>>>> the "val" field holds the same gi number, but the itemID field  
>>>> varies
>>>> (that depends on the coding location in the full genome record, in  
>>>> this
>>>> case). So I cannot use the protein gi to go and retrieve its coding
>>>> sequence.
>>>>
>>>> I *can* write something to read the URL itemID field value, I just
>>>> thought there might be a more elegant way. Maybe even using  
>>>> "legitimate"
>>>> NCBI mechanisms...
>>>>
>>>> Thanks,
>>>>
>>>> Iddo
>>>>
>>>>
>>>>
>>>
>>>
>> --
>> Jason Stajich
>> Duke University
>> jason at cgt.mc.duke.edu
>> _______________________________________________
>> BioPython mailing list  -  BioPython at biopython.org
>> http://biopython.org/mailman/listinfo/biopython
>
> -- 
> Iddo Friedberg, Ph.D.
> The Burnham Institute
> 10901 N. Torrey Pines Rd.
> La Jolla, CA 92037
> USA
> Tel: +1 (858) 646 3100 x3516
> Fax: +1 (858) 646 3171
> http://bioinformatics.ljcrf.edu/~iddo
>
> _______________________________________________
> BioPython mailing list  -  BioPython at biopython.org
> http://biopython.org/mailman/listinfo/biopython