[BioPython] NCBI: from protein to CDS
Jeffrey Chang
jchang at jeffchang.com
Fri Apr 11 13:40:43 EDT 2003
Ha! It flushed Andrew out of the woodwork. This week's award for most
valuable python contributor goes to Jason! :)
Jeff
On Thursday, April 10, 2003, at 07:18 PM, Iddo Friedberg wrote:
>
> The truth Jason: Jeff Chang deviously put you up to this to get me to
> write the appropriate code for Biopython, right?
>
> Thanks! I'll use it!
>
> Best,
>
>
> Iddo
>
>
>
> Jason Stajich wrote:
>> Sorry had to do it =) apologies for the perl code on a python list...
>> Not totally trivial but it's doable in bioperl.
>> <caveats>
>> Note don't use this for lots of request - NCBI will shut you off.
>> Better
>> to download genbank local, yadda yadda if you are going to do multiple
>> genomes. Follow NCBI's directions on limiting queries (we institute a
>> 2sec delay on our queries).
>> </caveats>
>> (download/install bioperl-1.2.1 and IO::String)
>> #!/usr/bin/perl -w
>> use strict;
>> use Bio::DB::GenBank;
>> use Bio::DB::GenPept;
>> use Bio::DB::FileCache;
>> use Bio::Factory::FTLocationFactory;
>> use Bio::SeqFeature::Generic;
>> my $ntdb = new Bio::DB::GenBank;
>> my $pepdb= new Bio::DB::GenPept;
>> # do some caching in the event you're pulling up the same
>> # chromosome and/or you are debugging
>> my $cachent = new Bio::DB::FileCache(-kept => 1,
>> -file => '/tmp/cache/nt.idx',
>> -seqdb => $ntdb);
>> my $cachepep = new Bio::DB::FileCache(-kept => 1,
>> -file => '/tmp/cache/pep.idx',
>> -seqdb => $pepdb);
>> # obj to turn strings into Bio::Location object
>> my $locfactory = new Bio::Factory::FTLocationFactory;
>> # you might get these from a file (and they can be accessions too)
>> my @protgis = (10956263);
>> foreach my $gi ( @protgis ) {
>> my $protseq = $cachepep->get_Seq_by_id($gi);
>> if( ! $protseq ) { print STDERR "could not find a seq for gi:$gi\n";
>> next;
>> }
>> foreach my $cds ( grep { $_->primary_tag eq 'CDS' }
>> $protseq->get_SeqFeatures() )
>> {
>> next unless( $cds->has_tag('coded_by') ); # skip CDSes with no
>> coded_by
>> my ($codedby) = $cds->each_tag_value('coded_by');
>> my ($ntacc,$loc) = split(/\:/, $codedby);
>> $ntacc =~ s/(\.\d+)//; # genbank wants an accession not a
>> versioned one
>> my $cdslocation = $locfactory->from_string($loc);
>> my $cdsfeature = new Bio::SeqFeature::Generic(-location =>
>> $cdslocation);
>> my $ntseq = $cachent->get_Seq_by_acc($ntacc);
>> next unless $ntseq;
>> $ntseq->add_SeqFeature($cdsfeature); # locate the feature on a
>> seq
>> my $cdsseq = $cdsfeature->spliced_seq();
>> print "cds seq is ", $cdsseq->seq(), "\n";
>> }
>> }
>> On Thu, 10 Apr 2003, Iddo Friedberg wrote:
>>> Sorry, slight mistake in my question.
>>>
>>> The protein sequence is retrieved using the gi-10956263
>>>
>>> Then the CDS link hold the following:
>>>
>>> val=10956247 (which is the nucleotide GI)
>>> itemID=98 (which tells us which bit of the nucleotide sequence
>>> actually
>>> codes for this protein).
>>>
>>> Best,
>>>
>>> Iddo
>>>
>>> Iddo Friedberg wrote:
>>>
>>>> Hi,
>>>>
>>>> Can anyone suggest a painless way to retrieve a coding sequence
>>>> using a
>>>> protein gi number via NCBI? manually that would mean to:
>>>>
>>>> 1) go to the protein page using the protein gi.
>>>> 2) Click on the CDS
>>>> 3) Get the DNA sequence coding to it.
>>>>
>>>> Here why this is a bummer:
>>>>
>>>> Using the gi-10956247 a protein record is retrieved from NCBI. The
>>>> URL
>>>> for the CDS looks like :
>>>>
>>>> "http://www.ncbi.nlm.nih.gov/entrez/
>>>> viewer.fcgi?val=10956247&itemID=98&view=gbwithparts"
>>>>
>>>>
>>>> the "val" field holds the same gi number, but the itemID field
>>>> varies
>>>> (that depends on the coding location in the full genome record, in
>>>> this
>>>> case). So I cannot use the protein gi to go and retrieve its coding
>>>> sequence.
>>>>
>>>> I *can* write something to read the URL itemID field value, I just
>>>> thought there might be a more elegant way. Maybe even using
>>>> "legitimate"
>>>> NCBI mechanisms...
>>>>
>>>> Thanks,
>>>>
>>>> Iddo
>>>>
>>>>
>>>>
>>>
>>>
>> --
>> Jason Stajich
>> Duke University
>> jason at cgt.mc.duke.edu
>> _______________________________________________
>> BioPython mailing list - BioPython at biopython.org
>> http://biopython.org/mailman/listinfo/biopython
>
> --
> Iddo Friedberg, Ph.D.
> The Burnham Institute
> 10901 N. Torrey Pines Rd.
> La Jolla, CA 92037
> USA
> Tel: +1 (858) 646 3100 x3516
> Fax: +1 (858) 646 3171
> http://bioinformatics.ljcrf.edu/~iddo
>
> _______________________________________________
> BioPython mailing list - BioPython at biopython.org
> http://biopython.org/mailman/listinfo/biopython
More information about the BioPython
mailing list