[Bioperl-l] Getting nucleotide seq from protein accession

Mon Jun 27 12:16:28 EDT 2005

Hi Stefan,
I've devised a far simpler way of getting what I want using a perl 
module called: WWW::Mechanize.
It simply gets the page source of a url which you can then parse the 
appropriate links to get the nucleotide accession number.
Thanks again,
Kat

use 
WWW::Mechanize;                                                                                                                           

my $mech = 
WWW::Mechanize->new();                                                                                                                            

# THIS GETS THE HTML CONTENTS OF A WEBSITE

my $url = 
"http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=13488063";
$mech->get( $url 
);                                                                                                                             

print $mech->content();

Stefan Kirov wrote:

> Kat,
> In my experience works OK:
> 60 mendel /home/sao> perl test.pl
> seq is  NP_106261
> Feature source starts 1 ends 243 strand 1
> Feature count is 0
> count is 1
> GETS TO HERE WITH NO KEYS 3
> DOESN'T GET TO HERE
> Annotation db_xref stringified value Value: taxon:266835
> DOESN'T GET TO HERE
> Annotation strain stringified value Value: MAFF303099
> DOESN'T GET TO HERE
> Annotation organism stringified value Value: Mesorhizobium loti 
> MAFF303099
> Feature Protein starts 1 ends 243 strand 1
> Feature count is 0
> count is 2
> GETS TO HERE WITH NO KEYS 1
> DOESN'T GET TO HERE
> Annotation product stringified value Value: transcriptional regulator
> Feature CDS starts 1 ends 243 strand 1
> Feature count is 0
> count is 3
> GETS TO HERE WITH NO KEYS 4
> DOESN'T GET TO HERE
> Annotation locus_tag stringified value Value: mlr5637
> DOESN'T GET TO HERE
> Annotation db_xref stringified value Value: GeneID:1228922
> DOESN'T GET TO HERE
> Annotation coded_by stringified value Value: NC_002678.2:4529413..4530144
> DOESN'T GET TO HERE
> Annotation transl_table stringified value Value: 11
>
> What you need is
> Annotation coded_by stringified value Value: NC_002678.2:4529413..4530144
> What bioperl version are you using?
>
> Stefan
>
> Kat Hull wrote:
>
>> Hi Stefan,
>> I'm still having problems!
>> I get some features from the sequence object but can't get past this 
>> part in the code:
>>
>>  $ac=$feat->annotation();   # This is not returning anything.
>>
>> Could you look at the code to tell me where its going wrong?
>>
>> Thanks again,
>> #________________________________________
>> #!/biol/programs/perl580/bin/perl
>>                                                                                              
>>
>> use Bio::Annotation::Collection;
>> use Bio::DB::GenPept;
>> $gb = new Bio::DB::GenPept;
>>                                                                                              
>>
>>                                    my $seqio = 
>> $gb->get_Stream_by_id(['13474692']);
>> my $ac;
>> my $c;
>>                                                                                              
>>
>>  while( my $seq = $seqio->next_seq ) {
>>      print "seq is  ", $seq->display_id, "\n";
>>      my @f=$seq->get_all_SeqFeatures;     #This gives you the 
>> annotation of the retrieved sequence object
>>                                                                                              
>>
>>      foreach my $feat (@f) {
>>        ++$c;
>>       print "Feature ",$feat->primary_tag," starts ",$feat->start," 
>> ends ",
>>       $feat->end," strand ",$feat->strand,"\n";
>>                                                                                              
>>
>>       # features retain link to underlying sequence object
>>       #print "Feature sequence is ",$feat->seq->seq(),"\n";
>>                                                                                              
>>
>>        my $t = $feat->feature_count;
>>        print "Feature count is $t\n";
>>        print "count is $c\n";
>>                                                                                              
>>
>>        $ac=$feat->annotation();  #  PROBLEM SEEMS TO BE HERE
>>                                                                                              
>>
>>        my $blah=$ac->get_all_annotation_keys();
>>        print "GETS TO HERE WITH NO KEYS $blah\n";
>>                                                                                              
>>
>>           foreach $key ( $ac->get_all_annotation_keys() ) {
>>                print "DOESN'T GET TO HERE\n";
>>               @values = $ac->get_Annotations($key);
>>               foreach $value ( @values ) {
>>                # value is an Bio::AnnotationI, and defines a 
>> "as_text" method
>>                print "Annotation ",$key," stringified value 
>> ",$value->as_text,"\n";
>>                                                                                              
>>
>>                # also defined hash_tree method, which allows data 
>> orientated
>>                # access into this object
>>                $hash = $value->hash_tree();
>>               }
>>           }
>>                                                                                              
>>
>> # commented out for now          
>>                                                                                  
>>
>>   #    next unless ($ac->get_Annotations('coded_by'));
>>    #   my @coded=$ac->get_Annotations('coded_by');
>>     #  foreach my $location (@coded) {
>>      #   print $location->value, " is the location that codes this 
>> protein\n";
>>     #  }
>>      }
>>  }
>>
>>
>>
>>
>>
>> Stefan Kirov wrote:
>>
>>>   my $seqio = $gb->get_Stream_by_id(['13474692']);
>>>   while( my $seq = $seqio->next_seq ) {
>>>       print "seq is  ", $seq->display_id, "\n";
>>>       my $ann=$seq->annotation; #This gives you the annotation of 
>>> the retrieved sequence object
>>>      foreach my $dblink ($ann->get_Annotations('DBLink')) {
>>>          if ($dblink->database =~/refseq/i) {
>>>             print $database->primary_id, " is the mRNA accession 
>>> number\n";
>>>          }
>>>       }
>>>   }
>>> However, the gene you are looking at is not associated with any NM_ 
>>> sequence, but rather comes from NC_. Therefore the above will not 
>>> work for you. You will have to descend through the sequence features 
>>> and find teh feature that says 'coded_by':
>>> use Bio::DB::GenPept;
>>> my $gb=new Bio::DB::GenPept;
>>>  my $seqio = $gb->get_Stream_by_id(['13474692']);
>>>   while( my $seq = $seqio->next_seq ) {
>>>       print "seq is  ", $seq->display_id, "\n";
>>>       my @f=$seq->get_SeqFeatures; #This gives you the annotation of 
>>> the retrieved sequence object
>>>      foreach my $feat (@f) {
>>>        my $ann=$feat->annotation;
>>>        next unless ($ann->get_Annotations('coded_by'));
>>>        my @coded=$ann->get_Annotations('coded_by');
>>>        foreach my $location (@coded) {
>>>          print $location->value, " is the location that codes this 
>>> protein\n";
>>>        }
>>>       }
>>>   }
>>> No guarantees the code is typo free :-)
>>> Stefan
>>>
>>> Kat Hull wrote:
>>>
>>>> Hi Stefan,
>>>> Thanks for your advice but i'm still struggling!   I have used 
>>>> Bio::DB::GenPept to get the protein accession number given the 
>>>> protein gi number.  However, I don't understand how 
>>>> Bio::Annotation::DBLink works.  Does it fetch the url of a link on 
>>>> the web-site?   Basically, if I could use this (or something else) 
>>>> to get the url of the CDS link for my protein of interest, I can 
>>>> get the corresponding nucleotide accession from this, as it is 
>>>> encoded in the url.
>>>> Do you know how to use this module?  Is this what you were 
>>>> suggesting I try yesterday (I didn't really understand what you 
>>>> were getting at).
>>>> Many thanks,
>>>>
>>>> Kat
>>>>
>>>> ps. Here's where i'm at so far:
>>>>                                                                                                                                                                                    
>>>>
>>>> use Bio::Annotation::DBLink;
>>>> use Bio::DB::GenBank;
>>>> use Bio::DB::GenPept;
>>>>   $gb = new Bio::DB::GenPept;
>>>>                                                                                                                                                                                                                                                                                                                                                                                                                                                  
>>>>
>>>>    # given the gi number, this returns the accession
>>>>    my $seqio = $gb->get_Stream_by_id(['13474692']);
>>>>    while( my $seq = $seqio->next_seq ) {
>>>>        print "seq is  ", $seq->display_id, "\n";
>>>>    }
>>>>     # not sure what i'm doing here 
>>>>                                                                                                                                                                                                                                                                                                                                                               
>>>>
>>>>   $link2 = new Bio::Annotation::DBLink();
>>>>   $link2->database('dbSNP');
>>>>   $link2->primary_id('2367');
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Stefan Kirov wrote:
>>>>
>>>>> Kat,
>>>>> If you are familiar with Bioperl it is kind of easy-
>>>>> look at Bio::DB::GenPept (I suppose you use GenPept/GenBank?) on 
>>>>> how to get the protein record
>>>>> Go through the dblinks and find the appropriate accession number 
>>>>> (where the database method returns  GenBank).
>>>>> Then retrieve this accession number(s) through Bio::DB::GenBank. 
>>>>> If you are not familiar with Bioperl- read the docs for 
>>>>> Bio::DB::GenPept, Bio::DB::GenBank, Bio::Annotation and 
>>>>> Bio::Annotation::DBLink).
>>>>> Hope this helps,
>>>>> Stefan
>>>>>
>>>>> Kat Hull wrote:
>>>>>
>>>>>> Hi there,
>>>>>> I was wondering whether anyone has a solution to my problem. I 
>>>>>> have a list of protein assession numbers and want to retrieve the 
>>>>>> corresponding nucleotide sequences automatically.  I thought it 
>>>>>> would be possible to do this by changing the NCBI url, but this 
>>>>>> doesn't seem to be the case.
>>>>>> Is there a bio-perl module that can do this?
>>>>>>
>>>>>> Kind regards,
>>>>>> Kat
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at portal.open-bio.org
>>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Stefan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>