[Bioperl-l] Getting nucleotide seq from protein accession
Kat Hull
khh103 at york.ac.uk
Mon Jun 27 12:16:28 EDT 2005
Hi Stefan,
I've devised a far simpler way of getting what I want using a perl
module called: WWW::Mechanize.
It simply gets the page source of a url which you can then parse the
appropriate links to get the nucleotide accession number.
Thanks again,
Kat
use
WWW::Mechanize;
my $mech =
WWW::Mechanize->new();
# THIS GETS THE HTML CONTENTS OF A WEBSITE
my $url =
"http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=13488063";
$mech->get( $url
);
print $mech->content();
Stefan Kirov wrote:
> Kat,
> In my experience works OK:
> 60 mendel /home/sao> perl test.pl
> seq is NP_106261
> Feature source starts 1 ends 243 strand 1
> Feature count is 0
> count is 1
> GETS TO HERE WITH NO KEYS 3
> DOESN'T GET TO HERE
> Annotation db_xref stringified value Value: taxon:266835
> DOESN'T GET TO HERE
> Annotation strain stringified value Value: MAFF303099
> DOESN'T GET TO HERE
> Annotation organism stringified value Value: Mesorhizobium loti
> MAFF303099
> Feature Protein starts 1 ends 243 strand 1
> Feature count is 0
> count is 2
> GETS TO HERE WITH NO KEYS 1
> DOESN'T GET TO HERE
> Annotation product stringified value Value: transcriptional regulator
> Feature CDS starts 1 ends 243 strand 1
> Feature count is 0
> count is 3
> GETS TO HERE WITH NO KEYS 4
> DOESN'T GET TO HERE
> Annotation locus_tag stringified value Value: mlr5637
> DOESN'T GET TO HERE
> Annotation db_xref stringified value Value: GeneID:1228922
> DOESN'T GET TO HERE
> Annotation coded_by stringified value Value: NC_002678.2:4529413..4530144
> DOESN'T GET TO HERE
> Annotation transl_table stringified value Value: 11
>
> What you need is
> Annotation coded_by stringified value Value: NC_002678.2:4529413..4530144
> What bioperl version are you using?
>
> Stefan
>
> Kat Hull wrote:
>
>> Hi Stefan,
>> I'm still having problems!
>> I get some features from the sequence object but can't get past this
>> part in the code:
>>
>> $ac=$feat->annotation(); # This is not returning anything.
>>
>> Could you look at the code to tell me where its going wrong?
>>
>> Thanks again,
>> #________________________________________
>> #!/biol/programs/perl580/bin/perl
>>
>>
>> use Bio::Annotation::Collection;
>> use Bio::DB::GenPept;
>> $gb = new Bio::DB::GenPept;
>>
>>
>> my $seqio =
>> $gb->get_Stream_by_id(['13474692']);
>> my $ac;
>> my $c;
>>
>>
>> while( my $seq = $seqio->next_seq ) {
>> print "seq is ", $seq->display_id, "\n";
>> my @f=$seq->get_all_SeqFeatures; #This gives you the
>> annotation of the retrieved sequence object
>>
>>
>> foreach my $feat (@f) {
>> ++$c;
>> print "Feature ",$feat->primary_tag," starts ",$feat->start,"
>> ends ",
>> $feat->end," strand ",$feat->strand,"\n";
>>
>>
>> # features retain link to underlying sequence object
>> #print "Feature sequence is ",$feat->seq->seq(),"\n";
>>
>>
>> my $t = $feat->feature_count;
>> print "Feature count is $t\n";
>> print "count is $c\n";
>>
>>
>> $ac=$feat->annotation(); # PROBLEM SEEMS TO BE HERE
>>
>>
>> my $blah=$ac->get_all_annotation_keys();
>> print "GETS TO HERE WITH NO KEYS $blah\n";
>>
>>
>> foreach $key ( $ac->get_all_annotation_keys() ) {
>> print "DOESN'T GET TO HERE\n";
>> @values = $ac->get_Annotations($key);
>> foreach $value ( @values ) {
>> # value is an Bio::AnnotationI, and defines a
>> "as_text" method
>> print "Annotation ",$key," stringified value
>> ",$value->as_text,"\n";
>>
>>
>> # also defined hash_tree method, which allows data
>> orientated
>> # access into this object
>> $hash = $value->hash_tree();
>> }
>> }
>>
>>
>> # commented out for now
>>
>>
>> # next unless ($ac->get_Annotations('coded_by'));
>> # my @coded=$ac->get_Annotations('coded_by');
>> # foreach my $location (@coded) {
>> # print $location->value, " is the location that codes this
>> protein\n";
>> # }
>> }
>> }
>>
>>
>>
>>
>>
>> Stefan Kirov wrote:
>>
>>> my $seqio = $gb->get_Stream_by_id(['13474692']);
>>> while( my $seq = $seqio->next_seq ) {
>>> print "seq is ", $seq->display_id, "\n";
>>> my $ann=$seq->annotation; #This gives you the annotation of
>>> the retrieved sequence object
>>> foreach my $dblink ($ann->get_Annotations('DBLink')) {
>>> if ($dblink->database =~/refseq/i) {
>>> print $database->primary_id, " is the mRNA accession
>>> number\n";
>>> }
>>> }
>>> }
>>> However, the gene you are looking at is not associated with any NM_
>>> sequence, but rather comes from NC_. Therefore the above will not
>>> work for you. You will have to descend through the sequence features
>>> and find teh feature that says 'coded_by':
>>> use Bio::DB::GenPept;
>>> my $gb=new Bio::DB::GenPept;
>>> my $seqio = $gb->get_Stream_by_id(['13474692']);
>>> while( my $seq = $seqio->next_seq ) {
>>> print "seq is ", $seq->display_id, "\n";
>>> my @f=$seq->get_SeqFeatures; #This gives you the annotation of
>>> the retrieved sequence object
>>> foreach my $feat (@f) {
>>> my $ann=$feat->annotation;
>>> next unless ($ann->get_Annotations('coded_by'));
>>> my @coded=$ann->get_Annotations('coded_by');
>>> foreach my $location (@coded) {
>>> print $location->value, " is the location that codes this
>>> protein\n";
>>> }
>>> }
>>> }
>>> No guarantees the code is typo free :-)
>>> Stefan
>>>
>>> Kat Hull wrote:
>>>
>>>> Hi Stefan,
>>>> Thanks for your advice but i'm still struggling! I have used
>>>> Bio::DB::GenPept to get the protein accession number given the
>>>> protein gi number. However, I don't understand how
>>>> Bio::Annotation::DBLink works. Does it fetch the url of a link on
>>>> the web-site? Basically, if I could use this (or something else)
>>>> to get the url of the CDS link for my protein of interest, I can
>>>> get the corresponding nucleotide accession from this, as it is
>>>> encoded in the url.
>>>> Do you know how to use this module? Is this what you were
>>>> suggesting I try yesterday (I didn't really understand what you
>>>> were getting at).
>>>> Many thanks,
>>>>
>>>> Kat
>>>>
>>>> ps. Here's where i'm at so far:
>>>>
>>>>
>>>> use Bio::Annotation::DBLink;
>>>> use Bio::DB::GenBank;
>>>> use Bio::DB::GenPept;
>>>> $gb = new Bio::DB::GenPept;
>>>>
>>>>
>>>> # given the gi number, this returns the accession
>>>> my $seqio = $gb->get_Stream_by_id(['13474692']);
>>>> while( my $seq = $seqio->next_seq ) {
>>>> print "seq is ", $seq->display_id, "\n";
>>>> }
>>>> # not sure what i'm doing here
>>>>
>>>>
>>>> $link2 = new Bio::Annotation::DBLink();
>>>> $link2->database('dbSNP');
>>>> $link2->primary_id('2367');
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Stefan Kirov wrote:
>>>>
>>>>> Kat,
>>>>> If you are familiar with Bioperl it is kind of easy-
>>>>> look at Bio::DB::GenPept (I suppose you use GenPept/GenBank?) on
>>>>> how to get the protein record
>>>>> Go through the dblinks and find the appropriate accession number
>>>>> (where the database method returns GenBank).
>>>>> Then retrieve this accession number(s) through Bio::DB::GenBank.
>>>>> If you are not familiar with Bioperl- read the docs for
>>>>> Bio::DB::GenPept, Bio::DB::GenBank, Bio::Annotation and
>>>>> Bio::Annotation::DBLink).
>>>>> Hope this helps,
>>>>> Stefan
>>>>>
>>>>> Kat Hull wrote:
>>>>>
>>>>>> Hi there,
>>>>>> I was wondering whether anyone has a solution to my problem. I
>>>>>> have a list of protein assession numbers and want to retrieve the
>>>>>> corresponding nucleotide sequences automatically. I thought it
>>>>>> would be possible to do this by changing the NCBI url, but this
>>>>>> doesn't seem to be the case.
>>>>>> Is there a bio-perl module that can do this?
>>>>>>
>>>>>> Kind regards,
>>>>>> Kat
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at portal.open-bio.org
>>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Stefan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>
More information about the Bioperl-l
mailing list