[Bioperl-l] Parsing entrezgene with Bio::SeqIO

Fri Mar 17 20:30:28 UTC 2006

Liisa,
This program:

#!/usr/bin/perl
use Bio::SeqIO;

use Data::Dumper;

my $file=shift;

my $previd;
my $eio=new Bio::SeqIO(-file=>$file,-format=>'entrezgene', 
-debug=>'on',-service_record=>'yes');#, -locuslink=>'convert');
while (1) {
my ($seq,$struct,$uncapt,%transvar);
eval {
($gene,$struct,$uncapt)=$eio->next_seq;
};
if ($@) {print $@,"\n"; print $previd,"\n"; undef $previd; };
last unless ($gene);
my $gid= $gene->accession_number;
 print "\n\nDBlinks for geneid: ",$gene->id, "\t",
          "acc: ", $gene->accession_number,"\n";
        my $ann=$gene->annotation;
    my @dblinks= $ann->get_Annotations('dblink');
    foreach my $dblink (@dblinks) {
        next unless ($dblink->database eq "KEGG");
        print "primary_id:", "\t",$dblink->primary_id,"\n";
        print "url:", "\t", $dblink->url, "\n"; 
        print "as_text:", "\t", $dblink->as_text, "\n";
        print "optional_id:","\t",$dblink->optional_id,"\n" ;
        print "comment:", "\t", $dblink->comment, "\n" ;
        print "object_id:", "\t", $dblink->object_id, "\n";
        print "namespase:", "\t", $dblink->namespace, "\n" ;
        print "authority:", "\t", $dblink->authority, "\n" ;

        print "\nhash_tree\n";
        my $hash_ref = $dblink->hash_tree;
         for my $key (keys %{$hash_ref}) {
           print $key,": ",$hash_ref->{$key},"\n";
         }
    }
}

prints out this:

DBlinks for geneid: A1BG        acc: 1
primary_id:     hsa:1
url:    http://www.genome.jp/dbget-bin/www_bget?hsa:1
as_text:        Direct database link to hsa:1 in database KEGG
optional_id:
comment:
object_id:      hsa:1
namespase:      KEGG
authority:

hash_tree
database: KEGG
primary_id: hsa:1

DBlinks for geneid: A2M acc: 2
primary_id:     hsa:2
url:    http://www.genome.jp/dbget-bin/www_bget?hsa:2
as_text:        Direct database link to hsa:2 in database KEGG
optional_id:
comment:
object_id:      hsa:2
namespase:      KEGG
authority:

hash_tree
database: KEGG
primary_id: hsa:2

By the way I encourage you to use Data::Dumper if you want to print an 
object- it is much nicer. Just do
print Dumper($obj);
The text you refer to is not captured (but I have no ider why you don't 
see the primary id). You can recover it from the uncaptured data, but it 
is kind of dirty.
I will try to make the parser capture this as well.
Stefan

Liisa Koski wrote:

>Thanks Stefan,
>Unfortunately I only parse out the URL and not the primary_id or any comments.
>
>
> print "\n\nDBlinks for geneid: ",$gene->id, "\t", 
>          "acc: ", $gene->accession_number,"\n";
>    my @dblinks= $ann->get_Annotations('dblink');
>    foreach my $dblink (@dblinks) {
>	next unless ($dblink->database eq "KEGG");
>	print "primary_id:", "\t",$dblink->primary_id,"\n"; 
>	print "url:", "\t", $dblink->url, "\n";  
>	print "as_text:", "\t", $dblink->as_text, "\n"; 
>	print "optional_id:","\t",$dblink->optional_id,"\n" ;
>	print "comment:", "\t", $dblink->comment, "\n" ;
>	print "object_id:", "\t", $dblink->object_id, "\n"; 
>	print "namespase:", "\t", $dblink->namespace, "\n" ;
>	print "authority:", "\t", $dblink->authority, "\n" ;
>
>	print "\nhash_tree\n";
>	my $hash_ref = $dblink->hash_tree;
>         for my $key (keys %{$hash_ref}) {
>           print $key,": ",$hash_ref->{$key},"\n";
>         }
>    }
>
>Output:
>------------------------------------
>DBlinks for geneid: ABAT        acc: 18
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 42.
>primary_id:
>url:    http://www.genome.jp/dbget-bin/www_bget?hsa:18
>Use of uninitialized value in concatenation (.) or string 
>at /netshare/home/koski/perl_modules/Bio/Annotation/DBLink.pm line 146.
>as_text: Direct database link to  in database KEGG
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 48.
>optional_id:
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 50.
>comment:
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 52.
>object_id:
>namespase:       KEGG
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 56.
>authority:
>
>printing hash_tree
>database: KEGG
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 62.
>primary_id:
>-----------------------------------------
>
>I see that on the gene page for ABAT (acc: 18) there are KEGG pathways:
>KEGG pathway: Alanine and aspartate metabolism 00252
>KEGG pathway: Butanoate metabolism 00650
>KEGG pathway: Glutamate metabolism 00251
>KEGG pathway: Propanoate metabolism 00640
>KEGG pathway: Valine, leucine and isoleucine degradation 00280
>KEGG pathway: beta-Alanine metabolism 00410 
>
>Is it possible to pull out these pathway names? 
>
>Thanks,
>Liisa
>
>
>On Thursday 16 March 2006 17:29, Stefan Kirov wrote:
>  
>
>>Do this:
>>my @dblinks=$ann->get_Annotations('dblink');
>>foreach my $link (@dblinks) {
>>    next unless ($dblink->database eq 'KEGG");
>>    print $dblink->primary_id,"\t",$dblink->url,"\n";
>>}
>>This works for me, hopefully it will for you too. Let me know if
>>something is not right.
>>Stefan
>>
>>Liisa Koski wrote:
>>    
>>
>>>Hi,
>>>I'm using Bio::SeqIO to parse the EntrezGene file Homo_sapiens (from
>>>ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_OLD/Mammalia/Homo_sapiens.gz).
>>>
>>>I'm using bioperl-1.5.1.
>>>
>>>I want to extract the KEGG annotations.
>>>See code below.
>>>
>>>use Bio::SeqIO;
>>>use Bio::ASN1::EntrezGene;
>>>
>>>my $seqio = Bio::SeqIO->new(-format => 'entrezgene',
>>>                                            -file => 'Homo_sapiens');
>>>while (my $gene = $seqio->next_seq){
>>>   print "\n",$gene->id, "\t", $gene->accession_number, "\n";
>>>   my $ann = $gene->annotation();
>>>   foreach my $key ( $ann->get_all_annotation_keys() ) {
>>>       my @values = $ann->get_Annotations($key);
>>>       foreach my $value ( @values ) {
>>>           print $key, "\t", "=", "\t", $value->as_text,"\n";
>>>       }
>>>   }
>>>}
>>>
>>>Unfortunately the only KEGG annotation I see in the results looks like:
>>>dblink  =       Direct database link to  in database KEGG
>>>(Notice the space between 'to  in')
>>>
>>>Anyone have any ideas how to get the KEGG annotation results?
>>>
>>>Note: I also tried parsing the file
>>>ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_BINARY/Mammalia/Homo_sapiens.ags.
>>>gz but I got the below error:
>>>
>>>./entrez_gene_seqio.pl Homo_sapiens.ags
>>>Data Error: none conforming data found on line 1 in Homo_sapiens.ags!
>>>first 20 (or till end of input) characters including the non-conforming
>>>data: 00
>>>at /netshare/home/koski/perl_modules/bioperl-live/Bio/SeqIO/entrezgene.pm
>>>line 138
>>>
>>>
>>>Thanks,
>>>Liisa
>>>
>>>_______________________________________________
>>>Bioperl-l mailing list
>>>Bioperl-l at lists.open-bio.org
>>>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>      
>>>
>
>  
>