[Bioperl-l] Parsing entrezgene with Bio::SeqIO
Stefan Kirov
skirov at utk.edu
Fri Mar 17 20:30:28 UTC 2006
Liisa,
This program:
#!/usr/bin/perl
use Bio::SeqIO;
use Data::Dumper;
my $file=shift;
my $previd;
my $eio=new Bio::SeqIO(-file=>$file,-format=>'entrezgene',
-debug=>'on',-service_record=>'yes');#, -locuslink=>'convert');
while (1) {
my ($seq,$struct,$uncapt,%transvar);
eval {
($gene,$struct,$uncapt)=$eio->next_seq;
};
if ($@) {print $@,"\n"; print $previd,"\n"; undef $previd; };
last unless ($gene);
my $gid= $gene->accession_number;
print "\n\nDBlinks for geneid: ",$gene->id, "\t",
"acc: ", $gene->accession_number,"\n";
my $ann=$gene->annotation;
my @dblinks= $ann->get_Annotations('dblink');
foreach my $dblink (@dblinks) {
next unless ($dblink->database eq "KEGG");
print "primary_id:", "\t",$dblink->primary_id,"\n";
print "url:", "\t", $dblink->url, "\n";
print "as_text:", "\t", $dblink->as_text, "\n";
print "optional_id:","\t",$dblink->optional_id,"\n" ;
print "comment:", "\t", $dblink->comment, "\n" ;
print "object_id:", "\t", $dblink->object_id, "\n";
print "namespase:", "\t", $dblink->namespace, "\n" ;
print "authority:", "\t", $dblink->authority, "\n" ;
print "\nhash_tree\n";
my $hash_ref = $dblink->hash_tree;
for my $key (keys %{$hash_ref}) {
print $key,": ",$hash_ref->{$key},"\n";
}
}
}
prints out this:
DBlinks for geneid: A1BG acc: 1
primary_id: hsa:1
url: http://www.genome.jp/dbget-bin/www_bget?hsa:1
as_text: Direct database link to hsa:1 in database KEGG
optional_id:
comment:
object_id: hsa:1
namespase: KEGG
authority:
hash_tree
database: KEGG
primary_id: hsa:1
DBlinks for geneid: A2M acc: 2
primary_id: hsa:2
url: http://www.genome.jp/dbget-bin/www_bget?hsa:2
as_text: Direct database link to hsa:2 in database KEGG
optional_id:
comment:
object_id: hsa:2
namespase: KEGG
authority:
hash_tree
database: KEGG
primary_id: hsa:2
By the way I encourage you to use Data::Dumper if you want to print an
object- it is much nicer. Just do
print Dumper($obj);
The text you refer to is not captured (but I have no ider why you don't
see the primary id). You can recover it from the uncaptured data, but it
is kind of dirty.
I will try to make the parser capture this as well.
Stefan
Liisa Koski wrote:
>Thanks Stefan,
>Unfortunately I only parse out the URL and not the primary_id or any comments.
>
>
> print "\n\nDBlinks for geneid: ",$gene->id, "\t",
> "acc: ", $gene->accession_number,"\n";
> my @dblinks= $ann->get_Annotations('dblink');
> foreach my $dblink (@dblinks) {
> next unless ($dblink->database eq "KEGG");
> print "primary_id:", "\t",$dblink->primary_id,"\n";
> print "url:", "\t", $dblink->url, "\n";
> print "as_text:", "\t", $dblink->as_text, "\n";
> print "optional_id:","\t",$dblink->optional_id,"\n" ;
> print "comment:", "\t", $dblink->comment, "\n" ;
> print "object_id:", "\t", $dblink->object_id, "\n";
> print "namespase:", "\t", $dblink->namespace, "\n" ;
> print "authority:", "\t", $dblink->authority, "\n" ;
>
> print "\nhash_tree\n";
> my $hash_ref = $dblink->hash_tree;
> for my $key (keys %{$hash_ref}) {
> print $key,": ",$hash_ref->{$key},"\n";
> }
> }
>
>Output:
>------------------------------------
>DBlinks for geneid: ABAT acc: 18
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 42.
>primary_id:
>url: http://www.genome.jp/dbget-bin/www_bget?hsa:18
>Use of uninitialized value in concatenation (.) or string
>at /netshare/home/koski/perl_modules/Bio/Annotation/DBLink.pm line 146.
>as_text: Direct database link to in database KEGG
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 48.
>optional_id:
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 50.
>comment:
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 52.
>object_id:
>namespase: KEGG
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 56.
>authority:
>
>printing hash_tree
>database: KEGG
>Use of uninitialized value in print at ./entrez_gene_seqio.pl line 62.
>primary_id:
>-----------------------------------------
>
>I see that on the gene page for ABAT (acc: 18) there are KEGG pathways:
>KEGG pathway: Alanine and aspartate metabolism 00252
>KEGG pathway: Butanoate metabolism 00650
>KEGG pathway: Glutamate metabolism 00251
>KEGG pathway: Propanoate metabolism 00640
>KEGG pathway: Valine, leucine and isoleucine degradation 00280
>KEGG pathway: beta-Alanine metabolism 00410
>
>Is it possible to pull out these pathway names?
>
>Thanks,
>Liisa
>
>
>On Thursday 16 March 2006 17:29, Stefan Kirov wrote:
>
>
>>Do this:
>>my @dblinks=$ann->get_Annotations('dblink');
>>foreach my $link (@dblinks) {
>> next unless ($dblink->database eq 'KEGG");
>> print $dblink->primary_id,"\t",$dblink->url,"\n";
>>}
>>This works for me, hopefully it will for you too. Let me know if
>>something is not right.
>>Stefan
>>
>>Liisa Koski wrote:
>>
>>
>>>Hi,
>>>I'm using Bio::SeqIO to parse the EntrezGene file Homo_sapiens (from
>>>ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_OLD/Mammalia/Homo_sapiens.gz).
>>>
>>>I'm using bioperl-1.5.1.
>>>
>>>I want to extract the KEGG annotations.
>>>See code below.
>>>
>>>use Bio::SeqIO;
>>>use Bio::ASN1::EntrezGene;
>>>
>>>my $seqio = Bio::SeqIO->new(-format => 'entrezgene',
>>> -file => 'Homo_sapiens');
>>>while (my $gene = $seqio->next_seq){
>>> print "\n",$gene->id, "\t", $gene->accession_number, "\n";
>>> my $ann = $gene->annotation();
>>> foreach my $key ( $ann->get_all_annotation_keys() ) {
>>> my @values = $ann->get_Annotations($key);
>>> foreach my $value ( @values ) {
>>> print $key, "\t", "=", "\t", $value->as_text,"\n";
>>> }
>>> }
>>>}
>>>
>>>Unfortunately the only KEGG annotation I see in the results looks like:
>>>dblink = Direct database link to in database KEGG
>>>(Notice the space between 'to in')
>>>
>>>Anyone have any ideas how to get the KEGG annotation results?
>>>
>>>Note: I also tried parsing the file
>>>ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_BINARY/Mammalia/Homo_sapiens.ags.
>>>gz but I got the below error:
>>>
>>>./entrez_gene_seqio.pl Homo_sapiens.ags
>>>Data Error: none conforming data found on line 1 in Homo_sapiens.ags!
>>>first 20 (or till end of input) characters including the non-conforming
>>>data: 00
>>>at /netshare/home/koski/perl_modules/bioperl-live/Bio/SeqIO/entrezgene.pm
>>>line 138
>>>
>>>
>>>Thanks,
>>>Liisa
>>>
>>>_______________________________________________
>>>Bioperl-l mailing list
>>>Bioperl-l at lists.open-bio.org
>>>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>
>
>
>
More information about the Bioperl-l
mailing list