[Bioperl-l] parsing GenBank file

shalabh sharma shalabh.sharma7 at gmail.com
Tue May 4 18:18:02 UTC 2010


Hi All,
      i have a huge GenBank file ( downloaded from RDP containing all
bacterial 16s). I just want to parse RDP id (in LOCUS) and organism's linage
(in ORGANISM).
I wrote a simple script for this:

#!/usr/bin/perl -w
use Bio::SeqIO;

my $seqio_object = Bio::SeqIO->new(-file => "$ARGV[0]");
while(my $seq_object = $seqio_object->next_seq){
my $id = $seq_object->id;
print "$id\t";
my $species_object = $seq_object->species;
my @classification = $seq_object->species->classification;
foreach my $val (@classification){print "$val\t";}
print "\n";
}

I am getting the output like:

S000107505 uncultured Acidobacteria bacterium Geothrix Holophagaceae
Holophagales Holophagae "Acidobacteria" Bacteria Root
S000148973 uncultured Geothrix sp. Geothrix Holophagaceae Holophagales
Holophagae "Acidobacteria" Bacteria Root
S000431649 uncultured Acidobacteria bacterium Geothrix Holophagaceae
Holophagales Holophagae "Acidobacteria" Bacteria Root
..
..

This is the exact output i want, but i am missing lot of records (they are
there in the genbank file but not in my output).
I also got a warning during parsing:

--------------------- WARNING ---------------------
MSG: Unbalanced quote in:
/db_xref="taxon:35783" /germline"
/mol_type="genomic DNA"
/organism="Enterococcus sp."
/strain="LMG12316"No further qualifiers will be added for this feature
---------------------------------------------------

So i was just wondering that is this warning message causing that problem or
i am doing something wrong?

Thanks
Shalabh



More information about the Bioperl-l mailing list