[Bioperl-l] Question on whole-genome comparisions
Steven Lembark
lembark at wrkhors.com
Fri Dec 12 13:41:22 EST 2003
perl-5.8.2 w/ bio-perl 1.2.3
I am testing code for whole-genome comparisions. Right
now the two are m.genatilium and m.pneumoniae. For the
comparision I'm trying to use their GenBank files:
ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Mycoplasma_pneumoniae/U0008
9.gbk
ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Mycoplasma_genitalium/L4396
7.gbk
Catch is that I cannot seem to iterate the genes within
the whole-genome GenBank files. I'm using the code below
to try and walk down the genes. $genome gets set to the
entire genome -- which makes some sense as this is a
whole-genome file. The genes are IN $genome, I just cannot
seem to get at them using get_SeqFeatures. The problem is
that @genome ends up with 1003 entries in it for m.genatilium,
which has 400+ genes...
$genome[0] is obviously the entire DNA given the start and end:
'_gsf_tag_hash' => HASH(0x8a4f944)
'db_xref' => ARRAY(0x8a4f974)
0 'taxon:2097'
'isolate' => ARRAY(0x8a4f95c)
0 'G37'
'organism' => ARRAY(0x8a4f950)
0 'Mycoplasma genitalium'
'_location' => Bio::Location::Simple=HASH(0x8a4f848)
'_end' => 580074
'_location_type' => 'EXACT'
'_root_verbose' => 0
'_seqid' => 'L43967'
'_start' => 1
'_strand' => 1
'_primary_tag' => 'source'
'_source_tag' => 'EMBL/GenBank/SwissProt'
$genome[1] looks like a gene:
'_gsf_tag_hash' => HASH(0x8a4fa04)
'db_xref' => ARRAY(0x8a4fa40)
0 'GenBank:3844619'
'gene' => ARRAY(0x8a4fa7c)
0 'MG001'
'_location' => Bio::Location::Simple=HASH(0x8a4fb60)
'_end' => 1829
'_location_type' => 'EXACT'
'_root_verbose' => 0
'_seqid' => 'L43967'
'_start' => 735
'_strand' => 1
'_primary_tag' => 'gene'
'_source_tag' => 'EMBL/GenBank/SwissProt'
But short of hacking the hash key's, I cannot find any good
way to iterate @genome to get the DNA sequences for each of
the genes. The elements of @genome are not blessed, so there
is something I'm missing... I've gone through the tutorial
and sequence I/O doc's, Bio::DB::GenBank.
Q: Is there any doc on processing whole-genome GenBank files?
#!/opt/bin/perl
use strict;
use warnings;
use Bio::Seq;
sub read_genome
{
use Bio::SeqIO;
my $in = Bio::SeqIO->new( qw( -format genbank -file ), shift );
my @a = ();
# whole genome gives an array of all genes as a feature of the
# first sequence...
my $genome = $in->next_seq;
# this includes the whole genome as the first item,
# followed by individual genes.
my @genz = $genome->get_SeqFeatures;
}
# q: at this point what is the best way to iterate the
# DNA sequences for each of the genes?
my @genome = read_genome shift;
$DB::single = 1;
0
__END__
<DB> x $genome
<snip>
1 Bio::SeqFeature::Generic=HASH(0x8a4a420)
'_gsf_seq' => Bio::PrimarySeq=HASH(0x8c2aea8)
-> REUSED_ADDRESS
'_gsf_tag_hash' => HASH(0x8a4e190)
'db_xref' => ARRAY(0x8a4e1cc)
0 'GenBank:3844619'
'gene' => ARRAY(0x8a501d4)
0 'MG001'
'_location' => Bio::Location::Simple=HASH(0x8a502b8)
'_end' => 1829
'_location_type' => 'EXACT'
'_root_verbose' => 0
'_seqid' => 'L43967'
'_start' => 735
'_strand' => 1
'_primary_tag' => 'gene'
'_source_tag' => 'EMBL/GenBank/SwissProt'
2 Bio::SeqFeature::Generic=HASH(0x8a4a468)
'_gsf_seq' => Bio::PrimarySeq=HASH(0x8c2aea8)
-> REUSED_ADDRESS
'_gsf_tag_hash' => HASH(0x8a5036c)
'codon_start' => ARRAY(0x8a50474)
0 1
'db_xref' => ARRAY(0x8a50450)
0 'GI:3844620'
'gene' => ARRAY(0x8a503f0)
0 'MG001'
'note' => ARRAY(0x8a504e0)
0 'similar to GB:U00089 SP:Q50313 PID:1209517 PID:1673814
percent identity: 70.87; identified by sequence similarity; putative'
'product' => ARRAY(0x8a504ec)
0 'DNA polymerase III, subunit beta (dnaN)'
'protein_id' => ARRAY(0x8a50348)
0 'AAC71217.1'
'transl_table' => ARRAY(0x8a50438)
0 4
'translation' => ARRAY(0x8a504a4)
0
'MNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNKNIDILEQGSLIVKGKIFNDLINGIKEEIIT
IQEKDQTLLVKTKKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNREISSKFNGV
NFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDKSIVFYYRKDNKDSFSTEMLISM
DNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNELKDALQRIQTLAQNERTFLCDMQINSSELKIRAIVNNIGN
SLEEISCLKFEGYKLNISFNPSSLLDHIESFESNEINFDFQGNSKYFLITSKSEPELKQILVPSR'
'_location' => Bio::Location::Simple=HASH(0x8a5048c)
'_end' => 1829
'_location_type' => 'EXACT'
'_root_verbose' => 0
'_seqid' => 'L43967'
'_start' => 735
'_strand' => 1
'_primary_tag' => 'CDS'
'_source_tag' => 'EMBL/GenBank/SwissProt'
<snip>
--
Steven Lembark 2930 W. Palmer
Workhorse Computing Chicago, IL 60647
+1 888 359 3508
More information about the Bioperl-l
mailing list