[Bioperl-l] Question on whole-genome comparisions

Steven Lembark lembark at wrkhors.com
Fri Dec 12 13:41:22 EST 2003


perl-5.8.2 w/ bio-perl 1.2.3

I am testing code for whole-genome comparisions. Right
now the two are m.genatilium and m.pneumoniae. For the
comparision I'm trying to use their GenBank files:

ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Mycoplasma_pneumoniae/U0008
9.gbk
ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Mycoplasma_genitalium/L4396
7.gbk

Catch is that I cannot seem to iterate the genes within
the whole-genome GenBank files. I'm using the code below
to try and walk down the genes. $genome gets set to the
entire genome -- which makes some sense as this is a
whole-genome file. The genes are IN $genome, I just cannot
seem to get at them using get_SeqFeatures. The problem is
that @genome ends up with 1003 entries in it for m.genatilium,
which has 400+ genes...

$genome[0] is obviously the entire DNA given the start and end:

   '_gsf_tag_hash' => HASH(0x8a4f944)
      'db_xref' => ARRAY(0x8a4f974)
         0  'taxon:2097'
      'isolate' => ARRAY(0x8a4f95c)
         0  'G37'
      'organism' => ARRAY(0x8a4f950)
         0  'Mycoplasma genitalium'
   '_location' => Bio::Location::Simple=HASH(0x8a4f848)
      '_end' => 580074
      '_location_type' => 'EXACT'
      '_root_verbose' => 0
      '_seqid' => 'L43967'
      '_start' => 1
      '_strand' => 1
   '_primary_tag' => 'source'
   '_source_tag' => 'EMBL/GenBank/SwissProt'

$genome[1] looks like a gene:

   '_gsf_tag_hash' => HASH(0x8a4fa04)
      'db_xref' => ARRAY(0x8a4fa40)
         0  'GenBank:3844619'
      'gene' => ARRAY(0x8a4fa7c)
         0  'MG001'
   '_location' => Bio::Location::Simple=HASH(0x8a4fb60)
      '_end' => 1829
      '_location_type' => 'EXACT'
      '_root_verbose' => 0
      '_seqid' => 'L43967'
      '_start' => 735
      '_strand' => 1
   '_primary_tag' => 'gene'
   '_source_tag' => 'EMBL/GenBank/SwissProt'

But short of hacking the hash key's, I cannot find any good
way to iterate @genome to get the DNA sequences for each of
the genes. The elements of @genome are not blessed, so there
is something I'm missing...  I've gone through the tutorial
and sequence I/O doc's, Bio::DB::GenBank.

Q: Is there any doc on processing whole-genome GenBank files?

	#!/opt/bin/perl

	use strict;
	use warnings;

	use Bio::Seq;

	sub read_genome
	{
		use Bio::SeqIO;

		my $in = Bio::SeqIO->new( qw( -format genbank -file ), shift );

		my @a = ();

		# whole genome gives an array of all genes as a feature of the
		# first sequence...

		my $genome = $in->next_seq;

		# this includes the whole genome as the first item,
		# followed by individual genes.

		my @genz = $genome->get_SeqFeatures;
	}

	# q: at this point what is the best way to iterate the
	# DNA sequences for each of the genes?

	my @genome = read_genome shift;

	$DB::single = 1;

	0

	__END__


<DB> x $genome

<snip>

      1  Bio::SeqFeature::Generic=HASH(0x8a4a420)
         '_gsf_seq' => Bio::PrimarySeq=HASH(0x8c2aea8)
            -> REUSED_ADDRESS
         '_gsf_tag_hash' => HASH(0x8a4e190)
            'db_xref' => ARRAY(0x8a4e1cc)
               0  'GenBank:3844619'
            'gene' => ARRAY(0x8a501d4)
               0  'MG001'
         '_location' => Bio::Location::Simple=HASH(0x8a502b8)
            '_end' => 1829
            '_location_type' => 'EXACT'
            '_root_verbose' => 0
            '_seqid' => 'L43967'
            '_start' => 735
            '_strand' => 1
         '_primary_tag' => 'gene'
         '_source_tag' => 'EMBL/GenBank/SwissProt'
      2  Bio::SeqFeature::Generic=HASH(0x8a4a468)
         '_gsf_seq' => Bio::PrimarySeq=HASH(0x8c2aea8)
            -> REUSED_ADDRESS
         '_gsf_tag_hash' => HASH(0x8a5036c)
            'codon_start' => ARRAY(0x8a50474)
               0  1
            'db_xref' => ARRAY(0x8a50450)
               0  'GI:3844620'
            'gene' => ARRAY(0x8a503f0)
               0  'MG001'
            'note' => ARRAY(0x8a504e0)
               0  'similar to GB:U00089 SP:Q50313 PID:1209517 PID:1673814 
percent identity: 70.87; identified by sequence similarity; putative'
            'product' => ARRAY(0x8a504ec)
               0  'DNA polymerase III, subunit beta (dnaN)'
            'protein_id' => ARRAY(0x8a50348)
               0  'AAC71217.1'
            'transl_table' => ARRAY(0x8a50438)
               0  4
            'translation' => ARRAY(0x8a504a4)
               0 
'MNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNKNIDILEQGSLIVKGKIFNDLINGIKEEIIT
IQEKDQTLLVKTKKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNREISSKFNGV
NFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDKSIVFYYRKDNKDSFSTEMLISM
DNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNELKDALQRIQTLAQNERTFLCDMQINSSELKIRAIVNNIGN
SLEEISCLKFEGYKLNISFNPSSLLDHIESFESNEINFDFQGNSKYFLITSKSEPELKQILVPSR'
         '_location' => Bio::Location::Simple=HASH(0x8a5048c)
            '_end' => 1829
            '_location_type' => 'EXACT'
            '_root_verbose' => 0
            '_seqid' => 'L43967'
            '_start' => 735
            '_strand' => 1
         '_primary_tag' => 'CDS'
         '_source_tag' => 'EMBL/GenBank/SwissProt'

<snip>


--
Steven Lembark                               2930 W. Palmer
Workhorse Computing                       Chicago, IL 60647
                                            +1 888 359 3508


More information about the Bioperl-l mailing list