[Bioperl-l] Question on whole-genome comparisions

Brian Osborne brian_osborne at cognia.com
Fri Dec 12 14:08:11 EST 2003


Steven,

>seem to get at them using get_SeqFeatures. The problem is
>that @genome ends up with 1003 entries in it for m.genatilium,
>which has 400+ genes...

If I understand you correctly you're saying you want to get only those
sequence features which represent genes, yes? Any given gene in the sequence
is likely to have at least 2 features, "CDS" and "gene", and there may be
other interesting features that the authors have noted in the Genbank file
as well. This explains your numeric discrepancy. Since you're new to this
Genbank parsing business and I'm starting to write a HOWTO on Features could
you take a look at my first draft and see if it helps you? I'd be interested
in your comments. It's here:

http://bioperl.org/HOWTOs/html/Feature-Annotation.html


Brian O.

-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Steven Lembark
Sent: Friday, December 12, 2003 1:41 PM
To: bioperl-l at bioperl.org
Subject: [Bioperl-l] Question on whole-genome comparisions

perl-5.8.2 w/ bio-perl 1.2.3

I am testing code for whole-genome comparisions. Right
now the two are m.genatilium and m.pneumoniae. For the
comparision I'm trying to use their GenBank files:

ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Mycoplasma_pneumoniae/U0008
9.gbk
ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Mycoplasma_genitalium/L4396
7.gbk

Catch is that I cannot seem to iterate the genes within
the whole-genome GenBank files. I'm using the code below
to try and walk down the genes. $genome gets set to the
entire genome -- which makes some sense as this is a
whole-genome file. The genes are IN $genome, I just cannot
seem to get at them using get_SeqFeatures. The problem is
that @genome ends up with 1003 entries in it for m.genatilium,
which has 400+ genes...

$genome[0] is obviously the entire DNA given the start and end:

   '_gsf_tag_hash' => HASH(0x8a4f944)
      'db_xref' => ARRAY(0x8a4f974)
         0  'taxon:2097'
      'isolate' => ARRAY(0x8a4f95c)
         0  'G37'
      'organism' => ARRAY(0x8a4f950)
         0  'Mycoplasma genitalium'
   '_location' => Bio::Location::Simple=HASH(0x8a4f848)
      '_end' => 580074
      '_location_type' => 'EXACT'
      '_root_verbose' => 0
      '_seqid' => 'L43967'
      '_start' => 1
      '_strand' => 1
   '_primary_tag' => 'source'
   '_source_tag' => 'EMBL/GenBank/SwissProt'

$genome[1] looks like a gene:

   '_gsf_tag_hash' => HASH(0x8a4fa04)
      'db_xref' => ARRAY(0x8a4fa40)
         0  'GenBank:3844619'
      'gene' => ARRAY(0x8a4fa7c)
         0  'MG001'
   '_location' => Bio::Location::Simple=HASH(0x8a4fb60)
      '_end' => 1829
      '_location_type' => 'EXACT'
      '_root_verbose' => 0
      '_seqid' => 'L43967'
      '_start' => 735
      '_strand' => 1
   '_primary_tag' => 'gene'
   '_source_tag' => 'EMBL/GenBank/SwissProt'

But short of hacking the hash key's, I cannot find any good
way to iterate @genome to get the DNA sequences for each of
the genes. The elements of @genome are not blessed, so there
is something I'm missing...  I've gone through the tutorial
and sequence I/O doc's, Bio::DB::GenBank.

Q: Is there any doc on processing whole-genome GenBank files?

        #!/opt/bin/perl

        use strict;
        use warnings;

        use Bio::Seq;

        sub read_genome
        {
                use Bio::SeqIO;

                my $in = Bio::SeqIO->new( qw( -format genbank -file ),
shift );

                my @a = ();

                # whole genome gives an array of all genes as a feature of
the
                # first sequence...

                my $genome = $in->next_seq;

                # this includes the whole genome as the first item,
                # followed by individual genes.

                my @genz = $genome->get_SeqFeatures;
        }

        # q: at this point what is the best way to iterate the
        # DNA sequences for each of the genes?

        my @genome = read_genome shift;

        $DB::single = 1;

        0

        __END__


<DB> x $genome

<snip>

      1  Bio::SeqFeature::Generic=HASH(0x8a4a420)
         '_gsf_seq' => Bio::PrimarySeq=HASH(0x8c2aea8)
            -> REUSED_ADDRESS
         '_gsf_tag_hash' => HASH(0x8a4e190)
            'db_xref' => ARRAY(0x8a4e1cc)
               0  'GenBank:3844619'
            'gene' => ARRAY(0x8a501d4)
               0  'MG001'
         '_location' => Bio::Location::Simple=HASH(0x8a502b8)
            '_end' => 1829
            '_location_type' => 'EXACT'
            '_root_verbose' => 0
            '_seqid' => 'L43967'
            '_start' => 735
            '_strand' => 1
         '_primary_tag' => 'gene'
         '_source_tag' => 'EMBL/GenBank/SwissProt'
      2  Bio::SeqFeature::Generic=HASH(0x8a4a468)
         '_gsf_seq' => Bio::PrimarySeq=HASH(0x8c2aea8)
            -> REUSED_ADDRESS
         '_gsf_tag_hash' => HASH(0x8a5036c)
            'codon_start' => ARRAY(0x8a50474)
               0  1
            'db_xref' => ARRAY(0x8a50450)
               0  'GI:3844620'
            'gene' => ARRAY(0x8a503f0)
               0  'MG001'
            'note' => ARRAY(0x8a504e0)
               0  'similar to GB:U00089 SP:Q50313 PID:1209517 PID:1673814
percent identity: 70.87; identified by sequence similarity; putative'
            'product' => ARRAY(0x8a504ec)
               0  'DNA polymerase III, subunit beta (dnaN)'
            'protein_id' => ARRAY(0x8a50348)
               0  'AAC71217.1'
            'transl_table' => ARRAY(0x8a50438)
               0  4
            'translation' => ARRAY(0x8a504a4)
               0
'MNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNKNIDILEQGSLIVKGKIFNDLINGIKEEIIT
IQEKDQTLLVKTKKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNREISSKFNGV
NFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDKSIVFYYRKDNKDSFSTEMLISM
DNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNELKDALQRIQTLAQNERTFLCDMQINSSELKIRAIVNNIGN
SLEEISCLKFEGYKLNISFNPSSLLDHIESFESNEINFDFQGNSKYFLITSKSEPELKQILVPSR'
         '_location' => Bio::Location::Simple=HASH(0x8a5048c)
            '_end' => 1829
            '_location_type' => 'EXACT'
            '_root_verbose' => 0
            '_seqid' => 'L43967'
            '_start' => 735
            '_strand' => 1
         '_primary_tag' => 'CDS'
         '_source_tag' => 'EMBL/GenBank/SwissProt'

<snip>


--
Steven Lembark                               2930 W. Palmer
Workhorse Computing                       Chicago, IL 60647
                                            +1 888 359 3508
_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list