[Bioperl-l] Bio::DB::GFF, aggregators, and spliced_seq

Tue Aug 23 14:38:42 EDT 2005

Hi, BioPerl gurus: 

Although this question involves a Gbrowse database, I think it's actually a
BioPerl question at heart - and in any case it appears that there's a lot of
overlap between the people who answer questions in both lists, so I'm
guessing this is a good place for this question.  

I've written a script that finds particular pfam hits in a GBROWSE database,
then uses "overlapping_features" to find predicted gene features of  type
"transcript:GLEAN" that overlap those pfams. I've set the aggregator
"transcript" as {CDS/mRNA} already.  I select features using a regular
expression to choose particular names, then I use spliced_seq to return the
spliced CDS of each feature - but I'm only getting back the CDS that
actually overlap the pfam hit, not the full predicted gene.  So my question
is, what do I need to do in order to get ALL the CDS of each predicted gene
feature spliced together, instead of only the ones that actually overlap the
pfam hit I used to select that predicted gene?  

Thanks in advance for any help you can give...

Here's the code:

#!/usr/bin/perl

use strict;

use Bio::DB::GFF;

use Bio::Seq;

use Bio::SeqIO;

use Getopt::Long;

my $outfile; 

GetOptions(

           'o|outfile=s' => \$outfile,

           ); 

my $outfa= Bio::SeqIO -> new (-file => ">$outfile",

                              -format => 'Fasta'

                             );

my $db      = Bio::DB::GFF-> new ( - adaptor => 'dbi::mysql',

                               -dsn        =>
'dbi:mysql:database=cc;host=localhost',

                               -fasta      => '/gbrowse/databases/cc'

                               );

$db->add_aggregator('transcript{CDS/mRNA}');

     for (my $i =1; $i<=20; $i++){

      my $pfamname="Peptidase_C$i";

        my @pfams = $db->get_feature_by_name( Domain => $pfamname);

        foreach my $pfamhit (@pfams){

             my $desc = $pfamname;

             my $score=$pfamhit->score;

             my $name = $pfamhit->name;

             $desc.= " $score ";

             $desc.= $pfamhit->location->seq_id();

             $desc.= ": ";

#

# Here's where I'm selecting predicted genes that overlap the Pfam hit

#

             my @genes = $db -> overlapping_features(

                                   -refseq => $pfamhit->location->seq_id,

                                   -start => $pfamhit->start,

                                   -stop => $pfamhit->stop,

                                   -types =>'transcript:GLEAN' 

                                   );

#

# Now I'm choosing the ones with names I want out of the selected genes

#

                  foreach my $gene (@genes){

                       my $gid=$gene->display_id();

                       if ($gid =~/aug_GLEAN/){

                            $desc.=$gene->start;

                            $desc.=" - "; 

                            $desc.=$gene->stop;

#

# Here I'm splicing the gene, tacking on a description, and outputting it.

#

                            my $splseq = $gene->spliced_seq();

                            $splseq->desc($desc);

                            $splseq->display_id($gid);

                            $outfa->write_seq($splseq);

                       }# end if aug_GLEAN   

                  }# end foreach gene

         }# end foreach pfamhit

   } # end for numbers

close OUT;

Allen Gathman

http://cstl-csm.semo.edu/gathman