[Bioperl-l] Bio::DB::GFF, aggregators, and spliced_seq
Gathman, Allen
agathman at semo.edu
Tue Aug 23 14:38:42 EDT 2005
Hi, BioPerl gurus:
Although this question involves a Gbrowse database, I think it's actually a
BioPerl question at heart - and in any case it appears that there's a lot of
overlap between the people who answer questions in both lists, so I'm
guessing this is a good place for this question.
I've written a script that finds particular pfam hits in a GBROWSE database,
then uses "overlapping_features" to find predicted gene features of type
"transcript:GLEAN" that overlap those pfams. I've set the aggregator
"transcript" as {CDS/mRNA} already. I select features using a regular
expression to choose particular names, then I use spliced_seq to return the
spliced CDS of each feature - but I'm only getting back the CDS that
actually overlap the pfam hit, not the full predicted gene. So my question
is, what do I need to do in order to get ALL the CDS of each predicted gene
feature spliced together, instead of only the ones that actually overlap the
pfam hit I used to select that predicted gene?
Thanks in advance for any help you can give...
Here's the code:
#!/usr/bin/perl
use strict;
use Bio::DB::GFF;
use Bio::Seq;
use Bio::SeqIO;
use Getopt::Long;
my $outfile;
GetOptions(
'o|outfile=s' => \$outfile,
);
my $outfa= Bio::SeqIO -> new (-file => ">$outfile",
-format => 'Fasta'
);
my $db = Bio::DB::GFF-> new ( - adaptor => 'dbi::mysql',
-dsn =>
'dbi:mysql:database=cc;host=localhost',
-fasta => '/gbrowse/databases/cc'
);
$db->add_aggregator('transcript{CDS/mRNA}');
for (my $i =1; $i<=20; $i++){
my $pfamname="Peptidase_C$i";
my @pfams = $db->get_feature_by_name( Domain => $pfamname);
foreach my $pfamhit (@pfams){
my $desc = $pfamname;
my $score=$pfamhit->score;
my $name = $pfamhit->name;
$desc.= " $score ";
$desc.= $pfamhit->location->seq_id();
$desc.= ": ";
#
# Here's where I'm selecting predicted genes that overlap the Pfam hit
#
my @genes = $db -> overlapping_features(
-refseq => $pfamhit->location->seq_id,
-start => $pfamhit->start,
-stop => $pfamhit->stop,
-types =>'transcript:GLEAN'
);
#
# Now I'm choosing the ones with names I want out of the selected genes
#
foreach my $gene (@genes){
my $gid=$gene->display_id();
if ($gid =~/aug_GLEAN/){
$desc.=$gene->start;
$desc.=" - ";
$desc.=$gene->stop;
#
# Here I'm splicing the gene, tacking on a description, and outputting it.
#
my $splseq = $gene->spliced_seq();
$splseq->desc($desc);
$splseq->display_id($gid);
$outfa->write_seq($splseq);
}# end if aug_GLEAN
}# end foreach gene
}# end foreach pfamhit
} # end for numbers
close OUT;
Allen Gathman
http://cstl-csm.semo.edu/gathman
More information about the Bioperl-l
mailing list