[Bioperl-l] Extracting gene seq from Bio::DB::GFF

Fri Aug 11 21:36:27 UTC 2006

Many thanks Scott, 

At the same time I got your email I was coming to the same conclusion as
you.

Now I have a stranger problem in my hands... My goal is quite simple, I try
to get the sequence of the genes back from the Bio::DB::GFF database loaded
on MySQL. The gene list is from a file with one gene id per per line. When I
run the following script:

use Bio::DB::GFF;
use Bio::SeqIO;
my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
                            -format => 'fasta');
my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
                              -dsn => 'dbi:mysql:database=dmel_43_new');

while (<>){
    chomp;
    my $id = $_;
    my @feats = $db->get_feature_by_name($id);
    for my $f (@feats){
        $out->write_seq( $f->seq ) if $f->type =~/gene/;
    }
}

I get more sequence back than the number of gene in my input file. I double
check there. Some of the duplicated entries are the same, some are not!

I double check and none of the the duplicated entry in the output are having
duplicated "gene" entry in the original gff files that I just loaded in the
MySQL database using bp_bulk_load_gff.pl.

Any one as an idea as to what is going with my script?

Marco

On 8/11/06 12:05, "Scott Cain" <cain at cshl.edu> wrote:

> Hi Marco,
> 
> What you are getting from get_feature_by_name is a list of
> Bio::DB::GFF::Feature objects, which are Bio::SeqFeatureI objects.  What
> you need are Bio::PrimarySeq objects.  Fortunately,
> Bio::DB::GFF::Feature has a method to get a PrimarySeq out of it; that
> method is called seq.
> 
> So, you should be able to change your line to
> 
>   $out->write_seq( $_->seq() ) for @feat;
> 
> and it should work.  Of course, I haven't test that to make sure that it
> does :-)
> 
> Scott
> 
> 
> On Fri, 2006-08-11 at 11:30 -0700, Marco Blanchette wrote:
>> Dear all,
>> 
>> I used to use this very simple script to extract the gene sequence as a
>> fasta flat file from a Bio::DB::GFF database containing the GadFly 4.3
>> annotations
>> 
>> #!/usr/bin/perl
>> 
>> use strict;
>> use warnings;
>> use Bio::DB::GFF;
>> use Bio::SeqIO;
>> 
>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
>>                             -format => 'fasta');
>> 
>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>                               -dsn => 'dbi:mysql:database=dmel_43_LS');
>> 
>> while (<>){
>>     chomp;
>>     my @feat = $db->get_feature_by_name($_);
>>     $out->write_seq($_) for @feat;
>> }
>> 
>> Somehow I now get the following output instead of the actual sequences:
>>> FBgn0024988 gene:.(FBgn0024988)
>> Bio::PrimarySeq=HASH(0x19fd3d8)
>>> FBgn0041184 gene:.(FBgn0041184)
>> Bio::PrimarySeq=HASH(0x19fa684)
>>> FBgn0033636 gene:.(FBgn0033636)
>> Bio::PrimarySeq=HASH(0x19e1908)
>> 
>> What change and what would be the right way to get what I want?
>> 
>> Many thanks
>> 
>> Marco
>> ______________________________
>> Marco Blanchette, Ph.D.
>> 
>> mblanche at uclink.berkeley.edu
>> 
>> Donald C. Rio's lab
>> Department of Molecular and Cell Biology
>> 16 Barker Hall
>> University of California
>> Berkeley, CA 94720-3204
>> 
>> Tel: (510) 642-1084
>> Cell: (510) 847-0996
>> Fax: (510) 642-6062

______________________________
Marco Blanchette, Ph.D.

mblanche at uclink.berkeley.edu

Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204

Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
--