[Bioperl-l] Extracting gene seq from Bio::DB::GFF

Marco Blanchette mblanche at berkeley.edu
Fri Aug 11 23:59:26 UTC 2006


Chris,
 
> Do you mean you get duplicates of sequences back, or that you get more than
> one chunk of the same sequence back?

I sometimes get duplicated sequences and sometimes overlapping regions (see
bellow)

> 
> Is it possible that each query using an ID could contain more than one
> feature?  That might explain it (you could check by testing the size of the
> array @feats).  
Most id return more than one features from various type ( point_mutation,
insertion_site, processed_transcript, etc...). That's why I restirct the
output to type "gene" using regexp /gene/ on $f->type.

> 
> I'm not sure how split locations are handled within Bio:DB::GFF, but do the
> specific features have split locations?
> 
> Chris
>
Not sure what you mean exactly but have a look at the following script, it
gives the location and the group id of the feature being reported:

use Bio::DB::GFF;
my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
                              -dsn => 'dbi:mysql:database=dmel_43_new');
my %dups;
while (<>){
    chomp;
    my $id = $_;
    my @feat = $db->get_feature_by_name($id);
    
     for my $f (@feat){
        if (exists $dups{$f->group} && $f->type =~/gene/){
            print     "Calling >>>", $f->group, "\n";
            print     "Chr: ", $f->refseq,
                    " Strand: ", $f->strand,
                    " Start: ", $f->start,
                    " End: ", $f->end,
                    "\n";
            print "Offending >>>", $dups{$f->group}->group, "\n";
            print     "Chr: ", $dups{$f->group}->refseq,
                    " Strand: ", $dups{$f->group}->strand,
                    " Start: ", $dups{$f->group}->start,
                      " End: ", $dups{$f->group}->end;
            print "\n\n";
         } else {
            $dups{$f->group} = $f;
         }
     }
}

Here is the output:
Calling >>>FBgn0004179
Chr: 3L Strand: 1 Start: 22201102 End: 22207587
Offending >>>FBgn0004179
Chr: 3L Strand: 1 Start: 22200575 End: 22200575

Calling >>>FBgn0025681
Chr: 2L Strand: 1 Start: 2992964 End: 2998614
Offending >>>FBgn0025681
Chr: 2L Strand: 1 Start: 2992964 End: 2998614

Calling >>>FBgn0025803
Chr: 3R Strand: 1 Start: 16966463 End: 17038413
Offending >>>FBgn0025803
Chr: 3R Strand: 1 Start: 16966463 End: 17038413

Calling >>>FBgn0000117
Chr: X Strand: -1 Start: 1756796 End: 1747557
Offending >>>FBgn0000117
Chr: X Strand: -1 Start: 1757776 End: 1747182

Calling >>>FBgn0005427
Chr: X Strand: -1 Start: 136456 End: 125343
Offending >>>FBgn0005427
Chr: X Strand: -1 Start: 133199 End: 124949

Calling >>>FBgn0000042
Chr: X Strand: 1 Start: 5746100 End: 5750026
Offending >>>FBgn0000042
Chr: X Strand: 1 Start: 5746096 End: 5746106

Calling >>>FBgn0004551
Chr: 2R Strand: -1 Start: 19443485 End: 19434556
Offending >>>FBgn0004551
Chr: 2R Strand: -1 Start: 19445155 End: 19429977

Do you have any suggestions?? Is the procedure I am using to retrieve  the
genes right?

Many thanks

Marco


 
>> Many thanks Scott,
>> 
>> At the same time I got your email I was coming to the same conclusion as
>> you.
>> 
>> Now I have a stranger problem in my hands... My goal is quite simple, I
>> try
>> to get the sequence of the genes back from the Bio::DB::GFF database
>> loaded
>> on MySQL. The gene list is from a file with one gene id per per line. When
>> I
>> run the following script:
>> 
>> 
>> 
>> use Bio::DB::GFF;
>> use Bio::SeqIO;
>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
>>                             -format => 'fasta');
>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>                               -dsn => 'dbi:mysql:database=dmel_43_new');
>> 
>> while (<>){
>>     chomp;
>>     my $id = $_;
>>     my @feats = $db->get_feature_by_name($id);
>>     for my $f (@feats){
>>         $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>     }
>> }
>> 
>> 
>> I get more sequence back than the number of gene in my input file. I
>> double
>> check there. Some of the duplicated entries are the same, some are not!
> 
> 
> ...
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

______________________________
Marco Blanchette, Ph.D.

mblanche at uclink.berkeley.edu

Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204

Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
-- 






More information about the Bioperl-l mailing list