[Bioperl-l] Extracting gene seq from Bio::DB::GFF

Chris Fields cjfields at uiuc.edu
Mon Aug 14 00:45:02 UTC 2006


Marco,

Did you figure out what the problem was?  I was curious; the issue  
you were having was rather odd.  I wanted to see if it was an issue  
with the GFF data or with the database itself.

Chris

On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:

> Chris,
>
>> Do you mean you get duplicates of sequences back, or that you get  
>> more than
>> one chunk of the same sequence back?
>
> I sometimes get duplicated sequences and sometimes overlapping  
> regions (see
> bellow)
>
>>
>> Is it possible that each query using an ID could contain more than  
>> one
>> feature?  That might explain it (you could check by testing the  
>> size of the
>> array @feats).
> Most id return more than one features from various type  
> ( point_mutation,
> insertion_site, processed_transcript, etc...). That's why I  
> restirct the
> output to type "gene" using regexp /gene/ on $f->type.
>
>>
>> I'm not sure how split locations are handled within Bio:DB::GFF,  
>> but do the
>> specific features have split locations?
>>
>> Chris
>>
> Not sure what you mean exactly but have a look at the following  
> script, it
> gives the location and the group id of the feature being reported:
>
> use Bio::DB::GFF;
> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>                               -dsn =>  
> 'dbi:mysql:database=dmel_43_new');
> my %dups;
> while (<>){
>     chomp;
>     my $id = $_;
>     my @feat = $db->get_feature_by_name($id);
>
>      for my $f (@feat){
>         if (exists $dups{$f->group} && $f->type =~/gene/){
>             print     "Calling >>>", $f->group, "\n";
>             print     "Chr: ", $f->refseq,
>                     " Strand: ", $f->strand,
>                     " Start: ", $f->start,
>                     " End: ", $f->end,
>                     "\n";
>             print "Offending >>>", $dups{$f->group}->group, "\n";
>             print     "Chr: ", $dups{$f->group}->refseq,
>                     " Strand: ", $dups{$f->group}->strand,
>                     " Start: ", $dups{$f->group}->start,
>                       " End: ", $dups{$f->group}->end;
>             print "\n\n";
>          } else {
>             $dups{$f->group} = $f;
>          }
>      }
> }
>
> Here is the output:
> Calling >>>FBgn0004179
> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
> Offending >>>FBgn0004179
> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
>
> Calling >>>FBgn0025681
> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
> Offending >>>FBgn0025681
> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>
> Calling >>>FBgn0025803
> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
> Offending >>>FBgn0025803
> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>
> Calling >>>FBgn0000117
> Chr: X Strand: -1 Start: 1756796 End: 1747557
> Offending >>>FBgn0000117
> Chr: X Strand: -1 Start: 1757776 End: 1747182
>
> Calling >>>FBgn0005427
> Chr: X Strand: -1 Start: 136456 End: 125343
> Offending >>>FBgn0005427
> Chr: X Strand: -1 Start: 133199 End: 124949
>
> Calling >>>FBgn0000042
> Chr: X Strand: 1 Start: 5746100 End: 5750026
> Offending >>>FBgn0000042
> Chr: X Strand: 1 Start: 5746096 End: 5746106
>
> Calling >>>FBgn0004551
> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
> Offending >>>FBgn0004551
> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
>
> Do you have any suggestions?? Is the procedure I am using to  
> retrieve  the
> genes right?
>
> Many thanks
>
> Marco
>
>
>
>>> Many thanks Scott,
>>>
>>> At the same time I got your email I was coming to the same  
>>> conclusion as
>>> you.
>>>
>>> Now I have a stranger problem in my hands... My goal is quite  
>>> simple, I
>>> try
>>> to get the sequence of the genes back from the Bio::DB::GFF database
>>> loaded
>>> on MySQL. The gene list is from a file with one gene id per per  
>>> line. When
>>> I
>>> run the following script:
>>>
>>>
>>>
>>> use Bio::DB::GFF;
>>> use Bio::SeqIO;
>>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
>>>                             -format => 'fasta');
>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>                               -dsn =>  
>>> 'dbi:mysql:database=dmel_43_new');
>>>
>>> while (<>){
>>>     chomp;
>>>     my $id = $_;
>>>     my @feats = $db->get_feature_by_name($id);
>>>     for my $f (@feats){
>>>         $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>>     }
>>> }
>>>
>>>
>>> I get more sequence back than the number of gene in my input file. I
>>> double
>>> check there. Some of the duplicated entries are the same, some  
>>> are not!
>>
>>
>> ...
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> ______________________________
> Marco Blanchette, Ph.D.
>
> mblanche at uclink.berkeley.edu
>
> Donald C. Rio's lab
> Department of Molecular and Cell Biology
> 16 Barker Hall
> University of California
> Berkeley, CA 94720-3204
>
> Tel: (510) 642-1084
> Cell: (510) 847-0996
> Fax: (510) 642-6062
> -- 
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list