[Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF

Fri Aug 18 00:25:24 UTC 2006

Lincoln, 

I can¹t seem to find how to fetch genes by their FBgn id (which seems to now
be the official GadFly unique identifier for genes...). The
get_feature_by_name method uses the real gene name not the FBgn id... I can
see that the FBgn ids are store in the attributes table in the mySQL
database with an attribute id of 7. I tried using
Œget_features_by_attribute({parent_id => $FBgn})¹ without success.

What is the trick??

Also, there is a typo in the Bio::DB::SeqFeature::Store documentation.

...
         # ...by type
         @features = $db->get_features_by_name('gene');
...

Should read  
...
         # ...by type
         @features = $db->get_features_by_type('gene');
...

Many thanks

Marco

On 8/17/06 15:32, "Lincoln Stein" <lincoln.stein at gmail.com> wrote:

> Let me know how it works.
> 
> I also get a few of the warnings about the ortho:* features. They don't seem
> to hurt anything so you can go ahead and use fast loading if you want. The
> long-term fix is to sort the GFF3 files so that all features that share the
> same ID occur next to each other.
> 
> Lincoln
> 
> On 8/17/06, Marco Blanchette <mblanche at berkeley.edu> wrote:
>> I will answer my own question...
>> 
>> Yes, one can load the fasta file after having loaded the gff file by doing:
>> 
>> bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta
>> 
>> Marco
>> 
>> 
>> On 8/17/06 11:20, "Marco Blanchette" < mblanche at berkeley.edu> wrote:
>> 
>>> > Lincoln, thanks for the precision. I just could not find any references to
>>> > how to load the DNA (no where in bp_seqfeature_load.pl or in the
>>> > Bio::DB::SeqFeature::Store it says how load the DNA sequences).
>>> >
>>> > So right now the gff files were loaded in mysql using:
>>> > /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff
>>> >
>>> > I tried the --fast options but got a bunch of warning (see below).
>>> >
>>> > The DNA file (a single fasta database file containing all chromosome
>>> > sequences) was in a different location from the gff files and was not
>>> loaded
>>> > together with the gff files (the sequence table is empty in the database).
>>> >
>>> > Can I load the DNA sequence after the gff files were loaded?
>>> >
>>> > Many thanks
>>> >
>>> > Marco
>>> >
>>> >
>>> > -------------------- WARNING ---------------------
>>> > MSG: ID=ortho:2825 has been used more than once, but it cannot be found in
>>> > the database.
>>> > This can happen if you have specified fast loading, but features sharing
>>> the
>>> > same ID
>>> > are not contiguous in the GFF file. This will be loaded as a separate
>>> > feature.
>>> > Line 483681: "X .      orthologous_region      19477824        19478027
>>> > .       +       .
>>> > ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse"
>>> >
>>> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature
>>> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537
>>> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load
>>> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424
>>> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
>>> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342
>>> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
>>> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240
>>> > STACK toplevel /usr/bin/bp_seqfeature_load.pl:81
>>> >
>>> >
>>> > On 8/17/06 10:27, "Lincoln Stein" <lstein at cshl.edu> wrote:
>>> >
>>>> >> Hi,
>>>> >>
>>>> >> This message bounced because I tried to send it from my gmail account
>>>> and so 
>>>> >> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it
>>>> >> finds a file that contains DNA data, it simply loads it. There is no
>>>> special
>>>> >> command line switch. Also you can include the DNA in the GFF3 file.
>>>> >>
>>>> >> Lincoln
>>>> >>
>>>> >> ---------- Forwarded message ----------
>>>> >> From: Lincoln Stein <lincoln.stein at gmail.com>
>>>> >> Date: Aug 17, 2006 12:26 PM
>>>> >> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF
>>>> >> To: Chris Fields <cjfields at uiuc.edu>
>>>> >> Cc: Marco Blanchette < mblanche at berkeley.edu
>>>> <mailto:mblanche at berkeley.edu> >, "bioperl-l at lists.open-bio.org"
>>>> >> <Bioperl-l at lists.open-bio.org>, cain.cshl at gmail.com
>>>> >>
>>>> >> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to
>>>> >> load the GFF3-format Fly data? I think you're probably getting confused
>>>> by 
>>>> >> overlapping mRNA splice forms, an issue that won't occur with the full
>>>> >> GFF3-formatted data.
>>>> >>
>>>> >>
>>>> >> On 8/13/06, Chris Fields <cjfields at uiuc.edu  <mailto:cjfields at uiuc.edu>
>>>> > wrote:
>>>>> >>>
>>>>> >>> Marco,
>>>>> >>>
>>>>> >>> Did you figure out what the problem was?  I was curious; the issue
>>>>> >>> you were having was rather odd.  I wanted to see if it was an issue
>>>>> >>> with the GFF data or with the database itself.
>>>>> >>>
>>>>> >>> Chris
>>>>> >>>
>>>>> >>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
>>>>> >>>
>>>>>> >>>> Chris, 
>>>>>> >>>>
>>>>>>> >>>>> Do you mean you get duplicates of sequences back, or that you get
>>>>>>> >>>>> more than
>>>>>>> >>>>> one chunk of the same sequence back?
>>>>>> >>>> 
>>>>>> >>>> I sometimes get duplicated sequences and sometimes overlapping
>>>>>> >>>> regions (see
>>>>>> >>>> bellow)
>>>>>> >>>>
>>>>>>> >>>>>
>>>>>>> >>>>> Is it possible that each query using an ID could contain more than
>>>>>>> >>>>> one
>>>>>>> >>>>> feature?  That might explain it (you could check by testing the
>>>>>>> >>>>> size of the
>>>>>>> >>>>> array @feats).
>>>>>> >>>> Most id return more than one features from various type
>>>>>> >>>> ( point_mutation,
>>>>>> >>>> insertion_site, processed_transcript, etc...). That's why I
>>>>>> >>>> restirct the
>>>>>> >>>> output to type "gene" using regexp /gene/ on $f->type.
>>>>>> >>>>
>>>>>>> >>>>>
>>>>>>> >>>>> I'm not sure how split locations are handled within Bio:DB::GFF,
>>>>>>> >>>>> but do the
>>>>>>> >>>>> specific features have split locations?
>>>>>>> >>>>>
>>>>>>> >>>>> Chris
>>>>>>> >>>>>
>>>>>> >>>> Not sure what you mean exactly but have a look at the following
>>>>>> >>>> script, it
>>>>>> >>>> gives the location and the group id of the feature being reported:
>>>>>> >>>>
>>>>>> >>>> use Bio::DB::GFF;
>>>>>> >>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>> >>>>                              -dsn =>
>>>>>> >>>> 'dbi:mysql:database=dmel_43_new');
>>>>>> >>>> my %dups;
>>>>>> >>>> while (<>){
>>>>>> >>>>     chomp;
>>>>>> >>>>     my $id = $_;
>>>>>> >>>>     my @feat = $db->get_feature_by_name($id);
>>>>>> >>>>
>>>>>> >>>>      for my $f (@feat){
>>>>>> >>>>        if (exists $dups{$f->group} && $f->type =~/gene/){
>>>>>> >>>>            print     "Calling >>>", $f->group, "\n";
>>>>>> >>>>            print     "Chr: ", $f->refseq,
>>>>>> >>>>                    " Strand: ", $f->strand,
>>>>>> >>>>                    " Start: ", $f->start,
>>>>>> >>>>                    " End: ", $f->end,
>>>>>> >>>>                    "\n";
>>>>>> >>>>            print "Offending >>>", $dups{$f->group}->group, "\n";
>>>>>> >>>>            print     "Chr: ", $dups{$f->group}->refseq,
>>>>>> >>>>                    " Strand: ", $dups{$f->group}->strand,
>>>>>> >>>>                    " Start: ", $dups{$f->group}->start,
>>>>>> >>>>                      " End: ", $dups{$f->group}->end;
>>>>>> >>>>             print "\n\n";
>>>>>> >>>>          } else {
>>>>>> >>>>             $dups{$f->group} = $f;
>>>>>> >>>>          }
>>>>>> >>>>      }
>>>>>> >>>> }
>>>>>> >>>>
>>>>>> >>>> Here is the output:
>>>>>> >>>> Calling >>>FBgn0004179
>>>>>> >>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
>>>>>> >>>> Offending >>>FBgn0004179
>>>>>> >>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
>>>>>> >>>>
>>>>>> >>>> Calling >>>FBgn0025681
>>>>>> >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>>>> >>>> Offending >>>FBgn0025681
>>>>>> >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>>>> >>>>
>>>>>> >>>> Calling >>>FBgn0025803
>>>>>> >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>>>> >>>> Offending >>>FBgn0025803
>>>>>> >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>>>> >>>>
>>>>>> >>>> Calling >>>FBgn0000117
>>>>>> >>>> Chr: X Strand: -1 Start: 1756796 End: 1747557
>>>>>> >>>> Offending >>>FBgn0000117
>>>>>> >>>> Chr: X Strand: -1 Start: 1757776 End: 1747182
>>>>>> >>>>
>>>>>> >>>> Calling >>>FBgn0005427
>>>>>> >>>> Chr: X Strand: -1 Start: 136456 End: 125343
>>>>>> >>>> Offending >>>FBgn0005427
>>>>>> >>>> Chr: X Strand: -1 Start: 133199 End: 124949
>>>>>> >>>>
>>>>>> >>>> Calling >>>FBgn0000042
>>>>>> >>>> Chr: X Strand: 1 Start: 5746100 End: 5750026
>>>>>> >>>> Offending >>>FBgn0000042
>>>>>> >>>> Chr: X Strand: 1 Start: 5746096 End: 5746106
>>>>>> >>>>
>>>>>> >>>> Calling >>>FBgn0004551
>>>>>> >>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
>>>>>> >>>> Offending >>>FBgn0004551
>>>>>> >>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
>>>>>> >>>>
>>>>>> >>>> Do you have any suggestions?? Is the procedure I am using to
>>>>>> >>>> retrieve  the
>>>>>> >>>> genes right?
>>>>>> >>>>
>>>>>> >>>> Many thanks
>>>>>> >>>>
>>>>>> >>>> Marco
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> 
>>>>>>>> >>>>>> Many thanks Scott,
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> At the same time I got your email I was coming to the same
>>>>>>>> >>>>>> conclusion as
>>>>>>>> >>>>>> you.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Now I have a stranger problem in my hands... My goal is quite
>>>>>>>> >>>>>> simple, I
>>>>>>>> >>>>>> try
>>>>>>>> >>>>>> to get the sequence of the genes back from the Bio::DB::GFF
>>>>>>>> database 
>>>>>>>> >>>>>> loaded
>>>>>>>> >>>>>> on MySQL. The gene list is from a file with one gene id per per
>>>>>>>> >>>>>> line. When
>>>>>>>> >>>>>> I
>>>>>>>> >>>>>> run the following script:
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> use Bio::DB::GFF;
>>>>>>>> >>>>>> use Bio::SeqIO;
>>>>>>>> >>>>>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
>>>>>>>> >>>>>>                            -format => 'fasta');
>>>>>>>> >>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>>>> >>>>>>                              -dsn =>
>>>>>>>> >>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> while (<>){
>>>>>>>> >>>>>>     chomp;
>>>>>>>> >>>>>>     my $id = $_;
>>>>>>>> >>>>>>     my @feats = $db->get_feature_by_name($id);
>>>>>>>> >>>>>>     for my $f (@feats){
>>>>>>>> >>>>>>        $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>>>>>>> >>>>>>     }
>>>>>>>> >>>>>> }
>>>>>>>> >>>>>>
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> I get more sequence back than the number of gene in my input
>>>>>>>> file. I 
>>>>>>>> >>>>>> double
>>>>>>>> >>>>>> check there. Some of the duplicated entries are the same, some
>>>>>>>> >>>>>> are not!
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> ... 
>>>>>>> >>>>>
>>>>>>> >>>>> _______________________________________________
>>>>>>> >>>>> Bioperl-l mailing list
>>>>>>> >>>>> Bioperl-l at lists.open-bio.org
>>>>>>> <mailto:Bioperl-l at lists.open-bio.org>
>>>>>>> >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>> >>>>
>>>>>> >>>> ______________________________
>>>>>> >>>> Marco Blanchette, Ph.D.
>>>>>> >>>>
>>>>>> >>>> mblanche at uclink.berkeley.edu
>>>>>> >>>>
>>>>>> >>>> Donald C. Rio's lab
>>>>>> >>>> Department of Molecular and Cell Biology
>>>>>> >>>> 16 Barker Hall
>>>>>> >>>> University of California
>>>>>> >>>> Berkeley, CA 94720-3204
>>>>>> >>>>
>>>>>> >>>> Tel: (510) 642-1084
>>>>>> >>>> Cell: (510) 847-0996
>>>>>> >>>> Fax: (510) 642-6062
>>>>>> >>>> --
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> _______________________________________________
>>>>>> >>>> Bioperl-l mailing list
>>>>>> >>>> Bioperl-l at lists.open-bio.org
>>>>>> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>> >>>
>>>>> >>> Christopher Fields
>>>>> >>> Postdoctoral Researcher
>>>>> >>> Lab of Dr. Robert Switzer
>>>>> >>> Dept of Biochemistry
>>>>> >>> University of Illinois Urbana-Champaign
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> _______________________________________________
>>>>> >>> Bioperl-l mailing list
>>>>> >>> Bioperl-l at lists.open-bio.org
>>>>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>> >>> 
>>>> >>
>>>> >>
>>> >
>>> > ______________________________
>>> > Marco Blanchette, Ph.D.
>>> >
>>> > mblanche at uclink.berkeley.edu
>>> >
>>> > Donald C. Rio's lab
>>> > Department of Molecular and Cell Biology
>>> > 16 Barker Hall
>>> > University of California
>>> > Berkeley, CA 94720-3204
>>> >
>>> > Tel: (510) 642-1084
>>> > Cell: (510) 847-0996
>>> > Fax: (510) 642-6062
>> 
>> ______________________________
>> Marco Blanchette, Ph.D.
>> 
>> mblanche at uclink.berkeley.edu
>> 
>> Donald C. Rio's lab
>> Department of Molecular and Cell Biology
>> 16 Barker Hall
>> University of California
>> Berkeley, CA 94720-3204
>> 
>> Tel: (510) 642-1084
>> Cell: (510) 847-0996
>> Fax: (510) 642-6062
>> --
>> 
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 

______________________________
Marco Blanchette, Ph.D.

mblanche at uclink.berkeley.edu

Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204

Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
--