[Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF

Fri Aug 18 19:52:33 UTC 2006

Many thanks Lincoln,

Bio::DB::SeqFeature::Store seems to work fine with me as in:

use Bio::DB::SeqFeature::Store;
my $db = Bio::DB::SeqFeature::Store->new(-adaptor => 'DBI::mysql',
                                        -dsn => 'dbi:mysql:dmel_43_SeqF');
while (<>){
    chomp;
    my $id = $_;
    my @feats = $db->get_features_by_alias($id);
    for my $f (@feats){
        print "$id -> ", $f->name, "\n" if $f->type eq 'gene';
    }
}

get a list of FBgn ids and spits out the gene name. The good thing now is
that I am getting the same number of output gene as the number of genes in
my starting list (As oposed to when I was using Bio::DB::GFF).

My only problem is that I had to guess that the method type() and
attributes() were available. My understanding by now is that
get_features_by_alias() return a Bio::DB::Feature, however, I couldn't find
any documentation on that object (it does not return a Bio::SeqFeatureI as I
originally thought).

Is the Bio::DB::Feature essentially a clone of the Bio::DB::GFF::Feature?

Many thanks again, 

Marco

On 8/17/06 15:32, "Lincoln Stein" <lincoln.stein at gmail.com> wrote:

> Let me know how it works.
> 
> I also get a few of the warnings about the ortho:* features. They don't seem
> to hurt anything so you can go ahead and use fast loading if you want. The
> long-term fix is to sort the GFF3 files so that all features that share the
> same ID occur next to each other.
> 
> Lincoln
> 
> On 8/17/06, Marco Blanchette <mblanche at berkeley.edu> wrote:
>> 
>> I will answer my own question...
>> 
>> Yes, one can load the fasta file after having loaded the gff file by
>> doing:
>> 
>> bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta
>> 
>> Marco
>> 
>> 
>> On 8/17/06 11:20, "Marco Blanchette" <mblanche at berkeley.edu> wrote:
>> 
>>> Lincoln, thanks for the precision. I just could not find any references
>> to
>>> how to load the DNA (no where in bp_seqfeature_load.pl or in the
>>> Bio::DB::SeqFeature::Store it says how load the DNA sequences).
>>> 
>>> So right now the gff files were loaded in mysql using:
>>> /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff
>>> 
>>> I tried the --fast options but got a bunch of warning (see below).
>>> 
>>> The DNA file (a single fasta database file containing all chromosome
>>> sequences) was in a different location from the gff files and was not
>> loaded
>>> together with the gff files (the sequence table is empty in the
>> database).
>>> 
>>> Can I load the DNA sequence after the gff files were loaded?
>>> 
>>> Many thanks
>>> 
>>> Marco
>>> 
>>> 
>>> -------------------- WARNING ---------------------
>>> MSG: ID=ortho:2825 has been used more than once, but it cannot be found
>> in
>>> the database.
>>> This can happen if you have specified fast loading, but features sharing
>> the
>>> same ID
>>> are not contiguous in the GFF file. This will be loaded as a separate
>>> feature.
>>> Line 483681: "X .       orthologous_region      19477824        19478027
>>> .       +       .
>>> ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse"
>>> 
>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature
>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537
>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load
>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424
>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342
>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240
>>> STACK toplevel /usr/bin/bp_seqfeature_load.pl:81
>>> 
>>> 
>>> On 8/17/06 10:27, "Lincoln Stein" <lstein at cshl.edu> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> This message bounced because I tried to send it from my gmail account
>> and so
>>>> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it
>>>> finds a file that contains DNA data, it simply loads it. There is no
>> special
>>>> command line switch. Also you can include the DNA in the GFF3 file.
>>>> 
>>>> Lincoln
>>>> 
>>>> ---------- Forwarded message ----------
>>>> From: Lincoln Stein <lincoln.stein at gmail.com>
>>>> Date: Aug 17, 2006 12:26 PM
>>>> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF
>>>> To: Chris Fields <cjfields at uiuc.edu>
>>>> Cc: Marco Blanchette <mblanche at berkeley.edu>, "
>> bioperl-l at lists.open-bio.org"
>>>> <Bioperl-l at lists.open-bio.org>, cain.cshl at gmail.com
>>>> 
>>>> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class
>> to
>>>> load the GFF3-format Fly data? I think you're probably getting confused
>> by
>>>> overlapping mRNA splice forms, an issue that won't occur with the full
>>>> GFF3-formatted data.
>>>> 
>>>> 
>>>> On 8/13/06, Chris Fields <cjfields at uiuc.edu> wrote:
>>>>> 
>>>>> Marco,
>>>>> 
>>>>> Did you figure out what the problem was?  I was curious; the issue
>>>>> you were having was rather odd.  I wanted to see if it was an issue
>>>>> with the GFF data or with the database itself.
>>>>> 
>>>>> Chris
>>>>> 
>>>>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
>>>>> 
>>>>>> Chris,
>>>>>> 
>>>>>>> Do you mean you get duplicates of sequences back, or that you get
>>>>>>> more than
>>>>>>> one chunk of the same sequence back?
>>>>>> 
>>>>>> I sometimes get duplicated sequences and sometimes overlapping
>>>>>> regions (see
>>>>>> bellow)
>>>>>> 
>>>>>>> 
>>>>>>> Is it possible that each query using an ID could contain more than
>>>>>>> one
>>>>>>> feature?  That might explain it (you could check by testing the
>>>>>>> size of the
>>>>>>> array @feats).
>>>>>> Most id return more than one features from various type
>>>>>> ( point_mutation,
>>>>>> insertion_site, processed_transcript, etc...). That's why I
>>>>>> restirct the
>>>>>> output to type "gene" using regexp /gene/ on $f->type.
>>>>>> 
>>>>>>> 
>>>>>>> I'm not sure how split locations are handled within Bio:DB::GFF,
>>>>>>> but do the
>>>>>>> specific features have split locations?
>>>>>>> 
>>>>>>> Chris
>>>>>>> 
>>>>>> Not sure what you mean exactly but have a look at the following
>>>>>> script, it
>>>>>> gives the location and the group id of the feature being reported:
>>>>>> 
>>>>>> use Bio::DB::GFF;
>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>>                               -dsn =>
>>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>>> my %dups;
>>>>>> while (<>){
>>>>>>     chomp;
>>>>>>     my $id = $_;
>>>>>>     my @feat = $db->get_feature_by_name($id);
>>>>>> 
>>>>>>      for my $f (@feat){
>>>>>>         if (exists $dups{$f->group} && $f->type =~/gene/){
>>>>>>             print     "Calling >>>", $f->group, "\n";
>>>>>>             print     "Chr: ", $f->refseq,
>>>>>>                     " Strand: ", $f->strand,
>>>>>>                     " Start: ", $f->start,
>>>>>>                     " End: ", $f->end,
>>>>>>                     "\n";
>>>>>>             print "Offending >>>", $dups{$f->group}->group, "\n";
>>>>>>             print     "Chr: ", $dups{$f->group}->refseq,
>>>>>>                     " Strand: ", $dups{$f->group}->strand,
>>>>>>                     " Start: ", $dups{$f->group}->start,
>>>>>>                       " End: ", $dups{$f->group}->end;
>>>>>>             print "\n\n";
>>>>>>          } else {
>>>>>>             $dups{$f->group} = $f;
>>>>>>          }
>>>>>>      }
>>>>>> }
>>>>>> 
>>>>>> Here is the output:
>>>>>> Calling >>>FBgn0004179
>>>>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
>>>>>> Offending >>>FBgn0004179
>>>>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
>>>>>> 
>>>>>> Calling >>>FBgn0025681
>>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>>>> Offending >>>FBgn0025681
>>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>>>> 
>>>>>> Calling >>>FBgn0025803
>>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>>>> Offending >>>FBgn0025803
>>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>>>> 
>>>>>> Calling >>>FBgn0000117
>>>>>> Chr: X Strand: -1 Start: 1756796 End: 1747557
>>>>>> Offending >>>FBgn0000117
>>>>>> Chr: X Strand: -1 Start: 1757776 End: 1747182
>>>>>> 
>>>>>> Calling >>>FBgn0005427
>>>>>> Chr: X Strand: -1 Start: 136456 End: 125343
>>>>>> Offending >>>FBgn0005427
>>>>>> Chr: X Strand: -1 Start: 133199 End: 124949
>>>>>> 
>>>>>> Calling >>>FBgn0000042
>>>>>> Chr: X Strand: 1 Start: 5746100 End: 5750026
>>>>>> Offending >>>FBgn0000042
>>>>>> Chr: X Strand: 1 Start: 5746096 End: 5746106
>>>>>> 
>>>>>> Calling >>>FBgn0004551
>>>>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
>>>>>> Offending >>>FBgn0004551
>>>>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
>>>>>> 
>>>>>> Do you have any suggestions?? Is the procedure I am using to
>>>>>> retrieve  the
>>>>>> genes right?
>>>>>> 
>>>>>> Many thanks
>>>>>> 
>>>>>> Marco
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> Many thanks Scott,
>>>>>>>> 
>>>>>>>> At the same time I got your email I was coming to the same
>>>>>>>> conclusion as
>>>>>>>> you.
>>>>>>>> 
>>>>>>>> Now I have a stranger problem in my hands... My goal is quite
>>>>>>>> simple, I
>>>>>>>> try
>>>>>>>> to get the sequence of the genes back from the Bio::DB::GFF
>> database
>>>>>>>> loaded
>>>>>>>> on MySQL. The gene list is from a file with one gene id per per
>>>>>>>> line. When
>>>>>>>> I
>>>>>>>> run the following script:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> use Bio::DB::GFF;
>>>>>>>> use Bio::SeqIO;
>>>>>>>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
>>>>>>>>                             -format => 'fasta');
>>>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>>>>                               -dsn =>
>>>>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>>>>> 
>>>>>>>> while (<>){
>>>>>>>>     chomp;
>>>>>>>>     my $id = $_;
>>>>>>>>     my @feats = $db->get_feature_by_name($id);
>>>>>>>>     for my $f (@feats){
>>>>>>>>         $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>>>>>>>     }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I get more sequence back than the number of gene in my input file.
>> I
>>>>>>>> double
>>>>>>>> check there. Some of the duplicated entries are the same, some
>>>>>>>> are not!
>>>>>>> 
>>>>>>> 
>>>>>>> ...
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Bioperl-l mailing list
>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>> 
>>>>>> ______________________________
>>>>>> Marco Blanchette, Ph.D.
>>>>>> 
>>>>>> mblanche at uclink.berkeley.edu
>>>>>> 
>>>>>> Donald C. Rio's lab
>>>>>> Department of Molecular and Cell Biology
>>>>>> 16 Barker Hall
>>>>>> University of California
>>>>>> Berkeley, CA 94720-3204
>>>>>> 
>>>>>> Tel: (510) 642-1084
>>>>>> Cell: (510) 847-0996
>>>>>> Fax: (510) 642-6062
>>>>>> --
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>> 
>>>>> Christopher Fields
>>>>> Postdoctoral Researcher
>>>>> Lab of Dr. Robert Switzer
>>>>> Dept of Biochemistry
>>>>> University of Illinois Urbana-Champaign
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>> 
>>>> 
>>>> 
>>> 
>>> ______________________________
>>> Marco Blanchette, Ph.D.
>>> 
>>> mblanche at uclink.berkeley.edu
>>> 
>>> Donald C. Rio's lab
>>> Department of Molecular and Cell Biology
>>> 16 Barker Hall
>>> University of California
>>> Berkeley, CA 94720-3204
>>> 
>>> Tel: (510) 642-1084
>>> Cell: (510) 847-0996
>>> Fax: (510) 642-6062
>> 
>> ______________________________
>> Marco Blanchette, Ph.D.
>> 
>> mblanche at uclink.berkeley.edu
>> 
>> Donald C. Rio's lab
>> Department of Molecular and Cell Biology
>> 16 Barker Hall
>> University of California
>> Berkeley, CA 94720-3204
>> 
>> Tel: (510) 642-1084
>> Cell: (510) 847-0996
>> Fax: (510) 642-6062
>> --
>> 
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 

______________________________
Marco Blanchette, Ph.D.

mblanche at uclink.berkeley.edu

Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204

Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
--