[Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF

Marco Blanchette mblanche at berkeley.edu
Fri Aug 18 03:15:14 UTC 2006


Again, I will answer my own questions...

The get_features_by_alias() method will use the FBgn ids to retrieve a given
feature as in

use Bio::DB::SeqFeature::Store;
my @ids = ('FBgn0026620', 'FBgn0010772', 'FBgn0025879');
my $db = Bio::DB::SeqFeature::Store->new(-adaptor => 'DBI::mysql',
                                         -dsn  =>'dbi:mysql:dmel_43_SeqF');
for my $id (@ids){
    my @feats = $db->get_features_by_alias($id);
    for my $f (@feats){
        print $f->name, "\n";
    }
}

Sorry for the spam...



On 8/17/06 17:25, "Marco Blanchette" <mblanche at berkeley.edu> wrote:

> Lincoln, 
> 
> I can¹t seem to find how to fetch genes by their FBgn id (which seems to now
> be the official GadFly unique identifier for genes...). The
> get_feature_by_name method uses the real gene name not the FBgn id... I can
> see that the FBgn ids are store in the attributes table in the mySQL
> database with an attribute id of 7. I tried using
> Œget_features_by_attribute({parent_id => $FBgn})¹ without success.
> 
> What is the trick??
> 
> Also, there is a typo in the Bio::DB::SeqFeature::Store documentation.
> 
> ...
>          # ...by type
>          @features = $db->get_features_by_name('gene');
> ...
> 
> Should read  
> ...
>          # ...by type
>          @features = $db->get_features_by_type('gene');
> ...
> 
> Many thanks
> 
> Marco
> 
> 
> On 8/17/06 15:32, "Lincoln Stein" <lincoln.stein at gmail.com> wrote:
> 
>> Let me know how it works.
>> 
>> I also get a few of the warnings about the ortho:* features. They don't seem
>> to hurt anything so you can go ahead and use fast loading if you want. The
>> long-term fix is to sort the GFF3 files so that all features that share the
>> same ID occur next to each other.
>> 
>> Lincoln
>> 
>> On 8/17/06, Marco Blanchette <mblanche at berkeley.edu> wrote:
>>> I will answer my own question...
>>> 
>>> Yes, one can load the fasta file after having loaded the gff file by doing:
>>> 
>>> bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta
>>> 
>>> Marco
>>> 
>>> 
>>> On 8/17/06 11:20, "Marco Blanchette" < mblanche at berkeley.edu> wrote:
>>> 
>>>>> Lincoln, thanks for the precision. I just could not find any references to
>>>>> how to load the DNA (no where in bp_seqfeature_load.pl or in the
>>>>> Bio::DB::SeqFeature::Store it says how load the DNA sequences).
>>>>> 
>>>>> So right now the gff files were loaded in mysql using:
>>>>> /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff
>>>>> 
>>>>> I tried the --fast options but got a bunch of warning (see below).
>>>>> 
>>>>> The DNA file (a single fasta database file containing all chromosome
>>>>> sequences) was in a different location from the gff files and was not
>>>> loaded
>>>>> together with the gff files (the sequence table is empty in the database).
>>>>> 
>>>>> Can I load the DNA sequence after the gff files were loaded?
>>>>> 
>>>>> Many thanks
>>>>> 
>>>>> Marco
>>>>> 
>>>>> 
>>>>> -------------------- WARNING ---------------------
>>>>> MSG: ID=ortho:2825 has been used more than once, but it cannot be found in
>>>>> the database.
>>>>> This can happen if you have specified fast loading, but features sharing
>>>> the
>>>>> same ID
>>>>> are not contiguous in the GFF file. This will be loaded as a separate
>>>>> feature.
>>>>> Line 483681: "X .      orthologous_region      19477824        19478027
>>>>> .       +       .
>>>>> ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse"
>>>>> 
>>>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature
>>>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537
>>>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load
>>>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424
>>>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
>>>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342
>>>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
>>>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240
>>>>> STACK toplevel /usr/bin/bp_seqfeature_load.pl:81
>>>>> 
>>>>> 
>>>>> On 8/17/06 10:27, "Lincoln Stein" <lstein at cshl.edu> wrote:
>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> This message bounced because I tried to send it from my gmail account
>>>>> and so 
>>>>>>> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it
>>>>>>> finds a file that contains DNA data, it simply loads it. There is no
>>>>> special
>>>>>>> command line switch. Also you can include the DNA in the GFF3 file.
>>>>>>> 
>>>>>>> Lincoln
>>>>>>> 
>>>>>>> ---------- Forwarded message ----------
>>>>>>> From: Lincoln Stein <lincoln.stein at gmail.com>
>>>>>>> Date: Aug 17, 2006 12:26 PM
>>>>>>> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF
>>>>>>> To: Chris Fields <cjfields at uiuc.edu>
>>>>>>> Cc: Marco Blanchette < mblanche at berkeley.edu
>>>>> <mailto:mblanche at berkeley.edu> >, "bioperl-l at lists.open-bio.org"
>>>>>>> <Bioperl-l at lists.open-bio.org>, cain.cshl at gmail.com
>>>>>>> 
>>>>>>> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to
>>>>>>> load the GFF3-format Fly data? I think you're probably getting confused
>>>>> by 
>>>>>>> overlapping mRNA splice forms, an issue that won't occur with the full
>>>>>>> GFF3-formatted data.
>>>>>>> 
>>>>>>> 
>>>>>>> On 8/13/06, Chris Fields <cjfields at uiuc.edu  <mailto:cjfields at uiuc.edu>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Marco,
>>>>>>>>> 
>>>>>>>>> Did you figure out what the problem was?  I was curious; the issue
>>>>>>>>> you were having was rather odd.  I wanted to see if it was an issue
>>>>>>>>> with the GFF data or with the database itself.
>>>>>>>>> 
>>>>>>>>> Chris
>>>>>>>>> 
>>>>>>>>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
>>>>>>>>> 
>>>>>>>>>>> Chris, 
>>>>>>>>>>> 
>>>>>>>>>>>> Do you mean you get duplicates of sequences back, or that you get
>>>>>>>>>>>> more than
>>>>>>>>>>>> one chunk of the same sequence back?
>>>>>>>>>>> 
>>>>>>>>>>> I sometimes get duplicated sequences and sometimes overlapping
>>>>>>>>>>> regions (see
>>>>>>>>>>> bellow)
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Is it possible that each query using an ID could contain more than
>>>>>>>>>>>> one
>>>>>>>>>>>> feature?  That might explain it (you could check by testing the
>>>>>>>>>>>> size of the
>>>>>>>>>>>> array @feats).
>>>>>>>>>>> Most id return more than one features from various type
>>>>>>>>>>> ( point_mutation,
>>>>>>>>>>> insertion_site, processed_transcript, etc...). That's why I
>>>>>>>>>>> restirct the
>>>>>>>>>>> output to type "gene" using regexp /gene/ on $f->type.
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm not sure how split locations are handled within Bio:DB::GFF,
>>>>>>>>>>>> but do the
>>>>>>>>>>>> specific features have split locations?
>>>>>>>>>>>> 
>>>>>>>>>>>> Chris
>>>>>>>>>>>> 
>>>>>>>>>>> Not sure what you mean exactly but have a look at the following
>>>>>>>>>>> script, it
>>>>>>>>>>> gives the location and the group id of the feature being reported:
>>>>>>>>>>> 
>>>>>>>>>>> use Bio::DB::GFF;
>>>>>>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>>>>>>>                              -dsn =>
>>>>>>>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>>>>>>>> my %dups;
>>>>>>>>>>> while (<>){
>>>>>>>>>>>     chomp;
>>>>>>>>>>>     my $id = $_;
>>>>>>>>>>>     my @feat = $db->get_feature_by_name($id);
>>>>>>>>>>> 
>>>>>>>>>>>      for my $f (@feat){
>>>>>>>>>>>        if (exists $dups{$f->group} && $f->type =~/gene/){
>>>>>>>>>>>            print     "Calling >>>", $f->group, "\n";
>>>>>>>>>>>            print     "Chr: ", $f->refseq,
>>>>>>>>>>>                    " Strand: ", $f->strand,
>>>>>>>>>>>                    " Start: ", $f->start,
>>>>>>>>>>>                    " End: ", $f->end,
>>>>>>>>>>>                    "\n";
>>>>>>>>>>>            print "Offending >>>", $dups{$f->group}->group, "\n";
>>>>>>>>>>>            print     "Chr: ", $dups{$f->group}->refseq,
>>>>>>>>>>>                    " Strand: ", $dups{$f->group}->strand,
>>>>>>>>>>>                    " Start: ", $dups{$f->group}->start,
>>>>>>>>>>>                      " End: ", $dups{$f->group}->end;
>>>>>>>>>>>             print "\n\n";
>>>>>>>>>>>          } else {
>>>>>>>>>>>             $dups{$f->group} = $f;
>>>>>>>>>>>          }
>>>>>>>>>>>      }
>>>>>>>>>>> }
>>>>>>>>>>> 
>>>>>>>>>>> Here is the output:
>>>>>>>>>>> Calling >>>FBgn0004179
>>>>>>>>>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
>>>>>>>>>>> Offending >>>FBgn0004179
>>>>>>>>>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
>>>>>>>>>>> 
>>>>>>>>>>> Calling >>>FBgn0025681
>>>>>>>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>>>>>>>>> Offending >>>FBgn0025681
>>>>>>>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>>>>>>>>> 
>>>>>>>>>>> Calling >>>FBgn0025803
>>>>>>>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>>>>>>>>> Offending >>>FBgn0025803
>>>>>>>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>>>>>>>>> 
>>>>>>>>>>> Calling >>>FBgn0000117
>>>>>>>>>>> Chr: X Strand: -1 Start: 1756796 End: 1747557
>>>>>>>>>>> Offending >>>FBgn0000117
>>>>>>>>>>> Chr: X Strand: -1 Start: 1757776 End: 1747182
>>>>>>>>>>> 
>>>>>>>>>>> Calling >>>FBgn0005427
>>>>>>>>>>> Chr: X Strand: -1 Start: 136456 End: 125343
>>>>>>>>>>> Offending >>>FBgn0005427
>>>>>>>>>>> Chr: X Strand: -1 Start: 133199 End: 124949
>>>>>>>>>>> 
>>>>>>>>>>> Calling >>>FBgn0000042
>>>>>>>>>>> Chr: X Strand: 1 Start: 5746100 End: 5750026
>>>>>>>>>>> Offending >>>FBgn0000042
>>>>>>>>>>> Chr: X Strand: 1 Start: 5746096 End: 5746106
>>>>>>>>>>> 
>>>>>>>>>>> Calling >>>FBgn0004551
>>>>>>>>>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
>>>>>>>>>>> Offending >>>FBgn0004551
>>>>>>>>>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
>>>>>>>>>>> 
>>>>>>>>>>> Do you have any suggestions?? Is the procedure I am using to
>>>>>>>>>>> retrieve  the
>>>>>>>>>>> genes right?
>>>>>>>>>>> 
>>>>>>>>>>> Many thanks
>>>>>>>>>>> 
>>>>>>>>>>> Marco
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Many thanks Scott,
>>>>>>>>>>>> 
>>>>>>>>>>>> At the same time I got your email I was coming to the same
>>>>>>>>>>>> conclusion as
>>>>>>>>>>>> you.
>>>>>>>>>>>> 
>>>>>>>>>>>> Now I have a stranger problem in my hands... My goal is quite
>>>>>>>>>>>> simple, I
>>>>>>>>>>>> try
>>>>>>>>>>>> to get the sequence of the genes back from the Bio::DB::GFF
>>>>>>>>> database 
>>>>>>>>>>>> loaded
>>>>>>>>>>>> on MySQL. The gene list is from a file with one gene id per per
>>>>>>>>>>>> line. When
>>>>>>>>>>>> I
>>>>>>>>>>>> run the following script:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> use Bio::DB::GFF;
>>>>>>>>>>>> use Bio::SeqIO;
>>>>>>>>>>>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
>>>>>>>>>>>>                            -format => 'fasta');
>>>>>>>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>>>>>>>>                              -dsn =>
>>>>>>>>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>>>>>>>>> 
>>>>>>>>>>>> while (<>){
>>>>>>>>>>>>     chomp;
>>>>>>>>>>>>     my $id = $_;
>>>>>>>>>>>>     my @feats = $db->get_feature_by_name($id);
>>>>>>>>>>>>     for my $f (@feats){
>>>>>>>>>>>>        $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>>>>>>>>>>>     }
>>>>>>>>>>>> }
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I get more sequence back than the number of gene in my input
>>>>>>>>> file. I 
>>>>>>>>>>>> double
>>>>>>>>>>>> check there. Some of the duplicated entries are the same, some
>>>>>>>>>>>> are not!
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> ... 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Bioperl-l mailing list
>>>>>>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>>> <mailto:Bioperl-l at lists.open-bio.org>
>>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>>>>>> 
>>>>>>>>>>> ______________________________
>>>>>>>>>>> Marco Blanchette, Ph.D.
>>>>>>>>>>> 
>>>>>>>>>>> mblanche at uclink.berkeley.edu
>>>>>>>>>>> 
>>>>>>>>>>> Donald C. Rio's lab
>>>>>>>>>>> Department of Molecular and Cell Biology
>>>>>>>>>>> 16 Barker Hall
>>>>>>>>>>> University of California
>>>>>>>>>>> Berkeley, CA 94720-3204
>>>>>>>>>>> 
>>>>>>>>>>> Tel: (510) 642-1084
>>>>>>>>>>> Cell: (510) 847-0996
>>>>>>>>>>> Fax: (510) 642-6062
>>>>>>>>>>> --
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioperl-l mailing list
>>>>>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>>>> 
>>>>>>>>> Christopher Fields
>>>>>>>>> Postdoctoral Researcher
>>>>>>>>> Lab of Dr. Robert Switzer
>>>>>>>>> Dept of Biochemistry
>>>>>>>>> University of Illinois Urbana-Champaign
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioperl-l mailing list
>>>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> ______________________________
>>>>> Marco Blanchette, Ph.D.
>>>>> 
>>>>> mblanche at uclink.berkeley.edu
>>>>> 
>>>>> Donald C. Rio's lab
>>>>> Department of Molecular and Cell Biology
>>>>> 16 Barker Hall
>>>>> University of California
>>>>> Berkeley, CA 94720-3204
>>>>> 
>>>>> Tel: (510) 642-1084
>>>>> Cell: (510) 847-0996
>>>>> Fax: (510) 642-6062
>>> 
>>> ______________________________
>>> Marco Blanchette, Ph.D.
>>> 
>>> mblanche at uclink.berkeley.edu
>>> 
>>> Donald C. Rio's lab
>>> Department of Molecular and Cell Biology
>>> 16 Barker Hall
>>> University of California
>>> Berkeley, CA 94720-3204
>>> 
>>> Tel: (510) 642-1084
>>> Cell: (510) 847-0996
>>> Fax: (510) 642-6062
>>> --
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
> 
> 
> ______________________________
> Marco Blanchette, Ph.D.
> 
> mblanche at uclink.berkeley.edu
> 
> Donald C. Rio's lab
> Department of Molecular and Cell Biology
> 16 Barker Hall
> University of California
> Berkeley, CA 94720-3204
> 
> Tel: (510) 642-1084
> Cell: (510) 847-0996
> Fax: (510) 642-6062

______________________________
Marco Blanchette, Ph.D.

mblanche at uclink.berkeley.edu

Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204

Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
-- 







More information about the Bioperl-l mailing list