[Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF
Marco Blanchette
mblanche at berkeley.edu
Fri Aug 18 19:52:33 UTC 2006
Many thanks Lincoln,
Bio::DB::SeqFeature::Store seems to work fine with me as in:
use Bio::DB::SeqFeature::Store;
my $db = Bio::DB::SeqFeature::Store->new(-adaptor => 'DBI::mysql',
-dsn => 'dbi:mysql:dmel_43_SeqF');
while (<>){
chomp;
my $id = $_;
my @feats = $db->get_features_by_alias($id);
for my $f (@feats){
print "$id -> ", $f->name, "\n" if $f->type eq 'gene';
}
}
get a list of FBgn ids and spits out the gene name. The good thing now is
that I am getting the same number of output gene as the number of genes in
my starting list (As oposed to when I was using Bio::DB::GFF).
My only problem is that I had to guess that the method type() and
attributes() were available. My understanding by now is that
get_features_by_alias() return a Bio::DB::Feature, however, I couldn't find
any documentation on that object (it does not return a Bio::SeqFeatureI as I
originally thought).
Is the Bio::DB::Feature essentially a clone of the Bio::DB::GFF::Feature?
Many thanks again,
Marco
On 8/17/06 15:32, "Lincoln Stein" <lincoln.stein at gmail.com> wrote:
> Let me know how it works.
>
> I also get a few of the warnings about the ortho:* features. They don't seem
> to hurt anything so you can go ahead and use fast loading if you want. The
> long-term fix is to sort the GFF3 files so that all features that share the
> same ID occur next to each other.
>
> Lincoln
>
> On 8/17/06, Marco Blanchette <mblanche at berkeley.edu> wrote:
>>
>> I will answer my own question...
>>
>> Yes, one can load the fasta file after having loaded the gff file by
>> doing:
>>
>> bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta
>>
>> Marco
>>
>>
>> On 8/17/06 11:20, "Marco Blanchette" <mblanche at berkeley.edu> wrote:
>>
>>> Lincoln, thanks for the precision. I just could not find any references
>> to
>>> how to load the DNA (no where in bp_seqfeature_load.pl or in the
>>> Bio::DB::SeqFeature::Store it says how load the DNA sequences).
>>>
>>> So right now the gff files were loaded in mysql using:
>>> /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff
>>>
>>> I tried the --fast options but got a bunch of warning (see below).
>>>
>>> The DNA file (a single fasta database file containing all chromosome
>>> sequences) was in a different location from the gff files and was not
>> loaded
>>> together with the gff files (the sequence table is empty in the
>> database).
>>>
>>> Can I load the DNA sequence after the gff files were loaded?
>>>
>>> Many thanks
>>>
>>> Marco
>>>
>>>
>>> -------------------- WARNING ---------------------
>>> MSG: ID=ortho:2825 has been used more than once, but it cannot be found
>> in
>>> the database.
>>> This can happen if you have specified fast loading, but features sharing
>> the
>>> same ID
>>> are not contiguous in the GFF file. This will be loaded as a separate
>>> feature.
>>> Line 483681: "X . orthologous_region 19477824 19478027
>>> . + .
>>> ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse"
>>>
>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature
>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537
>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load
>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424
>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342
>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240
>>> STACK toplevel /usr/bin/bp_seqfeature_load.pl:81
>>>
>>>
>>> On 8/17/06 10:27, "Lincoln Stein" <lstein at cshl.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> This message bounced because I tried to send it from my gmail account
>> and so
>>>> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it
>>>> finds a file that contains DNA data, it simply loads it. There is no
>> special
>>>> command line switch. Also you can include the DNA in the GFF3 file.
>>>>
>>>> Lincoln
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Lincoln Stein <lincoln.stein at gmail.com>
>>>> Date: Aug 17, 2006 12:26 PM
>>>> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF
>>>> To: Chris Fields <cjfields at uiuc.edu>
>>>> Cc: Marco Blanchette <mblanche at berkeley.edu>, "
>> bioperl-l at lists.open-bio.org"
>>>> <Bioperl-l at lists.open-bio.org>, cain.cshl at gmail.com
>>>>
>>>> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class
>> to
>>>> load the GFF3-format Fly data? I think you're probably getting confused
>> by
>>>> overlapping mRNA splice forms, an issue that won't occur with the full
>>>> GFF3-formatted data.
>>>>
>>>>
>>>> On 8/13/06, Chris Fields <cjfields at uiuc.edu> wrote:
>>>>>
>>>>> Marco,
>>>>>
>>>>> Did you figure out what the problem was? I was curious; the issue
>>>>> you were having was rather odd. I wanted to see if it was an issue
>>>>> with the GFF data or with the database itself.
>>>>>
>>>>> Chris
>>>>>
>>>>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
>>>>>
>>>>>> Chris,
>>>>>>
>>>>>>> Do you mean you get duplicates of sequences back, or that you get
>>>>>>> more than
>>>>>>> one chunk of the same sequence back?
>>>>>>
>>>>>> I sometimes get duplicated sequences and sometimes overlapping
>>>>>> regions (see
>>>>>> bellow)
>>>>>>
>>>>>>>
>>>>>>> Is it possible that each query using an ID could contain more than
>>>>>>> one
>>>>>>> feature? That might explain it (you could check by testing the
>>>>>>> size of the
>>>>>>> array @feats).
>>>>>> Most id return more than one features from various type
>>>>>> ( point_mutation,
>>>>>> insertion_site, processed_transcript, etc...). That's why I
>>>>>> restirct the
>>>>>> output to type "gene" using regexp /gene/ on $f->type.
>>>>>>
>>>>>>>
>>>>>>> I'm not sure how split locations are handled within Bio:DB::GFF,
>>>>>>> but do the
>>>>>>> specific features have split locations?
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>> Not sure what you mean exactly but have a look at the following
>>>>>> script, it
>>>>>> gives the location and the group id of the feature being reported:
>>>>>>
>>>>>> use Bio::DB::GFF;
>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>> -dsn =>
>>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>>> my %dups;
>>>>>> while (<>){
>>>>>> chomp;
>>>>>> my $id = $_;
>>>>>> my @feat = $db->get_feature_by_name($id);
>>>>>>
>>>>>> for my $f (@feat){
>>>>>> if (exists $dups{$f->group} && $f->type =~/gene/){
>>>>>> print "Calling >>>", $f->group, "\n";
>>>>>> print "Chr: ", $f->refseq,
>>>>>> " Strand: ", $f->strand,
>>>>>> " Start: ", $f->start,
>>>>>> " End: ", $f->end,
>>>>>> "\n";
>>>>>> print "Offending >>>", $dups{$f->group}->group, "\n";
>>>>>> print "Chr: ", $dups{$f->group}->refseq,
>>>>>> " Strand: ", $dups{$f->group}->strand,
>>>>>> " Start: ", $dups{$f->group}->start,
>>>>>> " End: ", $dups{$f->group}->end;
>>>>>> print "\n\n";
>>>>>> } else {
>>>>>> $dups{$f->group} = $f;
>>>>>> }
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> Here is the output:
>>>>>> Calling >>>FBgn0004179
>>>>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
>>>>>> Offending >>>FBgn0004179
>>>>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
>>>>>>
>>>>>> Calling >>>FBgn0025681
>>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>>>> Offending >>>FBgn0025681
>>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>>>>
>>>>>> Calling >>>FBgn0025803
>>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>>>> Offending >>>FBgn0025803
>>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>>>>
>>>>>> Calling >>>FBgn0000117
>>>>>> Chr: X Strand: -1 Start: 1756796 End: 1747557
>>>>>> Offending >>>FBgn0000117
>>>>>> Chr: X Strand: -1 Start: 1757776 End: 1747182
>>>>>>
>>>>>> Calling >>>FBgn0005427
>>>>>> Chr: X Strand: -1 Start: 136456 End: 125343
>>>>>> Offending >>>FBgn0005427
>>>>>> Chr: X Strand: -1 Start: 133199 End: 124949
>>>>>>
>>>>>> Calling >>>FBgn0000042
>>>>>> Chr: X Strand: 1 Start: 5746100 End: 5750026
>>>>>> Offending >>>FBgn0000042
>>>>>> Chr: X Strand: 1 Start: 5746096 End: 5746106
>>>>>>
>>>>>> Calling >>>FBgn0004551
>>>>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
>>>>>> Offending >>>FBgn0004551
>>>>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
>>>>>>
>>>>>> Do you have any suggestions?? Is the procedure I am using to
>>>>>> retrieve the
>>>>>> genes right?
>>>>>>
>>>>>> Many thanks
>>>>>>
>>>>>> Marco
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Many thanks Scott,
>>>>>>>>
>>>>>>>> At the same time I got your email I was coming to the same
>>>>>>>> conclusion as
>>>>>>>> you.
>>>>>>>>
>>>>>>>> Now I have a stranger problem in my hands... My goal is quite
>>>>>>>> simple, I
>>>>>>>> try
>>>>>>>> to get the sequence of the genes back from the Bio::DB::GFF
>> database
>>>>>>>> loaded
>>>>>>>> on MySQL. The gene list is from a file with one gene id per per
>>>>>>>> line. When
>>>>>>>> I
>>>>>>>> run the following script:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> use Bio::DB::GFF;
>>>>>>>> use Bio::SeqIO;
>>>>>>>> my $out = Bio::SeqIO->new( -fh => \*STDOUT,
>>>>>>>> -format => 'fasta');
>>>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>>>> -dsn =>
>>>>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>>>>>
>>>>>>>> while (<>){
>>>>>>>> chomp;
>>>>>>>> my $id = $_;
>>>>>>>> my @feats = $db->get_feature_by_name($id);
>>>>>>>> for my $f (@feats){
>>>>>>>> $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> I get more sequence back than the number of gene in my input file.
>> I
>>>>>>>> double
>>>>>>>> check there. Some of the duplicated entries are the same, some
>>>>>>>> are not!
>>>>>>>
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioperl-l mailing list
>>>>>>> Bioperl-l at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>>
>>>>>> ______________________________
>>>>>> Marco Blanchette, Ph.D.
>>>>>>
>>>>>> mblanche at uclink.berkeley.edu
>>>>>>
>>>>>> Donald C. Rio's lab
>>>>>> Department of Molecular and Cell Biology
>>>>>> 16 Barker Hall
>>>>>> University of California
>>>>>> Berkeley, CA 94720-3204
>>>>>>
>>>>>> Tel: (510) 642-1084
>>>>>> Cell: (510) 847-0996
>>>>>> Fax: (510) 642-6062
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>> Christopher Fields
>>>>> Postdoctoral Researcher
>>>>> Lab of Dr. Robert Switzer
>>>>> Dept of Biochemistry
>>>>> University of Illinois Urbana-Champaign
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>>
>>>
>>> ______________________________
>>> Marco Blanchette, Ph.D.
>>>
>>> mblanche at uclink.berkeley.edu
>>>
>>> Donald C. Rio's lab
>>> Department of Molecular and Cell Biology
>>> 16 Barker Hall
>>> University of California
>>> Berkeley, CA 94720-3204
>>>
>>> Tel: (510) 642-1084
>>> Cell: (510) 847-0996
>>> Fax: (510) 642-6062
>>
>> ______________________________
>> Marco Blanchette, Ph.D.
>>
>> mblanche at uclink.berkeley.edu
>>
>> Donald C. Rio's lab
>> Department of Molecular and Cell Biology
>> 16 Barker Hall
>> University of California
>> Berkeley, CA 94720-3204
>>
>> Tel: (510) 642-1084
>> Cell: (510) 847-0996
>> Fax: (510) 642-6062
>> --
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>
______________________________
Marco Blanchette, Ph.D.
mblanche at uclink.berkeley.edu
Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204
Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
--
More information about the Bioperl-l
mailing list