[Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF
Marco Blanchette
mblanche at berkeley.edu
Thu Aug 17 18:40:50 UTC 2006
I will answer my own question...
Yes, one can load the fasta file after having loaded the gff file by doing:
bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta
Marco
On 8/17/06 11:20, "Marco Blanchette" <mblanche at berkeley.edu> wrote:
> Lincoln, thanks for the precision. I just could not find any references to
> how to load the DNA (no where in bp_seqfeature_load.pl or in the
> Bio::DB::SeqFeature::Store it says how load the DNA sequences).
>
> So right now the gff files were loaded in mysql using:
> /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff
>
> I tried the --fast options but got a bunch of warning (see below).
>
> The DNA file (a single fasta database file containing all chromosome
> sequences) was in a different location from the gff files and was not loaded
> together with the gff files (the sequence table is empty in the database).
>
> Can I load the DNA sequence after the gff files were loaded?
>
> Many thanks
>
> Marco
>
>
> -------------------- WARNING ---------------------
> MSG: ID=ortho:2825 has been used more than once, but it cannot be found in
> the database.
> This can happen if you have specified fast loading, but features sharing the
> same ID
> are not contiguous in the GFF file. This will be loaded as a separate
> feature.
> Line 483681: "X . orthologous_region 19477824 19478027
> . + .
> ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse"
>
> STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature
> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537
> STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load
> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424
> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342
> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240
> STACK toplevel /usr/bin/bp_seqfeature_load.pl:81
>
>
> On 8/17/06 10:27, "Lincoln Stein" <lstein at cshl.edu> wrote:
>
>> Hi,
>>
>> This message bounced because I tried to send it from my gmail account and so
>> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it
>> finds a file that contains DNA data, it simply loads it. There is no special
>> command line switch. Also you can include the DNA in the GFF3 file.
>>
>> Lincoln
>>
>> ---------- Forwarded message ----------
>> From: Lincoln Stein <lincoln.stein at gmail.com>
>> Date: Aug 17, 2006 12:26 PM
>> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF
>> To: Chris Fields <cjfields at uiuc.edu>
>> Cc: Marco Blanchette <mblanche at berkeley.edu>, "bioperl-l at lists.open-bio.org"
>> <Bioperl-l at lists.open-bio.org>, cain.cshl at gmail.com
>>
>> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to
>> load the GFF3-format Fly data? I think you're probably getting confused by
>> overlapping mRNA splice forms, an issue that won't occur with the full
>> GFF3-formatted data.
>>
>>
>> On 8/13/06, Chris Fields <cjfields at uiuc.edu> wrote:
>>>
>>> Marco,
>>>
>>> Did you figure out what the problem was? I was curious; the issue
>>> you were having was rather odd. I wanted to see if it was an issue
>>> with the GFF data or with the database itself.
>>>
>>> Chris
>>>
>>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
>>>
>>>> Chris,
>>>>
>>>>> Do you mean you get duplicates of sequences back, or that you get
>>>>> more than
>>>>> one chunk of the same sequence back?
>>>>
>>>> I sometimes get duplicated sequences and sometimes overlapping
>>>> regions (see
>>>> bellow)
>>>>
>>>>>
>>>>> Is it possible that each query using an ID could contain more than
>>>>> one
>>>>> feature? That might explain it (you could check by testing the
>>>>> size of the
>>>>> array @feats).
>>>> Most id return more than one features from various type
>>>> ( point_mutation,
>>>> insertion_site, processed_transcript, etc...). That's why I
>>>> restirct the
>>>> output to type "gene" using regexp /gene/ on $f->type.
>>>>
>>>>>
>>>>> I'm not sure how split locations are handled within Bio:DB::GFF,
>>>>> but do the
>>>>> specific features have split locations?
>>>>>
>>>>> Chris
>>>>>
>>>> Not sure what you mean exactly but have a look at the following
>>>> script, it
>>>> gives the location and the group id of the feature being reported:
>>>>
>>>> use Bio::DB::GFF;
>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>> -dsn =>
>>>> 'dbi:mysql:database=dmel_43_new');
>>>> my %dups;
>>>> while (<>){
>>>> chomp;
>>>> my $id = $_;
>>>> my @feat = $db->get_feature_by_name($id);
>>>>
>>>> for my $f (@feat){
>>>> if (exists $dups{$f->group} && $f->type =~/gene/){
>>>> print "Calling >>>", $f->group, "\n";
>>>> print "Chr: ", $f->refseq,
>>>> " Strand: ", $f->strand,
>>>> " Start: ", $f->start,
>>>> " End: ", $f->end,
>>>> "\n";
>>>> print "Offending >>>", $dups{$f->group}->group, "\n";
>>>> print "Chr: ", $dups{$f->group}->refseq,
>>>> " Strand: ", $dups{$f->group}->strand,
>>>> " Start: ", $dups{$f->group}->start,
>>>> " End: ", $dups{$f->group}->end;
>>>> print "\n\n";
>>>> } else {
>>>> $dups{$f->group} = $f;
>>>> }
>>>> }
>>>> }
>>>>
>>>> Here is the output:
>>>> Calling >>>FBgn0004179
>>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
>>>> Offending >>>FBgn0004179
>>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
>>>>
>>>> Calling >>>FBgn0025681
>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>> Offending >>>FBgn0025681
>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>>>
>>>> Calling >>>FBgn0025803
>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>> Offending >>>FBgn0025803
>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>>>
>>>> Calling >>>FBgn0000117
>>>> Chr: X Strand: -1 Start: 1756796 End: 1747557
>>>> Offending >>>FBgn0000117
>>>> Chr: X Strand: -1 Start: 1757776 End: 1747182
>>>>
>>>> Calling >>>FBgn0005427
>>>> Chr: X Strand: -1 Start: 136456 End: 125343
>>>> Offending >>>FBgn0005427
>>>> Chr: X Strand: -1 Start: 133199 End: 124949
>>>>
>>>> Calling >>>FBgn0000042
>>>> Chr: X Strand: 1 Start: 5746100 End: 5750026
>>>> Offending >>>FBgn0000042
>>>> Chr: X Strand: 1 Start: 5746096 End: 5746106
>>>>
>>>> Calling >>>FBgn0004551
>>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
>>>> Offending >>>FBgn0004551
>>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
>>>>
>>>> Do you have any suggestions?? Is the procedure I am using to
>>>> retrieve the
>>>> genes right?
>>>>
>>>> Many thanks
>>>>
>>>> Marco
>>>>
>>>>
>>>>
>>>>>> Many thanks Scott,
>>>>>>
>>>>>> At the same time I got your email I was coming to the same
>>>>>> conclusion as
>>>>>> you.
>>>>>>
>>>>>> Now I have a stranger problem in my hands... My goal is quite
>>>>>> simple, I
>>>>>> try
>>>>>> to get the sequence of the genes back from the Bio::DB::GFF database
>>>>>> loaded
>>>>>> on MySQL. The gene list is from a file with one gene id per per
>>>>>> line. When
>>>>>> I
>>>>>> run the following script:
>>>>>>
>>>>>>
>>>>>>
>>>>>> use Bio::DB::GFF;
>>>>>> use Bio::SeqIO;
>>>>>> my $out = Bio::SeqIO->new( -fh => \*STDOUT,
>>>>>> -format => 'fasta');
>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>> -dsn =>
>>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>>>
>>>>>> while (<>){
>>>>>> chomp;
>>>>>> my $id = $_;
>>>>>> my @feats = $db->get_feature_by_name($id);
>>>>>> for my $f (@feats){
>>>>>> $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> I get more sequence back than the number of gene in my input file. I
>>>>>> double
>>>>>> check there. Some of the duplicated entries are the same, some
>>>>>> are not!
>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>> ______________________________
>>>> Marco Blanchette, Ph.D.
>>>>
>>>> mblanche at uclink.berkeley.edu
>>>>
>>>> Donald C. Rio's lab
>>>> Department of Molecular and Cell Biology
>>>> 16 Barker Hall
>>>> University of California
>>>> Berkeley, CA 94720-3204
>>>>
>>>> Tel: (510) 642-1084
>>>> Cell: (510) 847-0996
>>>> Fax: (510) 642-6062
>>>> --
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>> Christopher Fields
>>> Postdoctoral Researcher
>>> Lab of Dr. Robert Switzer
>>> Dept of Biochemistry
>>> University of Illinois Urbana-Champaign
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>>
>
> ______________________________
> Marco Blanchette, Ph.D.
>
> mblanche at uclink.berkeley.edu
>
> Donald C. Rio's lab
> Department of Molecular and Cell Biology
> 16 Barker Hall
> University of California
> Berkeley, CA 94720-3204
>
> Tel: (510) 642-1084
> Cell: (510) 847-0996
> Fax: (510) 642-6062
______________________________
Marco Blanchette, Ph.D.
mblanche at uclink.berkeley.edu
Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204
Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
--
More information about the Bioperl-l
mailing list