[Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF

Marco Blanchette mblanche at berkeley.edu
Thu Aug 17 18:20:00 UTC 2006


Lincoln, thanks for the precision. I just could not find any references to
how to load the DNA (no where in bp_seqfeature_load.pl or in the
Bio::DB::SeqFeature::Store it says how load the DNA sequences).

So right now the gff files were loaded in mysql using:
/usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff

I tried the --fast options but got a bunch of warning (see below).

The DNA file (a single fasta database file containing all chromosome
sequences) was in a different location from the gff files and was not loaded
together with the gff files (the sequence table is empty in the database).

Can I load the DNA sequence after the gff files were loaded?

Many thanks

Marco


-------------------- WARNING ---------------------
MSG: ID=ortho:2825 has been used more than once, but it cannot be found in
the database.
This can happen if you have specified fast loading, but features sharing the
same ID
are not contiguous in the GFF file. This will be loaded as a separate
feature.
Line 483681: "X .       orthologous_region      19477824        19478027
.       +       .  
ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse"

STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature
/Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537
STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load
/Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424
STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh
/Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342
STACK Bio::DB::SeqFeature::Store::GFF3Loader::load
/Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240
STACK toplevel /usr/bin/bp_seqfeature_load.pl:81


On 8/17/06 10:27, "Lincoln Stein" <lstein at cshl.edu> wrote:

> Hi,
> 
> This message bounced because I tried to send it from my gmail account and so
> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it
> finds a file that contains DNA data, it simply loads it. There is no special
> command line switch. Also you can include the DNA in the GFF3 file.
> 
> Lincoln
> 
> ---------- Forwarded message ----------
> From: Lincoln Stein <lincoln.stein at gmail.com>
> Date: Aug 17, 2006 12:26 PM
> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF
> To: Chris Fields <cjfields at uiuc.edu>
> Cc: Marco Blanchette <mblanche at berkeley.edu>, "bioperl-l at lists.open-bio.org"
> <Bioperl-l at lists.open-bio.org>, cain.cshl at gmail.com
> 
> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to
> load the GFF3-format Fly data? I think you're probably getting confused by
> overlapping mRNA splice forms, an issue that won't occur with the full
> GFF3-formatted data.
> 
> 
> On 8/13/06, Chris Fields <cjfields at uiuc.edu> wrote:
>> 
>> Marco,
>> 
>> Did you figure out what the problem was?  I was curious; the issue
>> you were having was rather odd.  I wanted to see if it was an issue
>> with the GFF data or with the database itself.
>> 
>> Chris
>> 
>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
>> 
>>> Chris,
>>> 
>>>> Do you mean you get duplicates of sequences back, or that you get
>>>> more than
>>>> one chunk of the same sequence back?
>>> 
>>> I sometimes get duplicated sequences and sometimes overlapping
>>> regions (see
>>> bellow)
>>> 
>>>> 
>>>> Is it possible that each query using an ID could contain more than
>>>> one
>>>> feature?  That might explain it (you could check by testing the
>>>> size of the
>>>> array @feats).
>>> Most id return more than one features from various type
>>> ( point_mutation,
>>> insertion_site, processed_transcript, etc...). That's why I
>>> restirct the
>>> output to type "gene" using regexp /gene/ on $f->type.
>>> 
>>>> 
>>>> I'm not sure how split locations are handled within Bio:DB::GFF,
>>>> but do the
>>>> specific features have split locations?
>>>> 
>>>> Chris
>>>> 
>>> Not sure what you mean exactly but have a look at the following
>>> script, it
>>> gives the location and the group id of the feature being reported:
>>> 
>>> use Bio::DB::GFF;
>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>                               -dsn =>
>>> 'dbi:mysql:database=dmel_43_new');
>>> my %dups;
>>> while (<>){
>>>     chomp;
>>>     my $id = $_;
>>>     my @feat = $db->get_feature_by_name($id);
>>> 
>>>      for my $f (@feat){
>>>         if (exists $dups{$f->group} && $f->type =~/gene/){
>>>             print     "Calling >>>", $f->group, "\n";
>>>             print     "Chr: ", $f->refseq,
>>>                     " Strand: ", $f->strand,
>>>                     " Start: ", $f->start,
>>>                     " End: ", $f->end,
>>>                     "\n";
>>>             print "Offending >>>", $dups{$f->group}->group, "\n";
>>>             print     "Chr: ", $dups{$f->group}->refseq,
>>>                     " Strand: ", $dups{$f->group}->strand,
>>>                     " Start: ", $dups{$f->group}->start,
>>>                       " End: ", $dups{$f->group}->end;
>>>             print "\n\n";
>>>          } else {
>>>             $dups{$f->group} = $f;
>>>          }
>>>      }
>>> }
>>> 
>>> Here is the output:
>>> Calling >>>FBgn0004179
>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
>>> Offending >>>FBgn0004179
>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
>>> 
>>> Calling >>>FBgn0025681
>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>> Offending >>>FBgn0025681
>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>>> 
>>> Calling >>>FBgn0025803
>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>> Offending >>>FBgn0025803
>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>>> 
>>> Calling >>>FBgn0000117
>>> Chr: X Strand: -1 Start: 1756796 End: 1747557
>>> Offending >>>FBgn0000117
>>> Chr: X Strand: -1 Start: 1757776 End: 1747182
>>> 
>>> Calling >>>FBgn0005427
>>> Chr: X Strand: -1 Start: 136456 End: 125343
>>> Offending >>>FBgn0005427
>>> Chr: X Strand: -1 Start: 133199 End: 124949
>>> 
>>> Calling >>>FBgn0000042
>>> Chr: X Strand: 1 Start: 5746100 End: 5750026
>>> Offending >>>FBgn0000042
>>> Chr: X Strand: 1 Start: 5746096 End: 5746106
>>> 
>>> Calling >>>FBgn0004551
>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
>>> Offending >>>FBgn0004551
>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
>>> 
>>> Do you have any suggestions?? Is the procedure I am using to
>>> retrieve  the
>>> genes right?
>>> 
>>> Many thanks
>>> 
>>> Marco
>>> 
>>> 
>>> 
>>>>> Many thanks Scott,
>>>>> 
>>>>> At the same time I got your email I was coming to the same
>>>>> conclusion as
>>>>> you.
>>>>> 
>>>>> Now I have a stranger problem in my hands... My goal is quite
>>>>> simple, I
>>>>> try
>>>>> to get the sequence of the genes back from the Bio::DB::GFF database
>>>>> loaded
>>>>> on MySQL. The gene list is from a file with one gene id per per
>>>>> line. When
>>>>> I
>>>>> run the following script:
>>>>> 
>>>>> 
>>>>> 
>>>>> use Bio::DB::GFF;
>>>>> use Bio::SeqIO;
>>>>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
>>>>>                             -format => 'fasta');
>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>>>>                               -dsn =>
>>>>> 'dbi:mysql:database=dmel_43_new');
>>>>> 
>>>>> while (<>){
>>>>>     chomp;
>>>>>     my $id = $_;
>>>>>     my @feats = $db->get_feature_by_name($id);
>>>>>     for my $f (@feats){
>>>>>         $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>>>>     }
>>>>> }
>>>>> 
>>>>> 
>>>>> I get more sequence back than the number of gene in my input file. I
>>>>> double
>>>>> check there. Some of the duplicated entries are the same, some
>>>>> are not!
>>>> 
>>>> 
>>>> ...
>>>> 
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> 
>>> ______________________________
>>> Marco Blanchette, Ph.D.
>>> 
>>> mblanche at uclink.berkeley.edu
>>> 
>>> Donald C. Rio's lab
>>> Department of Molecular and Cell Biology
>>> 16 Barker Hall
>>> University of California
>>> Berkeley, CA 94720-3204
>>> 
>>> Tel: (510) 642-1084
>>> Cell: (510) 847-0996
>>> Fax: (510) 642-6062
>>> --
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>> 
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 

______________________________
Marco Blanchette, Ph.D.

mblanche at uclink.berkeley.edu

Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204

Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
-- 






More information about the Bioperl-l mailing list