[Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF

Thu Aug 17 17:27:33 UTC 2006

Hi,

This message bounced because I tried to send it from my gmail account and so
I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it
finds a file that contains DNA data, it simply loads it. There is no special
command line switch. Also you can include the DNA in the GFF3 file.

Lincoln

---------- Forwarded message ----------
From: Lincoln Stein <lincoln.stein at gmail.com>
Date: Aug 17, 2006 12:26 PM
Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF
To: Chris Fields <cjfields at uiuc.edu>
Cc: Marco Blanchette <mblanche at berkeley.edu>, "bioperl-l at lists.open-bio.org"
<Bioperl-l at lists.open-bio.org>, cain.cshl at gmail.com

I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to
load the GFF3-format Fly data? I think you're probably getting confused by
overlapping mRNA splice forms, an issue that won't occur with the full
GFF3-formatted data.

On 8/13/06, Chris Fields <cjfields at uiuc.edu> wrote:
>
> Marco,
>
> Did you figure out what the problem was?  I was curious; the issue
> you were having was rather odd.  I wanted to see if it was an issue
> with the GFF data or with the database itself.
>
> Chris
>
> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
>
> > Chris,
> >
> >> Do you mean you get duplicates of sequences back, or that you get
> >> more than
> >> one chunk of the same sequence back?
> >
> > I sometimes get duplicated sequences and sometimes overlapping
> > regions (see
> > bellow)
> >
> >>
> >> Is it possible that each query using an ID could contain more than
> >> one
> >> feature?  That might explain it (you could check by testing the
> >> size of the
> >> array @feats).
> > Most id return more than one features from various type
> > ( point_mutation,
> > insertion_site, processed_transcript, etc...). That's why I
> > restirct the
> > output to type "gene" using regexp /gene/ on $f->type.
> >
> >>
> >> I'm not sure how split locations are handled within Bio:DB::GFF,
> >> but do the
> >> specific features have split locations?
> >>
> >> Chris
> >>
> > Not sure what you mean exactly but have a look at the following
> > script, it
> > gives the location and the group id of the feature being reported:
> >
> > use Bio::DB::GFF;
> > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
> >                               -dsn =>
> > 'dbi:mysql:database=dmel_43_new');
> > my %dups;
> > while (<>){
> >     chomp;
> >     my $id = $_;
> >     my @feat = $db->get_feature_by_name($id);
> >
> >      for my $f (@feat){
> >         if (exists $dups{$f->group} && $f->type =~/gene/){
> >             print     "Calling >>>", $f->group, "\n";
> >             print     "Chr: ", $f->refseq,
> >                     " Strand: ", $f->strand,
> >                     " Start: ", $f->start,
> >                     " End: ", $f->end,
> >                     "\n";
> >             print "Offending >>>", $dups{$f->group}->group, "\n";
> >             print     "Chr: ", $dups{$f->group}->refseq,
> >                     " Strand: ", $dups{$f->group}->strand,
> >                     " Start: ", $dups{$f->group}->start,
> >                       " End: ", $dups{$f->group}->end;
> >             print "\n\n";
> >          } else {
> >             $dups{$f->group} = $f;
> >          }
> >      }
> > }
> >
> > Here is the output:
> > Calling >>>FBgn0004179
> > Chr: 3L Strand: 1 Start: 22201102 End: 22207587
> > Offending >>>FBgn0004179
> > Chr: 3L Strand: 1 Start: 22200575 End: 22200575
> >
> > Calling >>>FBgn0025681
> > Chr: 2L Strand: 1 Start: 2992964 End: 2998614
> > Offending >>>FBgn0025681
> > Chr: 2L Strand: 1 Start: 2992964 End: 2998614
> >
> > Calling >>>FBgn0025803
> > Chr: 3R Strand: 1 Start: 16966463 End: 17038413
> > Offending >>>FBgn0025803
> > Chr: 3R Strand: 1 Start: 16966463 End: 17038413
> >
> > Calling >>>FBgn0000117
> > Chr: X Strand: -1 Start: 1756796 End: 1747557
> > Offending >>>FBgn0000117
> > Chr: X Strand: -1 Start: 1757776 End: 1747182
> >
> > Calling >>>FBgn0005427
> > Chr: X Strand: -1 Start: 136456 End: 125343
> > Offending >>>FBgn0005427
> > Chr: X Strand: -1 Start: 133199 End: 124949
> >
> > Calling >>>FBgn0000042
> > Chr: X Strand: 1 Start: 5746100 End: 5750026
> > Offending >>>FBgn0000042
> > Chr: X Strand: 1 Start: 5746096 End: 5746106
> >
> > Calling >>>FBgn0004551
> > Chr: 2R Strand: -1 Start: 19443485 End: 19434556
> > Offending >>>FBgn0004551
> > Chr: 2R Strand: -1 Start: 19445155 End: 19429977
> >
> > Do you have any suggestions?? Is the procedure I am using to
> > retrieve  the
> > genes right?
> >
> > Many thanks
> >
> > Marco
> >
> >
> >
> >>> Many thanks Scott,
> >>>
> >>> At the same time I got your email I was coming to the same
> >>> conclusion as
> >>> you.
> >>>
> >>> Now I have a stranger problem in my hands... My goal is quite
> >>> simple, I
> >>> try
> >>> to get the sequence of the genes back from the Bio::DB::GFF database
> >>> loaded
> >>> on MySQL. The gene list is from a file with one gene id per per
> >>> line. When
> >>> I
> >>> run the following script:
> >>>
> >>>
> >>>
> >>> use Bio::DB::GFF;
> >>> use Bio::SeqIO;
> >>> my $out = Bio::SeqIO->new(    -fh => \*STDOUT,
> >>>                             -format => 'fasta');
> >>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
> >>>                               -dsn =>
> >>> 'dbi:mysql:database=dmel_43_new');
> >>>
> >>> while (<>){
> >>>     chomp;
> >>>     my $id = $_;
> >>>     my @feats = $db->get_feature_by_name($id);
> >>>     for my $f (@feats){
> >>>         $out->write_seq( $f->seq ) if $f->type =~/gene/;
> >>>     }
> >>> }
> >>>
> >>>
> >>> I get more sequence back than the number of gene in my input file. I
> >>> double
> >>> check there. Some of the duplicated entries are the same, some
> >>> are not!
> >>
> >>
> >> ...
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > ______________________________
> > Marco Blanchette, Ph.D.
> >
> > mblanche at uclink.berkeley.edu
> >
> > Donald C. Rio's lab
> > Department of Molecular and Cell Biology
> > 16 Barker Hall
> > University of California
> > Berkeley, CA 94720-3204
> >
> > Tel: (510) 642-1084
> > Cell: (510) 847-0996
> > Fax: (510) 642-6062
> > --
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)
FOR URGENT MESSAGES & SCHEDULING,
PLEASE CONTACT MY ASSISTANT,
SANDRA MICHELSEN, AT michelse at cshl.edu