[Bioperl-l] Extracting gene seq from Bio::DB::GFF
Chris Fields
cjfields at uiuc.edu
Mon Aug 14 00:45:02 UTC 2006
Marco,
Did you figure out what the problem was? I was curious; the issue
you were having was rather odd. I wanted to see if it was an issue
with the GFF data or with the database itself.
Chris
On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote:
> Chris,
>
>> Do you mean you get duplicates of sequences back, or that you get
>> more than
>> one chunk of the same sequence back?
>
> I sometimes get duplicated sequences and sometimes overlapping
> regions (see
> bellow)
>
>>
>> Is it possible that each query using an ID could contain more than
>> one
>> feature? That might explain it (you could check by testing the
>> size of the
>> array @feats).
> Most id return more than one features from various type
> ( point_mutation,
> insertion_site, processed_transcript, etc...). That's why I
> restirct the
> output to type "gene" using regexp /gene/ on $f->type.
>
>>
>> I'm not sure how split locations are handled within Bio:DB::GFF,
>> but do the
>> specific features have split locations?
>>
>> Chris
>>
> Not sure what you mean exactly but have a look at the following
> script, it
> gives the location and the group id of the feature being reported:
>
> use Bio::DB::GFF;
> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
> -dsn =>
> 'dbi:mysql:database=dmel_43_new');
> my %dups;
> while (<>){
> chomp;
> my $id = $_;
> my @feat = $db->get_feature_by_name($id);
>
> for my $f (@feat){
> if (exists $dups{$f->group} && $f->type =~/gene/){
> print "Calling >>>", $f->group, "\n";
> print "Chr: ", $f->refseq,
> " Strand: ", $f->strand,
> " Start: ", $f->start,
> " End: ", $f->end,
> "\n";
> print "Offending >>>", $dups{$f->group}->group, "\n";
> print "Chr: ", $dups{$f->group}->refseq,
> " Strand: ", $dups{$f->group}->strand,
> " Start: ", $dups{$f->group}->start,
> " End: ", $dups{$f->group}->end;
> print "\n\n";
> } else {
> $dups{$f->group} = $f;
> }
> }
> }
>
> Here is the output:
> Calling >>>FBgn0004179
> Chr: 3L Strand: 1 Start: 22201102 End: 22207587
> Offending >>>FBgn0004179
> Chr: 3L Strand: 1 Start: 22200575 End: 22200575
>
> Calling >>>FBgn0025681
> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
> Offending >>>FBgn0025681
> Chr: 2L Strand: 1 Start: 2992964 End: 2998614
>
> Calling >>>FBgn0025803
> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
> Offending >>>FBgn0025803
> Chr: 3R Strand: 1 Start: 16966463 End: 17038413
>
> Calling >>>FBgn0000117
> Chr: X Strand: -1 Start: 1756796 End: 1747557
> Offending >>>FBgn0000117
> Chr: X Strand: -1 Start: 1757776 End: 1747182
>
> Calling >>>FBgn0005427
> Chr: X Strand: -1 Start: 136456 End: 125343
> Offending >>>FBgn0005427
> Chr: X Strand: -1 Start: 133199 End: 124949
>
> Calling >>>FBgn0000042
> Chr: X Strand: 1 Start: 5746100 End: 5750026
> Offending >>>FBgn0000042
> Chr: X Strand: 1 Start: 5746096 End: 5746106
>
> Calling >>>FBgn0004551
> Chr: 2R Strand: -1 Start: 19443485 End: 19434556
> Offending >>>FBgn0004551
> Chr: 2R Strand: -1 Start: 19445155 End: 19429977
>
> Do you have any suggestions?? Is the procedure I am using to
> retrieve the
> genes right?
>
> Many thanks
>
> Marco
>
>
>
>>> Many thanks Scott,
>>>
>>> At the same time I got your email I was coming to the same
>>> conclusion as
>>> you.
>>>
>>> Now I have a stranger problem in my hands... My goal is quite
>>> simple, I
>>> try
>>> to get the sequence of the genes back from the Bio::DB::GFF database
>>> loaded
>>> on MySQL. The gene list is from a file with one gene id per per
>>> line. When
>>> I
>>> run the following script:
>>>
>>>
>>>
>>> use Bio::DB::GFF;
>>> use Bio::SeqIO;
>>> my $out = Bio::SeqIO->new( -fh => \*STDOUT,
>>> -format => 'fasta');
>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>> -dsn =>
>>> 'dbi:mysql:database=dmel_43_new');
>>>
>>> while (<>){
>>> chomp;
>>> my $id = $_;
>>> my @feats = $db->get_feature_by_name($id);
>>> for my $f (@feats){
>>> $out->write_seq( $f->seq ) if $f->type =~/gene/;
>>> }
>>> }
>>>
>>>
>>> I get more sequence back than the number of gene in my input file. I
>>> double
>>> check there. Some of the duplicated entries are the same, some
>>> are not!
>>
>>
>> ...
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> ______________________________
> Marco Blanchette, Ph.D.
>
> mblanche at uclink.berkeley.edu
>
> Donald C. Rio's lab
> Department of Molecular and Cell Biology
> 16 Barker Hall
> University of California
> Berkeley, CA 94720-3204
>
> Tel: (510) 642-1084
> Cell: (510) 847-0996
> Fax: (510) 642-6062
> --
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list