[Bioperl-l] Extracting gene seq from Bio::DB::GFF
Marco Blanchette
mblanche at berkeley.edu
Fri Aug 11 23:59:26 UTC 2006
Chris,
> Do you mean you get duplicates of sequences back, or that you get more than
> one chunk of the same sequence back?
I sometimes get duplicated sequences and sometimes overlapping regions (see
bellow)
>
> Is it possible that each query using an ID could contain more than one
> feature? That might explain it (you could check by testing the size of the
> array @feats).
Most id return more than one features from various type ( point_mutation,
insertion_site, processed_transcript, etc...). That's why I restirct the
output to type "gene" using regexp /gene/ on $f->type.
>
> I'm not sure how split locations are handled within Bio:DB::GFF, but do the
> specific features have split locations?
>
> Chris
>
Not sure what you mean exactly but have a look at the following script, it
gives the location and the group id of the feature being reported:
use Bio::DB::GFF;
my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
-dsn => 'dbi:mysql:database=dmel_43_new');
my %dups;
while (<>){
chomp;
my $id = $_;
my @feat = $db->get_feature_by_name($id);
for my $f (@feat){
if (exists $dups{$f->group} && $f->type =~/gene/){
print "Calling >>>", $f->group, "\n";
print "Chr: ", $f->refseq,
" Strand: ", $f->strand,
" Start: ", $f->start,
" End: ", $f->end,
"\n";
print "Offending >>>", $dups{$f->group}->group, "\n";
print "Chr: ", $dups{$f->group}->refseq,
" Strand: ", $dups{$f->group}->strand,
" Start: ", $dups{$f->group}->start,
" End: ", $dups{$f->group}->end;
print "\n\n";
} else {
$dups{$f->group} = $f;
}
}
}
Here is the output:
Calling >>>FBgn0004179
Chr: 3L Strand: 1 Start: 22201102 End: 22207587
Offending >>>FBgn0004179
Chr: 3L Strand: 1 Start: 22200575 End: 22200575
Calling >>>FBgn0025681
Chr: 2L Strand: 1 Start: 2992964 End: 2998614
Offending >>>FBgn0025681
Chr: 2L Strand: 1 Start: 2992964 End: 2998614
Calling >>>FBgn0025803
Chr: 3R Strand: 1 Start: 16966463 End: 17038413
Offending >>>FBgn0025803
Chr: 3R Strand: 1 Start: 16966463 End: 17038413
Calling >>>FBgn0000117
Chr: X Strand: -1 Start: 1756796 End: 1747557
Offending >>>FBgn0000117
Chr: X Strand: -1 Start: 1757776 End: 1747182
Calling >>>FBgn0005427
Chr: X Strand: -1 Start: 136456 End: 125343
Offending >>>FBgn0005427
Chr: X Strand: -1 Start: 133199 End: 124949
Calling >>>FBgn0000042
Chr: X Strand: 1 Start: 5746100 End: 5750026
Offending >>>FBgn0000042
Chr: X Strand: 1 Start: 5746096 End: 5746106
Calling >>>FBgn0004551
Chr: 2R Strand: -1 Start: 19443485 End: 19434556
Offending >>>FBgn0004551
Chr: 2R Strand: -1 Start: 19445155 End: 19429977
Do you have any suggestions?? Is the procedure I am using to retrieve the
genes right?
Many thanks
Marco
>> Many thanks Scott,
>>
>> At the same time I got your email I was coming to the same conclusion as
>> you.
>>
>> Now I have a stranger problem in my hands... My goal is quite simple, I
>> try
>> to get the sequence of the genes back from the Bio::DB::GFF database
>> loaded
>> on MySQL. The gene list is from a file with one gene id per per line. When
>> I
>> run the following script:
>>
>>
>>
>> use Bio::DB::GFF;
>> use Bio::SeqIO;
>> my $out = Bio::SeqIO->new( -fh => \*STDOUT,
>> -format => 'fasta');
>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>> -dsn => 'dbi:mysql:database=dmel_43_new');
>>
>> while (<>){
>> chomp;
>> my $id = $_;
>> my @feats = $db->get_feature_by_name($id);
>> for my $f (@feats){
>> $out->write_seq( $f->seq ) if $f->type =~/gene/;
>> }
>> }
>>
>>
>> I get more sequence back than the number of gene in my input file. I
>> double
>> check there. Some of the duplicated entries are the same, some are not!
>
>
> ...
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
______________________________
Marco Blanchette, Ph.D.
mblanche at uclink.berkeley.edu
Donald C. Rio's lab
Department of Molecular and Cell Biology
16 Barker Hall
University of California
Berkeley, CA 94720-3204
Tel: (510) 642-1084
Cell: (510) 847-0996
Fax: (510) 642-6062
--
More information about the Bioperl-l
mailing list