[Bioperl-l] Problems with Bio::DB::Fasta

Fri May 27 02:55:35 UTC 2011

Hi Justin,

I been trying to reproduce your issue. A problem I ran into was that 
there were some extra empty lines in your FASTA files. Then I made a 
test script that gets the subsequences you mentioned using three 
different methods: Bio::SeqIO+Bio::Seq, Bio::DB::Fasta, and your 
InMemoryFastaAccess. These three methods return the same answer, so, I 
see no problem there.

My system is pretty similar to yours:
Bioperl-live from the BioPerl GitHub master branch from 27/5/11
Perl 5.12.3
Linux 2.6.38-2-amd64 (Linux Mint Debian Edition)

Can you run the attached script on the attached FASTA files and see if 
all tests pass?

Thanks,

Florent

On 21/05/11 05:51, Justin Chu wrote:
> Hello:
>
> I'm having trouble with Bio::DB::Fasta. It sometimes occurs when I use large
> fasta files and retrieve sequence from a bit past the start of the file. I
> think some characters are being ignored or a rounding error is occurring or
> something  when using the offset to retrieve entries from the index file. I
> have attached the Fasta files I have been using, just incase my problem is
> due to improper formatting of my files.
>
> For example:
>
> my $refDB   = Bio::DB::Fasta->new('Test2.Fasta');
> my $queryDB = Bio::DB::Fasta->new('Test1.Fasta');
>
> print $refDB->subseq( "gi|294675557|ref|NC_014034.1|", 161067, 161788
> )."\n";
> print $queryDB->subseq( "gi|169245903|gb|EU376363.1|", 1, 722 )."\n";
>
> output:
> GGTAGTCCACGCCGTAAACGATGAATGCCAGTCGTCGGCAG...
> GTAGTCCCGGCCGTAAACGATGGATGCTAGCCGTCGGATAG...
>
> my $refDB2  = InMemoryFastaAccess->new('Test2.Fasta');
> my $queryDB2 = InMemoryFastaAccess->new('Test1.Fasta');
>
> print $refDB2->subseq( "gi|294675557|ref|NC_014034.1|", 161067, 161788
> )."\n";
> print $queryDB2->subseq( "gi|169245903|gb|EU376363.1|", 1, 722 )."\n";
>
> I get:
>
> output:
> GTAGTCCACGCCGTAAACGATGAATGCCAGTCGTCGGCA...
> GTAGTCCCGGCCGTAAACGATGGATGCTAGCCGTCGGAT...
>
> Basically, sometimes the sequences retrieved are correct but other times it
> is offset slightly by a few base pairs. Interestingly it seems that the
> offset problem gets worse as you retrieve sequence chunks further and
> further down the sequence.
>
> print $refDB->subseq( "gi|294675557|ref|NC_014034.1|", 1514858,
> 1515579)."\n";
>
> output:
> CCCTGGTAGTCCACGCCGTAAACGATGAATGCCAGTCGT...
>
> when it should be:
>
> print $refDB2->subseq( "gi|294675557|ref|NC_014034.1|", 1514858,
> 1515579)."\n";
>
> output:
> GTAGTCCACGCCGTAAACGATGAATGCCAGTCGTCGGCA...
>
> This module is still way faster than what I have, so I want to keep using
> it. Do you think there something I'm overlooking that could be the problem
> or do you see a way to fix this?
>
> I am currently running:
> Bioperl-live from the BioPerl GitHub master branch from 19/5/11
> Perl 5.10.1
> Debian 6.0.1
>
> If you need any other information please let me know.
>
> Thanks,
>
> Justin Chu
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.pl
Type: application/x-perl
Size: 1409 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20110527/3d3291b7/attachment-0008.pl>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Test1.Fasta
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20110527/3d3291b7/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Test2.Fasta
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20110527/3d3291b7/attachment-0001.ksh>