[Bioperl-l] Bio::DB::Fasta problem: unable to fetch all sequences via get_PrimarySeq_stream

Fields, Christopher J cjfields at illinois.edu
Mon Nov 14 17:31:44 UTC 2016

We would probably need a list of IDs, but this has happened before a few times.  In some cases it’s an issue of line ending mismatches, which can be normalized using a tool like dos2unix.  However if you have IDs that could be evaluated as False the issue is trickier and not so easy to fix, primarily because the returned value is stringified to the display ID (which is one reason I hate object stringification).

For example, the following would likely short-circuit without showing sequence IDs, as having a seq ID of ‘0’ (note this does not include the description, which is separate) will evaluate as False and kill the while loop:

>0 desc1
>1 desc2

The issue, the problems with a fix, and a workaround are described here: https://github.com/bioperl/bioperl-live/issues/170


From: Bioperl-l <bioperl-l-bounces+cjfields=illinois.edu at mailman.open-bio.org> on behalf of Helene RIMBERT <helene.rimbert at inra.fr>
Date: Monday, November 14, 2016 at 10:16 AM
To: "bioperl-l at mailman.open-bio.org" <bioperl-l at mailman.open-bio.org>
Subject: [Bioperl-l] Bio::DB::Fasta problem: unable to fetch all sequences via get_PrimarySeq_stream

Dear BioPerl developers,

I come with a question regarding the get_PrimarySeq_stream !

I am using the Bio::DB:Fasta module to access my fasta sequences and i am facing some problem with the get_PrimarySeq_stream().
When i check the content of the db object, all the sequences are indexed (i mean that i can see all the sequences ids in the offsets hash).

I then use the get_PrimarySeq_stream to loop over all my sequences, but only 1 sequence is retrieved from the stream object.
I tried to look for some explanations, and the only thing i could find is that it seems that my seq_ids are considered as undef. during the while($dbstream->next_seq()) statement when reaching
IndexedBase.pm line 1116

I tried to loop over all sequence ids using my @seq_ids = $self->{fastaObj}->get_all_primary_ids; and it works very well.

I don't understand why the stream object does not retrieve all the sequences whereas get_all_primary_ids does!
Is there something wrong with my input FASTA (my ids are very long...) or am i missing something?

I am really interested in finding out why i am not able to use get_PrimarySeq_stream !

Many thanks in advance :)



# here is the part of code that causes problem:
# initialize db::fasta object
$self->{fastaObj} =  Bio::DB::Fasta->new("test2.fna", -reindex => 1);

# create stream object
my $seq_stream = $self->{fastaObj}->get_PrimarySeq_stream();

# loop over all seq in BioDBFasta obj using stream obj.
while ($self->{seq} = $seq_stream->next_seq()){
#foreach my $seq_id (@seq_ids){
    #$self->{seq} = $self->{fastaObj}->get_Seq_by_id($seq_id); # to use with foreach loop

    print (" New sequence: ", Dumper $self->{seq});
print (" Fetched sequences in _PrimarySeq_stream: $self->{nbSeqFetchedInStream}");


--> Nouvelle adresse e-mail: helene.rimbert at inra.fr<mailto:helene.rimbert at inra.fr> <--


Bioinformatic Engineer

helene.rimbert at inra.fr<mailto:helene.rimbert at inra.fr>

UMR 1095 INRA/UBP – Site de Crouel

Tèl. : +33 (0)4 73 62 43 49

5 chemin de beaulieu

63039 Clermont-Ferrand Cedex 2


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/bioperl-l/attachments/20161114/68a25487/attachment.html>

More information about the Bioperl-l mailing list