[Bioperl-l] Parsing FASTA files into PrimarySeq objects

Tue Feb 16 17:11:16 UTC 2010

Yes, except that tests are now failing on live.  Bio::PrimarySeq isn't a Bio::SeqI, so it's a bit of an API change in a lot of places expecting a Bio::SeqI.  We should make this optional for the time being, until we discuss it more on list.

chris

On Feb 16, 2010, at 10:57 AM, Jason Stajich wrote:

> Great!  thanks for committing it.
> 
> Florent Angly wrote:
>> Thanks for your reply Jason. It's very valuable since you coded the modules in question!
>> 
>> Now that I understand the purpose of these modules, I did a little bit of documentation clarification. I also figured out that I should implement the 'type' method in Bio::Seq::SeqFastaSpeedFactory and made it support creating Bio::PrimarySeq. This is all committed to SVN.
>> 
>> When sequences are large, you are right that it does not make much difference in time or memory resource to use one sequence type over the other. However, that's different for short sequences. I took a FASTA file weighting 39.3 MB and containing 360,000 sequences of 106 bp on average.
>> 
>> CASE 1:
>>  Requested factory Bio::Seq::SeqFactory
>>  Requested sequence type is Bio::Seq
>>  Memory used: 329 MB
>>  Time spent: 1m8.522s
>> 
>> CASE 2:
>>  Requested factory Bio::Seq::SeqFactory
>>  Requested sequence type is Bio::PrimarySeq
>>  Memory used: 246 MB
>>  Time spent: 0m52.663s
>> 
>> CASE 3:
>>  Requested factory Bio::Seq::SeqFastaSpeedFactory
>>  Requested sequence type is Bio::Seq
>>  Memory used: 298 MB
>>  Time spent: 0m31.292s
>> 
>> CASE 4:
>>  Requested factory Bio::Seq::SeqFastaSpeedFactory
>>  Requested sequence type is Bio::PrimarySeq
>>  Memory used: 231 MB
>>  Time spent: 0m28.500s
>> 
>> Clearly, when only simple sequence attributes are needed or FASTA files are used, it's fast and memory efficient to use the SeqFastaSpeedFactory with PrimarySeq.
>> 
>> Florent
>> 
>> 
>> On 16/02/10 10:03, Jason Stajich wrote:
>>> I don't think that aspect of the documentation vs interface was ever implemented - the interface object doesn't specify a type method or init argument even though the documentation says so. Not really sure why not, this was ages ago unfortunately.
>>> 
>>> This particular factory definitely assumes you are building Bio::Seq objects - you can try and subclass and build your own to see if it makes much of a difference in speed/memory - I would posit it won't make a significant difference but be interested to see what you find.
>>> 
>>> Just make your own Bio::Seq::PrimarySeqFactory object based on Bio::Seq::FastaSpeedFactory and simplify that code so it doesn't build Bio::Seq object wrapper around the Bio::PrimarySeq and do some perf tests so we'll know if it makes any difference here.
>>> 
>>> -jason
>>> Florent Angly wrote:
>>>> Hi all,
>>>> 
>>>> I am trying to reduce memory usage and speedup reading FASTA files using the facilities provided by BioPerl.
>>>> 
>>>> The first thing I noticed is that when using Bio::SeqIO::fasta, the objects returned are Bio::Seq, not Bio::PrimarySeq objects. Bio::PrimarySeq sequences are lighter than Bio::Seq sequences, so it's what I need. See code below:
>>>>> use warnings;
>>>>> use strict;
>>>>> use Data::Dumper;
>>>>> use Bio::SeqIO;
>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>> my $seq = $in->next_seq;
>>>>> print Dumper($seq);
>>>>> __END__
>>>>> >seq1 a small test sequence q
>>>>> ACGTACGACTACGACTAGCGCCATCAGC 
>>>> It returns:
>>>>> $VAR1 = bless( {
>>>>>                 'primary_id' => 'seq1',
>>>>>                 'primary_seq' => bless( {
>>>>>                                           'display_id' => 'seq1',
>>>>>                                           'primary_id' => 'seq1',
>>>>>                                           'desc' => 'a small test sequence',
>>>>>                                           'seq' => 'ACGTACGACTACGACTAGCGCCATCAGC',
>>>>>                                           'alphabet' => 'dna'
>>>>>                                         }, 'Bio::PrimarySeq' )
>>>>>               }, 'Bio::Seq' ); 
>>>> 
>>>> Actually, we have a Bio::Seq containing a Bio::PrimarySeq. I really only need the Bio::PrimarySeq. Looking at the documentation for Bio::SeqIO I found that I could in theory adjust the sequence factory and sequence builder to my liking... I tried this:
>>>>> use warnings;
>>>>> use strict;
>>>>> use Data::Dumper;
>>>>> use Bio::SeqIO;
>>>>> my $in = Bio::SeqIO->new(-fh=>\*DATA);
>>>>> my $seqfactory = $in->sequence_factory; # Bio::Factory::ObjectBuilderI
>>>>> print "The factory is a ".ref($seqfactory)."\n";
>>>>> $seqfactory->type('Bio::PrimarySeq'); # gives an error
>>>>> my $seq = $in->next_seq;
>>>>> print Dumper($seq);
>>>>> __END__
>>>>> >seq1 a small test sequence
>>>>> ACGTACGACTACGACTAGCGCCATCAGC 
>>>> This returns:
>>>>> The factory is a Bio::Seq::SeqFastaSpeedFactory
>>>>> Can't locate object method "type" via package "Bio::Seq::SeqFastaSpeedFactory" at ./seqbuilder_test_3.pl line 12, <DATA> line 1. 
>>>> 
>>>> According to Bio::Seq::FastaSpeedFactory's documentation:
>>>>> If you want the factory to create Bio::Seq objects instead
>>>>> of the default Bio::PrimarySeq objects, use the -type parameter 
>>>> So, PrimarySeq should be the default type, even though in my case it seems not to be. Second, I can't seem to use the -type method to change what the return type is... It errors.
>>>> 
>>>> Any ideas??? Thanks,
>>>> 
>>>> Florent
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l