[Bioperl-l] Next-gen modules

Wed Jun 17 22:10:57 UTC 2009

Chris Fields wrote:
> On Jun 17, 2009, at 1:20 PM, Sendu Bala wrote:
> 
>> Tristan Lefebure wrote:
>>> Hello,
>>> Regarding next-gen sequences and bioperl, following my experience, 
>>> another issue is bioperl speed. For example, if you want to trim bad 
>>> quality bases at ends of 1E6 Solexa reads using Bio::SeqIO::fastq and 
>>> some methods in Bio::Seq::Quality, well, you've got to be patient 
>>> (but may be I missed some shortcuts...).
>>
>> This is my concern as well. Or, rather, is there actually a 
>> significant set of users out there who are dealing with next-gen 
>> sequencing and would consider using BioPerl for their work?
>>
>> I'm working with all the 1000-genomes data at the Sanger, and we at 
>> least are probably never going to use BioPerl for the work.
> 
> Are you using pure perl or (gasp) something else?  ;>

We use some perl stuff, some C stuff. My own stuff is OO perl, but much 
lighter weight than BioPerl. Absolute minimal object creation.

>>> A pure perl solution will be between 100 to 1000x faster... Would it 
>>> be possible to have an ultra-light quality object with few simple 
>>> methods for next-gen reads?
>>
>> The fastq parser itself already seems pretty fast. The way to get the 
>> speedup is to not create any Bio::Seq* objects but just return the 
>> data directly. At that point it's not taking much advantage of 
>> BioPerl. But certainly it could be done...
> 
> I suppose the best way to assess what needs to be done is come up with a 
> set of 'use cases' specifying what users want so we can design around 
> them, otherwise we're shooting in the dark.

Indeed. Though at least I think we can all agree it would be nice to 
have the functionality there even if it's slow. There will always be at 
least some use-cases where the run speed doesn't matter.

> I'm personally wondering if this could be done as a sequence database, 
> something similar in theme to Lincoln's SeqFeature::Store, but sequence 
> only, and returns quality objects in a similar manner (ala Storable)?  
> Not sure whether that's feasible, but it's appears at least scalable.

I think not. Well, at least SeqFeature::Store doesn't scale. Try storing 
millions of features in a database and watch it crawl to complete 
unusability. I can't imagine a db scaling to holding hundreds of TB of 
data either. I'm also not sure what the benefit is. There are already 
high-speed ways of indexing your fastq or bam files.