[Bioperl-l] Next-gen modules
    Chris Fields 
    cjfields at illinois.edu
       
    Wed Jun 17 18:40:05 UTC 2009
    
    
  
On Jun 17, 2009, at 1:09 PM, Tristan Lefebure wrote:
> Thanks both for the light.
>
> That probably means that the place bioperl will take in the
> handling of the next-gen sequencing raw data (i.e. reads) is
> very limited, nope? (at least until bioperl6). A single GA2
> solexa lane generates about 9 million reads, and I would
> really not called that a big project...
I don't think it's impossible.  If you parse any very long list of  
sequences in order it will be very slow, yes, but if they were indexed  
or loaded into a DB lookups would of course be magnitudes faster.
We already have perl-based indexing for fastq (Bio::Index::Fastq), so  
maybe something could be built on top of that. I haven't looked but we  
can also wrap other C/C++-based parsers as well. BioLib, for instance,  
has bindings to io_lib, so maybe that could be (ab)used in some way.
> BTW, is there a simple way to see object instantiation and
> inheritance, as well as time consumption for each, when once
> calls next_seq() (or any other method)?
>
> -Tristan
As a simple benchmark, at one point all feature tag information was  
converted into Bio::Annotations.  I reverted that behavior to be  
simple tag/value again and had a pretty decent bump:
http://www.bioperl.org/wiki/Feature_Annotation_rollback#Simple_Benchmark
Also, I tried reimplementing some parsers as generic 'event'-based  
driver/handler and they were slightly faster, the key roadblock being  
instantation again.  If I didn't create Features/Annotations I saw a  
significant speedup.  That's not entirely unexpected, as SeqFeatures  
also contain Locations (in turn that can contain subLocations) and  
(until recently) tag-based Bio::Annotation by default.  Annotations  
are collected in an Annotation::Collection and can contain other  
objects I believe (Ontology terms, etc).
The overall lesson is, if you don't have very heavy objects being  
created the overhead is actually quite small; it's only when you  
greedily instantiate everything that you run into problems.
chris
    
    
More information about the Bioperl-l
mailing list