[Biojava-l] Dealing with huge sequences (was: "memory leak while reading nr.fasta")

Aaron Darling darling at cs.wisc.edu
Mon Jul 4 02:35:17 EDT 2005

Richard HOLLAND wrote:

>What is required for files this size is a SeqIOTools parser that reads
>sequence objects _on demand_ as requested by the iterator, rather than
>reading the whole lot at once. 
This brings up a related issue that I'm grappling with at the moment...  
I would like to have biojava parse a large sequence file and then 
periodically extract arbitrary subsequences.  As currently implemented, 
it seems that in order to extract a subsequence, the entire sequence 
entry must be loaded from the GenBank/FastA/whatever file into memory.  
This becomes a problem when dealing with large chromosomal data sets of 
the type displayed in the Mauve alignment viewer.  Yes, I'm aware of the 
PackedSymbolList.  Unfortunately, mammalian genomes are around 3 
gigabases, requiring around 700MB each using a 2 bits per base encoding.

Given that it won't be practical to store the entire sequence in memory, 
the next best solution would be keeping an in-memory index of relevant 
sequence file offsets.  Enter BioJava's IndexStore.  Unless I've 
misunderstood the documentation, the IndexStore family of classes index 
sequence files on a per-contig/per-entry basis.  Such a scheme creates 
rather sparse indexes for chromosomes that can be > 100MB in length.  
What seems ideal would be an implementation of SeqIOTools that could 
read a GenBank/FastA file and construct a Sequence-derivative  object 
with lazy references to the data.  The Sequence-derived class would also 
need mappings of sequence coordinates to file offsets so that reading a 
10 character subsequence n...n+10 doesn't require also reading 
subsequence 1...n-1.  I implemented a similar scheme in a small c++ 
library called libGenome years ago and it makes manipulating large data 
sets a breeze.

Echoing Richard's question for this slightly different problem:

>Can someone clarify if a lazy-loading parser/database implementation
>already exists for situations like this, or does one need to be written?
Thanks for Biojava, and thanks for any feedback

btw: I also brought this up at the BOSC biojava BOF but we were rather 
abruptly ushered out of the meeting room by an anxious hotel staffer 
prior to reaching a conclusion.

More information about the Biojava-l mailing list