[Biojava-l] Dealing with huge sequences (was: "memory leak while reading nr.fasta")

Mon Jul 4 02:49:35 EDT 2005

I think this would be easily do-able with biojava. It would require a 
custom implementation of Sequence and, due to the beauty of interfaces you 
probably wouldn't even know you were dealing with an assembly, (except 
sometimes it might be a bit slow while collecting data).

Like you say you could use IndexStore. I might also be worth looking at 
how Dazzle deals with DAS to see if you can steal anything from there.

Ideally the SequenceBuilders called (eventually) by SeqIOTools should 
decide what kind of Sequence implementation you get back. For example, 
small sequences get SimpleSequence, mid sized get PackedSymbolList, and 
really large ones get some kind of lazy loaded sequence.

Before diving in it would be interesting to know if it is the big sequence 
or the thousands of features that cause large sequences to be problematic. 
If it's features you would need to lazy load those as well (which could be 
problematic).

- Mark

Aaron Darling <darling at cs.wisc.edu>
Sent by: biojava-l-bounces at portal.open-bio.org
07/04/2005 02:35 PM

        To:     biojava-l at biojava.org, Paul Infield-Harm <pinfield at cs.wisc.edu>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-l] Dealing with huge sequences (was: "memory leak while 
reading nr.fasta")

Richard HOLLAND wrote:

>What is required for files this size is a SeqIOTools parser that reads
>sequence objects _on demand_ as requested by the iterator, rather than
>reading the whole lot at once. 
>
This brings up a related issue that I'm grappling with at the moment... 
I would like to have biojava parse a large sequence file and then 
periodically extract arbitrary subsequences.  As currently implemented, 
it seems that in order to extract a subsequence, the entire sequence 
entry must be loaded from the GenBank/FastA/whatever file into memory. 
This becomes a problem when dealing with large chromosomal data sets of 
the type displayed in the Mauve alignment viewer.  Yes, I'm aware of the 
PackedSymbolList.  Unfortunately, mammalian genomes are around 3 
gigabases, requiring around 700MB each using a 2 bits per base encoding.

Given that it won't be practical to store the entire sequence in memory, 
the next best solution would be keeping an in-memory index of relevant 
sequence file offsets.  Enter BioJava's IndexStore.  Unless I've 
misunderstood the documentation, the IndexStore family of classes index 
sequence files on a per-contig/per-entry basis.  Such a scheme creates 
rather sparse indexes for chromosomes that can be > 100MB in length. 
What seems ideal would be an implementation of SeqIOTools that could 
read a GenBank/FastA file and construct a Sequence-derivative  object 
with lazy references to the data.  The Sequence-derived class would also 
need mappings of sequence coordinates to file offsets so that reading a 
10 character subsequence n...n+10 doesn't require also reading 
subsequence 1...n-1.  I implemented a similar scheme in a small c++ 
library called libGenome years ago and it makes manipulating large data 
sets a breeze.

Echoing Richard's question for this slightly different problem:

>Can someone clarify if a lazy-loading parser/database implementation
>already exists for situations like this, or does one need to be written?
>
> 
>
Thanks for Biojava, and thanks for any feedback
-Aaron

btw: I also brought this up at the BOSC biojava BOF but we were rather 
abruptly ushered out of the meeting room by an anxious hotel staffer 
prior to reaching a conclusion.
_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l