[Biojava-l] Dealing with huge sequences (was: "memory leak while
reading nr.fasta")
Aaron Darling
darling at cs.wisc.edu
Mon Jul 4 02:35:17 EDT 2005
Richard HOLLAND wrote:
>What is required for files this size is a SeqIOTools parser that reads
>sequence objects _on demand_ as requested by the iterator, rather than
>reading the whole lot at once.
>
This brings up a related issue that I'm grappling with at the moment...
I would like to have biojava parse a large sequence file and then
periodically extract arbitrary subsequences. As currently implemented,
it seems that in order to extract a subsequence, the entire sequence
entry must be loaded from the GenBank/FastA/whatever file into memory.
This becomes a problem when dealing with large chromosomal data sets of
the type displayed in the Mauve alignment viewer. Yes, I'm aware of the
PackedSymbolList. Unfortunately, mammalian genomes are around 3
gigabases, requiring around 700MB each using a 2 bits per base encoding.
Given that it won't be practical to store the entire sequence in memory,
the next best solution would be keeping an in-memory index of relevant
sequence file offsets. Enter BioJava's IndexStore. Unless I've
misunderstood the documentation, the IndexStore family of classes index
sequence files on a per-contig/per-entry basis. Such a scheme creates
rather sparse indexes for chromosomes that can be > 100MB in length.
What seems ideal would be an implementation of SeqIOTools that could
read a GenBank/FastA file and construct a Sequence-derivative object
with lazy references to the data. The Sequence-derived class would also
need mappings of sequence coordinates to file offsets so that reading a
10 character subsequence n...n+10 doesn't require also reading
subsequence 1...n-1. I implemented a similar scheme in a small c++
library called libGenome years ago and it makes manipulating large data
sets a breeze.
Echoing Richard's question for this slightly different problem:
>Can someone clarify if a lazy-loading parser/database implementation
>already exists for situations like this, or does one need to be written?
>
>
>
Thanks for Biojava, and thanks for any feedback
-Aaron
btw: I also brought this up at the BOSC biojava BOF but we were rather
abruptly ushered out of the meeting room by an anxious hotel staffer
prior to reaching a conclusion.
More information about the Biojava-l
mailing list