[Biojava-l] Dealing with huge sequences (was: "memory leak whilereading nr.fasta")

Mon Jul 4 02:50:09 EDT 2005

I should probably mention some comments Mark made to me privately here -
that the current fileToBiojava method _does_ read on demand, and
sequentially, as opposed to buffered random access as I originally
thought it did. 

The memory leak is in fact a mystery - I can't find any trace in the
code to suggest that Biojava is holding internal references to Sequence
objects read by fileToBiojava. The BJIA example _should_ work without
any problems even on large files such as nr. Mark suggested a profiler
would be useful. Does somebody have access to one?

Apologies if I mislead anyone.

Anyhow, on to Aaron's points... 

A lazy loading sequence object shouldn't be too much trouble at initial
glance. It would (a) have to be aware of the file it came from, and (b)
aware of the format of that file. It would also have to (c) store in
memory each part that was loaded as we went along, unless otherwise told
not to, to prevent duplicate reads where multiple accesses take place.
This however is fundamentally different to the way files are currently
parsed in BioJava. Not sure how it would actually work in reality.

Any takers?

Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------

> -----Original Message-----
> From: biojava-l-bounces at portal.open-bio.org 
> [mailto:biojava-l-bounces at portal.open-bio.org] On Behalf Of 
> Aaron Darling
> Sent: Monday, July 04, 2005 2:35 PM
> To: biojava-l at biojava.org; Paul Infield-Harm
> Subject: Re: [Biojava-l] Dealing with huge sequences (was: 
> "memory leak whilereading nr.fasta")
> 
> 
> Richard HOLLAND wrote:
> 
> >What is required for files this size is a SeqIOTools parser 
> that reads
> >sequence objects _on demand_ as requested by the iterator, 
> rather than
> >reading the whole lot at once. 
> >
> This brings up a related issue that I'm grappling with at the 
> moment...  
> I would like to have biojava parse a large sequence file and then 
> periodically extract arbitrary subsequences.  As currently 
> implemented, 
> it seems that in order to extract a subsequence, the entire sequence 
> entry must be loaded from the GenBank/FastA/whatever file 
> into memory.  
> This becomes a problem when dealing with large chromosomal 
> data sets of 
> the type displayed in the Mauve alignment viewer.  Yes, I'm 
> aware of the 
> PackedSymbolList.  Unfortunately, mammalian genomes are around 3 
> gigabases, requiring around 700MB each using a 2 bits per 
> base encoding.
> 
> Given that it won't be practical to store the entire sequence 
> in memory, 
> the next best solution would be keeping an in-memory index of 
> relevant 
> sequence file offsets.  Enter BioJava's IndexStore.  Unless I've 
> misunderstood the documentation, the IndexStore family of 
> classes index 
> sequence files on a per-contig/per-entry basis.  Such a 
> scheme creates 
> rather sparse indexes for chromosomes that can be > 100MB in length.  
> What seems ideal would be an implementation of SeqIOTools that could 
> read a GenBank/FastA file and construct a Sequence-derivative  object 
> with lazy references to the data.  The Sequence-derived class 
> would also 
> need mappings of sequence coordinates to file offsets so that 
> reading a 
> 10 character subsequence n...n+10 doesn't require also reading 
> subsequence 1...n-1.  I implemented a similar scheme in a small c++ 
> library called libGenome years ago and it makes manipulating 
> large data 
> sets a breeze.
> 
> Echoing Richard's question for this slightly different problem:
> 
> >Can someone clarify if a lazy-loading parser/database implementation
> >already exists for situations like this, or does one need to 
> be written?
> >
> >  
> >
> Thanks for Biojava, and thanks for any feedback
> -Aaron
> 
> btw: I also brought this up at the BOSC biojava BOF but we 
> were rather 
> abruptly ushered out of the meeting room by an anxious hotel staffer 
> prior to reaching a conclusion.
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>