[Biojava-l] performance problems with BJX
holland at eaglegenomics.com
Tue Nov 4 11:58:18 UTC 2008
BioJava stores sequences as a list of symbol objects. Each object is a
singleton, and so the sequence is essentially a list of pointers.
I guess you're running on a 64-bit machine. One pointer on a 64-bit
machine is 64 bits = 8 bytes.
Your genbank file for human chromosome 1 is probably going to have
about 250,000,000 bases or so. Each of those bases, as a string, takes
1 or 2 bytes (8- or 16-bits) depending on encoding. So a string, or
file, encoded using 8 bits, you're going to get 1 byte per base =
approx. 250MB, plus feature and annotation data. That sounds about
right given that your file is 300MB.
As a list of pointers to singletons on a 64-bit machine, you're going
to end up using 8 bytes per base instead of 1 byte. 8 * 250MB = 2Gig.
This explains the first 2 gigabytes. Where the other 13 are coming
from I'm not sure, but I wouldn't be surprised if they're the
PS. The new BJ3 will only convert sequences from String to Symbols
when explicitly requested to do so - this will help save memory (and
time) when doing simple operations such as copying files into
databases without any intermediate processing.
2008/11/4 Mark Schreiber <markjschreiber at gmail.com>:
> Object relational mapping can be memory hungry but this does seem
> expensive. Probably the easiest way to track the problem is to run a
> memory profiler (such as the one in Netbeans) and see if there is
> either some very large object or a proliferation of thousands of small
> objects which would point to a memory leak.
> Best regards,
> - Mark
> On Tue, Nov 4, 2008 at 4:29 PM, Gabrielle Doan <gabrielle_doan at gmx.net> wrote:
>> Dear all,
>> I have some perfomance problems with BioJavaX.
>> I wanted to add the human chromosome 1 into my MySQL database with the build in method addRichSequence from the org.biojavax.bio.db.biosql.BioSQLRichSequenceDB and was wondered why this method waste so much memory. I started my program with -Xmx8G but I got a Java heap space messseage on my console. So I used -Xmx15G to insert the chromsome (range between 8G and 15G not tested).
>> I wonder why a genbank file of about 300 MB needs about 15G memory to be added into the database.
>> Does anybody has the same problem or can give me any advice?
>> Cheers, Gabrielle
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
More information about the Biojava-l