[Biojava-l] performance problems with BJX

Tue Nov 4 11:58:18 UTC 2008

BioJava stores sequences as a list of symbol objects. Each object is a
singleton, and so the sequence is essentially a list of pointers.

I guess you're running on a 64-bit machine. One pointer on a 64-bit
machine is 64 bits = 8 bytes.

Your genbank file for human chromosome 1 is probably going to have
about 250,000,000 bases or so. Each of those bases, as a string, takes
1 or 2 bytes (8- or 16-bits) depending on encoding. So a string, or
file, encoded using 8 bits, you're going to get 1 byte per base =
approx. 250MB, plus feature and annotation data. That sounds about
right given that your file is 300MB.

As a list of pointers to singletons on a 64-bit machine, you're going
to end up using 8 bytes per base instead of 1 byte. 8 * 250MB = 2Gig.

This explains the first 2 gigabytes. Where the other 13 are coming
from I'm not sure, but I wouldn't be surprised if they're the
features.

cheers,
Richard

PS. The new BJ3 will only convert sequences from String to Symbols
when explicitly requested to do so - this will help save memory (and
time) when doing simple operations such as copying files into
databases without any intermediate processing.

2008/11/4 Mark Schreiber <markjschreiber at gmail.com>:
> Object relational mapping can be memory hungry but this does seem
> expensive. Probably the easiest way to track the problem is to run a
> memory profiler (such as the one in Netbeans) and see if there is
> either some very large object or a proliferation of thousands of small
> objects which would point to a memory leak.
>
> Best regards,
>
> - Mark
>
> On Tue, Nov 4, 2008 at 4:29 PM, Gabrielle Doan <gabrielle_doan at gmx.net> wrote:
>>
>> Dear all,
>> I have some perfomance problems with BioJavaX.
>>
>> I wanted to add the human chromosome 1 into my MySQL database with the build in method addRichSequence from the org.biojavax.bio.db.biosql.BioSQLRichSequenceDB and was wondered why this method waste so much memory. I started my program with -Xmx8G but I got a Java heap space messseage on my console. So I used -Xmx15G to insert the chromsome (range between 8G and 15G not tested).
>>
>> I wonder why a genbank file of about 300 MB needs about 15G memory to be added into the database.
>>
>> Does anybody has the same problem or can give me any advice?
>>
>> Cheers, Gabrielle
>>
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/