[Biojava-l] performance problems with BJX

Mark Schreiber markjschreiber at gmail.com
Tue Nov 4 12:05:34 UTC 2008


I would have thought the symbol packing threshold would have kicked in
which should reduce the memory load on a sequence.

- Mark

On Tue, Nov 4, 2008 at 7:58 PM, Richard Holland
<holland at eaglegenomics.com> wrote:
> BioJava stores sequences as a list of symbol objects. Each object is a
> singleton, and so the sequence is essentially a list of pointers.
>
> I guess you're running on a 64-bit machine. One pointer on a 64-bit
> machine is 64 bits = 8 bytes.
>
> Your genbank file for human chromosome 1 is probably going to have
> about 250,000,000 bases or so. Each of those bases, as a string, takes
> 1 or 2 bytes (8- or 16-bits) depending on encoding. So a string, or
> file, encoded using 8 bits, you're going to get 1 byte per base =
> approx. 250MB, plus feature and annotation data. That sounds about
> right given that your file is 300MB.
>
> As a list of pointers to singletons on a 64-bit machine, you're going
> to end up using 8 bytes per base instead of 1 byte. 8 * 250MB = 2Gig.
>
> This explains the first 2 gigabytes. Where the other 13 are coming
> from I'm not sure, but I wouldn't be surprised if they're the
> features.
>
> cheers,
> Richard
>
> PS. The new BJ3 will only convert sequences from String to Symbols
> when explicitly requested to do so - this will help save memory (and
> time) when doing simple operations such as copying files into
> databases without any intermediate processing.
>
>
>
> 2008/11/4 Mark Schreiber <markjschreiber at gmail.com>:
>> Object relational mapping can be memory hungry but this does seem
>> expensive. Probably the easiest way to track the problem is to run a
>> memory profiler (such as the one in Netbeans) and see if there is
>> either some very large object or a proliferation of thousands of small
>> objects which would point to a memory leak.
>>
>> Best regards,
>>
>> - Mark
>>
>> On Tue, Nov 4, 2008 at 4:29 PM, Gabrielle Doan <gabrielle_doan at gmx.net> wrote:
>>>
>>> Dear all,
>>> I have some perfomance problems with BioJavaX.
>>>
>>> I wanted to add the human chromosome 1 into my MySQL database with the build in method addRichSequence from the org.biojavax.bio.db.biosql.BioSQLRichSequenceDB and was wondered why this method waste so much memory. I started my program with -Xmx8G but I got a Java heap space messseage on my console. So I used -Xmx15G to insert the chromsome (range between 8G and 15G not tested).
>>>
>>> I wonder why a genbank file of about 300 MB needs about 15G memory to be added into the database.
>>>
>>> Does anybody has the same problem or can give me any advice?
>>>
>>> Cheers, Gabrielle
>>>
>>> _______________________________________________
>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>> _______________________________________________
>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>
> --
> Richard Holland, BSc MBCS
> Finance Director, Eagle Genomics Ltd
> M: +44 7500 438846 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
>



More information about the Biojava-l mailing list