[Biojava-dev] [BioJava - Bug #3345] (Closed) Static object cache in SimpleRichObjectBuilder causing memory leak

Tjeerd Boerman twboerman at gmail.com
Fri Apr 27 02:08:37 UTC 2012


Hey,

I cannot reopen the bug myself, but I'm sure Andreas or another manager 
of the bugtracker can.

For my purposes the memory usage could be kept in check by disabling the 
cache in SimpleRichObjectBuilder. It's hard to say what the performance 
hit is, but it's not been noticeable in my application. A permanent 
solution needs some more thought.

I double-checked, and all calls to SimpleObjectBuilder.getObject() go 
through RichObjectFactory.getObject() - it effectively delegates the 
call to the SimpleObjectBuilder. The SimpleObjectBuilder class keeps a 
cache of unrestricted size, with one object for every unique combination 
of class and parameters. The RichObjectFactory class also keeps a cache 
of objects, but this cache is restricted to containing instances for at 
most 20 different object classes at any given time.
So we can think of this as a level 1 cache in RichObjectFactory, and a 
level 2 cache in SimpleObjectBuilder. The level 1 cache may contain at 
most 20 entries, and the level 2 cache may contain at most the amount of 
instantiable classes in BioJava, which will be a couple hundred 
*maximum*. I don't think this difference justifies the use of two caches.

Possible fix:
Unless I missed something, removing the cache from SimpleObjectBuilder 
and increasing the size of the LRU cache in RichObjectFactory to 
compensate seems like a good option. This would solve the unbounded 
growth of the level 2 cache by eliminating it.



There is another issue with the level 1 cache that could cause unbounded 
growth. Look at this piece of code in RichObjectFactory.getObject():

    Map m = (Map)cache.get(applicationClass);
    if (!m.containsKey(paramsList)) {
        m.put(paramsList,builder.buildObject(applicationClass, paramsList));
    }

For every unique combination of parameters, the object is built and 
stored in the Map associated with that class. This Map is not a LRU 
implementation, so creating 10,000,000 instances of a class (i.e. 
SimpleDocRef) with different parameters would cause every single 
instance to be saved in the cache. Such a situation would again cause 
memory problems.

Possible fix:
Well that's an easy one! Replace this simple Map cache with a LRU 
implementation, as proposed before.

Regards,
Tjeerd


On 4/27/2012 12:03 AM, George Waldon wrote:
> Hi Tjeerd,
>
> You can certainly reopen a bug.
>
> I am not the expert here, but looking at the code, RichObjectFactory 
> uses a LRU cache as a temporary cache for objects that are reused 
> often while SimpleRichObjectBuilder caches everything. I doubt you can 
> remove this last one.
>
> Maybe you can write your own RichObjectBuilder that would cache only 
> things you are interested in (to compare?). I bet you generate a lot 
> of DocRef that do not need to be cached; then use 
> RichObjectFactory.setRichObjectBuilder to set your partial builder 
> before you parse. Just an idea.
>
> Hope this helps,
> George
>
> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>
>> Wow. I'm quite emberassed, but the issue is actually not fixed in 
>> 1.8.2. The RichObjectFactory class uses an LRU cache, while the 
>> SimpleRichObjectBuilder class uses the cache described in my earlier 
>> post, and I mixed up the two. Sorry about that, I must have been 
>> staring at this code for too long today.
>>
>> It turns out the RichObjectFactory class is the only user of 
>> SimpleRichObjectBuilder, and has its own LRU cache. So the cache in 
>> SimpleRichObjectBuilder can probably be removed altogether. If the 
>> experts agree I would be happy to write a patch for this.
>>
>> Can the old bug report be reopened, or should I file a new bug 
>> report? Or should I submit a possible patch through this mailing list?
>>
>> Regards,
>> Tjeerd
>>
>> On 4/26/2012 6:01 PM, redmine at redmine.open-bio.org wrote:
>>> Issue #3345 has been updated by Andreas Prlic.
>>>
>>>  * Status changed from New to Closed
>>>  * % Done changed from 0 to 100
>>>
>>> already fixed in 1.8.2
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>>
>>>  Bug #3345: Static object cache in SimpleRichObjectBuilder causing
>>>  memory leak <https://redmine.open-bio.org/issues/3345>
>>>
>>>  * Author: Tjeerd Boerman
>>>  * Status: Closed
>>>  * Priority: Normal
>>>  * Assignee: biojava-dev list
>>>  * Category: bio
>>>  * Target version: BioJava 1.8 - legacy
>>>  * URL:
>>>
>>> I encountered a memory problem when parsing many Genbank files with 
>>> the Biojava 1.8.1. The parsed files were protein GPFF (GenPept Flat 
>>> File format) files from the latest RefSeq release. The application 
>>> tried to parse millions of protein sequences from these files, but 
>>> an OutOfMemoryException would always occur after some time. The used 
>>> heap space would gradually increase from a couple hundred megabytes 
>>> to over 1.5 GB, until the heap could grow no further. Upon 
>>> inspection I discovered a HashMap in RichSequenceBuilder was the 
>>> culprit:
>>>
>>> public class SimpleRichObjectBuilder implements RichObjectBuilder {
>>>
>>>     private static Map objects = new HashMap();
>>>
>>>     public Object buildObject(Class clazz, List paramsList) {
>>>         ...
>>>
>>>         // return the constructed object from the hashmap if there 
>>> already
>>>         if (contents.containsKey(ourParamsList)) return 
>>> contents.get(ourParamsList);
>>>
>>>         ...
>>>
>>>         // Instantiate it with the parameters
>>>         Object o = c.newInstance(ourParamsList.toArray());
>>>
>>>         // store it for later in the singleton map
>>>         contents.put(ourParamsList, o);
>>>
>>>         ...
>>>     }
>>> }
>>>
>>> It seems the *objects* Map in SimpleRichSequenceBuilder is used as a 
>>> static cache for objects, but when many different objects are 
>>> created this cache grows out of control. I am unsure if this is a 
>>> 'true' bug, but for my application it was a definite problem. My fix 
>>> was to simply comment out the *contents.put()* statement, but I'm 
>>> sure there is a better way to resolve this - perhaps by making the 
>>> use of the cache optional through a configuration option.
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> You have received this notification because you have either 
>>> subscribed to it, or are involved in it.
>>> To change your notification preferences, please click here and 
>>> login: http://redmine.open-bio.org
>>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>
>
>
> --------------------------------
> George Waldon
>



More information about the biojava-dev mailing list