[Biojava-dev] [BioJava - Bug #3345] (Closed) Static object cache in SimpleRichObjectBuilder causing memory leak

Andreas Prlic andreas at sdsc.edu
Sat May 12 02:36:55 UTC 2012


awesome!

Andreas

On Fri, May 11, 2012 at 5:04 PM, Scooter Willis <HWillis at scripps.edu> wrote:
> Andreas
>
> I will take a look at what is involved.
>
> Scooter
>
>
> ----- Reply message -----
> From: "Andreas Prlic" <andreas at sdsc.edu>
> To: "George Waldon" <gwaldon at geneinfinity.org>
> Cc: "biojava-dev at lists.open-bio.org" <biojava-dev at lists.open-bio.org>
> Subject: [Biojava-dev] [BioJava - Bug #3345] (Closed) Static object cache in
> SimpleRichObjectBuilder causing memory leak
> Date: Fri, May 11, 2012 7:06 pm
>
>
>
> Hi,
>
> We really need somebody to write a new genebank parser for the biojava
> 3 world.... Any volunteers? :-)
>
> Andreas
>
>
>
>
>
> On Fri, May 11, 2012 at 2:16 PM, George Waldon <gwaldon at geneinfinity.org>
> wrote:
>> Tjeerd,
>>
>> It is not a bug, it is a design for a singleton map to interact with a
>> database to actually address the issue you had in the first place. If you
>> change it, hibernate sessions may not work properly or memory issues may
>> arise if too many sequences are loaded at the same time because the
>> current
>> code also prevent duplicate objects.
>>
>> What about submitting your own version of RichObjectBuilder with a clear
>> description of when and how to use it?
>>
>>
>> - George
>>
>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>
>>> Hi,
>>>
>>> I've fixed the memory issue for me personally, but it may be a problem
>>> for
>>> future users, hence the question.
>>>
>>> As far I can see changing the default caching strategy will not break any
>>> contracts. It simply means taking a small (negligible?) performance hit
>>> in
>>> order to be able to run in a bounded memory space. Would you agree?
>>>
>>> Tjeerd
>>>
>>> On 5/11/2012 5:34 PM, George Waldon wrote:
>>>>
>>>> Hi Tjeerd,
>>>>
>>>> Bj1x was designed and written by Richard Holland; that would be the
>>>> person to ask directly I guess:-).
>>>>
>>>> It is not desirable to change behavior retroactively. Are you still
>>>> facing memory issues? Could you be more explicit on your problem?
>>>>
>>>> There are several ways we could adjust the current code without breaking
>>>> the original design; e.g. use of service providers or implement a
>>>> getRichObjectBuilder method to swap the builder for a particular job.
>>>> Just
>>>> thoughts.
>>>>
>>>> - George
>>>>
>>>>
>>>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>>>
>>>>> Any thoughts on this from the Biojava authors? I strongly believe that
>>>>> the default behavior of BioJava should not be "unbounded memory usage
>>>>> unless
>>>>> you implement your own RichObjectBuilder".
>>>>>
>>>>> - Tjeerd
>>>>>
>>>>> On 4/27/2012 3:45 PM, George Waldon wrote:
>>>>>>
>>>>>> You can use RichObjectFactory.setRichObjectBuilder(RichObjectBuilder
>>>>>> b)
>>>>>> to enable you own builder.
>>>>>>
>>>>>> -  George
>>>>>>
>>>>>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>>>>>
>>>>>>> Hi George,
>>>>>>>
>>>>>>> A very good point. While the specification of the RichObjectBuilder
>>>>>>> interface does not *enforce* that implementations persist the
>>>>>>> objects, it
>>>>>>> seems to be the intended use. Both implementations
>>>>>>> (BioSQLRichObjectBuilder
>>>>>>> and SimpleRichObjectBuilder) persist all created objects. Creating my
>>>>>>> own
>>>>>>> class would work for me, but this should be fixed in BioJava codebase
>>>>>>> as
>>>>>>> well, correct? Creating another RichObjectBuilder implementation that
>>>>>>> works
>>>>>>> in finite memory would work, but it would have to be 'plugged in' by
>>>>>>> editing
>>>>>>> the RichObjectFactory or enabled through a configuration option -
>>>>>>> both seem
>>>>>>> undesirable. Any ideas?
>>>>>>>
>>>>>>> Tjeerd
>>>>>>>
>>>>>>> On 4/27/2012 6:44 AM, George Waldon wrote:
>>>>>>>>
>>>>>>>> Hi Tjeerd,
>>>>>>>>
>>>>>>>> The use of a singleton map for rich objects with a Hibernate
>>>>>>>> database
>>>>>>>> is described in the biojavax documentation; see also the class
>>>>>>>> BioSQLRichObjectBuilder.
>>>>>>>>
>>>>>>>> You can implement your solution simply by creating your own
>>>>>>>> implementation of RichObjectBuilder as I mentioned before. Maybe you
>>>>>>>> could
>>>>>>>> store the cache values as WeakReference and recreated objects only
>>>>>>>> when
>>>>>>>> needed. Or you could filter the list of parameters, e.g. in the case
>>>>>>>> of
>>>>>>>> SimpleDocRef return the same singleton representing an empty DocRef
>>>>>>>> if you
>>>>>>>> are not interested in those, etc.
>>>>>>>>
>>>>>>>> - George
>>>>>>>>
>>>>>>>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>>>>>>>
>>>>>>>>> Hey,
>>>>>>>>>
>>>>>>>>> I cannot reopen the bug myself, but I'm sure Andreas or another
>>>>>>>>> manager of the bugtracker can.
>>>>>>>>>
>>>>>>>>> For my purposes the memory usage could be kept in check by
>>>>>>>>> disabling
>>>>>>>>> the cache in SimpleRichObjectBuilder. It's hard to say what the
>>>>>>>>> performance
>>>>>>>>> hit is, but it's not been noticeable in my application. A permanent
>>>>>>>>> solution
>>>>>>>>> needs some more thought.
>>>>>>>>>
>>>>>>>>> I double-checked, and all calls to SimpleObjectBuilder.getObject()
>>>>>>>>> go through RichObjectFactory.getObject() - it effectively delegates
>>>>>>>>> the call
>>>>>>>>> to the SimpleObjectBuilder. The SimpleObjectBuilder class keeps a
>>>>>>>>> cache of
>>>>>>>>> unrestricted size, with one object for every unique combination of
>>>>>>>>> class and
>>>>>>>>> parameters. The RichObjectFactory class also keeps a cache of
>>>>>>>>> objects, but
>>>>>>>>> this cache is restricted to containing instances for at most 20
>>>>>>>>> different
>>>>>>>>> object classes at any given time.
>>>>>>>>> So we can think of this as a level 1 cache in RichObjectFactory,
>>>>>>>>> and
>>>>>>>>> a level 2 cache in SimpleObjectBuilder. The level 1 cache may
>>>>>>>>> contain at
>>>>>>>>> most 20 entries, and the level 2 cache may contain at most the
>>>>>>>>> amount of
>>>>>>>>> instantiable classes in BioJava, which will be a couple hundred
>>>>>>>>> *maximum*. I
>>>>>>>>> don't think this difference justifies the use of two caches.
>>>>>>>>>
>>>>>>>>> Possible fix:
>>>>>>>>> Unless I missed something, removing the cache from
>>>>>>>>> SimpleObjectBuilder and increasing the size of the LRU cache in
>>>>>>>>> RichObjectFactory to compensate seems like a good option. This
>>>>>>>>> would solve
>>>>>>>>> the unbounded growth of the level 2 cache by eliminating it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is another issue with the level 1 cache that could cause
>>>>>>>>> unbounded growth. Look at this piece of code in
>>>>>>>>> RichObjectFactory.getObject():
>>>>>>>>>
>>>>>>>>> Map m = (Map)cache.get(applicationClass);
>>>>>>>>> if (!m.containsKey(paramsList)) {
>>>>>>>>>    m.put(paramsList,builder.buildObject(applicationClass,
>>>>>>>>> paramsList));
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> For every unique combination of parameters, the object is built and
>>>>>>>>> stored in the Map associated with that class. This Map is not a LRU
>>>>>>>>> implementation, so creating 10,000,000 instances of a class (i.e.
>>>>>>>>> SimpleDocRef) with different parameters would cause every single
>>>>>>>>> instance to
>>>>>>>>> be saved in the cache. Such a situation would again cause memory
>>>>>>>>> problems.
>>>>>>>>>
>>>>>>>>> Possible fix:
>>>>>>>>> Well that's an easy one! Replace this simple Map cache with a LRU
>>>>>>>>> implementation, as proposed before.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Tjeerd
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 4/27/2012 12:03 AM, George Waldon wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Tjeerd,
>>>>>>>>>>
>>>>>>>>>> You can certainly reopen a bug.
>>>>>>>>>>
>>>>>>>>>> I am not the expert here, but looking at the code,
>>>>>>>>>> RichObjectFactory uses a LRU cache as a temporary cache for
>>>>>>>>>> objects that are
>>>>>>>>>> reused often while SimpleRichObjectBuilder caches everything. I
>>>>>>>>>> doubt you
>>>>>>>>>> can remove this last one.
>>>>>>>>>>
>>>>>>>>>> Maybe you can write your own RichObjectBuilder that would cache
>>>>>>>>>> only things you are interested in (to compare?). I bet you
>>>>>>>>>> generate a lot of
>>>>>>>>>> DocRef that do not need to be cached; then use
>>>>>>>>>> RichObjectFactory.setRichObjectBuilder to set your partial builder
>>>>>>>>>> before
>>>>>>>>>> you parse. Just an idea.
>>>>>>>>>>
>>>>>>>>>> Hope this helps,
>>>>>>>>>> George
>>>>>>>>>>
>>>>>>>>>> Quoting Tjeerd Boerman <twboerman at gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> Wow. I'm quite emberassed, but the issue is actually not fixed in
>>>>>>>>>>> 1.8.2. The RichObjectFactory class uses an LRU cache, while the
>>>>>>>>>>> SimpleRichObjectBuilder class uses the cache described in my
>>>>>>>>>>> earlier post,
>>>>>>>>>>> and I mixed up the two. Sorry about that, I must have been
>>>>>>>>>>> staring at this
>>>>>>>>>>> code for too long today.
>>>>>>>>>>>
>>>>>>>>>>> It turns out the RichObjectFactory class is the only user of
>>>>>>>>>>> SimpleRichObjectBuilder, and has its own LRU cache. So the cache
>>>>>>>>>>> in
>>>>>>>>>>> SimpleRichObjectBuilder can probably be removed altogether. If
>>>>>>>>>>> the experts
>>>>>>>>>>> agree I would be happy to write a patch for this.
>>>>>>>>>>>
>>>>>>>>>>> Can the old bug report be reopened, or should I file a new bug
>>>>>>>>>>> report? Or should I submit a possible patch through this mailing
>>>>>>>>>>> list?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Tjeerd
>>>>>>>>>>>
>>>>>>>>>>> On 4/26/2012 6:01 PM, redmine at redmine.open-bio.org wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Issue #3345 has been updated by Andreas Prlic.
>>>>>>>>>>>>
>>>>>>>>>>>> * Status changed from New to Closed
>>>>>>>>>>>> * % Done changed from 0 to 100
>>>>>>>>>>>>
>>>>>>>>>>>> already fixed in 1.8.2
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>>>  Bug #3345: Static object cache in SimpleRichObjectBuilder
>>>>>>>>>>>> causing
>>>>>>>>>>>> memory leak <https://redmine.open-bio.org/issues/3345>
>>>>>>>>>>>>
>>>>>>>>>>>> * Author: Tjeerd Boerman
>>>>>>>>>>>> * Status: Closed
>>>>>>>>>>>> * Priority: Normal
>>>>>>>>>>>> * Assignee: biojava-dev list
>>>>>>>>>>>> * Category: bio
>>>>>>>>>>>> * Target version: BioJava 1.8 - legacy
>>>>>>>>>>>> * URL:
>>>>>>>>>>>>
>>>>>>>>>>>> I encountered a memory problem when parsing many Genbank files
>>>>>>>>>>>> with the Biojava 1.8.1. The parsed files were protein GPFF
>>>>>>>>>>>> (GenPept Flat
>>>>>>>>>>>> File format) files from the latest RefSeq release. The
>>>>>>>>>>>> application tried to
>>>>>>>>>>>> parse millions of protein sequences from these files, but an
>>>>>>>>>>>> OutOfMemoryException would always occur after some time. The
>>>>>>>>>>>> used heap space
>>>>>>>>>>>> would gradually increase from a couple hundred megabytes to over
>>>>>>>>>>>> 1.5 GB,
>>>>>>>>>>>> until the heap could grow no further. Upon inspection I
>>>>>>>>>>>> discovered a HashMap
>>>>>>>>>>>> in RichSequenceBuilder was the culprit:
>>>>>>>>>>>>
>>>>>>>>>>>> public class SimpleRichObjectBuilder implements
>>>>>>>>>>>> RichObjectBuilder
>>>>>>>>>>>> {
>>>>>>>>>>>>
>>>>>>>>>>>> private static Map objects = new HashMap();
>>>>>>>>>>>>
>>>>>>>>>>>> public Object buildObject(Class clazz, List paramsList) {
>>>>>>>>>>>>    ...
>>>>>>>>>>>>
>>>>>>>>>>>>    // return the constructed object from the hashmap if there
>>>>>>>>>>>> already
>>>>>>>>>>>>    if (contents.containsKey(ourParamsList)) return
>>>>>>>>>>>> contents.get(ourParamsList);
>>>>>>>>>>>>
>>>>>>>>>>>>    ...
>>>>>>>>>>>>
>>>>>>>>>>>>    // Instantiate it with the parameters
>>>>>>>>>>>>    Object o = c.newInstance(ourParamsList.toArray());
>>>>>>>>>>>>
>>>>>>>>>>>>    // store it for later in the singleton map
>>>>>>>>>>>>    contents.put(ourParamsList, o);
>>>>>>>>>>>>
>>>>>>>>>>>>    ...
>>>>>>>>>>>> }
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> It seems the *objects* Map in SimpleRichSequenceBuilder is used
>>>>>>>>>>>> as a static cache for objects, but when many different objects
>>>>>>>>>>>> are created
>>>>>>>>>>>> this cache grows out of control. I am unsure if this is a 'true'
>>>>>>>>>>>> bug, but
>>>>>>>>>>>> for my application it was a definite problem. My fix was to
>>>>>>>>>>>> simply comment
>>>>>>>>>>>> out the *contents.put()* statement, but I'm sure there is a
>>>>>>>>>>>> better way to
>>>>>>>>>>>> resolve this - perhaps by making the use of the cache optional
>>>>>>>>>>>> through a
>>>>>>>>>>>> configuration option.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ------------------------------------------------------------------------ You
>>>>>>>>>>>> have received this notification because you have either
>>>>>>>>>>>> subscribed to it, or
>>>>>>>>>>>> are involved in it.
>>>>>>>>>>>> To change your notification preferences, please click here and
>>>>>>>>>>>> login: http://redmine.open-bio.org
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> biojava-dev mailing list
>>>>>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --------------------------------
>>>>>>>>>> George Waldon
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> _______________________________________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev



-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------




More information about the biojava-dev mailing list