[Biojava-l] A little (??!!!) stress test.

Gerald Loeffler Gerald.Loeffler@vienna.at
Sun, 02 Jul 2000 23:22:24 +0200


Hi thomas,

can you please be more more specific about the "trouble" the big
ArrayList causes?! Does it throw an Out of Memory exception, which could
be  solved with the -Xmx argument to "java"? Or does it throw any other
exception? Or does it hang?...

	cheers,
	gerald

Thomas Down wrote:
> 
> Up until now, most of my biojava code has tended to access fairly
> short sequences (longest were Yeast chromosomes).  I'd guess that's
> true for most users.
> 
> Today this changed.
> 
> I now need to work with a rather long piece of DNA -- about 35Mb
> (human chromosome 22, if anyone out there is looking for a piece
> that length).  Unfortunately, BioJava can't cope with it :(.
> 
> Actually, to be more precise, it turns out that Java runs into
> trouble when working with ArrayLists of 19522579 or more elements.
> I'm still not at all sure what's causing this -- I've filed a
> report on the bug parade, but it's not publically visible yet.
> I'm seeing this on Compaq java 1.2.2 and Sun 1.3beta for Linux.
> I suspect it's found in pretty much any port which uses Sun's code.
> Anyone who has access to other platforms might like to try the
> tiny attached test case.  If there's anyone out there who doesn't
> mind looking at the Sun source code, you might also like to open
> up your src.zip and see if there's any logical explanation for
> this behaviour.
> 
> What this means in practice is that we can't RELIABLY store sequences
> of more than 19522578 symbols using the current SimpleSymbolList
> implementation, or any other implementation which is based on
> an ArrayList.  In the absence of a rapid fix for all platforms
> (which I guess is unlikely), this problem suggests that we need
> a new SimpleSymbolList.  It shouldn't be too hard to manage
> an implementation which acts just like the current one, but which
> doesn't depend on ArrayList, and therefore shouldn't show this bug.
> I'll have a go at that tommorow, but if anyone wants to see
> any other changes to SimpleSymbolList, now is probably the time to
> bring them up.
> 
> There's actually another worrying aspect of this issue.  When a
> sequence is read from disk (or some other stream), the SequenceFormat
> class (or at least any of the current implementations) first constructs
> a SymbolList containing the sequence data.  It then uses the supplied
> SequenceFactory to turn this into a fully-fledged Sequence object.
> The trouble is, when you construct a SimpleSequence from a SymbolList,
> what actually happens is that a new ArrayList is constructed, and all
> the sequence data is copied.  It is this process which exposes the
> ArrayList bug.  But even without the bug, it seems kind of wasteful,
> and means that some simple BioJava programs could end up needing
> twice the memory they ought to :(
> 
> What I'd prefer to see is a SimpleSequence implementation which
> /wraps/ a SymbolList object and acts as a proxy to it, rather
> that extending SimpleSymbolList, and copying all the data as we
> do at the moment.  This will reduce the CPU and memory requirements
> for sequence loading, and also means that we can plug in new
> SymbolList implementations (for instance, one which stores DNA
> using bytes rather than object references) without having to
> write matching new Sequence implementations.
> 
> There is a slight catch though: if SimpleSequence just becomes
> a view onto a SymbolList, we have to start thinking about
> sequence mutability.  Right now, I'm used to thinking of
> SymbolLists as immutable, so there's no problem having one
> SymbolList (or Sequence) which is actually a view onto another
> one.  BioJava already uses SymbolList views: e.g.
> DNATools.reverseComplement, and AbstractSymbolList.subList.
> But right now there is nothing in the SymbolList contract which
> says that they /have/ to be immutable.  So a view onto a SymbolList
> can potentially change under you without any notification.
> 
> So, now seems to be the time to either:
> 
>   - Formally state that SymbolLists are immutable, at least
>     until some future release.
> 
> OR
> 
>   - Thrash out a standard for mutable SymbolLists, preferably
>     with some kind of event model which allows notification of
>     changes to bubble up through a stack of `view' classes.
> 
> There's a certain amount to be said for the latter option,
> since it minimizes the amount of code that gets broken when
> we /do/ have a mutable SequenceList standard.  But I guess
> it depends whether anyone is interested in these at the moment.
> 
> Sorry for the long message.  What do people think?
> 
> Happy hacking,
> 
> Thomas.
> --
> He looked up with big brown eyes.  ``They're really only
> tiny little A-bombs, honest.''
>                                      -- David Brin.
> 
>   ------------------------------------------------------------------------
> 
>    ListTest.javaName: ListTest.java
>                 Type: Plain Text (text/plain)

-- 
   Gerald.Loeffler@vienna.at _________________ Software Architect
   http://www.imp.univie.ac.at ____ http://www.daemonstration.com
   OOA&D, Java, J2EE, JSP, Servlets, JavaBeans, ODBMS, RDBMS, XML