[Biojava-l] A little (??!!!) stress test.

Thomas Down td2@sanger.ac.uk
Sun, 2 Jul 2000 17:01:34 +0100


--AhhlLboLdkugWU4S
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Up until now, most of my biojava code has tended to access fairly
short sequences (longest were Yeast chromosomes).  I'd guess that's
true for most users.

Today this changed.

I now need to work with a rather long piece of DNA -- about 35Mb
(human chromosome 22, if anyone out there is looking for a piece
that length).  Unfortunately, BioJava can't cope with it :(.

Actually, to be more precise, it turns out that Java runs into
trouble when working with ArrayLists of 19522579 or more elements.
I'm still not at all sure what's causing this -- I've filed a
report on the bug parade, but it's not publically visible yet.
I'm seeing this on Compaq java 1.2.2 and Sun 1.3beta for Linux.
I suspect it's found in pretty much any port which uses Sun's code.
Anyone who has access to other platforms might like to try the
tiny attached test case.  If there's anyone out there who doesn't
mind looking at the Sun source code, you might also like to open
up your src.zip and see if there's any logical explanation for
this behaviour.

What this means in practice is that we can't RELIABLY store sequences
of more than 19522578 symbols using the current SimpleSymbolList
implementation, or any other implementation which is based on
an ArrayList.  In the absence of a rapid fix for all platforms
(which I guess is unlikely), this problem suggests that we need
a new SimpleSymbolList.  It shouldn't be too hard to manage
an implementation which acts just like the current one, but which
doesn't depend on ArrayList, and therefore shouldn't show this bug.
I'll have a go at that tommorow, but if anyone wants to see
any other changes to SimpleSymbolList, now is probably the time to
bring them up.

There's actually another worrying aspect of this issue.  When a
sequence is read from disk (or some other stream), the SequenceFormat
class (or at least any of the current implementations) first constructs
a SymbolList containing the sequence data.  It then uses the supplied
SequenceFactory to turn this into a fully-fledged Sequence object.
The trouble is, when you construct a SimpleSequence from a SymbolList,
what actually happens is that a new ArrayList is constructed, and all
the sequence data is copied.  It is this process which exposes the
ArrayList bug.  But even without the bug, it seems kind of wasteful,
and means that some simple BioJava programs could end up needing
twice the memory they ought to :(

What I'd prefer to see is a SimpleSequence implementation which
/wraps/ a SymbolList object and acts as a proxy to it, rather
that extending SimpleSymbolList, and copying all the data as we
do at the moment.  This will reduce the CPU and memory requirements
for sequence loading, and also means that we can plug in new
SymbolList implementations (for instance, one which stores DNA
using bytes rather than object references) without having to
write matching new Sequence implementations.

There is a slight catch though: if SimpleSequence just becomes
a view onto a SymbolList, we have to start thinking about
sequence mutability.  Right now, I'm used to thinking of
SymbolLists as immutable, so there's no problem having one
SymbolList (or Sequence) which is actually a view onto another
one.  BioJava already uses SymbolList views: e.g.
DNATools.reverseComplement, and AbstractSymbolList.subList.
But right now there is nothing in the SymbolList contract which
says that they /have/ to be immutable.  So a view onto a SymbolList
can potentially change under you without any notification.

So, now seems to be the time to either:
  
  - Formally state that SymbolLists are immutable, at least
    until some future release.

OR

  - Thrash out a standard for mutable SymbolLists, preferably
    with some kind of event model which allows notification of
    changes to bubble up through a stack of `view' classes.

There's a certain amount to be said for the latter option,
since it minimizes the amount of code that gets broken when
we /do/ have a mutable SequenceList standard.  But I guess
it depends whether anyone is interested in these at the moment.

Sorry for the long message.  What do people think?

Happy hacking,

Thomas.
-- 
He looked up with big brown eyes.  ``They're really only
tiny little A-bombs, honest.''
                                     -- David Brin.

--AhhlLboLdkugWU4S
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="ListTest.java"

import java.util.*;

/**
 * Simple demonstration of the ArrayList constructor problem.
 * On all platforms currently tested, this program succeeds
 * when run with a count of 19522578, but fails with a 
 * count of 19522579 or higher.
 *
 * @author Thomas Down <td2@sanger.ac.uk>
 */

public class ListTest {
    public static void main(String[] args) throws Exception {
	if (args.length < 1)
	    throw new Exception("usage: test.ListTest <count>");

	int count = Integer.parseInt(args[0]);

	String myObject = "Hello, world";
	List myList = Collections.nCopies(count, myObject);
	System.out.println(myList.size());
	List myList2 = new ArrayList(myList);
    }
}

--AhhlLboLdkugWU4S--