[Biojava-l] seqString/subStr bottleneck in SymbolList

29 Jun 2001 14:27:16 +0100

I've been fixing a severe performance problem with my hacked Artemis
which delegates its sequence functions to BioJava SymbolLists. This
was taking much longer to scroll/update etc than vanilla Artemis,
increasing roughly linearly with sequence length and feature
count. End result is a 5Mb sequence + 10000 features halts on a Compaq
ES40.

The cause turned out to be subStr() in AbstractSymbolList which makes
4 method calls for each base in the subsequence (symbolAt(), length(),
getToken(), append()) when creating a readable (i.e. string)
representation of the sequence. Is this something that is worth
looking at in the BioJava core?

For now I'm caching the whole stringified sequence elsewhere to get
round this. The reason for lots of substringing is that Artemis avoids
Java graphics rounding errors at high sequence/pixel coordinates by
checking visibility of residues/features and then only representing
those in the viewable area, all drawn from a zero origin using integer
coords. So the main genome sequence gets a substring and so do all the
visible features.

Caching the whole sequence as chars in addition to the overhead of an
object for each residue seems pretty inefficient.

Or is this a case of

Patient: "Doctor, it hurts when I do this."
Doctor:  "Well, don't do that."

;)

Keith

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA