[Biojava-l] seqString/subStr bottleneck in SymbolList

Thomas Down td2@sanger.ac.uk
Fri, 29 Jun 2001 14:38:08 +0100


On Fri, Jun 29, 2001 at 02:27:16PM +0100, Keith James wrote:
> 
> I've been fixing a severe performance problem with my hacked Artemis
> which delegates its sequence functions to BioJava SymbolLists. This
> was taking much longer to scroll/update etc than vanilla Artemis,
> increasing roughly linearly with sequence length and feature
> count. End result is a 5Mb sequence + 10000 features halts on a Compaq
> ES40.

This worries me...  The subStr method ought to run in linear
time w.r.t the size of the window you're requesting, and 
constant w.r.t. the overall length of the sequence.

My first thought is to wonder if there might be 
a SymbolList.seqString().substring(x, y)
somewhere where you really want a SymbolList.subStr(x, y)...

> The cause turned out to be subStr() in AbstractSymbolList which makes
> 4 method calls for each base in the subsequence (symbolAt(), length(),
> getToken(), append()) when creating a readable (i.e. string)
> representation of the sequence. Is this something that is worth
> looking at in the BioJava core?

When I've looked at this in the past, this process hasn't been
too slow (actually, I did once try optimizing stringification for
one particular special case, and ended up slowing things down!).
Possibly if your java VM isn't handling method inlining well
(you are using the Fast VM on Compaqs, aren't you?) it'll slow
down.

> For now I'm caching the whole stringified sequence elsewhere to get
> round this. The reason for lots of substringing is that Artemis avoids
> Java graphics rounding errors at high sequence/pixel coordinates by
> checking visibility of residues/features and then only representing
> those in the viewable area, all drawn from a zero origin using integer
> coords. So the main genome sequence gets a substring and so do all the
> visible features.

Fair enough...

> Patient: "Doctor, it hurts when I do this."
> Doctor:  "Well, don't do that."

No, getting string representations of SymbolLists should be fast.
And for applications like the DAS server, it /does/ go reasonably
fast...

Hmmm...

   Thomas.