[Biojava-l] seqString/subStr bottleneck in SymbolList

Keith James kdj@sanger.ac.uk
02 Jul 2001 11:28:54 +0100


>>>>> "Thomas" == Thomas Down <td2@sanger.ac.uk> writes:

[...]

    Thomas> This worries me...  The subStr method ought to run in
    Thomas> linear time w.r.t the size of the window you're
    Thomas> requesting, and constant w.r.t. the overall length of the
    Thomas> sequence.

    Thomas> My first thought is to wonder if there might be a
    Thomas> SymbolList.seqString().substring(x, y) somewhere where you
    Thomas> really want a SymbolList.subStr(x, y)...

Nope, afraid not. The Artemis sequence object simply wraps a
SymbolList and the subsequence method is

public String getSubSequence(int index1, int index2)
{
    String subSeq = "";

    try
    {
        // symbols is a BioJava SymbolList (a SimpleSequence)
        subSeq = symbols.subStr(index1, index2);

        // Using JDK method
        // subSeq = sequence.substring(index1 - 1, index2);
    }
    catch (IndexOutOfBoundsException ioe)
    {
	 ioe.printStackTrace();
    }

    // print static counter
    callCount++;
    System.err.println("count " + callCount);

    return subSeq;
}


    >> The cause turned out to be subStr() in AbstractSymbolList which
    >> makes 4 method calls for each base in the subsequence
    >> (symbolAt(), length(), getToken(), append()) when creating a
    >> readable (i.e. string) representation of the sequence. Is this
    >> something that is worth looking at in the BioJava core?

    Thomas> When I've looked at this in the past, this process hasn't
    Thomas> been too slow (actually, I did once try optimizing
    Thomas> stringification for one particular special case, and ended
    Thomas> up slowing things down!).  Possibly if your java VM isn't
    Thomas> handling method inlining well (you are using the Fast VM
    Thomas> on Compaqs, aren't you?) it'll slow down.

Yes, fast VM on the Compaqs, but I get the same symptom with the Sun
and Blackdown ports on Linux. I added printing a static counter -
calling the BioJava subStr() I can count 3-4 seconds between
increments, while using the java.lang.String.substring they whizz past
too fast to see.

It's not really a big deal though, as I can just cache the String.

-- 

-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA