[Biojava-dev] FastaFormat performance enhancement

Thomas Down td2 at sanger.ac.uk
Wed Oct 19 09:53:39 EDT 2005


On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:

> Hi,
> I had a lot of trouble using SeqIOTools.writeFasta on large  
> sequences. The subStr method of SymbolList seems to introduce a  
> memory leak (I did not track that in detail!). Anyway I would  
> suggest to change FastaFormat:
>     public void writeSequence(Sequence seq, PrintStream os)
>    throws IOException {
>        os.print(">");
>        os.println(describeSequence(seq));
>               int length = seq.length();
>               for (int pos = 1; pos <= length; pos += lineWidth) {
>            int end = Math.min(pos + lineWidth - 1, length);
>            os.println(seq.subStr(pos, end));
>        }
>    }
>
> to
>    public void writeSequence(Sequence seq, PrintStream os)
>    throws IOException {
>        os.print(">");
>        os.println(describeSequence(seq));
>               int length = seq.length();
>        String seqString = seq.seqString();
>        for (int pos = 0; pos < length; pos += lineWidth) {
>            int end = Math.min(pos + lineWidth, length);
>            String sub = seqString.substring(pos, end);
>            os.println(sub);
>        }
>    }
>
> since it is String manipulation that takes place in the loop, I  
> think there is no point in using SymbolList subStr anyway.

Hi,

I'd argue against this patch since it could potentially generate some  
really huge strings.  Suppose I've got a Sequence object representing  
human chromosome 1 (somewhere around 220Mb).  If this is a database- 
backed object with chunks of sequence lazy-loaded on demand (biojava- 
ensembl does this, for example) then there'll be no problem working  
with it even on a fairly modest PC.  But converting the whole thing  
to a String is going to use at least 440Mb of RAM, and could easily  
cause an OutOfMemoryError.

I'd be fine with stringifying sequences in larger chunks rather than  
one line at a time -- but I think we should be cautious about  
stringifying complete large sequences.

Do you have any idea where the memory leak might be?  I'd be  
interested to track it down.  What sort of sequences were you using?

              Thomas


More information about the biojava-dev mailing list