[Biojava-dev] FastaFormat performance enhancement

mark.schreiber at novartis.com mark.schreiber at novartis.com
Wed Oct 19 21:12:35 EDT 2005


Hi Thomas -

I can confirm this. I ran a profiler a while back after getting a similar 
complaint. It seems that every time you call subList you add a reference 
to the parent SymbolList. For some reason this reference remains even when 
the sub list is garbage collected. Also oddly if you ever do an edit 
operation then all the old references disappear.

The best way to see it happen is to assign lots of memory to the JVM and 
infinitely loop over a sublist operation:


Sequence seq = ...
while(true){
    SymbolList sl = seq.subList(1, 10);
}


You quickly accumulate thousands of references. I could never figure out 
why they don't get released.

- Mark





ml-it-biojava-dev at epigenomics.com
Sent by: biojava-dev-bounces at portal.open-bio.org
10/20/2005 12:28 AM

 
        To:     biojava-dev at biojava.org
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        Re: [Biojava-dev] FastaFormat performance enhancement


Dirk Habighorst wrote:
> Thomas Down wrote:
> 
>>
>> On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:
>>
>>> Hi,
>>> I had a lot of trouble using SeqIOTools.writeFasta on large 
>>> sequences. The subStr method of SymbolList seems to introduce a 
>>> memory leak (I did not track that in detail!). Anyway I would 
>>> suggest to change FastaFormat:
>>>     public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>               for (int pos = 1; pos <= length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth - 1, length);
>>>            os.println(seq.subStr(pos, end));
>>>        }
>>>    }
>>>
>>> to
>>>    public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>        String seqString = seq.seqString();
>>>        for (int pos = 0; pos < length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth, length);
>>>            String sub = seqString.substring(pos, end);
>>>            os.println(sub);
>>>        }
>>>    }
>>>
>>> since it is String manipulation that takes place in the loop, I 
>>> think there is no point in using SymbolList subStr anyway.
>>
>>
>>
>> Hi,
>>
>> I'd argue against this patch since it could potentially generate some 
>> really huge strings.  Suppose I've got a Sequence object representing 
>> human chromosome 1 (somewhere around 220Mb).  If this is a database- 
>> backed object with chunks of sequence lazy-loaded on demand (biojava- 
>> ensembl does this, for example) then there'll be no problem working 
>> with it even on a fairly modest PC.  But converting the whole thing 
>> to a String is going to use at least 440Mb of RAM, and could easily 
>> cause an OutOfMemoryError.
>>
>> I'd be fine with stringifying sequences in larger chunks rather than 
>> one line at a time -- but I think we should be cautious about 
>> stringifying complete large sequences.
>>
>> Do you have any idea where the memory leak might be?  I'd be 
>> interested to track it down.  What sort of sequences were you using?
>>
>>              Thomas
>>
> Hi thomas,
> 
> I experienced performance problems (even OutOfMemoryError) when working 
> with large Sequences (not lazy loaded). You might want to check this 
> little example:
> 
> package test;
> 
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.util.Properties;
> 
> import org.biojava.bio.seq.DNATools;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.symbol.IllegalSymbolException;
> import org.ensembl.datamodel.CoordinateSystem;
> import org.ensembl.datamodel.Location;
> import org.ensembl.datamodel.Sequence;
> import org.ensembl.datamodel.SequenceRegion;
> import org.ensembl.driver.AdaptorException;
> import org.ensembl.driver.ConfigurationException;
> import org.ensembl.driver.CoreDriver;
> import org.ensembl.driver.DriverManager;
> import org.ensembl.driver.SequenceAdaptor;
> import org.ensembl.driver.SequenceRegionAdaptor;
> 
> 
> public class ExportFasta
> {
> 
>  /**
>   * @param args
>   */
>  public static void main (String[] args) {
>    // TODO Auto-generated method stub
>    Properties props = createDriverProperties (args);
>    try {
>      OutputStream os;
>      os = new FileOutputStream (args[3]);
> 
>      CoreDriver coreDriver = DriverManager.loadDriver (props);
>      SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
>      SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
>      CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
>      SequenceRegion[] srs = 
> sra.fetchAllByCoordinateSystem(coordinateSystem);
>           int size = Integer.parseInt(args[5]);
>      for (SequenceRegion seqRegion : srs) {
>        Location loc = null;
>        int length = (int) seqRegion.getLength();
>        int start = 1;
>        int end;
>        while (start < length) {
>          end = start + size - 1 < length ? start + size - 1: length;
>          loc = new Location (coordinateSystem, seqRegion.getName(), 
> start, end, 1);
>          System.out.println(loc);
>          start = end + 1;
>          Sequence seq = sa.fetch(loc);
>          org.biojava.bio.seq.Sequence bioseq = 
> DNATools.createDNASequence(seq.getString(), loc.toString());
>          SeqIOTools.writeFasta(os, bioseq);
>        }
>      }
>    }
>    catch (ConfigurationException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (AdaptorException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (FileNotFoundException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IllegalSymbolException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IOException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>  }
> 
>  private static Properties createDriverProperties (String[] args) {
>    Properties props = new Properties ();
>    props.setProperty("host", args[0]);
>    props.setProperty("user", args[1]);
>    props.setProperty("database", args[2]);
>       return props;
>  }
> 
> }
> 
> java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE 

> RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
> 
> since the chunksize is stable the memory required should be stable. With 

> large chunks (1000000) allocated memory keeps growing!
> hope that helps, dirk

Hi thomas,

I did a little debugging myself and found an intresting place to look at! 
The SimpleSymbolList backing Sequences created with the DNATools 
implements subList like this:

     public SymbolList subList(int start, int end){
        if (start < 1 || end > length()) {
            throw new IndexOutOfBoundsException(
                      "Sublist index out of bounds " + length() + ":" + 
start + "," + end
                      );
        }

        if (end < start) {
            throw new IllegalArgumentException(
                "end must not be lower than start: start=" + start + ", 
end=" + end
                );
        }

        SimpleSymbolList sl = new 
SimpleSymbolList(this,viewOffset+start,viewOffset+end);
        if (isView){
            referenceSymbolList.addChangeListener(sl);
        }else{
            this.addChangeListener(sl);
        }
        return sl;
    }

so it keeps adding references to SymbolLists via the addChangeListener 
method to the original Sequence. It appears that the garbage collection 
can't keep up with that if the Sequence is to long. I have not checked 
this in detail though.

ciao, dirk
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst at epigenomics.com
_______________________________________________
biojava-dev mailing list
biojava-dev at biojava.org
http://biojava.org/mailman/listinfo/biojava-dev





More information about the biojava-dev mailing list