[Biojava-dev] FastaFormat performance enhancement

Wed Oct 19 12:28:37 EDT 2005

Dirk Habighorst wrote:
> Thomas Down wrote:
> 
>>
>> On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:
>>
>>> Hi,
>>> I had a lot of trouble using SeqIOTools.writeFasta on large  
>>> sequences. The subStr method of SymbolList seems to introduce a  
>>> memory leak (I did not track that in detail!). Anyway I would  
>>> suggest to change FastaFormat:
>>>     public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>               for (int pos = 1; pos <= length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth - 1, length);
>>>            os.println(seq.subStr(pos, end));
>>>        }
>>>    }
>>>
>>> to
>>>    public void writeSequence(Sequence seq, PrintStream os)
>>>    throws IOException {
>>>        os.print(">");
>>>        os.println(describeSequence(seq));
>>>               int length = seq.length();
>>>        String seqString = seq.seqString();
>>>        for (int pos = 0; pos < length; pos += lineWidth) {
>>>            int end = Math.min(pos + lineWidth, length);
>>>            String sub = seqString.substring(pos, end);
>>>            os.println(sub);
>>>        }
>>>    }
>>>
>>> since it is String manipulation that takes place in the loop, I  
>>> think there is no point in using SymbolList subStr anyway.
>>
>>
>>
>> Hi,
>>
>> I'd argue against this patch since it could potentially generate some  
>> really huge strings.  Suppose I've got a Sequence object representing  
>> human chromosome 1 (somewhere around 220Mb).  If this is a database- 
>> backed object with chunks of sequence lazy-loaded on demand (biojava- 
>> ensembl does this, for example) then there'll be no problem working  
>> with it even on a fairly modest PC.  But converting the whole thing  
>> to a String is going to use at least 440Mb of RAM, and could easily  
>> cause an OutOfMemoryError.
>>
>> I'd be fine with stringifying sequences in larger chunks rather than  
>> one line at a time -- but I think we should be cautious about  
>> stringifying complete large sequences.
>>
>> Do you have any idea where the memory leak might be?  I'd be  
>> interested to track it down.  What sort of sequences were you using?
>>
>>              Thomas
>>
> Hi thomas,
> 
> I experienced performance problems (even OutOfMemoryError) when working 
> with large Sequences (not lazy loaded). You might want to check this 
> little example:
> 
> package test;
> 
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.util.Properties;
> 
> import org.biojava.bio.seq.DNATools;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.symbol.IllegalSymbolException;
> import org.ensembl.datamodel.CoordinateSystem;
> import org.ensembl.datamodel.Location;
> import org.ensembl.datamodel.Sequence;
> import org.ensembl.datamodel.SequenceRegion;
> import org.ensembl.driver.AdaptorException;
> import org.ensembl.driver.ConfigurationException;
> import org.ensembl.driver.CoreDriver;
> import org.ensembl.driver.DriverManager;
> import org.ensembl.driver.SequenceAdaptor;
> import org.ensembl.driver.SequenceRegionAdaptor;
> 
> 
> public class ExportFasta
> {
> 
>  /**
>   * @param args
>   */
>  public static void main (String[] args) {
>    // TODO Auto-generated method stub
>    Properties props = createDriverProperties (args);
>    try {
>      OutputStream os;
>      os = new FileOutputStream (args[3]);
> 
>      CoreDriver coreDriver = DriverManager.loadDriver (props);
>      SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
>      SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
>      CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
>      SequenceRegion[] srs = 
> sra.fetchAllByCoordinateSystem(coordinateSystem);
>           int size = Integer.parseInt(args[5]);
>      for (SequenceRegion seqRegion : srs) {
>        Location loc = null;
>        int length = (int) seqRegion.getLength();
>        int start = 1;
>        int end;
>        while (start < length) {
>          end = start + size - 1 < length ? start + size - 1: length;
>          loc = new Location (coordinateSystem, seqRegion.getName(), 
> start, end, 1);
>          System.out.println(loc);
>          start = end + 1;
>          Sequence seq = sa.fetch(loc);
>          org.biojava.bio.seq.Sequence bioseq = 
> DNATools.createDNASequence(seq.getString(), loc.toString());
>          SeqIOTools.writeFasta(os, bioseq);
>        }
>      }
>    }
>    catch (ConfigurationException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (AdaptorException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (FileNotFoundException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IllegalSymbolException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>    catch (IOException e) {
>      // TODO Auto-generated catch block
>      e.printStackTrace();
>    }
>  }
> 
>  private static Properties createDriverProperties (String[] args) {
>    Properties props = new Properties ();
>    props.setProperty("host", args[0]);
>    props.setProperty("user", args[1]);
>    props.setProperty("database", args[2]);
>       return props;
>  }
> 
> }
> 
> java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE 
> RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
> 
> since the chunksize is stable the memory required should be stable. With 
> large chunks (1000000) allocated memory keeps growing!
> hope that helps, dirk

Hi thomas,

I did a little debugging myself and found an intresting place to look at! The SimpleSymbolList backing Sequences created with the DNATools implements subList like this:

     public SymbolList subList(int start, int end){
        if (start < 1 || end > length()) {
            throw new IndexOutOfBoundsException(
                      "Sublist index out of bounds " + length() + ":" + start + "," + end
                      );
        }

        if (end < start) {
            throw new IllegalArgumentException(
                "end must not be lower than start: start=" + start + ", end=" + end
                );
        }

        SimpleSymbolList sl = new SimpleSymbolList(this,viewOffset+start,viewOffset+end);
        if (isView){
            referenceSymbolList.addChangeListener(sl);
        }else{
            this.addChangeListener(sl);
        }
        return sl;
    }

so it keeps adding references to SymbolLists via the addChangeListener method to the original Sequence. It appears that the garbage collection can't keep up with that if the Sequence is to long. I have not checked this in detail though.

ciao, dirk
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst at epigenomics.com