[Biojava-dev] FastaFormat performance enhancement

Wed Oct 19 11:09:27 EDT 2005

Thomas Down wrote:
> 
> On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:
> 
>> Hi,
>> I had a lot of trouble using SeqIOTools.writeFasta on large  
>> sequences. The subStr method of SymbolList seems to introduce a  
>> memory leak (I did not track that in detail!). Anyway I would  suggest 
>> to change FastaFormat:
>>     public void writeSequence(Sequence seq, PrintStream os)
>>    throws IOException {
>>        os.print(">");
>>        os.println(describeSequence(seq));
>>               int length = seq.length();
>>               for (int pos = 1; pos <= length; pos += lineWidth) {
>>            int end = Math.min(pos + lineWidth - 1, length);
>>            os.println(seq.subStr(pos, end));
>>        }
>>    }
>>
>> to
>>    public void writeSequence(Sequence seq, PrintStream os)
>>    throws IOException {
>>        os.print(">");
>>        os.println(describeSequence(seq));
>>               int length = seq.length();
>>        String seqString = seq.seqString();
>>        for (int pos = 0; pos < length; pos += lineWidth) {
>>            int end = Math.min(pos + lineWidth, length);
>>            String sub = seqString.substring(pos, end);
>>            os.println(sub);
>>        }
>>    }
>>
>> since it is String manipulation that takes place in the loop, I  think 
>> there is no point in using SymbolList subStr anyway.
> 
> 
> Hi,
> 
> I'd argue against this patch since it could potentially generate some  
> really huge strings.  Suppose I've got a Sequence object representing  
> human chromosome 1 (somewhere around 220Mb).  If this is a database- 
> backed object with chunks of sequence lazy-loaded on demand (biojava- 
> ensembl does this, for example) then there'll be no problem working  
> with it even on a fairly modest PC.  But converting the whole thing  to 
> a String is going to use at least 440Mb of RAM, and could easily  cause 
> an OutOfMemoryError.
> 
> I'd be fine with stringifying sequences in larger chunks rather than  
> one line at a time -- but I think we should be cautious about  
> stringifying complete large sequences.
> 
> Do you have any idea where the memory leak might be?  I'd be  interested 
> to track it down.  What sort of sequences were you using?
> 
>              Thomas
> 
Hi thomas,

I experienced performance problems (even OutOfMemoryError) when working with large Sequences (not lazy loaded). You might want to check this little example:

package test;

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.Properties;

import org.biojava.bio.seq.DNATools;
import org.biojava.bio.seq.io.SeqIOTools;
import org.biojava.bio.symbol.IllegalSymbolException;
import org.ensembl.datamodel.CoordinateSystem;
import org.ensembl.datamodel.Location;
import org.ensembl.datamodel.Sequence;
import org.ensembl.datamodel.SequenceRegion;
import org.ensembl.driver.AdaptorException;
import org.ensembl.driver.ConfigurationException;
import org.ensembl.driver.CoreDriver;
import org.ensembl.driver.DriverManager;
import org.ensembl.driver.SequenceAdaptor;
import org.ensembl.driver.SequenceRegionAdaptor;

public class ExportFasta
{

  /**
   * @param args
   */
  public static void main (String[] args) {
    // TODO Auto-generated method stub
    Properties props = createDriverProperties (args);
    try {
      OutputStream os;
      os = new FileOutputStream (args[3]);

      CoreDriver coreDriver = DriverManager.loadDriver (props);
      SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
      SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
      CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
      SequenceRegion[] srs = sra.fetchAllByCoordinateSystem(coordinateSystem);

      int size = Integer.parseInt(args[5]);
      for (SequenceRegion seqRegion : srs) {
        Location loc = null;
        int length = (int) seqRegion.getLength();
        int start = 1;
        int end;
        while (start < length) {
          end = start + size - 1 < length ? start + size - 1: length;
          loc = new Location (coordinateSystem, seqRegion.getName(), start, end, 1);
          System.out.println(loc);
          start = end + 1;
          Sequence seq = sa.fetch(loc);
          org.biojava.bio.seq.Sequence bioseq = DNATools.createDNASequence(seq.getString(), loc.toString());
          SeqIOTools.writeFasta(os, bioseq);
        }
      }
    }
    catch (ConfigurationException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
    catch (AdaptorException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
    catch (FileNotFoundException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
    catch (IllegalSymbolException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
    catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }

  private static Properties createDriverProperties (String[] args) {
    Properties props = new Properties ();
    props.setProperty("host", args[0]);
    props.setProperty("user", args[1]);
    props.setProperty("database", args[2]);

    return props;
  }

}

java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE

since the chunksize is stable the memory required should be stable. With large chunks (1000000) allocated memory keeps growing! 

hope that helps, dirk
-- 
Dirk Habighorst                  Software Engineer/ Bioinformatician
Epigenomics AG    Kleine Praesidentenstr. 1    10178 Berlin, Germany
phone:+49-30-24345-372                          fax:+49-30-24345-555
http://www.epigenomics.com           dirk.habighorst at epigenomics.com