[Biojava-dev] FastaFormat performance enhancement
ml-it-biojava-dev at epigenomics.com
ml-it-biojava-dev at epigenomics.com
Wed Oct 19 12:28:37 EDT 2005
Dirk Habighorst wrote:
> Thomas Down wrote:
>
>>
>> On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:
>>
>>> Hi,
>>> I had a lot of trouble using SeqIOTools.writeFasta on large
>>> sequences. The subStr method of SymbolList seems to introduce a
>>> memory leak (I did not track that in detail!). Anyway I would
>>> suggest to change FastaFormat:
>>> public void writeSequence(Sequence seq, PrintStream os)
>>> throws IOException {
>>> os.print(">");
>>> os.println(describeSequence(seq));
>>> int length = seq.length();
>>> for (int pos = 1; pos <= length; pos += lineWidth) {
>>> int end = Math.min(pos + lineWidth - 1, length);
>>> os.println(seq.subStr(pos, end));
>>> }
>>> }
>>>
>>> to
>>> public void writeSequence(Sequence seq, PrintStream os)
>>> throws IOException {
>>> os.print(">");
>>> os.println(describeSequence(seq));
>>> int length = seq.length();
>>> String seqString = seq.seqString();
>>> for (int pos = 0; pos < length; pos += lineWidth) {
>>> int end = Math.min(pos + lineWidth, length);
>>> String sub = seqString.substring(pos, end);
>>> os.println(sub);
>>> }
>>> }
>>>
>>> since it is String manipulation that takes place in the loop, I
>>> think there is no point in using SymbolList subStr anyway.
>>
>>
>>
>> Hi,
>>
>> I'd argue against this patch since it could potentially generate some
>> really huge strings. Suppose I've got a Sequence object representing
>> human chromosome 1 (somewhere around 220Mb). If this is a database-
>> backed object with chunks of sequence lazy-loaded on demand (biojava-
>> ensembl does this, for example) then there'll be no problem working
>> with it even on a fairly modest PC. But converting the whole thing
>> to a String is going to use at least 440Mb of RAM, and could easily
>> cause an OutOfMemoryError.
>>
>> I'd be fine with stringifying sequences in larger chunks rather than
>> one line at a time -- but I think we should be cautious about
>> stringifying complete large sequences.
>>
>> Do you have any idea where the memory leak might be? I'd be
>> interested to track it down. What sort of sequences were you using?
>>
>> Thomas
>>
> Hi thomas,
>
> I experienced performance problems (even OutOfMemoryError) when working
> with large Sequences (not lazy loaded). You might want to check this
> little example:
>
> package test;
>
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.util.Properties;
>
> import org.biojava.bio.seq.DNATools;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.symbol.IllegalSymbolException;
> import org.ensembl.datamodel.CoordinateSystem;
> import org.ensembl.datamodel.Location;
> import org.ensembl.datamodel.Sequence;
> import org.ensembl.datamodel.SequenceRegion;
> import org.ensembl.driver.AdaptorException;
> import org.ensembl.driver.ConfigurationException;
> import org.ensembl.driver.CoreDriver;
> import org.ensembl.driver.DriverManager;
> import org.ensembl.driver.SequenceAdaptor;
> import org.ensembl.driver.SequenceRegionAdaptor;
>
>
> public class ExportFasta
> {
>
> /**
> * @param args
> */
> public static void main (String[] args) {
> // TODO Auto-generated method stub
> Properties props = createDriverProperties (args);
> try {
> OutputStream os;
> os = new FileOutputStream (args[3]);
>
> CoreDriver coreDriver = DriverManager.loadDriver (props);
> SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
> SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
> CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
> SequenceRegion[] srs =
> sra.fetchAllByCoordinateSystem(coordinateSystem);
> int size = Integer.parseInt(args[5]);
> for (SequenceRegion seqRegion : srs) {
> Location loc = null;
> int length = (int) seqRegion.getLength();
> int start = 1;
> int end;
> while (start < length) {
> end = start + size - 1 < length ? start + size - 1: length;
> loc = new Location (coordinateSystem, seqRegion.getName(),
> start, end, 1);
> System.out.println(loc);
> start = end + 1;
> Sequence seq = sa.fetch(loc);
> org.biojava.bio.seq.Sequence bioseq =
> DNATools.createDNASequence(seq.getString(), loc.toString());
> SeqIOTools.writeFasta(os, bioseq);
> }
> }
> }
> catch (ConfigurationException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (AdaptorException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (FileNotFoundException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (IllegalSymbolException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (IOException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> }
>
> private static Properties createDriverProperties (String[] args) {
> Properties props = new Properties ();
> props.setProperty("host", args[0]);
> props.setProperty("user", args[1]);
> props.setProperty("database", args[2]);
> return props;
> }
>
> }
>
> java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE
> RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
>
> since the chunksize is stable the memory required should be stable. With
> large chunks (1000000) allocated memory keeps growing!
> hope that helps, dirk
Hi thomas,
I did a little debugging myself and found an intresting place to look at! The SimpleSymbolList backing Sequences created with the DNATools implements subList like this:
public SymbolList subList(int start, int end){
if (start < 1 || end > length()) {
throw new IndexOutOfBoundsException(
"Sublist index out of bounds " + length() + ":" + start + "," + end
);
}
if (end < start) {
throw new IllegalArgumentException(
"end must not be lower than start: start=" + start + ", end=" + end
);
}
SimpleSymbolList sl = new SimpleSymbolList(this,viewOffset+start,viewOffset+end);
if (isView){
referenceSymbolList.addChangeListener(sl);
}else{
this.addChangeListener(sl);
}
return sl;
}
so it keeps adding references to SymbolLists via the addChangeListener method to the original Sequence. It appears that the garbage collection can't keep up with that if the Sequence is to long. I have not checked this in detail though.
ciao, dirk
--
Dirk Habighorst Software Engineer/ Bioinformatician
Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany
phone:+49-30-24345-372 fax:+49-30-24345-555
http://www.epigenomics.com dirk.habighorst at epigenomics.com
More information about the biojava-dev
mailing list