[Biojava-dev] FastaFormat performance enhancement
mark.schreiber at novartis.com
mark.schreiber at novartis.com
Wed Oct 19 21:12:35 EDT 2005
Hi Thomas -
I can confirm this. I ran a profiler a while back after getting a similar
complaint. It seems that every time you call subList you add a reference
to the parent SymbolList. For some reason this reference remains even when
the sub list is garbage collected. Also oddly if you ever do an edit
operation then all the old references disappear.
The best way to see it happen is to assign lots of memory to the JVM and
infinitely loop over a sublist operation:
Sequence seq = ...
while(true){
SymbolList sl = seq.subList(1, 10);
}
You quickly accumulate thousands of references. I could never figure out
why they don't get released.
- Mark
ml-it-biojava-dev at epigenomics.com
Sent by: biojava-dev-bounces at portal.open-bio.org
10/20/2005 12:28 AM
To: biojava-dev at biojava.org
cc: (bcc: Mark Schreiber/GP/Novartis)
Subject: Re: [Biojava-dev] FastaFormat performance enhancement
Dirk Habighorst wrote:
> Thomas Down wrote:
>
>>
>> On 19 Oct 2005, at 09:41, ml-it-biojava-dev at epigenomics.com wrote:
>>
>>> Hi,
>>> I had a lot of trouble using SeqIOTools.writeFasta on large
>>> sequences. The subStr method of SymbolList seems to introduce a
>>> memory leak (I did not track that in detail!). Anyway I would
>>> suggest to change FastaFormat:
>>> public void writeSequence(Sequence seq, PrintStream os)
>>> throws IOException {
>>> os.print(">");
>>> os.println(describeSequence(seq));
>>> int length = seq.length();
>>> for (int pos = 1; pos <= length; pos += lineWidth) {
>>> int end = Math.min(pos + lineWidth - 1, length);
>>> os.println(seq.subStr(pos, end));
>>> }
>>> }
>>>
>>> to
>>> public void writeSequence(Sequence seq, PrintStream os)
>>> throws IOException {
>>> os.print(">");
>>> os.println(describeSequence(seq));
>>> int length = seq.length();
>>> String seqString = seq.seqString();
>>> for (int pos = 0; pos < length; pos += lineWidth) {
>>> int end = Math.min(pos + lineWidth, length);
>>> String sub = seqString.substring(pos, end);
>>> os.println(sub);
>>> }
>>> }
>>>
>>> since it is String manipulation that takes place in the loop, I
>>> think there is no point in using SymbolList subStr anyway.
>>
>>
>>
>> Hi,
>>
>> I'd argue against this patch since it could potentially generate some
>> really huge strings. Suppose I've got a Sequence object representing
>> human chromosome 1 (somewhere around 220Mb). If this is a database-
>> backed object with chunks of sequence lazy-loaded on demand (biojava-
>> ensembl does this, for example) then there'll be no problem working
>> with it even on a fairly modest PC. But converting the whole thing
>> to a String is going to use at least 440Mb of RAM, and could easily
>> cause an OutOfMemoryError.
>>
>> I'd be fine with stringifying sequences in larger chunks rather than
>> one line at a time -- but I think we should be cautious about
>> stringifying complete large sequences.
>>
>> Do you have any idea where the memory leak might be? I'd be
>> interested to track it down. What sort of sequences were you using?
>>
>> Thomas
>>
> Hi thomas,
>
> I experienced performance problems (even OutOfMemoryError) when working
> with large Sequences (not lazy loaded). You might want to check this
> little example:
>
> package test;
>
> import java.io.FileNotFoundException;
> import java.io.FileOutputStream;
> import java.io.IOException;
> import java.io.OutputStream;
> import java.util.Properties;
>
> import org.biojava.bio.seq.DNATools;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.symbol.IllegalSymbolException;
> import org.ensembl.datamodel.CoordinateSystem;
> import org.ensembl.datamodel.Location;
> import org.ensembl.datamodel.Sequence;
> import org.ensembl.datamodel.SequenceRegion;
> import org.ensembl.driver.AdaptorException;
> import org.ensembl.driver.ConfigurationException;
> import org.ensembl.driver.CoreDriver;
> import org.ensembl.driver.DriverManager;
> import org.ensembl.driver.SequenceAdaptor;
> import org.ensembl.driver.SequenceRegionAdaptor;
>
>
> public class ExportFasta
> {
>
> /**
> * @param args
> */
> public static void main (String[] args) {
> // TODO Auto-generated method stub
> Properties props = createDriverProperties (args);
> try {
> OutputStream os;
> os = new FileOutputStream (args[3]);
>
> CoreDriver coreDriver = DriverManager.loadDriver (props);
> SequenceRegionAdaptor sra = coreDriver.getSequenceRegionAdaptor();
> SequenceAdaptor sa = coreDriver.getSequenceAdaptor();
> CoordinateSystem coordinateSystem = new CoordinateSystem (args[4]);
> SequenceRegion[] srs =
> sra.fetchAllByCoordinateSystem(coordinateSystem);
> int size = Integer.parseInt(args[5]);
> for (SequenceRegion seqRegion : srs) {
> Location loc = null;
> int length = (int) seqRegion.getLength();
> int start = 1;
> int end;
> while (start < length) {
> end = start + size - 1 < length ? start + size - 1: length;
> loc = new Location (coordinateSystem, seqRegion.getName(),
> start, end, 1);
> System.out.println(loc);
> start = end + 1;
> Sequence seq = sa.fetch(loc);
> org.biojava.bio.seq.Sequence bioseq =
> DNATools.createDNASequence(seq.getString(), loc.toString());
> SeqIOTools.writeFasta(os, bioseq);
> }
> }
> }
> catch (ConfigurationException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (AdaptorException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (FileNotFoundException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (IllegalSymbolException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> catch (IOException e) {
> // TODO Auto-generated catch block
> e.printStackTrace();
> }
> }
>
> private static Properties createDriverProperties (String[] args) {
> Properties props = new Properties ();
> props.setProperty("host", args[0]);
> props.setProperty("user", args[1]);
> props.setProperty("database", args[2]);
> return props;
> }
>
> }
>
> java -cp ... test.ExportFasta ENSEMBL_HOST ENSEMBL_USER ENSEMBL_DATABASE
> RESULT_FILE COORDINATE_SYSTEM CHUNK_SIZE
>
> since the chunksize is stable the memory required should be stable. With
> large chunks (1000000) allocated memory keeps growing!
> hope that helps, dirk
Hi thomas,
I did a little debugging myself and found an intresting place to look at!
The SimpleSymbolList backing Sequences created with the DNATools
implements subList like this:
public SymbolList subList(int start, int end){
if (start < 1 || end > length()) {
throw new IndexOutOfBoundsException(
"Sublist index out of bounds " + length() + ":" +
start + "," + end
);
}
if (end < start) {
throw new IllegalArgumentException(
"end must not be lower than start: start=" + start + ",
end=" + end
);
}
SimpleSymbolList sl = new
SimpleSymbolList(this,viewOffset+start,viewOffset+end);
if (isView){
referenceSymbolList.addChangeListener(sl);
}else{
this.addChangeListener(sl);
}
return sl;
}
so it keeps adding references to SymbolLists via the addChangeListener
method to the original Sequence. It appears that the garbage collection
can't keep up with that if the Sequence is to long. I have not checked
this in detail though.
ciao, dirk
--
Dirk Habighorst Software Engineer/ Bioinformatician
Epigenomics AG Kleine Praesidentenstr. 1 10178 Berlin, Germany
phone:+49-30-24345-372 fax:+49-30-24345-555
http://www.epigenomics.com dirk.habighorst at epigenomics.com
_______________________________________________
biojava-dev mailing list
biojava-dev at biojava.org
http://biojava.org/mailman/listinfo/biojava-dev
More information about the biojava-dev
mailing list