[Biojava-l] BLAST parsing explodes in size
VERHOEF Frans
verhoeff2 at gis.a-star.edu.sg
Wed Nov 12 04:37:22 EST 2003
Hi Keith,
Thanks for your response. I did paste the method that's doing the
parsing somewhere below. I also ran just now this method trying to parse
a blast output file with a size of approximately 350mb. The output
generated is this:
Before parsing: 402280
After parsing: 1043162496
With the number indicating the memory size of java in bytes. That means
that during the parsing (all biojava) the size explodes from a mere
402kb to 1gb. After that the size doesn't do much anymore.
For your information, I am using the following:
- NCBI Blast 2.2.4
- Java 1.4.2_01
- Linux
- Biojava from cvs, last updated at 21st of October
Hopefully you will now tell me I am doing something stupid ;-)
private void parseBlastOutput(File file) throws Exception{
Runtime r = Runtime.getRuntime();
System.out.println("Before parsing: " +
(r.totalMemory()-r.freeMemory()));
InputStream is = new FileInputStream(file);
BlastLikeSAXParser parser = new BlastLikeSAXParser();
parser.setModeLazy();
SeqSimilarityAdapter adapter = new SeqSimilarityAdapter();
parser.setContentHandler(adapter);
List results = new ArrayList();
SearchContentHandler builder = new BlastLikeSearchBuilder(results,
new DummySequenceDB("queries"), new DummySequenceDBInstallation());
adapter.setSearchContentHandler(builder);
parser.parse(new InputSource(is));
for (Iterator i = results.iterator(); i.hasNext(); ){
System.out.println("Iterating: " +
(r.totalMemory()-r.freeMemory()));
SeqSimilaritySearchResult result =
(SeqSimilaritySearchResult)i.next();
org.biojava.bio.Annotation anno = result.getAnnotation();
String queryID = (String)anno.getProperty("queryId");
String database =
this.parseNameFromDBPath((String)anno.getProperty("databaseId"));
String lib = this.parseIDForLibrary(queryID);
BlastSetting bsetting = null;
if (lib!=null && database!=null) bsetting =
adaptor.fetchSetting(lib, database);
if (lib == null || database == null || bsetting == null){
//means no blast setting can be found for this library and
database
System.out.println("HELP!!!!!");
throw new Exception("Cannot find Blast Setting in database
for library " + lib + " and blastdatabase " + database);
}
File outFile = new File(destDir, queryID + ".out");
BufferedWriter out = new BufferedWriter(new
FileWriter(outFile));
out.write("queryID\tqueryStart\tqueryEnd\tdatabase\tsubjectID\tsubjectSt
art\tsubjectEnd\tscore\teValue\tDescription\n");
List hits = result.getHits();
//System.out.println("Start writing with " + hits.size() + "
hits.");
for (int j=0; j<hits.size(); j++){
SeqSimilaritySearchHit hit =
(SeqSimilaritySearchHit)hits.get(j);
if (hit.getEValue() > bsetting.getMaxEValue()){
break;
}
//System.out.println("HIT!!!");
org.biojava.bio.Annotation hitAnno = hit.getAnnotation();
String description =
hitAnno.containsProperty("subjectDescription") ?
(String)hitAnno.getProperty("subjectDescription") : "No Description";
out.write(queryID + "\t");
out.write(hit.getQueryStart() + "\t");
out.write(hit.getQueryEnd() + "\t");
out.write(database + "\t");
out.write(hit.getSubjectID() + "\t");
out.write(hit.getSubjectStart() + "\t");
out.write(hit.getSubjectEnd() + "\t");
out.write(hit.getScore() + "\t");
out.write(hit.getEValue() + "\t");
out.write(description + "\n");
out.flush();
hitAnno = null;description = null;hit=null;
System.gc();
}
out.close();
hits = null; out=null; outFile=null; bsetting=null; lib=null;
database=null; queryID=null; anno=null; result=null;
System.gc();
}
file.delete();
}
> -----Original Message-----
> From: Keith James [mailto:kdj at sanger.ac.uk]
> Sent: Wednesday, November 12, 2003 12:25 AM
> To: VERHOEF Frans
> Cc: biojava-l at biojava.org
> Subject: Re: [Biojava-l] BLAST parsing explodes in size
>
> >>>>> "FV" == VERHOEF Frans <verhoeff2 at gis.a-star.edu.sg> writes:
>
> FV> Hi, I am having a problem parsing huge blast
> FV> results. Basically I am parsing the blast results pretty much
> FV> the same way as in "Biojava in Anger", with as only difference
> FV> that I use the setModeLazy() of the BlastLikeSAXParser, since
> FV> I am using NCBI Blast version 2.2.4 and that version is not
> FV> recognised by the parser yet.
>
> Using blast 2.2.4 or 2.2.6 is safe in lazy mode - diffs show only
> minor whitespace changes in the format.
>
> FV> Besides that the only difference lays in the things I do with
> FV> the data.
>
> This is likely to be the cause of the problem. See below.
>
> FV> The problem is that when I parse a blast result that is a few
> FV> hundred MB, for example 300MB, the java application is
> FV> ballooning up to around 1.6GB of memory. Sometimes the
> FV> application even crashes because I only have got 2GB to play
> FV> with.
>
> The parser uses an event driven framework which is designed to handle
> very big data - it will handle multi-GB reports. However, if you
> create many fine-grained objects for every element of every report you
> will quickly run out of resources.
>
> FV> Does anyone know what's causing this? Is it because I set the
> FV> lazy mode? Is there any way to work around it?
>
> Either you need to think about which elements of the report you are
> interested in and build a filter which captures those events,
> discarding the rest. See the demos/ssbind package for an example by
> Matthew. Or if you really need all those objects then you should look
> at allowing them to be garbage-collected as soon as possible.
>
> It is possible that there is a bug somewhere, but without any seeing
> any code it isn't possible to say much more. If you need more help,
> post a short (working) piece of code illustrating the problem and we
> will do our best.
>
> hth
>
> Keith
>
> --
>
> - Keith James <kdj at sanger.ac.uk> Microarray Facility, Team 65 -
> - The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK -
More information about the Biojava-l
mailing list