[Biojava-l] BLAST parsing explodes in size

Keith James kdj at sanger.ac.uk
Wed Nov 12 05:40:26 EST 2003


>>>>> " " == VERHOEF Frans <verhoeff2 at gis.a-star.edu.sg> writes:

     > Hi Keith, Thanks for your response. I did paste the method
     > that's doing the parsing somewhere below. I also ran just now
     > this method trying to parse a blast output file with a size of
     > approximately 350mb. The output generated is this:

     > Before parsing: 402280 After parsing: 1043162496

     > With the number indicating the memory size of java in
     > bytes. That means that during the parsing (all biojava) the
     > size explodes from a mere 402kb to 1gb. After that the size
     > doesn't do much anymore.

A report of 350mb is sufficient to generate a lot of objects if you
fully represent all hits, HSPs, alignments and annotation.

At the top of your method you create a list to contain all your
results:

 List results = new ArrayList();

and pass it to the builder. Although you make a couple of System.gc()
calls further down they are not addressing the cause of the problem -
this list is still in scope and objects within it cannot be garbage
collected. As the BlastLikeSearchBuilder stores its results in a List
in this way is not appropriate for your situation.

This is the same as choosing whether to parse XML using SAX or DOM -
only use DOM if you can afford to have the whole lot in memory at
once.

The data you are saving in your output file are taken from a very
small subset of the objects being created (so you are not using most
of them). You need to extend the event-driven way of handling the data
from the SAXContentHandler right through the SearchContentHandler and
up to the point where you write to your file. Don't collect everything
as objects before you write.

There is a working example in demos/ssbind (ProcessBlastReport) of
using this event and filtering approach.

Keith

-- 

- Keith James <kdj at sanger.ac.uk> Microarray Facility, Team 65 -
- The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK -


More information about the Biojava-l mailing list