[Biojava-l] Parsing a huge Blast File with Biojava

Thomas Down thomas at derkholm.net
Mon Nov 1 10:15:27 EST 2004


On Mon, Nov 01, 2004 at 03:49:39PM +0100, Can Gencer wrote:
> Hello everyone,
> 
> We are trying to parse a quite large multiple BLAST results file (around
> 4GB), and the computer available has 1GB of RAM. However, when the code
> in the cookbook is used (
> "http://www.biojava.org/docs/bj_in_anger/BlastParser.htm"), using the
> BlastLikeSAXParser it will give out an OutOfMemory exception after a
> short while, and when I monitor the system during the parsing, I don't
> see the memory usage going up significantly. It is the
> parse(InputSource) method that throws the exception. Is there a way to
> solve this problem ?

Hi,

When you use the BioJava blast parser as described in the BJIA
article, it does build a fairly comprehensive set of objects which
reflect the contents of the blast output.  If those objects
turn out to be bigger than your available memory, then you'll
either have to split up the output or process it in a "streaming"
fashion.

The BioJava blast parsers actually work by converting the blast
output to XML, which is then presented to a SAX contenthandler.
The normal strategy is to use a ContentHandler which builds objects,
and this is what the BioJava BlastLikeSearchBuilder class is doing.
However, there's nothing to stop you writing a custom ContentHandler
which extracts the information you want directly from the XML
representation.  This strategy should let you process unlimited
amounts of blast output without running into memory problems, but
does involve a certain amount of work.  If you want to see what the
XML representation looks like, try the demos/nativeapps/BlastLike2XML.java 
script, included in the BioJava source distribution.

However, since you say "I don't see the memory usage going up
significantly", I'm wondering if your program is *really*
exhausting system memory, or if you're just hitting the default
limit on the Java heap size.  On many platforms, the default heap
size can be pretty low.  You can control it using the -Xmx and
-Xms options (try typing java -X for proper descriptions).  On 
a 1Gb machine, I'd suggest trying something like:

       java -Xmx850M YourProgram

This allows Java to use the bulk of system memory, while still leaving
a bit left for the operating system, etc.

Hope this helps,

        Thomas.


More information about the Biojava-l mailing list