[Biojava-l] Parsing blast result with a lot of hit

mark.schreiber at group.novartis.com mark.schreiber at group.novartis.com
Thu Nov 4 20:01:03 EST 2004


Hello Lu Qiang -

We get this question a lot. I have posted below a recent response (by 
Thomas Down) to the same question:


Hi,

When you use the BioJava blast parser as described in the BJIA
article, it does build a fairly comprehensive set of objects which
reflect the contents of the blast output.  If those objects
turn out to be bigger than your available memory, then you'll
either have to split up the output or process it in a "streaming"
fashion.

The BioJava blast parsers actually work by converting the blast
output to XML, which is then presented to a SAX contenthandler.
The normal strategy is to use a ContentHandler which builds objects,
and this is what the BioJava BlastLikeSearchBuilder class is doing.
However, there's nothing to stop you writing a custom ContentHandler
which extracts the information you want directly from the XML
representation.  This strategy should let you process unlimited
amounts of blast output without running into memory problems, but
does involve a certain amount of work.  If you want to see what the
XML representation looks like, try the demos/nativeapps/BlastLike2XML.java 

script, included in the BioJava source distribution.

However, since you say "I don't see the memory usage going up
significantly", I'm wondering if your program is *really*
exhausting system memory, or if you're just hitting the default
limit on the Java heap size.  On many platforms, the default heap
size can be pretty low.  You can control it using the -Xmx and
-Xms options (try typing java -X for proper descriptions).  On 
a 1Gb machine, I'd suggest trying something like:

       java -Xmx850M YourProgram

This allows Java to use the bulk of system memory, while still leaving
a bit left for the operating system, etc.

Hope this helps,

        Thomas.


Mark Schreiber
Principal Scientist (Bioinformatics)

Novartis Institute for Tropical Diseases (NITD)
10 Biopolis Road
#05-01 Chromos
Singapore 138670
www.nitd.novartis.com

phone +65 6722 2973
fax  +65 6722 2910





"Lu Qiang" <luqiang at scbit.org>
Sent by: biojava-l-bounces at portal.open-bio.org
11/05/2004 02:42 AM

 
        To:     "biojava-l at biojava.org" <biojava-l at biojava.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Parsing blast result with a lot of hit


Hi, Guys,

If we are tyring to parse a blast result with a lot of hits, the machine 
will be crashed, for example 5000 sequences blast themselves. 

This must be caused by a ArrayList storing all results.

How to solve this problem?

regards,

Lu


_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l





More information about the Biojava-l mailing list