[Biojava-l] BLAST parsing explodes in size

Keith James kdj at sanger.ac.uk
Tue Nov 11 11:21:52 EST 2003


>>>>> "FV" == VERHOEF Frans <verhoeff2 at gis.a-star.edu.sg> writes:

    FV> Hi, I am having a problem parsing huge blast
    FV> results. Basically I am parsing the blast results pretty much
    FV> the same way as in "Biojava in Anger", with as only difference
    FV> that I use the setModeLazy() of the BlastLikeSAXParser, since
    FV> I am using NCBI Blast version 2.2.4 and that version is not
    FV> recognised by the parser yet.

Using blast 2.2.4 or 2.2.6 is safe in lazy mode - diffs show only
minor whitespace changes in the format.

    FV> Besides that the only difference lays in the things I do with
    FV> the data.

This is likely to be the cause of the problem. See below.

    FV> The problem is that when I parse a blast result that is a few
    FV> hundred MB, for example 300MB, the java application is
    FV> ballooning up to around 1.6GB of memory. Sometimes the
    FV> application even crashes because I only have got 2GB to play
    FV> with.

The parser uses an event driven framework which is designed to handle
very big data - it will handle multi-GB reports. However, if you
create many fine-grained objects for every element of every report you
will quickly run out of resources.

    FV> Does anyone know what's causing this? Is it because I set the
    FV> lazy mode?  Is there any way to work around it?

Either you need to think about which elements of the report you are
interested in and build a filter which captures those events,
discarding the rest. See the demos/ssbind package for an example by
Matthew. Or if you really need all those objects then you should look
at allowing them to be garbage-collected as soon as possible.

It is possible that there is a bug somewhere, but without any seeing
any code it isn't possible to say much more. If you need more help,
post a short (working) piece of code illustrating the problem and we
will do our best.

hth

Keith

-- 

- Keith James <kdj at sanger.ac.uk> Microarray Facility, Team 65 -
- The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK -


More information about the Biojava-l mailing list