[Biojava-l] Parsing massive blast-like output (was... Problems with SAX parsing)

Matthew Pocock matthew_pocock at yahoo.co.uk
Fri Feb 14 21:06:03 EST 2003


Great. Thanks Simon. I hate tracking down these reference leaks.

Matthew

Simon Brocklehurst wrote:
> Re: Parsing massive Blast output
> 
> In regard of recent mail to the list (and from up to a couple of years ago)
> 
> Up 'til now, when attempting to parse *very* large blast outputs 
> consisting of many (thousands of) separete reports concatenated 
> together, the Java Virtual Machine could sometimes run out of memory. 
> The workaround for this problem that people have been using was to split 
> their output into smaller chunks that the parser can deal with.
> 
> This parsing problem was due to a small bug, which we've now (I 
> think/hope) fixed in the biojava cvs (biojava-live).
> 
> The parser should now deal successfully with infinitely large amounts of 
> data, without any need for chunking the output.
> 
> After applying this fix, the "BlastLike" SAX parser was tested for 
> scalability in terms of handling large numbers of concatenated blast 
> reports as follows:
> 
> Size measures of typical test input files:
> 
> o Tens of thousands of concatenated blast-like reports
> 
> o Tens of millions of individual lines of blast-like pairwise output data
> 
> o Gigabytes in size
> 
> Tests were run using JDK 1.4.1 on Solaris 9.  Input data was parsed in 
> such a way as to process all SAX events generated by the underlying SAX 
> driver.
> 
> o For each test, the outputs from the parser were XML documents each of 
> the order of hundreds of millions of lines in size.
> 
> o Memory footprint remained both small and constant throughout the 
> parsing process, with a typical memory footprint under 14 MB in size.
> 
> 
> Simon


-- 
BioJava Consulting LTD - Support and training for BioJava
http://www.biojava.co.uk



More information about the Biojava-l mailing list