[Biojava-l] Parsing massive blast-like output (was... Problems with SAX parsing)

Simon Brocklehurst simon.brocklehurst at cambridgeantibody.com
Thu Feb 13 15:47:13 EST 2003


Re: Parsing massive Blast output

In regard of recent mail to the list (and from up to a couple of years ago)

Up 'til now, when attempting to parse *very* large blast outputs 
consisting of many (thousands of) separete reports concatenated 
together, the Java Virtual Machine could sometimes run out of memory. 
The workaround for this problem that people have been using was to split 
their output into smaller chunks that the parser can deal with.

This parsing problem was due to a small bug, which we've now (I 
think/hope) fixed in the biojava cvs (biojava-live).

The parser should now deal successfully with infinitely large amounts of 
data, without any need for chunking the output.

After applying this fix, the "BlastLike" SAX parser was tested for 
scalability in terms of handling large numbers of concatenated blast 
reports as follows:

Size measures of typical test input files:

o Tens of thousands of concatenated blast-like reports

o Tens of millions of individual lines of blast-like pairwise output data

o Gigabytes in size

Tests were run using JDK 1.4.1 on Solaris 9.  Input data was parsed in 
such a way as to process all SAX events generated by the underlying SAX 
driver.

o For each test, the outputs from the parser were XML documents each of 
the order of hundreds of millions of lines in size.

o Memory footprint remained both small and constant throughout the 
parsing process, with a typical memory footprint under 14 MB in size.


Simon
-- 
Dr Simon M. Brocklehurst, Ph.D.
Director of Informatics & Robotics

Cambridge Antibody Technology
Milstein Building
Granta Park
Cambridge
CB1 6GH
UK

Telephone: + 44 (0) 1763 263233
Facsimile + 44 (0) 1763 263413
Email: mailto:simon.brocklehurst at cambridgeantibody.com
http://www.cambridgeantibody.com

Cambridge Antibody Technology Limited *
Registered Office: The Science Park, Melbourn, Cambridgeshire,
SG8 6JJ, UK. Registered in England and Wales number 2451177
(* Cambridge Antibody Technology Limited is a member of the
Cambridge Antibody Technology Group of Companies)

Confidentiality Note: This information and any attachments is
confidential and only for use by the individual or entity to
whom it has been sent. Any unauthorised dissemination,
distribution or copying of this message is strictly prohibited.
If you are not the intended recipient please inform the sender
immediately by reply e-mail and delete this message from your system.
Thank you for your co-operation.



More information about the Biojava-l mailing list