[Biojava-l] Parsing massive blast-like output (was... Problems with
SAX parsing)
Simon Brocklehurst
simon.brocklehurst at cambridgeantibody.com
Thu Feb 13 15:47:13 EST 2003
Re: Parsing massive Blast output
In regard of recent mail to the list (and from up to a couple of years ago)
Up 'til now, when attempting to parse *very* large blast outputs
consisting of many (thousands of) separete reports concatenated
together, the Java Virtual Machine could sometimes run out of memory.
The workaround for this problem that people have been using was to split
their output into smaller chunks that the parser can deal with.
This parsing problem was due to a small bug, which we've now (I
think/hope) fixed in the biojava cvs (biojava-live).
The parser should now deal successfully with infinitely large amounts of
data, without any need for chunking the output.
After applying this fix, the "BlastLike" SAX parser was tested for
scalability in terms of handling large numbers of concatenated blast
reports as follows:
Size measures of typical test input files:
o Tens of thousands of concatenated blast-like reports
o Tens of millions of individual lines of blast-like pairwise output data
o Gigabytes in size
Tests were run using JDK 1.4.1 on Solaris 9. Input data was parsed in
such a way as to process all SAX events generated by the underlying SAX
driver.
o For each test, the outputs from the parser were XML documents each of
the order of hundreds of millions of lines in size.
o Memory footprint remained both small and constant throughout the
parsing process, with a typical memory footprint under 14 MB in size.
Simon
--
Dr Simon M. Brocklehurst, Ph.D.
Director of Informatics & Robotics
Cambridge Antibody Technology
Milstein Building
Granta Park
Cambridge
CB1 6GH
UK
Telephone: + 44 (0) 1763 263233
Facsimile + 44 (0) 1763 263413
Email: mailto:simon.brocklehurst at cambridgeantibody.com
http://www.cambridgeantibody.com
Cambridge Antibody Technology Limited *
Registered Office: The Science Park, Melbourn, Cambridgeshire,
SG8 6JJ, UK. Registered in England and Wales number 2451177
(* Cambridge Antibody Technology Limited is a member of the
Cambridge Antibody Technology Group of Companies)
Confidentiality Note: This information and any attachments is
confidential and only for use by the individual or entity to
whom it has been sent. Any unauthorised dissemination,
distribution or copying of this message is strictly prohibited.
If you are not the intended recipient please inform the sender
immediately by reply e-mail and delete this message from your system.
Thank you for your co-operation.
More information about the Biojava-l
mailing list