[Biojava-l] [newio] SymbolParser.java move, and StreamParser

Thomas Down td2@sanger.ac.uk
Mon, 4 Dec 2000 10:51:30 +0000


Hi...

I've had a stab at implementing the `streamable' SymbolParser
proposal discussed last week.  There are a few class moves
resulting from this:

  - SymbolParser (and all the standard implementations) are now
    in the package org.biojava.bio.seq.io

  - New StreamParser added.

Almost all code should be source (although not necessarily)
binary compatible with last week's builds.  In a few cases you
might need to add a new import statement to catch the SymbolParser
interface in its new location, though.

I'm currently just using the StreamParser interface in FastaFormat.
In the future, it'll be worth using it in EmblLikeFormat, too.
For anyone writing a new SequenceFormat, please consider using
the new interface -- it's quite simple, and FastaFormat shows
how to use it efficiently.

This does indeed seem to be giving a worthwhile speedup for
parsing large files.  As a quick benchmark, I've timed the
GCContent demo (which spends most of it's time loading the
sequence) running on 26Mb FASTA file.  Platform is an
Alpha EV6 (466MHz) with the Compaq fast VM, times in
seconds:


   BioJava 1.01                               70.8

   `Last week' (core newio changes, but       56.4
                not streaming parser)

   Streamable SymbolParser                    19.1

There are probably still some spot optimizations which might
be worth using, but I hope we've now got a good framework for
high performace I/O.

Happy hacking,

   Thomas
-- 
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett