[Biojava-l] [1.1] Sequence I/O rethink

Thomas Down td2@sanger.ac.uk
Tue, 7 Nov 2000 16:33:35 +0000


Hi...

I'd guess that the biological sequence I/O code is one of most
widely useful parts of BioJava.  The current system has
served us quite well so far, but there are some issues that
have cropped up, and I think the time might be ripe for a
proper discussion of what we want from the package in the
future.

Issues which would be worth addressing (in no particular order):

  - It's not entirely clear how to handle alignments within
    the current I/O framework.

  - SequenceFormat classes tend to be tightly coupled to
    one particular mechanism for constructing SymbolLists.
    The mechanism used by all the current SequenceFormats
    is rather inefficient (both in time and space) when 
    handling very long pieces of sequence.

  - There is not always an easy way to control the rules
    used to convert data from a sequence file into BioJava
    Annotation bundles and Feature objects.  Some attempts
    /have/ been made in this direction (look at FastaDescriptionReader
    and FeatureBuilder).  Unfortunately, this kind of
    functionality currently has to be implemeneted on
    a per-format basis, and has it's limitations.  For
    instance, there is no simple way to agregate several
    feature-table entries in an EMBL file into a single
    BioJava feature.

  - The I/O framework only works on files which contain sequence
    data.  It would be nice if at least some parts of it could
    be applied to the handling of, for example, GFF files (which
    currently have an entirely separate framework).

What I'm potentially thinking an event-driven framework for parsing all
kinds of sequence files (by which I include sequence-and-feature
formats like EMBL, sequence-only like FASTA, feature-only like GFF,
and alignments).  We already have a simple event driven system
in BioJava (org.biojava.bio.program.gff) and it works pretty well.
There would then be a major refactoring of SequenceFactory so that
it can act as a listener for the event stream.

NOTE: I'm talking here primarily about changes to the guts of
the I/O framework.  I hope there won't be any significant
increase in the number of lines of code needed in the simple
case of reading a sequence from a common file format (EMBL, Genbank,
FASTA).

I know there are a number of people on the list who are interested
in file parsing, so it would be good to hear everyone's thoughts
and requirements before we finalize any API.


Just to start the ball rolling, I've had an extension to the
current I/O framework which decouples SymbolList creation
from file parsing.  I've been using this myself for a few
weeks now, and it considerably improves performance (3-4 times)
and peak memory usage (potentially a factor of almost two) when
reading large sequences.  This certainly doesn't address all
the issues with the I/O framework, but it shows one area where
some real improvements can be made.

If you want to try this out, there is source code and class
files in:

  http://www.biojava.org/proposals/newio.jar

There's also javadoc at:

  http://www.biojava.org/proposals/newio-doc/index.html

Any comments?

   Thomas.
-- 
One of the advantages of being disorderly is that one is
constantly making exciting discoveries.
                                       -- A. A. Milne