[Biojava-l] Streaming symbol-parsing

Thomas Down td2@sanger.ac.uk
Fri, 1 Dec 2000 13:50:12 +0000

I've been thinking some more about building an efficient
way to parse text into Symbols which can still cope with
cases where there isn't a single-char -> Symbol mapping.
What I've come up with is the following (simple) interface

  public interface StreamingParser {
       * Parse one or more bytes of character data, notifying
       * the listener object of any Symbols which result.
       * @param data An array of characters containing the segment
       *             to parse
       * @param start The offset of the first character in the array
       *              which we wish to parse.
       * @param len The length of the segment to parse.

      public void characters(char[] data, int start, int len)
          throws IllegalSymbolException

       * Flush the parser.  This is provided mainly for the
       * benefit of parsers which implement multi-char ->
       * Symbol mappings, allowing them to throw an exception
       * if the final symbol in the stream is incomplete.

      public void close()
          throws IllegalSymbolException;

These are constructed via a new factory method in the
SymbolParser interface.

  public StreamingParser parseStream(SeqIOListener siol);

The StreamingParser then parses some character data, notifying
the SeqIOListener of the results via the addSymbols method.

The (slight) complication:

This introduces a dependancy from the org.biojava.bio.symbol
package to the org.biojava.bio.seq.io package (which contains
SeqIOListener).  There are two (potentially) justifiable ways 
this could be sliced:

  - Leave SymbolParser where it is.  Put StreamingParser in
    symbol, too.  This is the `least change' route, but leaves
    the distinction between seq.io and symbol rather blurred.

  - Move SymbolParser to seq.io, and put StreamingParser there.
    Thus, the only inter-package dependancy is that Alphabets
    return SymbolParsers.

The latter seems to give a clearer view of the roles of the
two packages.  It also leaves us with the option to add a new
method for creating parsers for alphabets (possibly via some
kind of ParserManager?) and (eventually) remove the getParser
method from Alphabet.

I'm planning on taking the second route at the moment -- if
anyone has strong objections to moving SymbolParser, please
let me know ASAP.


``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
           -- Terry Pratchett