[Biojava-l] TokenParser.TPStreamParser

David Huen David Huen <smh1008@cus.cam.ac.uk>
Sun, 10 Jun 2001 16:07:54 +0100 (BST)


On Sun, 10 Jun 2001, Thomas Down wrote:

> On Sun, Jun 10, 2001 at 02:40:33PM +0100, David Huen wrote:
> > The above appears to fubar when fed sequences with whitespace.
> > Unfortunately, these are common with XML derived sequences.  Would anyone
> > object to a modification such that whitespace characters are ignored
> > rather than worthy of an exception?
> 
> As I recall, I wrote TPStreamParser to be compatible with
> the existing TokenParser.  I'd actually be kind-of reluctant
> to add whitespace ignoring at this level, because it effectively
> means that you can /never/ use whitespace characters as tokens
> (which is probably a very bad idea, but it still worries me a 
> little to completely rule it out.).
> 
OK.

> How about the following alternative strategy:
> 
> I presume you're talking about driving a StreamParser from a
> SAX or StAX event source.  The S[t]AX listener will recieve
> arrays of characters.  You can then identify blocks of
> non-whitespace within this array, and pass them to the
> StreamParser.characters(char[], int, int) method.  No
> need to copy the characters into another array or anything,
> so it should be quite efficient.

OK, I'll do that.  That's no problem.

I have encountered another problem with StaxContentHandlerBase.

There is a method defined in the API:-
 public void characters(char[] ch, int start, int end)

When given elements with lots of data, it is usually called with start = 0
and end = 16384.  Any attempt to access char[16834] results in an
immediate exception which suggest to me that end is really length rather
than index of highest element within char[].  Is that correct?

Thanks,
David Huen, Dept. of Genetics, Univ. of Cambridge