[Biojava-l] Re: More Questions on behavior of SymbolList

Thomas Down td2@sanger.ac.uk
Fri, 7 Sep 2001 13:36:50 +0100


On Thu, Sep 06, 2001 at 12:52:39PM -0700, David Waring wrote:
> 
> > Here's an alternative:  why not use the StreamParser interface
> > here?  The reason that was added was to allow optimized code-paths
> > for simple cases (like single-character tokens) without having
> > to worry about details of specific SymbolParser implementations.
> > So long as you pump reasonably large chunks of characters through
> > the StreamParser, you should get performance which is very close
> > to direct calls to parseCharToken().
> >
> > Does that make sense?
> 
> I am not sure what advantage this gives. If I understand this, I would use
> the parseStream method of whichever parser I was given (presumably
> TokenParser) I would have to implement a SeqIOListener within my
> SimpleSymbolList, right? Otherwise how does the parser give me back my
> symbols?

Yes, you do need to implement SeqIOListener, but you can stub
almost all the methods (or just inherit from SeqIOAdaptor).  The
only method that actually matters is addSymbols

> Now I have a String and I want to parse it into Symbols and put each one
> into my array. I convert my String to a char[] instantiate my listener an
> pass it off to the streamParser.  This would then parse each char, put it
> into a Symbol[] then add it to my Symbol[].

Pretty much.  Better still, allocate a single char[] array
yourself, then use String.getChars to copy data into this.

> Is this right? All of this so that someone could say:
> 	new SimpleSymbolList(nameParser,"SerHisIleThr");
> 
> Implementing SeqIOListener seems excessive. Am I missing something here?
> I can see the flexiblity here, but it sure gets in the way of performance
> sometimes.

The whole reason StreamParser was implemeted in the first place
was performance.  It sounds a little more complex than using
an explicit call to parseCharToken, but in reality, the only
difference is an extra array-copy (which, as you've already
said, is really rather cheap, so long as you're not constantly
allocating and freeing arrays).


But if you're happier using parseCharToken, I don't have any
objection in principle to making it public, so long as it's
only available on TokenParser.

> I still suggest making parseCharToken public, and let people know that
> SimpleSymbolList will only handle a String constructor with one char/token
> Strings (after all it is not in the SymbolList interface). Another more
> flexible option I see, changing the SymbolParser interface to
> parseStream(SymbolListIOListener) and having a SymbolListIOListener that
> requires one method addSymbols(Symbol[] s). SeqIOListener could extend this.
> Then I would not have to implement a dozen empty methods.

SymbolListIOListener would be okay, too, if that makes you
happier.  I've always just inherited the stub-methods off
SeqIOAdaptor, though.  None of them actually get called
by the StreamParsers -- it's just interface re-use ;-).

> > > certainly makes reading symbolAt() slower (80-400% slower than
> > > SimpleSymbolList).
> >
> > 400%?  Ouch!  I didn't realize.  What virtual machine are you using?
> >
> 
> The 400% number comes from my tests on Win2000,jdk1.3. My tests on unix, an
> Alpha, again with jdk1.3 the number is generally around 80% for reading the
> last base in a 100,000 base sequence. The Win box is much faster overall
> about 10 times faster, for example at reading from the SimpleSymbolList. But
> the difference between the two tasks is greater. I suspect it may have to do
> with the modulo since reading from a two dimensional array should not be
> more than 2 times slower, but this is just a guess.

You might have a point about the modulo, actually.  When
I first implemeted it, it had the constraint that blockSize
was always 2^n, so that the division and modulo could both
be performed as (fast) bitwise operators, but for some
reason it got changed back to normal '/' and '%'.

I'll have a look at this and see if there's a speedup to be
had here.  the 400% is really surprising...

Thanks,

    Thomas.

> PC
> java version "1.3.0"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
> Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode)
> 
> Unix
> java version "1.3.0"
> Java(TM) 2 Runtime Environment, Standard Edition
> Classic VM (build 1.3.0-1, native threads, jit)