[Biojava-l] Re: More Questions on behavior of SymbolList

Thomas Down td2@sanger.ac.uk
Thu, 6 Sep 2001 10:52:12 +0100


On Wed, Sep 05, 2001 at 07:01:34PM -0700, David Waring wrote:
> I am adding editability to SimpleSymbolList.
> 
> How should subList be implemented? It becomes important when a SymbolList
> becomes editable. The subList method of AbstractSymbolList gives a veiw onto
> the original so if the original is modified the subList is too. This might
> be what is expected, but I don't think so. Especially because any edit that
> changes the register shifts the sequence within all subLists. I figure that
> the new version of SimpleSymbolList will return a new SimpleSymbolList
> unless others request differently.

That's a good point.  I'd always assumed that subList would
normally just reflect edits to the underlying sequence, but
maybe that isn't the best way to go.

Anyone have strong feelings on this?  Matthew?

But if you do decide that subLists /shouldn't/ reflect changes,
I'd be slightly concerned about going down the always-copy
route, since subList is a very common operation in some
cases.  A better approach might be `copy-on-write'.  Have an
implementation which starts off as a view on the parent sequence,
but installs a ChangeListener, and takes a full copy if
it receives a preChange notification?

> I want to implement a constructor SimpleSymbolList(Alphabet alpha, String
> seqString) and this would be much faster using TokenParser.parseCharToken()
> rather than parseToken (about 5 times faster). !!! Can we make this public?
> !!!!!

Here's an alternative:  why not use the StreamParser interface
here?  The reason that was added was to allow optimized code-paths
for simple cases (like single-character tokens) without having
to worry about details of specific SymbolParser implementations.
So long as you pump reasonably large chunks of characters through
the StreamParser, you should get performance which is very close
to direct calls to parseCharToken().

Does that make sense?

> Thomas suggested SimpleSymbolList(SymbolParser parser, String seqString)
> instead, but I think that Alphabet is better for 2 reasons; Other
> constructors of SimpleSymbolList use Alphabet, and programmers trying to
> call the constructor are much more likely to have easy access to the
> alphabet than try to rememeber how to get the parser. The only reason to use
> parser is if there were multiple implementations of TokenParser that a
> programmer would choose from. This seems unlikely to me, am I wrong?

We already have more than one SymbolParser for some alphabets:
DNA, RNA, and Protein can all be parsed either from single-character
or multi-character (name) representations.

There has also been some demand for more complex cases -- for
instance, there's on-going disagreement about what characters
represent gaps, especially in protein.  Having multiple parsers
gets us out of that kind of argument.  Unfortunately, the
current code for initializing built-in alphabets is a bit limited
in it's support for multiple parsers.  I've got a branch which
addresses this issue, but it needs a little bit more tidying
(and some review by other developers) before I feel confident
landing it.

> I am still trying to figure out this whole SymbolList deal.  I see that the
> TokenParser constructs a ChunkedSymbolList or SubArraySymbolList for any
> String >= 100 bases. What types of applications are ChunkedSymbolList, and
> SubArraySymbolList designed for? ChunkedSymbolList seems to be designed for
> speed when reading from a stream with a SymbolReader. But the only advantage
> I see is that it it saves a few array copies if you don't know what the size
> the sequence is beforehand. What kind of speed optimization are we talking
> about? By my calculations an arraycopy of 10,000 elements takes about 200
> nanoseconds and an arraycopy of 100,000 takes about 2 milliseconds. It
> certainly makes reading symbolAt() slower (80-400% slower than
> SimpleSymbolList).

400%?  Ouch!  I didn't realize.  What virtual machine are you using?

You're right in saying that array-copies are quite cheap.  On the
other hand, allocated and garbage-collecting big arrays can be rather
expensive.  Especially once you get to chromosome-sized sequences.
Of course, if you grow the array in large enough increments,
this goes away.  But Sun's ArrayList grows (IIRC) by only 10%
each time it overflows.  With that kind of increment, costs
can start to mount up quite quickly.

But actually, the main thing I was optimizing was memory.  In
particular, I was scared about the prospect of needing to
effectively hold two full copies of a really large sequence
in memory at once (and yes, I was really having trouble with
this at the time I wrote the code).

ChunkedSymbolLists are also quite nice if you read in a big
sequence, then take just a few small portions and throw away
the rest.

But maybe it is time to re-evaluate some of the issues here.
I'm certainly concerned by that 400% figure...

> I see two other things that could be faster with ChunkedSymbolList than
> SimpleSymbolList: editing large sequences and getting subLists that are true
> copies rather than views, but ChunkedSymbolList does neither of these.

ChunkedSymbolList in it's current form isn't really good for
editing, since performance will suffer (at least in the case
of really big sequences) if the chunks aren't all equal size.

Of course, an implementation which allowed variable-sized
chunks would be very good for the case of wanting to EDIT
a large sequence...

> What is this reason that editing has not been supported? Is it a general
> lack of interest, performance concerns, or just that nobody has gotten
> around to it yet?

As I said before, I think the main answer is apathy -- it
doesn't seem to be something that's been asked for up 'til
now.  But we're very glad to have you on the case!

And there are some niggling issues, like the contract for
subList, and also what happens to any Features on regions
of Sequences which get edited.


    Thomas.