[Biojava-l] More Questions on behavior of SymbolList

Wed, 5 Sep 2001 19:01:34 -0700

I am adding editability to SimpleSymbolList.

How should subList be implemented? It becomes important when a SymbolList
becomes editable. The subList method of AbstractSymbolList gives a veiw onto
the original so if the original is modified the subList is too. This might
be what is expected, but I don't think so. Especially because any edit that
changes the register shifts the sequence within all subLists. I figure that
the new version of SimpleSymbolList will return a new SimpleSymbolList
unless others request differently.

I want to implement a constructor SimpleSymbolList(Alphabet alpha, String
seqString) and this would be much faster using TokenParser.parseCharToken()
rather than parseToken (about 5 times faster). !!! Can we make this public?
!!!!!

Thomas suggested SimpleSymbolList(SymbolParser parser, String seqString)
instead, but I think that Alphabet is better for 2 reasons; Other
constructors of SimpleSymbolList use Alphabet, and programmers trying to
call the constructor are much more likely to have easy access to the
alphabet than try to rememeber how to get the parser. The only reason to use
parser is if there were multiple implementations of TokenParser that a
programmer would choose from. This seems unlikely to me, am I wrong?

I am still trying to figure out this whole SymbolList deal.  I see that the
TokenParser constructs a ChunkedSymbolList or SubArraySymbolList for any
String >= 100 bases. What types of applications are ChunkedSymbolList, and
SubArraySymbolList designed for? ChunkedSymbolList seems to be designed for
speed when reading from a stream with a SymbolReader. But the only advantage
I see is that it it saves a few array copies if you don't know what the size
the sequence is beforehand. What kind of speed optimization are we talking
about? By my calculations an arraycopy of 10,000 elements takes about 200
nanoseconds and an arraycopy of 100,000 takes about 2 milliseconds. It
certainly makes reading symbolAt() slower (80-400% slower than
SimpleSymbolList).

I see two other things that could be faster with ChunkedSymbolList than
SimpleSymbolList: editing large sequences and getting subLists that are true
copies rather than views, but ChunkedSymbolList does neither of these.

What is this reason that editing has not been supported? Is it a general
lack of interest, performance concerns, or just that nobody has gotten
around to it yet?

David