[Biojava-l] More Questions on behavior of SymbolList

Emig, Robin Robin.Emig@maxygen.com
Thu, 6 Sep 2001 07:47:06 -0700


	Please use SimpleSymbolList(SymbolParser parser, String seqString),
The reason being is that sometimes we have sequence (from patents mostly)
that are three letter AA codes and sometimes we have single letter AA codes,
If you were just to use the alphabet how would SimpleSymbolList Know which
parser to use based on the string?
	I think one issue with the "behind the scenes" symbol list is that
some people are dealing with genome size, or genome quantity sequences.
Whatever the implementation, it should be fast for those uses.
	I like the idea of getting a copy(ie not referencing the original)
of the data when getting a SubList. If others disagree that this should be
the default, then lets create two functions, one that returns a view, and
another that returns a copy.
	>What is this reason that editing has not been supported? Most of
the work we have to do required us writing very custom editing routines or
creating new sequences anyway. However, had that functionality been there we
could have used it

-----Original Message-----
From: David Waring [mailto:dwaring@u.washington.edu]
Sent: Wednesday, September 05, 2001 7:02 PM
To: Thomas Down
Cc: biojava-l@biojava.org
Subject: [Biojava-l] More Questions on behavior of SymbolList


I am adding editability to SimpleSymbolList.

How should subList be implemented? It becomes important when a SymbolList
becomes editable. The subList method of AbstractSymbolList gives a veiw onto
the original so if the original is modified the subList is too. This might
be what is expected, but I don't think so. Especially because any edit that
changes the register shifts the sequence within all subLists. I figure that
the new version of SimpleSymbolList will return a new SimpleSymbolList
unless others request differently.

I want to implement a constructor SimpleSymbolList(Alphabet alpha, String
seqString) and this would be much faster using TokenParser.parseCharToken()
rather than parseToken (about 5 times faster). !!! Can we make this public?
!!!!!

Thomas suggested SimpleSymbolList(SymbolParser parser, String seqString)
instead, but I think that Alphabet is better for 2 reasons; Other
constructors of SimpleSymbolList use Alphabet, and programmers trying to
call the constructor are much more likely to have easy access to the
alphabet than try to rememeber how to get the parser. The only reason to use
parser is if there were multiple implementations of TokenParser that a
programmer would choose from. This seems unlikely to me, am I wrong?

I am still trying to figure out this whole SymbolList deal.  I see that the
TokenParser constructs a ChunkedSymbolList or SubArraySymbolList for any
String >= 100 bases. What types of applications are ChunkedSymbolList, and
SubArraySymbolList designed for? ChunkedSymbolList seems to be designed for
speed when reading from a stream with a SymbolReader. But the only advantage
I see is that it it saves a few array copies if you don't know what the size
the sequence is beforehand. What kind of speed optimization are we talking
about? By my calculations an arraycopy of 10,000 elements takes about 200
nanoseconds and an arraycopy of 100,000 takes about 2 milliseconds. It
certainly makes reading symbolAt() slower (80-400% slower than
SimpleSymbolList).


I see two other things that could be faster with ChunkedSymbolList than
SimpleSymbolList: editing large sequences and getting subLists that are true
copies rather than views, but ChunkedSymbolList does neither of these.

What is this reason that editing has not been supported? Is it a general
lack of interest, performance concerns, or just that nobody has gotten
around to it yet?

David


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l