[Biojava-l] IntegerAlphabet IntegerSymbol

David Waring dwaring@u.washington.edu
Fri, 19 Oct 2001 15:56:54 -0700


I am working on bio.program.PhredSequence and its friends (for handling the
qualitative data associated with the output of Phred). PhredSequence uses
SymbolLists with an IntegerAlphabet. At present the getToken() method of
IntergerAlphabet.IntegerSymbol returns '#'. I guess this is because the
Symbol interface specifies that getToken() return a char. Shouldn't this be
a String?  Afterall SymbolParser parseToken() parses a String, and aren't we
dealing with alphabets that can have multi-character tokens such as the 3
letter amino acids names? Has this issue come up before? Am I
misunderstanding 'token'?

One of the things that must be done with at PhredSequnece is to write the
quality data (an IntegerAlphabet based SymbolList) to a fasta-like format.
I'd like to just create a Sequence with the quality SymbolList and be able
to write this using a FastaFormat. But since FastaFormat calls seqString()
and that is coded in AbstractSymbolList to use getToken() it can only deal
with chars so it can't handle IntegerSymbols. Another is issue is that with
an IntegerSymbolList one would really like the seqString to output something
like '10 20 22 7' as opposed to '1020227'.

Three options:
1) Create a new SequenceFormat just for this, and if there will be no other
use of IntegerSymbolList perhaps this is the best way to go.

2) Create an IntegerSymbolList that extends SimpleSymbolList overriding
seqString().

3) (most invasive but perhaps cleanest) Change getToken() to return an
String, or adding toString() to Symbol and add a method paddedSeqString() to
AbstractSymbolList.

Preferences, suggestions?

David

|||||||||||||||||||||||||||||||||||||||||||||||||||||||
|   David Waring
|   Systems Programmer
|   University of Washington Genome Center
|   dwaring@u.washington.edu
|   (206) 221-6902
|||||||||||||||||||||||||||||||||||||||||||||||||||||||