[Biojava-l] adding X to the DNA alphabet

Thomas Down td2@sanger.ac.uk
Sun, 20 Jan 2002 22:52:43 +0000


On Fri, Jan 18, 2002 at 02:34:05PM -0800, David Waring wrote:
> I know that this discussion has come up before and it seems that people
> generally agreed that it would be OK to add X to the dna ambiguity symbol
> list. I certainly need it in my work because I deal with sequence files
> generated by other programs that use X.
> 
> In the old IO model is was easy enough to modify the AlphabetManager.xml
> file so I did and have not worried about it for months. Well with the new
> model it is not so easy. As best I can tell there can only exist one
> ambiguity symbol for and set of bases. So you can not have both n and x act
> as symbols for AGCT. So if you add x as agct n will no longer work. If you
> add x as agc, v will no longer work (last one in the XML file wins). I'm
> guessing that there is a Map somewhere, though I have not found it.
> 
> I have temporarily gotten around it by just wiping out my 'b'. Since I
> really don't worry about ambiguity in my DNA much but must be able to read
> X. But does anyone see a proper solution that would give let us use X?

Hi...

Yes, this would have been broken by the SymbolTokenization
patch.  But I think we can solve this one reasonably easily:

I've just checked in a small modification to CharacterTokenization
to support synonyms (it was always meant to do this, but it
fell off the bottom of my mental todo list -- sorry...).  This
means that you can now add an extra <ambiguityMapping>
element into AlphabetManager.xml, and have it parse 'X' into
the proper (matches any symbol) ambiguity symbol.  Does this
solve the problem?

To sort this out once and for all, I'd like to check the
modified AlphabetManager.xml in, so that individual users
no longer have to hack their own copy.  Does anyone have
any objections to me doing this?

   Thomas.