[Biojava-l] Subtle bug in SimpleDistribution

Schreiber, Mark mark.schreiber@agresearch.co.nz
Fri, 18 Jan 2002 10:56:35 +1300


Well maybe not a bug but a potential danger.

SimpleDistribution.Trainer will accept counts from ambiguity symbols (N)
and the Gap symbol however when it comes to train it uses the
AlphabetIndexer for DNA which does not include indices for these symbols
and this leads to some very odd results.

A number of solutuions might exist:

1) prevent the addition of ambiguos symbols to
SimpleDistribution.Trainer. Safe, but an unexpected N in your sequence
could cause an unexpected exception so not very user freindly.

2) refactor SimpleDistribution.Trainer to add equal numbers of counts to
ambiguity subsymbols. ie if N is added then add 1 count to each of
a,c,g,t. However this will not work for gap symbols.

3) extend the dna AlphabetIndex to include IUPAC ambiguities
(m,r,w,s,y,k,v,h,d,b,n) and the gap symbol. Solves the gap problem but
maybe N should be added as one count to each of its subsymbols.

Not sure I like any of these, any other suggestions??


Mark Schreiber
Bioinformatics
AgResearch Invermay
PO Box 50034
Mosgiel
New Zealand

PH: +64 3 489 9175

 

=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================