[Biojava-l] Subtle bug in SimpleDistribution

Thu, 17 Jan 2002 22:29:01 +0000

Hi Mark,

What are the symptoms of your bug? The handling of ambiguous counts 
should be implemented by:

DistributionTrainerContext.addCount(
   Distribution dist,
   Symbol sym,
   double times)

This passes the count on to:

DistributionTrainer.addCount(
   AtomicSymbol aSym,
   double times)

Is this chain of events not getting fired off for you?

There's a larger question over how to treat ambiguities during training. 
I have come to the opinion that a sequence with an ambiguity in it is 
actualy a set of un-ambiguous sequences that match the ambiguous symbols 
in all possible ways. This makes the maths work out better than saying 
that it is partly one sequence and partly another. If you use the odds 
measurements during DP then this all divides out to give the expected 
likelihoods for alignments. However, it makes P(seq | model) equal to 
sum_i(seq_i | model) if there are i unambiguous sequences that could 
match your sequence by chosing matches for all combinations of all 
ambiguous symbols. Be aware. An obvious result of this is that a string 
of N's has the maximal probability of all sequences that length being 
aligned to a model, where as the odds of a string of N's being aligned 
will be 0 as there is no information relative to the null model to consider.

Feel free to tell me this is all pants and come up with a more sensible 
scheim for all of this. Anyone out there know some probability theory?

Matthew

Schreiber, Mark wrote:

> Well maybe not a bug but a potential danger.
> 
> SimpleDistribution.Trainer will accept counts from ambiguity symbols (N)
> and the Gap symbol however when it comes to train it uses the
> AlphabetIndexer for DNA which does not include indices for these symbols
> and this leads to some very odd results.
> 
> A number of solutuions might exist:
> 
> 1) prevent the addition of ambiguos symbols to
> SimpleDistribution.Trainer. Safe, but an unexpected N in your sequence
> could cause an unexpected exception so not very user freindly.
> 
> 2) refactor SimpleDistribution.Trainer to add equal numbers of counts to
> ambiguity subsymbols. ie if N is added then add 1 count to each of
> a,c,g,t. However this will not work for gap symbols.
> 
> 3) extend the dna AlphabetIndex to include IUPAC ambiguities
> (m,r,w,s,y,k,v,h,d,b,n) and the gap symbol. Solves the gap problem but
> maybe N should be added as one count to each of its subsymbols.
> 
> Not sure I like any of these, any other suggestions??
> 
> 
> Mark Schreiber
> Bioinformatics
> AgResearch Invermay
> PO Box 50034
> Mosgiel
> New Zealand
> 
> PH: +64 3 489 9175
> 
>  
> 
> =======================================================================
> Attention: The information contained in this message and/or attachments
> from AgResearch Limited is intended only for the persons or entities
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipients is prohibited by AgResearch
> Limited. If you have received this message in error, please notify the
> sender immediately.
> =======================================================================
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 
> 

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com