[Biojava-l] Subtle bug in SimpleDistribution

Schreiber, Mark mark.schreiber@agresearch.co.nz
Fri, 18 Jan 2002 14:01:02 +1300


Actually, after looking with a debugger I found that a programming error
on my part was causing the addition of negative counts to the
distribution. Perhaps either SimpleDistributionTrainerContext or
SimpleDistribution.Trainer could be made to throw an exception in such a
case so as to make it more idiot proof

Mark


Mark Schreiber
Bioinformatics
AgResearch Invermay
PO Box 50034
Mosgiel
New Zealand

PH: +64 3 489 9175

 

> -----Original Message-----
> From: Matthew Pocock [mailto:matthew_pocock@yahoo.co.uk]
> Sent: Friday, January 18, 2002 11:29 AM
> To: Schreiber, Mark
> Cc: biojava-l (E-mail)
> Subject: Re: [Biojava-l] Subtle bug in SimpleDistribution
> 
> 
> Hi Mark,
> 
> What are the symptoms of your bug? The handling of ambiguous counts 
> should be implemented by:
> 
> DistributionTrainerContext.addCount(
>    Distribution dist,
>    Symbol sym,
>    double times)
> 
> This passes the count on to:
> 
> DistributionTrainer.addCount(
>    AtomicSymbol aSym,
>    double times)
> 
> Is this chain of events not getting fired off for you?
> 
> There's a larger question over how to treat ambiguities 
> during training. 
> I have come to the opinion that a sequence with an ambiguity in it is 
> actualy a set of un-ambiguous sequences that match the 
> ambiguous symbols 
> in all possible ways. This makes the maths work out better 
> than saying 
> that it is partly one sequence and partly another. If you use 
> the odds 
> measurements during DP then this all divides out to give the expected 
> likelihoods for alignments. However, it makes P(seq | model) equal to 
> sum_i(seq_i | model) if there are i unambiguous sequences that could 
> match your sequence by chosing matches for all combinations of all 
> ambiguous symbols. Be aware. An obvious result of this is 
> that a string 
> of N's has the maximal probability of all sequences that length being 
> aligned to a model, where as the odds of a string of N's 
> being aligned 
> will be 0 as there is no information relative to the null 
> model to consider.
> 
> Feel free to tell me this is all pants and come up with a 
> more sensible 
> scheim for all of this. Anyone out there know some probability theory?
> 
> Matthew
> 
> Schreiber, Mark wrote:
> 
> > Well maybe not a bug but a potential danger.
> > 
> > SimpleDistribution.Trainer will accept counts from 
> ambiguity symbols (N)
> > and the Gap symbol however when it comes to train it uses the
> > AlphabetIndexer for DNA which does not include indices for 
> these symbols
> > and this leads to some very odd results.
> > 
> > A number of solutuions might exist:
> > 
> > 1) prevent the addition of ambiguos symbols to
> > SimpleDistribution.Trainer. Safe, but an unexpected N in 
> your sequence
> > could cause an unexpected exception so not very user freindly.
> > 
> > 2) refactor SimpleDistribution.Trainer to add equal numbers 
> of counts to
> > ambiguity subsymbols. ie if N is added then add 1 count to each of
> > a,c,g,t. However this will not work for gap symbols.
> > 
> > 3) extend the dna AlphabetIndex to include IUPAC ambiguities
> > (m,r,w,s,y,k,v,h,d,b,n) and the gap symbol. Solves the gap 
> problem but
> > maybe N should be added as one count to each of its subsymbols.
> > 
> > Not sure I like any of these, any other suggestions??
> > 
> > 
> > Mark Schreiber
> > Bioinformatics
> > AgResearch Invermay
> > PO Box 50034
> > Mosgiel
> > New Zealand
> > 
> > PH: +64 3 489 9175
> > 
> >  
> > 
> > 
> ==============================================================
> =========
> > Attention: The information contained in this message and/or 
> attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential 
> and/or privileged
> > material. Any review, retransmission, dissemination or 
> other use of, or
> > taking of any action in reliance upon, this information by 
> persons or
> > entities other than the intended recipients is prohibited 
> by AgResearch
> > Limited. If you have received this message in error, please 
> notify the
> > sender immediately.
> > 
> ==============================================================
> =========
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l@biojava.org
> > http://biojava.org/mailman/listinfo/biojava-l
> > 
> > 
> 
> 
> 
> 
> 
> 
> _________________________________________________________
> 
> Do You Yahoo!?
> 
> Get your free @yahoo.com address at http://mail.yahoo.com
> 
> 
> 
> 
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================