[Biojava-l] issue with translating codons with N

LAW Andy andy.law at roslin.ed.ac.uk
Fri Sep 20 15:10:19 UTC 2013


I would expect a HashMap to perform every bit as well as a hand-rolled array.


On 20 Sep 2013, at 15:57, Nick England <nickengland at gmail.com> wrote:

> 
> Andy and Andy,
> 
> Does the array really result in that much performance improvement from a hashmap? Rather than a huge empty array you could just use a char[][][] for the three codons if you want to do it that way. I think the problem is that the code for translating "N" as X isn't called if the array lookup returns a match. And the array lookup doesn't seem calibrated to cope with Ns (or indeed any ambiguous bases.)
> 
> And Andy Law yes, its using the int values (but used to use the numbers 1,2,3,4 from the looks of it.)
> 
>        public int compoundToInt(NucleotideCompound c) {
>             char b = c.getUpperedBase().charAt(0);
>             return (int)b;
> //            int v = -1;
> //            if('A' == b) {
> //                v = 1;
> //            }
> //            else if('C' == b) {
> //                v = 2;
> //            }
> //            else if('G' == b) {
> //                v = 3;
> //            }
> //            else if('T' == b || 'U' == b) {
> //                v = 4;
> //            }
> //            return v;
> }
> 
> 
> On 20 September 2013 15:51, LAW Andy <andy.law at roslin.ed.ac.uk> wrote:
> Looking at the multipliers, I would hazard a guess that the *intent* is to multiply the numbers 0,1,2,3 (ACGT) rather than the ASCII codes. Are you sure the code uses ASCII values?
> 
> 
> On 20 Sep 2013, at 15:16, Nick England <nickengland at gmail.com> wrote:
> 
> > Everyone,
> >
> > I've stepped through with a debugger, and this is a bad bug.
> >
> > The code to translate from RNA->Protein does the following:
> > - Take the ASCII Value for the 3 RNA bases, and multiple the first pos by
> > 16, second by 4 and third by 1 and add them up.
> > - Assume there won't be any collisions.
> >
> > Here are the values which it then uses:
> >
> > A:65
> > G:71
> > C:67
> > U:85
> > N:78
> > ANA: 1417
> > CAU: 1417
> > ANG: 1423
> > CGA: 1423
> >
> > Notice any hash collisions?
> >
> > I don't get why this wasn't done in a standard JavaHashMap which would
> > ensure that any collisions were resolved. This is a pretty critical bug for
> > a biology informatics package.
> >
> > Nick
> >
> >
> > On 20 September 2013 13:45, Nick England <nickengland at gmail.com> wrote:
> >
> >> Hara,
> >>
> >> Hmm this is rather odd. I get the same issue with that sequence with a
> >> custom engine as well.
> >>
> >> My code has:
> >> Builder builder = new TranscriptionEngine.Builder();
> >>    builder.initMet(false);
> >>    builder.translateNCodons(true);
> >>    builder.trimStop(false);
> >>    TranscriptionEngine engine = builder.build();
> >>    Sequence<AminoAcidCompound> seq=engine.translate(new
> >> DNASequence("GTNTGTTAGTGT"));
> >>    assertEquals("XC*C", seq.toString());
> >>    Sequence<AminoAcidCompound> seq2=engine.translate(new
> >> DNASequence("ANAANG"));
> >>    System.out.println(seq2);
> >> the first sequence translates as expected, but your sequence is
> >> translating as HR, when it should be XX. This looks like a pretty bad bug!
> >>
> >> Nick
> >>
> >>
> >> On 19 September 2013 19:59, Hara Dilley <hdilley at sutrobio.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> Is there an issue with the DNA Translation in biojava3.core?
> >>> It appears that it wants to translate "N" in certain cases
> >>> Executing:
> >>> new
> >>> DNASequence("ANAANG").getRNASequence().getProteinSequence().getSequenceAsString();
> >>> will produce  aa HR.
> >>>
> >>> thanks
> >>> Hara
> >>>
> >>> ________________________________
> >>>
> >>> This email and any attachments thereto may contain private, confidential,
> >>> and privileged material for the sole use of the intended recipient. Any
> >>> review, copying, or distribution of this email (or any attachments thereto)
> >>> by others is strictly prohibited. If you are not the intended recipient,
> >>> please contact the sender immediately and permanently delete the original
> >>> and any copies of this email and any attachments thereto.
> >>>
> >>> _______________________________________________
> >>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/biojava-l
> >>>
> >>
> >>
> > _______________________________________________
> > Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biojava-l
> 
> Later,
> 
> Andy
> --------
> Yada, yada, yada...
> 
> Disclaimer: This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the sender.
> 
> 
> 
> 
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 

Later,

Andy
--------
Yada, yada, yada...

Disclaimer: This e-mail and any attachments are confidential and intended solely for the use of the recipient(s) to whom they are addressed. If you have received it in error, please destroy all copies and inform the sender.




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.





More information about the Biojava-l mailing list