[Biojava-l] Problem to translate RNA into DNA with 'N' ambuguity

Thomas Down td2@sanger.ac.uk
Tue, 14 Jan 2003 14:06:44 +0000


On Tue, Jan 14, 2003 at 02:25:13PM +0100, Olivier JEFFROY wrote:
> Hye everyone,
> 
> I'm new using Biojava and I have a little problem to solve. I have DNA sequences which come from sequencing. in these sequences, I have N (ambiguity on a,t,g,c nucleotids). I'd like to know I could resolve my problem.

Could you explain your problem in more detail?  In general, BioJava
has good support for ambiguous symbols, in any Alphabet.  Internally,
all possible ambiguities can be represented.

What do you mean by `translate'?  In biology, that normally
refers specifically to the RNA -> protein data conversion.

In BioJava, when you apply any kind of translation table to
an ambiguity symbol, it will translate all possible matching
symbols, then return an ambiguity symbol over all the possible
translations.  So if you convert the DNA 'n' [a,c,g,t] to RNA,
you'll get [a,c,g,u], which will also be printed as 'n' if you
write it to a file.  Similarly, if you translate the sequence
"agn" to protein, you'll get back the ambiguity symbol
[serine,argenine], since these are the two possible matching
amino acids.  But if you translate "ggn", you'll just get
back the (non-ambiguous) symbol for glycine, since that's
the only possible tranlation.

If you *are* talking about translating sequences containing
ambiguity symbols to protein, there was a problem in BioJava 1.2x
is you tried to print the resulting protein sequence, since only
a few protein ambiguities have standard single-letter
representations.  Those that don't gave an error when you tried
to print them.  BioJava 1.3 contains a workaround for this -- any
`unknown' ambiguity symbol is printed as a more general alternative
which does have a defined character.  So in a protein sequence, most
ambiguity symbols will just become "X".

Does this help?

    Thomas.