[Biojava-l] How to create a SymbolList with a String thatcontains illegal Char

Tue Dec 9 05:25:14 EST 2003

Is 'i' actually a legal symbol from the RNA alphabet, in terms of biojava?
If not how should we define it? Would it be best modelled as an atomic
symbol or some kind of ambiguity? Stretching back to my biochem undergrad
days I think it should be atomic. That will mean the RNA Alphabets size is
5.

I've just checked the AlphabetManager.xml and inosine isn't in there. If
there are no objections I will add it as an AtomicSymbol tommorrow with a
mapping to the character 'i'. The question is should it be added as a member
of the RNA alphabet or as a member of the nucleotide alphabet or both?

- Mark

-----Original Message-----
From: biojava-l-bounces at portal.open-bio.org
[mailto:biojava-l-bounces at portal.open-bio.org] On Behalf Of David Huen
Sent: Tuesday, 9 December 2003 9:00 p.m.
To: taoxu at bioinformatics.ubc.ca; biojava-l at biojava.org
Subject: Re: [Biojava-l] How to create a SymbolList with a String
thatcontains illegal Char

On Tuesday 09 Dec 2003 2:01 am, Tao Xu wrote:
> Hi there,
>
> Does anyone know how to create a SymbolList with a String that 
> contains illegal symbol?
>
> I encountered IllegalSymbolException when I tried to retrieve 
> sequences from a sequence database. The sequence that gave me the 
> trouble was a refseq sequence, accession number NT_039621, Mus 
> musculus chromosome 15 genomic contig. I firsted used 
> DNATools.createDNA(String dna), and got IllegalSymbolException that 
> indicated there was at least one 'u' in the sequence. I then used 
> NucleotideTools.createNucleotide(String nucleotide), this time the 'u'
> did not cause any problem, but however I sitll got 
> IllegalSymbolException that inidicated there was 'l' in the sequence.
>
> I am afraid there must be lots of illegal symbols in GenBank's 
> sequences, I am wondering if there is a way to create error-tolerate 
> SymbolList object. If not, I am afraid I have to create an Alphabet 
> object that contains Symbols that covers all char in java and use this 
> Alphabet object to create a CharacterTokenization using 
> CharacterTokenization(Alphabet alpha, boolean caseSensitive) 
> constructor, and then use the resulting CharacterTokenization object 
> to call SimpleSymbolList(SymbolTokenization st, String seqString) to 
> get a SimpleSymbolList object. I guess there must be a better way in 
> Biojava to do this. Your help is highly appreciated.
>
> If I have to create an Alphatebet that covers all char in Java, how 
> can I do it? I originally thought merge NUCLEOTIDE and PROTEIN 
> Alphabet to create a new Alphabet would be able to cover all the 
> Symboles in GenBank sequences, but I noticed there was no method to 
> merge to Alphabets in AlphabetManager. Is there a way to merge two 
> Alphabets? If not, probably it is worth to implement one. It will be 
> useful not only to handle IllegalSymbols exist in the databases, but 
> also other applications like using non-standard symbols to generate 
> blastable MSBlast database.
>
> Thanks a lot for your help.
>
I think the problem you are encountering is because the sequence you are
reading is an RNA sequence.  So the "u" and "i" are uracil and inosine
respectively and therefore correctly illegal for a DNA sequence.

You will probably have much greater happiness by using:-
RNATools.createRNA(String rna)

Regards,
David Huen

_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l