[Biojava-l] How to create a SymbolList with a String that contains illegal Char

David Huen david.huen at ntlworld.com
Tue Dec 9 02:59:55 EST 2003


On Tuesday 09 Dec 2003 2:01 am, Tao Xu wrote:
> Hi there,
>
> Does anyone know how to create a SymbolList with a String that
> contains illegal symbol?
>
> I encountered IllegalSymbolException when I tried to retrieve
> sequences from a sequence database. The sequence that gave me the
> trouble was a refseq sequence, accession number NT_039621, Mus
> musculus chromosome 15 genomic contig. I firsted used
> DNATools.createDNA(String dna), and got IllegalSymbolException that
> indicated there was at least one 'u' in the sequence. I then used
> NucleotideTools.createNucleotide(String nucleotide), this time the 'u'
> did not cause any problem, but however I sitll got
> IllegalSymbolException that inidicated there was 'l' in the sequence.
>
> I am afraid there must be lots of illegal symbols in GenBank's
> sequences, I am wondering if there is a way to create error-tolerate
> SymbolList object. If not, I am afraid I have to create an Alphabet
> object that contains Symbols that covers all char in java and use this
> Alphabet object to create a CharacterTokenization using
> CharacterTokenization(Alphabet alpha, boolean caseSensitive)
> constructor, and then use the resulting CharacterTokenization object
> to call SimpleSymbolList(SymbolTokenization st, String seqString) to
> get a SimpleSymbolList object. I guess there must be a better way in
> Biojava to do this. Your help is highly appreciated.
>
> If I have to create an Alphatebet that covers all char in Java, how
> can I do it? I originally thought merge NUCLEOTIDE and PROTEIN
> Alphabet to create a new Alphabet would be able to cover all the
> Symboles in GenBank sequences, but I noticed there was no method to
> merge to Alphabets in AlphabetManager. Is there a way to merge two
> Alphabets? If not, probably it is worth to implement one. It will be
> useful not only to handle IllegalSymbols exist in the databases, but
> also other applications like using non-standard symbols to generate
> blastable MSBlast database.
>
> Thanks a lot for your help.
>
I think the problem you are encountering is because the sequence you are 
reading is an RNA sequence.  So the "u" and "i" are uracil and inosine 
respectively and therefore correctly illegal for a DNA sequence.

You will probably have much greater happiness by using:-
RNATools.createRNA(String rna)



Regards,
David Huen



More information about the Biojava-l mailing list