[Biojava-l] How to create a SymbolList with a String that contains illegal Char

Tao Xu taoxu at bioinformatics.ubc.ca
Mon Dec 8 21:01:36 EST 2003


Hi there,

Does anyone know how to create a SymbolList with a String that 
contains illegal symbol? 

I encountered IllegalSymbolException when I tried to retrieve 
sequences from a sequence database. The sequence that gave me the 
trouble was a refseq sequence, accession number NT_039621, Mus 
musculus chromosome 15 genomic contig. I firsted used 
DNATools.createDNA(String dna), and got IllegalSymbolException that 
indicated there was at least one 'u' in the sequence. I then used 
NucleotideTools.createNucleotide(String nucleotide), this time the 'u' 
did not cause any problem, but however I sitll got 
IllegalSymbolException that inidicated there was 'l' in the sequence. 

I am afraid there must be lots of illegal symbols in GenBank's 
sequences, I am wondering if there is a way to create error-tolerate 
SymbolList object. If not, I am afraid I have to create an Alphabet 
object that contains Symbols that covers all char in java and use this 
Alphabet object to create a CharacterTokenization using 
CharacterTokenization(Alphabet alpha, boolean caseSensitive) 
constructor, and then use the resulting CharacterTokenization object 
to call SimpleSymbolList(SymbolTokenization st, String seqString) to 
get a SimpleSymbolList object. I guess there must be a better way in 
Biojava to do this. Your help is highly appreciated.

If I have to create an Alphatebet that covers all char in Java, how 
can I do it? I originally thought merge NUCLEOTIDE and PROTEIN 
Alphabet to create a new Alphabet would be able to cover all the 
Symboles in GenBank sequences, but I noticed there was no method to 
merge to Alphabets in AlphabetManager. Is there a way to merge two 
Alphabets? If not, probably it is worth to implement one. It will be 
useful not only to handle IllegalSymbols exist in the databases, but 
also other applications like using non-standard symbols to generate 
blastable MSBlast database.

Thanks a lot for your help.

Regards,

Tao





More information about the Biojava-l mailing list