[Biojava-dev] Case-sensitive ProteinSequences

Spencer Bliven sbliven at ucsd.edu
Wed Nov 30 01:29:59 UTC 2011


I'm currently trying to read a FASTA file which encodes some information in
the case of each amino acid. Specifically, the FASTA contains an alignment
where upper case letters are aligned and lower case are unaligned.

The first problem I ran into was that lower-case letters are not valid as
input to AminoAcidCompoundSet.getCompoundForString(String), which gets
called indirectly from the FastaReader. This could be fixed by subclassing
AminoAcidCompoundSet and calling toUpper() on the input. However, the
second problem is that I need to extract that case information later on. My
current solution is a subclass of AminoAcidCompoundSet which contains two
copies of each amino acid–one upper and one lower. This seems like a very
ugly solution and it breaks all the Alignment algorithms (due to missing
amino acids in the scoring matrices). Does anyone have a better suggestion?

Thanks,
Spencer




More information about the biojava-dev mailing list