[Biojava-dev] Case sensitivity in Alignment
Waring, David A
dwaring at fhcrc.org
Thu Nov 21 19:19:01 UTC 2013
The fact that there is a class CaseFreeAminoAcidCompoundSet is a sign of the very problem with the design.
There is no such thing as an uppercase amino acid, or a lower case nucleotide. The representation of nucleotides with an ascii character is a convention. And in most cases a guanine is represented by a 'g' or a 'G'. Regardless of how it is represented in a file, the Object must represent a guanine, not a G or a g.
BioJava 1 was quite explicit in its understanding of this basic point. As best as I can tell BioJava 3 seems to miss this. I have just begun to try out BioJava 3 and this makes me wonder what other issues I will run into.
On Nov 20, 2013, at 3:02 PM, Spencer Bliven <sbliven at ucsd.edu> wrote:
> I neglected the CaseFreeAminoAcidCompoundSet from the aa-prop module's xml
> package. I have no idea why it's there.
>
> -Spencer
>
>
> On Wed, Nov 20, 2013 at 2:54 PM, Spencer Bliven <sbliven at ucsd.edu> wrote:
>
>> The issue of case has come up before, and to my knowledge it hasn't been
>> handled particularly consistently. There's a CaseInsensitiveCompound in
>> core which is not used my any other class, and which is pretty much useless
>> since it doesn't derive from NucleotideCompound but merely wraps it.
>> There's also a CasePreservingProteinSequenceCreator, which was my solution
>> to maintain the case information while still working with a standard
>> AminoAcid CompoundSet. It's an ugly solution-I just turn everything to
>> uppercase while storing the case as a boolean array in the sequence's
>> UserCollection. That could easily be adapted to nucleic acid, but I'd
>> welcome a cleaner solution if anyone has one.
>>
>>
>> On Wed, Nov 20, 2013 at 8:57 AM, Michael Heuer <heuermh at gmail.com> wrote:
>>
>>> Sorry, I may not be keeping up with you both here, but the code in
>>> question is in the alignment package, and if the substitution matrices
>>> are all upper case they won't match lower case soft masked sequence;
>>> wouldn't that be the intent? (A feature not a bug)
>>>
>>> michael
>>>
>>> On Wed, Nov 20, 2013 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> The problem is that the substitution matrices are all upper case. We can
>>>> probably fix this by making the NucleotideCompound.equals method case
>>>> insensitive...
>>>>
>>>> Does anybody see an issue with that?
>>>>
>>>> A
>>>>
>>>>
>>>> On Wed, Nov 20, 2013 at 8:22 AM, Michael Heuer <heuermh at gmail.com>
>>> wrote:
>>>>>
>>>>> Hello Andreas, David
>>>>>
>>>>> Lower case is the convention for soft-masking sequences from alignment
>>>>>
>>>>> http://www.ncbi.nlm.nih.gov/books/NBK1763/
>>>>>
>>>>>
>>> http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Create_a_masked_BLAST
>>>>>
>>>>> If we are using this convention, perhaps it should be more clearly
>>>>> documented. What happens if you use mixed case?
>>>>>
>>>>> michael
>>>>>
>>>>>
>>>>> On Wed, Nov 20, 2013 at 5:29 AM, Andreas Prlic <andreas at sdsc.edu>
>>> wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> not sure if we should consider this a bug or a feature: It should be
>>>>>> easy
>>>>>> to work around this by calling toUppercase on your strings. We could
>>> of
>>>>>> course internally convert all nucleotides to upper case, but that
>>> would
>>>>>> remove the possibility for people to use mixed upper case and lower
>>> case
>>>>>> sequences to represent e.g. alignment conservation.
>>>>>>
>>>>>> Any opinions by other people on this? Is anybody using mixed case
>>>>>> sequences?
>>>>>>
>>>>>> Andreas
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 18, 2013 at 11:43 AM, Waring, David A <dwaring at fhcrc.org
>>>>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> There seems to be a bug in the alignment package. If DNA sequences
>>> are
>>>>>>> created using lower case letters, the alignment methods don't work.
>>>>>>> Looks
>>>>>>> like the the default substitution matrix is coded in upper case, and
>>>>>>> the
>>>>>>> underlying case of the DNA sequence is being used in the alignment.
>>>>>>> Seems
>>>>>>> like a bug to me.
>>>>>>>
>>>>>>> This problem occurs when the DNA Sequence is create either using
>>> the
>>>>>>> DNASequence constructor, or reading from a fasta which is in lower
>>>>>>> case.
>>>>>>>
>>>>>>>
>>>>>>> The code below shows the problem.
>>>>>>>
>>>>>>>
>>>>>>> static SimpleGapPenalty gapP;
>>>>>>> static SubstitutionMatrix<NucleotideCompound> matrix;
>>>>>>>
>>>>>>> public static void main(String[] args)throws Exception{
>>>>>>> matrix = SubstitutionMatrixHelper.getNuc4_4();
>>>>>>> gapP = new SimpleGapPenalty();
>>>>>>> gapP.setOpenPenalty((short)5);
>>>>>>> gapP.setExtensionPenalty((short)2);
>>>>>>> testHardcoded();
>>>>>>> }
>>>>>>>
>>>>>>> public static void testHardcoded()throws Exception{
>>>>>>> Sequence<NucleotideCompound> seq1 = new
>>>>>>> DNASequence("AGGGCTTTACCCCGGTTAA");
>>>>>>> Sequence<NucleotideCompound> seq2 = new
>>>>>>> DNASequence("ACCCCGGTTTAATATTTTT");
>>>>>>> Sequence<NucleotideCompound> seq3 = new
>>>>>>> DNASequence("agggctttaccccggttaa");
>>>>>>> Sequence<NucleotideCompound> seq4 = new
>>>>>>> DNASequence("accccggtttaatattttt");
>>>>>>> alignPair(seq1,seq2);
>>>>>>> alignPair(seq1,seq4);
>>>>>>> alignPair(seq3,seq4);
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> public static void alignPair(Sequence<NucleotideCompound> seq1,
>>>>>>> Sequence<NucleotideCompound> seq2){
>>>>>>> SequencePair<Sequence<NucleotideCompound>,
>>>>>>> NucleotideCompound> pair =
>>>>>>> Alignments.getPairwiseAlignment(seq1,seq2,
>>>>>>>
>>> Alignments.PairwiseSequenceAlignerType.GLOBAL,
>>>>>>> gapP, matrix);
>>>>>>>
>>>>>>> System.out.printf("%s", pair);
>>>>>>> System.out.println();
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> biojava-dev mailing list
>>>>>>> biojava-dev at lists.open-bio.org
>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> biojava-dev mailing list
>>>>>> biojava-dev at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>
>>
>>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
More information about the biojava-dev
mailing list