[Biojava-dev] Case sensitivity in Alignment
Waring, David A
dwaring at fhcrc.org
Thu Nov 21 19:18:14 UTC 2013
If that is the intended behavior, then it must be explicit, and dependent on the class of sequence. So there would need to be a MaskedDNASequence, or perhaps a MaskedNucleotideCompound, which had a different equals() method.
A DNASequence<NucleotideCompound> should behave exactly the same way regardless of how it was created, and particularly; regardless of the file format it was read from. How does the current code behave now with a genbank file?, an embl file?, a gcg file? There should be no question in a users mind how it will behave. Now it a user is explicitly using a mixed case file, aware of its significance, he should have different options. So a DNASequence<MaskedNucliotideCompoud> could be available. This path would also allow for programmatically masking a sequence and using the alignment tools in the same way.
On Nov 20, 2013, at 8:57 AM, Michael Heuer <heuermh at gmail.com> wrote:
> Sorry, I may not be keeping up with you both here, but the code in
> question is in the alignment package, and if the substitution matrices
> are all upper case they won't match lower case soft masked sequence;
> wouldn't that be the intent? (A feature not a bug)
>
> michael
>
> On Wed, Nov 20, 2013 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>> The problem is that the substitution matrices are all upper case. We can
>> probably fix this by making the NucleotideCompound.equals method case
>> insensitive...
>>
>> Does anybody see an issue with that?
>>
>> A
>>
>>
>> On Wed, Nov 20, 2013 at 8:22 AM, Michael Heuer <heuermh at gmail.com> wrote:
>>>
>>> Hello Andreas, David
>>>
>>> Lower case is the convention for soft-masking sequences from alignment
>>>
>>> http://www.ncbi.nlm.nih.gov/books/NBK1763/
>>>
>>> http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Create_a_masked_BLAST
>>>
>>> If we are using this convention, perhaps it should be more clearly
>>> documented. What happens if you use mixed case?
>>>
>>> michael
>>>
>>>
>>> On Wed, Nov 20, 2013 at 5:29 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> Hi David,
>>>>
>>>> not sure if we should consider this a bug or a feature: It should be
>>>> easy
>>>> to work around this by calling toUppercase on your strings. We could of
>>>> course internally convert all nucleotides to upper case, but that would
>>>> remove the possibility for people to use mixed upper case and lower case
>>>> sequences to represent e.g. alignment conservation.
>>>>
>>>> Any opinions by other people on this? Is anybody using mixed case
>>>> sequences?
>>>>
>>>> Andreas
>>>>
>>>>
>>>> On Mon, Nov 18, 2013 at 11:43 AM, Waring, David A <dwaring at fhcrc.org>
>>>> wrote:
>>>>
>>>>>
>>>>> There seems to be a bug in the alignment package. If DNA sequences are
>>>>> created using lower case letters, the alignment methods don't work.
>>>>> Looks
>>>>> like the the default substitution matrix is coded in upper case, and
>>>>> the
>>>>> underlying case of the DNA sequence is being used in the alignment.
>>>>> Seems
>>>>> like a bug to me.
>>>>>
>>>>> This problem occurs when the DNA Sequence is create either using the
>>>>> DNASequence constructor, or reading from a fasta which is in lower
>>>>> case.
>>>>>
>>>>>
>>>>> The code below shows the problem.
>>>>>
>>>>>
>>>>> static SimpleGapPenalty gapP;
>>>>> static SubstitutionMatrix<NucleotideCompound> matrix;
>>>>>
>>>>> public static void main(String[] args)throws Exception{
>>>>> matrix = SubstitutionMatrixHelper.getNuc4_4();
>>>>> gapP = new SimpleGapPenalty();
>>>>> gapP.setOpenPenalty((short)5);
>>>>> gapP.setExtensionPenalty((short)2);
>>>>> testHardcoded();
>>>>> }
>>>>>
>>>>> public static void testHardcoded()throws Exception{
>>>>> Sequence<NucleotideCompound> seq1 = new
>>>>> DNASequence("AGGGCTTTACCCCGGTTAA");
>>>>> Sequence<NucleotideCompound> seq2 = new
>>>>> DNASequence("ACCCCGGTTTAATATTTTT");
>>>>> Sequence<NucleotideCompound> seq3 = new
>>>>> DNASequence("agggctttaccccggttaa");
>>>>> Sequence<NucleotideCompound> seq4 = new
>>>>> DNASequence("accccggtttaatattttt");
>>>>> alignPair(seq1,seq2);
>>>>> alignPair(seq1,seq4);
>>>>> alignPair(seq3,seq4);
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>> public static void alignPair(Sequence<NucleotideCompound> seq1,
>>>>> Sequence<NucleotideCompound> seq2){
>>>>> SequencePair<Sequence<NucleotideCompound>,
>>>>> NucleotideCompound> pair =
>>>>> Alignments.getPairwiseAlignment(seq1,seq2,
>>>>> Alignments.PairwiseSequenceAlignerType.GLOBAL,
>>>>> gapP, matrix);
>>>>>
>>>>> System.out.printf("%s", pair);
>>>>> System.out.println();
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> biojava-dev mailing list
>>>>> biojava-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>
>>
>>
More information about the biojava-dev
mailing list