[Biojava-l] comparison of the pairwise aligner to emboss' needle
Wim De Smet
Wim.DeSmet at UGent.be
Thu Apr 21 07:54:56 UTC 2011
Hi Andreas
Thanks for having a look. I found the issue: if your sequence is lower
case, the alignment is different. Try aligning with query.toLowerCase()
and target.toLowerCase(), then the output is:
Number of identical residues: 334
% identical query: 0.23405746
% identical query: 0.22845417
vs upper case
Number of identical residues: 1394
% identical query: 0.9768746
% identical query: 0.95348835
So I just inserted a toUpperCase() and it works now.
Regards
Wim
On 20-04-11 19:33, Andreas Prlic wrote:
> Hi Wim,
>
> are you sure you are using the correct sequences in your test? When I
> run the code at the bottom of this emails I am getting 95 and 97%
> sequence ID, which is similar to what you are expecting.
>
> Andreas
>
> Here my code: (using latest code from SVN)
>
> package demo;
>
> import org.biojava3.alignment.Alignments;
> import org.biojava3.alignment.Alignments.PairwiseSequenceAlignerType;
> import org.biojava3.alignment.SimpleGapPenalty;
> import org.biojava3.alignment.SubstitutionMatrixHelper;
> import org.biojava3.alignment.template.GapPenalty;
> import org.biojava3.alignment.template.PairwiseSequenceAligner;
> import org.biojava3.alignment.template.SequencePair;
> import org.biojava3.core.sequence.DNASequence;
> import org.biojava3.core.sequence.compound.AmbiguityDNACompoundSet;
> import org.biojava3.core.sequence.compound.NucleotideCompound;
>
> public class TestDNANeedlemanWunsch {
> public static void main(String[] args){
>
> String query =
> "AGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGTGAAGCCCAGCTTGCTGGGTGGATCA"
> +
> "GTGGCGAACGGGTGAGTAACACGTGAGCAACCTGCCCCTGACTCTGGGATAAGCGCTGGAAACGGTGTCT" +
> "AATACTGGATATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCG" +
> "GCCTATCAGCTTGTTGGTGAGGTAATGGCTCACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGACC" +
> "GGCCACACTGGGACTGAGACACGGCCCAGACTCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGC" +
> "GGAAGCCTGATGCAGCAACGCCGCGTGAGGGACGACGGCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAG" +
> "AAGCGAGAGTGACGGTACCTGCAGAAAAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAG" +
> "GGCGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAA" +
> "CCCGAGGCTCAACCTNNGGGCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCC" +
> "TGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAAC" +
> "TGACGCTGAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTT" +
> "GGGAACTAGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCACGTAACGCATTAAGTTCCCCGCCTGGGG" +
> "AGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTA" +
> "AATCGATGCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCT" +
> "TTGGACACTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCG" +
> "CAACGAGCGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTC" +
> "AACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACA" +
> "ATGGCCGGTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATT" +
> "GAGGTCTGCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATA" +
> "CGTTCCCGGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCTAA" +
> "CCCTTGTGGAGGGAGCCGGTAATTAAA";
>
> String target =
> "CTGGCCGCCTGCTTAACACATCCAAGTCGAACGGTGAAGCCCCANCTTACTGGGTGGATCAGTGCCGAAC"
> +
> "GGGTGAGTAACACGTGAGCAACCTCCCCCTGACTCTGGGATAAGCGCTGGAANCGGTGTCTAATACTGGA" +
> "TATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCGCCCTATGAG" +
> "CTTGTTGGTGAGGTAATGGCTCACCAAGCCGTCGACGGGTAGCCGGCCTGAGAGGGTGACCGNCCACACT" +
> "GGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGGAAGCCT" +
> "GATTCANCAACCCCGCGTGAGGGACGACGGCCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAGAAGCGAG" +
> "AGTGACGGTACCTGCAGAAAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAA" +
> "GCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAACCCGAGG" +
> "CTCAACCTCGGGCCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCCTGGTGTA" +
> "GCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAACTGACGCT" +
> "GAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTTGGGAACT" +
> "AGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCAGCTAACGCATTAAGTTCCCCGCCTGGGGAGTACGG" +
> "CCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTAATTCGAT" +
> "GCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCTTTGGACA" +
> "CTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAG" +
> "CGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTCAACTCGG" +
> "AGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACAATGGCCG" +
> "GTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATTGAGGTCT" +
> "GCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCC" +
> "GGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCCAACCCTTGT" +
> "GGAGGGAGCCGTCGAAGGTGGGATCGGTAATTAGGACTAAGTCGTAACAAGGTAGCCGTACC";
>
> GapPenalty penalty = new SimpleGapPenalty((short)-14, (short)-4);
> PairwiseSequenceAligner<DNASequence, NucleotideCompound> aligner =
> Alignments.getPairwiseAligner(
> new DNASequence(query, AmbiguityDNACompoundSet.getDNACompoundSet()),
> new DNASequence(target, AmbiguityDNACompoundSet.getDNACompoundSet()),
> PairwiseSequenceAlignerType.GLOBAL,
> penalty, SubstitutionMatrixHelper.getNuc4_4());
> SequencePair<DNASequence, NucleotideCompound>
> alignment = aligner.getPair();
>
> System.out.println(alignment);
>
> int identical = alignment.getNumIdenticals();
> System.out.println("Number of identical residues: " + identical);
> System.out.println("% identical query: " + identical / (float)
> query.length() );
> System.out.println("% identical query: " + identical / (float)
> target.length() );
> }
> }
>
>
>
>
>
>
> On Mon, Apr 18, 2011 at 8:22 AM, Wim De Smet<Wim.DeSmet at ugent.be> wrote:
>> Hi all,
>>
>> I've been trying to generate some global alignments with biojava and
>> comparing them with what needle returns. Doing this, I can't seem to
>> reproduce needle's alignment with biojava. The score returned from biojava
>> seems to be worse than that from needle, so I'm not sure what's happening
>> here.
>>
>> The sequences are AB004720 and Y17238 (I didn't attach a fasta file to avoid
>> spamming people, let me know if you want one). I align them with:
>> GapPenalty penalty = new SimpleGapPenalty((short)-14, (short)-4);
>> PairwiseSequenceAligner<DNASequence, NucleotideCompound> aligner =
>> Alignments.getPairwiseAligner(
>> new DNASequence(query, AmbiguityDNACompoundSet.getDNACompoundSet()),
>> new DNASequence(target, AmbiguityDNACompoundSet.getDNACompoundSet()),
>> PairwiseSequenceAlignerType.GLOBAL,
>> penalty, SubstitutionMatrixHelper.getNuc4_4());
>> SequencePair<DNASequence, NucleotideCompound>
>> alignment = aligner.getPair();
>>
>> This gives me an alignment with only 23% similarity and a gap at the end.
>> Varying the gap penalties can give me a gap in front too, but that's about
>> it. When aligning in needle, I get a sequence with a higher score (6784 vs
>> (-)5862) and 94% similarity (which seems closer to home). Needle I just run
>> with defaults (so it uses EDNAFULL) and a go/ge of 14/4.
>>
>> Could this be a bug or am I misunderstanding some of the options?
>>
>> BTW, if I use a really large gapextend, say -4000, I also get a nullpointer
>> exception.
>>
>> TIA,
>> Wim De Smet
>>
>> --
>> Wim De Smet
>> http://www.straininfo.net/
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
>
--
Wim De Smet
http://www.straininfo.net/
More information about the Biojava-l
mailing list