[Biojava-l] comparison of the pairwise aligner to emboss' needle

Andreas Prlic andreas at sdsc.edu
Wed Apr 20 17:33:03 UTC 2011


Hi Wim,

are you sure you are using the correct sequences in your test? When I
run the code at the bottom of this emails I am getting 95 and 97%
sequence ID, which is similar to what you are expecting.

Andreas

Here my code: (using latest code from SVN)

package demo;

import org.biojava3.alignment.Alignments;
import org.biojava3.alignment.Alignments.PairwiseSequenceAlignerType;
import org.biojava3.alignment.SimpleGapPenalty;
import org.biojava3.alignment.SubstitutionMatrixHelper;
import org.biojava3.alignment.template.GapPenalty;
import org.biojava3.alignment.template.PairwiseSequenceAligner;
import org.biojava3.alignment.template.SequencePair;
import org.biojava3.core.sequence.DNASequence;
import org.biojava3.core.sequence.compound.AmbiguityDNACompoundSet;
import org.biojava3.core.sequence.compound.NucleotideCompound;

public class TestDNANeedlemanWunsch {
	public static void main(String[] args){

		String query =
"AGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGTGAAGCCCAGCTTGCTGGGTGGATCA"
+
		"GTGGCGAACGGGTGAGTAACACGTGAGCAACCTGCCCCTGACTCTGGGATAAGCGCTGGAAACGGTGTCT" +
		"AATACTGGATATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCG" +
		"GCCTATCAGCTTGTTGGTGAGGTAATGGCTCACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGACC" +
		"GGCCACACTGGGACTGAGACACGGCCCAGACTCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGC" +
		"GGAAGCCTGATGCAGCAACGCCGCGTGAGGGACGACGGCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAG" +
		"AAGCGAGAGTGACGGTACCTGCAGAAAAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAG" +
		"GGCGCAAGCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAA" +
		"CCCGAGGCTCAACCTNNGGGCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCC" +
		"TGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAAC" +
		"TGACGCTGAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTT" +
		"GGGAACTAGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCACGTAACGCATTAAGTTCCCCGCCTGGGG" +
		"AGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTA" +
		"AATCGATGCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCT" +
		"TTGGACACTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCG" +
		"CAACGAGCGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTC" +
		"AACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACA" +
		"ATGGCCGGTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATT" +
		"GAGGTCTGCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATA" +
		"CGTTCCCGGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCTAA" +
		"CCCTTGTGGAGGGAGCCGGTAATTAAA";

		String target =
"CTGGCCGCCTGCTTAACACATCCAAGTCGAACGGTGAAGCCCCANCTTACTGGGTGGATCAGTGCCGAAC"
+
		"GGGTGAGTAACACGTGAGCAACCTCCCCCTGACTCTGGGATAAGCGCTGGAANCGGTGTCTAATACTGGA" +
		"TATGAGCTACCACCGCATGGTGAGTGGTTGGAAAGATTTTTCGGTTGGGGATGGGCTCGCGCCCTATGAG" +
		"CTTGTTGGTGAGGTAATGGCTCACCAAGCCGTCGACGGGTAGCCGGCCTGAGAGGGTGACCGNCCACACT" +
		"GGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGGAAGCCT" +
		"GATTCANCAACCCCGCGTGAGGGACGACGGCCTTCGGGTTGTAAACCTCTTTTAGCAGGGAAGAAGCGAG" +
		"AGTGACGGTACCTGCAGAAAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAA" +
		"GCGTTATCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGCTGTGAAAACCCGAGG" +
		"CTCAACCTCGGGCCTGCAGTGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGATTGGAATTCCTGGTGTA" +
		"GCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGATCTCTGGGCCGTAACTGACGCT" +
		"GAGGAGCGAAAGGGTGGGGAGCAAACAGGCTTAGATACCCTGGTAGTCCACCCCGTAAACGTTGGGAACT" +
		"AGTTGTGGGGTCCTTTCCACGGATTCCGTGACGCAGCTAACGCATTAAGTTCCCCGCCTGGGGAGTACGG" +
		"CCGCAAGGCTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGCGGAGCATGCGGATTAATTCGAT" +
		"GCAACGCGAAGAACCTTACCAAGGCTTGACATACACGAGAACGCTGCAGAAATGTAGAACTCTTTGGACA" +
		"CTCGTGAACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAG" +
		"CGCAACCCTCGTTCTATGTTGCCAGCACGTAATGGTGGGAACTCATGGGATACTGCCGGGGTCAACTCGG" +
		"AGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCTTGGGCTTCACGCATGCTACAATGGCCG" +
		"GTACAAAGGGCTGCAATACCGTGAGGTGGAGCGAATCCCAAAAAGCCGGTCCCAGTTCGGATTGAGGTCT" +
		"GCAACTCGACCTCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCC" +
		"GGGTCTTGTACACACCGCCCGTCAAGTCATGAAAGTCGGTAACACCTGAAGCCGGTGGCCCAACCCTTGT" +
		"GGAGGGAGCCGTCGAAGGTGGGATCGGTAATTAGGACTAAGTCGTAACAAGGTAGCCGTACC";

		GapPenalty penalty = new SimpleGapPenalty((short)-14, (short)-4);
		PairwiseSequenceAligner<DNASequence, NucleotideCompound> aligner =
Alignments.getPairwiseAligner(
				new DNASequence(query, AmbiguityDNACompoundSet.getDNACompoundSet()),
				new DNASequence(target, AmbiguityDNACompoundSet.getDNACompoundSet()),
				PairwiseSequenceAlignerType.GLOBAL,
				penalty, SubstitutionMatrixHelper.getNuc4_4());
		SequencePair<DNASequence, NucleotideCompound>
		alignment = aligner.getPair();
		
		System.out.println(alignment);
		
		int identical = alignment.getNumIdenticals();
		System.out.println("Number of identical residues: " + identical);
		System.out.println("% identical query: " + identical / (float)
query.length() );
		System.out.println("% identical query: " + identical / (float)
target.length() );
	}
}






On Mon, Apr 18, 2011 at 8:22 AM, Wim De Smet <Wim.DeSmet at ugent.be> wrote:
> Hi all,
>
> I've been trying to generate some global alignments with biojava and
> comparing them with what needle returns. Doing this, I can't seem to
> reproduce needle's alignment with biojava. The score returned from biojava
> seems to be worse than that from needle, so I'm not sure what's happening
> here.
>
> The sequences are AB004720 and Y17238 (I didn't attach a fasta file to avoid
> spamming people, let me know if you want one). I align them with:
> GapPenalty penalty = new SimpleGapPenalty((short)-14, (short)-4);
> PairwiseSequenceAligner<DNASequence, NucleotideCompound> aligner =
> Alignments.getPairwiseAligner(
> new DNASequence(query, AmbiguityDNACompoundSet.getDNACompoundSet()),
> new DNASequence(target, AmbiguityDNACompoundSet.getDNACompoundSet()),
> PairwiseSequenceAlignerType.GLOBAL,
> penalty, SubstitutionMatrixHelper.getNuc4_4());
> SequencePair<DNASequence, NucleotideCompound>
> alignment = aligner.getPair();
>
> This gives me an alignment with only 23% similarity and a gap at the end.
> Varying the gap penalties can give me a gap in front too, but that's about
> it. When aligning in needle, I get a sequence with a higher score (6784 vs
> (-)5862) and 94% similarity (which seems closer to home). Needle I just run
> with defaults (so it uses EDNAFULL) and a go/ge of 14/4.
>
> Could this be a bug or am I misunderstanding some of the options?
>
> BTW, if I use a really large gapextend, say -4000, I also get a nullpointer
> exception.
>
> TIA,
> Wim De Smet
>
> --
> Wim De Smet
> http://www.straininfo.net/
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
>



-- 
-----------------------------------------------------------------------
Dr. Andreas Prlic
Senior Scientist, RCSB PDB Protein Data Bank
University of California, San Diego
(+1) 858.246.0526
-----------------------------------------------------------------------




More information about the Biojava-l mailing list