[Biojava-dev] More about SNPs in BioJava 2

Jason Stajich jason at cgt.duhs.duke.edu
Wed May 19 21:29:15 EDT 2004

I've done quite a few of these in bioperl.

Many of the distance methods came from the EMBOSS code or phylip code.

You are more than welcome to steal liberally from Bio::PopGen code in
bioperl (live CVS code not release code as there are some bug fixes).

On Wed, 19 May 2004, Bruno Aranda - Dev wrote:

> Hello again,
> I've been thinking about the SNP stuff. In order to work with SNPs it
> would be useful before have a good api for alignments. From the
> alignments we can find segregating sites (polymorphic sites), and
> calculate some biological statistics from it as, to name few:
> - Polymorphism measures:
>     Number of segregating sites (S)  	 Nei 1987
>     Number of segregating sites per nucleotide
>     Theta θ per site (estimated from S)
>     Minimum number of mutations (η) 	Tajima 1996
>     Minimum number of mutations η per site
>     Theta θ per site (estimated from η)
>     Theta θ per DNA sequence (estimated from S) 	Tajima 1993
>     Variance of θ per DNA sequence (estimated from S) - without
> recombination
>     Standard deviation of θ per DNA sequence (estimated from S) -
> without recombination
>     Variance of θ per DNA sequence (estimated from S) - free
> recombination
>     [...]
> - Synonymous-Nonsynonimous
>     Number of Synonymous differences  	 Nei and Gojobori 1986
>     Number of Non-synonymous differences
>     Number of Synonymous positions
>     Number of Non-synonymous positions
>     Number of synonymous polymorphisms per synonymous site
>     Number of non-synonymous polymorphisms per non-synonymous site
> Number of synonymous polymorphisms per synonymous site - with
> Jukes&Cantor correction
>     Number of non-synonymous polymorphisms per non-synonymous site -
> with
> Jukes&Cantor correction
>     Variance of Ks (Jukes&Cantor)
>     Variance of Ka (Jukes&Cantor)
>     [...]
> We can also do pairwise comparisons between polymorphic sites and
> calculate Linkage Disequilibriums (more equations here, like D'
> (Lewontin), R, R2 (Hill, Robertson), distance between both comparing
> sites, and so on...).
> All (or most) of these equations are only applied in this field of
> biology, and you cannot find them in libraries as colt or apache commons
> math... Of course you can calculate some other basic probabilities like
> Chi square and Fisher distributions from another packages. Moreover this
> equations are pretty simple and use often the same variables. I guess
> that the best strategy here would be use some Objects. For example we
> could have:
> PolymorphismCalculator polyCalc = new
> PolymorphismCalculatior(alignment); int segregatingSites =
> polyCalc.calculateSegregatingSites();
> ....
> .... or something like this. The biological functions would be in
> Strategy Objects for different models and biostatistics, but I am sure
> that better ideas will come about this :)
> We could also get the SNP objects form an alignment using some
> utilities...
> SNP[] snps = SNPTools.retrieveFromAlignment(alignment);
> A SNP would be a Symbol composed by AtomicSymbols, and some
> annotations...
> Well, but alignments are not the only source for SNPs. For example, we
> can get the SNPS from the NCBI in a few formats, being this kind of
> Fasta the most used I think:
> >gnl|dbSNP|ss12724475|allelePos=632|totalLen=659|PIGGENOME|P1|taxid=9823|mol=cDNA|snpclass=1|alleles='T/A'
> W
> The SNP here is the W ambigous symbol.
> Well, this is an overall idea about SNPs... more focused on SNP
> calculations. It would be useful to have more opinions about this and
> other aspects of SNPs...
> Bruno
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev

Jason Stajich
Duke University
jason at cgt.mc.duke.edu

More information about the biojava-dev mailing list