[Biojava-dev] More about SNPs in BioJava 2
Jason Stajich
jason at cgt.duhs.duke.edu
Wed May 19 21:29:15 EDT 2004
I've done quite a few of these in bioperl.
Many of the distance methods came from the EMBOSS code or phylip code.
You are more than welcome to steal liberally from Bio::PopGen code in
bioperl (live CVS code not release code as there are some bug fixes).
-jason
On Wed, 19 May 2004, Bruno Aranda - Dev wrote:
> Hello again,
>
> I've been thinking about the SNP stuff. In order to work with SNPs it
> would be useful before have a good api for alignments. From the
> alignments we can find segregating sites (polymorphic sites), and
> calculate some biological statistics from it as, to name few:
>
> - Polymorphism measures:
> Number of segregating sites (S) Nei 1987
> Number of segregating sites per nucleotide
> Theta θ per site (estimated from S)
> Minimum number of mutations (η) Tajima 1996
> Minimum number of mutations η per site
> Theta θ per site (estimated from η)
> Theta θ per DNA sequence (estimated from S) Tajima 1993
> Variance of θ per DNA sequence (estimated from S) - without
> recombination
> Standard deviation of θ per DNA sequence (estimated from S) -
> without recombination
> Variance of θ per DNA sequence (estimated from S) - free
> recombination
>
> [...]
>
> - Synonymous-Nonsynonimous
> Number of Synonymous differences Nei and Gojobori 1986
> Number of Non-synonymous differences
> Number of Synonymous positions
> Number of Non-synonymous positions
> Number of synonymous polymorphisms per synonymous site
> Number of non-synonymous polymorphisms per non-synonymous site
> Number of synonymous polymorphisms per synonymous site - with
> Jukes&Cantor correction
> Number of non-synonymous polymorphisms per non-synonymous site -
> with
> Jukes&Cantor correction
> Variance of Ks (Jukes&Cantor)
> Variance of Ka (Jukes&Cantor)
>
> [...]
>
> We can also do pairwise comparisons between polymorphic sites and
> calculate Linkage Disequilibriums (more equations here, like D'
> (Lewontin), R, R2 (Hill, Robertson), distance between both comparing
> sites, and so on...).
>
> All (or most) of these equations are only applied in this field of
> biology, and you cannot find them in libraries as colt or apache commons
> math... Of course you can calculate some other basic probabilities like
> Chi square and Fisher distributions from another packages. Moreover this
> equations are pretty simple and use often the same variables. I guess
> that the best strategy here would be use some Objects. For example we
> could have:
>
> PolymorphismCalculator polyCalc = new
> PolymorphismCalculatior(alignment); int segregatingSites =
> polyCalc.calculateSegregatingSites();
> ....
>
> .... or something like this. The biological functions would be in
> Strategy Objects for different models and biostatistics, but I am sure
> that better ideas will come about this :)
>
> We could also get the SNP objects form an alignment using some
> utilities...
>
> SNP[] snps = SNPTools.retrieveFromAlignment(alignment);
>
> A SNP would be a Symbol composed by AtomicSymbols, and some
> annotations...
>
> Well, but alignments are not the only source for SNPs. For example, we
> can get the SNPS from the NCBI in a few formats, being this kind of
> Fasta the most used I think:
>
> >gnl|dbSNP|ss12724475|allelePos=632|totalLen=659|PIGGENOME|P1|taxid=9823|mol=cDNA|snpclass=1|alleles='T/A'
> CATCATTGAGCTACTTGCCCTTCGGAGCAGGACCCCGCTCTTGCGTAGGG
> GAGATGCTAGCCCGCCAGGAGCTCTTCCTCTTCACGGCTGGATTGCTGCA
> GAGGTTCGACCTGGAGCTCCCAGATGATGGGCAGCTACCCTGTCTCGTGG
> GCAACCCCAGTTTGGTCCTGCAGATAGATCCTTTCAAAGTGAAGATCAAG
> GAGCGCCAGGCCTGGAAGGAAGCCCACACTG
> W
> AGGGGAGTACCTCCTGACTCCACCCTG
>
> The SNP here is the W ambigous symbol.
>
> Well, this is an overall idea about SNPs... more focused on SNP
> calculations. It would be useful to have more opinions about this and
> other aspects of SNPs...
>
>
> Bruno
>
>
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at biojava.org
> http://biojava.org/mailman/listinfo/biojava-dev
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu
More information about the biojava-dev
mailing list