[Biojava-dev] More about SNPs in BioJava 2

Wed May 19 08:23:33 EDT 2004

Hello again,

I've been thinking about the SNP stuff. In order to work with SNPs it
would be useful before have a good api for alignments. From the
alignments we can find segregating sites (polymorphic sites), and
calculate some biological statistics from it as, to name few:

- Polymorphism measures:
    Number of segregating sites (S)  	 Nei 1987
    Number of segregating sites per nucleotide
    Theta &#952; per site (estimated from S)
    Minimum number of mutations (&#951;) 	Tajima 1996
    Minimum number of mutations &#951; per site
    Theta &#952; per site (estimated from &#951;)
    Theta &#952; per DNA sequence (estimated from S) 	Tajima 1993
    Variance of &#952; per DNA sequence (estimated from S) - without
recombination
    Standard deviation of &#952; per DNA sequence (estimated from S) -
without recombination
    Variance of &#952; per DNA sequence (estimated from S) - free
recombination

    [...]

- Synonymous-Nonsynonimous
    Number of Synonymous differences  	 Nei and Gojobori 1986
    Number of Non-synonymous differences
    Number of Synonymous positions
    Number of Non-synonymous positions
    Number of synonymous polymorphisms per synonymous site
    Number of non-synonymous polymorphisms per non-synonymous site
Number of synonymous polymorphisms per synonymous site - with
Jukes&Cantor correction
    Number of non-synonymous polymorphisms per non-synonymous site -
with
Jukes&Cantor correction
    Variance of Ks (Jukes&Cantor)
    Variance of Ka (Jukes&Cantor)

    [...]

We can also do pairwise comparisons between polymorphic sites and
calculate Linkage Disequilibriums (more equations here, like D'
(Lewontin), R, R2 (Hill, Robertson), distance between both comparing
sites, and so on...).

All (or most) of these equations are only applied in this field of
biology, and you cannot find them in libraries as colt or apache commons
math... Of course you can calculate some other basic probabilities like
Chi square and Fisher distributions from another packages. Moreover this
equations are pretty simple and use often the same variables. I guess
that the best strategy here would be use some Objects. For example we
could have:

PolymorphismCalculator polyCalc = new
PolymorphismCalculatior(alignment); int segregatingSites =
polyCalc.calculateSegregatingSites();
....

.... or something like this. The biological functions would be in
Strategy Objects for different models and biostatistics, but I am sure
that better ideas will come about this :)

We could also get the SNP objects form an alignment using some
utilities...

SNP[] snps = SNPTools.retrieveFromAlignment(alignment);

A SNP would be a Symbol composed by AtomicSymbols, and some
annotations...

Well, but alignments are not the only source for SNPs. For example, we
can get the SNPS from the NCBI in a few formats, being this kind of
Fasta the most used I think:

>gnl|dbSNP|ss12724475|allelePos=632|totalLen=659|PIGGENOME|P1|taxid=9823|mol=cDNA|snpclass=1|alleles='T/A'
CATCATTGAGCTACTTGCCCTTCGGAGCAGGACCCCGCTCTTGCGTAGGG
GAGATGCTAGCCCGCCAGGAGCTCTTCCTCTTCACGGCTGGATTGCTGCA
GAGGTTCGACCTGGAGCTCCCAGATGATGGGCAGCTACCCTGTCTCGTGG
GCAACCCCAGTTTGGTCCTGCAGATAGATCCTTTCAAAGTGAAGATCAAG
GAGCGCCAGGCCTGGAAGGAAGCCCACACTG
W
AGGGGAGTACCTCCTGACTCCACCCTG

The SNP here is the W ambigous symbol.

Well, this is an overall idea about SNPs... more focused on SNP
calculations. It would be useful to have more opinions about this and
other aspects of SNPs...

Bruno