[Biojava-dev] More about SNPs in BioJava 2
Bruno Aranda - Dev
bruno_dev at ebiointel.com
Wed May 19 08:23:33 EDT 2004
Hello again,
I've been thinking about the SNP stuff. In order to work with SNPs it
would be useful before have a good api for alignments. From the
alignments we can find segregating sites (polymorphic sites), and
calculate some biological statistics from it as, to name few:
- Polymorphism measures:
Number of segregating sites (S) Nei 1987
Number of segregating sites per nucleotide
Theta θ per site (estimated from S)
Minimum number of mutations (η) Tajima 1996
Minimum number of mutations η per site
Theta θ per site (estimated from η)
Theta θ per DNA sequence (estimated from S) Tajima 1993
Variance of θ per DNA sequence (estimated from S) - without
recombination
Standard deviation of θ per DNA sequence (estimated from S) -
without recombination
Variance of θ per DNA sequence (estimated from S) - free
recombination
[...]
- Synonymous-Nonsynonimous
Number of Synonymous differences Nei and Gojobori 1986
Number of Non-synonymous differences
Number of Synonymous positions
Number of Non-synonymous positions
Number of synonymous polymorphisms per synonymous site
Number of non-synonymous polymorphisms per non-synonymous site
Number of synonymous polymorphisms per synonymous site - with
Jukes&Cantor correction
Number of non-synonymous polymorphisms per non-synonymous site -
with
Jukes&Cantor correction
Variance of Ks (Jukes&Cantor)
Variance of Ka (Jukes&Cantor)
[...]
We can also do pairwise comparisons between polymorphic sites and
calculate Linkage Disequilibriums (more equations here, like D'
(Lewontin), R, R2 (Hill, Robertson), distance between both comparing
sites, and so on...).
All (or most) of these equations are only applied in this field of
biology, and you cannot find them in libraries as colt or apache commons
math... Of course you can calculate some other basic probabilities like
Chi square and Fisher distributions from another packages. Moreover this
equations are pretty simple and use often the same variables. I guess
that the best strategy here would be use some Objects. For example we
could have:
PolymorphismCalculator polyCalc = new
PolymorphismCalculatior(alignment); int segregatingSites =
polyCalc.calculateSegregatingSites();
....
.... or something like this. The biological functions would be in
Strategy Objects for different models and biostatistics, but I am sure
that better ideas will come about this :)
We could also get the SNP objects form an alignment using some
utilities...
SNP[] snps = SNPTools.retrieveFromAlignment(alignment);
A SNP would be a Symbol composed by AtomicSymbols, and some
annotations...
Well, but alignments are not the only source for SNPs. For example, we
can get the SNPS from the NCBI in a few formats, being this kind of
Fasta the most used I think:
>gnl|dbSNP|ss12724475|allelePos=632|totalLen=659|PIGGENOME|P1|taxid=9823|mol=cDNA|snpclass=1|alleles='T/A'
CATCATTGAGCTACTTGCCCTTCGGAGCAGGACCCCGCTCTTGCGTAGGG
GAGATGCTAGCCCGCCAGGAGCTCTTCCTCTTCACGGCTGGATTGCTGCA
GAGGTTCGACCTGGAGCTCCCAGATGATGGGCAGCTACCCTGTCTCGTGG
GCAACCCCAGTTTGGTCCTGCAGATAGATCCTTTCAAAGTGAAGATCAAG
GAGCGCCAGGCCTGGAAGGAAGCCCACACTG
W
AGGGGAGTACCTCCTGACTCCACCCTG
The SNP here is the W ambigous symbol.
Well, this is an overall idea about SNPs... more focused on SNP
calculations. It would be useful to have more opinions about this and
other aspects of SNPs...
Bruno
More information about the biojava-dev
mailing list