[Biopython-dev] Statistics code

Wed Oct 3 15:37:17 UTC 2007

Ralph,

Thanks for the detailed explanation. Because of a couple of requests I
had, I am going to commit first the coalescent code, but after the
coalescent code is in, I will pick this up.

Tiago

On 10/3/07, Ralph Haygood <rhaygood at duke.edu> wrote:
> Tiago,
>
> Sorry to be so long replying---I've been almost drowning in work.
>
> Use anything you find useful in my code.  If you do write an article
> about it, I'd be glad to be a coauthor, not just in name but actually
> to help with writing the discussion of sequence statistics.
>
> There *is* a lot of stuff in my code, not all of it generally
> important.  For example, few people will care about indel statistics,
> beyond counting them and maybe getting the frequency distribution of
> their lengths.  The things most people will care about are K (the
> number of polymorphic sites), Watterson's theta, pi, Tajima's D, Fu
> and Li's D, Fay and Wu's H, F_ST, and McDonald--Kreitman testing.
>
> As for ambiguous nucleotides, my code handles them in one of two ways,
> at the programmer's option.  By default, a site at which any sequence
> in the alignment contains an ambiguous nucleotide is ignored; for
> example,
>
>         ACRGTY
>         ACAGTC
>
> is effectively equivalent to
>
>         ACGT
>         ACGT .
>
> However, if the 'expand_diplotypes' option is specified when the
> Sample object is constructed, each sequence in the alignment is
> interpreted as a diplotype and converted into a pair of pseudo-
> haplotypes, two-fold ambiguous nucleotides (R, Y, W, S, M, and K)
> being interpreted as heterozygous; for example,
>
>         ACRGTY
>         ACAGTC
>
> is effectively equivalent to
>
>         ACAGTC
>         ACGGTT
>         ACAGTC
>         ACAGTC .
>
> In expand_diplotypes mode, sites containing three- or four-fold
> ambiguous nucleotides are still ignored.  Also, you'll get a warning
> if you request a statistic that depends on correct SNP phasing, which
> most statistics don't.  So far, I've found these two operating modes
> sufficient for my needs.
>
> I think your plan sounds very reasonable, just adding sequence
> statistics at a pace that's comfortable for you.  Any time you have
> questions, feel free to ask me, and I'll give you whatever benefit
> there is in my opinion and experience.
>
> I'm happy for all this to happen on biopython-dev, so that other
> people (e.g., Alex Lancaster) can add to it.  I'll leave it to the
> core developers to tell us if we're too noisy.  (I'd recommend still
> sending messages to me with copies to biopython-dev, however, so that
> I don't accidentally miss them on biopython-dev, which I don't always
> read carefully.)
>
> Ralph
>
> On Sat, 29 Sep 2007, Tiago Antão wrote:
>
> > Hi Ralph,
> >
> > Hope all is good with you. I am now finally starting to commit
> > statistics code to Biopython. But before I go ahead I would like to
> > ask some advice to you (plus some extra comments):
> >
> > About code merging and authorship:
> >
> > I am finally looking to your code. There is really lots of stuff
> > there! Would it be OK with you if I merged your code with mine into
> > Bio.PopGen.Stats? Obviously the copyright/authorship for the module
> > would be co-shared as would any authorship of any article deriving
> > from it...
> >
> > About a strategy to advance:
> >
> > 1. I personally don't have any experience, really, with working with
> > sequence data (My background are SNPs, microsatellites/STRs, AFLPs and
> > that sort of stuff)
> > 2. Starting on Monday I am beginning a PhD which will require, part
> > time, sequence analysis
> > 3. What I mean from 1 and 2 is that I currently don't have maturity to
> > architect and design a good framework for sequence analysis but I will
> > gain it with time.
> > My plan is then to defer all sequence code until I fell I know what I
> > am doing (although I was still thinking in providing something like
> > BioPerl's facility of extracting all SNPs from sequences)
> > If this is OK with you I plan to start committing code the week
> > starting on this Monday,
> >
> > About request for insight:
> >
> > If you have any comments to offer on issues regarding representing
> > indels and ambiguous data (ie ambiguous nucleotides) they might be
> > useful, as I suppose that is the biggest issue that makes me afraid of
> > sequence code.
> >
> >
> > Finally: I would summarize our discussion here on biopython-dev (I am
> > not taking it there directly just because you might not want your code
> > on Biopython or might want it in other terms).
> >
> > Thanks,
> > Tiago
> >

-- 
http://www.tiago.org/ps