[Biopython-dev] Statistics code

Tiago Antão tiagoantao at gmail.com
Thu Apr 3 00:55:20 UTC 2008


On Thu, Apr 3, 2008 at 1:13 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> > > Its not out of the question, but what exactly do you need from SciPy?
> >
> > I know that this will sound ridiculous, but, in the long run I will
> > need almost everything.
>
> I don't think we should include a dependency now because we may need it in
> the long run.

I already need it now, but just for a very small thing: The chi-square
test. It is quite easy to reimplement. If it ends up by being just
chisquare (which I doubt, but I might be able to externalize to the
user the conventional stats part), then I think the best thing would
be just to reimplement and not to force the dependency. But I think
that I will need to use more stats stuff as I implement functionality.

The point is, IF I need to use it extensively, can I go ahead? If I
end up with just the need couple of functions I would not mind
implementing it myself (but more than that is too much work, and as
you say I don't think it makes sense for biopython to be also
biostats).


>
> > SciPy is a stable project, not an obscure library.
> While this is true, in my experience SciPy is also difficult to install. It
> may mean fewer people using your code because they don't want to go through
> the hassle of installing SciPy. Particularly users coming from a biology
> rather than a computer science background.

And poorly documented also, in my view. But population genetics is
actually 90% statistics. One doesn't do population genetics without
statistics. So, if one does pop gen then some kind of statistical
processing will have to exist somewhere.
If SciPy is difficult to install on Windows/Mac then there is a
adoption problem as you point out (I am on Linux/Ubuntu, in this setup
is trivial to install), but I don't see a way around statisics for
anyone that wants to do population genetics (again statistics where
invented for population genetics, it is really core for us). Of
course, better solutions than SciPy might exist...

> Previously we also discussed switching from the old Numerical Python to the
> new NumPy. I've heard rumors that the NumPy documentation will be declared
> open at the SciPy conference this year. Not having this documentation was my
> biggest argument against NumPy. In my understanding, NumPy has more
> functionality than Numeric. Maybe it has better statistics support also?

It says on http://www.scipy.org/Documentation : "fee based until SciPy 2008"
I think that NumPy has only basic stuff (standard deviation, mean). I
might be wrong, but my research points to that.

To sum it up:
1. It is still not clear to me that I will need a stats library, most
probably yes.
2. I won't mind reimplementing some stats stuff in biopython as long
as it is little work in order avoid a dependency. I can try in as much
as possible to avoid a dependency.
3. The dependency (in case it appears) would be of zero impact outside
of Bio.PopGen.Stats (maybe just setup.py to optionally allow using
scipy)
4. I need to know "the rules of the game" before I write more code (in
order to know what I can or cannot use, in case I need to use).

Tiago
PS - In the spirit of cascade software development I could do a a
priori study of the requirement, but I really don't believe the
conclusion would be reliable.



More information about the Biopython-dev mailing list