[BioPython] [Biopython-dev] Statistics in population genetics module - Part I

Mon Nov 3 20:50:14 UTC 2008

Giovanni Marco Dall'Olio wrote:
> On Fri, Oct 31, 2008 at 12:58 AM, Tiago Antão <tiagoantao at gmail.com> wrote:
>
>   
>> Hi,
>>
>> Statistics is the most important part of population genetics modules.
>> In fact one could say that statistics where invented FOR population
>> genetics (check http://en.wikipedia.org/wiki/Ronald_Fisher ).
>> When I started to work on the population genetics module I decided to
>> delay the statistics module a bit, in order to get experience with the
>> whole biopython project before committing to do the most important
>> thing.
>> Irrespective of it is possible or not to link scipy or not, now seems
>> to be the time to advance, especially considering that Giovanni is
>> interested in participating.
>> A few of points need to be said before suggesting on how to put
>> statistics in Bio.PopGen
>>
>> 1. Whatever design is put in, it should be reasonably future proof: in
>> a few releases it should not be a good idea to break older code. That
>> should be avoided in as much as possible.
>>     
>
>
> For how much time do you think a biopython module should be kept compatible
> with older versions, more or less?
> It will take a long time to develop the module, and it is sure that we will
> make some mistakes. So, what is the best way to proceed? What if we create a
> separated biopython branch where we can test all the new features?
> At the moment I am working with a separated git repository for all the
> popgen modules. The problem is that I didn't include all biopython modules
> in the repository, so, if any of my changes breaks something in biopython, I
> won't know it until I'll merge everything with biopython code.
> On the other side, if I include a biopython release in my popgen repository,
> I won't be able to track changes made in biopython, and my popgen code will
> be compatible with that version only.
> I think git provides some options to handle this kind of situations... I am
> not very used to cvs, so I don't know.
>   

If you have modified a Biopython module you probably see if it is 
acceptable to change the main Biopython distribution especially if it 
involves an API change or modify your code because I do not think it is 
good idea to have different versions of the same Biopython module or any 
name clashes with Biopython. Otherwise, you just need to check that it 
runs with a very recent version of Biopython (and under the Biopython 
supported Python versions).

If you have not done so, I would suggest developing unit tests that not 
only ensure code accuracy but also maintain future compatibility. A 
failed test will indicate some problem that needs resolving and the 
solution will mean that the code will be made compatible if necessary.

> p.s. When python3000 will be released, it will be probably necessary to
> rewrite large portions of biopython, if not creating a 'biopython 2' version
> (I think they were discussing something like this in bioperl's list).
> I thought that maybe, even if we make some 'mistakes' in this version of
> biopython, we will be able to fix them in a later version.
>   

Python 3 can not be discussed until all incompatible modules like numpy 
or Biopython can be used under Python 3 (rc1 is available). Further, the 
advice from above (see Guido's blog 
http://www.artima.com/weblogs/viewpost.jsp?thread=227041) is that the 
conversion should be a direct port without any changes especially API 
ones. So correcting any major 'mistakes' in the existing module probably 
will not be acceptable to the community. Further any correction at any 
time to the main distribution is not trivial especially as you must 
first get the users informed (I saw that with changing histogram in numpy).

There is a lot of flexibility in a separate project that you will lose 
when a project is widely released or included in an well established 
project like Biopython. I think that you should maintain a separate 
project of some type until everything is sufficiently acceptable to the 
Biopython community. This gives sufficient time to address various 
concerns and enables an easy integration.

Finally, if you require additional dependencies than those currently 
required by Biopython (especially something like scipy) then I think it 
will be very hard or impossible for you to get any code associated with 
these dependencies into Biopython.

Just my opinions on your questions,
Bruce