[Bioperl-l] bioperl reorganization (was Re: Tree refactor? was Re: Bootstrap, root, reroot...)

Thu Jul 16 07:22:00 UTC 2009

Renaming thread to bioperl reorganization.

Chris Fields wrote:
> I agree with you, but we've had this discussion before. Repeatedly, 
> actually. I have a page in the wiki dedicated to it, having first raised 
> the issue myself:
>
> http://www.bioperl.org/wiki/Proposed_core_modules_changes

Ah good.  It's good that there's been some discussion of this already. 
This is a major issue.  I took at the proposed changes page, and it's a 
fundamentally unsound idea.  If we're having problems maintaining a big, 
monolithic distribution of modules, the solution is not "let's keep 
doing monolithic distributions, but just slightly smaller ones".  It's 
just pushing the problem back a bit, we'll still have the same problems 
down the road.

The proven, scalable, maintainable way to maintain and distribute Perl 
modules is small, focused distributions tied together with dependencies. 
(For those of you following along at home, a distribution is just a 
tar.gz that the cpan installer downloads and installs behind the scenes)

For users, this is fine, it has to be fine, or they could not use 
anything else that's on the CPAN, because it is all like this (except 
for BioPerl).  And if a user doesn't know how to use the CPAN and 
refuses to learn, they are missing out, and that's just how it is.  It 
is not trivial, but it is not that hard, and if they are going to be 
using bioperl to write their own Perl programs, they need to learn it. 
It is, in 2009, an integral part of writing Perl.

For developers, this system works very well for reasons already covered. 
  And without developers, there is no code.

> First: the problem we have consistently run into is exactly how to 
> deliver a core set of modules in a way that works both for users and for 
> release managers.   We have settled on one of the original proposals 
> noted above, starting by roughly splitting up the current 'core' into 
> something based on similar functions and level of development/support.  
> bioperl-dev was part of that, for instance, and represents code we 
> consider 'developer-only' or experimental.  The true 'core' would be a 
> base set of modules with minimal additional dependencies (see below for 
> how nebulous this becomes).
> 
> If you haven't already noticed, prior to 1.6.0 Bio::Graphics basically 
> started the process (it's now an independent release on CPAN) and we 
> already have a bioperl-dev.  As you've noted we can't split everything 
> up right from the beginning, but we have started down that path.

The Bio::Graphics split is definitely a step in the right direction. 
There it is on the CPAN, 
(http://search.cpan.org/~lds/Bio-Graphics-1.97/).  Beautiful.

> Second: Bio::Tree seems independent of the other modules, but that's 
> highly misleading. Bio::Species and Bio::Taxon (and thus anything that 
> will use said objects, like Bio::Seqs, which are very much core) are now 
> completely dependent on Bio::Tree code.  Both are-a Bio::Tree::NodeI, I 
> believe since 1.5.2.  If we split that code off it then creates a 
> circular dependency (Bio::Species, in core, requires Bio::Tree in the 
> bio-tree package, which in turn requires Bio::Root::Root in the core 
> package).  Bio::Tree code also has a Bio::DB::Taxonomy, thus expanding 
> core a little bit more.  Similarly, Bio::Ontology classes are used by 
> several key modules (Bio::Annotation::OntologyTerm comes to mind, but 
> also Bio::Annotation::OntologyTerm).  In other words, there are some 
> parts of core that can't easily be split off w/o repercussions (and thus 
> probably won't be).

OK, Bio::Tree is definitely not the place to start then.  You have to 
start chipping away and extracting leaf nodes in the dependency tree, 
and that's what was done with Bio::Graphics.

> 
> Third: the largest issue in my opinion, that being what really 
> constitutes 'core', not just to us but to current bioperl users.  To me, 
> the idea or a true 'core' is the bare essentials (Seq, Features, 
> Annotations, and some basic IO modules, the most common interfaces).
> Should 'core' include SearchIO, or AlignIO?  Remote and/or local DB 
> functionality?  Bio::Tools?  All of those are feasibly independent sets 
> of modules, and I would definitely support those being in their own 
> subdistributions and would be easier to fix bugs and release updates, 
> but I may be in the minority as they are extremely popular, and many 
> users still consider them 'core'.  We need need a workaround for that.

There is no workaround needed.  The user types at their cpan prompt: 
"install Bio::SeqIO" and says 'yes' to follow dependencies.  There 
should be no core.  Only dependencies.  If we want to give users a 
convenient abstraction of "BioPerl", the way to do that would be to 
revisit Bundle::BioPerl (as you say below), or do a Task::BioPerl.

Really, the whole idea of having a "core" is bogus.  Somebody doing 
phylogenetics will say the Tree stuff should be core, because, you know, 
whatever else would you use BioPerl for anyway, but me, who runs genome 
annotation pipelines and data handling, does not give a hoot about 
trees.  At least not right now.  So you can go round and round arguing 
about what should and should not be in core, and you will never come to 
a set of modules that satisfies even "most researchers'" needs unless 
you have a huge, unmaintainable monolithic distribution, which, as has 
been demonstrated, is not a good idea.

> Finally (a wrap-up of bits and pieces): a) how are the various bio-* 
> packages to be maintained?  Would there be several release pumpkins, one 
> for each release?
> b) How do we sort out versioning?  For instance, 
> would bio-foo have a separate version (like Bio::Graphics now does) and 
> require a specific core version?  c) I'm sure I have forgotten a few 
> things, but I've rambled on enough already.

Each distribution would be versioned and released independently. 
Perhaps they could all start out at version 1.6.  If there is a change 
in one module that breaks something in another distro (which of course 
should not be done lightly) it's the responsibility of the other 
distro's maintainer to fix it or code around it or pin it down with a 
specific version number dependency in its Build.PL, or whatever. 
Finding and characterizing these interactions is what automated testing 
is for, and why it's built into CPAN.

> </breather>

Grrr!!! No breather!  (just kidding)

> Now, my suggestions.  We have settled on a general layout, so...
> * Each subdistribution would have a separate version and require a 
> specific core (Bio::Root::Root) version.  Note that Bio::Graphics is 
> using a different versioning scheme than BioPerl, but we may want to 
> stick to a similar tripartite numbering scheme as for core.  Whatever 
> happens, this must be decided on first, as there will be no turning back.
> * We repurpose Bundle::BioPerl (or a similar Bundle::* package) or make 
> the BioPerl distribution itself a bundle-like installation.  This would 
> be for packaging up an old-style 'everything and the kitchen sink' core 
> package from the various distributions.  Anytime we split off something 
> into it's own distribution we release a newly trimmed-down core and add 
> the new distribution to the bundle or BioPerl.  Refer everyone to 
> install the bundle if they want the old-style installation.
> * Other current subdistributions (run, db, network, etc) follow the same 
> pattern as the above.  Releases for non-core distributions do not have 
> to be tied together with core except where needed.
> * Avoid any circular dependencies (Bio::ASN1::EntrezGene, I'm staring at 
> you).
(Is there any point in staring into its dead sunken eye sockets?  It was 
last released in 2005.  Need to remove this dependency, rewriting the 
module in question if necessary.)
> * As you mention, work these out on branches to test things out.

The above is all exactly right.  The proposed layout of the 
distributions is the only thing that's wrong.  They need to be much 
smaller, more focused, and thus more maintainable.

> 
> And finally, and I am saying this with the utmost respect and sincerest 
> thanks for everything Sendu is doing and has done for BioPerl, but I'm 
> not convinced we should keep using Bio::Root::Build. It does make some 
> things convenient, but at the cost of additional bugs (2-3 at last 
> count), some API breakage (some methods conflict with Module::Build), 
> and a bit of a chicken-and-egg dilemma that particularly impacts 
> subdistributions (attempting to fall back to Module::Build doesn't work 
> due to API issues).  I can elaborate on that more if asked, but I think 
> this post is already long enough, so I'll leave that to later.

Yes, please elaborate on that more.  I want to know.

Such progress.

Seems like now we just need to get everyone to agree that distributions 
need to be small and focused.

Right?

Rob