[Bioperl-l] bioperl reorganization (was Re: Tree refactor? was Re: Bootstrap, root, reroot...)

Thu Jul 16 16:48:53 UTC 2009

On Jul 16, 2009, at 2:22 AM, Robert Buels wrote:

> Renaming thread to bioperl reorganization.
>
> Chris Fields wrote:
>> I agree with you, but we've had this discussion before. Repeatedly,  
>> actually. I have a page in the wiki dedicated to it, having first  
>> raised the issue myself:
>>
>> http://www.bioperl.org/wiki/Proposed_core_modules_changes
>
> Ah good.  It's good that there's been some discussion of this  
> already. This is a major issue.  I took at the proposed changes  
> page, and it's a fundamentally unsound idea.  If we're having  
> problems maintaining a big, monolithic distribution of modules, the  
> solution is not "let's keep doing monolithic distributions, but just  
> slightly smaller ones".  It's just pushing the problem back a bit,  
> we'll still have the same problems down the road.
> ...
>
> For developers, this system works very well for reasons already  
> covered.  And without developers, there is no code.

Robert, it helps to read the older mail threads the page links to for  
some historical context.  This has been argued fairly extensively,  
with the proposed split written up on the page being the *initial*  
one, with more likely to occur along the way.  BTW, this was  
originally planned for 1.6 but we were trying to cram too much into  
the release, so I bit the bullet and pushed it back until we had 1.6.0  
out.  As you have seen we did manage to have Bio::Graphics migrate out  
successfully.

>> First: the problem we have consistently run into is exactly how to  
>> deliver a core set of modules in a way that works both for users  
>> and for release managers.   We have settled on one of the original  
>> proposals noted above, starting by roughly splitting up the current  
>> 'core' into something based on similar functions and level of  
>> development/support.  bioperl-dev was part of that, for instance,  
>> and represents code we consider 'developer-only' or experimental.   
>> The true 'core' would be a base set of modules with minimal  
>> additional dependencies (see below for how nebulous this becomes).
>> If you haven't already noticed, prior to 1.6.0 Bio::Graphics  
>> basically started the process (it's now an independent release on  
>> CPAN) and we already have a bioperl-dev.  As you've noted we can't  
>> split everything up right from the beginning, but we have started  
>> down that path.
>
> The Bio::Graphics split is definitely a step in the right direction.  
> There it is on the CPAN, (http://search.cpan.org/~lds/Bio-Graphics-1.97/ 
> ).  Beautiful.
>
>> Second: Bio::Tree seems independent of the other modules, but  
>> that's highly misleading. Bio::Species and Bio::Taxon (and thus  
>> anything that will use said objects, like Bio::Seqs, which are very  
>> much core) are now completely dependent on Bio::Tree code.  Both  
>> are-a Bio::Tree::NodeI, I believe since 1.5.2.  If we split that  
>> code off it then creates a circular dependency (Bio::Species, in  
>> core, requires Bio::Tree in the bio-tree package, which in turn  
>> requires Bio::Root::Root in the core package).  Bio::Tree code also  
>> has a Bio::DB::Taxonomy, thus expanding core a little bit more.   
>> Similarly, Bio::Ontology classes are used by several key modules  
>> (Bio::Annotation::OntologyTerm comes to mind, but also  
>> Bio::Annotation::OntologyTerm).  In other words, there are some  
>> parts of core that can't easily be split off w/o repercussions (and  
>> thus probably won't be).
>
> OK, Bio::Tree is definitely not the place to start then.  You have  
> to start chipping away and extracting leaf nodes in the dependency  
> tree, and that's what was done with Bio::Graphics.

That will be the issue (and one of the reasons I brought up SearchIO,  
AlignIO, Tools, etc).

>> Third: the largest issue in my opinion, that being what really  
>> constitutes 'core', not just to us but to current bioperl users.   
>> To me, the idea or a true 'core' is the bare essentials (Seq,  
>> Features, Annotations, and some basic IO modules, the most common  
>> interfaces).
>> Should 'core' include SearchIO, or AlignIO?  Remote and/or local DB  
>> functionality?  Bio::Tools?  All of those are feasibly independent  
>> sets of modules, and I would definitely support those being in  
>> their own subdistributions and would be easier to fix bugs and  
>> release updates, but I may be in the minority as they are extremely  
>> popular, and many users still consider them 'core'.  We need need a  
>> workaround for that.
>
> There is no workaround needed.  The user types at their cpan prompt:  
> "install Bio::SeqIO" and says 'yes' to follow dependencies.  There  
> should be no core.  Only dependencies.  If we want to give users a  
> convenient abstraction of "BioPerl", the way to do that would be to  
> revisit Bundle::BioPerl (as you say below), or do a Task::BioPerl.

Well, a Task::BioPerl or Bundle::BioPerl would essentially be a  
workaround.  I consider anything to appease long-time users who expect  
an old-style core a 'workaround', though one might use 'solution'  
there as well.

> Really, the whole idea of having a "core" is bogus.  Somebody doing  
> phylogenetics will say the Tree stuff should be core, because, you  
> know, whatever else would you use BioPerl for anyway, but me, who  
> runs genome annotation pipelines and data handling, does not give a  
> hoot about trees.  At least not right now.  So you can go round and  
> round arguing about what should and should not be in core, and you  
> will never come to a set of modules that satisfies even "most  
> researchers'" needs unless you have a huge, unmaintainable  
> monolithic distribution, which, as has been demonstrated, is not a  
> good idea.

I don't agree.  I do think there is a 'core' set of modules  
(Bio::Root, if you want to take the most extreme point of view, would  
represent the purest core set of modules).  Most larger projects  
define a core set.  Perl itself.  Moose as well; they have had recent  
discussions with adapting AttributeHelpers to Moose core:

http://thread.gmane.org/gmane.comp.lang.perl.moose/890

The difference between Moose and BioPerl is Moose has effectively  
preempted the large distribution issue with MooseX::*, which goes with  
it's own versioning.  However, for the tons of MooseX::*, there will  
always be one Moose 'core' set of modules.

Conversely (and my point with the question), bioperl's core was never  
truly defined as 'this is the base set of modules, everything else is  
a separate distribution', and therefore it has grown to an almost  
unmaintainable proportion.  Essentially we're the reverse of Moose,  
having to deal with splitting up a very large core into more  
maintainable bits.  I think it's possible, but it won't be easy w/o  
having some way of bundling the whole lot together.

>> Finally (a wrap-up of bits and pieces): a) how are the various bio- 
>> * packages to be maintained?  Would there be several release  
>> pumpkins, one for each release?
>> b) How do we sort out versioning?  For instance, would bio-foo have  
>> a separate version (like Bio::Graphics now does) and require a  
>> specific core version?  c) I'm sure I have forgotten a few things,  
>> but I've rambled on enough already.
>
> Each distribution would be versioned and released independently.  
> Perhaps they could all start out at version 1.6.

That's what I'm thinking as well, at least for the modules split out  
of core.  Anything else that could have it's own (hopefully sane)  
versioning.  That would be left up to the developer.

> If there is a change in one module that breaks something in another  
> distro (which of course should not be done lightly) it's the  
> responsibility of the other distro's maintainer to fix it or code  
> around it or pin it down with a specific version number dependency  
> in its Build.PL, or whatever. Finding and characterizing these  
> interactions is what automated testing is for, and why it's built  
> into CPAN.

Yes.

>> </breather>
>
> Grrr!!! No breather!  (just kidding)
>
>
>> Now, my suggestions.  We have settled on a general layout, so...
>> * Each subdistribution would have a separate version and require a  
>> specific core (Bio::Root::Root) version.  Note that Bio::Graphics  
>> is using a different versioning scheme than BioPerl, but we may  
>> want to stick to a similar tripartite numbering scheme as for  
>> core.  Whatever happens, this must be decided on first, as there  
>> will be no turning back.
>> * We repurpose Bundle::BioPerl (or a similar Bundle::* package) or  
>> make the BioPerl distribution itself a bundle-like installation.   
>> This would be for packaging up an old-style 'everything and the  
>> kitchen sink' core package from the various distributions.  Anytime  
>> we split off something into it's own distribution we release a  
>> newly trimmed-down core and add the new distribution to the bundle  
>> or BioPerl.  Refer everyone to install the bundle if they want the  
>> old-style installation.
>> * Other current subdistributions (run, db, network, etc) follow the  
>> same pattern as the above.  Releases for non-core distributions do  
>> not have to be tied together with core except where needed.
>> * Avoid any circular dependencies (Bio::ASN1::EntrezGene, I'm  
>> staring at you).
> (Is there any point in staring into its dead sunken eye sockets?  It  
> was last released in 2005.  Need to remove this dependency,  
> rewriting the module in question if necessary.)
>> * As you mention, work these out on branches to test things out.
>
> The above is all exactly right.  The proposed layout of the  
> distributions is the only thing that's wrong.  They need to be much  
> smaller, more focused, and thus more maintainable.

Yes, I agree.  However a large set of modules in bioperl were  
effectively donated by the author, so they will fall to the core devs  
to maintain by sheer property of legacy.

>> And finally, and I am saying this with the utmost respect and  
>> sincerest thanks for everything Sendu is doing and has done for  
>> BioPerl, but I'm not convinced we should keep using  
>> Bio::Root::Build. It does make some things convenient, but at the  
>> cost of additional bugs (2-3 at last count), some API breakage  
>> (some methods conflict with Module::Build), and a bit of a chicken- 
>> and-egg dilemma that particularly impacts subdistributions  
>> (attempting to fall back to Module::Build doesn't work due to API  
>> issues).  I can elaborate on that more if asked, but I think this  
>> post is already long enough, so I'll leave that to later.
>
> Yes, please elaborate on that more.  I want to know.

On bugs:

http://bugzilla.open-bio.org/show_bug.cgi?id=2792
http://bugzilla.open-bio.org/show_bug.cgi?id=2831
http://bugzilla.open-bio.org/show_bug.cgi?id=2859
http://bugzilla.open-bio.org/show_bug.cgi?id=2832 (this one is more a  
TODO)

Note that the author of Bio::Root::Build hasn't touched these, so my  
inclination is to convert over to plain ol' Module::Build.

On API and the 'chicken-or-egg' issue:

Several methods within Bio::Root::Build override Module::Build methods  
but break API, in that they accept, generate, or process different  
(sometimes bioperl-specific) data than what the same Module::Build  
methods expect.  I think 'requires' and 'recommends' fall into this  
cateory, as well as some meta data generation, such as META.yaml and  
PPM.  Other bits are more akin to syntactic sugar (automated  
installation via CPAN, network checking, etc).  This may cause bugs as  
noted above, which goes to demonstrate that too much 'sugar' can send  
you into a coma ;>

It also causes a bit of a 'chicken-or-egg' issue with subdistributions  
wanting to use Bio::Root::Build, in that one has to check for the  
presence of Bio::Root::Build first and then completely bail if it  
isn't present.  One can't fall back to Module::Build due to the API  
difference.  I have run into this when releasing bioperl-run and the  
others.

What I would like is have the various breakaway Bio::* either fall  
back to Module::Build if Bio::Root::Build isn't present, or just use  
Module::Build.  My suggestion is to just use Module::Build directly,  
but we could scale down Bio::Root::Build to respect the Module::Build  
API (thus allowing it as a fallback).

> Such progress.
>
> Seems like now we just need to get everyone to agree that  
> distributions need to be small and focused.
>
> Right?
>
> Rob

I think most devs are on board with this, as long as we have some way  
of *easily* collecting the various bits into a larger whole.  We do  
get a ton of first-time programmers on this list, probably more  
similar to what is seen with the perl users list opposed to the moose  
list.

Anyway, bundling should solve this.

chris