[Bioperl-l] Splits again

Sendu Bala bix at sendu.me.uk
Thu Jun 28 07:25:03 UTC 2007


Chris Fields wrote:
> On Jun 27, 2007, at 5:43 PM, Sendu Bala wrote:
>> What advantage is there of these defined splits instead of  
>> individual modules? As I see it you lose some of the potential  
>> benefits of breaking Bioperl up completely, whilst also suffering  
>> the maintenance problems I outlined in my objection to Steve's post.
>>
>> Being able to work on all Bioperl from a single cvs (ne svn) check  
>> out/ archive, whilst distributing it as individual modules on CPAN  
>> seems like the best of both worlds to me. What am I missing?
> 
> Okay, forewarned, but here's my long-winded reasoning.  The short and  
> sweet version: I (very) respectfully don't agree with you, at least  
> re: the idea we should commit all modules to CPAN independently. It  
> doesn't make any sense to me, but maybe you can elaborate more?   
> Maybe I'm misinterpreting what you mean?

The short and sweet version: my proposal has all the benefits of yours, 
but none of the disadvantages. What's not to like?


> Finally, all of this should wait until later.  Much later, like after  
> a decent release, after svn, etc kind of 'later'.  I think we can  
> agree on that.

Hmm, not really. If it can be implemented by a change in just Build.PL 
and ModuleBuildBioperl, its really independent of everything else. 
That's the beauty of it: the only thing that changes is how things are 
uploaded to and downloaded from CPAN. The only person that normally 
deals with that issue is the pumpkin for a release, and he only cares 
about it at release time.

In fact, if we're going to do it at all it makes sense to try it out on 
a minor release like 1.5.3. We've already got experience of doing it 
split-style from 1.5.2. (And let me tell you: splits at the code-base 
level suck.)


> Individual CPAN modules:
> 
> CPAN is not our personal versioning system; it may be if a  
> distribution consists of only a few modules, but not when it's one of  
> the largest distros present.  If someone wants to update an  
> individual bioperl module for a quick bug fix they are more than  
> welcome to download it via cvs, svn, or even using a web browser, and  
> replace the one they have.

And where is the harm in letting them do it via CPAN as well? In fact, 
there are significant benefits:


> I'm trying to reason how one could break up the individual SeqIO/ 
> SearchIO/otherIO modules into single module distributions.  They are  
> intrinsically tied together (SeqIO::genbank won't work w/o SeqIO,  
> which relies on the various interfaces, RootIO, and on down).  How  
> would tests be run off CPAN when the modules are distributed  
> independently?

Bio::SeqIO::genbank would have a dependency on the latest version of 
Bio::SeqIO (etc.), and Bio::SeqIO would have its own dependencies.

So when a user wants to get the latest version of Bio::SeqIO::genbank, 
they no longer have to worry about what other modules in its dependency 
hierarchy they should also install.

Instead they just request Bio::SeqIO::genbank which itself ensures you 
have the latest version of all its dependencies before installing itself 
and running its tests.

When a dev makes a major bugfix to Bio::SeqIO::genbank that all genbank 
users should have, he could just call './Build dist Bio::SeqIO::genbank' 
which would generate a new package for Bio::SeqIO::genbank suitable for 
uploading to CPAN. No more long release cycles and having to constantly 
tell people to 'use CVS' to get working Bioperl code.


> Would they also be individually distributed?  What  
> would you use to tie all the individual modules together?  How would  
> you explain to the CPAN maintainers that you want to split bioperl  
> into 990 individual modules, all updated independently, but intend on  
> bundling them afterwards anyway?

They would be tied together by a CPAN bundle. You don't have to 
'explain' anything to the CPAN maintainers because you're not doing 
anything wrong. In fact, you're using it the way you're supposed to.


> Splitting up core:
> 
> As I see it, here are the advantages of a defined split as Steve and  
> I see it (off the top of my head).  Some of this probably reiterates  
> my previous points, as well as Steve's, so apologies in advance.

Below I answer with how it would be with my single-module approach 
compared to the defined splits.


> - A lean, mean, focused set of bioperl base modules (core) w/o or  
> with very few external deps, minimal installation issues, etc.  The  
> very basic stuff to get up and running.

Even leaner, even more focused.


> - BioPerl bundled modules (Nathan's 'cliques') with defined, focused  
> functionality, code, and tests, which add a bit more 'sugar' to the  
> base functionality of the core.  If you only care about parsing BLAST  
> reports, get SearchIO, which requires core and optionally other  
> modules (XML::SAX).  If you want additional DB functionality apart  
> from the very basic ones in core, install DB (with it's additional  
> requirements, including core, DBI, and so on).  Same with Graphics,  
> Tools, Tree/Phylo, etc.  We just need to define and limit the number  
> of splits.

The same can be achieved with CPAN bundles for each kind of functional 
grouping you can think of. And since its just a single text file that 
defines such a grouping, its easy to change or add new ones as you feel 
like it, as opposed to the rather more permanent and substantial effort 
of creating one of your splits on the code-base level.

Also, the world doesn't have to rely on /our/ ideas of what a useful 
functional split is. If someone just wants to parse Blast results, they 
can just use CPAN to install Bio::SearchIO::blast_pull instead of having 
to install all of SearchIO.


> - Easier to add additional bundled modules.  For instance, I could  
> focus all of my RNA work into a discrete set of modules (say, bioperl- 
> rna) which I maintain, I ensure works with the latest core code, I  
> ensure also plays well with the other children =) , and I distribute  
> via CPAN.  Same with EUtilities, which could go into a separated DB- 
> related set or stay in core.

And if you lose interest in them? They eventually die because they no 
longer have someone looking after them by default (the pumpkin and other 
devs). Alternatively you could just make a CPAN bundle. One text file! 
Easy! No duplication of modules in CPAN, no new hassle for you or the 
Bioperl 'core' pumpkin to ensure that the latest version of each work 
with each other and other splits.


> - If we want a full-fledged 'install everything', the CPAN Bundle  
> system is available.  I think it's easier to use a Bundle for 4-5,  
> even 10 groups of modules as opposed to over 900.

No, it isn't any easier. Its /equally/ easy to install a bundle of 900 
packages of 900 modules as it is to install 5 packages of 900 modules.

When not installing absolutely everything, but perhaps 'most' things, 
there's the additional benefit that it would be easier to skip a 
particular Bio::module because you didn't want to install its external 
dependencies and weren't that interested in it anyway.


> - A Bundle or a build file where discrete distributions are listed  
> (Bio::SearchIO, etc) wouldn't need to be updated every time a new  
> module is added to a distribution.  I suppose this could be  
> automated, but why have the additional headache?

Yes, it would be automated, and no, it wouldn't at all be any kind of 
additional headache. I'm proposing a fully-automated system that the 
pumpkin wouldn't even have to think about it. Much /less/ of a headache 
than dealing with splits. Orders of magnitude easier to deal with.


> - A chance to cut out some cruft.  We all know that particular areas  
> need work or a complete overhaul (Restriction, Structure, maybe a few  
> others).  Smaller, concentrated sets of modules I believe would be  
> easier to maintain, and those that don't get use will eventually fall  
> out of favor and may be lost or replaced from the more maintained  
> group of modules.  Survival of the fittest.

And the smallest, most concentrated set of modules is the individual module.


> - We already have had practice; bioperl-db, bioperl-run, bioperl- 
> network, and others.  Those that have been routinely maintained and  
> enjoy wide use (db, run, network) have survived; others not so much  
> (corba-related stuff, microarray, ext, etc., though the code is still  
> available if someone else wants to take it up and revive it!).

The reason some of these existing splits (micoarray, ext) have fallen by 
the way-side? /Because/ they're splits. If they had been part of 
bioperl-live all along, they'd have been kept in a working, compatible 
state and would have been released along with everything else in 1.5.2


> Disadvantages of a defined split:
> 
> - The initial headache of identifying which groups go where,  
> coordinating with those who rely on bioperl (GMOD, etc) on how this  
> will be set up, so on...

No need to worry about this with individual modules.


> - Separate groups of modules require testing together to ensure  
> functionality is consistent and maintained (something I think you  
> pointed out previously).

No need to worry.


> - I think an increased possibility of branching is possible.
> 
> - Extra headaches for devs, who have to keep track of the various  
> critical distributions and make sure they work well together.

No headaches.



More information about the Bioperl-l mailing list