[Bioperl-l] bioperl reorganization

Sat Jul 18 03:14:49 UTC 2009

My 2c...

On Jul 17, 2009, at 12:01 PM, Jason Stajich wrote:

> Will try to weigh in more, a little bit of stream of consciousness  
> to let you know I'm thinking about it.  Tough summer to focus much  
> on this.

Yes, for me as well.  That will change soon (approx two weeks) ;>

> It's too bad we are apparently the laughing stock of Perl gurus, but  
> it would be great to see how to modernize aspects of the development.
>
> I'm curious how it will work that we'll have dozens of separate  
> distros that we'll have a hard time keeping track of what directory  
> things are in? Will there have to be a master list of what version  
> and what modules are in what distro now?

I don't think we're a laughingstock as much as we haven't had the time  
to dedicate towards this (and much of this occurred at a point early  
on, with that whole 'Cathedral and Bazaar' esr-based thingy).  BTW,,  
those same gurus shouldn't speak: perl core is just as bad and riddled  
with worse bugs, though rgs and co. wouldn't admit it.

In fact, base.pm itself has a nasty one; I'm surprised no one in the  
bioperl community has noticed it yet (it's listed as a bug on RT I  
think):

pyrimidine1:biomoose cjfields$ perl -MBio::SeqIO -e 'print  
$Bio::SeqIO::VERSION."\n"'
1.0069
pyrimidine1:biomoose cjfields$ perl -MBio::SeqIO -e 'print  
$Bio::Root::IO::VERSION."\n"'
-1, set by base.pm

Imported modules do not have VERSION set correctly when it is  
exported.  This hasn't become an issue in bioperl yet (it's really an  
edge case), but several devs have run into this. And really, why set  
VERSION to a string like '-1, set by base.pm'?

Anyway, re: versioning, the way I think about it, if we have a small  
very stable core with version X, and a focused very stable module  
group with version Y, other distributions would have a separate  
version and require subgroup version Y (which would in turn require  
core version X).  CPAN would take care of it.  This isn't much  
different than what occurs everyday on CPAN anyway (Jay's Catalyst,  
Moose and MooseX, and so on).  In fact, several Moose-requiring  
distributions don't require the latest Moose.

> When I do a SVN (or git) checkout do I need to checkout each of  
> these in its own directory?  Or will there be a master packaging  
> script that makes the necessary zip files for CPAN submission?

Not sure; that would be up to us I suppose.  I think it would be  
easier to maintain and release if they were separate or packaged up as  
Jay suggests.

> If they are in separate directories are we organizing by conceptual  
> topic (phylogenetics, alignment, database search) or by namespace of  
> the modules?

By topic, retaining namespaces.  We have a basic Bio::* directory  
structure already in place for various generic terms (Tools, DB, etc),  
so I see this crossing simple namespaces very easily.  And as I  
pointed out to Robert, several of those could possibly go together.

> Do all the 'database' modules live together - probably not  - so do  
> we name bioperl-db-remote bioperl-db-local-index, bioperl-db-local- 
> sql, etc?  really bioperl-db is somewhat focused on sequences and  
> features, but what about things that integrate multiple data types -  
> like biosql?

I don't see bioperl-db (BioSQL) being split up.  I think it's too  
intrinsically linked and cohesive (it's almost a separate core unto  
itself), so it would be counterproductive to do so.

Maybe have bioperl-db become bioperl-biosql.  Web-based = bioperl- 
remotedb.  Local = bioperl-localdb. OBDA = bioperl-obda.

> If they are in separate directories, what about all the test data  
> that might be shared, is this replicated among all the sub- 
> directories - how do we do a good job keeping that up to date, could  
> we have a test-data distro instead with symlinks within SVN?

We have to see how much is actually shared and proceed from there.  I  
would like to eventually resurrect the idea of a separate biodata repo  
that we could just ftp the data from as needed.  That would cut down  
on the package size quite a bit, but I'm not sure how feasible that is  
from the testing point of view (would we have to skip all tests if  
there were no network access)?

> For some other obvious modules that can be split off and self- 
> contained, each of these could be a package.  I would estimate more  
> than 20 packages depending on how Bio::Tools are carved up.
> - I think Bio::DB::SeqFeature needs to be split off for sure this is  
> a nice logical peeling off.  Could be another test case since it is  
> a Gbrowse dependancy
> -  Bio::DB::GFF as well for the same reasons.

Completely agree (and I think Lincoln would like this as well).

> -  Bio::PopGen - self contained for the most part, but depends on  
> Bio::Tree and Bio::Align objects

Could list those as a required dependency.

> -  Bio::Variation
> -  Bio::Map and Bio::MapIO
> -  Bio::Cluster and Bio::ClusterIO
> -  Bio::Assembly
> - Bio::Coordinate
>
> My nightmare is that we're going to have to manage a lot of 'use XX  
> 1.01' enforcing version requiring when dealing with the dependancies  
> on the interface classes and having to keep these all up to date?   
> The version was implicit when they are all part of the same big  
> distro.

Right.  But it also becomes a maintenance problem when serious bugs in  
one module impede the needed release of others to CPAN.

> Also the splits need not only include one namespace if need be I  
> guess but we have generally grouped things by namespace.
>
> What do you want to do about the bioperl-run.  Do we make a set of  
> parallel splits from all of these?  I think at the outset we need to  
> coordinate the applications supported here in some sort of loose  
> ontology - the namespaces were not consistently applied so we have  
> some alignment tools in different directories, etc.  So the  
> namespace sort of classifies them but it could be better.  One of  
> the challenges of multiple developers without a totally shared  
> vision on how it should be done.

We could split bp-run and Tools, pairing the wrappers with the  
relevant parsers modules.  Not sure if this can be done with SearchIO  
as well but it could be tested to see how feasible that would be.

> I'm not convinced that the Bio::Graphics splitoff has been painless  
> so we should take stock of how that is working.

Really?  Lincoln has made several fixes lately on CPAN, so I thought  
everything was going well.  If anything I would think the lack of  
additional 1.6.x bioperl releases has probably held Gbrowse 2.0 up  
more due to Bio::DB::SeqFeature (my fault, but as you know life and  
job take precedence sometimes).

> It seems like this split off would be a way to better streamline  
> things in bioperl so that modern versions of bioperl might be able  
> to better interface with things like Ensembl again too.
>
> How much of this effort is worth triaging on the current code versus  
> the efforts we want to make on a cleaner, simpler bioperl system  
> that appears to scare so many users (and potential developers) off.

I say triage away on a branch, but we need to indicate which ones to  
whittle out first.  The reason I believe we went for a larger split  
initially (as indicated on the wiki page) was to push something  
forward and not get too bogged down in the details.  But we may as  
well go full throttle and do this right away.

> Okay I rambled, hope that was helpful.
>
> -jason
> --
> Jason Stajich
> jason at bioperl.org

Very, very helpful.  Now I need a beer.

chris