[Bioperl-l] WARNING INCOMING: collection consolidation

Wed Feb 26 19:18:15 EST 2003

Hilmar Lapp wrote:

> Just as an aside, a little more communication about what's going on in 
> the freaky branch wouldn't hurt if this changes a lot of things (as 
> opposed to adding things) and is ever to go into the main trunk ...

I agree, and I apologize if it seems mysterious.

Mostly it's the collection consolodation.  I've been holding back from 
checking in the bulk of my work on that because the same branch has been 
used by Lincoln and others to test out some (basically unrelated) ideas 
(relative locations and I think GFF3), and I don't want to check stuff 
in until the tests pass for fear of breaking other people's ongoing work.

The unique identifier stuff is also unrelated and was a quick answer to 
a short discussion that Lincoln and I had about the bulkiness of the 
existing IdentifierI interface and my desire to have a lighter-weight 
one that could unify the disparate concepts of 'unique identifier' that 
I find confusing in BioPerl.  It has so far remained sequestered on the 
freak branch because we've all had so many other things to squabble about.

The collection consolodation has been briefly mentioned on the list, 
mostly as a warning because it will affect users of feature collections, 
including DasI, GFF, and the gbrowse stuff.  The discussion brought up a 
lot of important issues that are still unresolved, particularly about a) 
handling relative ranges, b) the relationship between sequences and 
their annotations, and c) naming conventions.  I have had to trudge 
through with these things up in the air, so I've made some working 
decisions: a) I've added seq_id() to RangeI, but have documented that it 
can remain undef and that's okay; I've also created a RelRangeI (and an 
implementation, RelRange) that adds accessor methods for absolute start, 
end, and strand values, utility methods for conversion between absolute 
and relative range values, and an absolute() flag for forcing 
absoluteness (this all came from the Bio::DB::GFF::RelSegment class); my 
new interface Bio::SeqFeature::SegmentI isa RelRangeI and it is the only 
thing besides RelRange that presently extends/implements RelRangeI.  b) 
I'm just using the SeqFeatureI stuff as-is because I don't yet 
understand the proposed new model; I'm a bit wary about how that will 
work with the new Bio::SeqFeature::CollectionI stuff but I'm excited for 
the challenge.  c) I'm sticking with (the name) 
Bio::SeqFeature::CollectionI for now because I'm lazy and we can't seem 
to decide if it should be Bio::SeqFeatureCollectionI instead; this is a 
minor change downstream if necessary.

On the whole the plan is to make sure that things remain 
backwards-compatible where possible.  The collection consolodation 
unites many existing classes that provide filtered access to feature 
lists, including Bio::SeqFeature::CollectionI, 
Bio::SeqFeature::Collection, Bio::Das, Bio::DasI, Bio::Das::Segment, 
Bio::DB::GFF, Bio::DB::GFF::Segment.  We've also made a new interface 
for _providers_ of collections, to unify access to databases and DAS 
servers and other things that store features.  The need for this is that 
gbrowse currently gets unified access to Das and GFF data sources via 
the DasI interface, which is poorly named and poorly placed for a 
generic data access interface.  The result is three new interfaces in 
Bio::DB, Bio::DB::FeatureProviderI, Bio::DB::SequenceProviderI, and 
Bio::DB::SegmentProviderI, where the latter is a simple extension of the 
two former interfaces.  SequenceProviderI isa Bio::DB::RandomAccessI and 
a Bio::DB::UpdateableSeqI.  All three interfaces provide a minimal core 
set of methods for adding, retrieving, updating, and deleting (features 
or sequences) from a data store.

So far there's nothing (else) major here.  Some existing things will be 
deprecated, such as Bio::DB::GFF::RelSegment.  Some existing things will 
implement additional interfaces (eg. those many collections will now 
implement the common Bio::SeqFeature::CollectionI interface).

I do not think that this email will suffice as a request for comment, 
but comments are welcome.  When it gets closer to real (like when I can 
get the tests to succeed and can check it all in to the freaky branch) I 
will get back to this list with a real proposal and can refer people to 
its working implementation.  I hope that the initial investment will pay 
off.  This is all groundwork for an overhaul of gbrowse's data access 
methodology, with the goal of making gbrowse more component-based and 
allowing for multiple simultaneous data sources of more disparate types.

Thanks for reading all the way through this long message.  Please accept 
my apology if it seems that we have failed to solicit sufficient input 
from the group; your comments will be appreciated.

:Paul