[Bioperl-l] Hilmar and Ewan debate SeqFeatures some more...

Thomas Down td2@sanger.ac.uk
Fri, 19 Jan 2001 19:45:44 +0000


On Fri, Jan 19, 2001 at 11:13:57AM -0800, Hilmar Lapp wrote:
>
> The BioJava project came up, as far as I can recall, with a
> Location class model separate from the Feature class. I put
> Matthew and Thomas on the cc to ask for their experience with this
> model, and what the feedback from the biojava community was so
> far.

Yes, we have this approach (well, strictly speaking we have
a Location interface plus various implementation).  It's worked
pretty well for us so far -- any type of feature can have
any type of location attached to it (point, range, compound),
and it's efficient in terms of memory usage.

We've also found that the Location objects can be kind-of useful
on their own -- I've got all sorts of scripts which use bare
Locations for tracking coverage, or even keeping track of available
space when working out an optimal GUI layout.

I don't know exactly how this experience would translate into
your design, though.

> > People might be interested that I originally argued for an explicit
> > location object about 1 month ago. I don't now...
> > 
> > I am suggesting that SeqFeatures do not have an explicit location object,
> > but we subclass SeqFeatures into Split, Simple and Fuzzy, all inherieting
> > >from a common SeqFeature interface

The only potential consideration is that this then makes any
further polymorphism of SeqFeature quite difficult.  We're
experimenting with polymorphic features in BioJava -- look
at the org.biojava.bio.seq.genomic package for lots of
useful sub-interfaces of Feature.

If you are thinking of ever going down this route, beware the possible
explosion of combinations of feature type and location type.

> > Benefits - (a) less objects (b) only one place where the client gets the
> > information and (c) more backwardly compatible.
> 
> I'd like to note here that 'less objects' is not a benefit by
> itself, unless loading modules imposes a significant run-time
> performance hit, which I think we agree it doesn't. Having less
> objects I think does constitute a benefit if it removes redundant
> definitions, and makes for a steeper learning curve of the API,
> that is, if they're easier to use. This is the point I doubt here:
> I think further inflating SeqFeatureI flattens the learning curve.
> And I think Location (where) and Feature (what) are not redundant.

Actually, my understanding is that the per-object overhead in
perl is pretty high, especially for objects implemented as 
hashes.  If you ever want to hold millions of SeqFeatures in
memory (a not unreasonable requirement, I'd suggest), a few
hundred bytes per location might come back with a vengence.

Of course, this can probably be mitigated by implementing the
locations as C structs.  Is this approach currently being
used in BioPerl?

So I'm going to be inconclusive.  I like the seeparate
Locations design, but I'd suggest investigating the memory-usage
issues before deciding one way or the other.

    Thomas.