[Bioperl-l] Bio::FeatureHolderI

Matthew Pocock matthew_pocock@yahoo.co.uk
Wed, 20 Nov 2002 16:00:11 +0000 (GMT)


 --- Lincoln Stein <lstein@cshl.org> wrote: 
> The Bio::Das and Bio::DB::GFF get_SeqFeatures
> methods both support filters, so 
> I am happy if you formalize this.
> 
> I suggest two ways to filter, to start with:
> 
> 	(-tag => \@tag_list)
> 
> Filter on the basis of a list of primary tags.  The
> returned features are a 
> union or ORing the tags.
> 

Hi Lincoln,

so in the first case could I say something like:

  $features->filter(
    -source => "blast",
    -type => "homology" );

While there's nothing wrong with this approach for the
realy simple cases, you will quickly have users who
want or/and/not, and this API can never do that. It
can, of course be expanded to handle enumerations over
values as you sudgested e.g. -source => [qw/blast
fasta/] and is fine for a flat space of features that
you want to seperate into single tracks for rendering.
It's of limited use for data-mining.

> 	(-filter => \&filter)
> 
> Pass each feature to the filter subroutine.  Return
> features for which the 
> subroutine returns true.

The seccond case, while being usefull for quickly
writing custom filters is useless for allowing feature
providors to optimize how they fetch clumps of
features. Implicitly, the filtering method must be
shown each feature - doing something equivalent to:

return map { $filt_sub->($_) } @features;

(though with working perl syntax ;) )

This is of minimal use to implementers because perl
doesn't expose the implementation code of a method as
perl data. It would be fine in lisp or some other
nasty language. One common opperation in biojava is to
take a filter, work out that it could hit a pot of
features, transform the filter into a new one that is
suited to that pot (e.g. move locations, flip strands)
and pass it on. You can't do that with a sub.

As I said before, you can go ahead and implement
$filt->accept(feature) based filtering tomorrow, but
other than reducing the number of get_foo_by_bar
methods that the user of the class is exposed to it
won't buy you anything. Or you can define meta-data
and a filter language for this stuff, write the query
interpreters, get pots of features to expose schemas
for what they contain, and get realy benefits in terms
of performance but it will take 6 months of effort.

To give you some idea, by fixing some filter
optmiziations last week, Thomas and I were able to get
a query that combines Ensembl (via our adaptor code)
with two tables of GFF features and a DAS source down
from 45 mins to 5 mins. It's worth doing. It's worth
doing right.

> 
> Lincoln

This basicaly boils down to what you want to use
filters for - are they for reducing the amount of user
API, for moving data-search constraints out of code
and into data-structures, or for providing hooks for
optimizations? You can't guarantee that a solution
aimed at one of these will also do the others.

I'm going to lurk from now on - feel free to pinch a
working solution from BioJava 1.3 or roll your own
from the ground up or just have a working
perl<->daml/oil bridge.

/Signing off

Matthew
Matthew

__________________________________________________
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com