[Biojava-l] Proposal: Making a more general Filter framework.

Thomas Down td2@sanger.ac.uk
Tue, 6 Mar 2001 12:13:03 +0000


[Warning: long e-mail ahead...]

        A generalized filtering model for BioJava
         
             By Thomas Down and Matthew Pocock


We've already got a fairly powerful mechanism for filtering
Feature objects, including AND and OR operators for building
complex filters.  A couple of weeks ago, Matthew wrote some
cool code in the FilterUtils class which is helping
FilterHolder implementer to provide optimized implementations
of the filter method.

Now, however, we're reaching the point where we want to filter
other types of object as well (in the first instance, the requirement
is for filtering Sequences, but in the slightly longer term
we'll want to filter all sorts of other object types -- Hit objects
from search results come to mind).  We could have a separate
SequenceFilter interface (plus implementations), but we'd then
need to duplicate things link AND and OR filters, and all the
stuff in FilterUtils *sigh*.

There's also another issue that's been lurking for a while.
We've got a hierarchical feature model, and the current
filtering system only allows this to be handled in two ways:

  - Filter only `top level' features in a given FeatureHolder
    (ignore hierarchy).

  - Filter recursively, and place all the matches in a single
    result FeatureHolder (equivalent to flattening the hierarchy
    before filtering).

This makes it potentially hard to ask questions like: `return all
the exons >100 bases which are parts of immunoglobulin genes'.  Okay,
contrived example but this is turning into a very real limitation.


The Solution (maybe).
---------------------

This is a proposal for how we might be able to go forward, based
on discussions between myself and Matthew (with input from the
acedb group at the sanger centre, plus some inspiration from
technologies like XPath).

Most of this code should go in a new package (org.biojava.utils.filter)

Please feel free to suggest, complain, or pick it to pieces.

First of all, filters should now be able to work on any
kind of object:

  public interface Filter {
      public boolean accept(Object o);
  }

Notes:

  1. The FeatureFilter interface still remains in place
     (as a subinterface of filter), both as a placeholder
     for standard Filter implementations which work on
     Features, and to reduce the impact on all code.

  2. Clearly, many operations are only meaningful on certain
     types of object (for instance, a filter which looks at
     a Feature's location).  These filters will appear as
     proper subsets of "Filter.ByClass(Feature.class)" (for
     those who haven't seen FilterUtils, this contains a method
     to tell you if one filter is a subset of another).

     We might also want to have a getAcceptedClass() method
     on Filter.  Suggestions? 


This allows us to filter Sequences (and Blast hits, and whatever).
But to sort out the issues with hierarchical data, we need a slightly
more radical change.

We propose an extended model for the filtering process, where a
filter-chain consists of alternating `filter' and `follow links'
stages.  The example query I suggested above can be represented
like this:

                       Gene                                  Exon
   get all             AND                get all            AND
  descendants   type=Immunoglobulin      children         length >100

  ----------->  ???????????????????  ----------------->  ?????????????

  [follow]           [filter]             [follow]          [filter]


This seems a much more powerful solution than the current
(rather problematic) `recurse' flag on the FeatureHolder.filter()
method.  It can also deal with some really exciting possibilities
such as pulling out all the sequences from one particular lab,
then filtering their features for some property.

As with the current `single stage' filters, I'd hope that
objects would analyse any FilterChain parsed in and try to
optimise their execution of it.  Just as the FilterUtils
class provides utilities to help find execution strategies
for `single pass' filters, standard execution planners
would be provided for FilterChains.

Notes:

  1. There is a distinction between `filter an object based
     on some child property' (e.g. select Sequences from a
     SequenceDB which contain an immunoglobulin gene), and
     `make a selection from the children of this object'
     (e.g. actually return those Gene features).  The formerr
     can be modelled using simple Filters, while the latter
     requires a link-follower.  In a relational database, the
     distinction between these two operations is blurred.
 
  2. As an example of the kind of optimization which could
     be performed, a Sequence object backed by data stored
     in a relational database might reasonably be able to
     re-write the filter chain above as a single SQL query.


I'd suggest the following APIs for the basic query object model.
APIs for the query optimizer utilities can be discussed later:

  public interface Filterable {
      public int size();
      public Iterator iterator();
      public Filterable filter(FilterChain fc);
   
      // Other methods?  Mutators?
  }

This is just a (very) lightweight collection interface.  This would
become the base interface for FeatureHolder, and also SequenceDB
(or perhaps we should have separate SequenceDB and FilterableSequenceDB?)

  public interface Follower {
      /**
       * Standard follower which just returns the same set
       * of objects.
       */

      public static final Follower IDENTITY;        

      /**
       * Return all children of the objects in the working set.
       */
 
      public static final Follower ALL_CHILDREN;

      /**
       * Return all descendants of objects in the working set.
       */

      public static final Follower ALL_DESCENDANTS;

      public Filterable followLinks(Filterable f)
          throws CannotFollowException;
  } 


Finally, there would be the filter-chain itself.  This is
just a list of tuples of (Follower, Filter).

  public final class FilterChain {
      public static final class Element {
          public Follower getFollower();
          public Filter getFilter();
      }

      public void addElement(Follower flwr, Filter fltr);
      // Methods for inserting and removing elements?

      public int length();
      public Element getElementAt(int pos);

      /**
       * Convenience method for turning a single-pass filter into
       * a FilterChain.
       */

      public static FilterChain singleFilter(Filter f);
  }




Can anyone see any flaws in this plan?  Is  it complete
overkill?  Comments?  Suggestions?  Offers of help when
implementing it?


Thank you for reading to the end of this...

    Thomas.