[Bioperl-l] feature holder for testing overlaps, etc

Jason Stajich jason@cgt.mc.duke.edu
Mon, 20 May 2002 16:13:10 -0400 (EDT)


On Mon, 20 May 2002, Lincoln Stein wrote:

> Hi Jason,
>
> Would it be OK to overlay the DasI interface on top of features_in_range()
> and get_features()?  Then gbrowse will run on top of it.
>

That sounds like a great idea.  I'll look at the interface and see what it
would take to implement it.  Is Bio::SeqFeature::Collection an okay name
in everyone's mind?

> What if I want to combine those two methods to return features of a
> particular type that fall inside a particular range?  This is a very common
> optimization and will greatly help performance if implemented correctly.  The
> DasI overlapping_features() method works this way.  There are also the
> following methods:
>

I was thinking about something like this just this morning -
perhaps gbrowse could allow a set of features (and their
associated sequences) to be selected based on a feature range and/or some
feature metadata like:

$f->has_tag('gene') && grep { /$GENE/ } $f->each_tag_value('gene')

Let's tackle this after I get the range query working.

> 	contained_features()  -- find features that are contained inside range
> 	contained_in()            -- find features that completely contain a range
>
> The way to fetch a range with a B-Tree is to use the DB_File object-oriented
> seq() method with a cursor of R_CURSOR.  This has to be coupled with a custom
> indexer that performs a numeric comparison, and the appropriate flags to
> allow you to fetch duplicate keys.  See the DB_file documentation for
> examples of this.
>
Thanks Lincoln.

I almost have it working. I just figured out you probably can't mix
get_dup calls within the calls to the cursor iterator or else you'll only
get keys which have >1 value.  I'll commit the code and tests tonight if
it all works and we can expand from there.

-jason


> Lincoln
>
>
> On Wednesday 15 May 2002 18:45, Jason Stajich wrote:
> > Here is the proposal for an in-memory SeqFeature collection interface
> > and object tenatively called Bio::SeqFeature::FeatureCollectionI and
> > Bio::SeqFeature::Collection - which is analagous to ChrisM's described
> > IntersectionGraph (maybe it can inheriet from an InterfaceGraphI if
> > you want to help abstract those methods out).
> >
> > SeqFeatureCollectionI interface
> > methods:
> > add_features    -- add a set of features to the collection
> >
> > features_in_range -- returns a list of features that are contained in
> > 		     a specified start & end,range or LocationI.
> > 		     Optionally taking into account strand in the same
> > 		     way the Range overlap/contains methods do.
> > 		     Accept a flag as to whether to test for features
> > 		     that overlap or are completely contained.
> > get_features(-tag => $tag) - returns a list features that have the
> > 		     requested tag (this will only be more efficient
> > 		     than grepping on the list if the # of features is
> > 		     large.
> >
> > It could be reasonable to let Bio::Seq objects use a
> > SeqFeatureCollection to hold their features depending on the
> > efficiency here - but one thing at a time.
> >
> > Bio::SeqFeature::Collection would be implemeted with a BDB B-Tree and
> > use Lincoln's bin method from Bio::DB::GFF::Util::Binning.  I'm not
> > sure how to get things that fall within a range from the BDB B-Tree
> > interface - have to pull from a sorted list somehow and most of the
> > examples are for duplicate hash keys, hints appreciated.
> >
> > -jason
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu