[Bioperl-l] Bio::DB::GFF::Util::Binning

Fri Oct 20 02:13:52 UTC 2006

I know that there may be some changes resulting from new GFF3 implementations, 
but thought I would see if the following is useful anyway.

I implemented the R-tree binning schema as used by Bio::DB::GFF::Util::Binning 
and as mention in this article:

I tested the following query on a normal table (no binning), but it assumes 
that you know the longest range in the table.  So for example with a table of 
human genes, where the longest gene we know of is around 2.4Mb.

 SELECT COUNT(*) as count FROM groups WHERE start > max(0,[start-2.4Mb]) AND 
g.start < [end] AND g.end > [start] AND g.chromosome = '1'

so for 100Mb:101Mb

SELECT COUNT(*) as count FROM groups WHERE start > 97600000 AND g.start < 
101000000 AND g.end > 100000000 AND g.chromosome = '1'

where [start] and [end] define the region of interest.  This query outperforms 
the R-Tree implementation on all tests that I have performed (for lengths of 
200bp to 10Mb across a whole chromsome).  Could this be of some practical use?