[DAS] maxbins

Andy Jenkinson andy.jenkinson at ebi.ac.uk
Wed Jul 6 15:54:56 UTC 2011


Hi Gustavo,

I'm not sure what you mean by "non-overlapping" features exactly. For a maxbins of 700, the thing to avoid IMO is returning the first (or random) 700 features because this does not divide the segment up into "bins". If all 700 are in the first bin (which might conceivably be represented by a single pixel on the client), the client might still only be able to show one/a few, whilst the rest of the pixels show nothing.

Each procedure should divide the segment into 700 bins, and sort features into them by position. Some features might fall into only one bin, others might fall into more than one and may or may not be overlapping. A filter should then be applied to choose which features in each bin should be returned. Such a filter might be based on score (e.g. highest, lowest, sum, mean/median/mode averages), or some other factor. You could also create a new feature representing all those in the bin (e.g. a feature count).

In my mind there are two complicating factors:

1. How to decide which features to return if they are of different sizes and cover a different number of bins.

Perhaps it makes sense if a feature is selected for one bin that it should be selected for all the other bins it covers, for example, but what if the algorithm is using highest score and the feature selected for the first bin has a substantially lower score than another feature in the second bin?

2. Rounding. If you're tired or losing interest it's probably best to stop reading now...

Consider segment X:12345,98765 (86421 bases) and a maxbins of 1000. Assuming bins are of equal size, each is 86.421 bases long. Obviously it's not possible to express fractions of a base in DAS, so it is important that the server and the client interpret this in the same way. Firstly it's important not to round the bin size at the beginning, which would create an error in the total length or number of bins. So the first bin is >= 12345 and <= 12431.421, and the second is > 2431.421 and <= 12517.842. Which bin(s) does X:12431 fall into? You might be tempted to say "easy, it's in bin 1". But would your answer change if it was a feature at X:12400,12431 [which really means X:12400,12431.99999], or a feature at X:12431,12500? Basically what I am getting at is, do we count an end position as being 12431.0 or 12431.9999999? I believe Ensembl does the former but am not 100% sure and this is probably not strictly speaking accurate.

Sorry about the complicated numbers... 

Cheers,
Andy

On 6 Jul 2011, at 14:53, Gustavo Salazar wrote:

> Hey guys,
> 
> One of the topics in the workshop was the idea of having a set of strategies for maxbins, and we said we will discuss it here... so this is my call to hear ideas about it, i might have a some spare time soon and if we get a couple of good strategies I can implement them in mydas as part of its core, so a datasource provider can choose to use one of the predefined strategies or to define a particular algorithm if is their wish. 
> 
> I suppose the easiest maxbins strategy is to return the X random non-overlaping features in he segment. 
> 
> Any other Ideas?
> 
> Regards,
> 
> Gustavo.
> 
> 
> _______________________________________________
> DAS mailing list
> DAS at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/das





More information about the DAS mailing list