[DAS] maxbins

Thu Jul 7 15:45:23 UTC 2011

Hey guys,

Thanks Rafa for remind us the minutes of the workshop, it is a good place to start.

Andy, what I meant as 'non-overlaping' strategy, was no really dividing the segment into bins, but using the numbers of requested bins just as the maximum number of features to return in the segment, which in that case is can be selected in a random way or by size, so the inclusion in the response of one feature in conditioned to no overlap with a previous selected feature. That obviously is no necessarily the most representative set of features but again I think that works in order to having a very simple maxbins implementation, although i can see that it doesn't really goes with the concept of bins.

In the cases where the goal is to represent the score the way to go IMO is to create a feature that represent the bin(avg, max or min). And for those cases your question 1 is solved, because the feature is contributing to all the bins that is covering, we can think in a formula that consider the number of bases that features contribute to a bin(see below), and in this way give a more representative ranking. In the cases that the score is meaningless, the same feature can be used in all the bins that is covering.
About the rounding issue I think thats a complete problem of the server, meaning that if the client is asking for 700 bins, it shouldn't care about how it was calculated and the response should be the 700 features that he requested.
>From the side of the server we can divide the problem in the ones based on continuos values, usually based on the score,  and the ones with discrete values(frequency of types, etc). I wont go in this email to the discrete case because i haven't though about it :-)

**WARNING** Dunno how this is going to look in pure text.
**WARNING** calculation in the example may not be exact.

For the continuos case, even that conceptually speaking is not possible to speak about a fraction of a base, we can know what is the proportion of its contribution for an specific bean. And from there we can calculate the score of the compendium feature(thinking in average), which will represent the bin by the formulae:

Sb = ∑(scoren * contributionn) / (length/#bins)

For instance, for a segment  request of 7 bases with maxbins of 3 we can have an annotation with its score per base like this:

|0.5|0.4|0.5|0.4|0.4|0.5|0.6|

and then the score for the 3 bins will be:

S1 = (0.5*1 + 0.4*1 + 0.5*0.33)/2.33 	= 0.457
S2 = (0.5*0.66 + 0.4*1 + 0.4*0.66)/2.33 	= 0.426
S3 = (0.4*0.33 + 0.5*1 + 0.6*1)/2.33 	= 0.528

So the client receives 3 bins as requested an is its responsibility how to draw those but he asked for this number so we can assume that the client knows what it is doing.

In the cases where there are positions with non annotations or more than one annotation per position, we may generalise the formulae by changing the divisor to ∑(contributionn)

Sb = ∑(scoren * contributionn) / ∑(contributionn)

I know that won't work for all the cases but I think it is one of the strategies to implement and that the data provider can choose to use or not.

What do you guys think?

Cheers,

Gustavo.

On 6 Jul 2011, at 17:54, Andy Jenkinson wrote:

> Hi Gustavo,
> 
> I'm not sure what you mean by "non-overlapping" features exactly. For a maxbins of 700, the thing to avoid IMO is returning the first (or random) 700 features because this does not divide the segment up into "bins". If all 700 are in the first bin (which might conceivably be represented by a single pixel on the client), the client might still only be able to show one/a few, whilst the rest of the pixels show nothing.
> 
> Each procedure should divide the segment into 700 bins, and sort features into them by position. Some features might fall into only one bin, others might fall into more than one and may or may not be overlapping. A filter should then be applied to choose which features in each bin should be returned. Such a filter might be based on score (e.g. highest, lowest, sum, mean/median/mode averages), or some other factor. You could also create a new feature representing all those in the bin (e.g. a feature count).
> 
> In my mind there are two complicating factors:
> 
> 1. How to decide which features to return if they are of different sizes and cover a different number of bins.
> 
> Perhaps it makes sense if a feature is selected for one bin that it should be selected for all the other bins it covers, for example, but what if the algorithm is using highest score and the feature selected for the first bin has a substantially lower score than another feature in the second bin?
> 
> 2. Rounding. If you're tired or losing interest it's probably best to stop reading now...
> 
> Consider segment X:12345,98765 (86421 bases) and a maxbins of 1000. Assuming bins are of equal size, each is 86.421 bases long. Obviously it's not possible to express fractions of a base in DAS, so it is important that the server and the client interpret this in the same way. Firstly it's important not to round the bin size at the beginning, which would create an error in the total length or number of bins. So the first bin is >= 12345 and <= 12431.421, and the second is > 2431.421 and <= 12517.842. Which bin(s) does X:12431 fall into? You might be tempted to say "easy, it's in bin 1". But would your answer change if it was a feature at X:12400,12431 [which really means X:12400,12431.99999], or a feature at X:12431,12500? Basically what I am getting at is, do we count an end position as being 12431.0 or 12431.9999999? I believe Ensembl does the former but am not 100% sure and this is probably not strictly speaking accurate.
> 
> Sorry about the complicated numbers... 
> 
> Cheers,
> Andy
> 
> On 6 Jul 2011, at 14:53, Gustavo Salazar wrote:
> 
>> Hey guys,
>> 
>> One of the topics in the workshop was the idea of having a set of strategies for maxbins, and we said we will discuss it here... so this is my call to hear ideas about it, i might have a some spare time soon and if we get a couple of good strategies I can implement them in mydas as part of its core, so a datasource provider can choose to use one of the predefined strategies or to define a particular algorithm if is their wish. 
>> 
>> I suppose the easiest maxbins strategy is to return the X random non-overlaping features in he segment. 
>> 
>> Any other Ideas?
>> 
>> Regards,
>> 
>> Gustavo.
>> 
>> 
>> _______________________________________________
>> DAS mailing list
>> DAS at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/das
>