[DAS] maxbins
Andy Jenkinson
andy.jenkinson at ebi.ac.uk
Thu Jul 7 17:09:53 UTC 2011
On 7 Jul 2011, at 16:45, Gustavo Salazar wrote:
> Hey guys,
>
> Thanks Rafa for remind us the minutes of the workshop, it is a good place to start.
>
> Andy, what I meant as 'non-overlaping' strategy, was no really dividing the segment into bins, but using the numbers of requested bins just as the maximum number of features to return in the segment, which in that case is can be selected in a random way or by size, so the inclusion in the response of one feature in conditioned to no overlap with a previous selected feature. That obviously is no necessarily the most representative set of features but again I think that works in order to having a very simple maxbins implementation, although i can see that it doesn't really goes with the concept of bins.
What I am trying to get across is that "maxbins=700" is NOT "I want 700 features". It is "I have 700 pixels/subdivisions/something abstract to render this segment in". It is fine to have a simple binning strategy of "non-overlapping features" (although I don't like random selection as this will be inconsistent between page refreshes). But this does not mean that you will always get 700 features. The point is that the client is telling the server how much space the features will be rendered in, and it is up to the server to decide what is an appropriate number and distribution of features to return. So a more meaningful implementation of "non-overlapping features" would be to provide ALL non-overlapping features, where overlaps are BY BIN not by position. I may be missing something, but I can't see a case for wanting to give the first X features regardless of where they appear in the query segment.
> In the cases where the goal is to represent the score the way to go IMO is to create a feature that represent the bin(avg, max or min). And for those cases your question 1 is solved, because the feature is contributing to all the bins that is covering, we can think in a formula that consider the number of bases that features contribute to a bin(see below), and in this way give a more representative ranking. In the cases that the score is meaningless, the same feature can be used in all the bins that is covering.
This is one interpretation of a score-based algorithm, assuming that a feature is simply a score for the region of the genome. This is fine, and comes under the category of replacing the feature with a new one (although there is an issue of how notes/types etc are merged). However it is also quite feasible to have features that _have_ scores, but are not themselves _purely_ scores. Hypothetically, let's say genes are ranked by score, and as we zoom out we want to see the highest-ranking genes in the region. The algorithm would be expected to filter features, not modify/replace them. It is not an insurmountable problem, it just needs to be recognised that there is a difference between giving each bin a score, and choosing features in bins based on score.
> About the rounding issue I think thats a complete problem of the server, meaning that if the client is asking for 700 bins, it shouldn't care about how it was calculated and the response should be the 700 features that he requested.
Again, 700 bins does NOT mean 700 features. Remember also that bins is an expression of something that is specific to the client. When rendering those bins, the client must convert between position and bin so it is important that it does so in the same way that the server did it. The client internally has one view of the positions that are represented by each bin, and if the rounding is not the same then it will render the feature "out by one bin".
Consider this example. My client has divided a segment of 1001 bases into 1000 bins. Note 1 of these bases has to go into a bin with another base. After handing that to the server, let's imagine that the server assigns a score to each bin and returns a feature for each bin like this:
feature 1 = start 1, end 2 = score 56
feature 2 = start 3, end 3 = score 14
feature 3 = start 4, end 4 = score 97
...
feature 999 = start 999, end 999 = score 11
feature 1000 = start 1000, end 1000 = score 65
The client then takes the start/end positions and converts them into rendering space:
bin 1 = feature 1 = score 56
bin 2 = feature 1 = score 56
bin 3 = feature 2 = score 14
bin 4 = feature 3 = score 97
...
bin 1000 = feature 999 + feature 1000 = score ???
Now what has happened here is that the server and client have not used the same rounding mechanism for assigning positions into bins, and thus produce different results.
> From the side of the server we can divide the problem in the ones based on continuos values, usually based on the score, and the ones with discrete values(frequency of types, etc). I wont go in this email to the discrete case because i haven't though about it :-)
>
> **WARNING** Dunno how this is going to look in pure text.
> **WARNING** calculation in the example may not be exact.
>
> For the continuos case, even that conceptually speaking is not possible to speak about a fraction of a base, we can know what is the proportion of its contribution for an specific bean. And from there we can calculate the score of the compendium feature(thinking in average), which will represent the bin by the formulae:
Yes this is a consideration for the algorithm of how to derive a score for a bin, but it is an advanced implementation detail of the algorithm. It is largely orthogonal to what I mean by consistency of rounding - that is, deciding what start/end positions to _return_ in the DAS XML for a particular bin (which must be expressed as a whole number). Your features response is stating something like "bases X to Y have a score of Z". Whilst you can divide bases into multiple bins to derive the score, you still have to say which _whole_ bases that score applies to.
However what I would say about the below algorithm is that, given that you have to express a region that the score applies to in whole numbers, dividing by "contribution" ends up presenting something that is not entirely true because the score for a feature also includes "a bit of the next base too". You're also making an assumption that a score for base 12345 is contributing 0.25 of its score to 12344.75, which is only one way of looking at it (as I was saying in my last email). In reality a start position of 12345 usually means "12345.0 and upwards" whereas an end position of 12345 usually means "everything less than 12346.0". For these reasons, I think weighting by contribution is not something I'd be keen on implementing.
>
> Sb = ∑(scoren * contributionn) / (length/#bins)
>
> For instance, for a segment request of 7 bases with maxbins of 3 we can have an annotation with its score per base like this:
>
> |0.5|0.4|0.5|0.4|0.4|0.5|0.6|
>
> and then the score for the 3 bins will be:
>
> S1 = (0.5*1 + 0.4*1 + 0.5*0.33)/2.33 = 0.457
> S2 = (0.5*0.66 + 0.4*1 + 0.4*0.66)/2.33 = 0.426
> S3 = (0.4*0.33 + 0.5*1 + 0.6*1)/2.33 = 0.528
>
> So the client receives 3 bins as requested an is its responsibility how to draw those but he asked for this number so we can assume that the client knows what it is doing.
>
> In the cases where there are positions with non annotations or more than one annotation per position, we may generalise the formulae by changing the divisor to ∑(contributionn)
>
> Sb = ∑(scoren * contributionn) / ∑(contributionn)
>
> I know that won't work for all the cases but I think it is one of the strategies to implement and that the data provider can choose to use or not.
>
> What do you guys think?
>
> Cheers,
>
> Gustavo.
>
> On 6 Jul 2011, at 17:54, Andy Jenkinson wrote:
>
>> Hi Gustavo,
>>
>> I'm not sure what you mean by "non-overlapping" features exactly. For a maxbins of 700, the thing to avoid IMO is returning the first (or random) 700 features because this does not divide the segment up into "bins". If all 700 are in the first bin (which might conceivably be represented by a single pixel on the client), the client might still only be able to show one/a few, whilst the rest of the pixels show nothing.
>>
>> Each procedure should divide the segment into 700 bins, and sort features into them by position. Some features might fall into only one bin, others might fall into more than one and may or may not be overlapping. A filter should then be applied to choose which features in each bin should be returned. Such a filter might be based on score (e.g. highest, lowest, sum, mean/median/mode averages), or some other factor. You could also create a new feature representing all those in the bin (e.g. a feature count).
>>
>> In my mind there are two complicating factors:
>>
>> 1. How to decide which features to return if they are of different sizes and cover a different number of bins.
>>
>> Perhaps it makes sense if a feature is selected for one bin that it should be selected for all the other bins it covers, for example, but what if the algorithm is using highest score and the feature selected for the first bin has a substantially lower score than another feature in the second bin?
>>
>> 2. Rounding. If you're tired or losing interest it's probably best to stop reading now...
>>
>> Consider segment X:12345,98765 (86421 bases) and a maxbins of 1000. Assuming bins are of equal size, each is 86.421 bases long. Obviously it's not possible to express fractions of a base in DAS, so it is important that the server and the client interpret this in the same way. Firstly it's important not to round the bin size at the beginning, which would create an error in the total length or number of bins. So the first bin is >= 12345 and <= 12431.421, and the second is > 2431.421 and <= 12517.842. Which bin(s) does X:12431 fall into? You might be tempted to say "easy, it's in bin 1". But would your answer change if it was a feature at X:12400,12431 [which really means X:12400,12431.99999], or a feature at X:12431,12500? Basically what I am getting at is, do we count an end position as being 12431.0 or 12431.9999999? I believe Ensembl does the former but am not 100% sure and this is probably not strictly speaking accurate.
>>
>> Sorry about the complicated numbers...
>>
>> Cheers,
>> Andy
>>
>> On 6 Jul 2011, at 14:53, Gustavo Salazar wrote:
>>
>>> Hey guys,
>>>
>>> One of the topics in the workshop was the idea of having a set of strategies for maxbins, and we said we will discuss it here... so this is my call to hear ideas about it, i might have a some spare time soon and if we get a couple of good strategies I can implement them in mydas as part of its core, so a datasource provider can choose to use one of the predefined strategies or to define a particular algorithm if is their wish.
>>>
>>> I suppose the easiest maxbins strategy is to return the X random non-overlaping features in he segment.
>>>
>>> Any other Ideas?
>>>
>>> Regards,
>>>
>>> Gustavo.
>>>
>>>
>>> _______________________________________________
>>> DAS mailing list
>>> DAS at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/das
>>
>
More information about the DAS
mailing list