[DAS] querying for nonpositional annotations

Mon Aug 2 08:18:57 UTC 2010

  Interesting thread this - it's something that hasn't been properly 
discussed at previous DAS developer meetings..

On 30/07/2010 20:33, Dave Messina wrote:
> I too agree with Eugene.
>
> No magic numbers.
You're too late here. 0 *is already* a magic number in the normalised 
protein sequence world, since it indicates the transcription start site 
for a coding sequence (i.e. the initial M). This is the cause for some 
ambiguity in the bio* bindings, and confusion on the part of more simple 
minded programmers like myself :)

> Types can be used for filtering, and actually you get more fine-grained control than simply positional or non-positional. (I use this technique now in DASher.) *
>
> In my opinion, the current spec as written is correct. That is, non-positional features don't just apply to the whole sequence, they apply to any part of the sequence.
Agreed. But read on...
> As an example, consider a journal reference — a particular protein was isolated by a lab, they wrote a paper about it, and deposited the protein sequence in a database. If you look at a subsequence of the protein sequence, that subsequence still derives from the paper, right? So therefore the feature containing that journal reference should still be attached to the subsequence.
>
> On that basis, I think the uniprot server is technically doing it wrong and should be changed, although I have to say that in practice it hasn't been an issue for me.
It's a difficult call. The uniprot server's behaviour is almost 
certainly due to the ambiguity arising from non-positional annotation 
which have start/end attributes (where start==end && start==0), and 
those which do not (the annotation is then usually derived from some 
other table, viz. the BioSQL schema). Other DAS servers do similar 
things, and kludges are needed to fix them.

My only worry with the expectation of 'proper behaviour' - is that 
currently, I frequently see IDs with more non-positional annotation than 
positional (notwithstanding histogram like continuous quantitative 
annotation such as running averages of predicted or observed local 
sequence properties). Enforcing compliance with the spec as written 
means that the average DAS metaserver (i.e. uniprot, or some server that 
aggregates sequence database info with other data) will send a huge 
non-positional header in response to every range qualified feature 
request, which is pretty inefficient. It may not scale well, either, 
since the amount of database cross references is (still) increasing.

> * It might be nice, though, to add 'positional' and 'non-positional' types, which would be a way to grab all of the existing positional or non-positional types in one go. (currently it's necessary to specify multiple types to get the same functionality.)
This is essential, I think. However, the only way you are going to be 
able to do this in a DAS type constraint currently is to ensure the 
feature annotation source is ontology aware (and said ontology includes 
a distinct positional/non-positional hierarchy)**. One route would be to 
introduce a DAS-specific type term that the server maps to its source's 
ontology, another simpler approach would be to introduce a new boolean 
constraint 'positional', which if specified, limits the response to 
positional annotation only.

Jim.

** but this immediatly brings to mind a nasty potential gotcha: e.g. 
'expression' in the context of a genome is positional, but is a 
non-positional feature in the context of a proteome. So terms will have 
to be fully qualified in the type constraint on a feature request.

-- 
-------------------------------------------------------------------
J. B. Procter  (JALVIEW/ENFIN)  Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764  http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.