[DAS] querying for nonpositional annotations
Jim Procter
jprocter at compbio.dundee.ac.uk
Mon Aug 2 08:18:57 UTC 2010
Interesting thread this - it's something that hasn't been properly
discussed at previous DAS developer meetings..
On 30/07/2010 20:33, Dave Messina wrote:
> I too agree with Eugene.
>
> No magic numbers.
You're too late here. 0 *is already* a magic number in the normalised
protein sequence world, since it indicates the transcription start site
for a coding sequence (i.e. the initial M). This is the cause for some
ambiguity in the bio* bindings, and confusion on the part of more simple
minded programmers like myself :)
> Types can be used for filtering, and actually you get more fine-grained control than simply positional or non-positional. (I use this technique now in DASher.) *
>
> In my opinion, the current spec as written is correct. That is, non-positional features don't just apply to the whole sequence, they apply to any part of the sequence.
Agreed. But read on...
> As an example, consider a journal reference — a particular protein was isolated by a lab, they wrote a paper about it, and deposited the protein sequence in a database. If you look at a subsequence of the protein sequence, that subsequence still derives from the paper, right? So therefore the feature containing that journal reference should still be attached to the subsequence.
>
> On that basis, I think the uniprot server is technically doing it wrong and should be changed, although I have to say that in practice it hasn't been an issue for me.
It's a difficult call. The uniprot server's behaviour is almost
certainly due to the ambiguity arising from non-positional annotation
which have start/end attributes (where start==end && start==0), and
those which do not (the annotation is then usually derived from some
other table, viz. the BioSQL schema). Other DAS servers do similar
things, and kludges are needed to fix them.
My only worry with the expectation of 'proper behaviour' - is that
currently, I frequently see IDs with more non-positional annotation than
positional (notwithstanding histogram like continuous quantitative
annotation such as running averages of predicted or observed local
sequence properties). Enforcing compliance with the spec as written
means that the average DAS metaserver (i.e. uniprot, or some server that
aggregates sequence database info with other data) will send a huge
non-positional header in response to every range qualified feature
request, which is pretty inefficient. It may not scale well, either,
since the amount of database cross references is (still) increasing.
> * It might be nice, though, to add 'positional' and 'non-positional' types, which would be a way to grab all of the existing positional or non-positional types in one go. (currently it's necessary to specify multiple types to get the same functionality.)
This is essential, I think. However, the only way you are going to be
able to do this in a DAS type constraint currently is to ensure the
feature annotation source is ontology aware (and said ontology includes
a distinct positional/non-positional hierarchy)**. One route would be to
introduce a DAS-specific type term that the server maps to its source's
ontology, another simpler approach would be to introduce a new boolean
constraint 'positional', which if specified, limits the response to
positional annotation only.
Jim.
** but this immediatly brings to mind a nasty potential gotcha: e.g.
'expression' in the context of a genome is positional, but is a
non-positional feature in the context of a proteome. So terms will have
to be fully qualified in the type constraint on a feature request.
--
-------------------------------------------------------------------
J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.
More information about the DAS
mailing list