[DAS2] outstanding questions
Andrew Dalke
dalke at dalkescientific.com
Mon Apr 17 07:31:13 UTC 2006
These are culled from the current draft of the spec. I used "XXX"
to denote regions where I had questions.
1) type ontology URI
The TYPE elements have an 'ontology' attribute. This is supposed
to be a required element, which is the URI of the corresponding
ontology term.
At present there is no URI system for ontology. We added a special
'accession' attribute which is the GO id, as in
so_accession="SO:0000704"
This was meant to be a hack for the hackathon.
My thought is:
- keep the GO accession (as an optional attribute)
- make 'ontology' be an optional attribute, but one of 'ontology'
or 'so_accession' is required
Also, should that be "SO:0000704" or simply "0000704" ? I think
the "SO:" should be present.
2) Feature strand.
I want to make sure this is correct
1 for positive
-1 for negative
0 for unknown
not given for both strands or does not have meaning
3) taxid
The 'taxid' in the SOURCE element does not appear to be useful.
It's written
<SOURCE uri="volvox" title="Volvox Database" writeable="no"
doc_href="http://www.example.org/volvox_db.pdf" taxid="3066">
<VERSION uri="volvox/build_1" title="Build 1, October 2002"
created="2002-10-15" modified="2002-10-25T09:56:23">
<COORDINATES uri="http://ncbi.nlm.nih.gov/das-genomes/human-35"
taxid="3066" source="chromosome" authority="NCBI"
version="35" />
<COORDINATES uri="http://embl.ebi.ac.uk/genome/volvox-clone"
taxid="2034" source="clone" authority="EMBL" />
Notice how the taxid exists in the SOURCE element and the COORDINATES
element (and how there are difference taxids for each COORDINATES)?
I think we can drop 'taxid' from the SOURCE element and if it's
important someone should have a COORDINATES element.
4) 'writeable'
The versioned source element contains the attribute "writeable", as in
<VERSION uri="volvox/build_1" title="Build 1, October 2002"
writeable="no" created="2002-10-15"
modified="2002-10-25T09:56:23">
Do we need that 'writeable' attribute? It seems that if there's a
writeback capability then then versioned source is writeable.
5) content-type for FASTA records
"text/plain", "text/x-fasta" or "chem/x-fasta"
Looking around now I also see "application/x-fasta" and
"application/fasta".
I'm going to say "should be text/x-fasta but may be text/plain".
Objections?
6) response document too large
I've described that a server may return an error if the response
document
is too large. This means a client may try again, hopefully making a
request which returns a smaller document.
My question is, how does a client make a smaller request? What if the
server decides that sending more than 5 features at a time is too much?
When does the client just give up and say the server implementation is
crazy?
7) styles
Are we going to go with the current style system or some other
approach?
The DAS1 styles had support for limited semantic zooming, with
options for "high", "medium" and "low" resolution. What do those
mean? When should a client choose one over another?
What does "height" mean for a glyph? How do the glyph and text
interoperate? Eg, is the "height" the height for both, or just
for the glyph?
Should style information be moved outside of the DAS2 exchange
spec?
8) the "count" format
We talked about, and people wanted, a "count" format. This returns
the number of features which would be returned in a query.
Does it really return the number of features, or does it return the
number of complex annotations (eg, if there is a complex annotation
with a root and two children, is that a count of "1" or a count of "3"?
Given the way we've done things, I'm going with "3".)
9) alignments
How do I write an alignment? Please give an example - I can't
figure it out.
10) CIGAR string
What's the format of the CIGAR string? I've found two main
variations. They are
M 40 I 1 M 12 D 4
40M1I12M4D
The latter appears to be the most common. However, I did see one
case where if no count is given "1" is implied, so the latter can
also be written
40MI12M4D
10) Do we need a REGION element? I've written
All feature locations are given in coordinates on a segment. Some
features may be locatable on other features. For example, a contig
feature may be locatable on a supercontig. This relationship is
stored using a REGION element. A FEATURE element has zero or more
REGION elements. The 'feature' attribute of the REGION element
contains the URI of the parent feature, on which the current feature
is located. A REGION record has an optional 'range' attribute. If
not given the feature is on the entire parent feature. The range
string is the same syntax and meaning as in the LOC record.
XXX I think this is overkill - what are some good examples of use;
perhaps when the global coordinates are not well-defined?. Are
negative coordiantes important, like "promoter region is 20 bases
upstream from some gene"? Does this need a CIGAR string too? XXX
For example, suppose feature A is 6 bases long and is on chromosome 5
at position 10000, on exon X at position 300 and on contig K at
position 7. The FEATURE record for this feature may be as follows:
<?xml version="1.0" encoding="UTF-8"?>
<!-- XXX fix this -->
<FEATURES xmlns="http://www.biodas.org/documents/das2"
xml:base="http://www.biodas.org/das2/sequence/volvox/v3/">
<FEATURE uri="feature/A" type="type/Type_A">
<LOC segment="segment/5" range="10000:10006">
<REGION id="feature/exon_X" range="300:306" />
<REGION id="feature/contig_K" range="7:13" />
</FEATURE>
</FEATURES>
11) XID
Currently the XID element has a single attribute, 'href'. I wrote
A FEATURE has zero or more XID elements linking the feature record to
an external database entry. XXX This is not well-thought out. I
think it should have:
'uri' -- a URL or LSID
'authority' -- the name of the database (controlled vocabulary)
'type' -- 'primary', 'accession', or possibly others?
'id' -- the actual identifier
'description' -- a paragraph or so describing the link, for humans
to see why they might want to look into a link
This has to be a well-defined concept. Let's steal from someone else.
The use-case here is to link to sequence records in other databases
and to link to PubMed or other bibliographic databases.
12) complex features
In the spec I wrote
Some features are complex and cannot easily be modeled with a single
feature record. Quoting from the "Chado Schema Documentation" XXX
give hyperlink XXX
The class of transplicing events that involve ligating transcripts
from different loci into a mature mRNA requires a separate feature
to represent each locus transcript and one to represent the fused
transcript. The fragments are located on the fused transcript;
portions of the fused transcript can also be located on the genome.
Is this a relevant example of a complex feature for DAS2? If not,
give another example.
In general I'm having a hard time coming up with good examples of
various forms of complex features. I just don't know the domain
well enough.
13) "root" attribute
I proposed that features have a new, optional attribute called "root".
If
a feature is part of a complex annotation then the "root" attribute must
be present and it must have the URI of the root feature for the
annotation.
This makes client processing easier, though it is not needed in the
purest of senses.
14) features have a 'STYLE' element
The idea was that an individual feature could override the style
given in the feature type record. I don't think that's useful
and/or we need a real stylesheet instead. I'm going to drop the
STYLE element from the FEATURE element unless there is objection.
15) In text searches we've defined
ABC -- field exactly matches "ABC"
*ABC -- field ends with "ABC"
ABC* -- field starts with "ABC"
*ABC* -- field contains the substring "ABC"
I want to say that using "*" and "?" elsewhere in the query string
is implementation dependent. That is, "A*B" might match everything
with an A followed by a B or it might match the exact string "A*B"
and only that string.
I did this because looking around at various tools it looks like
it might be hard to change the meaning of "*" and "?" for the
text searches.
Andrew
dalke at dalkescientific.com
More information about the DAS2
mailing list