[DAS2] outstanding questions

Mon Apr 17 07:31:13 UTC 2006

These are culled from the current draft of the spec.  I used "XXX"
to denote regions where I had questions.

1) type ontology URI

The TYPE elements have an 'ontology' attribute.  This is supposed
to be a required element, which is the URI of the corresponding
ontology term.

At present there is no URI system for ontology.  We added a special
'accession' attribute which is the GO id, as in

       so_accession="SO:0000704"

This was meant to be a hack for the hackathon.

My thought is:
   - keep the GO accession (as an optional attribute)
   - make 'ontology' be an optional attribute, but one of 'ontology'
       or 'so_accession' is required

Also, should that be "SO:0000704" or simply "0000704" ?  I think
the "SO:" should be present.

2)  Feature strand.

I want to make sure this is correct

   1 for positive
  -1 for negative
   0 for unknown
   not given for both strands or does not have meaning

3)  taxid

The 'taxid' in  the SOURCE element does not appear to be useful.
It's written

   <SOURCE uri="volvox" title="Volvox Database" writeable="no"
       doc_href="http://www.example.org/volvox_db.pdf" taxid="3066">

     <VERSION uri="volvox/build_1" title="Build 1, October 2002"
            created="2002-10-15" modified="2002-10-25T09:56:23">

       <COORDINATES uri="http://ncbi.nlm.nih.gov/das-genomes/human-35"
                    taxid="3066" source="chromosome" authority="NCBI" 
version="35" />

       <COORDINATES uri="http://embl.ebi.ac.uk/genome/volvox-clone"
                    taxid="2034" source="clone" authority="EMBL" />

Notice how the taxid exists in the SOURCE element and the COORDINATES
element (and how there are difference taxids for each COORDINATES)?

I think we can drop 'taxid' from the SOURCE element and if it's
important someone should have a COORDINATES element.

4)  'writeable'

The versioned source element contains the attribute "writeable", as in

     <VERSION uri="volvox/build_1" title="Build 1, October 2002"
            writeable="no" created="2002-10-15" 
modified="2002-10-25T09:56:23">

Do we need that 'writeable' attribute?  It seems that if there's a
writeback capability then then versioned source is writeable.

5) content-type for FASTA records

"text/plain", "text/x-fasta" or "chem/x-fasta"

Looking around now I also see "application/x-fasta" and 
"application/fasta".

I'm going to say "should be text/x-fasta but may be text/plain".
Objections?

6) response document too large

I've described that a server may return an error if the response 
document
is too large.  This means a client may try again, hopefully making a
request which returns a smaller document.

My question is, how does a client make a smaller request?  What if the
server decides that sending more than 5 features at a time is too much?
When does the client just give up and say the server implementation is
crazy?

7) styles

Are we going to go with the current style system or some other
approach?

The DAS1 styles had support for limited semantic zooming, with
options for "high", "medium" and "low" resolution.  What do those
mean?  When should a client choose one over another?

What does "height" mean for a glyph?  How do the glyph and text
interoperate?  Eg, is the "height" the height for both, or just
for the glyph?

Should style information be moved outside of the DAS2 exchange
spec?

8) the "count" format

We talked about, and people wanted, a "count" format.  This returns
the number of features which would be returned in a query.

Does it really return the number of features, or does it return the
number of complex annotations (eg, if there is a complex annotation
with a root and two children, is that a count of "1" or a count of "3"?
Given the way we've done things, I'm going with "3".)

9) alignments

How do I write an alignment?  Please give an example - I can't
figure it out.

10) CIGAR string

What's the format of the CIGAR string?  I've found two main
variations.  They are
   M 40 I 1 M 12 D 4
   40M1I12M4D

The latter appears to be the most common.  However, I did see one
case where if no count is given "1" is implied, so the latter can
also be written
   40MI12M4D

10) Do we need a REGION element?  I've written

   All feature locations are given in coordinates on a segment.  Some
   features may be locatable on other features.  For example, a contig
   feature may be locatable on a supercontig.  This relationship is
   stored using a REGION element.  A FEATURE element has zero or more
   REGION elements.  The 'feature' attribute of the REGION element
   contains the URI of the parent feature, on which the current feature
   is located.  A REGION record has an optional 'range' attribute.  If
   not given the feature is on the entire parent feature.  The range
   string is the same syntax and meaning as in the LOC record.

   XXX I think this is overkill - what are some good examples of use;
   perhaps when the global coordinates are not well-defined?.  Are
   negative coordiantes important, like "promoter region is 20 bases
   upstream from some gene"?  Does this need a CIGAR string too? XXX

   For example, suppose feature A is 6 bases long and is on chromosome 5
   at position 10000, on exon X at position 300 and on contig K at
   position 7.  The FEATURE record for this feature may be as follows:

   <?xml version="1.0" encoding="UTF-8"?>
   <!-- XXX fix this -->
   <FEATURES  xmlns="http://www.biodas.org/documents/das2"
        xml:base="http://www.biodas.org/das2/sequence/volvox/v3/">
     <FEATURE uri="feature/A" type="type/Type_A">
       <LOC segment="segment/5" range="10000:10006">

       <REGION id="feature/exon_X" range="300:306" />
       <REGION id="feature/contig_K" range="7:13" />
     </FEATURE>
   </FEATURES>

11) XID

Currently the XID element has a single attribute, 'href'.  I wrote

   A FEATURE has zero or more XID elements linking the feature record to
   an external database entry.  XXX This is not well-thought out.  I
   think it should have:
     'uri' -- a URL or LSID
     'authority' -- the name of the database (controlled vocabulary)
     'type' -- 'primary', 'accession', or possibly others?
     'id' -- the actual identifier
     'description' -- a paragraph or so describing the link, for humans
        to see why they might want to look into a link
   This has to be a well-defined concept.  Let's steal from someone else.
   The use-case here is to link to sequence records in other databases
   and to link to PubMed or other bibliographic databases.

12)  complex features

In the spec I wrote

   Some features are complex and cannot easily be modeled with a single
   feature record.  Quoting from the "Chado Schema Documentation" XXX
   give hyperlink XXX

      The class of transplicing events that involve ligating transcripts
      from different loci into a mature mRNA requires a separate feature
      to represent each locus transcript and one to represent the fused
      transcript. The fragments are located on the fused transcript;
      portions of the fused transcript can also be located on the genome.

Is this a relevant example of a complex feature for DAS2?  If not,
give another example.

In general I'm having a hard time coming up with good examples of
various forms of complex features.  I just don't know the domain
well enough.

13) "root" attribute

I proposed that features have a new, optional attribute called "root".  
If
a feature is part of a complex annotation then the "root" attribute must
be present and it must have the URI of the root feature for the 
annotation.

This makes client processing easier, though it is not needed in the
purest of senses.

14) features have a 'STYLE' element

The idea was that an individual feature could override the style
given in the feature type record.  I don't think that's useful
and/or we need a real stylesheet instead.  I'm going to drop the
STYLE element from the FEATURE element unless there is objection.

15) In text searches we've defined

     ABC  -- field exactly matches "ABC"
    *ABC  -- field ends with "ABC"
     ABC* -- field starts with "ABC"
    *ABC* -- field contains the substring "ABC"

I want to say that using "*" and "?" elsewhere in the query string
is implementation dependent.  That is, "A*B" might match everything
with an A followed by a B or it might match the exact string "A*B"
and only that string.

I did this because looking around at various tools it looks like
it might be hard to change the meaning of "*" and "?" for the
text searches.

					Andrew
					dalke at dalkescientific.com