[DAS2] Notes from the DAS/2 teleconference for the code sprint, 6 Feb 2006

Mon Feb 6 19:50:14 UTC 2006

Notes from the DAS/2 teleconference for the code sprint, 6 Feb 2006

$Id: das2-teleconf-2006-02-06.txt,v 1.2 2006/02/06 19:57:05 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  Sanger: Andreas Prlic, Thomas Down, Roy
  Sweden: Andrew Dalke
  UC Berkeley: Nomi Harris
  UCLA: Allen Day

Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 

Gregg's topics for discussion:

* Status report 
* DAS/2 XML - valid or not valid?
* CATEGORY elements -- constructing query URLs
* MAINTAINER information
* Use of xml:base
* update on feature properties - searching, etc.

Status Reports - what people are working on for the code sprint
------------------------------------------------------------

andrew
- getting folks up to speed on the spec changes, what he wrote.
- getting a feel for ensembl schema.
- change today: time zone specification b/c td's java time lib did
  something different than iso did.

aday: tag & branch?
gh: no branch, maybe tag
ad: tagging probably not necessary

gh: brings up a related issue:
 what is our mechanism for versioning - client & spec to understand
 which version of the spec they are/should be implementing
- can talk about it later during the xml validation issue discussion

ap: [missed it -- sorry!]

td: java om, feature xml done, can read and write.

roy: zmap das2 client, read/write das2, written in C. working with
ed griffith who's not available this week.
currently just a reader. from james gilbert, based on fmap from Acedb

gh: updating client and server (mostly client). top down syncing in
parallel, one command at a time. sources request is working on both
sides. will start w/ allen's server today, doing gh's sources query
against allen's server. segments and types today.

nh: apollo das2 client. reads das2 xml from andrew's example, write
out features in das2, now working on get, testing with server.

sc: affy das2 server stuff. streamlining updating it with feature data
from UCSC. also working on updating exon array data for use in IGB
client. working w/ gregg on other server-related work.
gh: graph data as well.

ee: working on igb client. talk w/ gregg later to get specifics.
gh: lots of ui stuff

Topic: xml validation
---------------------

ad: dtd's don't support namespaces, so we can't support dtds
gh: not that simple. where do we add namespaces?
ad: schemas have ns's
    testing....
gh: concern #1: is one of perception. don't like telling people we
don't have valid xml
ad: only means suports the dtd, not in human sense.
gh: it's one of perception
td: self-contained document + validation

gh: getting rid of doctype declaration is issue of versioning. how
will client know which version of spec it's supposed to be implementing?
need to deal with spec crawl. The only way i'm aware of is via looking
at dtd pointer changing.
gh: not worried about new categories, but changing things like
optional vs req'd attributes/elements.
ad: content-type contains version
td: or content negotiation
ap: xml schema validator at w3c.org can use that and claim it is
valid. can upload your files, push a button.
ad: I have an extension of properties with arbitrary binary data vs
text vs href. this is ok with relaxng, not by xsd.
ad: we could say what is valid das2 since we're the arbiters of what
is valid das xml document. e.g., well-formed, validates against the rng
schemas

gh: the rng we now have allows arbitrary xml?
ad: yes. can say there are arbitrary elements under some node. checked
in as file named common.rnc
gh: ok, getting rid of requirement for doctype declaration. any
versioning is done via content-type

gh: if we don't do content neg, a sources query goes out, whatever
version that the server supports comes back. this will be the latest
version of the spec the server supports.
ad: for backwards compatibility that won't be needed. extensibility
will be sufficient for a few years.
gh: don't believe it.
td: spec is churning fast now. there'll be less churn once there are impls.
gh: there were impls 3 or 4 mos ago (allen, gregg). so there have been
plenty of churn even with impls.so we'll need versioning, ok on
content-type.
aday: we definitely need versioning. need it now. also want a tagged
version we we can work at same time.
ad: content-type-xdas;version=1.1
in general not the right solution (not general purpose), but for this
case, makes sense. 
aday: can impl, header says 1.1
gh/ad: contents are a subset of the specification. so it's tied to a
version of the rng schema.
ad: the tag will be the cvs revision #

gh: this isn't temporary, where there will not be a time when we are
not generating churn.
ad: believes this is temporary, won't have to have it long-term
aday: no mechanism for it now.
ad: need a way to turn it into meaning. agreement on what string means
which verison of a program.

nh: second gregg. will always be an issue. ad says it's not good
long-term, maybe we should come up with it.
gh: we have some basis to go forward.

[A] das/2 server will specify spec version via content-type-xdas;version=X.X

Topic: category elements, how to construct a query url
------------------------------------------------------

ad: what is syntax of string used to specify ontology? SO:?
aday: attribute for it
gh: ontol term is a uri
aday: type element has ontology
gh: id of type is not nec an ontol term
ad: the attrib of feat type, ontol=something
gh: that's a uri, abs or rel point to a frag in so/fa ontol
ad: can't find how this should look. said SO:0000001. that should be
a uri?
gh: yes. in types xml that's returned, id and ontol are uri's. a
server will pick one for it's xml base. the other will have to be a
full uri.
ad: how do diff clients know a given term corresponds to what term in
the ontol?
gh: they will have to understand sofa/so.
ad: do they have persistent ids?
gh: my understanding is that they can use fragment notation for a
stable url for the term
aday: ontol docs aren't xml, no anchors for pointing to a
fragment. they're their own format. nervous about building dependency
on fragment record uris into our system
gh: good point. would be happier if it was recast as xml
aday: is now pointing to an xml document for ontology nodes
ad: happier if we could use "SO:xxx" i.e., a urn
gh: would like a re-cast as xml document, hosted at so/sofa
website. that xml would be like a std ontology representation so you
could extend it. so someone could point to an extension of it.

Category elements -- constructing query URLs
--------------------------------------------

gh: andreas' point (email): query id attribute, constructing these out
of relative uri, or based on base uri.
agree with andreas: we know what those will be.
for clarity of spec, we should specify: here's base uri, here's how you
construct the segments query, etc.
ad: trouble for segments- could be on ref server
gh: doubt that people will impl this way. will be specific to server
and will be related to everyone else's notion of chromosomes and
assemblies.
ad: where does the distributed nature of das come from? ref server
gh: das/1: ref server has residues to serve, regions (entry pts)
served up by everyone. this was the notion of ref vs non-ref
server to carry forward. non-ref server still serves up segments.
will have segments in it's reference space. reference would be genome
assembly version + organism. sufficient to globally identify it.
ap: had discussions about this. query id
td: issue comes from seqs being urls rather than opaque ids in a ns
defined by coord system. have a set of servers that share common coord
syst. then a seq identified by stringx on one server is same as on the
other server.
the remaining q: server that doesn't want to serve up seqs, what urls
does it use? can it use an opaque seq name that is known by that name of
ref server? 

gh: restating concerns here: using query string to construct uri's
1. confusion: arbitrary uri means more confusing spec, and how to
   implement it (can't just say /segment, but 'whatever is pointed at
   by such and such uri')
2. size of documents. right now, can use same xml:base for features
   document, can make feat ids and location id relative to it, nice
   and short. if seg is on other server, need to expand one of the ids

compresses well, but that will take longer than transmission.
this is only for features xml.

can use coords or assembly info to determine identity between urls.
want a defined ns.
ad: you want a way to say: these are relative urls to a base url for
that data type. so that this type url is relative to some base url for
types, similar for segments, features.
gh: we have this now, can be relative or absolute
ad: there is a default xml base like thing: one for type, segment,
features. so you could have relative ids to those bases.
gh: possibly, but not ideal. It's better to use a std xml base for all
of them. 
each server has it's own unique uris for segments.

I'm proposing that we decouple segments from residues and having
segments doesn't mean we can serve residues. reasoning:
- this leads to smaller xml docs
- simplifies the spec if we didn't have to construct query ids from
  category element

would rather specify the string that's appended in the spec.

sc: might could deal with this issue by adding structure to the
document in order to add different xml:bases for different data
types. e.g., use different parent elements that could define their own
xml:bases, one for types, segments, and feautures. might complicate
the spec tho. 

ad: single genome have same types across all dbs.
gh: across servers, dangerous.
ad/td: globally unique ids, could have everything in the same directory.
td: can we just use seq/name, type/name. i.e., codifying what the
convention now is.
ad: name is put at end of base url
a feature document may give types, segments, other features.
td: just use simple strings, not urls.
gh: std uri syntax isn't important, but a std query mechanism to get
all of these is. some uri you put a '/types' on or a '/segments'.
ad: you have this right now.
gh: but it's only defined for a server, not the whole spec. there's no
where in the spec that says this. confusing for people
reading/implementing the spec.
ap: If you make it free text, you don't know what to put for a given server?
ad: you get a document
ap: I already know the server, not necessarily a document.

ad: taking out the mention of any hierarchy, just refer to things as
feat query url.

[note taker is having trouble following the thread of this discussion.]

gh: let's sleep on it, discuss tomorrow, vote then.