[DAS2] Notes from DAS/2 code sprint #3, day four, 17 Aug 2006

Thu Aug 17 22:18:21 UTC 2006

Notes from DAS/2 code sprint #3, day four, 17 Aug 2006

$Id: das2-teleconf-2006-08-17.txt,v 1.1 2006/08/17 22:15:30 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  CHSL: Lincoln Stein
  Dalke Scientific: Andrew Dalke
  UCLA: Allen Day, Brian O'Connor

Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 

Topic: Status Reports
----------------------

ls: Perl interface in good shape. reorg'd to get parser based on
content type dynamically. response comes in, figures out what parser
to use, returns the objects, should be extensible for other
formats. main task todo is to implement the feature object so that I
can actually return features. now parser is there, object is not. Not
a Bio::SeqFeatureI object, in order to work with gbrowse and other
parts of bioperl. some issues with biopackages with xml:base,
sometimes slashes there that shouldn't be and vice versa. segments
request has extraneous / at end, so it has 'segments' repeated twice.
didn't try to fetch to see if would work, but looks like a bug.

gh: regarding parent-child relationships between features: if they
have parent, need to point to it, if they have children need to
point to them.
ls: parsing with sax, I'll know when an object is complete.
will create a feature stream and start returning features as the parse
is coming across. threaded, so you can have multiple streams going
simultaneously. 

gh: more issues with parent child hierarchy. will wait for allen to
arrive before discussing.

Topic: Spec issues
------------------

ad: working on content negotiation, but now is not right time to do
it. in sequence doc, default doc should be das2-segments.

sc: xml:base issue -- where do we allow it (0, 1, infinity)?
gh: our policy is that we follow the xml:base spec.
ad: if you use it, use it everywhere.
gh: my parser is looking for it where it everywhere.
ad: my email explains why you might want to use it on multiple
features. eg., combining data from different servers.

sc: what about brian gilman's issue, when you get to root what if
xml:base is still relative?

ad: uri spec defines how to define relative urls, e.g., get it from
document.

gh: relaxNG says it can be anywhere. I think it should therefore be
allowed anywhere.

ad: right now all services returns an xml object file except segment
request -- fasta file. would like to return xml.

sc: this is along the lines of what I proposed a while back. I like
it. See discussion under this thread:
http://lists.open-bio.org/pipermail/das2/2005-December/000395.html

ad: formats per-segment basis. current scheme only defines
per-everything basis. propose have each segment also has it's own
format. each segment can have alt formats. (see ad's email from today
on this topic).
gh: like it. it means that a server doesn't have to know about all
residues. 

ad: for case of reference server, we guarantee that it supports fasta
sequence.  affects other servers, not just reference server.
gh: I like that flexibility. any objections?
[silence]

gh: if you return the segments doc we now have, you are only serving
up xml. if you want to return fasta, you need to return a format
element.
ls: is there a way for client to determine what it will get?
gh: in the segments document, returned back from reference
server. client can specify format defined there.
ls: not impl yet, just a proposal?
gh: yes. another plus is the ability to specify more efficient binary
formats too.

Topic: Ann's issue on content-type
----------------------------------
gh: server has option to specify that you can return things as
text/xml, but still send das2xml format.

ad: content negotiation doesn't work to allow the browser to view
XML. only works for clients that can do content neg, not general
clients (e.g., safari). I tried two different browsers, got two
different results.

[A] Ask Ann Loraine if this solution is sufficient.

Topic: Writeback issues
-------------------------

aday: problem writeback. creating new feat or update existing feat. if
it's a new feature, das_private uri scheme has no info about source or
versioned source that the feature is intended to be written to. This
is not necessarily a problem, could be a different uri post.

But it is a problem when parsing and it's possible for parents or
children to be attached to the feat and they are not the
source/vsource combination. make sense?

ad: every feat has unique id. could do it by saying when you see this
id, it corresponds to this segment or this versioned source.

ad: feature comes from NCBI but is being posted to affymetrix.

gh: I talked about this as a use case for the grant. Example:
snps being served by an authority (dbSNP) and people are trying to
create their own haplotype blocking structure. you want them to be
able to point to the authority for the leaf features (snps,
children). so you can have one server serving up haplotype blocks, and
points to snps that reside on another server that is the authority.

right now in the spec, can't do that because of the bidirectional
parent-child stuff. you'd have to point the snps at the authority to
the new stuff. 
ad: could have parent-child relationships that are incorrect.
all parents connected together are places you can get to. has to be a
single root.
gh: due to that and the bidirectional stuff, we can't support my use
case,  also can't build features from multiple servers to construct
curations.
ad: can do it in datamodel. I point to features over there.
gh: in xml it can't be done.
ad: also means that, you have to keep requesting features over and
over again. you have to do at least one request for every feat.
gh: even if we have these restrictions, how can we enforce them with
das-private id.
aday: the document is not enough to tell you if the parent being
associated with a feature is valid. you have to know more.

aday: it's only these das-private ids that are a problem, you cannot
know where it came from or where it's to be written to.
the child-parent pointers are not a problem.
gh: post to a writeable das server with das-private id, it means the
feature is to be written on that server.
aday: new document comes in, you don't know where to write them to.
gh: which writeable server are they to be written to.
ad: there will be a different distinct url.
gh: client is aware of 5 different writeback servers, which one do I
write to. this is a client issue. it should present options to the
user and let them select.

aday: what about creating a hybrid feature?
gh: it's a totally new curation.
ad: what if you want to have one writeback url for several dbs on the
server?
gh: i would say no.
aday: you need to know what is the context of the write.
gh: for server, it knows, for client.
aday: so are we saying that the document does not need to be
validatable when standalone (ie, outside the context of the server)?
there is not enough information to know whether some features being
grouped together should be.
I upload this document to xxx, is it be loadable?
gh: i dont' see that as an issue. we have validation issues with read
document as well. the validators don't go into the uris of each
feature and see if they come from same server.
aday: if absolute, yes, but if all relative.
as long as all relative, you can tell if compatible.
gh: if you have document element was retrieved from, it's relative to
that. if not, it's application-specific, which in our case means
punting.
validator can't guarantee that certain uri's are compatible. to do
that, it would have to know how to resolve every uri, and they don't
need to be url's. nobody knows how to resolve every uri.
what that means is that the server will have to reject the post if it
sees uris that it doesn't recognize them.
aday: or, that it 

sc: how does server know if uri's are compatible?
gh: for posts, those features have to be coming from that server
aday: adding new exon to transcript that already exists in db, can I
give you the new exon and pointer to transcript?
get's into uri compatibility issue.
I have exon whose parent I don't have access to (on remote
server). could I do an external request on the parent, figure out it's
location, close it, send xid to parent on remote server.
ad: would say it's legal but you have to pass in the complete feature
record.
gh: the legality is in the document that is being posted. you have
parent-child resolvability back up to the root. that's the requirement
now.

gh: is it worth considering relaxing our bidirectional closure
requirement? 
ad: makes parsing harder. have to wait to very end. takes lots of time,
memory.
gh: use case you have, you need parent. we could relax it to require
parent-child (as needed for my use case). but for Allen's case you
need child-to-parent pointers.
ad: using xid

gh: xid's are free form. how do you know that it means x was derived
from y? there's no way to represent that in our xml. it's open to
interpretation by client and server.
ad: in the xid have one of them be the type, constrained vocab, so you
know what kind of link it is.
keyword 'rel', this means get css, rss....
also the xml-link stuff steve mentioned a while ago.

gh: would require some significant rejiggering to resolve it.
ad: can we do it by having a new feature type, of it's own
vocabulary. 
gh: if you do this in one client, it does this by cloning, it looks to
user you are doing it from different servers. write to client. another
one reads it, and it has no way of know that it was derived from the
two different sources.

gh: for now, you can only point to newly created features or features
coming from the server you are posting to, for feature ids. need to
know more about evidence trails, to know more about what info they
need to preserve.

[A] talk to curator pro (nomi) about what evidence to save when
creating/modifying feats

ad: new type: external-feature-reference, do a new element at end of
record. doesn't require a new format.
gh: it's outside the spec right now, allen doesn't have to support it.
extra xml in the document to describe the relationship. e.g., a
derived-from element. it's doable, but I don't think it should be in
the current spec.
ad: can be done without making backwards incompatible changes to the
current spec.

aday: now I get free reign to validate the way I want to. I will be
liberal in what I reject.

gh: end of the spec issues we were looking at yesterday.

Topic: Status report
--------------------

ee: started working on gff3 parser for IGB.

bo: feature filtering. using full uri's not just 'chr2'. going through
biopackages.net server checking if it is up to spec. coordinates
issues, mapping document, stored in extra file.
gh: reference to each segment.

aday: writeback server able to do delete and update now. fixed bug
reported by andrew. name based query was not returning parents.
gh: lincoln mentioned xml:base problem. segment/segment/
bo/aday: fixed this.
aday: started impl a new server that takes any arbitrary range
request. performs modulus on range request. you know that there is
only certain blocks being requested, so you can use a cache. does it
satisfy requested range, and return that.
I always do children before parent. inserting hints on the thing that
does backend parsing.

gh: are you supporting multiple parents of children (e.g., multiple
transcripts that share an exon)?
aday: a good question. I keep track of children and multiple locations
of children and then I given parents after that. after the grooming, I
can have multiple hints, 'this is the end of this 15mb block'.
all parents are presented. then all of my comments would be presented.

gh: got out IGB release, but had to recall it, since it broke things.
verifying I can write back to new and improved writeback server.
if you post to a writeback server, that's also the address you should
be using to get the....
a versioned source with a writeable attribute. I should be able to use
that same source to both write to and retrieve from.
aday: you can't retrieve
gh: I have to use two different urls to do retrieve and posts.
The way I think it should work: anything you write to you should be
able to do retrieval as well.
aday: writeable=yes attribute, and go over here and write. should be
ok. thinking about using redirection under the covers.

gh: resolving new ids mapping to das-private ids, editing is working
on client side.

sc: worked on info page for affy das servers. Generating new
drosophila alignment data for Ann.

gh: had trouble hooking up exon chp data with new binary formatted
exon data you generated (gregg's new bp2 format for exon data). could
be that I have only control probes and they are not in your data.

[A] steve will check to see if there are any control probes in the exon
array data.

ad: I got the validation server back up and running. will work on
sequence retrieval spec.

question: does spec guarantee that seq will be upper or lowercase?
gh: no, fasta can be either.

gh: spec docs don't have date stamp, eg, writeback document. this is
useful to see if it has been updated.

[A] andrew will put date stamp back in spec docs that don't have it.