[DAS2] Notes from the biweekly DAS/2 teleconference, 5 Mar 2007

Mon Mar 5 19:03:03 UTC 2007

Notes from the biweekly DAS/2 teleconference, 5 Mar 2007

$Id: das2-teleconf-2007-03-05.txt,v 1.2 2007/03/05 19:01:59 sac Exp $

Teleconference Info:
   * Schedule:         Biweekly on Monday
   * Time of Day:      9:30 AM PST, 17:30 GMT
   * Dialin (US):      800-531-3250
   * Dialin (Intl):    303-928-2693
   * Toll-free UK:     08 00 40 49 467
   * Toll-free France: 08 00 907 839
   * Conference ID:    2879055
   * Passcode:         1365

Attendees:
    Affy: Steve Chervitz, Ed Erwin, Gregg Helt
    CSHL: Lincoln Stein
  Sanger: Andreas Prlic
    UCLA: Allen Day

Note taker: Steve Chervitz

Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 

Agenda
-------
* Review of BioSapiens DAS workshop
* Status updates

gh: I sent my summary of the biosapiens das workshop and feature
classification workshop I attended with Ed in Hinxton:
http://lists.open-bio.org/pipermail/das2/2007-March/000982.html

"das developers workshop from a das/2 perspective", summarizes what I
took home from these meetings, how well das/2 meets needs of people in
europe (ensembl, sanger, biosapiens -- the focus of these
meetings). and a quick biosapiens overview: a big european project ,
25 institutions, large scale genome protein annotation. decided early
on to use das to distribute annotations between organizations. can
check the stats on their das servers -- andreas' registry -- 23
servers serving up 69 das sources -- a major das investment!

In developing das/2 we haven't had too much experience with the kind
of data they're dealing with (protein annotations).

das/1 clients under study:
 - dasty2, dasty1 - ajax-based viz clients
 - jalview - alignment viewer, editor
 - igb - Ed gave presentation
 - pepper and spice - das viewers, also use alignment and 3d structure
   info
 - proview - protein annotation,
 - ensembl viewer

servers presented/discussed:
 - pfam, ensembl, proserver, Andreas',
 - Extensions to das/1 protocol discussed: gene das, protein das,
   structure das, 3d-em das (arbitrary 3d volumes), interaction das for
   prot-prot interactions. Moddas - writeback in das/1. Alignment das
   (Andreas). 
 - Simple das - das servers that don't impl all of das/1 (entry_points,
   or types, e.g.,).

Gregg presented on das/2, will put up ppt later. Tailored it assuming

[A] Gregg will send out powerpoint for his talk from BioSapiens DAS workshop

Focussed on familiarity with das/1, how big the diffs are with an eye
towards how hard it would be to move to das/2. Conceptually, not that
big a switch, though XML is a lot different.

Also discussed how well das/2 addresses some of the problems with
das/1 that came up at the workshop.

extensions for das/1:
- das/2 addressed some of them very well. E.g., gene das (das w/o
  specifying location of feature). this is addressed well in
  das/2. can have features w/o location, or w/o range.
- protein das - das/2 did a good job of removing nucleotide specific
  parts of das features (orientation, phase are not required). das/2
  is much more agnostic about dna vs protein.
- alignment das - pairwise or multiple - locations with features in
  das/2 addresses some of these issues (0,1,or more locations for a
  feature) each location can have optional gap attribute (cigar
  string). so if you can describe it with a cigar string, you can
  describe it in das/2. Can use multiple locations to do mult
  alignments. Not dealt with in das/2: 3d-threading of an alignment through
a
  structure.  Need to look at this in the future

[A] Look at how to handle 3D structure alignment threading in DAS/2 spec

- simple das stuff handled better in das/2 - in das/1 the assumption
  is you support all things unless. but in das/2 there is a
  capabilities header, you must indicate support there, if not stated,
  the default is you don't support it. Can also say you support
  feature filters, so there's more formal support for that.

Surprises:
- smaller subset of das/1 is in use than expected. of 69 sources, 64
  either fail entry points or say not applicable. types query: 49
  fail/not applicable

ls: for types query. only one type?
gh: for ensembl, this is the case.
ap: lack of consistency of types is addressed in the other workshop
related to features.

gh: in types in das/1 it is less necessary because all info is
replicated in each feature, type-method, category, id
ls: use case for types query is to present user with set of
checkboxes, select which type to retrieve from source. if in practice
das sources are being use to for one type or a set of types that only
make sense together, no reason to turn off a part of it, then makes
sense to not support types query.
ls: have heard that types query is expensive. computationally. simple
db backends with no normalization/indexins, finding all types involves
visiting each record.
gh: part of justification with 1 type / source is because those types
are stored in separate db. so having a das server to integrate them
make sense.

gh: Re: using smaller subset of das/1 than I expected:
types can be expensive in another way, example: representing pfam in
das. feat type for each pfam domain type (9000 primary domains).
Pfam b - there are 70-400K more!

ls: in das/2 create a single type 'protein domain' then use attribute
pointing to an ontology saying which pfam domain it is.
gh: concern there is, assuming clients will do something useful for
particular attributes. For rendering, I could do diff rendering based
on diff attribs (color diff domains differently). but for clients to
really understand that they're different, that's a more complicated
issue.

gh: not using types or entry_points by clients because servers don't,
feedback loop.
ap: low coverage genomes (e.g., elephant) may have several 100K entry
points. 
gh: in das/2 we are more formal and say that you don't support
it. Creates problem: how do you know what to query in the first place?
Then you have to know what you're looking for.

gh: feature hierarchies handled in das/2 -- this is not an issue for
protein das, where annotations are completely flat. even protein
disulfide bond is one level, just rendered differently so it doesn't
span all residues in between. But doing non-visual things (unions,
intersections) this could be a problem.
ls: flat in terms of location or ontology?
gh: location. there is no feature ontology yet (no consistent, agreed
upon yet, just proposed at this meeting).
ls: they aren't creating discontinuous features because too hard, or
don't care.
gh: just not needed for most protein annotations. even when it could
be needed, just not being used.
ls: for nucleotide, it's needed frequently
gh: not an issue for das/2

gh: ensembl collapses type and source into one thing. what does this
mean? das/2 could be over complicated.
ls: no doubt that it is too complicated for the biosapiens use
case. we could make it easy for them to use by providing tool kits to
read and write. could also argue that postscript is too complicate to
draw simple rectangles on the page. You wouldn't expect then to
simplify postscript. There are tools to ease simple rendering.
The complexity of das/2 won't interfere with adoption, but not having
toolkits, middleware layers to read/write. Not getting ensembl buy-in
to das/2 could be a problem
gh: tim hubbard was there and was on-board to transition to
das/2. 
ls: would have be better to have buy in now (i.e., Tony Cox dropping
out)
gh: we've made it more formal to say, here is the subset of das/2 that
this server supports. for other use cases, we do need the added
complexity.

gh: re: ensembl support for das/2. I mentioned andrew's das/1 - das/2
transformational proxy server. not released yet, but making progress
on it. So if you have a das/1 server, you can put a das/2 front end on
it.
ls: can you go the other way, provide das/1 interface on das/2?
gh: want to do this for the affy public das/2 server. Andrew's doesn't
do that yet, but I'd like to do this. Another thing: integrate that
proxy into the registry, so the registry makes it into a das/2
server. then we don't have a burden on servers to support two versions
of the protocol. 
got email from andrew about his proxy on that.

sc: I put a note about Andrew's proxy server on the biodas.org wiki.
gh: he needs to have a place to keep it.
sc: open-bio server would work. Just need a beetter mechanism to
ensure it stays up. I think it's not getting started when the machine
gets rebooted.

[A] Steve/Andrew work on stable home for the proxy server

[Correction: In my note in the teleconf, I was thinking about Andrew's
validation server, which is hosted on open-bio and has a problem with
not being up reliably. The proxy server is another issue. There's a
mention of it on the DAS FAQ page, but not pointer to any server
yet. -steve] 

gh: data overload and redundancy from the user perspective. clients
where default for protein annotation is to go to all servers, you have
way too many track showing up. Lots of servers and types. Ensembl is
moving to expose even more data via das, thousands of new tracks
(organisms, type, assembly version). Concern with biosapiens is
replication of the same annotation data. E.g., pfam domains in
different biosapiens data sources, may return same thing or slight
diffs in feature ranges. how does user decide which is authoritative?
Which can be left out? A big concern at the biosapiens meeting --
redundant information.

gh: another issue: mirrors for the data. discussed in early days of
das/2, not resolved how to deal with mirrors, http redirection
mechanism. This can lead to redundant data when you hit all mirrors.

gh: feature classification and ontologies around that. My take was
that the sequence ontology is inadequate to describe protein
annotation as it stands now. PAO - protein annotation ontology
ls: are they doing this with NCBO involved?
gh: talked to them about getting hold of lincoln and suzi and
integrating with SO as an extension.
ap: for 3rd version of SO we will contact lincoln and suzi to discuss
ls: great
gh: for biosapiens, Janet Thornton is the person to contact about
that.

gh: more about types (proliferation causing data overload issue mentioned
above.)
also discussion about dag vs hierarchical tree. pointing to multiple
terms in the ontology for a particular type. in SO, how much has
multiple parents come up? may need a type that can point to multiple
ontology terms for that type. das/2 cannot do it yet, only one term
per type.
ls: the more flexible we make it the less coherent it will be. data
overload will get even worse. to reduce data overload, need a way to
take data from servers and deciding if same or different. are they
reachable in same ontology? allowing set arithematic will create
ambiguity. biosapiens can be allowed with an attribute, multiple
attributes that point at different ontologies.

gh: combining cellular location with protien classification
ontologies. 
ls: certainly, but those are separate attributes. what we created is
essentially an RDF. Actually, terminology is 'property' not
attribute. Types property is the correct way to do this.

gh: use of subset of das/1, what it means for das/2
data overload for users,
featu classification issues

gh: das wish list, people wrote up what they feel what das is
inadequate for. Das/2 group was aware of these.

ls: encryption, synchronous request seem like impl issues, not part of
protocol.
gh: some people complained that das is inadequate because it relies on
http(s). you can do much more high-level things with soap-based
system. I think this is correct, but wrong that no one in our space
needs that.
ls: no pharma that cares about this will entrust it to the public
internet with any thing, soap or otherwise.
gh: at affy, we've done das/1 servers with https and no one has ever
complained. 
ls: identity theft problems via people stealing from encrypted streams
never emerged as a problem. they steal it from your physical trash,
setting up phony banking sites. Not related to strength of encryption.
gh: regarding asynch request - discussed 2 years ago -- yes, it's
outside of das/2 spec, but we say, use http as you will. redirect and
say "your request has been accepted, check back here in a while."

gh: wish list (sent out in email to the list noted above):
- multi-level features, stylesheets
- caching - use http caching as you will
- features from other sources - dealth with since we use URIs. a
  problem for das/1

ls: providence requires people to put in effort to maintain the
providence, but it doesn't free you of responsibility of having to
track it.

- scalability and large analysis - the data overload issue. the
answer to me is smarter clients.

- more queries -- addressed in das/2
- entry point supports - in das/2 we have a less ambiguous way to say
  whether a server points it or not.
- counting number of features of each type per source -- have the
  'count' format in das/2
- refering to id's externally (das/2 uri's)
- errors and exception handling - we have http error codes -- remains
  to be seen how well it works out. done a reasonable job to map it to
  http error codes
- better stylesheets - in progress for das/2
- mapping servers - different genome assembly versions or mapping from
  protein to nucleotide space. -- under discussion with data
  providers.

ap: Another thing on wish list: people want to know stats per server,
uptime, hits, etc. (server stats).
gh: andreas' registry does a good job for das/1. biosapiens registry
is built on Andreas' registry. How many are up, which requests they
support, the data the server. Very nice.

ap: Gregg's coverage was good. Also gave a very good advertisement for
das/2!

gh: the das/1 to das/2 transformational proxy was quite
popular. doesn't take advantage of das/2 power, but gets people started.

Other Topics:
--------------
sc: biodas.org wiki is now officially up.
gh: mentioned to Tim Hubbard. He said, "I know. I already edited it."

sc: globalseqids page needs das2xml snippets for coordinates.

[A] lincoln will add das2xml coordinate snippets to globalseqids page on
wiki

sc: might also be good to have notice of the next teleconf on the
site. Maybe pointers to the notes as well.
gh: maybe have an automatic email sent out reminding folks?
sc: maybe not, if we have a list of the dates for upcoming meetings on
the site. 

[A] Steve post list of dates of upcoming DAS/2 teleconferences on wiki

Next meeting in two weeks: 19 mar 2007