[DAS2] Notes from DAS/2 code sprint #2, day one, 13 Mar 2006

Steve Chervitz Steve_Chervitz at affymetrix.com
Tue Mar 14 04:22:36 UTC 2006

Notes from DAS/2 code sprint #2, day one, 13 Mar 2006

$Id: das2-teleconf-2006-03-13.txt,v 1.1 2006/03/14 04:31:36 sac Exp $

Note taker: Steve Chervitz

  Affy: Steve Chervitz, Ed E., Gregg Helt
  Sanger: Andreas Prlic
  Dalke Scientific: Andrew Dalke (at Affy)
  UC Berkeley: Nomi Harris (at Affy)
  UCLA: Allen Day, Brian O'Connor (at Affy)
Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2006. Instructions on how to access this
repository are at http://biodas.org

The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 

General note: 
Passcode is now required to enter teleconf.
This is a change in their system.

Issue: Continuation Grant

gh: no word yet.

Issue: Coordinate System

ad: question of what happens when there are multiple coordinate
systems for an assembly.

auth and source,
source: contig space, scaffold space
auth: organization (e.g. ncbi, ucsc)

gh: not enough to get uniqueness.
ncbi, genome, human is not enough, need version
to uniquely id the coord system

ad: auth, source, species, version identification string
gh: use case: need to know whether uris for two versioned source
refer to the same genome.

gh: ncbi version numbers are separate from organism info, eg. v35.

ad: we could have a service for mapping strings

gh: idea - every server can say this assembly name is same as
that. Clients could chain together statements from multiple servers.
For the affy das server used by igb, we now have a synonyms file on
our server which igb reads. It's a pain to maintain.

ad: type of alignment server?
gh: a synonym server. Here's a uri, give me a list of synonyms that
refer to the same thing.

This is something tho talk more about when Andreas is on line.

[Andreas joins in.]

GH: How would a das server verify the version info in a sources
document point to same genome assembly?
AP: You would check auth=ncbi, vers=35, taxid=human
AP: In protein structure space, you check verison on every object you
work with. Protein seq.
gh: so we have to map version info on sequences as well as genome
gh: use case: two segment responses from diff servers, diff uris for
the diff sequences, how you know they are refering to the same seq.
name=chromosome21 vs name=chr21?
ad: we require the same name for the same segments.
gh: going to fall apart fast. no way to enforce it. People use 1, I,
chr1, chromI.
ee: can put this in the validation suite.
aday: yes.
gh: but what do you use for name: accession # for entry, string chr1,
gh: important since this is the name that goes to user.
ad: could have one slot for computer to use, one for human consumption.
ad: for segments there seem to be two diff ids: url,
ad: the point of having special ids for segments is segment
equivalence from different servers. Separate coordinates element that
says how to merge things together. Identifiers in here that are just
coordinate space ids, not necessarily for human use. Only for identifying
gh: but how do we get people to use it?

sc: what about the idea of using checksums as identifiers for a seq?
ad: problem of duplicate seqs in an assembly. eg., same seq from chr1
and chr9.
gh: if they are the same seq they should get the same id.
ad: don't you want to know if there is a region on chr1 that is an
exact duplicate of a region on chr9?
sc: we could create the checksum on source:sequence

gh: useful to have a central place to ask for diff names for the same
coord system.
ad: uniqueness idea: coords element, has: auth, source, version,
species (optional) 
uniqueness says these are the names you use.
gh: this can fail. What do we say happens when it fails? Should there
be a way of resolving it.
ad: this is where your synonym table comes it. Publish it?
gh: maybe as part of the registry, knows

ap: there isn't a big variety in naming because there aren't many
people providing assemblies.
gh: we already have 10 different synonyms for an assembly
ee: this has some performance impact on igb. should have to do it.
ap: we should say this is how naming works.
gh: will fail.

ad: is this required for this version of the spec?
gh: need something that can be used now.
aday: without hardwiring
gh: if we don't agree during the code sprint, then it won't happen for
everyone else.
aday: using roman numerals for yeast since sgd uses it.
ee: trouble with chrX

ad: andreas: is there a place for naming of segments to use
ap: no, something for the reference server, not coords
ad: given these coords, here are the names that are used.
ap: same as reference server.

gh: maybe registry should provide: here's a coord system and here are
the names you can use for
ap: you would get a long list for proteins
aday: a user who wants to

gh: question for brian g: LSID, when you come across this for LSIDs,
ncbi is auth for human genome assembly yet they have no lsid for their
assembly, how do people refer to their lsid when there's no authority
to say what it is?
bg: you can't, no one is the authority. but you can write a resolver
that queries ncbi under the cover, in your resolver you make ncbi the
authority of the lsid, add namespace, object id. Then everyone has to
know that your resolver is hosted at some site somewhere. So there is
no satisfactory answer. It's a problem if the authority does not host
the resolver.
bg: I'm at the w3c meeting at mit, providing a webified resolver, they
would host a resolver, everyone would know to go to a well-known web

bg: you start a convention, enforce it, give error if people don't
use it.
gh: thinking we need it associated with registry.
ap: ref server + coord system, provides ids that can be used,
gh: so other ids can be used, but registry server wouldn't support it.

ad: site has ftp site for downloading chromosomes, contains names for
different segments in the file. How do I go from the ids in ths file
to the ids that Andreas describes.
To make my annotations in the same space. Mapping from file from ncbi.
bg: what are your use cases? write back to server?
ad: user publishing locally,
bg: you make a ref server.
gh: experience from das1 is that everyone makes their own reference
server and refers to it from their annotation server, using different
ad: new tag 'coordinates'
gh: like enforcing common names at registry server. Can use their own
names, they just won't be allowed to post on the registry.

ad: need documentation
ap: could point to docn on reference server

bg: workflow1: fish researcher looking for abberant regions in chr7,
11 and 3, singled out the abctransporter gene. How does that work in
das/2? type 'abc' in web page for reference server? This is a gene name.
ad: your client browser can go to to registry to find servers that
host the assemblies for your fish. Go to those reference servers, do
searches there. Will go to coord system, get a segments document, get
display chromosome by title.
gh: get a das features xml document saying the sequence and
gh: our discussion here is on getting the diff.
ad: we don't have anything on coordinates saying which is the latest

bg: latest build may have changed their gene coordinate.
gh: mapping servers is part of our continuation grant. Can push an
annotation on one assembly to another assembly.
bg: a hard thing.
gh: that's why where enlisting UCSC to do it!

ad: Topic: id, url, uri, iri (see email)
gh: likes uri, not url. Some things aren't really urls
(resolvable). Iri might work.

ad: multiple coord elements for same ref server.
ap: originally there was one, but some use two, zebrafish guy chrom
and scaffold coordinates. or chromosomes vs. gene ids. same types,
different accession codes and features.
ad: if you have graphical browser, do you get scaffolds or
ap: depends on your view.
gh: if you do a segments query, do you get segments and contigs?
ap: depending on the coordinate system of the requrest.
ad: one capabilities for scaffolds and one for chromosomes?
gh: maybe

[A] gregg: by end of week, load stuff from multiple servers, compare in the
same view.

[A] steve will work on getting gregg's das/2 server up and running.

gh: trouble with biopackages.net server
aday: possible power outage interference.

gh: target filters have been  dropped.
aday: yay!

More information about the DAS2 mailing list