[DAS2] DAS/2 weekly meeting notes from 12 Dec 2005

Tue Dec 13 01:38:14 UTC 2005

Notes from the weekly DAS/2 teleconference, 12 Dec 2005.

$Id: das2-teleconf-2005-12-12.txt,v 1.1 2005/12/13 01:03:01 sac Exp $

Note taker: Steve Chervitz

Attendees: 
  Affy: Steve Chervitz, Ed E., Gregg Helt
  CSHL: Lincoln Stein
  Sanger: Andreas Prlic, Thomas Down
  Sweden: Andrew Dalke

Action items are flagged with '[A]'.

These notes are checked into the biodas.org CVS repository at
das/das2/notes/2005. Instructions on how to access this
repository are at http://biodas.org

DISCLAIMER: 
The note taker aims for completeness and accuracy, but these goals are
not always achievable, given the desire to get the notes out with a
rapid turnaround. So don't consider these notes as complete minutes
from the meeting, but rather abbreviated, summarized versions of what
was discussed. There may be errors of commission and omission.
Participants are welcome to post comments and/or corrections to these
as they see fit. 

Today's topic: Spec Issues
--------------------------

* Regions
-----------

Discussion thread:
http://portal.open-bio.org/pipermail/das2/2005-December/000388.html

AD: Can the region request be removed, it's just a type of a feature.

LS: There are situations where we need to say "something lives in this
region and you can't get base pairs for it." Example, gaps or
chromosomal sequence based on mapping data only.

AP: How would start/stop be specified?
LS: Endpoints of gap are specified in base pair coordinates. This is
standard in AGP files. Can indicate approximate length.
It's not a case of feature with an ambiguous location, just not
precisely defined location.
GH: Does the spec allow decimal places in location, e.g., for
recombination frequency?
LS: No. Still genome/base pair oriented. If we want to require the
retrieval of bases, we could possibly have a convention where Ns could
be returned. 

AD: Didn't know DAS needed to handle this info. For features
type=region, names could be returned, not necessarily sequence data.

LS: The operations we need to support:
   1) Return entry points for an interactive search
      - e.g., chromosome length fragments
   2) Assembly info (AGP)
   3) Bases (residues) given location on the sequence

TD: Why not return assembly as a set of features as DAS/1 does? Why do
we need a special assembly communication format?

GH: You can't get the whole picture all at once. You have to get the
top-level contigs, each of which has it's own assembly,
recursively. Lots of queries may be required.

TD, LS: Hierarchical features are supported. You could have one
feature per chromosome of type=assembly. You then do a non-recursive
request to get the top-level features, then do a recursive request to
get the feature with all children.

GH: This is the chado approach where every sequence is a feature. I
have trouble with this.
TD: The feature indicates an alignment. The region for the feature
alignes to a piece of chromosome.
GH: How do you find out what the chromosomes are?
LS: Assembly fragment type could be used for children.

Currently in the DAS/2 spec:
  - a region request returns contigs a la the entry points list from DAS/1.
    volvox/Contig1 - or in a finished assembly, this would be chromosome
    length things that IGB would present to the user to select for
    browsing.
    These are not necessarily chromosomes, just recommended entry points
    for browsing.
Feature-based approach:
  - do a feature query using filter type=assembly

AD: Why do we need region request?
GH: To get top-level entry points for browsing.

LS: Sequence ontology has these types that could appear as entry points:
  - assembly 
  - assembly component
  - contig
  - supercontig
  - chromosome
  - chromosome arm

Problem: A naive browser comes into genome, doesn't know what
the entry points are.
AD: Saying type=top-level is wrong. It should be a property.
LS: 'Entry point' or 'landmark' attribute.
GH: How do you get the entry points?
LS: Feature request with a filter for attribute='entry point', and
type='assembly component'

AD: Possible trouble with people defining features at different
servers from the one providing regions:
  - server 1 provides regions
  - server 2 provides other feature types
So you need to go to multiple servers.
TD: This is not a big change from DAS/1

LS: gmod has chromosomes as features, this has never been a problem.
Advantages to Andrew's suggestion (regions as features):
- simplifies the protocol
- can't return AGP format files, must parse DAS2XML (or can only get
  AGP for a subcomponent of assembly).
GH: Can use the same alternative format approach we use for types
request (optional FORMAT subelements). But then no server would be
required to return it.
TD: Not a big deal because every client will be required to parse
feature XML. Also, the top-level assembly won't be very large.
GH: AGP support is not that important.
LS: Was a request from UCSC (Jim Kent). OK to get rid of region and
use feature.

GH: Still have trouble using feature to get region data because of the
circular nature of refering to yourself as your coordinate system.
AD: You can still point to a sequence as your coordinate system.
GH: How do you know the size of the sequence without requesting the
whole sequence?
There's also the possibility of 0 vs 1-based coordinate confusion.
Someone could provide an assembly top-level feature and
declare it starts at 1, getting around our 0-based requirement for
genomic features.
LS: They could, but will suffer the consequences of pervasive
off-by-one errors. 

Proposal: Abolish the region namespace (request/response):
 - Add special feature type 'assembly component'
 - assembly component has optional attribute 'entry point'
 - Response to this query must be fast
 - Servers must be able to handle attribute filters

GH: Not comfortable with this, and how gmod treats chromosomes being
the same as features. Why? Data modeling, e.g., the sequence symmetry
concept of genometry used in IGB. An annotation/feature is always
described as a relation between one or more sequences. The annotation
only points to the sequence.

LS: In Bioperl GFF database and chado schema, entry-level sequences are
features that use their own coord system. Top-level sequence is a
feature with type=chromosome or contig.
Limitation is that you need to know what to use for type.
Advantage is in relative addressing (e.g., get all genes within 1000
bases of other genes). Works when feature is in its own corrdinate
system. 

AD: There's a danger of becoming too generic, example from WebDAV. When
everything is a property, there's no structure.
LS: There is the risk of having too many magic attributes.

AD: We could keep the top-level or landmark request as a special alias
that retrieves a subset of the data -- just top-level entry points,
instead of having a special feature request. Would be the same as a
region without the extra stuff.
GH: Bad to have two ways to get same data.
LS: Regions as features is OK, but no top-level attribute.

Proposal: DAS-defined special feature type 'top level' or 'entry
point' that maps to SO assembly component. Hard-coded, special type
that returns entry type features.
AD: Is there multiple inheritance support? Are there features that
inherit from both SO and our special type? E.g., of type entry point
and contig?
LS: No. A data source must support type='das:entry point'. To get
top-level features, you ask to get features of this type. They can
have children to describe the assembly.
Trouble with this: Duplication, you now have features that appear as
type=entry point and as type=supercontig or chromosome. One is a
physical object, one is a navigation object.
This trades using a magic type instead of a magic attribute.

AD: So we have a choice:
   - magic attribute
   - magic type
   - magic URL

LS: Likes special attribute the best. Advantage is that you can tag
what ever feature type you want to appear as an entry
point. Disadvantage is potential abuse and implementation could be
harder. Attribute filtering must be fast.
Use case: At an intermediate stage of a big assembly you can choose
what you want to be top-level, rather than creating a new database
object, or figuring out another way to make it appear in response to a
region request. 

Vote:
 - GH: special URL (region)
 - LS, AD, TD: special attribute

AD: As benevolent dictator, decides that DAS/2 will employ a special
attribute to handle regions as features.

Question: What to do with the location attribute (now is a feature
URL). Or do we get rid of the position attribute.
LS: LOC points to feature that establishes coord system and has
subranges of that feature. So the URL gets longer. Attributes specify
position of the feature. LOC is for feature space. It specifies the
reference system of the feature and where it starts relative to the
feature.  
Position attribute points to the sequence. Clients know to parse the
URL to get the start/end.
TD: In XML, LOC with attributes start, end, strand, seq.
GH: We now permit matching feature filters to allowing combining
these. So we should keep the filter syntax. Feature loc syntax can be
different. 

[A]: Andrew: provide details for retrieving regions via feature request
 - need to get the feature the coordinates are relative to (contig)
 - need to get the bases, which may not be on the same server

SC: Has some philosophical issues with collapsing regions into
features, but willing to explore doing so for simplicity. Trouble is
putting objects with some physical correlate (sequence) at same level
as objects lacking such solid substrate (features).

GH: This discussion has created a lot of churn in the spec fairly late
in the game. We should be more settled by now.
ALL: General agreement.

[A]: Everyone make a push to stabilize the retrieval spec.

* New topic: Rename DAS 'genome' domain to 'sequence'
-----------------------------------------------------

Discussion thread:
http://portal.open-bio.org/pipermail/das2/2005-December/000394.html

AD: Why not remove the top-level domain completely?
(das/genome becomes just das).
GH: Need to know what data a given server has.
AP: As long as the source description provides info about what its
about, should be sufficient.
GH: This pushes the URL data into the source type tag (this is the
same magic URL vs magic type vs magic attribute issue all over
again...) 
AD: If we get rid of it, a given server can provide different data
without special URLs.
GH: What you put on the URL determines the return type. Why don't you
like it?
AD: 1) 'genome' in the URL is extra fluff.
    2) saying you might need it in the future is a weak argument
       (ain't gonna need it).
GH: Most servers will provide one type of data.
AD: People who provide meta data might want to combine it into one
document for all data.

LS: Saying you're in 'genome space' is a contract for what coordinate
system is (positive integers, start, end, strand) and what type of
reqests/responses are expected. If we jumble things up, it makes it
difficult for dealing with other systems (3D coords).
The 'genome' space is intended to cover both protein and DNA.
AD: The top-level DAS response would point to the versioned source,
and indicate that it has a sequence, and a top-level URL.
LS: Seems like an arbitrary decision.

AP: What about the original proposal, to simply change 'genome' to
'sequence'? 
GH: OK with this.

[A]: Andrew (spec czar) change 'das/genome' domain to 'das/sequence'.
[A]: Andrew (spec czar) change 'sequence' request to 'residues'.

Other Issues:
-------------
LS: Concerned about big changes being made to spec at this date.
ALL: Agreed. Should have happened earlier, but the discussion is important.

[A]: All - focus on spec issues again next week. No meeting in two weeks.